Blog

Discover the LTM Benchmark at NeurIPS 2024

October 09, 2024

We are glad to announce that our paper “Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models” has been accepted to NeurIPS 2024, where we will have the opportunity to share our work and knowledge in relation to Long-Term Memory (LTM) and LLM-based agents.

It was nearly a year ago that we open-sourced the first version of the LTM Benchmark, which we created as a response to our need to measure progress in the development of LLM-based agents. Since the beginning, the goal has been to evaluate agents by means of a very long conversation, in which several pieces of information and challenging questions are intertwined as part of a natural user↔agent interaction.

This conversational setup represents the main and key difference between the LTM Benchmark and other benchmarks or datasets, where individual tests are often based on a single prompt and LLM response. Instead, our tests form an indivisible part of the conversation, which grows organically as a natural byproduct of the interaction between the agent and our benchmarking system (acting as the user).

In the course of the conversation, the LTM Benchmark assesses a wide range of skills that are relevant to the creation, management, and retrieval of long-term memories. We have subjected many different LLMs and LLM-based agents to the LTM Benchmark and analyzed their results, from which we draw the following conclusions:

  • The chat environment is specially challenging and most benchmarks do not model this aspect.
  • LLMs’ performance worsens as the conversation length increases, but LLM-based agents with an LTM system are more resilient.
  • The interleaving of tasks has the most impact on the scores.

Check our GitHub and social media content to learn more about the LTM Benchmark and our work on LLM-based agents and LTM: https://github.com/GoodAI/goodai-ltm-benchmark

Leave a comment

Join GoodAI

Are you keen on making a meaningful impact? Interested in joining the GoodAI team?

View open positions