When it comes to long-term AI assistants, LLMs have some problems...

Token limits can lead to truncated context.
Information overload can muddy language understanding.
Small details can be lost as conversations get long.
Distant events may be easily subject to hallucination.
Processing time significantly increases as context window gets larger.

How can we augment conversations with LLMs to improve long-term memory while still maintaining the seamless experience of having a conversation that modern LLMs excel at?

We can fix it!

By leveraging the everlasting information persistence of structured databases such as SQL in conjunction with OpenAI's function calling ability, LLM4LLM allows large language models to maintain memory points indefinitely, regardless of the conversation length or complexity. There are many customizable steps that users can add on, including extra functions to perform , validation and security checks to maintain data integrity, or reference tables to constrain user input.

For our testing purposes, we implemented an assistant reminiscent of a Dungeons & Dragons dungeon master, which allows ample opportunity to both progress a long-term story, as well as provides a perfect vehicle for testing long-term memory using inventory management.

How does it work?

User Input to LLM

Based on the type of assistant, interact with the LLM as normal.

In our case, role-play as Elara Windrider, a courageous warrior with a heart of gold, as she travels during her adventures. As you go on a virtual adventure, pick up items along the way from the items available in the world!

Information Storage

The LLM will parse the input for relevant information and, using the function calling API, will insert, delete, or retrieve data from a structured database.

As you adventure as Elara in our demo, see the items obtained in the SQL table visualization!

Information Retrieval

When asked, the LLM will query the data to fulfill any request requiring the stored data, regardless of how long ago initial data entry happened.

At any point in the adventure, ask Elara about her inventory state!

How does LLM4LLM perform against base LLMs?

We tested our methodology against a ChatGPT 3.5-Turbo over multiple rounds of obtaining and retrieving information on items throughout Elara's journey. To test each model's memory, we informed the models that Elara has picked up certain items and then at various time points, asked about the items that were in her inventory

Let's see how LLM4LLM performs against a baseline LLM!

Baseline Performance

Accuracy = 63.5%
Recall = 27%

For our baseline model, we created a basic assistant without the LLM4LLM structured database augmentation. Through providing prompts for obtaining items and asking about the inventory in set intervals, we see that the baseline model really struggled to have consistent performance, with the majority of errors stemming from not "remembering" particular items were obtained.

LLM4LLM Performance

Accuracy = 98.75%
Recall = 99.38%

After augmenting the baseline assistant with LLM4LLM, we see that the performance significantly improves. By pulling the required information from the structured database, the LLM is able to consistently and accurately answer any questions about Elara's inventory at any point in time, with only scarce errors related to misunderstanding the question or hallucinating the initial prompts.

LLM4LLM

LLM4LLM

When it comes to long-term AI assistants, LLMs have some problems...

We can fix it!

How does it work?

User Input to LLM

Information Storage

Information Retrieval

How does LLM4LLM perform against base LLMs?

Baseline Performance

Accuracy = 63.5%
Recall = 27%

LLM4LLM Performance

Accuracy = 98.75%
Recall = 99.38%

Beyond the Proof of Concept

Meet The LLM4LLM Team!

Kevin Dai

Hsi-Sheng Wei

Diqing Wu

How does it work?

Accuracy = 63.5% Recall = 27%

Accuracy = 98.75% Recall = 99.38%

Beyond the Proof of Concept

Accuracy = 63.5%
Recall = 27%

Accuracy = 98.75%
Recall = 99.38%