Memory and State Management in Multi-Turn AI Conversations

Models are stateless. They remember nothing between calls. Every "memory" you have ever seen in a chat product is an illusion built by the application — replaying past turns, summaries, or retrieved facts into each new request. This tutorial covers the three memory layers professional systems use and when to choose each.

1. Introduction

A single chat response feels simple from the outside. The user types, the assistant replies, the conversation continues. Under the hood, every reply involves rebuilding the entire context from scratch: system prompt, relevant history, current user message — all packaged and sent to the model. The model has no idea any earlier turn ever happened unless you tell it.

That sounds wasteful until you realise it is also liberating. Because you control what goes into the context, you can choose which past turns to include, summarise the rest, and pull in long-term facts on demand. Memory becomes an explicit design decision rather than a black-box feature.

2. The Concept Explained

Mature chat systems use three complementary layers of memory, modelled loosely after how humans organise their own memory.

Working memory. The recent N turns of the conversation, copied verbatim. Cheap to maintain, fits naturally in the context window. Holds detail but is short-lived.
Summary memory. A running, rolling summary of older turns. Generated by the model itself, refreshed every K turns. Holds gist but loses detail.
Long-term memory. Facts stored in an external store — vector database, key-value store, or both. Retrieved on demand based on the current query. Holds anything that needs to survive across sessions or scale beyond the context window.

Picture a notepad you bring to every meeting. Working memory is the last few pages you can flip back to. Summary memory is the one-paragraph recap on the cover. Long-term memory is the filing cabinet down the corridor — you only fetch a folder when the meeting topic actually needs it.

Three layers of memory in a production chat system. Each one is rebuilt into the prompt for every model call.

3. The Problem Without a Memory Strategy

Most beginner chat apps do one of two extremes. The first is to send every past turn to the model on every request — which works for ten turns and breaks at five hundred when the prompt no longer fits the context window. The second is to send nothing — which works for trivia but produces a goldfish-style assistant that forgets the user's name two turns in.

Naive append-everything

history = []
while True:
    user = input("You: ")
    history.append({"role": "user", "content": user})
    resp = model.chat(messages=history)
    history.append({"role": "assistant", "content": resp})
    print("AI:", resp)

By turn 200, the history is 50,000 tokens. Costs balloon, latency crawls, the model's attention dilutes, and eventually the request errors out at the context limit. This pattern does not survive contact with real users.

4. The Solution: A Three-Layer Memory Manager

Layered memory (pseudocode)

WINDOW = 12       # working memory: keep last N turns verbatim
SUMMARY_EVERY = 8 # summary memory: refresh every K turns

state = {
  "summary": "",          # rolling gist of old turns
  "recent": [],           # last N turns
  "facts": vector_store,  # long-term memory
}

def handle_turn(user_message):
    relevant_facts = state["facts"].search(user_message, k=4)

    messages = [
        { "role": "system",
          "content": SYSTEM_PROMPT
                     + "\n\nConversation summary so far:\n"
                     + state["summary"]
                     + "\n\nRelevant facts:\n"
                     + format(relevant_facts) },
        *state["recent"],
        { "role": "user", "content": user_message },
    ]

    reply = model.chat(messages=messages)
    state["recent"].append({"role": "user", "content": user_message})
    state["recent"].append({"role": "assistant", "content": reply})

    if len(state["recent"]) > WINDOW:
        old = state["recent"][:-WINDOW]
        state["summary"] = model.summarise(state["summary"], old)
        state["recent"] = state["recent"][-WINDOW:]

    # Optionally extract durable facts and persist them
    new_facts = model.extract_facts(user_message, reply)
    state["facts"].upsert(new_facts)

    return reply

The model now has the three things it actually needs: the gist of the whole conversation, the precise wording of the most recent turns, and on-demand access to long-term facts. Token usage stays roughly constant no matter how long the conversation runs.

5. Step-by-Step Breakdown

Size your working memory. 8–20 turns is typical. Smaller for simple chats, larger for tasks that depend on recent context (coding, multi-step planning).
Choose a summarisation cadence. Don't summarise every turn — that wastes tokens. Don't summarise too rarely — the working memory will overflow. Refreshing every 8–12 turns is a good default.
Decide what is "durable". Not every fact deserves long-term storage. User preferences, key decisions, and identifying details (with consent) belong in long-term memory. Small talk does not.
Retrieve, don't dump. Even if your long-term store has thousands of facts, retrieve only the 3–6 most relevant for each turn. Stuffing everything in defeats the purpose.
Make memory inspectable. Build a debug view that shows exactly what context was sent on the last turn. When the model says something weird, you need to know whether it was reading stale memory.
Respect privacy by design. Long-term memory often involves storing personal data. Encrypt it, scope it per-user, give users a way to view and delete it, and never use it for purposes beyond what the user agreed to.

Tip: When summarising, preserve the things the model is most likely to need later: names, dates, decisions, preferences, open tasks. Strip the things it can rebuild from context: pleasantries, hedges, generic acknowledgements.

6. Practice Exercises

Exercise 1

Build a minimal chat loop with only working memory (last 10 turns). Have a 30-turn conversation. Note which earlier facts the assistant has forgotten. This is the baseline.

Exercise 2

Add a summary layer that refreshes every 8 turns. Run the same 30-turn conversation. Compare which facts now survive — and which the summariser dropped. Tune the summariser prompt accordingly.

Exercise 3

Add long-term memory using any vector store. Extract one or two durable facts per turn ("user prefers Indian food", "user works at a fintech"). End the session, restart, and check whether the assistant correctly recalls those facts in a new conversation.

7. Key Takeaways

LLMs are stateless. All "memory" is an application-level illusion built by rebuilding the prompt every turn.
Production systems use three layers: working memory (verbatim recent turns), summary memory (rolling gist), and long-term memory (external store, retrieved on demand).
Each layer trades off cost, fidelity, and persistence — combine them rather than relying on any single layer.
Keep memory inspectable. Most "weird model behaviour" turns out to be a bad memory snapshot, not a model bug.
Treat long-term memory as user data. Encrypt, scope, audit, and respect deletion requests.

Discussion

Constrained Generation: Controlling AI Output Format Precisely Using Delimiters, XML Tags, and Markdown to Structure Prompts