Models are stateless. They remember nothing between calls. Every "memory" you have ever seen in a chat product is an illusion built by the application — replaying past turns, summaries, or retrieved facts into each new request. This tutorial covers the three memory layers professional systems use and when to choose each.
A single chat response feels simple from the outside. The user types, the assistant replies, the conversation continues. Under the hood, every reply involves rebuilding the entire context from scratch: system prompt, relevant history, current user message — all packaged and sent to the model. The model has no idea any earlier turn ever happened unless you tell it.
That sounds wasteful until you realise it is also liberating. Because you control what goes into the context, you can choose which past turns to include, summarise the rest, and pull in long-term facts on demand. Memory becomes an explicit design decision rather than a black-box feature.
Mature chat systems use three complementary layers of memory, modelled loosely after how humans organise their own memory.
Picture a notepad you bring to every meeting. Working memory is the last few pages you can flip back to. Summary memory is the one-paragraph recap on the cover. Long-term memory is the filing cabinet down the corridor — you only fetch a folder when the meeting topic actually needs it.
Most beginner chat apps do one of two extremes. The first is to send every past turn to the model on every request — which works for ten turns and breaks at five hundred when the prompt no longer fits the context window. The second is to send nothing — which works for trivia but produces a goldfish-style assistant that forgets the user's name two turns in.
Naive append-everything
history = []
while True:
user = input("You: ")
history.append({"role": "user", "content": user})
resp = model.chat(messages=history)
history.append({"role": "assistant", "content": resp})
print("AI:", resp)
By turn 200, the history is 50,000 tokens. Costs balloon, latency crawls, the model's attention dilutes, and eventually the request errors out at the context limit. This pattern does not survive contact with real users.
Layered memory (pseudocode)
WINDOW = 12 # working memory: keep last N turns verbatim
SUMMARY_EVERY = 8 # summary memory: refresh every K turns
state = {
"summary": "", # rolling gist of old turns
"recent": [], # last N turns
"facts": vector_store, # long-term memory
}
def handle_turn(user_message):
relevant_facts = state["facts"].search(user_message, k=4)
messages = [
{ "role": "system",
"content": SYSTEM_PROMPT
+ "\n\nConversation summary so far:\n"
+ state["summary"]
+ "\n\nRelevant facts:\n"
+ format(relevant_facts) },
*state["recent"],
{ "role": "user", "content": user_message },
]
reply = model.chat(messages=messages)
state["recent"].append({"role": "user", "content": user_message})
state["recent"].append({"role": "assistant", "content": reply})
if len(state["recent"]) > WINDOW:
old = state["recent"][:-WINDOW]
state["summary"] = model.summarise(state["summary"], old)
state["recent"] = state["recent"][-WINDOW:]
# Optionally extract durable facts and persist them
new_facts = model.extract_facts(user_message, reply)
state["facts"].upsert(new_facts)
return reply
The model now has the three things it actually needs: the gist of the whole conversation, the precise wording of the most recent turns, and on-demand access to long-term facts. Token usage stays roughly constant no matter how long the conversation runs.
Tip: When summarising, preserve the things the model is most likely to need later: names, dates, decisions, preferences, open tasks. Strip the things it can rebuild from context: pleasantries, hedges, generic acknowledgements.
Build a minimal chat loop with only working memory (last 10 turns). Have a 30-turn conversation. Note which earlier facts the assistant has forgotten. This is the baseline.
Add a summary layer that refreshes every 8 turns. Run the same 30-turn conversation. Compare which facts now survive — and which the summariser dropped. Tune the summariser prompt accordingly.
Add long-term memory using any vector store. Extract one or two durable facts per turn ("user prefers Indian food", "user works at a fintech"). End the session, restart, and check whether the assistant correctly recalls those facts in a new conversation.
Sign in to join the discussion and post comments.
Sign inFoundations of Prompt Engineering
The must-know basics of prompt engineering. Learn what prompts are, how AI models read them, and how to write clear instructions that get great results.
Prompt Engineering Projects & Real-World Applications
Twelve hands-on projects that turn prompt engineering theory into a portfolio. Build chatbots, content generators, RAG systems, and more.
Prompt Engineering for Specific AI Tools
Tool-by-tool mastery — deep dives into ChatGPT, Claude, Gemini, GitHub Copilot, Midjourney, Stable Diffusion, and more. Learn the exact prompting techniques each platform rewards.
Prompt Engineering for Business & Productivity
Use AI to work smarter — automate tasks, make better decisions, and communicate professionally. 12 practical business prompt tutorials for professionals.
Prompt Engineering for Data Science & Analytics
Supercharge your data workflows with AI. 15 practical tutorials on using prompt engineering for data cleaning, EDA, machine learning, SQL, visualisation, and more.
Prompt Engineering for Content & Copywriting
Write blogs, ads, emails, and social media content ten times faster with AI. 13 practical tutorials on prompt engineering for content creators and copywriters.