Prompt Compression: How to Fit More into Less Context

Every token in a prompt has a cost — in money, latency, and the model's ability to actually attend to what matters. Prompt compression is the discipline of shrinking prompts without losing fidelity. Done well, it cuts spend by 40–70% and frequently improves answer quality.

1. Introduction

It is tempting to treat the context window as free real estate. A 200,000-token window feels boundless until you do the maths on a million daily requests. Even on a single-call budget, padded prompts hurt: the model's attention is finite, and irrelevant tokens dilute the signal of the relevant ones. Long prompts also slow down generation noticeably — latency scales roughly with prompt length in most architectures.

Prompt compression is not about being terse for its own sake. It is about packing the same instruction-content into fewer tokens by removing redundancy, replacing prose with structure, and pre-summarising the parts that the model does not need to read in full.

2. The Concept Explained

There are three useful flavours of compression, each tackling a different kind of waste:

Structural compression. Replace prose with markup. A bulleted list of constraints uses fewer tokens than the same constraints written as sentences, and the model reads the structure faster.
Semantic compression. Summarise long context (documents, conversation history, retrieved chunks) before passing it in. A 4,000-token document can usually be compressed to a 400-token bullet summary that the downstream model uses just as well for most queries.
Reference compression. Replace repeated content with short references. Define a label once at the top — "<policy> means the company's refund policy v3.2" — then refer to it by name in every later turn.

Compression preserves meaning while cutting tokens — typically by 60–70% on verbose hand-written prompts.

3. The Problem Without Compression

Hand-written prompts grow over time. A team adds an instruction here, a clarification there, an edge case they remembered last week. After a year, the prompt is twice as long as it needs to be and the model is spending half its attention on stale instructions.

Bloated prompt

You are a very helpful AI assistant who helps customers
with their questions. Please be polite and professional
at all times. When you respond, please make sure that you
are clear and that you are helpful. The customer is very
important to us so please be respectful. If you don't know
the answer please say that you don't know rather than
making something up. Also please don't make up policies
that don't exist. We have a refund policy which says that
customers can get a refund within 30 days of purchase as
long as the product is unused and in its original
packaging. If a customer asks about refunds you should
mention this. ... (continues for 800 more tokens)

Lots of words, very little new signal. The model still has to read every token. Half the instructions are restatements of the default helpful-assistant behaviour.

4. The Solution: Structured, Deduplicated, Summarised

Compressed prompt

# Role
Customer support assistant. Polite, professional, concise.

# Hard rules
- Never invent policies. If unsure, say so.
- Use only the policies in `policies.md` (loaded below).
- Refunds: 30 days, unused, original packaging.

# Style
- 2–4 sentences per reply unless asked for more.
- Match the customer's language.

# Output
Reply text only. No preamble. No "I hope this helps".

Six sections, around 90 tokens, no signal lost. The model can scan it instantly. Crucially, every rule earns its place — restatements of default behaviour have been removed.

5. Step-by-Step Breakdown

Audit the current prompt. Read each sentence and ask: "If I delete this, does the output change?" Restatements of default behaviour ("be helpful and polite") almost always survive deletion.
Switch prose to structure. Headings, bullets, and short labels are denser than sentences and easier for the model to attend to.
Pre-summarise long context. If you are stuffing in 5,000 tokens of policy or chat history, run a one-time summarisation pass and store the summary. Use the summary in production; refresh it when the source changes.
Use reference labels. Define a label once, refer to it by name. "As described in <refund_policy>..." is shorter than restating the policy each time.
Measure before and after. Count tokens precisely. Run both prompts on a held-out set and compare quality. Only ship the compressed version if it matches or beats the original.
Watch for over-compression. Some redundancy is load-bearing — examples in particular. Stripping examples to save tokens often hurts quality more than it helps the budget.

Tip: Long-context models tolerate longer prompts, but they don't reward them. The "lost-in-the-middle" effect — where information buried in the middle of a long context is attended to less — is real on every model family. Compression isn't just about cost; it is about staying in the model's high-attention zone.

6. Practice Exercises

Exercise 1

Take a long prompt you currently use and compress it to half its tokens. Run both versions on five inputs. If quality holds, you have just cut your spend by 50%.

Exercise 2

Pick a 2,000-token document you regularly stuff into prompts. Generate a 300-token summary once. Use the summary in production. Measure whether downstream answers stay accurate on queries that don't need the full text.

Exercise 3

Build a "compression linter" — a small script that flags prompts containing low-value phrases like "please be helpful", "make sure you", "I hope this helps". Run it across your prompt library. You will be surprised how much fat is hiding in plain sight.

7. Key Takeaways

Prompt compression cuts cost, latency, and noise — often improving answer quality, not just shrinking the bill.
Three flavours work together: structural (prose → markup), semantic (summarise long context), and reference (label and reuse).
Always remove restatements of default behaviour. The model is already helpful and polite by default.
Compress, then evaluate. If a compressed prompt loses quality, restore the parts that were load-bearing.
Long context windows do not eliminate the need to compress; attention falls off in the middle of long contexts on every model.

Discussion

Meta-Prompting: Using AI to Write Better Prompts for You Prompt Injection and How to Defend Against It