Constrained Generation: Controlling AI Output Format Precisely

Free-form text is great for humans and a nightmare for software. Constrained generation forces a model's output into a precise shape — valid JSON, regex-matching strings, schema-validated objects — so downstream code can parse it without defensive heroics. This is what turns a chat model into a real building block of a system.

1. Introduction

The first time you try to plug an LLM into a real pipeline, you discover the same lesson developers learned about web scraping in the 2000s: parsing free-text is hell. The model puts the JSON inside a markdown code fence on Monday, forgets the trailing brace on Tuesday, slips in a friendly "Sure, here's the JSON:" on Wednesday. Every shape change breaks your parser.

Constrained generation fixes this by either (a) using API-level features that guarantee the output matches a schema, or (b) writing prompts so disciplined that schema violations become rare enough to handle as exceptions. Most production systems use both.

2. The Concept Explained

There are four levels of "constrained" you can apply, in increasing order of strength:

Format described in prose. "Return your answer as JSON with the fields X, Y, Z." Weakest. Often works on capable models but no guarantees.
Format demonstrated by example. Few-shot prompts showing the exact shape. Stronger — the model is excellent at imitating shape from examples.
JSON mode / structured outputs. Many APIs (OpenAI, Anthropic, Gemini, open-source servers) accept a JSON Schema and guarantee the output parses to that schema. This is the gold standard when available.
Grammar-constrained decoding. The decoder is restricted at the token level to only emit tokens that keep the output valid against a grammar (regex, BNF, JSON Schema). Available in llama.cpp, vLLM, Outlines, and other inference engines.

With grammar-constrained decoding, the output is mathematically forced to be schema-valid — no validation needed downstream.

3. The Problem Without Constraints

Prose-only format request

Extract the customer's name, intent, and urgency from
the message below. Return as JSON.

Message: """I'd like to cancel my account please.
The product is broken and I've waited three days."""

On a typical week of production traffic you will see all of these variations: the model wraps the JSON in ```json … ```, adds a preface like "Here's the JSON you asked for:", drops a trailing comma, uses "urgency": "high" on one call and "urgency": "High" on another. Your parser breaks at 2 a.m.

4. The Solution: Explicit Schema + Structured Output

Schema + structured output API

// JSON Schema passed to the API alongside the prompt
{
  "name": "support_ticket",
  "schema": {
    "type": "object",
    "properties": {
      "customer_name":   { "type": "string" },
      "intent": {
        "type": "string",
        "enum": ["cancel", "complaint", "question", "compliment"]
      },
      "urgency": {
        "type": "string",
        "enum": ["low", "normal", "high", "critical"]
      },
      "needs_human":     { "type": "boolean" }
    },
    "required": ["customer_name", "intent", "urgency", "needs_human"],
    "additionalProperties": false
  },
  "strict": true
}

// Prompt
You are a support triage classifier. Extract the fields
defined in the schema. If the customer's name is missing,
use "unknown". Never guess — pick the closest enum value.

Message: """I'd like to cancel my account please.
The product is broken and I've waited three days."""

The API guarantees the output is valid against the schema. urgency can only be one of the four enum values. needs_human is always a boolean. Your downstream code parses it with JSON.parse and immediately uses the values — no defensive coding required.

5. Step-by-Step Breakdown

Define the schema before writing the prompt. Decide every field, every type, every enum. Treat the schema as the API contract between the model and your code.
Use enums everywhere you can. Enums turn open-ended classification into a closed set, which is easier for both the model and your downstream logic.
Prefer structured-output APIs over prose instructions. If your model provider offers a JSON Schema mode, use it. It eliminates an entire class of bugs.
Add validation even with structured output. Schemas guarantee shape but not semantic correctness. A customer_name can be schema-valid but obviously wrong. Validate after parsing.
Handle the rare failure. Even constrained decoding can fail in edge cases (especially with custom grammars). Wrap parsing in try/catch and either retry, fall back to a simpler prompt, or escalate to a human.
Keep examples handy. Even with API-level constraints, a few-shot example or two inside the prompt improves semantic quality — the model picks better field values when it sees what good looks like.

Tip: When the API doesn't support structured outputs, the next best thing is "JSON between fences" — instruct the model to emit only <json>...</json> blocks and parse what is between the tags. This catches the most common failure (prefatory chatter) without needing any special API support.

6. Practice Exercises

Exercise 1

Pick a task that produces structured output (extraction, classification, routing) and write the JSON Schema before the prompt. Then write the prompt to match. Run it on 30 real inputs and measure the schema-validity rate.

Exercise 2

Compare three implementations of the same extraction task: (a) prose-only format request, (b) few-shot prompt with examples, (c) full structured-output API. Measure validity rate and downstream accuracy. The cost difference is usually negligible; the reliability difference is huge.

Exercise 3

Build a tiny retry loop: on schema-validation failure, automatically retry once with the validation error appended to the prompt ("Your previous response failed validation: missing required field 'urgency'. Try again."). Measure how often the retry succeeds. This is a cheap and effective safety net.

7. Key Takeaways

Constrained generation is the bridge between free-text models and real software systems — without it, your parser becomes your single biggest source of bugs.
Four levels exist, from prose-described formats up to grammar-constrained decoding. Use the strongest one your stack supports.
Schemas should be designed first, prompts second. Treat the schema as the contract.
Enums are your friend — they turn open-ended generation into reliable classification.
Always validate after parsing, even with structured outputs. Schema-valid is not the same as semantically correct.

Discussion

Prompt Injection and How to Defend Against It Memory and State Management in Multi-Turn AI Conversations