A single model run can be smart but unreliable. Self-consistency prompting fixes that by sampling several independent reasoning paths for the same question and choosing the answer that shows up most often. It is one of the simplest ways to trade a little extra compute for a meaningful jump in accuracy.
Even with Chain-of-Thought, a single model response is one trajectory through a vast space of possible answers. With a nonzero temperature, you might get a slightly different reasoning chain each time you press send. Most of those chains will converge on the same correct answer — but a few will go off the rails. The catch is you usually do not know which response you got.
Self-consistency, introduced by Wang et al. in 2022, is a beautifully simple fix. Run the same prompt several times with sampling enabled, collect the final answers, and take a majority vote. The wisdom of crowds — applied to one model talking to itself. In the original paper, self-consistency added between 4 and 18 accuracy points on arithmetic, common-sense, and symbolic reasoning benchmarks compared to plain Chain-of-Thought.
The core insight is statistical. When a problem has one correct answer but multiple valid routes to that answer, sampling makes the model take different routes each time. Most routes lead to the correct answer (the "valley" in the probability landscape). A few unlucky routes lead elsewhere. If you sample five or seven completions and tally the final answers, the correct answer almost always wins the vote — even when individual responses occasionally get it wrong.
It is the same logic as asking five different colleagues to estimate a number independently and taking the median. The errors are uncorrelated; the truth shows up in the middle.
Consider a tricky word problem where the model is right roughly 70% of the time. If you take a single answer, you have a 30% chance of being wrong with no way to tell. The model will sound equally confident either way — that is the worst kind of error in any system that depends on correctness.
Single sample
A factory produces widgets at 240 units/hour for the
first 4 hours, then output drops by 25% for the next
6 hours due to material shortage. How many widgets are
produced in the full 10-hour shift?
Think step by step.
Run this once. You will probably get the right answer (2,040). But on roughly one run in five, the model will misread "drops by 25%" as "drops to 25%" and return 1,320. There is no signal in the single response that says "I might be wrong here".
Run the same prompt N times with a nonzero temperature (around 0.7 is a good default), extract the final answer from each response, and pick the answer that appears most often.
Self-consistency loop (pseudocode)
prompt = """
A factory produces widgets at 240 units/hour for the
first 4 hours, then output drops by 25% for the next
6 hours. How many widgets are produced in the 10-hour
shift?
Think step by step, then write the final number on a
line that starts with "Answer:".
"""
answers = []
for i in range(7):
response = model.generate(prompt, temperature=0.7)
final = parse_after("Answer:", response)
answers.append(final)
result = most_common(answers)
print(result, "votes:", Counter(answers))
You now get both the most-likely-correct answer and a confidence signal — a clean 7-out-of-7 sweep means very high confidence; a 4-3 split is a flag to investigate further or escalate to a stronger model.
"Answer: <value>" or inside an XML tag like <final></final>. Otherwise voting becomes a parsing nightmare.Tip: Self-consistency is embarrassingly parallel. Fire all N requests at once via async / concurrent calls instead of one after the other. With most APIs, seven parallel calls finish in roughly the same wall-clock time as one.
Pick a non-trivial arithmetic or logic problem the model occasionally gets wrong. Run it 10 times with temperature 0.7, collect the answers, and look at the distribution. How often does the majority answer differ from a single sample?
Apply self-consistency to a classification task — for example, labelling sentences as positive / neutral / negative. Use 5 samples per sentence. Track which inputs produce a unanimous vote and which produce a split. The split cases are usually the genuinely ambiguous ones.
Compare cost vs accuracy. Run the same task with: (a) one sample at temperature 0, (b) one sample at 0.7, (c) seven samples at 0.7 with majority vote. Measure accuracy on a small evaluation set and the total token cost. Find the sweet spot for your problem.
Sign in to join the discussion and post comments.
Sign inPrompt Engineering for Content & Copywriting
Write blogs, ads, emails, and social media content ten times faster with AI. 13 practical tutorials on prompt engineering for content creators and copywriters.
Prompt Engineering for Image Generation
Turn words into stunning visuals. Master AI image generation tools like Midjourney, DALL·E 3, and Stable Diffusion with 18 focused tutorials — from first prompt to full brand identity.
Prompt Engineering for Business & Productivity
Use AI to work smarter — automate tasks, make better decisions, and communicate professionally. 12 practical business prompt tutorials for professionals.
Prompt Engineering for Developers
Use AI as your coding co-pilot. 18 tutorials on writing prompts to generate clean code, debug faster, write tests, build APIs, and ship better software.
Prompt Engineering for Education & Learning
Use AI as your personal tutor. Learn how to study faster, create lesson plans, generate practice questions, master languages, and prepare for competitive exams with smart prompts.
Foundations of Prompt Engineering
The must-know basics of prompt engineering. Learn what prompts are, how AI models read them, and how to write clear instructions that get great results.