Project: Generate a Complete Data Analysis Report Using AI

In this project you will turn a raw CSV into a polished data analysis report — profile, hypotheses, insights, charts, and an executive summary — using a chain of prompts. The deliverable is a Markdown or PDF report that reads like work from a junior analyst, generated from a dataset and a structured brief.

1. Introduction

Analysts spend most of their time not on analysis but on framing — understanding the data, picking the right questions, and writing up findings so a non-technical reader cares. AI can compress all three. You still need to think; the AI just removes the boilerplate.

We will use a fictional but realistic dataset: orders.csv from a small e-commerce business with 12 columns and 18 months of order data. The workflow generalises to any tabular data — sales, marketing, HR, support tickets.

2. The Concept Explained

A good analysis report has five parts: profile the data, frame hypotheses, run analyses, visualise the findings, and write the executive summary. Each maps to a separate prompt. Running them as a chain keeps the model focused and gives you a checkpoint at every stage.

Five focused prompts feed into a single deliverable. Reviewing each step prevents the final report from being confidently wrong.

3. The Problem Without a Pipeline

One-shot analysis

Here is my CSV. Analyse it and write a report.

What you get is a soft, generic essay full of phrases like "the data suggests interesting trends". There are no specific numbers, no charts, no hypotheses, and worst of all — no auditable calculations. The model has to guess everything about your business in one shot.

4. The Solution: A Five-Prompt Pipeline

Step 1 — Profile prompt

You are a senior data analyst. Profile the dataset below.

Return:
- Row count, column count
- For each column: data type, % missing, 3-row sample, and any
  obvious anomaly (e.g. negative prices, future dates)
- A "data quality flags" section with ≤5 issues that would block
  analysis

Then propose 5 sensible cleaning steps. Do not run them yet.

Context:
- Business: small e-commerce store selling home decor
- 18 months of order data
- Goal of the analysis: understand which months and product lines
  drive revenue

Dataset (first 20 rows shown, full file attached):
"""
order_id,order_date,customer_id,country,product_id,product_line,
units,unit_price,discount_pct,shipping_cost,returned,refund_amount
1001,2024-09-12,C204,UK,P-DEC-019,Lighting,2,29.99,0,4.5,FALSE,0
...
"""

The model returns a clean profile with column types, missing-value percentages, and three or four concrete data-quality flags. You read it and either approve the cleaning steps or adjust them.

Step 2 — Hypothesis prompt

Based on the profile above and our business goal
(understand which months and product lines drive revenue),
propose 6 hypotheses worth testing. For each:
- one-line statement
- the columns and aggregation needed
- a falsifiable expected outcome
- priority (high / medium / low) with one-line reason

Avoid trivial hypotheses ("higher units = higher revenue"). Look
for hypotheses that, if true or false, change a business decision.

Sample output (abbreviated): "H1: Revenue from Lighting drops in Q1 by > 30%. Columns: order_date, product_line, units × unit_price × (1 − discount_pct). Priority: high — affects Q1 stocking…" You now have a real research plan instead of a vague "analyse it".

Step 3 — Analyses prompt

For each high-priority hypothesis, do two things:

1) Generate the pandas (or DuckDB) code that would test it.
   Use the column names from the profile. Annotate each line.

2) Given the summary statistics I'll paste back below, write a
   2-sentence "what we found" verdict per hypothesis. State
   explicitly whether the hypothesis was supported, partially
   supported, or rejected.

When stating any number, round to 1 decimal and include units
(£, %, units). Never write "approximately" — write a number.

You run the generated code yourself in a notebook (or ask a code-capable AI to run it), then paste the resulting tables back in. The model writes the verdicts. This separation — model writes code, you run it, model interprets results — is what makes the analysis trustworthy.

Step 4 — Charts prompt

For each supported or partially-supported hypothesis, design one
chart. Return:
- Chart type and why (e.g. line chart — time series; bar chart —
  ranked categorical values)
- The matplotlib code to generate it
- A 1-sentence "alt text" describing what the chart shows
- A 1-sentence caption suitable for a business reader

Constraints:
- One idea per chart. No double-axes.
- Title states the finding, not the variable.
  Good: "Lighting revenue drops 38% in Q1"
  Bad:  "Revenue by product line over time"

The titles-as-findings rule is the single biggest upgrade you can make to any business chart. The AI will follow it once you tell it.

Step 5 — Executive summary prompt

Write the executive summary for the report. Audience: the founder,
non-technical, 5 minutes to read.

Structure:
- 1-line headline finding
- 3 bullet "what we learned"
- 3 bullet "what to do next" (each must be a concrete action,
  not a platitude)
- 1 paragraph on limitations and what to investigate later

Use the hypotheses verdicts and chart titles as your source of
truth. Do not introduce new numbers. Do not use the words
"leverage", "synergy", or "in conclusion".

Stitch the outputs of steps 1–5 into one Markdown file: profile → hypotheses → analyses → charts → executive summary on top. That's the report.

5. Step-by-Step Breakdown

Always profile first. Without a profile, every later prompt is operating in the dark. Five minutes on profiling saves an hour of confused analysis.
Force the model to propose hypotheses. Generic "what insights can you find" prompts produce generic insights. Specific falsifiable hypotheses produce real findings.
Separate code generation from interpretation. Let the model write the code, but run it yourself. Then paste the actual results back in for interpretation. This single discipline eliminates the worst hallucination class — numbers that look right but aren't.
Titles state findings, not variables. "Revenue by product line" is a label. "Lighting drives 42% of total revenue" is a finding.
Write the executive summary last. When the summary is written first, the rest of the analysis warps to fit it. Doing it last keeps you honest.
Save the full chain. The report is the deliverable. The prompt chain — that's the portfolio piece. Recruiters love seeing a structured workflow more than a single document.

6. Practice Exercises

Exercise 1

Find a public dataset (Kaggle, the UK government open data portal, your own export from a tool you use). Run the profile prompt and read the output carefully. Note any anomalies the model spotted that you would have missed.

Exercise 2

Generate hypotheses for the same dataset twice — once with a vague "find interesting insights" prompt, and once with the structured hypothesis prompt above. Compare the two lists. The contrast is the most powerful argument for using a chain.

Exercise 3

Add a "robustness check" step between analyses and charts: "For each finding, list three reasons it might be wrong (data quality, sample bias, confounders) and one quick check that would rule each out." This single step transforms naive analysis into defensible analysis.

7. Key Takeaways

Analysis is five separate jobs: profile, hypothesise, analyse, visualise, summarise. Give each its own prompt.
Hypotheses must be falsifiable and tied to a decision. Otherwise the analysis stays vague.
The model writes code; you run it; the model interprets the result. Never let the model invent numbers.
Chart titles should state findings, not variables. This single rule upgrades every report you produce.
Write the executive summary last and keep it strictly grounded in the earlier steps.

Discussion

Project: Build a Customer Support Chatbot with Prompt Engineering Project: AI-Powered Resume and Cover Letter Generator