Exploratory Data Analysis (EDA) with AI Prompts

EDA is where data science projects live or die — miss a key distribution or an unexpected correlation and your model will underperform in ways that are hard to diagnose later. AI can be your most productive EDA companion if you give it a structured brief. This topic shows you how.

1. Introduction

Exploratory data analysis is the process of getting to know a dataset before you model it — understanding its shape, distributions, relationships, and anomalies. Traditionally this means writing dozens of small scripts and plots. With well-structured prompts, you can ask AI to write an entire EDA notebook in one go, covering univariate statistics, bivariate correlations, time-series patterns, and anomaly flags — all tailored to your specific schema and business question.

2. The Concept Explained

Good EDA moves through layers of understanding, from broad to specific. First you understand the overall shape of the data (how many rows, what types). Then you look at individual column distributions. Then you explore relationships between columns. Finally, you look for anomalies — values or patterns that don't fit the expected story. Think of it like a pyramid: wide at the base (shape), narrowing toward specific hypotheses at the top.

The EDA pyramid: always start at the base (shape and schema) before moving toward anomaly detection at the apex.

An AI EDA prompt should mirror this pyramid. Start with a prompt that generates a data profiling report, then follow up with targeted prompts for correlations, time trends, and outlier investigation. Chaining prompts through the pyramid is far more effective than asking for "a complete EDA" in a single shot.

Prompt chain structure for EDA

Prompt 1 — Profile: Generate df.describe(), null counts, unique value counts, and dtype summary.
Prompt 2 — Distributions: Plot histograms for numeric columns, bar charts for categoricals.
Prompt 3 — Relationships: Correlation heatmap, scatter plots for target vs top features.
Prompt 4 — Anomalies: Flag rows outside 3 standard deviations; check for impossible values per column.

3. The Problem Without This Technique

Weak prompt

Do an EDA on my data and find insights.

No schema, no business question, no output format. The AI will produce a boilerplate notebook that may not even run, and its "insights" will be generic observations about made-up column names. You spend more time editing the output than writing it yourself.

Stronger prompt

Act as a data analyst running EDA in a Jupyter notebook.

Dataset: subscription SaaS metrics, ~50,000 rows.
Columns:
  customer_id (int), signup_date (datetime),
  plan_type (str: basic/pro/enterprise),
  monthly_revenue (float), churn_date (nullable datetime),
  country (str, 40 unique values), support_tickets (int).

Business question: What customer characteristics
and behaviours predict churn within 90 days?

Generate a Pandas + Matplotlib EDA script that:
1. Prints shape, dtypes, and null counts.
2. Plots distributions of monthly_revenue and
   support_tickets (histogram + box plot each).
3. Plots churn rate by plan_type and country (bar charts).
4. Shows a correlation matrix for numeric columns.
5. Flags any customer_id values that appear more than once.

Use plt.tight_layout() and label all axes clearly.
Return only the code, structured as commented sections.

The AI will produce a structured, runnable notebook script with clearly labelled sections, sensible plot choices per data type, and the correlation matrix the ML engineer will need for feature selection. Outputs include df.groupby('plan_type')['churn_flag'].mean()-style aggregations and a heatmap using seaborn.heatmap(df.corr()).

4. The Solution

Chain your EDA prompts through the pyramid. The first prompt profiles the data; subsequent prompts drill into specific layers. Always state the business question — "what predicts churn?" shapes which analyses are actually useful and prevents the AI from spending effort on irrelevant columns.

5. Step-by-Step Breakdown

Share the schema and row count. The AI needs to know what it is working with before it can suggest sensible visualisations.
State the business question. This filters the EDA to analyses that matter rather than analyses that are generically possible.
Request profiling first. Shape, dtypes, nulls, unique counts — this is fast to generate and immediately surfaced issues.
Ask for visualisations by data type. Numeric → histogram + box plot. Categorical → bar chart. Date → line chart over time. Pairwise → scatter / heatmap.
Follow up with anomaly detection. After you see the distributions, ask a second prompt: "Flag rows where monthly_revenue > 3 standard deviations above the mean."
Iterate with follow-up questions. EDA is a conversation. Paste the output back in: "The histogram shows a bimodal distribution in monthly_revenue. What would cause that and how should I investigate further?"

6. Practice Exercises

Exercise 1

Download any public dataset from Kaggle or a government open data portal. Write a four-part EDA prompt chain (profile → distributions → relationships → anomalies) using the actual column names. Run each prompt in sequence and document what you learn at each stage.

Exercise 2

Use the following prompt with any numeric dataset: "For each numeric column, generate a histogram and print the skewness value. For columns with absolute skewness > 1, suggest whether a log transform or a square root transform would be more appropriate and why."

Exercise 3

Ask AI to write a reusable quick_eda(df, target_col) function that accepts any DataFrame and a target column name and outputs a standardised EDA report. This becomes a permanent tool in your data science toolkit.

7. Key Takeaways

EDA prompts should follow a pyramid structure: shape → distributions → relationships → anomalies.
Always include the business question in your EDA prompt — it filters analyses to the ones that actually matter.
Chain multiple focused prompts rather than asking for a "complete EDA" in one shot.
Paste outputs back into the conversation and ask follow-up questions — EDA with AI works best as a dialogue.
Request axis labels, tight layout, and comments in generated plotting code to get presentation-quality output from the first draft.

Discussion

Advanced: Using AI as a Data Science Pair Programmer Writing Python Pandas & NumPy Code Using AI