Data Cleaning Prompts: Handle Missing Values, Outliers, Duplicates

Data cleaning consumes up to 80% of a data scientist's project time. AI can compress that dramatically — but only if your prompts describe the problem precisely. This topic shows you how to prompt AI for missing value imputation, outlier handling, and deduplication in a way that produces code you can actually run.

1. Introduction

Raw data is almost never analysis-ready. Dates come in four formats. Revenue columns have nulls that mean "zero" in one business unit and "unknown" in another. Customer records are duplicated because two systems merged. Fixing all of this manually is slow and error-prone. AI can generate a complete cleaning pipeline in seconds — but the quality of that pipeline depends entirely on how well you describe the mess you are starting with. This tutorial teaches you the prompting patterns that produce clean, defensive, well-commented cleaning code.

2. The Concept Explained

A data cleaning prompt has to do one thing a normal prompt does not: describe imperfection. You are not describing what the data should look like — you are describing what is wrong with it right now. The AI needs to know which columns have nulls, what fraction of values is missing, what the expected range of a numeric column is, and what "clean" looks like for your use case.

Think of it like calling a plumber. Saying "my house has a water problem" wastes everyone's time. Saying "the cold-water pipe under the kitchen sink has a slow drip at the elbow joint — here is a photo" gets you a quote in two minutes. Data cleaning prompts work exactly the same way: the more precisely you describe the defect, the more targeted the fix.

The four stages of a data cleaning pipeline — each maps to a distinct prompt pattern.

Missing Values

Describe the column name, the percentage of nulls, and the business meaning. "The monthly_revenue column has ~12% nulls — these represent customers whose billing failed that month, not customers with zero revenue." This distinction changes whether you impute with zero, with the median, or flag them separately.

Outliers

Specify which column, what the expected domain is, and whether outliers should be removed, capped, or flagged. "Flag rows where session_duration_seconds exceeds 7200 (2 hours) as is_outlier=True rather than dropping them."

Duplicates

Define what "duplicate" means in your context. Is it an exact row match, or a match on a subset of key columns with different timestamps? "Consider a row a duplicate if customer_id and event_type are identical within the same calendar day — keep only the earliest event_timestamp."

3. The Problem Without This Technique

Weak prompt

Clean my dataset and handle missing values.

No schema. No description of what is missing or why. The AI will write generic code using df.dropna() — which silently deletes rows that might be valuable — and you will ship a cleaned dataset that has quietly lost 15% of its records.

Stronger prompt

Act as a senior data engineer writing defensive Pandas code.

Dataset schema:
  customer_id (int, no nulls)
  signup_date (string YYYY-MM-DD, 0% nulls)
  plan_type (str: 'basic'|'pro'|'enterprise', 2% nulls)
  monthly_revenue (float, 12% nulls — billing failures,
                   NOT true zeros)
  churn_date (nullable string, 68% null — means active)

Tasks:
1. Fill plan_type nulls with 'unknown'.
2. For monthly_revenue nulls, impute with the
   per-customer median across all non-null months;
   add a boolean flag column `revenue_imputed`.
3. Parse signup_date and churn_date to datetime.
4. Drop exact duplicate rows, keep first occurrence.
5. After cleaning, print a summary: row count before
   and after, null counts per column.

Return runnable Pandas code with inline comments.
Do not use for loops — vectorise where possible.

The AI will return a clean, well-commented Pandas script with a df.groupby('customer_id')['monthly_revenue'].transform('median')-based imputation, a revenue_imputed flag column, proper datetime parsing, and a diagnostic print block. Essentially production-ready.

4. The Solution

The key upgrade is specificity in three areas: describe the defect (not just "nulls" but "12% nulls representing billing failures"), prescribe the fix (impute with per-customer median, not global median), and request defensive output (add a flag column, print a before/after summary). This turns a throw-away snippet into a reusable cleaning module.

5. Step-by-Step Breakdown

List every column with its type and null rate. Even 0% null columns are worth listing — it tells the AI which columns are safe to use as join keys.
Explain the business meaning of nulls. "Null = unknown" is different from "null = zero" — spell it out.
Specify the imputation or removal strategy per column. Never let the AI choose — you know the data; it does not.
Define duplicate semantics. Which columns define uniqueness? What to keep when duplicates exist?
Ask for a diagnostic summary. A before/after row count and null-count check makes the cleaning verifiable.
Request a flag column for imputed values. This preserves audit trail and lets downstream models learn from the imputation pattern.

6. Practice Exercises

Exercise 1

Take any CSV you have. Run df.info() and df.isnull().sum(). Now write a cleaning prompt that includes the column names, types, and null counts. Compare the AI output to what a generic "clean my data" prompt produces.

Exercise 2

Prompt AI: "I have a column called order_value (float). The 5th percentile is £2.50 and the 95th percentile is £480. Some values exceed £10,000, which are likely data entry errors. Write Pandas code to cap outliers at the 99th percentile value and add an order_value_capped boolean flag." Evaluate how well the AI handles this specific outlier definition.

Exercise 3

Ask AI to write a reusable Python function called clean_customer_data(df) that accepts a DataFrame and returns a cleaned version with a cleaning report dictionary. Specify the schema in the prompt and ask for type hints and a docstring.

7. Key Takeaways

Data cleaning prompts require you to describe the imperfection — not just the schema, but what is wrong and why.
Always specify the business meaning of null values before letting AI choose an imputation strategy.
Ask for flag columns on imputed values to preserve audit trail for downstream models.
Define duplicate semantics explicitly: which columns define uniqueness, and what to keep.
Request a before/after diagnostic summary in every cleaning prompt — it makes the output verifiable.

Discussion

Introduction: How Data Scientists Use Prompt Engineering Prompting AI for Data Visualisation (Matplotlib, Seaborn, Plotly)