Data cleaning consumes up to 80% of a data scientist's project time. AI can compress that dramatically — but only if your prompts describe the problem precisely. This topic shows you how to prompt AI for missing value imputation, outlier handling, and deduplication in a way that produces code you can actually run.
Raw data is almost never analysis-ready. Dates come in four formats. Revenue columns have nulls that mean "zero" in one business unit and "unknown" in another. Customer records are duplicated because two systems merged. Fixing all of this manually is slow and error-prone. AI can generate a complete cleaning pipeline in seconds — but the quality of that pipeline depends entirely on how well you describe the mess you are starting with. This tutorial teaches you the prompting patterns that produce clean, defensive, well-commented cleaning code.
A data cleaning prompt has to do one thing a normal prompt does not: describe imperfection. You are not describing what the data should look like — you are describing what is wrong with it right now. The AI needs to know which columns have nulls, what fraction of values is missing, what the expected range of a numeric column is, and what "clean" looks like for your use case.
Think of it like calling a plumber. Saying "my house has a water problem" wastes everyone's time. Saying "the cold-water pipe under the kitchen sink has a slow drip at the elbow joint — here is a photo" gets you a quote in two minutes. Data cleaning prompts work exactly the same way: the more precisely you describe the defect, the more targeted the fix.
Describe the column name, the percentage of nulls, and the business meaning. "The monthly_revenue column has ~12% nulls — these represent customers whose billing failed that month, not customers with zero revenue." This distinction changes whether you impute with zero, with the median, or flag them separately.
Specify which column, what the expected domain is, and whether outliers should be removed, capped, or flagged. "Flag rows where session_duration_seconds exceeds 7200 (2 hours) as is_outlier=True rather than dropping them."
Define what "duplicate" means in your context. Is it an exact row match, or a match on a subset of key columns with different timestamps? "Consider a row a duplicate if customer_id and event_type are identical within the same calendar day — keep only the earliest event_timestamp."
Weak prompt
Clean my dataset and handle missing values.
No schema. No description of what is missing or why. The AI will write generic code using df.dropna() — which silently deletes rows that might be valuable — and you will ship a cleaned dataset that has quietly lost 15% of its records.
Stronger prompt
Act as a senior data engineer writing defensive Pandas code.
Dataset schema:
customer_id (int, no nulls)
signup_date (string YYYY-MM-DD, 0% nulls)
plan_type (str: 'basic'|'pro'|'enterprise', 2% nulls)
monthly_revenue (float, 12% nulls — billing failures,
NOT true zeros)
churn_date (nullable string, 68% null — means active)
Tasks:
1. Fill plan_type nulls with 'unknown'.
2. For monthly_revenue nulls, impute with the
per-customer median across all non-null months;
add a boolean flag column `revenue_imputed`.
3. Parse signup_date and churn_date to datetime.
4. Drop exact duplicate rows, keep first occurrence.
5. After cleaning, print a summary: row count before
and after, null counts per column.
Return runnable Pandas code with inline comments.
Do not use for loops — vectorise where possible.
The AI will return a clean, well-commented Pandas script with a df.groupby('customer_id')['monthly_revenue'].transform('median')-based imputation, a revenue_imputed flag column, proper datetime parsing, and a diagnostic print block. Essentially production-ready.
The key upgrade is specificity in three areas: describe the defect (not just "nulls" but "12% nulls representing billing failures"), prescribe the fix (impute with per-customer median, not global median), and request defensive output (add a flag column, print a before/after summary). This turns a throw-away snippet into a reusable cleaning module.
Take any CSV you have. Run df.info() and df.isnull().sum(). Now write a cleaning prompt that includes the column names, types, and null counts. Compare the AI output to what a generic "clean my data" prompt produces.
Prompt AI: "I have a column called order_value (float). The 5th percentile is £2.50 and the 95th percentile is £480. Some values exceed £10,000, which are likely data entry errors. Write Pandas code to cap outliers at the 99th percentile value and add an order_value_capped boolean flag." Evaluate how well the AI handles this specific outlier definition.
Ask AI to write a reusable Python function called clean_customer_data(df) that accepts a DataFrame and returns a cleaned version with a cleaning report dictionary. Specify the schema in the prompt and ask for type hints and a docstring.
Sign in to join the discussion and post comments.
Sign inPrompt Engineering for Content & Copywriting
Write blogs, ads, emails, and social media content ten times faster with AI. 13 practical tutorials on prompt engineering for content creators and copywriters.
Prompt Engineering for Developers
Use AI as your coding co-pilot. 18 tutorials on writing prompts to generate clean code, debug faster, write tests, build APIs, and ship better software.
Prompt Engineering for Image Generation
Turn words into stunning visuals. Master AI image generation tools like Midjourney, DALL·E 3, and Stable Diffusion with 18 focused tutorials — from first prompt to full brand identity.
Advanced Prompt Engineering Techniques
Master the powerful techniques AI experts use every day. Chain-of-thought, RAG, agents, function calling, prompt evaluation, and much more — 20 deep-dive tutorials.
Prompt Engineering for Education & Learning
Use AI as your personal tutor. Learn how to study faster, create lesson plans, generate practice questions, master languages, and prepare for competitive exams with smart prompts.
Prompt Engineering Projects & Real-World Applications
Twelve hands-on projects that turn prompt engineering theory into a portfolio. Build chatbots, content generators, RAG systems, and more.