Introduction: How Data Scientists Use Prompt Engineering

Prompt engineering is not just for writers and marketers. For data scientists, it is a productivity multiplier — turning hours of repetitive coding, documentation, and analysis into minutes of well-crafted instructions. This topic shows you where AI fits into the data science lifecycle and how to start using it effectively from day one.

1. Introduction

Data science projects move through a predictable cycle: define the problem, gather data, clean it, explore it, model it, evaluate results, and communicate findings. Each of those stages involves work that AI can accelerate dramatically. The data scientists who are moving fastest today are not necessarily the strongest coders — they are the ones who have learned to treat AI as a tireless analytical collaborator and can give it precise, well-structured instructions. This tutorial maps out that collaboration and shows you what good data science prompting actually looks like.

2. The Concept Explained

Prompt engineering for data science is the practice of crafting instructions that turn an AI assistant into a specialised analytical partner. Unlike general prompting, data science prompting almost always involves three extra ingredients: dataset description (column names, types, shape), expected output format (working code, a table, a plain-English explanation), and technical constraints (library version, performance requirements, downstream use of the result).

Think of it like briefing a very capable consultant. If you walk into a room and say "help me with my data", they will ask twenty questions before they can start. But if you hand them a one-page brief — the dataset schema, the business question, the tool stack, and the desired deliverable — they can start producing value in minutes. A well-formed AI prompt is that one-page brief.

The data science workflow loop. AI can accelerate every single stage — from sharpening the question to drafting the decision memo.

AI prompts map onto this cycle naturally. At the question stage, AI helps clarify hypotheses. At the data stage, it generates cleaning and transformation code. During analysis, it writes exploratory scripts and suggests statistical tests. At the insight stage, it drafts plain-English explanations. And at the decision stage, it helps format findings for different audiences.

3. The Problem Without This Technique

Without structured prompting, data scientists either avoid AI altogether (treating it as a toy for non-technical people) or use it naively, getting generic code that doesn't fit their actual schema and requires extensive manual fixing.

Weak prompt

Write Python code to analyse my sales data.

No dataset description. No column names. No stated goal. The AI will invent column names like date, amount, product that may not match reality, and the code will need heavy rewriting before it runs.

Stronger prompt

Act as a senior data analyst using Python and Pandas.

I have a CSV with these columns:
  customer_id (int), signup_date (YYYY-MM-DD string),
  plan_type (str: 'basic'|'pro'|'enterprise'),
  monthly_revenue (float), churn_date (nullable YYYY-MM-DD).

Task: Write a Pandas script that calculates
monthly revenue by plan_type for the last 12 months,
then outputs a summary table sorted by month descending.

Use pd.to_datetime for date parsing. Add inline comments
explaining each transformation step. Return only the code.

Now the AI has the exact schema, the business question, the library preference, and the output format. The generated code will run — or be very close to running — on the real dataset immediately.

4. The Solution

The pattern for data science prompting is: Role → Dataset description → Task → Output format → Constraints. Every time you add a missing piece, the output quality jumps. The most important addition is the dataset description — column names and types alone eliminate the majority of irrelevant code.

5. Step-by-Step Breakdown

State the role. "Act as a senior data scientist / SQL analyst / ML engineer." This calibrates vocabulary and code style.
Describe the dataset. List column names, data types, a rough row count, and any quirks (missing values, mixed formats).
State the task precisely. One clear verb: "calculate", "detect", "visualise", "build", "summarise". Avoid "help me with".
Specify the output format. "Return runnable Pandas code with inline comments", "Return a markdown table", "Return a plain-English paragraph for a non-technical stakeholder".
Add constraints. Library versions, performance requirements ("must handle 5M rows"), style guides, or things to avoid ("do not use for loops — vectorise").
Iterate. Paste error messages back in with context. "Running the code above gave this error: [traceback]. Fix it and explain what was wrong."

6. Practice Exercises

Exercise 1

Pick a dataset you work with regularly. Write a prompt that includes its column names, types, and a specific analytical question. Compare the output to what you would get from a generic "analyse my data" prompt.

Exercise 2

Ask AI: "Given a dataset with columns customer_id, event_type, event_timestamp, and session_id — what are the five most important questions I should explore in an initial EDA? For each question, suggest the Pandas or SQL approach." Use this output as a project checklist.

Exercise 3

Take a piece of code you recently wrote yourself. Paste it into the AI with the prompt: "Review this Pandas code for correctness, performance, and readability. Suggest specific improvements with explanations." Notice how specific the critique becomes when you give it real code.

7. Key Takeaways

AI accelerates every stage of the data science lifecycle — cleaning, analysis, modelling, and communication.
The most important addition to any data science prompt is a precise dataset description: column names, types, and shape.
Use the pattern: Role → Dataset → Task → Output format → Constraints.
Generic prompts generate generic code. The more specific your prompt, the closer the output is to production-ready.
Treat AI output as a first draft — always review generated code before running it on real data.

Discussion

Data Cleaning Prompts: Handle Missing Values, Outliers, Duplicates