Writing Python Pandas & NumPy Code Using AI

Pandas and NumPy power the majority of day-to-day data work in Python. AI can write almost any transformation you can describe — provided you give it a faithful picture of your DataFrame. This topic teaches the prompt patterns that produce vectorised, idiomatic, ready-to-run code instead of slow, generic boilerplate.

1. Introduction

Most Pandas problems look the same on the surface — load a CSV, group by something, compute an aggregate, write it back out — but the difference between code that takes ten minutes to write and code that takes ten seconds is almost always the quality of the prompt. AI excels at Pandas because the library is so widely documented, but it can only help you efficiently if you describe your DataFrame the way Pandas describes itself: with explicit column names, dtypes, index structure, and row count expectations. This tutorial gives you the templates and the habits to make every Pandas and NumPy prompt land first try.

2. The Concept Explained

Pandas code lives or dies on three small details: column names, dtypes, and shape. A perfectly correct snippet for a DataFrame of 1,000 rows with string dates can be slow, buggy, or completely wrong on a DataFrame of 50 million rows with timezone-aware timestamps. NumPy adds a fourth detail — array shape and dtype. When you prompt for Pandas or NumPy code without those details, the AI has to guess, and guesses lead to KeyError, SettingWithCopyWarning, or silently incorrect results.

A useful analogy is ordering a custom-made cabinet from a carpenter. If you say "make me a cabinet", you will get something generic. If you say "120cm wide, 180cm tall, oak frame, two glass doors, three internal shelves at 50/100/150cm" — you get exactly what you wanted, the first time. Pandas prompts work the same way: the dimensions and materials of your DataFrame are the spec.

The minimum brief for any Pandas prompt

Before asking the AI to write code, copy this short briefing pattern. It is almost always worth the thirty seconds it takes to fill in.

Reusable DataFrame brief

DataFrame name: orders_df
Approximate row count: 2.3 million
Index: default RangeIndex
Columns (name : dtype : notes):
  order_id        : int64  : unique, primary key
  customer_id     : int64  : 350k unique values
  order_timestamp : datetime64[ns, UTC]
  product_sku     : string : ~8k unique values
  quantity        : int64  : range 1..50
  unit_price_gbp  : float64 : range 0.99..2,499.00
  discount_pct    : float64 : 0..0.45, ~30% nulls
  status          : category : 'placed'|'shipped'|'returned'

Memory budget: must run inside 8 GB RAM.
Style: vectorised Pandas (no apply, no for loops unless
unavoidable). Add inline comments.

Paste that block at the top of any Pandas prompt and the AI will stop inventing schemas. NumPy briefs are even shorter — name, shape tuple, dtype, and a one-line description of what the values represent.

3. The Problem Without This Technique

Generic prompts produce generic code. When the AI does not know your column names or types, it falls back on its training distribution — which means snake_case column names that don't exist, naive .apply() calls that crawl on large frames, and parsing logic that breaks the moment your timestamps include timezones.

Weak prompt

Write Pandas code to calculate revenue per customer per month.

No DataFrame name. No column names. No types. The AI will fabricate columns like date, amount, customer. The code may run on a toy example but will fail on your real schema, and you will spend more time fixing column names than you saved by asking.

Stronger prompt

Act as a senior Pandas engineer optimising for large frames.

DataFrame: orders_df (~2.3M rows)
  order_id (int64), customer_id (int64),
  order_timestamp (datetime64[ns, UTC]),
  quantity (int64), unit_price_gbp (float64),
  discount_pct (float64, ~30% null),
  status (category: placed|shipped|returned).

Task: produce a new DataFrame `revenue_monthly` with
columns [customer_id, year_month, net_revenue_gbp] where
net revenue per row = quantity * unit_price_gbp *
(1 - discount_pct.fillna(0)) and only status == 'shipped'
contributes. Group to month start in UTC.

Constraints:
- vectorised Pandas only
- use pd.Grouper(freq='MS') for monthly bucketing
- return the code only, with inline comments

The AI now has the schema, the business rule, and the performance constraint. The returned snippet will use df.loc[df['status'].eq('shipped')], a single groupby with pd.Grouper, and a fillna inside the revenue calculation — close to production-ready on the first try.

4. The Solution

The pattern is: DataFrame brief → task verb → business rule → performance constraint → output shape. The DataFrame brief alone removes 80% of the noise. Add a single line about performance ("must vectorise", "avoid copies", "keep memory under 4 GB") and the AI will choose idiomatic patterns like assign, eval, query, merge_asof, or NumPy np.where instead of slow Python loops.

For NumPy, the equivalent shortcut is to state the array shape and dtype, and to ask explicitly for vectorisation. A prompt like "given prices of shape (N, T) float32 representing N tickers over T days, compute rolling 20-day log returns vectorised — no Python loops" reliably yields a clean np.log + slicing solution.

5. Step-by-Step Breakdown

Capture the schema from the source. Run df.info() and paste the relevant lines. This single step prevents most hallucinated column names.
State the DataFrame name. Tell the AI the variable name you will use (orders_df, not df) so generated code drops straight into your notebook.
Express the business rule in plain words first. "Net revenue = quantity × price × (1 − discount)" — then ask for code. This catches logic errors before they become bug reports.
Constrain the implementation style. "Vectorise. No apply. No for loops." The AI defaults to readable Python; you must explicitly request performance.
Demand the output shape. "Return a new DataFrame called revenue_monthly with columns X, Y, Z and one row per customer-month." Naming the output prevents the AI from inventing it.
Loop with errors. When something fails, paste the traceback back into the chat with "Here is the error. Fix it and explain the root cause." This is the fastest way to converge on correct code.

Tip: Save your DataFrame briefs in a Markdown file next to each project. When you start a new prompt, paste the matching brief first — it turns your codebase into a personal prompt library for AI assistants.

6. Practice Exercises

Exercise 1

Take a DataFrame you work with regularly. Run df.info(), copy the output, and paste it into your AI tool with the prompt: "From this DataFrame, write a function that returns the top 10 rows by total spend per customer, broken out by quarter." Compare the result to a version where you only paste the column names without dtypes.

Exercise 2

Ask the AI to convert one of your existing .apply()-heavy Pandas snippets into a fully vectorised version. Use the prompt:

Rewrite this code with no apply or for loops. Use vectorised Pandas or NumPy operations. Explain each change.

Benchmark both with %timeit.

Exercise 3

Prompt AI to generate a NumPy function that takes a (N, T) array of returns and produces an (N, N) covariance matrix — vectorised, no Python loops. Then ask: "Now extend this to weighted covariance with weights of shape (T,)." Notice how a tight follow-up question keeps the momentum going.

7. Key Takeaways

The DataFrame brief — name, shape, columns, dtypes — is the single biggest lever for Pandas prompt quality.
Always state the implementation style: vectorise, no apply, no for loops. AI defaults to readable rather than fast.
State business rules in plain English before asking for code; logic bugs surface earlier this way.
Name the output DataFrame and its expected columns so the AI doesn't invent its own variables.
For NumPy, supply array shape and dtype — those two facts unlock idiomatic vectorisation.

Discussion

Exploratory Data Analysis (EDA) with AI Prompts