Pandas and NumPy power the majority of day-to-day data work in Python. AI can write almost any transformation you can describe — provided you give it a faithful picture of your DataFrame. This topic teaches the prompt patterns that produce vectorised, idiomatic, ready-to-run code instead of slow, generic boilerplate.
Most Pandas problems look the same on the surface — load a CSV, group by something, compute an aggregate, write it back out — but the difference between code that takes ten minutes to write and code that takes ten seconds is almost always the quality of the prompt. AI excels at Pandas because the library is so widely documented, but it can only help you efficiently if you describe your DataFrame the way Pandas describes itself: with explicit column names, dtypes, index structure, and row count expectations. This tutorial gives you the templates and the habits to make every Pandas and NumPy prompt land first try.
Pandas code lives or dies on three small details: column names, dtypes, and shape. A perfectly correct snippet for a DataFrame of 1,000 rows with string dates can be slow, buggy, or completely wrong on a DataFrame of 50 million rows with timezone-aware timestamps. NumPy adds a fourth detail — array shape and dtype. When you prompt for Pandas or NumPy code without those details, the AI has to guess, and guesses lead to KeyError, SettingWithCopyWarning, or silently incorrect results.
A useful analogy is ordering a custom-made cabinet from a carpenter. If you say "make me a cabinet", you will get something generic. If you say "120cm wide, 180cm tall, oak frame, two glass doors, three internal shelves at 50/100/150cm" — you get exactly what you wanted, the first time. Pandas prompts work the same way: the dimensions and materials of your DataFrame are the spec.
Before asking the AI to write code, copy this short briefing pattern. It is almost always worth the thirty seconds it takes to fill in.
Reusable DataFrame brief
DataFrame name: orders_df
Approximate row count: 2.3 million
Index: default RangeIndex
Columns (name : dtype : notes):
order_id : int64 : unique, primary key
customer_id : int64 : 350k unique values
order_timestamp : datetime64[ns, UTC]
product_sku : string : ~8k unique values
quantity : int64 : range 1..50
unit_price_gbp : float64 : range 0.99..2,499.00
discount_pct : float64 : 0..0.45, ~30% nulls
status : category : 'placed'|'shipped'|'returned'
Memory budget: must run inside 8 GB RAM.
Style: vectorised Pandas (no apply, no for loops unless
unavoidable). Add inline comments.
Paste that block at the top of any Pandas prompt and the AI will stop inventing schemas. NumPy briefs are even shorter — name, shape tuple, dtype, and a one-line description of what the values represent.
Generic prompts produce generic code. When the AI does not know your column names or types, it falls back on its training distribution — which means snake_case column names that don't exist, naive .apply() calls that crawl on large frames, and parsing logic that breaks the moment your timestamps include timezones.
Weak prompt
Write Pandas code to calculate revenue per customer per month.
No DataFrame name. No column names. No types. The AI will fabricate columns like date, amount, customer. The code may run on a toy example but will fail on your real schema, and you will spend more time fixing column names than you saved by asking.
Stronger prompt
Act as a senior Pandas engineer optimising for large frames.
DataFrame: orders_df (~2.3M rows)
order_id (int64), customer_id (int64),
order_timestamp (datetime64[ns, UTC]),
quantity (int64), unit_price_gbp (float64),
discount_pct (float64, ~30% null),
status (category: placed|shipped|returned).
Task: produce a new DataFrame `revenue_monthly` with
columns [customer_id, year_month, net_revenue_gbp] where
net revenue per row = quantity * unit_price_gbp *
(1 - discount_pct.fillna(0)) and only status == 'shipped'
contributes. Group to month start in UTC.
Constraints:
- vectorised Pandas only
- use pd.Grouper(freq='MS') for monthly bucketing
- return the code only, with inline comments
The AI now has the schema, the business rule, and the performance constraint. The returned snippet will use df.loc[df['status'].eq('shipped')], a single groupby with pd.Grouper, and a fillna inside the revenue calculation — close to production-ready on the first try.
The pattern is: DataFrame brief → task verb → business rule → performance constraint → output shape. The DataFrame brief alone removes 80% of the noise. Add a single line about performance ("must vectorise", "avoid copies", "keep memory under 4 GB") and the AI will choose idiomatic patterns like assign, eval, query, merge_asof, or NumPy np.where instead of slow Python loops.
For NumPy, the equivalent shortcut is to state the array shape and dtype, and to ask explicitly for vectorisation. A prompt like "given prices of shape (N, T) float32 representing N tickers over T days, compute rolling 20-day log returns vectorised — no Python loops" reliably yields a clean np.log + slicing solution.
df.info() and paste the relevant lines. This single step prevents most hallucinated column names.orders_df, not df) so generated code drops straight into your notebook.revenue_monthly with columns X, Y, Z and one row per customer-month." Naming the output prevents the AI from inventing it.Tip: Save your DataFrame briefs in a Markdown file next to each project. When you start a new prompt, paste the matching brief first — it turns your codebase into a personal prompt library for AI assistants.
Take a DataFrame you work with regularly. Run df.info(), copy the output, and paste it into your AI tool with the prompt: "From this DataFrame, write a function that returns the top 10 rows by total spend per customer, broken out by quarter." Compare the result to a version where you only paste the column names without dtypes.
Ask the AI to convert one of your existing .apply()-heavy Pandas snippets into a fully vectorised version. Use the prompt:
Rewrite this code with no apply or for loops. Use vectorised Pandas or NumPy operations. Explain each change.
Benchmark both with %timeit.
Prompt AI to generate a NumPy function that takes a (N, T) array of returns and produces an (N, N) covariance matrix — vectorised, no Python loops. Then ask: "Now extend this to weighted covariance with weights of shape (T,)." Notice how a tight follow-up question keeps the momentum going.
Sign in to join the discussion and post comments.
Sign inAdvanced Prompt Engineering Techniques
Master the powerful techniques AI experts use every day. Chain-of-thought, RAG, agents, function calling, prompt evaluation, and much more — 20 deep-dive tutorials.
Prompt Engineering Projects & Real-World Applications
Twelve hands-on projects that turn prompt engineering theory into a portfolio. Build chatbots, content generators, RAG systems, and more.
Prompt Engineering for Business & Productivity
Use AI to work smarter — automate tasks, make better decisions, and communicate professionally. 12 practical business prompt tutorials for professionals.
Prompt Engineering for Content & Copywriting
Write blogs, ads, emails, and social media content ten times faster with AI. 13 practical tutorials on prompt engineering for content creators and copywriters.
Foundations of Prompt Engineering
The must-know basics of prompt engineering. Learn what prompts are, how AI models read them, and how to write clear instructions that get great results.
Prompt Engineering for Image Generation
Turn words into stunning visuals. Master AI image generation tools like Midjourney, DALL·E 3, and Stable Diffusion with 18 focused tutorials — from first prompt to full brand identity.