When data outgrows your laptop, every line of code matters. AI can recommend the right engine, refactor Pandas to chunked or distributed equivalents, and design out-of-core ML workflows — if your prompt names the data volume, the memory budget, and the latency target. This topic shows you how to make that brief precise.
The boundary between "comfortable Pandas" and "needs a real engine" lives roughly where a single DataFrame stops fitting in memory — somewhere between 5 and 50 million rows depending on column count and dtypes. Past that boundary, naive code crashes, swaps to disk, or runs for hours. The tools to cross the boundary (chunked Pandas, Polars, Dask, Spark, DuckDB, cloud warehouses) all have different prompt patterns. The good news: AI knows all of them, and a precise brief gives you idiomatic code in any of them. This tutorial covers the prompts that scale.
"Large" is relative. What matters in a prompt is the absolute size and the constraints around it. A useful brief includes row count, column count and dtypes, file format (CSV, Parquet, JSON Lines), memory budget, latency requirement, and the engine you want to use. With those six numbers, AI can pick chunked Pandas, switch you to Polars, propose a Dask DataFrame, or write a PySpark job.
An analogy: choosing a vehicle for cargo. A laptop is a hatchback — fine for groceries, useless for furniture. Polars is a van. Dask is a small fleet. Spark is a freight train. DuckDB is a clever forklift that handles awkward shapes. Each one excels at certain loads, and the wrong choice burns time and money.
Chunked Pandas: 10–50M rows on one machine when you can stream sequentially. Use pd.read_csv(chunksize=) or pd.read_parquet with row-group filters.
Polars: any size from 5M to 500M rows on one machine; multi-threaded, lazy execution, often 5–10× faster than Pandas. Drop-in for many workflows.
Dask: when you want Pandas semantics across a cluster, or out-of-core single machine work with familiar APIs. Excellent for ad-hoc analysis on TB-scale Parquet.
Spark / PySpark: production ETL across many TB; best when your team already runs Spark and the workload has predictable shape.
DuckDB: when SQL is more natural than dataframes; brilliant for analytical queries on Parquet at single-machine scale.
Warehouse SQL: when the data already lives in BigQuery, Snowflake, Databricks SQL, or Redshift, push the work there rather than pulling it down.
Weak prompt
Process my 50GB CSV file in Python.
No memory budget, no engine choice, no transformation goal. The AI will return generic pd.read_csv code that loads everything into memory and crashes within minutes on the real file. You spend the afternoon debugging instead of analysing.
Stronger prompt
Act as a senior data engineer experienced with Polars
and DuckDB.
Data:
- events.parquet, 4.2 TB across 1,800 daily partitions
on S3 (s3://prod-events/dt=YYYY-MM-DD/).
- ~120 columns, mix of int64, float64, string, struct.
- Row count per day: ~50M; total ~28B rows.
Compute environment:
- Single c6i.8xlarge EC2 instance (32 vCPU, 64 GB RAM).
- Local NVMe scratch (1.5 TB).
Task: produce a daily aggregate table that, for each
(user_id, event_date), computes:
- session_count
- total_event_count
- unique_event_types
- first_event_ts, last_event_ts
Constraints:
- Stream-process one day at a time; never load
more than 60 GB into memory at once.
- Use Polars lazy frames OR DuckDB SQL — pick the
one that's clearer for this aggregation and
justify the choice.
- Output: append-only Parquet at
s3://prod-events-agg/user_daily/dt=YYYY-MM-DD/.
- Add a progress log and an end-of-run summary
with rows processed and elapsed time.
You receive a Polars or DuckDB script with explicit per-day streaming, sensible memory pinning, S3 path handling, and a logging block. It will run on the EC2 instance without OOM errors.
The pattern is: data shape → environment → memory budget → engine choice (or "you pick") → output target → progress reporting. The memory budget is the most under-specified part of beginner prompts. State it explicitly — "must fit in 8 GB", "must fit in 60 GB", "must stream to disk above 10 GB" — and AI will choose chunk sizes, dtypes, and engines accordingly.
For ML workloads on large data, ask AI explicitly for out-of-core training strategies: incremental learning via SGDClassifier.partial_fit, gradient boosting on shards with LightGBM or XGBoost's external memory mode, or feature hashing for high-cardinality categoricals.
Tip: When refactoring slow Pandas to Polars, paste the original code and ask:
Rewrite this in Polars (lazy API where possible). Explain each change and the expected performance improvement.You learn the new engine while migrating.
Take a slow Pandas script of your own. Paste it into AI with the prompt:
Rewrite this in Polars lazy mode. Then rewrite it as a DuckDB SQL query against the same Parquet file. Compare the two — which is clearer for this workload?
Prompt AI to design a chunked Pandas ETL job for a 200 GB CSV that you cannot rewrite to Parquet. Constraints: 16 GB RAM machine, daily run, idempotent output to Parquet partitioned by date. Demand a progress log and an end-of-run summary.
Ask AI to build an out-of-core training loop for a logistic regression model on 500M rows of features. Use SGDClassifier.partial_fit over chunked Parquet. Specify class weights for imbalance and request a calibration plot at the end.
Sign in to join the discussion and post comments.
Sign inPrompt Engineering for Content & Copywriting
Write blogs, ads, emails, and social media content ten times faster with AI. 13 practical tutorials on prompt engineering for content creators and copywriters.
Prompt Engineering for Business & Productivity
Use AI to work smarter — automate tasks, make better decisions, and communicate professionally. 12 practical business prompt tutorials for professionals.
Prompt Engineering for Developers
Use AI as your coding co-pilot. 18 tutorials on writing prompts to generate clean code, debug faster, write tests, build APIs, and ship better software.
Prompt Engineering for Image Generation
Turn words into stunning visuals. Master AI image generation tools like Midjourney, DALL·E 3, and Stable Diffusion with 18 focused tutorials — from first prompt to full brand identity.
Prompt Engineering for Education & Learning
Use AI as your personal tutor. Learn how to study faster, create lesson plans, generate practice questions, master languages, and prepare for competitive exams with smart prompts.
Prompt Engineering Projects & Real-World Applications
Twelve hands-on projects that turn prompt engineering theory into a portfolio. Build chatbots, content generators, RAG systems, and more.