Prompt Patterns for Working with Large Datasets

When data outgrows your laptop, every line of code matters. AI can recommend the right engine, refactor Pandas to chunked or distributed equivalents, and design out-of-core ML workflows — if your prompt names the data volume, the memory budget, and the latency target. This topic shows you how to make that brief precise.

1. Introduction

The boundary between "comfortable Pandas" and "needs a real engine" lives roughly where a single DataFrame stops fitting in memory — somewhere between 5 and 50 million rows depending on column count and dtypes. Past that boundary, naive code crashes, swaps to disk, or runs for hours. The tools to cross the boundary (chunked Pandas, Polars, Dask, Spark, DuckDB, cloud warehouses) all have different prompt patterns. The good news: AI knows all of them, and a precise brief gives you idiomatic code in any of them. This tutorial covers the prompts that scale.

2. The Concept Explained

"Large" is relative. What matters in a prompt is the absolute size and the constraints around it. A useful brief includes row count, column count and dtypes, file format (CSV, Parquet, JSON Lines), memory budget, latency requirement, and the engine you want to use. With those six numbers, AI can pick chunked Pandas, switch you to Polars, propose a Dask DataFrame, or write a PySpark job.

An analogy: choosing a vehicle for cargo. A laptop is a hatchback — fine for groceries, useless for furniture. Polars is a van. Dask is a small fleet. Spark is a freight train. DuckDB is a clever forklift that handles awkward shapes. Each one excels at certain loads, and the wrong choice burns time and money.

When to choose what

Chunked Pandas: 10–50M rows on one machine when you can stream sequentially. Use pd.read_csv(chunksize=) or pd.read_parquet with row-group filters.

Polars: any size from 5M to 500M rows on one machine; multi-threaded, lazy execution, often 5–10× faster than Pandas. Drop-in for many workflows.

Dask: when you want Pandas semantics across a cluster, or out-of-core single machine work with familiar APIs. Excellent for ad-hoc analysis on TB-scale Parquet.

Spark / PySpark: production ETL across many TB; best when your team already runs Spark and the workload has predictable shape.

DuckDB: when SQL is more natural than dataframes; brilliant for analytical queries on Parquet at single-machine scale.

Warehouse SQL: when the data already lives in BigQuery, Snowflake, Databricks SQL, or Redshift, push the work there rather than pulling it down.

3. The Problem Without This Technique

Weak prompt

Process my 50GB CSV file in Python.

No memory budget, no engine choice, no transformation goal. The AI will return generic pd.read_csv code that loads everything into memory and crashes within minutes on the real file. You spend the afternoon debugging instead of analysing.

Stronger prompt

Act as a senior data engineer experienced with Polars
and DuckDB.

Data:
- events.parquet, 4.2 TB across 1,800 daily partitions
  on S3 (s3://prod-events/dt=YYYY-MM-DD/).
- ~120 columns, mix of int64, float64, string, struct.
- Row count per day: ~50M; total ~28B rows.

Compute environment:
- Single c6i.8xlarge EC2 instance (32 vCPU, 64 GB RAM).
- Local NVMe scratch (1.5 TB).

Task: produce a daily aggregate table that, for each
(user_id, event_date), computes:
  - session_count
  - total_event_count
  - unique_event_types
  - first_event_ts, last_event_ts

Constraints:
- Stream-process one day at a time; never load
  more than 60 GB into memory at once.
- Use Polars lazy frames OR DuckDB SQL — pick the
  one that's clearer for this aggregation and
  justify the choice.
- Output: append-only Parquet at
  s3://prod-events-agg/user_daily/dt=YYYY-MM-DD/.
- Add a progress log and an end-of-run summary
  with rows processed and elapsed time.

You receive a Polars or DuckDB script with explicit per-day streaming, sensible memory pinning, S3 path handling, and a logging block. It will run on the EC2 instance without OOM errors.

4. The Solution

The pattern is: data shape → environment → memory budget → engine choice (or "you pick") → output target → progress reporting. The memory budget is the most under-specified part of beginner prompts. State it explicitly — "must fit in 8 GB", "must fit in 60 GB", "must stream to disk above 10 GB" — and AI will choose chunk sizes, dtypes, and engines accordingly.

For ML workloads on large data, ask AI explicitly for out-of-core training strategies: incremental learning via SGDClassifier.partial_fit, gradient boosting on shards with LightGBM or XGBoost's external memory mode, or feature hashing for high-cardinality categoricals.

5. Step-by-Step Breakdown

State the data volume. Rows, columns, file format, total bytes, partitioning. Each fact narrows the engine choice.
State the compute environment. Machine type, RAM, scratch disk, cluster size. AI can match strategies to hardware once it knows the box.
Set a memory budget. A hard number prevents the AI from defaulting to "load everything into memory".
Pick or delegate the engine. Either name the engine or ask AI to recommend one and justify. Both work; pick based on whether you have a preference.
Demand idempotent, append-only output. For batch ETL, idempotency saves you when reruns happen — and they always happen.
Request progress and timing. Long jobs that print nothing for two hours are nightmares. A one-line per-batch log is cheap insurance.

Tip: When refactoring slow Pandas to Polars, paste the original code and ask:
Rewrite this in Polars (lazy API where possible). Explain each change and the expected performance improvement.
You learn the new engine while migrating.

6. Practice Exercises

Exercise 1

Take a slow Pandas script of your own. Paste it into AI with the prompt:

Rewrite this in Polars lazy mode. Then rewrite it as a DuckDB SQL query against the same Parquet file. Compare the two — which is clearer for this workload?

Exercise 2

Prompt AI to design a chunked Pandas ETL job for a 200 GB CSV that you cannot rewrite to Parquet. Constraints: 16 GB RAM machine, daily run, idempotent output to Parquet partitioned by date. Demand a progress log and an end-of-run summary.

Exercise 3

Ask AI to build an out-of-core training loop for a logistic regression model on 500M rows of features. Use SGDClassifier.partial_fit over chunked Parquet. Specify class weights for imbalance and request a calibration plot at the end.

7. Key Takeaways

Brief the AI with rows, columns, format, environment, memory budget, and latency target — these six facts pick the engine.
State a hard memory budget; without it, AI defaults to "load everything", which fails at scale.
Polars, Dask, Spark, DuckDB, and warehouse SQL each have a sweet spot. Either name yours or let AI justify a pick.
For ML at scale, demand out-of-core strategies — partial_fit, external memory, feature hashing.
Always request idempotent output and per-batch progress logging; both save you when reruns happen.

Discussion

Using AI to Interpret and Summarise Machine Learning Results Advanced: Using AI as a Data Science Pair Programmer