Machine Learning Model Prompts: From Selection to Evaluation

Machine learning projects fail more often from process gaps than from algorithm choice. AI can fill those gaps — recommending models, writing training scaffolds, suggesting metrics, and generating evaluation reports — if your prompts walk it through the lifecycle one stage at a time. This topic shows you how.

1. Introduction

Most ML tutorials jump straight to "import RandomForestClassifier" and skip the structural decisions that decide whether a project ships. In real work, the hardest moments are choosing the right model family for the problem, splitting the data correctly, picking metrics aligned to the business outcome, and producing an evaluation report a stakeholder can act on. AI is excellent at all of these — provided you treat the conversation as a series of focused prompts, each one corresponding to a step in the ML lifecycle. This tutorial maps that conversation.

2. The Concept Explained

The machine learning lifecycle is a chain of decisions, not a single function call. Each link in the chain has its own prompt shape: problem framing determines the model family, data preparation determines the splits, training determines the search strategy, and evaluation determines the metric set. If any link is weak, the whole chain breaks.

A useful analogy is building a house. You wouldn't ask a contractor for a roof before drawing the foundation. ML prompts should follow the same discipline: ask AI to help you frame the problem before you ask it to write the training loop. The reward is enormous — fewer rewrites, fewer leaks between train and test, fewer "the model is great but the business won't use it" outcomes.

The ML lifecycle in seven steps. Each step gets its own focused prompt — and metrics often loop you back to framing.

One prompt per stage

Frame: classification vs regression vs ranking? Imbalanced? Time-sensitive? Split: random, stratified, group, or time-based? Baseline: what naive model beats random? Train: which scikit-learn (or XGBoost / LightGBM) estimator and pipeline? Tune: grid, random, or Bayesian search? Evaluate: which metrics and plots? Ship: serialise, document, monitor.

3. The Problem Without This Technique

Weak prompt

Build a machine learning model for my customer data.

The AI doesn't know the target variable, the success metric, the data shape, or even whether this is classification or regression. It will produce a generic RandomForestClassifier snippet, default split, and accuracy score — all of which may be exactly wrong for an imbalanced churn problem.

Stronger prompt

Act as a senior ML engineer using scikit-learn and XGBoost.

Problem framing:
- Task: binary classification — predict customer churn
  within the next 90 days (target column: churn_90d, 0/1).
- Class balance: ~7% positives. Imbalanced.
- Business cost: false negatives cost ~10x more
  than false positives (we miss a saveable customer).

Data:
- DataFrame customers_df, ~480k rows.
- 38 features: 9 numeric, 22 one-hot encoded categoricals,
  7 engineered tenure/usage features.
- Pre-split: random by customer_id, 70/15/15 train/val/test.

Tasks:
1. Recommend two model families for this setup
   and justify why.
2. Build a scikit-learn Pipeline (StandardScaler +
   ColumnTransformer + chosen estimator).
3. Use stratified k-fold (k=5) cross-validation.
4. Report PR-AUC as the primary metric (not ROC-AUC,
   given class imbalance) plus recall at top-10% scored.
5. Return runnable code with brief inline comments.

You will get a tailored pipeline (likely XGBoost with scale_pos_weight plus a logistic regression baseline), the correct stratified CV, PR-AUC as primary metric, and a recall-at-k function that maps directly to the business cost framing.

4. The Solution

The pattern is: problem framing → data brief → split strategy → metric → estimator family → output format. Problem framing is the link most beginners skip. Spelling out "imbalanced", "time-dependent", or "small sample" changes every downstream choice — and AI will make those choices intelligently if it knows.

One especially valuable habit: always ask AI to recommend a baseline before a model.

What is the simplest possible baseline I can beat on this problem?

tells you the floor your sophisticated model has to clear. A 0.62 PR-AUC sounds great until you realise the prevalence baseline is 0.58.

5. Step-by-Step Breakdown

Frame the problem precisely. Classification vs regression vs ranking. State the target, the class balance, and the business cost asymmetry.
Describe the data. Row count, feature types and counts, and pre-split plan. Include leakage risks ("don't use last_login_date after the churn cutoff").
Ask for a baseline. Logistic regression, gradient boosting with defaults, or a domain rule. The baseline is the contract your final model must beat.
Request a scikit-learn Pipeline. Always wrap preprocessing in a Pipeline / ColumnTransformer — this prevents data leakage and makes the model serialisable.
Specify the metric set. Primary metric, secondary metrics, and operational metrics (precision at top-k, calibration). Each one maps to a business decision.
Iterate by pasting back results. "Cross-val PR-AUC is 0.71 but test PR-AUC is 0.58. Diagnose the gap and suggest fixes." This kind of follow-up is where AI shines.

Tip: Ask AI to write your evaluation as a single function: evaluate(model, X_val, y_val) -> dict. You can call it after every experiment and the diff in metrics tells the whole story of an iteration.

6. Practice Exercises

Exercise 1

Write a "framing prompt" for a current ML problem at work. Include task type, target, class balance, cost asymmetry, and any leakage risks. Ask AI: "Given this framing, what are the three most important modelling decisions I need to get right, and what could go wrong with each?"

Exercise 2

Prompt AI to write a scikit-learn Pipeline that includes a ColumnTransformer for mixed numeric / categorical / text features, with a stratified k-fold cross-validation loop and a final test-set evaluation. Specify XGBoost as the estimator. Ask for hyperparameter ranges suitable for RandomizedSearchCV.

Exercise 3

Take an existing model's metrics from a previous project. Paste the train / val / test scores into AI and ask: "Diagnose what is likely happening based on these scores, and recommend three specific next experiments — in order of expected impact."

7. Key Takeaways

The ML lifecycle is a chain of decisions; give AI one focused prompt per stage rather than one mega-prompt.
Problem framing — task type, class balance, cost asymmetry — is the most under-specified part of weak prompts.
Always wrap preprocessing in a scikit-learn Pipeline / ColumnTransformer to prevent leakage.
Ask for a baseline before a sophisticated model; metrics only mean something against a floor.
State the metric set explicitly — primary, secondary, and operational — and align it to the business cost asymmetry.

Discussion

Prompting AI for Data Visualisation (Matplotlib, Seaborn, Plotly)Prompt-Driven Feature Engineering Techniques