Prompting for NLP Tasks: Sentiment Analysis, Text Classification

Modern NLP work splits cleanly into two prompt styles: generative prompts that ask an LLM to label or extract directly, and code prompts that ask AI to write a classical scikit-learn or transformer pipeline. This topic gives you both — when to use each, and how to brief them well.

1. Introduction

For most data science teams, NLP used to mean weeks of annotation, tokenisation, and model fine-tuning. A surprising portion of that work can now be replaced by a well-crafted prompt to a capable LLM — at least for the prototyping phase. For production-scale work, classical NLP pipelines (TF-IDF + logistic regression, spaCy, transformer encoders) remain faster, cheaper, and more predictable. The data scientist's job is to pick the right approach per task and to brief AI well in both worlds. This tutorial covers the four most common NLP tasks: sentiment analysis, text classification, entity extraction, and retrieval-augmented analytics.

2. The Concept Explained

NLP prompts come in two flavours. Direct LLM prompts ask the model to perform the NLP task itself — "classify this support ticket into one of these eight intents". They are zero-shot or few-shot, no training required, and excellent for prototypes or low-volume work. Code-generation prompts ask AI to write a traditional pipeline — "produce scikit-learn code that trains a TF-IDF + logistic regression sentiment classifier on this DataFrame". They are ideal when latency, cost, or interpretability matter.

An analogy: a direct LLM prompt is like hiring a freelance consultant for one job — fast, flexible, expensive per task. A code-generation prompt is like building a factory — slower to set up, cheap per item, predictable output. Most production systems eventually need the factory; most prototypes start with the consultant.

Two paths for NLP: a direct LLM prompt or a classical pipeline. Both produce structured labels — the trade-offs are cost, latency, and predictability.

Retrieval-augmented analytics

When the corpus is too large or too proprietary to fit in any prompt, the standard answer is retrieval-augmented generation (RAG): embed the documents, retrieve the relevant chunks for a question, and pass them as context. For analytics, this lets you ask natural-language questions across thousands of support tickets, contracts, or research papers. Prompt design for RAG follows the same data-brief discipline: describe the corpus, the embedding model, the chunk size, and the question shape.

3. The Problem Without This Technique

Weak prompt

Do sentiment analysis on my customer reviews.

No label set ("positive/negative" or 5-star?), no domain (hotel reviews are different from software reviews), no output format (column? JSON? probabilities?). The AI will return either an over-generic Python snippet or, worse, a stream of unstructured opinions on individual reviews.

Stronger prompt

Act as a senior NLP engineer.

Task: label customer support tickets with both
sentiment and intent.

Input: DataFrame tickets_df (~18,000 rows) with columns
  ticket_id (int), ticket_text (str, 20-800 words),
  product_area (str), language (str, mostly 'en').

Output schema (one new column per row):
  sentiment ∈ {very_negative, negative, neutral,
              positive, very_positive}
  intent ∈ {bug_report, feature_request, billing,
            how_to, complaint, praise, other}
  intent_confidence ∈ [0, 1]
  short_summary (str, ≤ 18 words)

Constraints:
- For prototyping: provide a direct LLM prompt
  template (system + user) that returns strict JSON.
- For production: provide scikit-learn code that
  trains a TF-IDF + Logistic Regression classifier
  using a labelled subset of 2,000 rows
  (assume column `intent_label` exists).
- Include input validation and graceful error
  handling for malformed JSON output.

The AI will produce a clean JSON-emitting system prompt for the prototype, plus a scikit-learn pipeline (Pipeline + TfidfVectorizer + LogisticRegression with class weights) for the production version, along with a JSON parser that retries on malformed output.

4. The Solution

The pattern is: task framing → input description → output schema → constraint set → choice of approach. The output schema is the most powerful piece. If you specify the exact label set and ask for strict JSON, the LLM will emit parseable, columnar data you can drop into a DataFrame with a single json.loads per row.

For high-volume tasks, always ask AI to write both versions — a direct LLM prompt for the first 100 examples and a classical pipeline for the next 100,000. This dual-track approach gives you a working prototype on day one and a cheap, fast production system on week two.

5. Step-by-Step Breakdown

Define the label set exactly. "Positive/Negative/Neutral" with no further definition produces inconsistent labels. Provide an example of each class.
Specify the input shape. Length range, language, domain. Sentiment for hotel reviews and SaaS feedback uses different vocabulary.
Demand a strict output schema. JSON with a fixed list of keys. Add a validation step in the same prompt.
Choose your track. Direct LLM for low volume, prototyping, or rare classes. Classical pipeline for high volume, low latency, or strict cost control.
For RAG, brief the corpus. Document count, average length, embedding model, retrieval top-k, chunk size, and question domain. Each of these affects answer quality.
Evaluate against a held-out set. Ask AI to write the evaluation code in the same prompt — confusion matrix, macro-F1, and per-class recall.

Tip: When using direct LLM prompts in production, log the raw model output alongside the parsed labels. When the parsing breaks (and it will), you have the evidence to fix the prompt rather than the data.

6. Practice Exercises

Exercise 1

Take a sample of 50 free-text customer messages. Write a direct LLM prompt that labels each one with a sentiment and an intent, returning strict JSON. Run it. Manually grade the accuracy on the first 20. Use the errors to refine the prompt.

Exercise 2

Prompt AI to produce a scikit-learn Pipeline (TF-IDF + Logistic Regression with class_weight='balanced') for a multi-class classification problem. Specify the DataFrame columns, the target labels, and the evaluation metric (macro-F1). Compare its performance to the direct LLM approach on the same held-out set.

Exercise 3

Design a RAG prompt for analytics: "I have 4,000 sales-call transcripts. Build a retrieval-augmented system that lets me ask questions like 'what are the top three objections raised in calls with enterprise prospects in Q3?' Specify chunking, embedding model, retrieval strategy, and answer format."

7. Key Takeaways

NLP prompts come in two flavours: direct LLM labelling and code-generated classical pipelines. Pick per task.
Define the label set exactly — give one example per class — to avoid drift in LLM labelling.
Demand strict JSON output and include a parser-with-retry in the same prompt.
Use direct LLM prompts for prototyping; switch to classical pipelines for high-volume, low-latency production.
For RAG analytics, brief the corpus, chunking, retrieval strategy, and answer schema with the same discipline as a DataFrame brief.

Discussion

AI Prompts for Statistical Testing and Hypothesis Building Building Data Pipelines Using AI Prompt Workflows