Encoding Categorical Variables
Add to BookmarkMachine learning models work with numerical data, but real-world datasets often contain categorical variables such as Gender
, Country
, or Department
. These values must be converted into numerical form before training a model.
This tutorial covers common encoding techniques, when to use each, and how to implement them using pandas
and scikit-learn
.
Why Encoding is Needed
Categorical data can be:
- Nominal: No inherent order (e.g., Red, Green, Blue)
- Ordinal: Ordered categories (e.g., Low, Medium, High)
Most ML algorithms cannot process strings directly. Encoding transforms these values into a format the algorithm can understand.
1. Label Encoding
Converts each category into a unique integer.
When to Use:
- For ordinal features where the order matters.
- Avoid using on nominal data in tree-based models.
Python Example:
from sklearn.preprocessing import LabelEncoder
import pandas as pd
df = pd.DataFrame({'Grade': ['Low', 'Medium', 'High', 'Medium', 'Low']})
encoder = LabelEncoder()
df['Grade_Encoded'] = encoder.fit_transform(df['Grade'])
print(df)
Output-
Grade Grade_Encoded
0 Low 1
1 Medium 2
2 High 0
3 Medium 2
4 Low 1
2. One-Hot Encoding
Creates a new column for each category with binary (0 or 1) values.
When to Use:
- For nominal features (no order).
- Works well with linear models.
Python Example (pandas):
df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']})
df_encoded = pd.get_dummies(df, columns=['Color'])
print(df_encoded)
Output-
Color_Blue Color_Green Color_Red
0 False False True
1 False True False
2 True False False
3 False True False
4 False False True
Using scikit-learn:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(df[['Color']])
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())
print(encoded_df)
Output-
Color_Blue Color_Green Color_Red
0 0.0 0.0 1.0
1 0.0 1.0 0.0
2 1.0 0.0 0.0
3 0.0 1.0 0.0
4 0.0 0.0 1.0
3. Ordinal Encoding
Use when categories have a clear order, such as Low < Medium < High
.
from sklearn.preprocessing import OrdinalEncoder
df = pd.DataFrame({'Experience': ['Junior', 'Mid', 'Senior', 'Mid', 'Junior']})
encoder = OrdinalEncoder(categories=[['Junior', 'Mid', 'Senior']])
df['Experience_Encoded'] = encoder.fit_transform(df[['Experience']])
print(df)
Output-
Experience Experience_Encoded
0 Junior 0.0
1 Mid 1.0
2 Senior 2.0
3 Mid 1.0
4 Junior 0.0
4. Frequency Encoding (Custom)
Replaces each category with its frequency count or ratio.
df = pd.DataFrame({'City': ['Delhi', 'Mumbai', 'Delhi', 'Chennai', 'Delhi']})
freq = df['City'].value_counts(normalize=True)
df['City_Encoded'] = df['City'].map(freq)
print(df)
Output-
City City_Encoded
0 Delhi 0.6
1 Mumbai 0.2
2 Delhi 0.6
3 Chennai 0.2
4 Delhi 0.6
5. Target Encoding (For Advanced Users)
Encodes categories based on the mean of the target variable.
Use with caution to prevent data leakage.
Which Encoding to Use?
Technique | Use Case |
---|---|
Label Encoding | Ordinal features, Tree models |
One-Hot Encoding | Nominal features, Linear models |
Ordinal Encoding | Ordered categories |
Frequency Encoding | High-cardinality nominal features |
Target Encoding | Advanced use, risk of leakage |
Things to Watch Out For
- High cardinality (e.g., thousands of unique categories) can make one-hot encoding inefficient.
- Always encode train and test data using the same encoder (fit on train, transform both).
- Don't use label encoding on nominal features with linear models (it adds false order).
Prepare for Interview
- JavaScript Interview Questions for 0–1 Year Experience
- JavaScript Interview Questions For Fresher
- SQL Interview Questions for 5+ Years Experience
- SQL Interview Questions for 2–5 Years Experience
- SQL Interview Questions for 1–2 Years Experience
- SQL Interview Questions for 0–1 Year Experience
- SQL Interview Questions for Freshers
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
- Debugging in Python
- Unit Testing in Python
- Asynchronous Programming in PYthon
Random Blogs
- Why to learn Digital Marketing?
- Deep Learning (DL): The Core of Modern AI
- Understanding HTAP Databases: Bridging Transactions and Analytics
- Datasets for analyze in Tableau
- Time Series Analysis on Air Passenger Data
- Google’s Core Update in May 2020: What You Need to Know
- Understanding OLTP vs OLAP Databases: How SQL Handles Query Optimization
- 5 Ways Use Jupyter Notebook Online Free of Cost
- OLTP vs. OLAP Databases: Advanced Insights and Query Optimization Techniques
- Where to Find Free Datasets for Your Next Machine Learning & Data Science Project
- Understanding Data Lake, Data Warehouse, Data Mart, and Data Lakehouse – And Why We Need Them
- The Ultimate Guide to Data Science: Everything You Need to Know
- Window Functions in SQL – The Ultimate Guide
- How to Become a Good Data Scientist ?
- Understanding SQL vs MySQL vs PostgreSQL vs MS SQL vs Oracle and Other Popular Databases
Datasets for Machine Learning
- Amazon Product Reviews Dataset
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset
- Artificial Characters Dataset