Encoding Categorical Variables

Add to Bookmark

Machine learning models work with numerical data, but real-world datasets often contain categorical variables such as Gender, Country, or Department. These values must be converted into numerical form before training a model.

This tutorial covers common encoding techniques, when to use each, and how to implement them using pandas and scikit-learn.

Why Encoding is Needed

Categorical data can be:

Nominal: No inherent order (e.g., Red, Green, Blue)
Ordinal: Ordered categories (e.g., Low, Medium, High)

Most ML algorithms cannot process strings directly. Encoding transforms these values into a format the algorithm can understand.

1. Label Encoding

Converts each category into a unique integer.

When to Use:

For ordinal features where the order matters.
Avoid using on nominal data in tree-based models.

Python Example:

from sklearn.preprocessing import LabelEncoder
import pandas as pd
df = pd.DataFrame({'Grade': ['Low', 'Medium', 'High', 'Medium', 'Low']})
encoder = LabelEncoder()
df['Grade_Encoded'] = encoder.fit_transform(df['Grade'])
print(df)

Output-

    Grade  Grade_Encoded
0     Low              1
1  Medium              2
2    High              0
3  Medium              2
4     Low              1

2. One-Hot Encoding

Creates a new column for each category with binary (0 or 1) values.

When to Use:

For nominal features (no order).
Works well with linear models.

Python Example (pandas):

df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']})
df_encoded = pd.get_dummies(df, columns=['Color'])
print(df_encoded)

Output-

   Color_Blue  Color_Green  Color_Red
0       False        False       True
1       False         True      False
2        True        False      False
3       False         True      False
4       False        False       True

Using scikit-learn:

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(df[['Color']])
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())
print(encoded_df)

Output-

   Color_Blue  Color_Green  Color_Red
0         0.0          0.0        1.0
1         0.0          1.0        0.0
2         1.0          0.0        0.0
3         0.0          1.0        0.0
4         0.0          0.0        1.0

3. Ordinal Encoding

Use when categories have a clear order, such as Low < Medium < High.

from sklearn.preprocessing import OrdinalEncoder
df = pd.DataFrame({'Experience': ['Junior', 'Mid', 'Senior', 'Mid', 'Junior']})
encoder = OrdinalEncoder(categories=[['Junior', 'Mid', 'Senior']])
df['Experience_Encoded'] = encoder.fit_transform(df[['Experience']])
print(df)

Output-

  Experience  Experience_Encoded
0     Junior                 0.0
1        Mid                 1.0
2     Senior                 2.0
3        Mid                 1.0
4     Junior                 0.0

4. Frequency Encoding (Custom)

Replaces each category with its frequency count or ratio.

df = pd.DataFrame({'City': ['Delhi', 'Mumbai', 'Delhi', 'Chennai', 'Delhi']})
freq = df['City'].value_counts(normalize=True)
df['City_Encoded'] = df['City'].map(freq)
print(df)

Output-

      City  City_Encoded
0    Delhi           0.6
1   Mumbai           0.2
2    Delhi           0.6
3  Chennai           0.2
4    Delhi           0.6

5. Target Encoding (For Advanced Users)

Encodes categories based on the mean of the target variable.
Use with caution to prevent data leakage.

Which Encoding to Use?

Technique	Use Case
Label Encoding	Ordinal features, Tree models
One-Hot Encoding	Nominal features, Linear models
Ordinal Encoding	Ordered categories
Frequency Encoding	High-cardinality nominal features
Target Encoding	Advanced use, risk of leakage

Things to Watch Out For

High cardinality (e.g., thousands of unique categories) can make one-hot encoding inefficient.
Always encode train and test data using the same encoder (fit on train, transform both).
Don't use label encoding on nominal features with linear models (it adds false order).

Overview

Encoding Categorical Variables

Why Encoding is Needed

1. Label Encoding

When to Use:

Python Example:

2. One-Hot Encoding

When to Use:

Python Example (pandas):

Using scikit-learn:

3. Ordinal Encoding

4. Frequency Encoding (Custom)

5. Target Encoding (For Advanced Users)

Which Encoding to Use?

Things to Watch Out For

Prepare for Interview

Tutorials

Random Blogs

Datasets for Machine Learning

Categories

Follow us on Linkedin

Overview

Encoding Categorical Variables

Why Encoding is Needed

1. Label Encoding

When to Use:

Python Example:

2. One-Hot Encoding

When to Use:

Python Example (pandas):

Using scikit-learn:

3. Ordinal Encoding

4. Frequency Encoding (Custom)

5. Target Encoding (For Advanced Users)

Which Encoding to Use?

Things to Watch Out For

Related Tutorials

Prepare for Interview

Tutorials

Random Blogs

Datasets for Machine Learning

Categories

Follow us on Linkedin