Encoding Categorical Variables

  Add to Bookmark

Machine learning models work with numerical data, but real-world datasets often contain categorical variables such as Gender, Country, or Department. These values must be converted into numerical form before training a model.

This tutorial covers common encoding techniques, when to use each, and how to implement them using pandas and scikit-learn.


Why Encoding is Needed

Categorical data can be:

  • Nominal: No inherent order (e.g., Red, Green, Blue)
  • Ordinal: Ordered categories (e.g., Low, Medium, High)

Most ML algorithms cannot process strings directly. Encoding transforms these values into a format the algorithm can understand.


1. Label Encoding

Converts each category into a unique integer.

When to Use:

  • For ordinal features where the order matters.
  • Avoid using on nominal data in tree-based models.

Python Example:

from sklearn.preprocessing import LabelEncoder
import pandas as pd
df = pd.DataFrame({'Grade': ['Low', 'Medium', 'High', 'Medium', 'Low']})
encoder = LabelEncoder()
df['Grade_Encoded'] = encoder.fit_transform(df['Grade'])
print(df)

Output-

    Grade  Grade_Encoded
0     Low              1
1  Medium              2
2    High              0
3  Medium              2
4     Low              1

2. One-Hot Encoding

Creates a new column for each category with binary (0 or 1) values.

When to Use:

  • For nominal features (no order).
  • Works well with linear models.

Python Example (pandas):

df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']})
df_encoded = pd.get_dummies(df, columns=['Color'])
print(df_encoded)

Output-

   Color_Blue  Color_Green  Color_Red
0       False        False       True
1       False         True      False
2        True        False      False
3       False         True      False
4       False        False       True

Using scikit-learn:

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(df[['Color']])
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())
print(encoded_df)

Output-

   Color_Blue  Color_Green  Color_Red
0         0.0          0.0        1.0
1         0.0          1.0        0.0
2         1.0          0.0        0.0
3         0.0          1.0        0.0
4         0.0          0.0        1.0

3. Ordinal Encoding

Use when categories have a clear order, such as Low < Medium < High.

from sklearn.preprocessing import OrdinalEncoder
df = pd.DataFrame({'Experience': ['Junior', 'Mid', 'Senior', 'Mid', 'Junior']})
encoder = OrdinalEncoder(categories=[['Junior', 'Mid', 'Senior']])
df['Experience_Encoded'] = encoder.fit_transform(df[['Experience']])
print(df)

Output-

  Experience  Experience_Encoded
0     Junior                 0.0
1        Mid                 1.0
2     Senior                 2.0
3        Mid                 1.0
4     Junior                 0.0

4. Frequency Encoding (Custom)

Replaces each category with its frequency count or ratio.

df = pd.DataFrame({'City': ['Delhi', 'Mumbai', 'Delhi', 'Chennai', 'Delhi']})
freq = df['City'].value_counts(normalize=True)
df['City_Encoded'] = df['City'].map(freq)
print(df)

Output-

      City  City_Encoded
0    Delhi           0.6
1   Mumbai           0.2
2    Delhi           0.6
3  Chennai           0.2
4    Delhi           0.6

5. Target Encoding (For Advanced Users)

Encodes categories based on the mean of the target variable
Use with caution to prevent data leakage.


Which Encoding to Use?

TechniqueUse Case
Label EncodingOrdinal features, Tree models
One-Hot EncodingNominal features, Linear models
Ordinal EncodingOrdered categories
Frequency EncodingHigh-cardinality nominal features
Target EncodingAdvanced use, risk of leakage

Things to Watch Out For

  • High cardinality (e.g., thousands of unique categories) can make one-hot encoding inefficient.
  • Always encode train and test data using the same encoder (fit on train, transform both).
  • Don't use label encoding on nominal features with linear models (it adds false order).