Machine learning models work with numerical data, but real-world datasets often contain categorical variables such as Gender, Country, or Department. These values must be converted into numerical form before training a model.
This tutorial covers common encoding techniques, when to use each, and how to implement them using pandas and scikit-learn.
Categorical data can be:
Most ML algorithms cannot process strings directly. Encoding transforms these values into a format the algorithm can understand.
Converts each category into a unique integer.
from sklearn.preprocessing import LabelEncoder
import pandas as pd
df = pd.DataFrame({'Grade': ['Low', 'Medium', 'High', 'Medium', 'Low']})
encoder = LabelEncoder()
df['Grade_Encoded'] = encoder.fit_transform(df['Grade'])
print(df)Output-
Grade Grade_Encoded
0 Low 1
1 Medium 2
2 High 0
3 Medium 2
4 Low 1Creates a new column for each category with binary (0 or 1) values.
df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']})
df_encoded = pd.get_dummies(df, columns=['Color'])
print(df_encoded)Output-
Color_Blue Color_Green Color_Red
0 False False True
1 False True False
2 True False False
3 False True False
4 False False Truefrom sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(df[['Color']])
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())
print(encoded_df)Output-
Color_Blue Color_Green Color_Red
0 0.0 0.0 1.0
1 0.0 1.0 0.0
2 1.0 0.0 0.0
3 0.0 1.0 0.0
4 0.0 0.0 1.0Use when categories have a clear order, such as Low < Medium < High.
from sklearn.preprocessing import OrdinalEncoder
df = pd.DataFrame({'Experience': ['Junior', 'Mid', 'Senior', 'Mid', 'Junior']})
encoder = OrdinalEncoder(categories=[['Junior', 'Mid', 'Senior']])
df['Experience_Encoded'] = encoder.fit_transform(df[['Experience']])
print(df)Output-
Experience Experience_Encoded
0 Junior 0.0
1 Mid 1.0
2 Senior 2.0
3 Mid 1.0
4 Junior 0.0Replaces each category with its frequency count or ratio.
df = pd.DataFrame({'City': ['Delhi', 'Mumbai', 'Delhi', 'Chennai', 'Delhi']})
freq = df['City'].value_counts(normalize=True)
df['City_Encoded'] = df['City'].map(freq)
print(df)Output-
City City_Encoded
0 Delhi 0.6
1 Mumbai 0.2
2 Delhi 0.6
3 Chennai 0.2
4 Delhi 0.6Encodes categories based on the mean of the target variable.
Use with caution to prevent data leakage.
| Technique | Use Case |
|---|---|
| Label Encoding | Ordinal features, Tree models |
| One-Hot Encoding | Nominal features, Linear models |
| Ordinal Encoding | Ordered categories |
| Frequency Encoding | High-cardinality nominal features |
| Target Encoding | Advanced use, risk of leakage |
Sign in to join the discussion and post comments.
Sign inUnsupervised Learning
Explore Unsupervised Learning techniques to uncover patterns, structures, and relationships in unlabeled data.
Supervised Learning
Discover what Supervised Learning is, how it works, and what you'll learn in this hands-on tutorial series covering top ML algorithms like Linear Regression, Decision Trees, SVM, and more.