Machine learning models work with numerical data, but real-world datasets often contain categorical variables such as Gender, Country, or Department. These values must be converted into numerical form before training a model.
This tutorial covers common encoding techniques, when to use each, and how to implement them using pandas and scikit-learn.
Categorical data can be:
Most ML algorithms cannot process strings directly. Encoding transforms these values into a format the algorithm can understand.
Converts each category into a unique integer.
from sklearn.preprocessing import LabelEncoder
import pandas as pd
df = pd.DataFrame({'Grade': ['Low', 'Medium', 'High', 'Medium', 'Low']})
encoder = LabelEncoder()
df['Grade_Encoded'] = encoder.fit_transform(df['Grade'])
print(df)Output-
Grade Grade_Encoded
0 Low 1
1 Medium 2
2 High 0
3 Medium 2
4 Low 1Creates a new column for each category with binary (0 or 1) values.
df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']})
df_encoded = pd.get_dummies(df, columns=['Color'])
print(df_encoded)Output-
Color_Blue Color_Green Color_Red
0 False False True
1 False True False
2 True False False
3 False True False
4 False False Truefrom sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(df[['Color']])
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())
print(encoded_df)Output-
Color_Blue Color_Green Color_Red
0 0.0 0.0 1.0
1 0.0 1.0 0.0
2 1.0 0.0 0.0
3 0.0 1.0 0.0
4 0.0 0.0 1.0Use when categories have a clear order, such as Low < Medium < High.
from sklearn.preprocessing import OrdinalEncoder
df = pd.DataFrame({'Experience': ['Junior', 'Mid', 'Senior', 'Mid', 'Junior']})
encoder = OrdinalEncoder(categories=[['Junior', 'Mid', 'Senior']])
df['Experience_Encoded'] = encoder.fit_transform(df[['Experience']])
print(df)Output-
Experience Experience_Encoded
0 Junior 0.0
1 Mid 1.0
2 Senior 2.0
3 Mid 1.0
4 Junior 0.0Replaces each category with its frequency count or ratio.
df = pd.DataFrame({'City': ['Delhi', 'Mumbai', 'Delhi', 'Chennai', 'Delhi']})
freq = df['City'].value_counts(normalize=True)
df['City_Encoded'] = df['City'].map(freq)
print(df)Output-
City City_Encoded
0 Delhi 0.6
1 Mumbai 0.2
2 Delhi 0.6
3 Chennai 0.2
4 Delhi 0.6Encodes categories based on the mean of the target variable.
Use with caution to prevent data leakage.
| Technique | Use Case |
|---|---|
| Label Encoding | Ordinal features, Tree models |
| One-Hot Encoding | Nominal features, Linear models |
| Ordinal Encoding | Ordered categories |
| Frequency Encoding | High-cardinality nominal features |
| Target Encoding | Advanced use, risk of leakage |
Sign in to join the discussion and post comments.
Sign inSupervised Learning
Discover what Supervised Learning is, how it works, and what you'll learn in this hands-on tutorial series covering top ML algorithms like Linear Regression, Decision Trees, SVM, and more.
Unsupervised Learning
Explore Unsupervised Learning techniques to uncover patterns, structures, and relationships in unlabeled data.