- Feature Engineering & Data Preprocessing
-
Overview
- Handling Missing Data in ML
- Feature Scaling (Normalization vs. Standardization)
- Encoding Categorical Variables
- Feature Selection Techniques
- Dimensionality Reduction Techniques
- Feature Extraction from Text and Images
- Handling Imbalanced Data (SMOTE, Class Weights)
Feature Extraction from Text and Images
Add to BookmarkFeature extraction is a critical step in machine learning pipelines where raw data (text, images, etc.) is converted into numerical representations that algorithms can understand.
We'll break it into two major sections:
- Feature Extraction from Text
- Feature Extraction from Images
1. Feature Extraction from Text
We will cover:
- Bag of Words (BoW)
- TF-IDF
- Word Embeddings (Word2Vec, GloVe)
- Using Pretrained Transformers (like BERT)
Example: Bag of Words using CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# Sample text data
corpus = [
'Machine learning is fascinating',
'Learning algorithms can be powerful',
'Text data needs preprocessing'
]
# Convert text to numeric features
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(corpus)
# Feature names
print(vectorizer.get_feature_names_out())
print(X_bow.toarray())
Output-
['algorithms' 'be' 'can' 'data' 'fascinating' 'is' 'learning' 'machine'
'needs' 'powerful' 'preprocessing' 'text']
[[0 0 0 0 1 1 1 1 0 0 0 0]
[1 1 1 0 0 0 1 0 0 1 0 0]
[0 0 0 1 0 0 0 0 1 0 1 1]]
Example: TF-IDF using TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)
# Display results
print(tfidf.get_feature_names_out())
print(X_tfidf.toarray())
Output -
['algorithms' 'be' 'can' 'data' 'fascinating' 'is' 'learning' 'machine'
'needs' 'powerful' 'preprocessing' 'text']
[[0. 0. 0. 0. 0.52863461 0.52863461
0.40204024 0.52863461 0. 0. 0. 0. ]
[0.46735098 0.46735098 0.46735098 0. 0. 0.
0.35543247 0. 0. 0.46735098 0. 0. ]
[0. 0. 0. 0.5 0. 0.
0. 0. 0.5 0. 0.5 0.5 ]]
Example: Word Embeddings using gensim
from gensim.models import Word2Vec
sentences = [s.lower().split() for s in corpus]
model = Word2Vec(sentences, vector_size=50, window=5, min_count=1, workers=4)
# Access vector for a word
print(model.wv['learning'])
Output-
[-1.0724545e-03 4.7286271e-04 1.0206699e-02 1.8018546e-02
-1.8605899e-02 -1.4233618e-02 1.2917745e-02 1.7945977e-02
-1.0030856e-02 -7.5267432e-03 1.4761009e-02 -3.0669428e-03
-9.0732267e-03 1.3108104e-02 -9.7203208e-03 -3.6320353e-03
5.7531595e-03 1.9837476e-03 -1.6570430e-02 -1.8897636e-02
1.4623532e-02 1.0140524e-02 1.3515387e-02 1.5257311e-03
1.2701781e-02 -6.8107317e-03 -1.8928028e-03 1.1537147e-02
-1.5043275e-02 -7.8722071e-03 -1.5023164e-02 -1.8600845e-03
1.9076237e-02 -1.4638334e-02 -4.6675373e-03 -3.8754821e-03
1.6154874e-02 -1.1861792e-02 9.0324880e-05 -9.5074680e-03
-1.9207101e-02 1.0014586e-02 -1.7519170e-02 -8.7836506e-03
-7.0199967e-05 -5.9236289e-04 -1.5322480e-02 1.9229487e-02
9.9641159e-03 1.8466286e-02]
2. Feature Extraction from Images
We will cover:
- Raw Pixel Values
- Color Histograms
- Pre-trained CNNs (like VGG16)
Example: Flattened Pixel Features using OpenCV
import cv2
import numpy as np
# Load image (convert to grayscale)
image = cv2.imread('sample.jpg', cv2.IMREAD_GRAYSCALE)
image = cv2.resize(image, (64, 64))
# Flatten to 1D array
features = image.flatten()
print("Feature shape:", features.shape)
Output -
Feature shape: (4096,)
Example: Pretrained CNN (VGG16) using Keras
from keras.applications.vgg16 import VGG16, preprocess_input
from keras.preprocessing import image
import numpy as np
# Load model without top classifier
model = VGG16(weights='imagenet', include_top=False)
# Load and preprocess image
img = image.load_img('sample.jpg', target_size=(224, 224))
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data, axis=0)
img_data = preprocess_input(img_data)
# Extract features
features = model.predict(img_data)
print("Extracted features shape:", features.shape)
Output -
Extracted features shape: (1, 7, 7, 512)
Prepare for Interview
- JavaScript Interview Questions for 5+ Years Experience
- JavaScript Interview Questions for 2–5 Years Experience
- JavaScript Interview Questions for 1–2 Years Experience
- JavaScript Interview Questions for 0–1 Year Experience
- JavaScript Interview Questions For Fresher
- SQL Interview Questions for 5+ Years Experience
- SQL Interview Questions for 2–5 Years Experience
- SQL Interview Questions for 1–2 Years Experience
- SQL Interview Questions for 0–1 Year Experience
- SQL Interview Questions for Freshers
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
Random Blogs
- The Ultimate Guide to Starting a Career in Computer Vision
- Convert RBG Image to Gray Scale Image Using CV2
- Navigating AI Careers in 2025: Data Science, Machine Learning, Deep Learning, and More
- Mastering SQL in 2025: A Complete Roadmap for Beginners
- Datasets for analyze in Tableau
- Python Challenging Programming Exercises Part 2
- Datasets for Natural Language Processing
- Datasets for Exploratory Data Analysis for Beginners
- Internet of Things (IoT) & AI – Smart Devices and AI Working Together
- Understanding SQL vs MySQL vs PostgreSQL vs MS SQL vs Oracle and Other Popular Databases
- The Ultimate Guide to Artificial Intelligence (AI) for Beginners
- What Is SEO and Why Is It Important?
- Store Data Into CSV File Using Python Tkinter GUI Library
- Python Challenging Programming Exercises Part 3
- AI Agents: The Future of Automation, Work, and Opportunities in 2025
Datasets for Machine Learning
- Amazon Product Reviews Dataset
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset
- Artificial Characters Dataset