Feature Extraction from Text and Images

Add to Bookmark

Feature extraction is a critical step in machine learning pipelines where raw data (text, images, etc.) is converted into numerical representations that algorithms can understand.

We'll break it into two major sections:

Feature Extraction from Text
Feature Extraction from Images

1. Feature Extraction from Text

We will cover:

Bag of Words (BoW)
TF-IDF
Word Embeddings (Word2Vec, GloVe)
Using Pretrained Transformers (like BERT)

Example: Bag of Words using `CountVectorizer`

from sklearn.feature_extraction.text import CountVectorizer
# Sample text data
corpus = [
    'Machine learning is fascinating',
    'Learning algorithms can be powerful',
    'Text data needs preprocessing'
]
# Convert text to numeric features
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(corpus)
# Feature names
print(vectorizer.get_feature_names_out())
print(X_bow.toarray())

Output-

['algorithms' 'be' 'can' 'data' 'fascinating' 'is' 'learning' 'machine'
 'needs' 'powerful' 'preprocessing' 'text']
[[0 0 0 0 1 1 1 1 0 0 0 0]
 [1 1 1 0 0 0 1 0 0 1 0 0]
 [0 0 0 1 0 0 0 0 1 0 1 1]]

Example: TF-IDF using `TfidfVectorizer`

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)
# Display results
print(tfidf.get_feature_names_out())
print(X_tfidf.toarray())

Output -

['algorithms' 'be' 'can' 'data' 'fascinating' 'is' 'learning' 'machine'
 'needs' 'powerful' 'preprocessing' 'text']
[[0.         0.         0.         0.         0.52863461 0.52863461
  0.40204024 0.52863461 0.         0.         0.         0.        ]
 [0.46735098 0.46735098 0.46735098 0.         0.         0.
  0.35543247 0.         0.         0.46735098 0.         0.        ]
 [0.         0.         0.         0.5        0.         0.
  0.         0.         0.5        0.         0.5        0.5       ]]

Example: Word Embeddings using `gensim`


from gensim.models import Word2Vec
sentences = [s.lower().split() for s in corpus]
model = Word2Vec(sentences, vector_size=50, window=5, min_count=1, workers=4)
# Access vector for a word
print(model.wv['learning'])

Output-

[-1.0724545e-03  4.7286271e-04  1.0206699e-02  1.8018546e-02
 -1.8605899e-02 -1.4233618e-02  1.2917745e-02  1.7945977e-02
 -1.0030856e-02 -7.5267432e-03  1.4761009e-02 -3.0669428e-03
 -9.0732267e-03  1.3108104e-02 -9.7203208e-03 -3.6320353e-03
  5.7531595e-03  1.9837476e-03 -1.6570430e-02 -1.8897636e-02
  1.4623532e-02  1.0140524e-02  1.3515387e-02  1.5257311e-03
  1.2701781e-02 -6.8107317e-03 -1.8928028e-03  1.1537147e-02
 -1.5043275e-02 -7.8722071e-03 -1.5023164e-02 -1.8600845e-03
  1.9076237e-02 -1.4638334e-02 -4.6675373e-03 -3.8754821e-03
  1.6154874e-02 -1.1861792e-02  9.0324880e-05 -9.5074680e-03
 -1.9207101e-02  1.0014586e-02 -1.7519170e-02 -8.7836506e-03
 -7.0199967e-05 -5.9236289e-04 -1.5322480e-02  1.9229487e-02
  9.9641159e-03  1.8466286e-02]

2. Feature Extraction from Images

We will cover:

Raw Pixel Values
Color Histograms
Pre-trained CNNs (like VGG16)

Example: Flattened Pixel Features using OpenCV

import cv2
import numpy as np
# Load image (convert to grayscale)
image = cv2.imread('sample.jpg', cv2.IMREAD_GRAYSCALE)
image = cv2.resize(image, (64, 64))
# Flatten to 1D array
features = image.flatten()
print("Feature shape:", features.shape)

Output -

Feature shape: (4096,)

Example: Pretrained CNN (VGG16) using Keras


from keras.applications.vgg16 import VGG16, preprocess_input
from keras.preprocessing import image
import numpy as np
# Load model without top classifier
model = VGG16(weights='imagenet', include_top=False)
# Load and preprocess image
img = image.load_img('sample.jpg', target_size=(224, 224))
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data, axis=0)
img_data = preprocess_input(img_data)
# Extract features
features = model.predict(img_data)
print("Extracted features shape:", features.shape)

Output -

Extracted features shape: (1, 7, 7, 512)

Overview

Feature Extraction from Text and Images

1. Feature Extraction from Text

Example: Bag of Words using `CountVectorizer`

Example: TF-IDF using `TfidfVectorizer`

Example: Word Embeddings using `gensim`

2. Feature Extraction from Images

Example: Flattened Pixel Features using OpenCV

Example: Pretrained CNN (VGG16) using Keras

Dimensionality Reduction Techniques

Handling Imbalanced Data (SMOTE, Class Weights)

Prepare for Interview

Tutorials

Random Blogs

Datasets for Machine Learning

Categories

Follow us on Linkedin

Overview

Feature Extraction from Text and Images

1. Feature Extraction from Text

Example: Bag of Words using CountVectorizer

Example: TF-IDF using TfidfVectorizer

Example: Word Embeddings using gensim

2. Feature Extraction from Images

Example: Flattened Pixel Features using OpenCV

Example: Pretrained CNN (VGG16) using Keras

Dimensionality Reduction Techniques

Handling Imbalanced Data (SMOTE, Class Weights)

Related Tutorials

Prepare for Interview

Tutorials

Random Blogs

Datasets for Machine Learning

Categories

Follow us on Linkedin

Example: Bag of Words using `CountVectorizer`

Example: TF-IDF using `TfidfVectorizer`

Example: Word Embeddings using `gensim`