Feature Extraction from Text and Images

  Add to Bookmark

Feature extraction is a critical step in machine learning pipelines where raw data (text, images, etc.) is converted into numerical representations that algorithms can understand.

We'll break it into two major sections:

  1. Feature Extraction from Text
  2. Feature Extraction from Images

1. Feature Extraction from Text

We will cover:

  • Bag of Words (BoW)
  • TF-IDF
  • Word Embeddings (Word2Vec, GloVe)
  • Using Pretrained Transformers (like BERT)

Example: Bag of Words using CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer
# Sample text data
corpus = [
    'Machine learning is fascinating',
    'Learning algorithms can be powerful',
    'Text data needs preprocessing'
]
# Convert text to numeric features
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(corpus)
# Feature names
print(vectorizer.get_feature_names_out())
print(X_bow.toarray())

Output- 

['algorithms' 'be' 'can' 'data' 'fascinating' 'is' 'learning' 'machine'
 'needs' 'powerful' 'preprocessing' 'text']
[[0 0 0 0 1 1 1 1 0 0 0 0]
 [1 1 1 0 0 0 1 0 0 1 0 0]
 [0 0 0 1 0 0 0 0 1 0 1 1]]

Example: TF-IDF using TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)
# Display results
print(tfidf.get_feature_names_out())
print(X_tfidf.toarray())

Output - 

['algorithms' 'be' 'can' 'data' 'fascinating' 'is' 'learning' 'machine'
 'needs' 'powerful' 'preprocessing' 'text']
[[0.         0.         0.         0.         0.52863461 0.52863461
  0.40204024 0.52863461 0.         0.         0.         0.        ]
 [0.46735098 0.46735098 0.46735098 0.         0.         0.
  0.35543247 0.         0.         0.46735098 0.         0.        ]
 [0.         0.         0.         0.5        0.         0.
  0.         0.         0.5        0.         0.5        0.5       ]]

Example: Word Embeddings using gensim


from gensim.models import Word2Vec
sentences = [s.lower().split() for s in corpus]
model = Word2Vec(sentences, vector_size=50, window=5, min_count=1, workers=4)
# Access vector for a word
print(model.wv['learning'])

Output- 

[-1.0724545e-03  4.7286271e-04  1.0206699e-02  1.8018546e-02
 -1.8605899e-02 -1.4233618e-02  1.2917745e-02  1.7945977e-02
 -1.0030856e-02 -7.5267432e-03  1.4761009e-02 -3.0669428e-03
 -9.0732267e-03  1.3108104e-02 -9.7203208e-03 -3.6320353e-03
  5.7531595e-03  1.9837476e-03 -1.6570430e-02 -1.8897636e-02
  1.4623532e-02  1.0140524e-02  1.3515387e-02  1.5257311e-03
  1.2701781e-02 -6.8107317e-03 -1.8928028e-03  1.1537147e-02
 -1.5043275e-02 -7.8722071e-03 -1.5023164e-02 -1.8600845e-03
  1.9076237e-02 -1.4638334e-02 -4.6675373e-03 -3.8754821e-03
  1.6154874e-02 -1.1861792e-02  9.0324880e-05 -9.5074680e-03
 -1.9207101e-02  1.0014586e-02 -1.7519170e-02 -8.7836506e-03
 -7.0199967e-05 -5.9236289e-04 -1.5322480e-02  1.9229487e-02
  9.9641159e-03  1.8466286e-02]

2. Feature Extraction from Images

We will cover:

  • Raw Pixel Values
  • Color Histograms
  • Pre-trained CNNs (like VGG16)

Example: Flattened Pixel Features using OpenCV

import cv2
import numpy as np
# Load image (convert to grayscale)
image = cv2.imread('sample.jpg', cv2.IMREAD_GRAYSCALE)
image = cv2.resize(image, (64, 64))
# Flatten to 1D array
features = image.flatten()
print("Feature shape:", features.shape)

Output - 

Feature shape: (4096,)

Example: Pretrained CNN (VGG16) using Keras


from keras.applications.vgg16 import VGG16, preprocess_input
from keras.preprocessing import image
import numpy as np
# Load model without top classifier
model = VGG16(weights='imagenet', include_top=False)
# Load and preprocess image
img = image.load_img('sample.jpg', target_size=(224, 224))
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data, axis=0)
img_data = preprocess_input(img_data)
# Extract features
features = model.predict(img_data)
print("Extracted features shape:", features.shape)

Output -

Extracted features shape: (1, 7, 7, 512)

 

Download sample.jpg