Feature extraction is a critical step in machine learning pipelines where raw data (text, images, etc.) is converted into numerical representations that algorithms can understand.
We'll break it into two major sections:
We will cover:
CountVectorizerfrom sklearn.feature_extraction.text import CountVectorizer
# Sample text data
corpus = [
'Machine learning is fascinating',
'Learning algorithms can be powerful',
'Text data needs preprocessing'
]
# Convert text to numeric features
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(corpus)
# Feature names
print(vectorizer.get_feature_names_out())
print(X_bow.toarray())Output-
['algorithms' 'be' 'can' 'data' 'fascinating' 'is' 'learning' 'machine'
'needs' 'powerful' 'preprocessing' 'text']
[[0 0 0 0 1 1 1 1 0 0 0 0]
[1 1 1 0 0 0 1 0 0 1 0 0]
[0 0 0 1 0 0 0 0 1 0 1 1]]TfidfVectorizerfrom sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)
# Display results
print(tfidf.get_feature_names_out())
print(X_tfidf.toarray())Output -
['algorithms' 'be' 'can' 'data' 'fascinating' 'is' 'learning' 'machine'
'needs' 'powerful' 'preprocessing' 'text']
[[0. 0. 0. 0. 0.52863461 0.52863461
0.40204024 0.52863461 0. 0. 0. 0. ]
[0.46735098 0.46735098 0.46735098 0. 0. 0.
0.35543247 0. 0. 0.46735098 0. 0. ]
[0. 0. 0. 0.5 0. 0.
0. 0. 0.5 0. 0.5 0.5 ]]gensim
from gensim.models import Word2Vec
sentences = [s.lower().split() for s in corpus]
model = Word2Vec(sentences, vector_size=50, window=5, min_count=1, workers=4)
# Access vector for a word
print(model.wv['learning'])Output-
[-1.0724545e-03 4.7286271e-04 1.0206699e-02 1.8018546e-02
-1.8605899e-02 -1.4233618e-02 1.2917745e-02 1.7945977e-02
-1.0030856e-02 -7.5267432e-03 1.4761009e-02 -3.0669428e-03
-9.0732267e-03 1.3108104e-02 -9.7203208e-03 -3.6320353e-03
5.7531595e-03 1.9837476e-03 -1.6570430e-02 -1.8897636e-02
1.4623532e-02 1.0140524e-02 1.3515387e-02 1.5257311e-03
1.2701781e-02 -6.8107317e-03 -1.8928028e-03 1.1537147e-02
-1.5043275e-02 -7.8722071e-03 -1.5023164e-02 -1.8600845e-03
1.9076237e-02 -1.4638334e-02 -4.6675373e-03 -3.8754821e-03
1.6154874e-02 -1.1861792e-02 9.0324880e-05 -9.5074680e-03
-1.9207101e-02 1.0014586e-02 -1.7519170e-02 -8.7836506e-03
-7.0199967e-05 -5.9236289e-04 -1.5322480e-02 1.9229487e-02
9.9641159e-03 1.8466286e-02]We will cover:
import cv2
import numpy as np
# Load image (convert to grayscale)
image = cv2.imread('sample.jpg', cv2.IMREAD_GRAYSCALE)
image = cv2.resize(image, (64, 64))
# Flatten to 1D array
features = image.flatten()
print("Feature shape:", features.shape)Output -
Feature shape: (4096,)
from keras.applications.vgg16 import VGG16, preprocess_input
from keras.preprocessing import image
import numpy as np
# Load model without top classifier
model = VGG16(weights='imagenet', include_top=False)
# Load and preprocess image
img = image.load_img('sample.jpg', target_size=(224, 224))
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data, axis=0)
img_data = preprocess_input(img_data)
# Extract features
features = model.predict(img_data)
print("Extracted features shape:", features.shape)Output -
Extracted features shape: (1, 7, 7, 512)
Sign in to join the discussion and post comments.
Sign inUnsupervised Learning
Explore Unsupervised Learning techniques to uncover patterns, structures, and relationships in unlabeled data.
Supervised Learning
Discover what Supervised Learning is, how it works, and what you'll learn in this hands-on tutorial series covering top ML algorithms like Linear Regression, Decision Trees, SVM, and more.