So far the networks we've built have been fully connected, meaning every neuron in one layer talks to every neuron in the next. That works for small inputs, but it falls apart on images. Convolutional neural networks were designed specifically for data with a grid structure, and they are the reason computers can now recognise faces, read handwriting, and drive cars. This chapter explains what a convolution actually does and why it works so much better than a plain dense network for vision.
Take a modest colour photo of 224 by 224 pixels with three colour channels. Flattened into a single vector, that's just over 150,000 input values. If you connect that to a hidden layer of even 1,000 neurons, the first layer alone needs more than 150 million weights. That is far too many to train, and it gets worse with larger images.
There's a deeper problem than size. A dense layer treats every pixel as completely unrelated to its neighbours, so it has no notion that nearby pixels form edges, and edges form shapes. It also has to learn the same object separately in every position it might appear. A cat in the top-left corner and the same cat in the bottom-right look like entirely different inputs to a dense layer. We need an architecture that understands locality and doesn't care where in the image something appears.
The core idea is to slide a small window, called a filter or kernel, across the image. At each position the filter multiplies its values by the pixels underneath it and adds them up, producing a single number. Slide the filter across the whole image and you get a grid of these numbers called a feature map.
What makes this powerful is what the filter learns to detect. A particular filter might respond strongly wherever there is a vertical edge, staying quiet everywhere else. Another might fire on a patch of a certain colour or texture. The network learns these filters during training rather than having them designed by hand, and a single convolutional layer holds many filters, each looking for something different.
Three properties fall out of this design, and together they solve the problems with dense layers:
After a convolution, it's common to downsample the feature map with a pooling layer. Max pooling is the usual choice: it slides a small window across the map and keeps only the largest value in each window, throwing away the rest. This makes the representation smaller and faster to process, and it adds a little tolerance to small shifts in the input, since the strongest signal in a region survives even if it moves slightly. The trend through a CNN is that the maps get smaller in width and height while growing deeper in the number of filters.
A typical CNN repeats a simple pattern: a convolution to detect features, an activation (usually ReLU) to add non-linearity, and a pooling layer to shrink things down. Stack a few of these blocks, then flatten the result and feed it into one or two dense layers that make the final decision, ending in a softmax output for classification.
What's happening as you go deeper is a hierarchy of features. The early layers detect simple things like edges and corners. The middle layers combine those into textures and parts, such as an eye or a wheel. The deepest layers respond to whole objects. Nobody tells the network to organise itself this way; it emerges from training because it is the most efficient way to make sense of images.
Here is a compact network that classifies images, the kind of thing you would point at a dataset like CIFAR-10:
from tensorflow.keras import layers, models
model = models.Sequential([
layers.Input(shape=(32, 32, 3)),
layers.Conv2D(32, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.Flatten(),
layers.Dense(64, activation='relu'),
layers.Dense(10, activation='softmax') # 10 classes
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.summary()
Run model.summary() and look at the parameter counts. The convolutional layers, despite doing the heavy lifting of feature extraction, have far fewer parameters than a single dense layer connected to the raw image would. That efficiency is the whole point.
Convolutional networks are behind most of what people mean when they say computer vision: classifying what's in a photo, detecting and locating multiple objects in a scene, segmenting an image pixel by pixel, reading text from images, and analysing medical scans. In modern practice you rarely train a large vision model from scratch, and the next architectures and the transfer learning chapter show why.
CNNs are built for data laid out in a grid, where position matters but order in time does not. A lot of important data is the opposite: sequences such as text, speech, and time series, where what came before changes the meaning of what comes next. The next chapter covers recurrent networks, which are designed for exactly that.
Sign in to join the discussion and post comments.
Sign in