Deep CV: How do CNNs work?

Computer Vision is the part of AI which deals with teaching our computers how to see. The machines have to try to effectively understand what they are looking at when we feed images or video data to them. We, humans, have eyes to get visual information from our surroundings and as we grow up and learn about the world we can effectively understand what we are seeing. In the end, our eyes are receiving light rays reflected from all the objects in front of us. Our brain receives this “image” of what’s in front of us and is able to make meaningful inferences. We are able to recognize different faces that we have seen before, we are able to distinguish between hundreds of different animals and identify different objects. Now, with Deep Learning, we are able to let computers to do the same.

The primary challenge in handling visual data is that each image is represented as a 2-dimensional matrix, where each element of the matrix contains a certain colour, instead of a single 1-dimensional vector, which is what we require for training typical neural networks.

We could always convert an image into a 1D format by “flattening” it. This basically means we keep each pixel next to each other one by one to form a single long vector, which we can then input into the neural network. But by doing this, we are losing a lot of spatial information present in the image. Each pixel location is now a separate feature and hence the neural network will not focus on finding patterns between neighbouring pixels in the image. To combat this issue, researchers came up with the convolutional neural network (CNN) architecture which is much more optimized for handling 2-dimensional input. Before we get into how CNNs work, let’s first look at how convolutions happen in images.


Convolutions in images have been used for image and signal processing applications way before deep learning started using them. But the reason why we use convolutions always remained the same, to extract patterns from an image.

In convolutions, we have a kernel, and we convolve an image with this kernel. This kernel contains the pattern that we wish to extract from the image. The output of this convolution is another image. The output image has high-value pixels (white) near the areas in the original image where the pattern exists and it has low-value pixels (black) near the areas in the original image where the pattern does not exist.

A kernel is typically a very small matrix, usually of size 3x3 or 5x5. This kernel contains the pattern in the image which we want to extract. Let’s have a look at the example below.

Here, we have a kernel which is simply a straight vertical line. Keep in mind that the kernel is very small (3*3) so it is just 3 vertical white pixels surrounded by black pixels. So once we convolve the input image with this kernel, we will get the resultant output image. As you can see, the white pixels in the output image correspond to areas in the input image which had vertical white lines. Hence the output image represents a pixel map which indicates all the areas in the input image where the corresponding kernel pattern exists. This is how the convolution operation extracts patterns from any image. In this case, this convolution extracted all vertical line patterns from the input image. Similarly, we can extract any pattern by changing the kernel. The output image is also called a feature map.

Kernels act as feature extractors for images by extracting patterns from a given image through convolution.

Now we know what convolution does, let’s look at how exactly we perform convolution.

In this animation, we have a 10*10 image and we are convolving it with a 3*3 kernel. First, we take the first 3*3 space in the input image and perform element-wise multiplication with the kernel and add it up, this operation is essentially calculating the dot product of both the matrices. This resultant number will be the value of the first pixel of the output image. We then slide over to the next 3*3 space and perform the same operation, this will be the value of the 2nd pixel in the output image. We keep doing this until all the pixels in the input image have been covered.

If we perform convolution with a 3*3 kernel, the output image will be slightly smaller than the input image because we cannot perform the multiplication at the edge pixels. If we want the output image size to be the same as the input image size we perform padding where we add pixels to the edges of the input image prior to performing convolution so that the output image will be of the same size. We usually just add pixels with value 0 (black pixels), this is called zero padding. In some cases we want the image size to decrease after convolution so we do not perform padding.

From a basic level, all visual information can be broken down into a set of patterns. We can recognize an object by its shape, colour, size, etc. All of these attributes can be considered as a mixture of many patterns that we see in images. These patterns are embedded into the pixels and we can extract them with kernels. For example, if we were trying to detect the number 0, we would look for curves in the image. If we convolve the image with kernels containing curve patterns and if these feature maps have high-intensity outputs in certain areas, then we can say that it might be a 0 if the right curves are in the right places. This is how we extract basic level patterns from images. Using 3*3 kernels we are limited to only extracting very basic level patterns like edges, curves and lines, but later we will look into how we perform convolution multiple times to aggregate these simple features to be able to extract more complex patterns from images. Let’s now look at how coloured images are represented.

Image Channels

Till now we have been dealing with 2D images which have a single value at each pixel. These are also called grayscale images where each pixel is essentially a value between black and white. These images don’t encode colour. If we want to represent the colour, we require channels. Any colour can be represented by a mixture of the primary colours, red, green and blue. Hence we require three channels to represent a coloured image. A single channel represents a single 2D matrix, hence a coloured image is essentially 3 2D matrices, 1 for red, 1 for blue and 1 for green. Each of these 2D matrices represents a colour and gives the value of that respective colour for all the pixels in the image. For example, at a certain pixel location in the image, if we have the red value as 200, green value as 10 and blue value as 210, this means the overall colour of that pixel is purplish since it’s mostly red and blue.

Now let’s go through the architecture of a CNN.

To understand the next part of the article it is recommended that you understand the fundamental concepts of Deep Learning. Please visit my publication here where I have explained in several articles about the basics of Deep Learning.

Convolutional Neural Network Architecture

Convolutional Layer

CNNs consist of specialized layers called convolutional layers, or conv layers for short. The input to each conv layer is an image. As we have established earlier, each image has 3 separate 2D matrices called channels, hence the shape is (W,H,3), where W and H are the width and the height of the image, and in this case, we have 3 channels. Previously when we looked at convolution we considered single-channel images, so how do we perform convolution on multi-channel images? We perform the same action but this time the kernel also has 3 channels, so that it can perform dot product with all the channels present in the image while it’s sliding through the image.

So hence if we are using 3*3 kernels to convolve a certain input image, the shape of the kernels should be (3,3,n) where n is the number of channels of the input image.

Now we need to choose the number of kernels we have in the first layer. Let’s say we choose 16 kernels. After convolution with a single kernel, we will get an output feature map. So now with 16 kernels, we will get a total of 16 feature maps.

What are we going to do with 16 feature maps? Remember how each image has 3 channels? We are going to do something similar to that here. We will now concatenate all 16 feature maps so that it forms a 16 channel image. Each channel is a separate feature map and essentially it contains information about a certain pattern. So the input is the (W,H,3) coloured image and after convolving with 16 kernels, the output is the (w,h,16) feature map block. This entire operation is considered as a single convolutional layer in a CNN. The entire process is visualized in the animation below.

We can consider this entire feature map block as another image. It has a similar shape to a normal image except the only difference is that it has more channels. Also, in normal images, each channel contains pixel intensity of a certain colour, but in the feature map block, each channel contains pixel intensities corresponding to certain patterns which the kernels have extracted.

Hence what we can do is that we can keep another convolutional layer this time taking the previous feature map block as the input and by taking another set of kernels. Let’s say we take 32 kernels this time in the 2nd convolutional layer. Each of these 32 kernels will have 16 channels since the input also has 16 channels. We will convolve each of these kernels with the input feature map block and we will get 32 output feature maps. Similar to before, we concatenate these 32 feature maps to get a 32 channel feature map block. By doing this we can keep adding convolutional layers one after the other. A CNN consists of several conv layers stacked together.

We can control the number of channels in the output feature map block by changing the number of kernels in the conv layer. If we keep n kernels in a particular conv layer, the output of that conv layer will have an n-channel feature map block. As we progress through the CNN, we will try to keep adding channels to the feature map blocks by using more kernels in each layer. This is so that the number of patterns we extract will keep increasing.

Intuition behind Convolutional Layers:

Why do we keep adding convolutional layers in a CNN? Because we want to be able to extract complex features from an image. Kernels are very small and can only extract very basic features. But when we try to look at a combination of multiple feature maps, we can aggregate these basic features to form more complex features.

Let’s say there is one kernel looking for vertical lines and one kernel looking for horizontal lines. These will have separate feature maps. Now when we try to find patterns by combining these feature maps, we can extract features which are a mix of both of these patterns. For example, wherever there is a presence of the end of a vertical line and the end of a horizontal line, it means there is a corner. This is a very basic example of how we aggregate simple features to get more complex features.

This is why we concatenate all the feature maps and then perform convolution again. This way, the next set of kernels can extract patterns which are a mixture of all the previously extracted features, hence allowing the conv layers to extract more and more complex features as the number of conv layers increase. At some of the deeper conv layers, they can also extract complex features like a whole human eye, a dog ear, etc.

As we go deeper with more convolutional layers, we can extract features which are a mixture of all the features extracted by previous layers, allowing the CNN to be able to extract complex high-level features in an image.

Max Pooling Layer

Apart from convolutional layers, CNNs also have max-pooling layers. These layers don’t contain any parameters, they simply reduce the width and the height of the feature maps. Let’s look at how they work.

First, we decide the size of the max pooling window. In most cases, we take a 2x2 window to perform max pooling. We then slide the window across the image and we take the maximum pixel value in each window as the output. It’s important to note that we slide the window in such a way that the windows don’t overlap, so for a 2x2 window, we will slide by 2 pixels each time to prevent overlapping. This is a visualization of max-pooling:

Why do we do max pooling? As the CNN progresses and has more convolutional layers and feature maps, we need to aggregate the important features so that the feature maps get more concentrated. Furthermore, at the end of the CNN, we need to have smaller feature maps so that we focus on the most important features in the image while we are making the final decision.

In the feature maps, we have high-intensity pixel values wherever the pattern exists right? That’s why we perform max-pooling so that only the high-intensity pixels get carried forward while reducing the size of the feature maps. This way we can still keep important information of extracted patterns while removing all redundant pixels in the feature maps. This is how features are concentrating through max-pooling.

It is also important to note that we do not perform max pooling as often as we perform convolution in a CNN. We will perform it once for every 3–4 conv layers, sometimes even lesser.

Fully Connected Layers

At the end of the CNN, we are going to end up with a feature map block which is small enough to concentrate all the important features and has many channels so that it contains several features. Now we need to perform a decision based on these features.

We will now have a typical neural network at the final few layers of the CNN. We do this by flattening the final feature map block. This means we take every single pixel from every feature map in the block and keep them next to each other to form a single dimension vector. This vector is then fed into a typical neural network with hidden layers and an output layer. The output layer will be configured according to the task. If we are performing classification, we will keep an output layer with softmax activation function and the number of neurons will be the number of classes.

Why do we do this? At the final feature map block, each feature map pixel essentially contains information about a high-level feature which is extracted. If the pixel’s value is high, that means that high-level feature was present in the image and otherwise, it was absent. Hence by flattening this and feeding it into a neural network, the network will learn to classify the image based on the features present in the image. This is the basic intuition behind why we have fully connected layers at the end of the CNN.

Overall CNN structure

From the input image to the final layer, the overall CNN structure is as follows:

This is a very basic example of a complete CNN. It has a total of 2 conv layers and 2 max-pooling layers before the fully connected layers. Usually, we have a lot more conv layers and max-pooling layers because we need a smaller feature map size and more channels before the fully connected layers. Now let’s look at how CNNs learn from data.

How to train a CNN

Once we completely configure a CNN for a particular task, we train the CNN similarly to how we train any ordinary neural network. We simply feed each image into the CNN and calculate the loss at the output layer, we then backpropagate this loss throughout the entire CNN. Repeating this throughout the entire dataset over many iterations will eventually cause the CNN to be trained on the given data.

But what’s actually happening as the CNN is training on the given data? Each kernel is affecting the overall output of the CNN in some way, at the same time, each kernel is also responsible for extracting important patterns from the images.

Hence, as the CNN gets trained over the images, the parameters in each kernel will try to adjust themselves to form patterns which frequently occur in the images in the dataset and which allow the CNN to distinguish between the different classes. For example, if we are training a CNN to distinguish between cats and dogs, some of the high-level features the CNN might learn are things like ears, eyes, fur, etc. These features occur in all the images and we can distinguish between the two classes by paying attention to the nature of these features.

I hope this article has made you understand how CNNs work. I will go through more advanced CNN architectures in future articles of Deep CV. Thanks for reading!