This article is the notes part of teacher Wu Enda's deep learning course [1].

Author:Huang Haiguang[2]

Main writers: Huang Haiguang, Lin Xingmu (all the manuscripts of the fourth lesson, the first two weeks of the fifth lesson, the first three quarters of the third week), Zhu Yansen: (all the manuscripts of the third lesson), He Zhiyao (the third week of the fifth lesson), Wang Xiang, Hu Hanwen, Yu Xiao, Zheng Hao, Li Huaisong, Zhu Yuepeng, Chen Weihe, Cao Yue, Lu Haoxiang, Qiu Muchen, Tang Tianze, Zhang Hao, Chen Zhihao, You Ren, Ze Lin, Shen Weichen, Jia Hongshun, Shi Chao, Chen Zhe, Zhao Yifan , Hu Xiaoyang, Duan Xi, Yu Chong, Zhang Xinqian

Participating editors: Huang Haiguang, Chen Kangkai, Shi Qinglu, Zhong Boyan, Xiang Wei, Yan Fenglong, Liu Cheng, He Zhiyao, Duan Xi, Chen Yao, Lin Jiayong, Wang Xiang, Xie Shichen, Jiang PengNote: Notes, assignments (including data, original assignment files), and videos are all downloaded in github[3].

I will successively post the course notes on the public account "Machine Learning Beginners", so stay tuned.

The fourth course Convolutional Neural Networks (Convolutional Neural Networks)

## The first week of Convolutional Neural Networks (Foundations of Convolutional Neural Networks)

1.1 Computer vision (Computer vision)\

Welcome to this course on Convolutional Neural Networks. Computer vision is a rapidly developing field thanks to deep learning. Deep learning and computer vision can help cars, identify pedestrians and cars around them, and help cars avoid them. It also makes the face recognition technology more efficient and accurate. You will be able to experience or have already experienced that just by swiping your face you can unlock your phone or door lock. When you unlock your phone, I guess there must be many apps for sharing pictures on your phone. On the top, you can see pictures of food, hotels or beautiful scenery. Some companies use deep learning technology in these applications to show you the most vivid and beautiful pictures that are most relevant to you. Machine learning has even spawned new types of art. There are two reasons why deep learning excites me, and I think you guys think so too.

1. the rapid development of computer vision indicates the possibility of new applications, which was unimaginable a few years ago. By learning to use these tools, you may be able to create new products and applications.

Secondly, even if you fail to make a difference in computer vision in the end, I found that people s research on computer vision is so imaginative and creative, and new neural network structures and algorithms are derived from it, which actually inspires People go to create the intersection of computer vision and other fields. For example, when I was doing speech recognition before, I often looked for inspiration in the field of computer vision and applied it to my literature. So even if you haven't made any achievements in computer vision, I hope you can apply what you have learned to other algorithms and structures. That's it, let's start learning.

These are some of the questions we will learn in this lesson. You should have heard of image classification, or image recognition. For example, given this 64 64 picture, let the computer tell that it is a cat.

There is another example. In computer vision, there is a problem called target detection. For example, in an unmanned driving project, you don t have to recognize that the object in the picture is a vehicle, but you need to calculate the location of other vehicles to ensure I can avoid them. Therefore, in the target detection project, you first need to calculate which objects are in the picture, such as cars, and other things in the picture, and then simulate them into boxes, or use some other technology to identify them in the picture. position. Note that in this example, there are multiple vehicles in a picture at the same time, and each vehicle has an exact distance from you.

Another more interesting example is the style transfer of pictures implemented by neural networks. For example, you have a picture, but you want to convert this picture to another style. So picture style transfer means that you have a satisfactory picture and a style picture. In fact, the picture on the right is Picasso's painting, and you can use neural networks to fuse them together to draw a new picture. Its overall outline comes from the left, but the style on the right, and finally the following picture is generated. This magical algorithm has created a new artistic style, so in this course, you can also learn to do such a thing.

But there is a challenge when applying computer vision, that is, the data input may be very large. For example, in the past courses, you generally manipulated small 64 64 pictures. In fact, its data volume is 64 64 3 because each picture has 3 color channels. If you calculate it, you can see that the amount of data is 12288, so our feature vector dimension is 12288. This is actually good, because 64 64 is really a small picture.

If you want to manipulate a larger picture, such as a 1000 1000 picture, it is as large as 1 trillion, but the dimension of the feature vector reaches 1000 1000 3, because there are 3 **RGB** channels, so the number will be It is 3 million. If you observe on a small screen, you may not notice that the image above is only 64 64, while the image below is 1000 1000.

If you want to input 3 million data, this means that the dimension of the feature vector is as high as 3 million. So in the first hidden layer, you may have 1000 hidden units, and all the weights form a matrix. If you use a standard fully connected network, as we said in the first and second courses, the size of this matrix will be 10 3 million. Because the current dimension is usually used to represent 3 million. This means that the matrix will have 3 billion parameters, which is a very huge number. With such a large number of parameters, it is difficult to obtain enough data to prevent the neural network from overfitting and competing demands. To deal with a neural network containing 3 billion parameters, the huge memory requirement is unacceptable.

But for computer vision applications, you definitely don't want it to only handle small pictures, you want it to be able to handle large pictures at the same time. For this, you need to perform convolution calculations, which is a very important piece of convolutional neural networks. In the next lesson, I will show you how to perform this calculation. I will use an example of edge detection to illustrate the meaning of convolution.

### 1.2 Edge detection example

Convolution operation is the most basic part of convolutional neural network, using edge detection as an introductory example. In this video, you will see how convolution is performed.

In the previous video, I talked about how the first few layers of the neural network detect edges. Then, the later layers may detect parts of the object, and some later layers may detect the complete object. This example The middle is the face. In this video, you will see how to perform edge detection in a picture.

Let's take an example. Give such a picture and let the computer figure out what objects are in the picture. The first thing you might do is to detect the vertical edges in the picture. For example, the railings in this picture correspond to vertical lines. At the same time, the contour lines of these pedestrians are also vertical lines to some extent. These lines are the output of the vertical edge detector. Similarly, you may also want to detect horizontal edges. For example, these railings are very obvious horizontal lines, and they can also be detected. Here is the result. So how to detect these edges in the image?

Look at an example, this is a 6 6 grayscale image. Because it is a grayscale image, it is a 6 6 1 matrix instead of 6 6 3 because there is no **RGB** three channels. In order to detect vertical edges in an image, you can construct a 3 3 matrix. In common usage, it is called a filter in the terminology of convolutional neural networks. I want to construct a 3 3 filter, like this. In the paper it is sometimes called a kernel, not a filter, but in this video, I will use the term filter. Perform a convolution operation on this 6 6 image. The convolution operation is represented by "", and it is convolved with a 3 3 filter.

Regarding symbolic representation, there are some problems. In mathematics, "" is the standard sign of convolution, but in **Python** , this sign is often used to indicate multiplication or element-wise multiplication. So this "" has multiple meanings. It is an overloaded symbol. In this video, I will specifically explain when "" means convolution.

The output of this convolution operation will be a 4 4 matrix, you can think of it as a 4 4 image. Here is how to calculate this 4 4 matrix. In order to calculate the first element, the element in the upper left corner of the 4 4 uses a 3 3 filter to overlay it on the input image, as shown in the figure below. Then perform **element-wise products** ( **element-wise products** ) operation, so, then add each element of the matrix to get the element in the upper left corner, ie.

Add these 9 numbers together to get -5. Of course, you can add these 9 numbers in any order. I just wrote the first column, then the second and third columns.

Next, in order to figure out what the second element is, you have to move the blue square one step to the right, like this, remove the green marks:

Continue to do the same element-wise multiplication, and then add up, so it is.

The same goes for the next step, continue to move one step to the right, and add up the dot product of 9 numbers to get 0.

Continue to move to get 8, verify it:.

Next, in order to get the elements of the next row, move the blue block down, and now the blue block is at this position:

Repeat the element-wise multiplication, and then add them up. By doing this you get -10. Move it to the right to get -2, then 2, 3. By analogy, the other elements in the matrix are calculated in this way.

To make it clearer, this -16 is obtained through the 3 3 area at the bottom right corner.

Therefore, the 6 6 matrix and the 3 3 matrix are convolved to obtain a 4 4 matrix. These pictures and filters are matrices of different dimensions, but the matrix on the left is easily understood as a picture, the one in the middle is understood as a filter, and the picture on the right can be understood as another picture. This is the vertical edge detector, as you will understand on the next page.

Before proceeding, let me say one more thing. If you want to use a programming language to implement this operation, different programming languages have different functions, instead of using "" to denote convolution. So in programming exercises, you will use a function called **conv_forward** . If under **tensorflow** , this function is called **tf.conv2d** . In other deep learning frameworks, in later courses, you will see the **Keras** framework, in which **Conv2D** is used to implement convolution operations. All programming frameworks have some functions to implement convolution operations.

Why can this be done for vertical edge detection? Let's look at another example. To make it clear, I will use a simple example. This is a simple 6 6 image, the left half is 10, and the right is generally 0. If you think of it as a picture, the left part looks white, the pixel value 10 is the brighter pixel value, and the right pixel value is darker. I use gray to represent 0, although it can also be drawn as black. In the picture, there is a particularly obvious vertical edge in the middle of the image. This vertical line is the transition line from black to white, or from white to dark.

So, when you use a 3 3 filter to perform convolution operations, the 3 3 filter can be visualized as shown below, with bright pixels on the left, and then a transition, with 0 in the middle, and then on the right. dark. After the convolution operation, what you get is the matrix on the right. If you want, you can verify it through mathematical operations. For example, the element 0 in the upper left corner is obtained from this 3 3 block (marked by the green box) through the element product operation and then the summation.

. On the contrary, this 30 is obtained from this (marked by the red box),

.

If you consider the rightmost matrix as an image, it looks like this. There is a brighter area in the middle, corresponding to the vertical edge in the middle of the 6 6 image. The dimensionality here seems to be a bit incorrect, and the detected edges are too thick. Because in this example, the picture is too small. If you use a 1000 1000 image instead of a 6 6 image, you will find that it will detect the vertical edges in the image very well. In this example, the bright spot in the middle of the output image indicates that there is a particularly obvious vertical edge in the middle of the image. The inspiration that can be obtained from vertical edge detection is that because we use a 3 3 matrix (filter), the vertical edge is a 3 3 area, with bright pixels on the left, and no need to consider the one in the middle. Dark pixels. In the middle part of this 6 6 image, bright pixels on the left and dark pixels on the right are regarded as a vertical edge. The convolution operation provides a convenient way to find the vertical edge in the image.

So you already know how convolution works. In the next video, you will see how to use convolution operations as the basic module of convolutional neural networks.

### 1.3 More edge detection content (More edge detection)

You have seen the use of convolution to achieve vertical edge detection. In this video, you will learn how to distinguish between positive and negative edges. This is actually the difference between light and dark and dark to light, which is the transition of edges. You can also learn about other types of edge detection and how to implement these algorithms, instead of always thinking about writing an edge detection program yourself, let's get started.

Still the example in the previous video, this 6 6 picture is brighter on the left and darker on the right. Convolve it with the vertical edge detection filter, and the detection result is displayed in the middle part of the picture on the right. .

What has changed in this picture now? Its color has been flipped, becoming darker on the left and brighter on the right. Now the point with a brightness of 10 ran to the right, and the point with a brightness of 0 ran to the left. If you use it to convolve with the same filter, the middle of the resulting image will be -30 instead of 30. If you convert the matrix to a picture, it will look like the picture below the matrix. Now the transition part in the middle has been flipped. The previous 30 was flipped to -30, indicating that it was a transition from dark to light, not from light to dark.

If you don't care about the difference between the two, you can take out the absolute value of the matrix. But this particular filter can really distinguish the difference between these two light and shade changes for us.

Let's take a look at more examples of edge detection. We have already seen this 3 3 filter, which can detect vertical edges. So, seeing the filter on the right, I think you should have guessed it, it allows you to detect horizontal edges. As a reminder, a vertical edge filter is a 3 3 area with relatively brighter left side and relatively darker right side. Similarly, the horizontal edge filter on the right is also a 3 3 area, with the upper edge relatively brighter and the lower edge relatively darker.

Here is a more complicated example, where the upper left and lower right are dots with a brightness of 10. If you draw it as a picture, the upper right corner is the darker area. Here are the points with 0 brightness. I will add shadows to these darker areas. The upper left and lower right will be relatively bright. If you convolve this image with the horizontal edge filter, you will get the matrix on the right.

For another example, here 30 (the green box marked element in the right matrix) represents the 3 3 area on the left (the green box marked part of the left matrix). This area is indeed brighter on the upper side, while the lower side is relatively bright. It's dark, so it found a positive edge here. And here -30 (the purple box marked element in the right matrix) represents another area on the left (the purple box marked part of the left matrix). This area is indeed brighter at the bottom and darker on the top, so here It is a negative side.

Again, what we are using now are relatively small pictures, only 6 6. But these middle values, for example, this 10 (the element marked by the yellow box in the right matrix) represents the area on the left (the part marked by the yellow box in the 6 6 matrix on the left). The two columns on the left of this area are the positive side, and the column on the right is the negative side. The values of the positive and negative sides are added together to get an intermediate value. But if this is a very large 1000 1000 checkerboard-style big picture like this, these transition bands with brightness of 10 will not appear, because the picture size is large, these intermediate values will become very small.

All in all, by using different filters, you can find vertical or horizontal edges. But in fact, for this 3 3 filter, we used a combination of numbers.

But historically, in the computer vision literature, there has been a fair debate about which combination of numbers is the best, so you can also use this:, called **Sobel** 's filter, which has the advantage of adding a middle row The weight of the element, which makes the result more robust.

But computer vision researchers often use other digital combinations, such as this: This is called Scharr filter, which has completely different characteristics from before, and is actually a vertical edge detection. If you flip it At 90 degrees, you can get the corresponding horizontal edge detection.

With the development of deep learning, one of the things we learn is that when you really want to detect the edges of complex images, you don t necessarily have to use the nine numbers chosen by the researchers, but you can get from it. Great benefits. Consider the 9 numbers in this matrix as 9 parameters, and later you can learn to use the backpropagation algorithm. The goal is to understand these 9 parameters.

When you get the 6 6 picture on the left, convolve it with this 3 3 filter, and you will get an excellent edge detection. This is what you will see in the next section of the video. Use these 9 numbers as a parameter filter. Through backpropagation, you can learn this kind of filter, or **Sobel** filter and **Scharr** filter. There is another kind of filter, which can even surpass any of these handwritten filters before. Compared with this simple vertical edge and horizontal edge, it can detect 45 or 70 or 73 , or even edges at any angle. So set all the numbers of the matrix as parameters, let the neural network learn them automatically through data feedback, we will find that the neural network can learn some low-level features, such as these edge features. Although we have to work harder than those researchers, we can actually write these things by hand. However, the basis of these calculations is still the convolution operation, so that the backpropagation algorithm can allow the neural network to learn any 3 3 filters it needs and apply it on the entire picture. Here, here, and here (the blue box marked part of the left matrix), to output these, any features it detects, whether it is a vertical edge, a horizontal edge, there are other edges with strange angles, or even Other filters that don t even have a name.

Therefore, this idea of using these 9 numbers as parameters has become one of the most effective ideas in computer vision. In the next course, next week, we will discuss in detail how to use backpropagation to let the neural network learn these 9 numbers. But before that, we need to discuss some other details, such as some basic convolution operation variables. In the following two videos, I will discuss with you how to use **padding** and the various developments of convolution. These two sections will be an important part of the convolution module in convolutional neural networks, so we will next section Goodbye video.

### 1.4 Padding

In order to build a deep neural network, a basic convolution operation you need to learn to use is **padding** . Let's take a look at how it works.

We saw in the previous video that if you convolve a 6 6 image with a 3 3 filter, you will end up with a 4 4 output, which is a 4 4 matrix. That's because your 3 3 filter is in a 6 6 matrix, and there are only 4 4 possible positions. The mathematical explanation behind this is that if we have an image and convolution with a filter, then the output dimension is. In this example it is, so a 4 4 output is obtained.

There are two shortcomings in this case. The first shortcoming is that every time you do a convolution operation, your image will shrink from 6 6 to 4 4. After you do it a few times, your image will be reduced. Becomes very small, may shrink to only 1 1 size. You don't want your image to shrink every time you recognize edges or other features. This is the first disadvantage.

The second disadvantage is that if you pay attention to the pixel on the edge of the corner, this pixel (marked with green shading) is only touched or used by one output because it is located in the corner of this 3 3 area. But if it is a pixel in the middle, such as this one (marked by a red box), there will be many 3 3 areas overlapping with it. So those pixels in the corner or edge area are used less in the output, which means that you lose a lot of information about the edge position of the image.

In order to solve these two problems, one is to reduce the output. When we build a deep neural network, you will know why you don't want the image to shrink every step of the operation. For example, when you have a 100-layer deep network, if the image is reduced after each layer passes through the 100-layer network, you will get a very small image, so this is a problem. Another problem is that most of the information at the edges of the image is lost.

To solve these problems, you can fill the image before the convolution operation. In this case, you can fill in another layer of pixels along the edge of the image. If you do this, the 6 6 image will be filled into an 8 8 image by you. If you convolve this 8 8 image with a 3 3 image, the output you get is not a 4 4 image, but a 6 6 image, and you get a size 6 6 with the original image image. Traditionally, you can use 0 to fill. If it is the number of fills, in this case, because we fill a pixel around it, the output becomes, so it becomes, and the input image same size. The green pixels (left matrix) affect the grids in the output (right matrix). In this way, the shortcoming of missing information or, more accurately, the information on the corners or edges of the image plays a smaller role is weakened.

Just now I have shown that one pixel is used to fill the edge. If you want, you can also fill two pixels, that is to say, fill a layer here. You can actually fill in more pixels. I painted this situation here, after filling.

As for the choice of how many pixels to fill, there are usually two choices, called **Valid** convolution and **Same** convolution.

**Valid** convolution means no padding. In this case, if you have an image and convolve it with a filter, it will give you a one-dimensional output. This is similar to the example we showed in the previous video. There is a 6 6 image that passes a 3 3 filter to get a 4 4 output.

Another filling method that is often used is called **Same** convolution, which means that after you fill it, your output size is the same as your input size. According to this formula, when you fill in a pixel, it becomes, and the formula becomes. So if you have an image and fill the edges with pixels, the output size is like this. If you want to, make the output and input the same size, if you use this equation to solve, then. So when it is an odd number, as long as you select the corresponding fill size, you can be sure to get the same output size as the input. This is why the previous example, when the filter is 3 3, is the same as the example in the previous slide, so that the output size is equal to the input size, and the required padding is (3-1)/2, which is 1 Pixels. Another example, when your filter is 5 5, if you then substitute that formula, you will find that you need two layers of padding to make the output as large as the input. This is the case of a 5 5 filter.

Traditionally, in computer vision, it is usually an odd number, and it may even be the case. You rarely see an even filter used in computer vision, I think there are two reasons.

One possibility is that if it is an even number, then you can only use some asymmetric padding. Only in the case of an odd number, the **Same** convolution will have natural filling. We can fill the surroundings with the same amount, instead of filling a little more on the left and a little less on the right, which is asymmetrical filling.

The second reason is that when you have an odd-dimensional filter, such as 3 3 or 5 5, it has a center point. Sometimes in computer vision, it is more convenient to have a central pixel point to point out the location of the filter.

Maybe these are not sufficient reasons why they are usually odd numbers, but if you read the literature on convolution, you will often see 3 3 filters, and you may also see some 5 5, 7 7 filters. Device. Later we will talk about the 1 1 filter and when it makes sense. But customarily, I recommend that you only use odd-numbered filters. I think if you use an even number f, you might get good performance. If you follow the conventions of computer vision, I usually use an odd value.

You have seen how to use **padding** convolution, in order to specify the **padding** in the convolution operation , you can specify the value. You can also use **Valid** convolution, that is. You can also use **Same** convolution to fill pixels so that your output is the same size as your input. The above is **padding** , in the following video we discuss how to set the step size in the convolution.

### 1.5 Strided convolutions

Stride in convolution is another basic operation for building convolutional neural networks, let me show you an example.

If you want to convolve this 7 7 image with a 3 3 filter, the difference is that we set the stride to 2. You also take the product of the elements in the upper left 3 3 area as before, and add them together, and the final result is 91.

It s just that the step length we used to move the blue box is 1, now the step length is 2, we let the filter skip 2 steps, pay attention to the upper left corner, this point moves to the next two grid points, skipped A location. Then you still multiply each element together and sum, you will get the result of 100.

Now we continue, move the blue box two steps, you will get the result of 83. When you move to the next line, you also use step 2 instead of step 1, so we move the blue box here:

Notice that we skipped a position and got the result of 69. Now you continue to move two steps, you will get 91, 127, and the last line is 44, 72, 74.

So in this example, we use a 3 3 matrix to convolve a 7 7 matrix to get a 3 3 output. The dimensions of input and output are determined by the following formula. If you convolve an image with a filter, your **padding** is, the stride is, in this example, you will get an output, because now you are not moving one step at a time, but one step at a time, and output So it becomes

In our example,,,,,, which is 3 3 output.

Now only the last detail is left. What if the quotient is not an integer? In this case, we round down. This is the symbol for rounding down, which is also called floor division ( **floor** ), which means rounding down to the nearest integer. The way this principle is implemented is that you only perform calculations on the blue box when it is completely contained within the image or the filled image. If any blue box moves outside, then you should not perform multiplication. This is a convention. Your 3 3 filter must be completely in the image or in the image area after filling to output the corresponding result, this is the convention. Therefore, the correct way to calculate the output dimension is to round down, so as not to be an integer.

To summarize the dimensions, if you have a matrix or image, convolve with a matrix, or filter. **Padding** is, the stride is no output size is like this:

It is good to choose all the numbers so that the result is an integer, although sometimes you don't have to do this, just round down. You can also choose some of the values of,, and to verify that the formula for the output size is correct.

Before I talk about the next part, here is a technical suggestion on cross-correlation and convolution. This will not affect the way you build a convolutional neural network, but it depends on whether you read a math textbook or a signal processing textbook. The symbols in the textbook may be inconsistent. If you are looking at a typical mathematics textbook, then the definition of convolution is to do the sum of the product of the elements. In fact, there is another step you need to do first, which is to combine this 6 6 matrix with 3 Before convolution of the 3 filter, first you flip the 3 3 filter along the horizontal and vertical axes, so it becomes, which is equivalent to mirroring the 3 3 filter on the horizontal and vertical axes ( Finisher s Note: This should be obtained by first rotating 90 clockwise, and then flipping it horizontally). Then you copy the flipped matrix here (the image matrix on the left), you have to multiply the elements of the flipped matrix to calculate the upper left element of the output 4 4 matrix, as shown in the figure. Then take these 9 numbers, move them one place, and then move one grid, and so on.

So when we defined the convolution operation in these videos, we skipped the mirroring operation. Technically, we actually do, that we used in the previous operation of the video, sometimes referred to as cross-correlation ( **Cross-Correlation** ) instead of convolution ( **Convolution** ). But in the deep learning literature, by convention, we call this (without flipping operation) a convolution operation.

In summary, in accordance with the conventions of machine learning, we usually do not perform flip operations. Technically speaking, this operation may be better called cross-correlation. But in most of the deep learning literature it is called the convolution operation, so we will use this convention in these videos. If you read a lot of machine learning literature, you will find that many people call it convolution operations and do not need to use these flips.

It turns out that in signal processing or in certain branches of mathematics, the definition of convolution includes flipping, which makes the convolution operator possess this property, that is, this is called associative law in mathematics. This is great for some signal processing applications, but it is really not important for deep neural networks, so the omission of this double mirroring operation simplifies the code and makes the neural network work normally.

According to convention, most of us call it convolution. Although mathematicians prefer to call it cross-correlation, it will not affect anything you want to achieve in programming exercises, nor will it affect your reading and understanding. Deep learning literature.

Now you have seen how to perform convolution, how to use padding, and how to choose the stride in convolution. But so far, what we have used is the convolution of matrices, such as a 6 6 matrix. In the next video, you will see how to convolve stereo, which will make your convolution more powerful, let's continue to the next video.

### 1.6 Convolutions over volumes

You already know how to convolve a two-dimensional image. Now let s see how to perform convolution not only on a two-dimensional image, but on a three-dimensional image.

Let's start with an example. Let's say you want to detect not only the features of grayscale images, but also the features of **RGB** color images. If the color image is 6 6 3, the 3 here refers to three color channels. You can think of it as a stack of three 6 6 images. In order to detect the edges or other features of the image, it is not convolved with the original 3 3 filter, but with a three-dimensional filter whose dimension is 3 3 3, so this filter also has 3.layers, corresponding to the three channels of red, green and blue.

Give these names (original images), where the first 6 represents the height of the image, the second 6 represents the width, and this 3 represents the number of channels. Similarly, your filter also has a height, width and number of channels, and the number of channels of the image must match the number of channels of the filter, so these two numbers (the two numbers marked by the purple square) must be equal. In the next slide, we will know how this convolution operation is performed. The output of this will be a 4 4 image. Note that it is 4 4 1, and the last number is not 3.

Let's study the details behind this, first change a good-looking picture. This is a 6 6 3 image, and this is a 3 3 3 filter. The number of the last digital channel must match the number of channels in the filter. In order to simplify the image of this 3 3 3 filter, we do not draw it as a stack of 3 matrices, but as a three-dimensional cube.

In order to calculate the output of this convolution operation, all you have to do is to put the 3 3 3 filter in the upper left corner first. This 3 3 3 filter has 27 numbers and 27 parameters. It is 3 cubes. Take these 27 numbers in turn, and then multiply them by the numbers in the corresponding red, green, and blue channels. First take the first 9 numbers of the red channel, then the green channel, then the blue channel, multiply by the corresponding 27 numbers covered by the yellow cube on the left, and then add these numbers together to get the first output digital.

If you want to calculate the next output, you slide the cube by one unit, multiply these 27 numbers, and add them all together to get the next output, and so on.

So, what can this do? For example, this filter is 3 3 3. If you want to detect the edge of the red channel of the image, then you can set the first filter to be the same as before, and the green channel is all 0,, blue The color is also all 0. If you stack these three together to form a 3 3 3 filter, then this is a filter that detects the vertical boundary, but it is only useful for the red channel.

Or if you don't care which color channel the vertical border is in, then you can use a filter like this,,,, all three channels are like this. So by setting the second filter parameter, you have a boundary detector, a 3 3 3 boundary detector, which is used to detect the boundary in any color channel. With different parameter choices, you can get different feature detectors, all of which are 3 3 3 filters.

According to computer vision conventions, when your input has a specific height and width and number of channels, your filter can have different heights and different widths, but the number of channels must be the same. In theory, it is feasible for our filter to focus only on the red channel, or only focus on the green or blue channels.

Pay attention to this convolution cube again. A 6 6 6 input image is convolved with a 3 3 3 filter to get a 4 4 two-dimensional output.

Now that you have understood how to convolve a cube, there is one last concept, which is essential for building a convolutional neural network. That is, what if we don't just want to detect vertical edges? What if we detect vertical edges and horizontal edges at the same time, as well as 45 slanted edges, and 70 slanted edges? In other words, what if you want to use multiple filters at the same time?

This is the picture of our last slide. We convolve this 6 6 3 image with this 3 3 3 filter to get a 4 4 output. (The first one) This may be a vertical boundary detector or learning to detect other features. The second filter can be represented in orange, and it can be a horizontal edge detector.

So convolve with the first filter to get the first 4 4 output, and then convolve the second filter to get a different 4 4 output. We finish the convolution, and then take the two 4 4 outputs, take the first one and put it in the front, and then take the second filter output. I draw it here and put it in the back. So stack these two outputs together so that you both get a 4 4 2 output cube. You can treat this cube as and repaint it here. It is a box like this, so this is a 4 4 2 output cube. It uses a 6 6 3 image, and then convolves these two different 3 3 filters to get two 4 4 outputs, which are stacked together to form a 4 4 2 cube, The 2 here comes from the use of two different filters.

Let's summarize the dimensions. If you have an input image (number of channels), in this example it is 6 6 3, here is the number of channels, and then convolve the previous one. In this example, it is 3 3 3. By convention, this (previous) and this (next) must have the same value. Then you get it, here is actually the number of channels in the next layer, which is the number of filters you use, in our example, that is 4 4 2. When I wrote this hypothesis, I used a stride of 1 and no **padding** . If you use a different stride or **padding** , then this value will change, as demonstrated in the previous video.

This concept of cubic convolution is really useful, you can now use a small part of it to operate directly on the three-channel **RGB** image. More importantly, you can detect two features, such as vertical and horizontal edges or 10 or 128 or hundreds of different features, and the number of output channels will be equal to the number of features you want to detect.

For the notation here, I always use the number of channels () to represent the last dimension, and in the literature everyone also calls it the depth of a 3-dimensional cube. These two terms, channel or depth, are often used in the literature. But I think the depth is confusing, because you usually also talk about the depth of the neural network. Therefore, in these videos I will use the term channel to represent the size of the third dimension of the filter.

So you already know how to convolve a cube, and you are ready to implement one of the convolutional neural layers. Let us see how to do it in the next video.

### 1.7 One layer of a convolutional network

Today we are going to talk about how to build the convolutional layer of a convolutional neural network. Let's look at an example.

In the last lesson, we have talked about how to process a three-dimensional image by convolution with two filters and output two different 4 4 matrices. Suppose that the first filter is used for convolution, and the first 4 4 matrix is obtained. Use the second filter to perform convolution to get another 4 4 matrix.

In the end, a convolutional neural network layer is formed separately, and then the deviation is added. It is a real number, and the same deviation is added to these 16 elements through the broadcast mechanism of **Python** . Then apply the non-linear function. For the purpose of illustration, it is a non-linear activation function **ReLU** , and the output result is a 4 4 matrix.

For the second 4 4 matrix, we add different deviations, which is also a real number, and the same real number is added to all 16 numbers, and then the nonlinear function is applied, which is a nonlinear activation function **ReLU** , and finally another 4 4 matrix. Then repeat our previous steps, stack these two matrices, and finally get a 4 4 2 matrix. Through calculation, we derive a 4 4 2 matrix from the input of 6 6 3, which is a layer of the convolutional neural network, and map it to one of the four convolutional layers in the standard neural network. Or in a non-convolutional neural network.

Note that one of the operations in the forward propagation is, among them, the implementation of the nonlinear function is obtained, ie. The input here is, that is, these filters are represented by variables. In the convolution process, we operate on these 27 numbers, which are actually 27 2, because we use two filters, and we take these numbers for multiplication. A linear function is actually executed, and a 4 4 matrix is obtained. The output result of the convolution operation is a 4 4 matrix, and its function is similar to the output result of these two 4 4 matrices, and then the deviation is added.

This part (the part marked by the blue border in the figure) is the value before the activation function **ReLU** is applied . Its function is similar to that of the final application of the nonlinear function, and the 4 4 2 matrix obtained becomes the next layer of the neural network. , Which is the activation layer.

This is the evolution process to the first, the linear function is executed first, and then all the elements are multiplied to do convolution. The specific method is to use the linear function plus the deviation, and then apply the activation function **ReLU** . In this way, a 6 6 3 dimension is evolved into a 4 4 2 dimension through a layer of the neural network. This is the layer of the convolutional neural network.

In the example we have two filters, that is, there are two features, so we finally get a 4 4 2 output. But if we use 10 filters instead of 2, we will end up with a 4 4 10 dimension output image, because we have selected 10 feature maps instead of just 2, and stacked them Together, they form a 4 4 10 output image, that is.

In order to deepen understanding, let's do an exercise. Suppose you have 10 filters instead of 2, and the layer of the neural network is 3 3 3, then, how many parameters does this layer have? Let's calculate, each layer is a 3 3 3 matrix, so each filter has 27 parameters, which is 27 numbers. Then add a deviation, which is represented by a parameter, and now the parameter is increased to 28. In the previous slide, I drew 2 filters, and now we have 10, adding up to 28 10, which is 280 parameters.

Please note that no matter how big the input image is, whether it is 1000 1000 or 5000 5000, the parameters are always 280. Use these 10 filters to extract features such as vertical edges, horizontal edges and other features. Even if these pictures are large, the parameters are few. This is a feature of convolutional neural networks called " **avoid overfitting** ". You already know how to extract 10 features, which can be applied to large images, and the number of parameters is fixed. In this example, there are only 28, which is relatively small.

Finally, we summarize the various labels used to describe a layer in a convolutional neural network (take layer as an example), that is, the convolutional layer.

This layer is a convolutional layer, and is used to represent the filter size. We said that the filter size is, and the superscript represents the filter size in the layer. Normally, superscripts are used to mark layers. Used to mark the amount of **padding** , the amount of **padding** can also be specified as a **valid** convolution, that is, no **padding** . Or the **same** convolution, that is, **padding** is selected , so that the height and width of the output and input pictures are the same. Mark the stride with.

The input of this layer will be data of a certain dimension, expressed as the number of color channels on a certain layer.

We have to make a slight modification and increase the superscript, that is, because it is the activation value of the previous layer.

In this example, the height and width of the pictures used are the same, but they may also be different, so they are marked with subscripts and subscripts respectively, ie. Then in the first layer, the size of the picture is, and the input of the layer is the output of the previous layer, so the superscript is used. There will be outputs in this layer of the neural network, and it will output images itself. Its size is, which is the size of the output image.

As we mentioned earlier, this formula gives the size of the output image, at least the height and width. (Note: (Use the result of this calculation directly, or round down). In this new expression, the layer The height of the output image, that is, we can also calculate the width of the image, using the replacement parameters, that is, the formula is the same, as long as the height and width parameters are changed, we can calculate the height or width of the output image. This is derived and deduced process.

So what is the number of channels? Where did these numbers come from? Let's take a look. The output image also has depth. Through the previous example, we know that it is equal to the number of filters in the layer. If there are 2 filters, the output image is 4 4 2, which is two-dimensional. If there are 10 filters The output image is 4 4 10. The number of channels in the output image is the number of filters used in this layer of the neural network. How to determine the size of the filter? We know that convolution of a 6 6 3 image requires a 3 3 3 filter, so the number of channels in the filter must be the same as the number of channels in the input. Therefore, the number of output channels is the number of input channels, so the filter dimension is equal to.

After applying bias and nonlinear functions, the output of this layer is equal to its activation value, which is this dimension (output dimension). Is a three-dimensional body, ie. When you perform batch gradient descent or mini-batch gradient descent, if there is an example, that is, there is a set of activation values, then output. If using batch gradient descent, the order of the variables is as follows, first the index and training examples, then the other three variables.

How to determine the weight parameter, that is, the parameter W? The dimension of the filter is known, as, this is just the dimension of a filter, how many filters there are, this () is the number of filters, and the weight is the set of all filters multiplied by the total number of filters, namely , The number of losses L is the number of filters in the layer.

Finally, we look at the deviation parameter. Each filter has a deviation parameter, which is a real number. The deviation contains these variables, and it is a vector in that dimension. We will see in subsequent courses that, for convenience, the deviation is expressed in the code as a 1 1 1 four-dimensional vector or four-dimensional tensor.

There are many labeling methods for convolution, and this is the most commonly used convolution symbol. When you search online or view the open source code, there is not a completely unified standard convolution about the order of height, width, and channels, so when you check the source code on **GitHub** or read some open source implementations, you will find that some authors will use the The channel is the first coding standard, and sometimes all variables use this standard. In fact, in some architectures, when retrieving these pictures, there will be a variable or parameter to identify the order of calculating the number of channels and the number of channel losses. As long as they are consistent, both convolution standards are available. Unfortunately, this is only a part of the labeling method, because the deep learning literature does not agree on the labeling, but I will use this convolutional labeling method in class, and calculate it in the order of height, width and the number of channel losses.

I know that when I suddenly came into contact with so many new marking methods, you might say, how do you remember so many? Don't worry, don't remember them all, you can get familiar with them through this week's practice. The focus of this lesson I want to talk about is the working principle of a certain convolutional layer of a convolutional neural network and how to calculate the activation function of a certain convolutional layer and map it to the activation value of the next layer. Knowing the working principle of a certain convolutional layer in a convolutional neural network, we can stack them to form a deep convolutional neural network, we will talk about it in the next class.

### 1.8 A simple convolution network example

In the last lesson, we talked about how to build a convolutional layer for a convolutional network. Today we look at a specific example of a deep convolutional neural network, by the way, practice the notation we learned in the last lesson.

Suppose you have a picture, and you want to do picture classification or picture recognition. Define this picture as input, and then identify whether there is a cat in the picture, and use 0 or 1. Convolutional neural network for the task. For this example, I used a relatively small picture with a size of 39 39 3. This setting can make some of the figures look better. Therefore, the height and width are both equal to 39, that is, the number of channels in layer 0 is 3.

Assuming that in the first layer we use a 3 3 filter to extract features, then, because the filter is a 3 3 matrix. **,,** So the height and width use **valid** convolution. If there are 10 filters, the activation value of the next layer of the neural network is 37 37 10. 10 is written because we used 10 filters. 37 is the calculation result of the formula, that is, the output is 37 37 , It is a **vaild** convolution, which is the size of the output result. The first layer is marked as,, which is equal to the number of filters in the first layer. This (37 37 10) is the dimension of the activation value of the first layer.

Assuming there is another convolutional layer, the filter we use this time is a 5 5 matrix. In the notation method, the next layer of the neural network, that is, the stride is 2, ie. **The padding** is 0, that is, there are 20 filters. So the output result will be a new image, this time the output result is 17 17 20, because the stride is 2, the dimension is reduced very quickly, the size is reduced from 37 37 to 17 17, which is reduced More than half, the number of filters is 20, so the number of channels is also 20, 17 17 20 is the dimension of the activation value. Therefore,.

Let's build the last convolutional layer, assuming that the filter is still 5 5, and the stride is 2, that is, I skipped the calculation process, and the final output is 7 7 40, assuming 40 filters are used. **The padding** is 0, 40 filters, and the final result is 7 7 40.

At this point, this 39 39 3 input image has been processed, and 7 7 40 features have been extracted for the picture, which is 1960 features calculated. Then the convolution is processed, and it can be smoothed or expanded into 1960 units. After smoothing, a vector can be output. Whether the filling content is a **logistic** regression unit or a **softmax** regression unit depends entirely on whether we want to recognize whether there are cats in the picture or whether we want to recognize one of the different objects, which represents the final neural network. Forecast output. To be clear, the last step is to process all the numbers, that is, all the 1960 numbers, and expand them into a very long vector. In order to predict the final output result, we fill this long vector into the **softmax** regression function.

This is a typical example of a convolutional neural network. When designing a convolutional neural network, it takes time to determine these hyperparameters. It is necessary to determine the size, stride, **padding,** and how many filters to use. This week and next week, I will provide some suggestions and guidance on the selection of parameters.

The one thing you need to master in this lesson is that as the depth of the neural network calculation continues to deepen, the image at the beginning is usually larger. The initial value is 39 39, and the height and width will remain the same for a period of time. As the depth of the network deepens, it gradually decreases, from 39 to 37, then to 17, and finally to 7. The number of channels is increasing, from 3 to 10, then to 20, and finally to 40. You can also see this trend in many other convolutional neural networks. On how to determine these parameters, I will explain in more detail in a later class. This is the first example of a convolutional neural network we talked about.

A typical convolutional neural network usually has three layers, one is a convolutional layer, we often use **Conv** to label. In the previous example, I used **CONV** . There are also two common types of layers, which we will talk about in the next two lectures. One is the pooling layer, which we call **POOL** . The last one is the fully connected layer, denoted by **FC** . Although it is possible to build a good neural network using only convolutional layers, most neural watchtower architects still add a pooling layer and a fully connected layer. Fortunately, the pooling layer and the fully connected layer are easier to design than the convolutional layer. In the next two lessons, we will quickly explain these two concepts so that you can better understand the most commonly used layers in neural networks, and you can use them to build more powerful networks.

Congratulations again that you have mastered the first convolutional neural network. In the next few lessons this week, we will learn how to train these convolutional neural networks. But before that, I would like to briefly introduce the pooling layer and the fully connected layer. Then I will train these networks, and then I will use the familiar back propagation training method. So in the next lesson, we will first understand how to build a neural network pooling layer.

### 1.9 Pooling layers

In addition to the convolutional layer, the convolutional network often uses a pooling layer to reduce the size of the model, increase the calculation speed, and improve the robustness of the extracted features. Let's take a look.

First give an example of the pooling layer, and then we will discuss the necessity of the pooling layer. If the input is a 4 4 matrix, the pooling type used is **max pooling** . The tree pool that performs max pooling is a 2 2 matrix. The execution process is very simple, split the 4 4 input into different areas, and I use different colors to mark this area. For a 2 2 output, each element of the output is the maximum element value in its corresponding color area.

The maximum value of the upper left area is 9, the maximum element value of the upper right area is 2, the maximum value of the lower left area is 6, and the maximum value of the lower right area is 3. In order to calculate the values of the 4 elements on the right, we need to perform the maximum operation on the 2 2 area of the input matrix. It's like applying a filter with a scale of 2, because we used a 2 2 area with a stride of 2. These are the hyperparameters for maximum pooling.

Because the filter we used is 2 2, the final output is 9. Then move 2 steps to the right to calculate the maximum value of 2. Then on the second row, move down 2 steps to get the maximum value of 6. Finally, move 3 steps to the right to get the maximum value of 3. This is a 2 2 matrix, that is, the stride is 2, that is.

This is an intuitive understanding of the maximum pooling function. You can think of this 4 4 input as a collection of certain features, maybe not. You can think of this 4 4 area as a collection of certain features, that is, a collection of inactive values in a certain layer of the neural network. A large number means that some specific features may be detected. The feature in the upper left quadrant may be a vertical edge, an eye, or a **CAP** feature that everyone is afraid of encountering . Obviously this feature exists in the upper left quadrant, and this feature may be a cat-eye detector. However, this feature does not exist in the upper right quadrant. The function of the maximize operation is that as long as a feature is extracted in any quadrant, it will remain in the maximized pooling output. Therefore, the actual effect of the maximization operation is to keep the maximum value if a certain feature is extracted in the filter. If this feature is not extracted, it may not exist in the upper right quadrant, and the maximum value is still very small. This is the intuitive understanding of maximum pooling.

It must be admitted that the main reason people use maximum pooling is that this method works well in many experiments. Although the intuitive understanding just described is often quoted, I don't know if you fully understand the real reason for it, or if you understand the real reason for the high efficiency of maximum pooling.

One of the interesting features is that it has a set of hyperparameters, but there are no parameters to learn. In fact, gradient descent has nothing to learn. Once the sum is determined, it is a fixed operation, and gradient descent does not need to change any value.

Let's look at an example with several super parameters, and the input is a 5 5 matrix. We use the maximum pooling method. Its filter parameter is 3 3, that is, the stride is 1, and the output matrix is 3 3. The formula for calculating the output size of the convolutional layer mentioned earlier is also applicable to the maximum pooling, That is, this formula can also calculate the output size of the maximum pooling.

In this example, each element of the 3 3 output is calculated. We look at these elements in the upper left corner. Note that this is a 3 3 area because there are 3 filters, and the maximum value is 9. Then move an element, because the stride is 1, the maximum value of the blue area is 9. Continue moving to the right, the maximum value of the blue area is 5. Then move to the next line, because the stride is 1, we only move down one grid, so the maximum value of this area is 9. This area is also 9. The maximum value of these two regions is 5. The maximum values of these three regions are 8, 6, and 9, respectively. The hyperparameter,, and the final output are shown in the figure.

The above is a demonstration of maximum pooling of a two-dimensional input. If the input is three-dimensional, then the output is also three-dimensional. For example, if the input is 5 5 2, then the output is 3 3 2. The method of calculating the maximum pooling is to perform the calculation process just now for each channel separately. As shown in the figure above, the first channel remains unchanged. For the second channel, I just drew it below, and do the same calculation on this layer to get the output of the second channel. Generally speaking, if the input is 5 5 , the output is 3 3 , and each of the channels performs the maximum pooling calculation separately. The above is the maximum pooling algorithm.

There is another type of pooling, average pooling, which is not very commonly used. Let me briefly introduce that this kind of calculation, as the name implies, selects not the maximum value of each filter, but the average value. In the example, the average value of the purple area is 3.75, followed by 1.25, 4, and 2. For this super parameter of average pooling, we can also choose other super parameters.

Currently, maximum pooling is more commonly used than average pooling. But there are exceptions, that is, the deep neural network. You can use average pooling to decompose the representation layer of the network with a scale of 7 7 1000, and average it in the entire space to get 1 1 1000. Let's look at an example. But in neural networks, maximum pooling is used more than average pooling.

To sum up, the super parameters of pooling include filter size and stride. The commonly used parameter value is, and the application frequency is very high. The effect is equivalent to reducing the height and width by half. There are also cases of using. As for other super parameters, it depends on whether you use maximum pooling or average pooling. You can also add other super parameters that represent **padding** according to your wishes , although this is rarely used. When maximizing pooling, the hyperparameter **padding is** often rarely used . Of course, there are exceptions. We will talk about it next week. In most cases, **padding** is rarely used for max pooling . Currently the most commonly used value is 0, ie. The input of maximum pooling is, assuming there is no **padding** , then output. The number of input channels is the same as that of output channels, because we have pooled each channel. One thing to note is that there are no parameters that need to be learned during the pooling process. When performing backpropagation, no parameters for backpropagation apply to max pooling. Only these hyperparameters that have been set may be set manually or set through cross-validation.

In addition to these, the contents of pooling are all finished. Maximum pooling is just to calculate the static attributes of a certain layer of the neural network, there is nothing to learn, it is just a static attribute.

We have talked about pooling here, now we already know how to build a convolutional layer and a pooling layer. In the next lesson, we will analyze a more complex example of a convolutional network that can introduce a fully connected layer.

### 1.10 Convolutional neural network example (Convolutional neural network example)

We have already mastered the building blocks of a fully convolutional neural network. Let s look at an example.

Suppose, there is an input picture with a size of 32 32 3, which is a picture in **RGB** mode, and you want to recognize handwritten numbers. A 32 32 3 **RGB** picture contains a certain number, such as 7, and you want to identify which of the 10 numbers from 0-9 is. We build a neural network to achieve this function.

The network model I used is very similar to the classic network **LeNet-5** , and the inspiration comes from this. **LeNet-5** was created by **Yann LeCun** many years ago . The model I used is not **LeNet-5** , but inspired by it, many parameter choices are similar to **LeNet-5** . The input is a 32 32 3 matrix. Assuming that the filter size used in the first layer is 5 5, the stride is 1, **paddin** g is 0, and the number of filters is 6, then the output is 28 28 6. Mark this layer as **CONV1** , it uses 6 filters, increases the deviation, applies a non-linear function, which may be a **ReLU** non-linear function, and finally outputs the result of **CONV1** .

Then build a pooling layer. Here I choose to use maximum pooling, the parameter, and because **padding** is 0, I won t write it out. Now start to build the pooling layer. The maximum pooling filter used is 2 2, and the stride is 2, which means that the height and width of the layer will be reduced by half. Therefore, 28 28 becomes 14 14 and the number of channels remains unchanged, so the final output is 14 14 6, which is marked as **POOL1** .

It is found that in the convolutional neural network literature, there are two classifications of convolution, which is consistent with the so-called layer division. One type of convolution is a convolutional layer and a pooling layer together as a layer, which is the **Layer1 of the** neural network . Another type of convolution is to use the convolutional layer as a layer, and the pooling layer as a single layer. When people calculate how many layers there are in a neural network, they usually only count the layers with weights and parameters. Because the pooling layer has no weights and parameters, only some hyperparameters. Here, we take **CONV1** and **POOL1** together as a convolution and mark it as **Layer1** . Although you are reading online articles or research reports, you may see that the convolutional layer and the pooling layer are each layered, these are just two different labeling terms. Generally, when I count the number of network layers, I only count the weighted layers, that is, consider **CONV1** and **POOL1** as **Layer1** . Here we use **CONV1** and **POOL1** to mark, both are part of the neural network **Layer1** , **POOL1** is also divided into **Layer1** , because it has no weight, the output obtained is 14 14 6.

We build a convolutional layer for it again, the filter size is 5 5, the stride is 1, this time we use 10 filters, and finally output a 10 10 10 matrix, labeled as **CONV2** .

Then do maximum pooling, hyperparameters,. You can probably guess the result,,, the height and width will be halved, and the final output is 5 5 10, labeled as **POOL2** , which is the second convolutional layer of the neural network, namely **Layer2** .

If another convolutional layer is applied to **Layer1** , the filter is 5 5, that is, the stride is 1, and the **padding** is 0, so it is omitted here. There are 16 filters, so the output of **CONV2** is 10 10 16. Let s look at **CONV2** , which is the **CONV2** layer.

Continue to perform large pooling calculations, parameters,, can you guess the result? Perform the maximum pooling calculation on the 10 10 16 input, and the parameters,, height and width are halved. Guess the result of the calculation. The parameter of maximum pooling,, the input height and width will be halved, the result is 5 5 16, the number of channels is the same as before, marked as **POOL2** . This is a convolution, that is, **Layer2** , because it has only one weight set and one convolutional layer **CONV2** .

The 5 5 16 matrix contains 400 elements. Now **POOL2 is** flattened into a one-dimensional vector of size 400. We can think of the flattening result as such a collection of neurons, and then use these 400 units to build the next layer. The next layer contains 120 units. This is our first fully connected layer, labeled **FC3** . These 400 units are closely connected with 120 units, which is the fully connected layer. It is very similar to the single neural network layer we talked about in the first and second courses, which is a standard neural network. Its weight matrix is 120 400 in dimension. This is the so-called "full connection", because the 400 units are connected to each of the 120 units, and there is a deviation parameter. Finally, 120 dimensions are output because there are 120 outputs.

Then we add a fully connected layer to the 120 units, this layer is smaller, suppose it contains 84 units, labeled **FC4** .

Finally, fill a **softmax** unit with these 84 units. If we want to recognize the 10 digits 0-9 handwritten through handwritten digit recognition, this **softmax** will have 10 outputs.

The convolutional neural network in this example is very typical. It seems that it has many hyperparameters. I will provide more suggestions on how to select these parameters. The general practice is to try not to set the hyper-parameters yourself, but to check which hyper-parameters others have adopted in the literature, and choose an architecture that works well in others tasks, then it may also be suitable for your own applications. I will talk about it in detail every week.

Now, what I want to point out is that as the depth of the neural network deepens, the height and width usually decrease. As I mentioned earlier, from 32 32 to 28 28, to 14 14, to 10 10, and then To 5 5. So as the number of layers increases, the height and width will decrease, while the number of channels will increase, increasing from 3 to 6 to 16, and then get a fully connected layer.

In neural networks, another common mode is one or more convolutions followed by a pooling layer, then one or more convolutional layers followed by a pooling layer, then several fully connected layers, and finally A **softmax** . This is another common pattern of neural networks.

Next, we will talk about the shape of the activation value of the neural network, the size of the activation value and the number of parameters. The input is 32 32 3, these numbers are multiplied, the result is 3072, so the activation value has 3072 dimensions, the activation value matrix is 32 32 3, and the input layer has no parameters. When calculating other layers, try to calculate the activation value yourself. These are the activation value shapes and activation value sizes of different layers in the network.

There are a few points to note. 1. the pooling layer and the maximum pooling layer have no parameters; the second convolutional layer has relatively few parameters. As we mentioned in the previous lesson, many parameters exist in the fully connected neural network. Floor. Observation shows that as the neural network deepens, the activation value size will gradually become smaller. If the activation value size drops too fast, it will also affect the performance of the neural network. In the example, the activation value size is 6000 in the first layer, then reduced to 1600, slowly reduced to 84, and finally the **softmax** result is output . We found that many convolutional networks have these properties and are similar in patterns.

We have finished talking about the basic building blocks of neural networks. A convolutional neural network includes a convolutional layer, a pooling layer, and a fully connected layer. Many computer vision researches are exploring how to integrate these basic modules to build efficient neural networks. The integration of these basic modules does require a deep understanding. According to my experience, the best way to find the integration of basic building blocks is to read a lot of other people's cases. Next week, I will demonstrate some specific cases of integrating basic modules to successfully build efficient neural networks. I hope that next week's course can help you find the feeling of building an effective neural network. Maybe you can also apply the framework developed by others to your own application. This is the content of next week. The next class is also the last class of this week. I want to take a moment to discuss why you are willing to use convolution, what are the benefits and advantages of using convolution, how to integrate multiple convolutions, and how to test neural networks. How to train a neural network on the training set to recognize pictures or perform other tasks, we will continue to talk about it in the next lesson.

### 1.11 Why use convolution? (Why convolutions?)

This is the last class of this week. Let's analyze the reasons why convolutions are so popular in neural networks, and then make a brief summary of how to integrate these convolutions and how to train convolutional neural networks through a labeled training set. Compared with only the fully connected layer, the two main advantages of the convolutional layer are parameter sharing and sparse connection. Let me illustrate with an example.

Suppose there is a 32 32 3 dimension picture. This is the example from the previous lesson. Assuming that 6 filters with a size of 5 5 are used, the output dimension is 28 28 6. 32 32 3=3072 , 28 28 6=4704. We build a neural network, one layer contains 3072 units, the next layer contains 4074 units, each neuron in the two layers is connected to each other, and then calculate the weight matrix, which is equal to 4074 3072 14 million, so we need to train There are many parameters. Although with the current technology, we can train the network with more than 14 million parameters, because this 32 32 3 picture is very small, there is no problem training so many parameters. If this is a 1000 1000 picture, the weight matrix will become very large. Let s take a look at the number of parameters of this convolutional layer. Each filter is 5 5. A filter has 25 parameters, plus the deviation parameter. Then each filter has 26 parameters, a total of 6 Filters, so there are a total of 156 parameters, and the number of parameters is still very small.

There are two reasons why convolutional networks map so few parameters:

One is parameter sharing. It is observed that if feature detection such as vertical edge detection is applicable to a certain area of the picture, it may also be applicable to other areas of the picture. In other words, if you use a 3 3 filter to detect vertical edges, then the upper left corner of the picture and the areas next to it (the part marked by the blue box in the left matrix) can use this 3 3 filter. Each feature detector and output can use the same parameters in different areas of the input image to extract vertical edges or other features. It is not only suitable for low-level features such as edge features, but also for high-level features, such as extracting eyes, cats or other feature objects on the face. Even if the number of parameters is reduced, these 9 parameters can also calculate 16 outputs. The intuitive feeling is that a feature detector, such as a vertical edge detector, is used to detect the feature in the upper left corner of the picture, and this feature is likely to be applicable to the lower right corner of the picture. Therefore, you do not need to add other feature detectors when calculating the upper left and lower right corners of the image. If there is such a data set, the upper left and lower right corners may have different distributions, or they may be slightly different, but they are very similar. The whole picture shares the feature detector, and the extraction effect is also very good.

The second method is to use sparse connections, let me explain. This 0 is calculated through a 3 3 convolution, and it only depends on the 3 3 input cell. The output unit on the right (element 0) is only connected to 9 of the 36 input features. And other pixel values will not have any impact on the output, which is the concept of sparse connection.

To give another example, this output (the element 30 marked in red in the right matrix) only depends on these 9 features (the area marked by the red box in the left matrix). It seems that only these 9 input features are connected to the output, and other pixels It has no effect on the output.

The neural network can reduce the parameters through these two mechanisms so that we can train it with a smaller training set to prevent overfitting. You may have also heard that convolutional neural networks are good at capturing translation changes. Through observation, it can be found that moving two pixels to the right, the cat in the picture is still clearly visible, because the convolution structure of the neural network makes this picture still have very similar features even if it moves a few pixels, and should belong to the same output mark . In fact, we use the same filter to generate all the pixel values of the pictures in each layer. We hope that the network will become more robust through automatic learning in order to better obtain the desired translation invariant attributes.

This is why convolutions or convolutional networks perform well in computer vision tasks.

Finally, we integrate these layers and see how to train these networks. For example, if we want to build a cat detector, we have the following marked training set, which represents a picture, which is a binary mark or an important mark. We selected a convolutional neural network, input images, add convolutional layers and pooling layers, then add a fully connected layer, and finally output a **softmax** , ie. The convolutional layer and the fully connected layer have different parameters and deviations, and we can use any parameter set to define the cost function. A cost function similar to the one we talked about before, and initialize its parameter sum randomly, the cost function is equal to the total loss of the neural network's prediction of the entire training set divided by (ie). So to train a neural network, all you have to do is to use gradient descent, or other algorithms, such as **Momentum** gradient descent, gradient descent with **RMSProp** or other factors to optimize all parameters in the neural network to reduce the value of the cost function. Through the above operations, you can build an efficient cat detector or other detectors.

Congratulations on completing this week's course. You have learned all the basic building blocks of convolutional neural networks and how to integrate these modules in an efficient image recognition system. Through this week's programming exercises, you can understand these concepts more specifically, try to integrate these building blocks, and use them to solve your own problems.

Next week, we will continue to learn about convolutional neural networks in depth. I mentioned that there are many hyperparameters in convolutional neural networks. Next week, I plan to show some of the most effective examples of convolutional neural networks. You can also try to determine which network architecture types are more efficient. People usually apply the architecture discovered by others and published in research reports to their own applications. After seeing more specific examples next week, I believe you will do better. In addition, next week we will also analyze the reasons why convolutional neural networks are so efficient, and explain some new computer vision applications, such as object detection and neural style transfer, and how to use these algorithms to create new forms of art.

Reference