CN108717569B

CN108717569B - Expansion full-convolution neural network device and construction method thereof

Info

Publication number: CN108717569B
Application number: CN201810470228.5A
Authority: CN
Inventors: 曹铁勇; 方正; 张雄伟; 杨吉斌; 孙蒙; 李莉; 赵斐; 洪施展; 项圣凯
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2018-05-16
Filing date: 2018-05-16
Publication date: 2022-03-22
Anticipated expiration: 2038-05-16
Also published as: CN108717569A

Abstract

The invention discloses an expansion full convolution neural network and a construction method thereof. The neural network comprises a convolutional neural network, a feature extraction module and a feature fusion module which are connected in sequence. The construction method comprises the following steps: selecting a convolutional neural network: removing full-connection layers and classification layers for classification in the convolutional neural network, only leaving a convolutional layer and a pooling layer in the middle, and extracting a feature map from the convolutional layer and the pooling layer; a structural feature extraction module: the feature extraction module comprises a plurality of expansion up-sampling modules which are connected in series, and each expansion up-sampling module respectively comprises a feature map merging layer, an expansion convolution layer and a deconvolution layer; constructing a feature fusion module: the feature fusion module comprises a dense expansion fusion volume block and an deconvolution layer. The method effectively solves the problems of feature extraction and fusion in the convolutional neural network, and can be applied to the task of labeling the pixel level of the image.

Description

Expansion full-convolution neural network device and construction method thereof

Technical Field

The invention belongs to the technical field of image signal processing, and particularly relates to an expansion full convolution neural network device and a construction method thereof.

Background

Convolutional Neural Networks (CNNs) are the most widely used networks for deep learning in image processing and computer vision. CNNs were originally designed for image recognition classification, i.e., class labels in the output image after the input image passes through the CNN. However, in some fields of image processing, merely identifying the category of the entire image is far from sufficient. For example, in image semantic segmentation, the category of each pixel point in an image needs to be labeled, and the output at this time is not a category label but a mapping map with the same size as the original image, and each pixel in the mapping map is labeled with the semantic category to which the corresponding pixel in the original image belongs. At this time, the CNN is unable to complete the task only, structural improvement needs to be made on the CNN, and the earliest network of CNN modified to the pixel level labeling task is a full volume network (FCN) (j.long, e.shell, and t.darrell, "full connectivity networks for the sake of" in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2015, pp.3431-3440 "), which replaces the classification layer of the conventional CNN tail with a convolution layer and a deconvolution layer to obtain an output mapping image with the same size as the original image, and FCN was used for semantic segmentation of the image at the earliest time and is also used for other types of pixel level labeling tasks later. FCN has two main applications:

(1) detecting the image significance: the saliency detection of an image aims to find out a salient foreground target in the image, namely, a foreground target and a background target of the image are detected through an algorithm, and if a saliency detection model is learned through FCN, a loss function of a general network is the Euclidean distance or cross entropy between an annotation graph and a generated mapping graph.

(2) Image semantic segmentation: different from the saliency target detection, the semantic segmentation needs to find out and label all semantic contents in each image, not only to segment the foreground but also to have the background, and also needs to classify the labeled areas. When training the semantic segmentation model using FCN, the general loss function consists of cross entropy and a Softmax classification function.

However, the result graph obtained from the conventional full convolution network often cannot well retain the edge information of the object, the result graph is often rough, and a post-processing process is generally adopted to improve the labeling precision. The post-processing processes not only increase the complexity of the labeling model, but also have the defects that the obtained result is not smooth and has more discontinuous pixel points due to the manual segmentation of the labeling process, and the result is greatly influenced. These drawbacks are mainly due to the fact that previous FCNs do not extract and exploit image features in the network very well, resulting in a degradation of performance. On the other hand, the conventional FCNs have a large number of parameters, and are not favorable for model transplantation and miniaturization.

Disclosure of Invention

The invention aims to provide an expansion full convolution neural network device and a construction method thereof, so that image features in a network can be accurately extracted and utilized and the expansion full convolution neural network device is used for an image pixel level segmentation task.

The technical solution for realizing the purpose of the invention is as follows: the utility model provides an inflation full convolution neural network device, includes convolution neural network, feature extraction module, the feature fusion module that connects gradually, wherein:

the convolutional neural network is a network main body and comprises a convolutional layer and a pooling layer, and a characteristic diagram is extracted from the convolutional layer and the pooling layer;

the feature extraction module comprises a plurality of expansion up-sampling modules which are connected in series, and each expansion up-sampling module respectively comprises a feature map merging layer, an expansion convolution layer and a deconvolution layer; the feature map merging layer merges feature maps with the same size in an overlapping mode; the expansion convolution layer is used for expanding a receptive field, and the definition of the receptive field is the size of an area mapped by pixel points on a characteristic diagram output by each layer of the convolution neural network on an original image; the deconvolution layer is used for up-sampling the feature map, so that the output feature map of the deconvolution layer is twice the size of the input feature map;

the feature fusion module comprises a dense expansion fusion convolution block, a deconvolution layer and an activation function, wherein the dense expansion fusion convolution block is used for fusing all feature graphs from the feature extraction module, the deconvolution layer resets an output image to the size of an original input image, and the activation function is selected according to a specific task.

Further, the feature extraction module comprises M expansion up-sampling modules connected in series, wherein the first expansion up-sampling module extracts a feature map from an output end of a first convolutional layer before the third down-sampling in the convolutional neural network, the second expansion up-sampling module extracts a feature map from an output end of the first convolutional layer before the fourth down-sampling in the convolutional neural network, and so on until the M expansion up-sampling module extracts a feature map from an output end of the first convolutional layer before the M +2 th down-sampling in the convolutional neural network; from the Mth expansion upsampling module to the 1 st expansion upsampling module, wherein the expansion factor of the expansion convolutional layer is decreased by less than 16; the up-sampling factor of the deconvolution layer in all the expansion up-sampling modules is 2.

Further, the dense expansion fusion convolution block is used for fusing all feature maps from the feature extraction module, the size of the feature maps is unchanged after the feature maps pass through each convolution layer in the dense expansion fusion convolution block, and the number of channels of the output feature maps of the dense expansion fusion convolution block is 1.

Further, the dense dilation fusion convolution block includes 5 convolutional layers, the input of each convolutional layer is from the output of all convolutional layers before this convolutional layer; the first 3 of them are expansion convolution layers, and the expansion factor is increased and less than 16; the last 2 are common convolutional layers; the feature map is unchanged in size after passing through the 5 convolutional layers.

A construction method of an expanded full convolution neural network comprises the following steps:

step 1, selecting a convolutional neural network: removing full-connection layers and classification layers for classification in the convolutional neural network, only leaving a convolutional layer and a pooling layer in the middle, and extracting a feature map from the convolutional layer and the pooling layer;

step 2, constructing a feature extraction module: the feature extraction module comprises a plurality of expansion up-sampling modules which are connected in series, and each expansion up-sampling module respectively comprises a feature map merging layer, an expansion convolution layer and a deconvolution layer;

step 3, constructing a feature fusion module: the feature fusion module comprises a dense expansion fusion volume block, a deconvolution layer and an activation function, the dense expansion fusion volume block fuses all feature graphs from the feature extraction module, the deconvolution layer resets an output image to the size of an original input image, and a result image is output after the output image is processed by the activation function.

Further, the feature extraction module in step 2 includes a plurality of expansion upsampling modules connected in series, where the number of the expansion upsampling modules is M, the first expansion upsampling module extracts a feature map from an output end of a first convolutional layer before third downsampling in the convolutional neural network, the second expansion upsampling module extracts a feature map from an output end of the first convolutional layer before fourth downsampling in the convolutional neural network, and so on until the mth expansion upsampling module extracts a feature map from an output end of the first convolutional layer before M +2 downsampling in the convolutional neural network, and from the mth expansion upsampling module to the 1 st expansion upsampling module, where expansion factors of the expansion convolutional layers decrease progressively and are smaller than 16; the up-sampling factor of the deconvolution layer in all the expansion up-sampling modules is 2.

Further, the dense expansion fusion convolution block in step 3 is used to fuse all feature maps from the feature extraction module, the size of the feature map after passing through each convolution layer in the dense expansion fusion convolution block is not changed, and the number of channels of the output feature map of the dense expansion fusion convolution block is 1.

Further, the activation function in step 3 is selected according to a specific task: if the network is used for training the image semantic segmentation task, the activation function is a softmax classification function; if the training of the significance detection task is performed, the activation function is a sigmoid function.

Compared with the prior art, the invention has the following remarkable advantages: (1) the constructed feature extraction and fusion module effectively solves the problem of feature extraction and fusion in the FCN and can better solve the problem of pixel-level labeling in image processing; (2) better results can be obtained without additional post-treatment processes; (3) the model has simple structure, less final model parameters and high operation speed.

Drawings

Fig. 1 is an overall configuration diagram of the expanded fully convolutional neural network device of the present invention.

Fig. 2 is a schematic structural diagram of an expansion up-sampling module in the expansion full convolution neural network device according to the present invention.

FIG. 3 is a schematic structural diagram of a dense fusion module in the expanded fully convolutional neural network device according to the present invention.

FIG. 4 is an exemplary diagram of a dense dilation network constructed by the method of the present invention for a dense convolutional network.

Detailed Description

The present invention is described in further detail below with reference to the attached drawing figures.

The utility model provides an inflation full convolution neural network device, includes convolution neural network, feature extraction module, the feature fusion module that connects gradually, wherein:

Example 1

FIG. 1 is a schematic diagram of the structure of the disclosed expanded fully convoluted network. The network consists of 3 parts, including a convolutional neural network part, a feature extraction module and a feature fusion module. The convolutional layer is denoted "Conv" and "Pooling" denotes the Pooling layer.

(1) A convolutional neural network:

the convolutional neural network can select all the existing convolutional neural networks, including VGG-Net, ResNet, DenseNet, and the like, the convolutional neural network is a network used for image classification, and generally comprises some convolutional layers, pooling layers and full-link layers, when the full-convolutional network is constructed, the full-link layer and the sorting layer which are finally used for classification in the convolutional network need to be removed, only the convolutional layers and the pooling layers in the middle are left, and an output feature map is extracted from the middle layers, as shown in fig. 1, a feature map after each pooling layer is generally extracted, because each pooling layer downsamples an input image, the feature maps after each pooling layer have different sizes, and specific analysis is found in a feature extraction module construction part.

(2) The feature extraction module is constructed as follows:

the feature extraction is composed of a series of expansion up-sampling modules, and fig. 2 is a diagram of the expansion up-sampling module provided by the invention, which is composed of a layer for merging feature maps, an expansion convolution layer and a deconvolution layer. The following briefly describes deconvolution and dilation convolution, followed by the feature extraction module construction process.

Assuming that F is a two-dimensional image, N × N, and K is a filter K × K, the convolution operation of F and K is defined as:

wherein

Here, the convolution sign is indicated, and S (x, y) is the obtained convolution result.

Let l be the dilation factor, then the convolution of the dilation factor l

Is defined as:

the expansion convolution can effectively enlarge the receptive field, and the definition of the receptive field is the size of the area of the pixel points on the characteristic diagram output by each layer of the convolution neural network, which are mapped on the original image. The larger the expansion factor is, the larger the receptive field in the feature map is, and more detailed information in the original image can be captured. When designing a dense dilation full convolution network, the smaller feature maps require a larger receptive field, so the dilation factors in the 4 dilation upsampling modules decrease sequentially from the fourth to the first and are smaller than 16.

Deconvolution is the inverse operation of convolution. In FCN, deconvolution is used to up-sample the feature map, because the CNN original structure is a series of down-sampling structures (including convolution and pooling), and in the convolutional neural network, the size relationship between the input and output images of each convolutional layer can be expressed as:

wherein O is_convIs the length or width of the output image, I_convIs the length or width of the input image, K is the convolution kernel size, P is the zero-padding number, and S is the convolution step.

The magnitude relation of the deconvolution input and output is as follows:

O_deconv＝(I_deconv-1)S+K-SP (4)

wherein O is_deconvIs the length or width of the output image, I_deconvIs the length or width of the input image, K is the convolution kernel size, P is the zero-padding number, and S is the convolution step. The output size of the pooling layer is half of the input size.

In the following, a construction mode of the feature extraction module is described by taking a dense expanded convolution network as an example, as shown in fig. 4, in a convolution neural network, data flow exists in the form of a 4-dimensional tensor, assuming that an input image is N × N, the input tensor is 1 × 3 × N, after convolution, feature maps with different channel numbers are output, and according to a network structure, a feature map tensor extracted from a dense convolution block 1 is 1 × N (N/4), N is the channel number of the feature map, and here, the feature map tensor can be selected by self according to circumstances, generally, the larger N is, the more final model parameters are, but the better performance is. When designing the feature extraction module in the invention, the size relationship of each intermediate layer output feature graph is mainly concerned.

As described above, the size of the signature extracted from the dense convolution block 1 is (N/4) × (N/4), the size of the output signature of the dense convolution block 2 is (N/8) × (N/8), the size of the output signature of the dense convolution block 3 is (N/16) × (N/16), and the size of the output signature of the dense convolution block 4 is (N/32) × (N/32). However, in the labeling task at the pixel level, an output result graph with the same size as the original image needs to be obtained, meanwhile, the feature graph information of each layer is different, and all output feature graphs need to be up-sampled when the features of all layers are comprehensively utilized. For this, a cascaded upsampling structure is constructed, and the feature maps of all layers are upsampled to N/2 × N/2.

A single upsampling building block is shown in fig. 2, where the parameters of the deconvolution layer are set to twice the upsampling profile, and the dilated convolution input profile is the same size as the output profile, the merging layer functions to merge the same sized profiles by means of superposition. When constructed, the minimum feature map (the feature map output by the dense convolution block 4 in fig. 4) is largeAs small as (N/32) × (N/32) and no smaller eigen map extraction follows it, so that when constructing the upsampling module for the last layer, no merging layer is needed, only the (N/16) × (N/16) eigen maps are output after the dilation convolution and deconvolution, and the eigen maps extracted from the dense convolution block are also (N/16) × (N/16), so in the second upsampling structure, a merging layer is added, which serves to merge these same-sized eigen maps into one tensor, assuming that the data tensor extracted from the dense convolution block 4 is 1 × N₄(N/32) x (N/32), wherein N₄After passing through the first up-sampling structure unit, the output tensor becomes 1 × n₃(N/16) x (N/16), and the data tensor extracted from the dense convolution block 3 is 1 x N₄(N/16) and the output tensor becomes 1 (N) after the merging layers are merged₃+n₄) (N/16) by (N/16), in this way, sequentially onwards, the final 4 th up-sampled structure output tensor is 1 (N)₃+n₄+n₂+n₁)*(N/2)*(N/2)。

Table 1 includes some parameter setting examples of the feature extraction module in the dense expanding convolution network, and the convolution kernel size, zero padding, convolution step size and expansion factor are all related parameters of convolution operation. Where Conv1-4 is the dilated convolution layer of the 4 upsampled structures and Deconv1-4 is the deconvolution structure thereof. In design, the parameters can be selected according to the situation, but the sizes of the expansion convolution input and output characteristic graphs are ensured to be the same, and the deconvolution enlarges the image size by 2 times each time, wherein the expansion factor of the expansion convolution is selected according to the size of the characteristic graph input by the expansion convolution, namely the large expansion factor is used for the large-size characteristic graph, and the small expansion factor is used for the small-size characteristic graph. Because the dilation factor becomes large, it can be that the convolution kernel becomes large, and if the feature size being convolved is too small, the information in the feature is lost.

For different networks, only the parameters or the number of the up-sampling structural units need to be changed according to the steps.

TABLE 1

Type (B)	Convolution kernel size	Zero padding	Convolution step size	Expansion factor	Output of
						Conv4	3*3	2	1	2	n₁(N/32)(N/32)
Conv3	3*3	3	1	3	n₂(N/16)(N/16)
						Conv2	3*3	4	1	4	n₃(N/8)(N/8)
Conv1	3*3	5	1	5	n₄(N/4)(N/4)
						Deconv4	4*4	1	2	/	n₁(N/16)(N/16)
Deconv3	4*4	1	2	/	n₂(N/8)(N/8)
						Deconv2	4*4	1	2	/	n₃(N/4)(N/4)
Deconv1	4*4	1	2	/	n₄(N/2)(N/2)
						Conv9_1	3*3	2	1	2	100(N/2)(N/2)
Conv9_2	3*3	4	1	4	50(N/2)(N/2)
						Conv9_3	3*3	8	1	8	30(N/2)(N/2)
Conv9_4	1*1	0	1	/	20(N/2)(N/2)
						Conv9_5	1*1	0	1	/	1(N/2)(N/2)
Deconv5	4*4	1	2	/	1NN

(3) The feature fusion module structure:

the feature fusion module is composed of a dense expansion fusion volume block, an deconvolution layer and an activation function, wherein the structure of the dense expansion fusion volume block is shown in fig. 3, wherein Conv9-1, Conv9-2 and Conv9-3 are expansion convolutions, the expansion factors are respectively 2,4,8, Conv9-4 and the Conv9-5 convolution layer is 1 × 1 convolution layer. Conv9-1-Conv9-5 is densely connected, i.e., the input to each layer is from the output of all layers before this layer, such as Conv9-5 in FIG. 3, whose input is from all the outputs of the previous 4 convolutional layers. The dense dilation fusion convolution block is used to fuse all feature maps from the feature extraction module to get the corresponding result. Also illustrated by the example of the dense dilation convolution network of fig. 4, the input of the dense dilation fusion convolution module comes from the feature extraction module, and the input tensor is 1 × (n)₃+n₄+n₂+n₁) (N/2) × (N/2), after the dense convolution block goes through a series of convolution operations, a tensor of 1 × 1 (N/2) × (N/2) is output, and in design, it should be ensured that the size of the eigenmap passing through 5 convolution layers is not changed, and the number of channels of the output eigenmap of the Conv9-5 convolution layer must be 1, and an example of parameter design in the dense dilation convolution fusion block in fig. 4 is shown in table 1. The expansion factor of Conv9-1-Conv9-3 can be selected according to the situation, and the number of output characteristic diagrams of Conv9-1-Conv9-4 can also be selected according to the situation. Deconv5 functions to reset the output image to the input image size. And selecting the activation function according to a specific task, for example, training an image semantic segmentation task by using the network, wherein the activation function is a softmax classification function, and if the training of a saliency detection task is performed, the activation function is a sigmoid function.

(4) Network training: after the network is constructed, the learning training of the network can be carried out for specific tasks. Different loss functions are selected for different tasks. For example, for a task of significance detection, a training set image and a corresponding label graph thereof need to be selected first, and a loss function is generally an euclidean distance between the label graph and a generated map. As shown in the following formula

Wherein Z ═ Z_i(i＝1,...,N₁) Is the training set image, f (Z)_i) Is the output of the image after passing through the network, M_i(i＝1,...,N₁) Is an annotation graph corresponding to the training image. The parameters of the network can be updated by minimizing the equation (1) by a gradient descent method.

Different loss functions and parameter updating methods can be selected for different training tasks. The network effectively solves the problems of feature extraction and fusion in the convolutional neural network, and can be applied to the task of labeling the pixel level of an image.

Claims

1. The utility model provides an inflation full convolution neural network device which characterized in that, includes convolution neural network, feature extraction module, the feature fusion module that connects gradually, wherein:

the feature fusion module comprises a dense expansion fusion convolution block, a deconvolution layer and an activation function, wherein the dense expansion fusion convolution block is used for fusing all feature graphs from the feature extraction module, the deconvolution layer resets an output image to the size of an original input image, and the activation function is selected according to a specific task;

the dense dilation fusion convolution block includes 5 convolutional layers, the input of each convolutional layer is from the output of all convolutional layers before the convolutional layer; the first 3 of them are expansion convolution layers, and the expansion factor is increased and less than 16; the last 2 are common convolutional layers; the feature map is unchanged in size after passing through the 5 convolutional layers.

2. The apparatus according to claim 1, wherein the feature extraction module comprises M cascaded dilation upsampling modules, wherein a first dilation upsampling module extracts a feature map from an output of a first convolutional layer before a third downsampling in the convolutional neural network, a second dilation upsampling module extracts a feature map from an output of the first convolutional layer before a fourth downsampling in the convolutional neural network, and so on until the mth dilation upsampling module extracts a feature map from an output of the first convolutional layer before the M +2 downsampling in the convolutional neural network; from the Mth expansion upsampling module to the 1 st expansion upsampling module, wherein the expansion factor of the expansion convolutional layer is decreased by less than 16; the up-sampling factor of the deconvolution layer in all the expansion up-sampling modules is 2.

3. The apparatus according to claim 1, wherein the dense dilation fusion convolution block is configured to fuse all feature maps from the feature extraction module, the feature maps have a constant size after passing through each convolution layer in the dense dilation fusion convolution block, and the number of channels of the output feature maps of the dense dilation fusion convolution block is 1.

4. A construction method of an expansion full convolution neural network is characterized by comprising the following steps:

step 3, constructing a feature fusion module: the feature fusion module comprises a dense expansion fusion volume block, a deconvolution layer and an activation function, the dense expansion fusion volume block fuses all feature graphs from the feature extraction module, the deconvolution layer resets an output image to the size of an original input image, and a result image is output after the output image is processed by the activation function; the dense dilation fusion convolution block includes 5 convolutional layers, the input of each convolutional layer is from the output of all convolutional layers before the convolutional layer; the first 3 of them are expansion convolution layers, and the expansion factor is increased and less than 16; the last 2 are common convolutional layers; the feature map is unchanged in size after passing through the 5 convolutional layers.

5. The method according to claim 4, wherein the feature extraction module in step 2 includes a plurality of expansion upsampling modules connected in series, where the number of the expansion upsampling modules is M, a first expansion upsampling module extracts a feature map from an output end of a first convolutional layer in the convolutional neural network before a third downsampling, a second expansion upsampling module extracts a feature map from an output end of the first convolutional layer in the convolutional neural network before a fourth downsampling, and so on until an M expansion upsampling module extracts a feature map from an output end of the first convolutional layer in the convolutional neural network before an M +2 downsampling, and so on from the M expansion upsampling module to a 1 expansion upsampling module, where expansion factors of the expansion convolutional layers decrease progressively and are less than 16; the up-sampling factor of the deconvolution layer in all the expansion up-sampling modules is 2.

6. The method according to claim 4, wherein the dense expansion fusion convolution block in step 3 is used to fuse all feature maps from the feature extraction module, the size of the feature map passing through each convolution layer in the dense expansion fusion convolution block is not changed, and the number of channels of the output feature map of the dense expansion fusion convolution block is 1.

7. The method of constructing a dilated full convolution neural network of claim 4, wherein step 3 the activation function is selected according to a specific task: if the network is used for training the image semantic segmentation task, the activation function is a softmax classification function; if the training of the significance detection task is performed, the activation function is a sigmoid function.