CN113421269A

CN113421269A - Real-time semantic segmentation method based on double-branch deep convolutional neural network

Info

Publication number: CN113421269A
Application number: CN202110640607.6A
Authority: CN
Inventors: 刘悦
Original assignee: Nanjing Ruiyi Intelligent Technology Co ltd
Current assignee: Nanjing Ruiyi Intelligent Technology Co ltd
Priority date: 2021-06-09
Filing date: 2021-06-09
Publication date: 2021-09-21

Abstract

The invention discloses a real-time semantic segmentation method based on a double-branch deep convolutional neural network. The method comprises the following steps: preprocessing a city landscape semantic segmentation data set; retraining a deep convolutional neural network ResNet on the data set, and extracting deep semantic features; designing a global branch formed by the normalized convolutional layer, and respectively carrying out normalized convolution operation on the feature maps of different stages of ResNet to obtain feature maps with the same dimensionality and carry out channel dimensionality combination; sharing characteristic information of different stages in a ResNet residual error network by utilizing a shared characteristic layer and a pooling layer, and constructing a local branch with rich detailed information; designing a feature merging module, fusing feature mapping graphs of the global branch and the local branch, integrating feature information of different scales, and obtaining a final prediction graph; utilizing an up-sampling operation to realize the mapping transformation from the prediction image to the resolution of the original image; and (4) carrying out classification prediction on each pixel of the prediction image of the One-Hot coding by utilizing a Softmax classification layer, and finally obtaining an image segmentation result. The invention improves the segmentation prediction speed of the deep convolutional network on the high-resolution image and realizes the performance upgrade of the semantic segmentation precision and the segmentation speed.

Description

Real-time semantic segmentation method based on double-branch deep convolutional neural network

Technical Field

The invention relates to the field of deep learning of computer vision, in particular to a real-time semantic segmentation method based on a double-branch deep convolutional neural network.

Background

Computer-based processing and analysis of images is a major goal of machine vision tasks and is also a very challenging task. The human visual system can rapidly analyze image information captured by eyes, and analyze the whole scene by combining with multilayer brain neurons, and the extraction and analysis of image semantic information can be realized by combining with computer and Deep Convolutional Neural Networks (DCNN) which are developed rapidly in recent years, so that the pixel-by-pixel mapping transformation from a characteristic diagram to an original image is realized, the boundary segmentation of different region blocks in the image is realized, and the process of analyzing the whole scene is finally realized. The method has extremely important research significance in the fields of medical image analysis, geographical remote sensing image analysis, automatic driving and the like.

Semantic segmentation is mainly divided into two categories from a method point of view: the first is a classical image segmentation algorithm based on traditional image processing; the second category is deep learning algorithms based on convolutional neural networks. In the sixty-seven decades of the twentieth century, image segmentation still stays at the traditional image segmentation stage in the period, and the segmentation is realized by utilizing simple image features. Prewitt et al calculate one or more gray threshold values based on the gray features of the image and compare the gray value of each pixel in the image to the threshold values and finally classify the pixels into the appropriate categories based on the comparison. Boykov Y and Rother C et al propose that GraphCut and GrabCut, respectively, are graph theory-based image segmentation methods that relate the image segmentation problem to the min cut problem of the graph. The essence of the graph theory-based segmentation method is to remove a specific edge and divide a graph into a plurality of sub-graphs so as to realize segmentation. These manually designed operators are usually only able to extract a single feature and thus cannot fully represent the features possessed by the object.

Deep learning occupies a place in machine vision with its excellent characteristic characterization performance. Deep learning obtains the deep semantic features by constructing a convolutional neural network with a deep structure, and has stronger generalization capability. In 2015, Badrinarayanan et al proposed a real-time semantic segmentation network SegNet, which is a typical Encoder-Decoder architecture. SegNet makes up the encoder with the removal of fully connected layers in the VGG-16 convolutional network architecture to generate low resolution image representations or feature maps, which are then mapped to pixel level predictions by a decoder with a symmetric structure. The decoder part consists of a series of upsampling and convolutional layers, and is finally connected with a Softmax classifier to predict labels at a pixel level, so that the labels are used as output, and the resolution can reach the same resolution as that of an input image. SegNet has an apparently symmetrical structure, one for each encoder layer. Furthermore, the function of pooling multiple recording locations compared to pooling for a full convolutional network is used in the encoder layer. Since the lost weight of each convolution kernel is irretrievable after Max-firing, SegNet records the position of the maximum weight value when the encoder layer performs Max-firing, and realizes nonlinear up-sampling by the corresponding maximum pooling layer index at the decoder layer. The method avoids the excessive calculation amount caused by using deconvolution when the full convolution network is used for up-sampling, does not need to open up a space for storing the characteristic diagram during coding, and improves the calculation efficiency. Although the model has the advantage of real-time reasoning, there is still room for improvement in semantic segmentation accuracy.

Disclosure of Invention

The invention aims to provide a real-time semantic segmentation method based on a double-branch deep convolutional neural network, which is high in speed and accuracy.

The technical solution for realizing the purpose of the invention is as follows: a real-time semantic segmentation method based on a double-branch deep convolutional neural network comprises the following steps:

step 1, preprocessing a city landscape semantic segmentation data set to obtain an original image in the data set;

step 2, retraining a deep convolutional neural network ResNet on the data set, and extracting deep semantic features;

step 3, designing a global branch formed by the normalized convolutional layer, and respectively carrying out normalized convolution operation on the feature maps of different stages of ResNet to obtain feature maps with the same dimensionality for channel dimensionality combination;

step 4, sharing feature information of different stages in the ResNet residual error network by utilizing the shared feature layer and the pooling layer, and constructing a local branch with rich detail information;

step 5, designing a feature merging module, fusing feature mapping graphs of the global branch and the local branch, integrating feature information of different scales, and obtaining a final prediction graph;

step 6, utilizing an up-sampling operation to realize the mapping transformation from the prediction image to the resolution of the original image;

step 7, carrying out classification prediction on each pixel in the prediction image of the One-Hot code by utilizing a Softmax classification layer, and finally obtaining an image segmentation result;

further, step 2 retrains the deep convolutional neural network ResNet on the data set, and extracts deep semantic features, which is specifically as follows:

training a ResNet-18 residual neural network model on a preprocessed large-scale high-resolution city landscape City semantic segmentation data set, using the model as an extractor of deep semantic features, performing class prediction on each pixel, calculating cross entropy loss, and training by combining a back propagation algorithm, wherein a loss function corresponding to each pixel is as follows:

wherein pixel _ loss represents the loss of each pixel after being calculated by a convolutional neural network, classes represents all the prediction categories of the semantic segmentation model, and y_trueRepresenting a One-Hot matrix, wherein each element corresponds to One-Hot vector in the matrix, the elements only have two values of 0 and 1, if the category is the same as the sample category, the value is 1, and if the category is not consistent with the sample, the value is 0, y_predRepresenting the probability of the prediction sample belonging to the current class;

wherein bp _ loss represents the total loss of the back propagation of the whole image, w and h respectively represent the corresponding width and height of the whole image, and pixel _ loss_ijIndicating the loss of the pixel corresponding to the ith row and the jth column;

further, step 3 is to design a global branch formed by the normalized convolutional layers, and perform normalized convolution operation on the feature maps of different stages of the ResNet respectively to obtain feature maps of the same dimension for channel dimension combination, specifically as follows:

the deep convolutional neural network is realized by utilizing the residual block in the residual network ResNet, and meanwhile, the overfitting phenomenon caused by deepening of a network layer can be avoided, wherein the characteristic mapping realized by the residual block is as follows:

wherein x is the input feature map of the residual block, F (x) represents the feature mapping function implemented by the residual block,

representing the output signature after passing through a residual block, which allows the network to converge more quickly.

Normalizing convolutional layers at different stages of the ResNet residual error network, normalizing feature maps with different channel dimensions and different space dimensions to completely same size feature maps by utilizing convolution operation, and realizing feature fusion of a high-dimensional feature map and a low-dimensional feature map, wherein the normalization convolution operation is defined as:

wherein k, c, i, j are respectively a characteristic diagram c channel, an ith row, a jth column and y corresponding to the kth layer_c,i，jFor outputting the characteristic value of the pixel at the corresponding position of the characteristic map, w (k)_c，0，k_i，k_j) Weight parameter, x (k), representing the convolution kernel in a convolution operation_c,i+k_i,j+k_j) A feature value representing the size of the convolution kernel corresponding to the input feature map in the convolution operation,

indicating the bias parameters for the c-th channel at the k-th layer network layer.

Further, in step 4, the shared feature layer and the pooling layer are used to share feature information of different stages in the ResNet residual error network, and a local branch with rich detail information is constructed, specifically as follows:

extracting feature maps at different stages of a residual error network, learning the feature maps by utilizing network layers such as a pooling layer and an upsampling layer, and extracting rich image detail information as follows:

wherein f is_i,j(s)_maxRepresenting a maximum pooling operation characteristic value; f. of_i,j(s)_avgRepresenting an average pooling operation characteristic value; k represents the size of the convolution kernel; i, j represents the calculation characteristic value of the ith row and the jth column of the corresponding convolution kernel; max, average denote the maximum and average operations, respectively.

Further, the up-sampling operation in step 6 is used to implement the mapping transformation from the prediction image to the resolution of the original image:

wherein x and y respectively represent the abscissa and the ordinate of the midpoint of the coordinate system; f (0,0), f (0,1), f (1,0), f (1,1) represent the coordinates of four known coordinate points of the bilinear interpolation operation.

Further, the step 7 of performing classification prediction on each pixel in the prediction map of One-Hot encoding by using the Softmax classification layer, and finally obtaining an image segmentation result:

wherein, P_iRepresenting the probability value of the ith target, k representing the index value of the current prediction category, k representing the number of prediction categories of the semantic segmentation model, a_iRepresenting the eigenvalues of the ith target.

Compared with the prior art, the invention has the following remarkable advantages: (1) an easily-trained ResNet18 residual neural network is used as a feature extractor, so that the representation capability of the model on the image is improved, the convergence is easier, and the precision of an image segmentation algorithm is higher; (2) by combining a corresponding convolutional network layer, a unique double-branch network architecture and a characteristic merging strategy, the algorithm can ensure the segmentation precision and improve the segmentation prediction speed; (3) the method can realize the functions of semantic segmentation and scene analysis in real time in the high-resolution video material, and can be applied to the fields of automatic driving and the like.

Drawings

FIG. 1 is a flow chart of a real-time semantic segmentation method based on a dual-branch deep convolutional neural network according to the present invention.

Fig. 2 is a diagram of the effect of real-time semantic segmentation experiment in Cityscapes city landscape dataset, where (a) is an original image in the dataset, (b) is a label diagram corresponding to the real segmentation effect of the original image, and (c) is a diagram of the predicted segmentation effect of the semantic segmentation model on the original image.

Detailed Description

The invention discloses a real-time semantic segmentation method based on a double-branch deep convolutional neural network. Firstly, processing an original label file and an image of a Cityscapes data set to manufacture a corresponding training label image file, and synthesizing a training data set; retraining a ResNet deep convolution neural network on the data set, and extracting deep semantic features; designing a global branch formed by the normalized convolutional layer, and respectively carrying out normalized convolution operation on the feature maps of ResNet at different stages to obtain feature maps with the same dimension for channel dimension combination; sharing characteristic information of different stages in the ResNet residual error network by using the shared characteristic layer and the pooling layer, and constructing a local branch with rich detailed information; designing a feature merging module, fusing feature mapping graphs of the global branch and the local branch, integrating feature information of different scales, and obtaining a final prediction graph; utilizing an up-sampling operation to realize the mapping transformation from the prediction image to the resolution of the original image; the method comprises the following steps of performing classification prediction on each pixel of a prediction image of One-Hot coding by using a Softmax classification layer, and finally obtaining an image segmentation result, wherein the method specifically comprises the following steps:

representing output signatures after passing through a residual blockThe residual connection structure makes the network convergence faster.

wherein f is_i,j(s)_maxTo representMaximum pooling operating characteristic value; f. of_i,j(s)_avgRepresenting an average pooling operation characteristic value; i represents the size of the convolution kernel; i, j represents the calculation characteristic value of the ith row and the jth column of the corresponding convolution kernel; max, average denote the maximum and average operations, respectively.

The invention is described in further detail below with reference to the figures and the embodiments.

Examples

The invention discloses a real-time semantic segmentation method based on a double-branch deep convolutional neural network, which mainly comprises three components: the first part is a global branch constructed by ResNet18 as an infrastructure and a convolutional layer; the second part is a local branch constructed by a characteristic diagram, a pooling layer and the like of different stages of a shared ResNet18 model; the third part is a feature merging module, which fuses prediction graphs of global branches and local branches, and the detailed steps are as follows in combination with fig. 1:

wherein pixel _ loss represents the loss of each pixel after being calculated by a convolutional neural network, classes represents all the prediction categories of the semantic segmentation model, and y_trueRepresenting an One-Hot matrix, each element corresponding to One of the matrixThe elements of the One-Hot vector only have two values of 0 and 1, if the category is the same as the category of the sample, the category is 1, and if the category is not consistent with the sample, the category is 0, y_predRepresenting the probability of the prediction sample belonging to the current class;

wherein k, c, i, j are respectively a characteristic diagram c channel, an ith row, a jth column and y corresponding to the kth layer_c,i，jFor outputting the characteristic value of the pixel at the corresponding position of the characteristic map, w (k)_c,0,k_i,k_j) Weight parameter, x (k), representing the convolution kernel in a convolution operation_c,i+k_i,j+k_j) A feature value representing the size of the convolution kernel corresponding to the input feature map in the convolution operation,

The invention discloses an effect graph for semantic segmentation and scene analysis of an image in an urban landscape data set, wherein a graph 2(a) is an image which is selected from a Cityscapes data set and has prediction categories of pedestrians, vehicles and roads, a graph 2(b) is a real segmentation effect graph corresponding to an original image, different area blocks have different colors for distinguishing, and a graph 2(c) is a segmentation prediction graph of the image after passing through a real-time semantic segmentation model.

Claims

1. A real-time semantic segmentation method based on a double-branch deep convolutional neural network is characterized by comprising the following steps:

step 3, designing a global branch formed by the normalized convolutional layer, and respectively carrying out normalized convolution operation on the feature maps of different stages of ResNet to obtain feature maps with the same dimensionality and carry out channel dimensionality combination;

step 5, designing a feature merging module, fusing feature mapping graphs of the global branches and the local branches, integrating feature information of different scales, and obtaining a final prediction graph;

step 7, classifying and predicting each pixel in the prediction image of the One-Hot coding by utilizing a Softmax classification layer, and finally obtaining an image segmentation result;

2. the method for real-time semantic segmentation based on the double-branch deep convolutional neural network according to claim 1, wherein step 2 retrains the deep convolutional neural network ResNet on the data set to extract deep semantic features, specifically as follows:

training a ResNet-18 residual neural network model on a preprocessed large-scale high-resolution city landscape City semantics segmentation data set, using the model as an extractor of deep semantic features, performing class prediction on each pixel, calculating cross entropy loss, and training by combining a back propagation algorithm, wherein a loss function corresponding to each pixel is as follows:

wherein pixel _ loss represents the loss of each pixel after being calculated by a convolutional neural network, classes represents all the prediction categories of the semantic segmentation model, and y_trueRepresenting a One-Hot matrix, wherein each element corresponds to One-Hot vector in the matrix, the elements only have two values of 0 and 1, if the category is the same as the sample category, the category is 1, and if the category is different from the sampleThen is 0, y_predRepresenting the probability of the prediction sample belonging to the current class;

3. the real-time semantic segmentation method based on the double-branch deep convolutional neural network according to claim 1, wherein in step 3, the global branch formed by the designed normalized convolutional layer is subjected to normalized convolutional operation respectively on feature maps at different stages of ResNet, feature maps with the same dimensionality are obtained, and channel dimensionality combination is performed, specifically as follows:

representing the output profile after passing through a residual block, which allows the network to converge more quickly.

Normalizing convolution layers at different stages of the ResNet residual error network to normalize feature maps with different channel dimensions and different space dimensions to completely same size feature maps by utilizing convolution operation, and realizing feature fusion of a high-dimensional feature map and a low-dimensional feature map, wherein the normalization convolution operation is defined as:

wherein k, c, i, j are respectively a characteristic diagram c channel, an ith row, a jth column and y corresponding to the kth layer_c,i,jFor outputting the characteristic value of the pixel at the corresponding position of the characteristic map, w (k)_c,0,k_i,k_j) Weight parameter, x (k), representing the convolution kernel in a convolution operation_c,i+k_i,j+k_j) A feature value representing the size of the convolution kernel corresponding to the input feature map in the convolution operation,

4. The real-time semantic segmentation method based on the dual-branch deep convolutional neural network of claim 1, wherein in step 4, the shared feature layer and the pooling layer are used to share feature information of different stages in the ResNet residual network, so as to construct local branches with rich detail information, specifically as follows:

wherein f is_i,j(S)_maxRepresenting a maximum pooling operation characteristic value; f. of_i,j(S)_avgRepresenting an average pooling operation characteristic value; k represents the size of the convolution kernel; i, j represents the calculation characteristic value of the ith row and the jth column of the corresponding convolution kernel; max, average respectively representMaximum and average operations.

5. The method for real-time semantic segmentation based on the dual-branch deep convolutional neural network as claimed in claim 1, wherein the step 6 uses an upsampling operation to implement mapping transformation from a predicted image to the resolution of an original image:

6. The real-time semantic segmentation method based on the double-branch deep convolutional neural network of claim 1, wherein in step 7, classification prediction is performed on each pixel in the prediction graph of One-Hot coding by using a Softmax classification layer, and finally an image segmentation result is obtained: