CN111210443B

CN111210443B - Deformable convolution mixing task cascading semantic segmentation method based on embedding balance

Info

Publication number: CN111210443B
Application number: CN202010004799.7A
Authority: CN
Inventors: 陈玫玫; 王健; 吴金洋; 曾博义; 赖子轩
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2020-01-03
Filing date: 2020-01-03
Publication date: 2022-09-13
Anticipated expiration: 2040-01-03
Also published as: CN111210443A

Abstract

The invention designs a deformable convolution mixing task cascade semantic segmentation method based on embedding balance, which is used for realizing image target identification and semantic segmentation and comprises the following steps: inputting the cut image into a pre-trained neural network; mapping the two samples to the same scale space through a feature pyramid network; performing information fusion on semantic features extracted from different hierarchies; predicting a pixel-level segmentation result by using the convolution layer; adopting a deformable convolution neural network in the convolution and pooling part of the characteristic pyramid network to extract the characteristics of the input image to obtain a characteristic diagram; the feature map is divided into parts with the same size, the feature map obtained after the feature pyramid network is input into a regional candidate network for training the network, the regional candidate network comprises a target detection classifier and a candidate frame positioning classifier, the target detection classifier outputs a target identification result and prediction accuracy, and the candidate frame positioning classifier can accurately position candidate regions and output candidate frames of a plurality of candidate regions. The invention improves the precision of semantic segmentation positioning and the segmentation precision.

Description

Deformable convolution mixing task cascading semantic segmentation method based on embedding balance

Technical Field

The invention designs a deformable convolution mixing task cascade semantic segmentation method based on embedding balance, which is used for realizing image target identification and semantic segmentation.

Background

In the traditional semantic segmentation, a semantic segmentation task is defined as dividing an image into several disjoint parts, and the parts have respective semantics, that is, the divided parts only contain one type of objects or objects. In the traditional semantic segmentation, a semantic segmentation method based on user interaction has a great deal of research work. In the method, a user often selects one region, then the color similarity, texture similarity and edge feature of other regions and the selected region are used as the connection weight of the regions, and finally the image is segmented by using a Conditional Random Field (CRF) or Graph-Cut (Graph-Cut) mode. However, such methods have certain disadvantages: when a plurality of targets exist in the image, human participation is needed, and the self-adaptability is low. Another type of method is a non-parametric method, which generally matches an image or a part of the image to be segmented with an image in a dataset by means of retrieval, migrates a label in the dataset into the segmented image of the target image, and finally obtains a final segmented image by a series of post-processing such as Markov Random Field (MRF). Although the method does not need manual participation, the quality of the matching result is related to the diversity of the data set and also has great relation with the matching mode. After the advent of object detection methods such as Deformable Part Model (DPM), many methods begin to try to obtain connected regions in an image by using low-layer features such as color and texture, detect an object in the image by using the detection method, and finally combine the detection result of each pixel position in the image and the connected regions in the image to obtain a result of semantic segmentation.

With the popularization of intelligent terminals such as smart phones and tablet computers and the enhancement of the computing power of corresponding low-power chips, the demand for computer vision technology with good effect and less occupied resources will be greater and greater.

Disclosure of Invention

The invention aims to improve the precision of semantic segmentation and positioning and the precision of segmentation, and provides an image computer vision processing method based on deep learning.

The invention adopts the following technical scheme:

a deformable convolution mixing task cascaded semantic segmentation method based on embedding balance, the method comprising:

inputting the cut image into a pre-trained neural network;

performing dimensionality reduction on an input image by adopting a 3 multiplied by 3 convolution kernel and pooling operation of a characteristic pyramid network; up-sampling the low-level feature map, down-sampling the high-level feature map, and mapping the two samples to the same scale space; performing information fusion on semantic features extracted from different hierarchies; predicting a pixel-level segmentation result by using the convolution layer;

performing feature extraction on the input image by adopting a deformable convolution neural network in the convolution and pooling part of the feature pyramid network; the convolution layer at the starting end of the deformable convolution neural network is a deformable convolution layer, and the pooling layer of the region of interest is a deformable region of interest pooling layer, so as to obtain a characteristic diagram; dividing the characteristic diagram into parts with the same size, wherein the number of the parts is the output dimensionality;

inputting the feature map obtained after the feature pyramid network into a regional candidate network for training the network, wherein the regional candidate network comprises a target detection classifier and a candidate frame positioning classifier, the target detection classifier outputs a target identification result and prediction accuracy, and the candidate frame positioning classifier can provide accurate positioning for a candidate region and output candidate frames of a plurality of candidate regions.

Further, in the process of training the regional candidate network, a balanced L1 loss function is adopted to adjust the weight of the L1 loss function of each of a plurality of tasks, wherein the tasks comprise target detection and candidate box generation.

Furthermore, in the training process, samples need to be introduced, the deformable interesting region is uniformly divided into K intervals according to the overlapping degree balance sampling, N difficult samples are uniformly distributed to each interval, and then the sample intervals are uniformly selected from the N difficult samples.

Further, inputting the feature map into the area candidate network for candidate box location classification, including: setting 9 candidate deformable interested regions for each pixel point in the characteristic diagram, dividing the deformable interested regions into a foreground and a background by using a classifier, simultaneously preliminarily adjusting the positions of the deformable interested regions by using a regressor, carrying out non-maximum suppression, sequencing the deformable interested regions according to classified scores, and selecting the first N deformable interested regions to obtain a candidate frame.

Further, the deformable interesting region generated by the region candidate network is mapped to the feature map extracted by the feature pyramid network, so as to obtain 7 × 7 feature mapping corresponding to the feature map, and the deformable interesting region is aligned.

Further, a cyclic semantic segmentation network is adopted to predict and generate a pixel-level mask for each aligned deformable region of interest through a full convolution neural network; after the target segmentation image is obtained through the full convolution neural network, if the target segmentation cross-over ratio is not ideal, the result needs to be input into the full convolution neural network again for training until the needed cross-over ratio result is obtained.

Further, the cyclic semantic segmentation network is trained, and the balanced L1 loss function is adopted to adjust the weight of the L1 loss function of the task, wherein the task comprises semantic segmentation.

Further, a padding operation of 0 is added to the image periphery before convolution in all processes.

Further, all the convolution processes are followed by a modified Linear Unit (ReLU).

The effective gain of the invention is:

the invention provides an embedded balanced deformable convolution mixed task cascading semantic segmentation method which is remarkably progressive compared with the traditional methods. The invention fully utilizes the context information of the network by combining the characteristic diagrams of a decoding end and a decoding end, improves the final accuracy to a certain extent, concatenates Mask RCNN and Cascade RCNN by embedding a balanced deformable convolution mixed task cascading framework and a new cascading framework for example segmentation, improves the information flow by combining cascading and multitasking of each stage, and further improves the accuracy by utilizing the spatial background. By combining object recognition, bounding box regression, and mask prediction in a multitasking manner at each stage. And (3) in the main network for extracting the features, adopting a feature pyramid network embedded with deformable convolution and pooling to extract the features. In addition, the method integrates IoU balanced sampling, a balanced feature pyramid, and a balanced L1 loss function for reducing sample, feature, and target level imbalances, respectively, benefiting from an overall balanced design. This framework significantly improves the accuracy of segmentation.

The method is based on a series of convolution operations and maximum pooling operations of a Cascade RCNN model to extract features, and doubles the number of channels of a feature map after convolution and halves the length and width of the feature map after maximum pooling. To improve sensitivity to image features, two new modules were introduced to enhance the conversion modeling capabilities of CNNs, namely deformable convolution and deformable regions of interest. The deformable convolution and deformable region of interest can increase the spatial sampling locations in the module as well as additional spatial information offsets and learn the spatial information offsets of the target task without additional supervision. The new module can replace the common module in the existing convolutional neural network, and can carry out end-to-end training through standard back propagation, thereby generating the deformable convolutional network. Therefore, applying a regular convolution on a deformable feature image can more effectively reflect complex structures.

The method of the invention makes full use of the information correlation between target detection (bounding box regression) and semantic segmentation (mask prediction). At each stage, the interrelationship between the bounding box regression and the mask prediction is obtained through parallel, thereby further improving the information flow in the RCNN. In the target segmentation process, the spatial background is removed, and the remaining foreground part is the target object in the picture. A complete convolution branch is adopted in the CNN to obtain a spatial background, and by means of the method, the method is beneficial to distinguishing the foreground part which is difficult to distinguish in the complex background, and the accuracy of target segmentation is further improved.

Drawings

FIG. 1 is a detailed structure diagram of a video semantic segmentation method based on a convolutional neural network;

FIG. 2 is a schematic diagram of a balanced feature pyramid;

FIG. 3 is a block diagram of a video semantic segmentation method based on a convolutional neural network;

FIG. 4 is a schematic diagram of a video semantic segmentation method based on a convolutional neural network;

FIG. 5 is a schematic diagram of a Mask RCNN and Cascade Cascade;

FIG. 6 is a diagram of a Deformable convolution principle (DCN);

FIG. 7 is a model segmentation effect diagram in a plurality of simple scenarios;

FIG. 8 is a graph of a model segmentation effect in a complex scene (such as illumination background color);

Detailed Description

The invention is further described in detail below with reference to the drawings.

Fig. 1 is combined with fig. 5, the present invention innovatively provides a completely new Semantic image Segmentation Method for overcoming the defects and drawbacks of the prior art, and compared with other methods, the Method has the advantages that the local and global information can be better considered, and the shape and boundary of the object in the finally obtained Segmentation map are clearer and the classification is more accurate by improving Cascade RCNN and Mask RCNN, which is called as a deformable Convolution mixing Task Cascade Semantic Segmentation Method (discrete Convolution Task Segmentation Method) Based on embedding Balance.

The device comprises three parts: the method comprises the steps of embedding a balanced deformable convolutional neural network, a cyclic semantic segmentation network, a cascading sparse RoI classification and a regression network.

Referring to fig. 2, the present invention inputs the clipped image into a pre-trained neural network, so as to extract important features of different targets from a large number of pictures, and obtain a feature map. Deep Residual networks (ResNet) and Feature Pyramid Networks (FPN) at the 50-layer of the Feature extraction Network. And the feature pyramid network FPN performs prediction after upsampling and shallow layer fusion on each layer of feature map, and then adopts the feature fusion operation of embedding balance to further improve the recognition rate of the tiny target. The method needs to train the feature pyramid network, and balance of a plurality of tasks is realized by adjusting the weight of the balanced L1 loss function of each task by adopting the balanced L1 loss function in the training process. The method comprises the steps of leading in samples in the training process, evenly dividing sampling intervals into K intervals according to overlapping degree balance sampling, evenly distributing N difficult samples to each interval, evenly selecting sample intervals from the N difficult samples, and carrying out intersection-comparison balance sampling. In contrast to model architectures, where the training process has a crucial impact on the performance of the target detection model, the training of the detection portion of the model is typically subject to a balance of sample level, feature level, and target level constraints. In order to reduce the adverse effect caused by the method, an embedded balanced FPN structure is provided, and the structure is beneficial to balanced learning of a model and improves the detection precision of small targets. In the multi-scale feature sampling process of the FPN, an intersection-to-parallel ratio balance sampling and a balance L1 loss function are added to reduce unbalance in three aspects of sample, feature extraction and target detection, break through the balance limit of an integral model and obtain a balance feature fusion extraction diagram. Referring to fig. 6, feature extraction is performed by using FPN embedded with deformable convolution and pooling, and a deformable convolution and deformable pooling module is introduced to enhance the conversion modeling capability of a Convolutional Neural Network (CNN). The new module easily replaces the normal module in the existing CNN and can easily be trained end-to-end by standard back propagation. The convolution layer at the starting end of the deformable convolution neural network is a deformable convolution layer, and the pooling layer of the region of interest is deformable region-of-interest pooling; the Feature Map (Feature Map) obtained by the network is divided into parts with the same size, the number of the parts is the output dimensionality, the deformable sampling position is utilized, then, adjacent similar structure information of each pixel is compressed into a fixed grid, a deformable Feature image is created, the complex structure of the image can be reflected more effectively by applying regular convolution on the deformable Feature image, and the sensitivity to a tiny target is further improved.

According to the method, filling operation is added after convolution, namely 0-value pixels are expanded on the outermost periphery, although weak noise is possibly introduced in filling, the operation can enable the resolution of an output segmentation image to be the same as that of an original image, and the method is beneficial to accurately predicting the category of each pixel when the problem of dense small target segmentation is faced. Because the deeper the neural network structure is, the more spatial information is lost, the more small-scale objects are difficult to recover from the low-resolution feature map, and S-Dropout is used in the convolutional network to trim the network structure and remove the last pooling layer and the convolutional layer following the last pooling layer, the network has the sparsification characteristic, and overfitting in training is reduced.

The invention inputs the feature map into a regional candidate Network (RPN) for target recognition classification and candidate box regression. The nature of the RPN is a sliding window based target detector, which is a tree structure with a 3 x 3 convolutional layer for the trunk and two 1 x 1 convolutional layers for the branches. The RPN classifier (1 × 1 convolutional layer) classifies the RoI that overlaps with the real target and whose overlap area is greater than 0.5 as foreground and the RoI that does not overlap with any target or whose overlap area is less than 0.1 as background. The RPN regressor (1 × 1 convolutional layer) calculates the deviation values of the frames between these foreground rois and the real target, then performs non-maximum suppression, i.e., sorts these rois according to the classified scores, and selects the first N rois to obtain the RPN Box candidate frame. Next, the classification result and the candidate frame obtained by regression are integrated to obtain the candidate region (Proposal). The penalty function used by the RPN network here is the sum of the classification error and the regression error.

Each generated RoI is mapped onto a convolution Feature Map (Feature Map) extracted from the FPN network, and a 7 × 7 Feature Map corresponding thereto is obtained. The RoI alignment includes two processes: firstly, the original image is corresponding to the pixel points of the feature map, and then the feature map is corresponding to the feature map. Because the pooling layer in the feature extraction network can cause the feature map to be reduced by a certain proportion (related to the number and size of the pooling layers) compared with the original map, the problem of identification errors caused by rough quantization operation is avoided through bilinear interpolation and pooling, namely, a pixel value in the target map is determined by fully utilizing four real pixel values around a virtual point in the original map, so that the pixels in the original map and the pixels in the feature map are accurately aligned, the accuracy of target detection is improved, and meanwhile, target segmentation is facilitated.

And (3) performing Full Convolution Network (FCN) operation on the feature map after the RoI alignment by using a cyclic semantic segmentation Network to predict and generate a Mask (Mask) at a pixel level. The loss function used by the mask generation network is the target detection classification error + the bounding box regression error + the semantic segmentation error. The "Head layer" preceding the FCN has the main effect of expanding the dimension of the RoI alignment output, which is more accurate when predicting the mask. After the target segmentation image is obtained through the FCN, if the target segmentation cross ratio is not ideal, it is indicated that a Receptive Field (receptor Field) of the feature extraction is too small to be beneficial to accurate segmentation of the small target, and the result needs to be input into the Head layer again to be fused with the Receptive Field layer in the convolutional layer, and is segmented again through the FCN until a relatively accurate cross-over ratio result is obtained. The receptive field is defined as the area size of each pixel point on the feature map output by each layer of the convolutional neural network, which is mapped on the original image (the input image of the network).

Compared with the current target detection, the method has remarkable accuracy and sensitivity. On the basis of predecessors, a network structure is redefined, features are extracted by combining embedded balance FPN and deformable convolution, a circulation mechanism is introduced in semantic segmentation to improve the sensitivity of a receptive field, a network cascade mode is adopted to improve the classification and positioning accuracy of a candidate positioning frame, through feature fusion of different layers, information of each layer of a network is fully utilized, the final accuracy is improved to a certain extent, the result of image segmentation is smoother, the classification and positioning results are more accurate, and the innovation and superiority of the method in the aspects of semantic segmentation and target detection are fully shown.

Referring to fig. 1 in conjunction with fig. 2 and 3 and fig. 4, in one embodiment, the method of the present invention comprises:

s1, first, an image of an arbitrary size is input into a backbone network composed of 13 convolutional layers, 13 Linear Unit (ReLU) layers, and 4 pooling layers for extracting picture features. Firstly, the input is a whole picture, and dimension reduction processing is carried out on each level of the characteristic pyramid network through a 1 × 1 convolution layer. Second, the low-level feature map is upsampled and the high-level feature map is downsampled to the same scale space with the step set to 8. It was found from experiments that this setup is sufficient for an accurate pixel level prediction of the whole image. Then, information fusion is carried out on semantic features extracted before different levels; meanwhile, four convolutional layers are added structurally to further enhance the generalization capability of extracting semantic features. Finally, the convolution layer is adopted to predict the pixel level segmentation result, and a Feature map (Feature Maps) is output. Performing feature extraction on the input image by adopting a deformable convolution neural network in the convolution and pooling part of the feature pyramid network; the convolution layer at the starting end of the deformable convolution neural network is a deformable convolution layer, and the pooling layer of the region of interest is a deformable region of interest pooling layer to obtain a characteristic diagram; dividing the characteristic diagram into parts with the same size, wherein the number of the parts is the output dimensionality;

s2, for each image, after the operation of S1, all the obtained Feature Maps (Feature Maps) are input into a Region candidate Network (RPN) for training. The method comprises the steps of firstly entering a 3 x 3 convolutional layer, then entering two 1 x 1 brother convolutional layers (sitting layers), finally classifying by using a Softmax layer, accurately positioning a candidate region, and selecting the candidate region. The area candidate network comprises two parts, namely a target detection classifier and a candidate frame positioning classifier, wherein the target detection classifier outputs a target identification result and prediction accuracy, and the candidate frame positioning classifier can accurately position candidate areas and output candidate frames of a plurality of candidate areas. In the training process, samples need to be introduced, sampling is balanced according to the overlapping degree, the deformable interesting region is uniformly divided into K intervals through interval sampling, N difficult samples are uniformly distributed to each interval, and then the sample intervals are uniformly selected from the N difficult samples.

Inputting the feature map into a regional candidate network for candidate box location classification, comprising: setting 9 candidate deformable interesting regions for each pixel point in the feature map, dividing the deformable interesting regions into a foreground and a background by using a classifier, simultaneously preliminarily adjusting the positions of the deformable interesting regions by using a regressor, carrying out non-maximum suppression, sequencing the deformable interesting regions according to classified scores, and selecting the first N deformable interesting regions to obtain a candidate frame.

A circular semantic segmentation network is adopted to predict and generate a pixel-level mask for each aligned deformable interested area through a full convolution neural network; after the target segmentation image is obtained through the full convolution neural network, if the target segmentation cross-over ratio is not ideal, the result needs to be input into the full convolution neural network again for training until the needed cross-over ratio result is obtained.

And S3, taking the anchor point (Anchors) with the maximum intersection ratio of the real target Box (Ground Truth Box) as a positive training sample, and training the area candidate network. The L1 loss function for the area candidate network is defined as follows:

wherein i represents the ith Anchor, p in mini-batch processing (mini-batch) _i Representing the probability that the ith Anchor is a foreground target, p is 1 when the ith Anchor is the foreground target, otherwise, p is 0, t _i Coordinates, t, representing the predicted Bounding Box (Bounding Box) _i And x is the coordinate of the real target (Ground Truth).

And training the cycle semantic segmentation network, and adjusting the weight of the L1 loss function of the task by adopting a balanced L1 loss function, wherein the task comprises semantic segmentation.

S4, performing RoI alignment through the operation of S3, and mapping the RoI to the corresponding position of the Feature Map (Feature Map) in the step S1 according to the input image; dividing the mapped region into parts with the same size, wherein the number of the parts is the same as the output dimension; a maximum pooling operation is performed for each portion so that a fixed size profile can be obtained from different size boxes.

S5, classification and regression, the output of this layer is the final purpose, the category to which the candidate region belongs and the precise location of the candidate region in the image are output.

Where a padding operation is required to pad the 0-valued pixels to the image periphery before each convolution. After each convolution a ReLU is concatenated. And the size and the number of the convolution kernels can be selected arbitrarily.

Referring to fig. 7 and 8, for model segmentation effect maps under various simple scenes processed by the method of the present invention, the final accuracy is improved to a certain extent, so that the result of image segmentation is more accurate and smooth, and as can be seen from the above examples in combination with the detailed description of the figures, the proposed and applied semantic image segmentation method of the present invention has a significant step forward compared with the conventional methods, the present invention redefines the network structure, and combines the feature maps of the decoding end and the decoding end, thereby fully showing the superiority of the present invention in semantic image segmentation, from both quantitative and qualitative comparison. Through feature fusion of different layers, information of each layer of the network is fully utilized, the final accuracy is improved to a certain degree, and the image segmentation result is more accurate and smooth.

The preferred embodiments of the present invention have been described in detail, however, the present invention is not limited to the above specific embodiments, and those skilled in the art can make modifications or equivalent changes within the scope of the claims and all the modifications and equivalent changes should be included in the scope of the present invention.

Claims

1. A deformable convolution mixing task cascading semantic segmentation method based on embedding balance is characterized by comprising the following steps:

inputting the cut image into a pre-trained neural network;

performing feature extraction on the input image by adopting a deformable convolution neural network in the convolution and pooling part of the feature pyramid network; the convolution layer at the starting end of the deformable convolution neural network is a deformable convolution layer, and the pooling layer of the region of interest is a deformable region of interest pooling layer to obtain a characteristic diagram; dividing the characteristic diagram into parts with the same size, wherein the number of the parts is the output dimensionality;

inputting a feature map obtained after the feature pyramid network into a regional candidate network for training the network, wherein the regional candidate network comprises a target detection classifier and a candidate frame positioning classifier, the target detection classifier outputs a target identification result and prediction accuracy, and the candidate frame positioning classifier can accurately position candidate regions and output candidate frames of a plurality of candidate regions;

in the process of training the regional candidate network, a balanced L1 loss function is adopted to adjust the weight of the respective L1 loss function of a plurality of tasks, wherein the tasks comprise target detection and candidate frame generation;

in the training process, samples need to be introduced, sampling is balanced according to the overlapping degree, the deformable interesting region is uniformly divided into K intervals through interval sampling, N difficult samples are uniformly distributed to each interval, and then the sample intervals are uniformly selected from the N difficult samples.

2. The method of claim 1, wherein inputting the feature map into the regional candidate network for candidate box location classification comprises: setting 9 candidate deformable interesting regions for each pixel point in the feature map, dividing the deformable interesting regions into a foreground and a background by using a classifier, simultaneously preliminarily adjusting the positions of the deformable interesting regions by using a regressor, carrying out non-maximum suppression, sequencing the deformable interesting regions according to classified scores, and selecting the first N deformable interesting regions to obtain a candidate frame.

3. The method of claim 2, wherein the deformable region of interest generated by the area candidate network is mapped onto the feature map extracted by the feature pyramid network, and a 7 x 7 feature map corresponding to the feature map is obtained for deformable region of interest alignment.

4. The method of claim 3, wherein a pixel-level mask is generated predictively using a recurrent semantic segmentation network through a full convolution neural network for each aligned deformable region of interest; after the target segmentation image is obtained through the full convolution neural network, if the target segmentation cross-over ratio is not ideal, the result needs to be input into the full convolution neural network again for training until the needed cross-over ratio result is obtained.

5. The method of claim 4, wherein the cyclic semantic segmentation network is trained to adjust the weights of the L1 loss functions of tasks comprising semantic segmentation using balanced L1 loss functions.

6. A method as claimed in claim 1, characterized in that a padding operation of 0 is applied to the image surroundings before the convolution in all processes.

7. A method as claimed in claim 1, characterized in that the convolution in all processes is followed by a modified linear element.