CN111210443B - Deformable convolution mixing task cascading semantic segmentation method based on embedding balance - Google Patents

Deformable convolution mixing task cascading semantic segmentation method based on embedding balance Download PDF

Info

Publication number
CN111210443B
CN111210443B CN202010004799.7A CN202010004799A CN111210443B CN 111210443 B CN111210443 B CN 111210443B CN 202010004799 A CN202010004799 A CN 202010004799A CN 111210443 B CN111210443 B CN 111210443B
Authority
CN
China
Prior art keywords
network
deformable
candidate
convolution
feature map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010004799.7A
Other languages
Chinese (zh)
Other versions
CN111210443A (en
Inventor
陈玫玫
王健
吴金洋
曾博义
赖子轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin University
Original Assignee
Jilin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jilin University filed Critical Jilin University
Priority to CN202010004799.7A priority Critical patent/CN111210443B/en
Publication of CN111210443A publication Critical patent/CN111210443A/en
Application granted granted Critical
Publication of CN111210443B publication Critical patent/CN111210443B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20092Interactive image processing based on input by user
    • G06T2207/20104Interactive definition of region of interest [ROI]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Editing Of Facsimile Originals (AREA)

Abstract

The invention designs a deformable convolution mixing task cascade semantic segmentation method based on embedding balance, which is used for realizing image target identification and semantic segmentation and comprises the following steps: inputting the cut image into a pre-trained neural network; mapping the two samples to the same scale space through a feature pyramid network; performing information fusion on semantic features extracted from different hierarchies; predicting a pixel-level segmentation result by using the convolution layer; adopting a deformable convolution neural network in the convolution and pooling part of the characteristic pyramid network to extract the characteristics of the input image to obtain a characteristic diagram; the feature map is divided into parts with the same size, the feature map obtained after the feature pyramid network is input into a regional candidate network for training the network, the regional candidate network comprises a target detection classifier and a candidate frame positioning classifier, the target detection classifier outputs a target identification result and prediction accuracy, and the candidate frame positioning classifier can accurately position candidate regions and output candidate frames of a plurality of candidate regions. The invention improves the precision of semantic segmentation positioning and the segmentation precision.

Description

Deformable convolution mixing task cascading semantic segmentation method based on embedding balance
Technical Field
The invention designs a deformable convolution mixing task cascade semantic segmentation method based on embedding balance, which is used for realizing image target identification and semantic segmentation.
Background
In the traditional semantic segmentation, a semantic segmentation task is defined as dividing an image into several disjoint parts, and the parts have respective semantics, that is, the divided parts only contain one type of objects or objects. In the traditional semantic segmentation, a semantic segmentation method based on user interaction has a great deal of research work. In the method, a user often selects one region, then the color similarity, texture similarity and edge feature of other regions and the selected region are used as the connection weight of the regions, and finally the image is segmented by using a Conditional Random Field (CRF) or Graph-Cut (Graph-Cut) mode. However, such methods have certain disadvantages: when a plurality of targets exist in the image, human participation is needed, and the self-adaptability is low. Another type of method is a non-parametric method, which generally matches an image or a part of the image to be segmented with an image in a dataset by means of retrieval, migrates a label in the dataset into the segmented image of the target image, and finally obtains a final segmented image by a series of post-processing such as Markov Random Field (MRF). Although the method does not need manual participation, the quality of the matching result is related to the diversity of the data set and also has great relation with the matching mode. After the advent of object detection methods such as Deformable Part Model (DPM), many methods begin to try to obtain connected regions in an image by using low-layer features such as color and texture, detect an object in the image by using the detection method, and finally combine the detection result of each pixel position in the image and the connected regions in the image to obtain a result of semantic segmentation.
With the popularization of intelligent terminals such as smart phones and tablet computers and the enhancement of the computing power of corresponding low-power chips, the demand for computer vision technology with good effect and less occupied resources will be greater and greater.
Disclosure of Invention
The invention aims to improve the precision of semantic segmentation and positioning and the precision of segmentation, and provides an image computer vision processing method based on deep learning.
The invention adopts the following technical scheme:
a deformable convolution mixing task cascaded semantic segmentation method based on embedding balance, the method comprising:
inputting the cut image into a pre-trained neural network;
performing dimensionality reduction on an input image by adopting a 3 multiplied by 3 convolution kernel and pooling operation of a characteristic pyramid network; up-sampling the low-level feature map, down-sampling the high-level feature map, and mapping the two samples to the same scale space; performing information fusion on semantic features extracted from different hierarchies; predicting a pixel-level segmentation result by using the convolution layer;
performing feature extraction on the input image by adopting a deformable convolution neural network in the convolution and pooling part of the feature pyramid network; the convolution layer at the starting end of the deformable convolution neural network is a deformable convolution layer, and the pooling layer of the region of interest is a deformable region of interest pooling layer, so as to obtain a characteristic diagram; dividing the characteristic diagram into parts with the same size, wherein the number of the parts is the output dimensionality;
inputting the feature map obtained after the feature pyramid network into a regional candidate network for training the network, wherein the regional candidate network comprises a target detection classifier and a candidate frame positioning classifier, the target detection classifier outputs a target identification result and prediction accuracy, and the candidate frame positioning classifier can provide accurate positioning for a candidate region and output candidate frames of a plurality of candidate regions.
Further, in the process of training the regional candidate network, a balanced L1 loss function is adopted to adjust the weight of the L1 loss function of each of a plurality of tasks, wherein the tasks comprise target detection and candidate box generation.
Furthermore, in the training process, samples need to be introduced, the deformable interesting region is uniformly divided into K intervals according to the overlapping degree balance sampling, N difficult samples are uniformly distributed to each interval, and then the sample intervals are uniformly selected from the N difficult samples.
Further, inputting the feature map into the area candidate network for candidate box location classification, including: setting 9 candidate deformable interested regions for each pixel point in the characteristic diagram, dividing the deformable interested regions into a foreground and a background by using a classifier, simultaneously preliminarily adjusting the positions of the deformable interested regions by using a regressor, carrying out non-maximum suppression, sequencing the deformable interested regions according to classified scores, and selecting the first N deformable interested regions to obtain a candidate frame.
Further, the deformable interesting region generated by the region candidate network is mapped to the feature map extracted by the feature pyramid network, so as to obtain 7 × 7 feature mapping corresponding to the feature map, and the deformable interesting region is aligned.
Further, a cyclic semantic segmentation network is adopted to predict and generate a pixel-level mask for each aligned deformable region of interest through a full convolution neural network; after the target segmentation image is obtained through the full convolution neural network, if the target segmentation cross-over ratio is not ideal, the result needs to be input into the full convolution neural network again for training until the needed cross-over ratio result is obtained.
Further, the cyclic semantic segmentation network is trained, and the balanced L1 loss function is adopted to adjust the weight of the L1 loss function of the task, wherein the task comprises semantic segmentation.
Further, a padding operation of 0 is added to the image periphery before convolution in all processes.
Further, all the convolution processes are followed by a modified Linear Unit (ReLU).
The effective gain of the invention is:
the invention provides an embedded balanced deformable convolution mixed task cascading semantic segmentation method which is remarkably progressive compared with the traditional methods. The invention fully utilizes the context information of the network by combining the characteristic diagrams of a decoding end and a decoding end, improves the final accuracy to a certain extent, concatenates Mask RCNN and Cascade RCNN by embedding a balanced deformable convolution mixed task cascading framework and a new cascading framework for example segmentation, improves the information flow by combining cascading and multitasking of each stage, and further improves the accuracy by utilizing the spatial background. By combining object recognition, bounding box regression, and mask prediction in a multitasking manner at each stage. And (3) in the main network for extracting the features, adopting a feature pyramid network embedded with deformable convolution and pooling to extract the features. In addition, the method integrates IoU balanced sampling, a balanced feature pyramid, and a balanced L1 loss function for reducing sample, feature, and target level imbalances, respectively, benefiting from an overall balanced design. This framework significantly improves the accuracy of segmentation.
The method is based on a series of convolution operations and maximum pooling operations of a Cascade RCNN model to extract features, and doubles the number of channels of a feature map after convolution and halves the length and width of the feature map after maximum pooling. To improve sensitivity to image features, two new modules were introduced to enhance the conversion modeling capabilities of CNNs, namely deformable convolution and deformable regions of interest. The deformable convolution and deformable region of interest can increase the spatial sampling locations in the module as well as additional spatial information offsets and learn the spatial information offsets of the target task without additional supervision. The new module can replace the common module in the existing convolutional neural network, and can carry out end-to-end training through standard back propagation, thereby generating the deformable convolutional network. Therefore, applying a regular convolution on a deformable feature image can more effectively reflect complex structures.
The method of the invention makes full use of the information correlation between target detection (bounding box regression) and semantic segmentation (mask prediction). At each stage, the interrelationship between the bounding box regression and the mask prediction is obtained through parallel, thereby further improving the information flow in the RCNN. In the target segmentation process, the spatial background is removed, and the remaining foreground part is the target object in the picture. A complete convolution branch is adopted in the CNN to obtain a spatial background, and by means of the method, the method is beneficial to distinguishing the foreground part which is difficult to distinguish in the complex background, and the accuracy of target segmentation is further improved.
Drawings
FIG. 1 is a detailed structure diagram of a video semantic segmentation method based on a convolutional neural network;
FIG. 2 is a schematic diagram of a balanced feature pyramid;
FIG. 3 is a block diagram of a video semantic segmentation method based on a convolutional neural network;
FIG. 4 is a schematic diagram of a video semantic segmentation method based on a convolutional neural network;
FIG. 5 is a schematic diagram of a Mask RCNN and Cascade Cascade;
FIG. 6 is a diagram of a Deformable convolution principle (DCN);
FIG. 7 is a model segmentation effect diagram in a plurality of simple scenarios;
FIG. 8 is a graph of a model segmentation effect in a complex scene (such as illumination background color);
Detailed Description
The invention is further described in detail below with reference to the drawings.
Fig. 1 is combined with fig. 5, the present invention innovatively provides a completely new Semantic image Segmentation Method for overcoming the defects and drawbacks of the prior art, and compared with other methods, the Method has the advantages that the local and global information can be better considered, and the shape and boundary of the object in the finally obtained Segmentation map are clearer and the classification is more accurate by improving Cascade RCNN and Mask RCNN, which is called as a deformable Convolution mixing Task Cascade Semantic Segmentation Method (discrete Convolution Task Segmentation Method) Based on embedding Balance.
The device comprises three parts: the method comprises the steps of embedding a balanced deformable convolutional neural network, a cyclic semantic segmentation network, a cascading sparse RoI classification and a regression network.
Referring to fig. 2, the present invention inputs the clipped image into a pre-trained neural network, so as to extract important features of different targets from a large number of pictures, and obtain a feature map. Deep Residual networks (ResNet) and Feature Pyramid Networks (FPN) at the 50-layer of the Feature extraction Network. And the feature pyramid network FPN performs prediction after upsampling and shallow layer fusion on each layer of feature map, and then adopts the feature fusion operation of embedding balance to further improve the recognition rate of the tiny target. The method needs to train the feature pyramid network, and balance of a plurality of tasks is realized by adjusting the weight of the balanced L1 loss function of each task by adopting the balanced L1 loss function in the training process. The method comprises the steps of leading in samples in the training process, evenly dividing sampling intervals into K intervals according to overlapping degree balance sampling, evenly distributing N difficult samples to each interval, evenly selecting sample intervals from the N difficult samples, and carrying out intersection-comparison balance sampling. In contrast to model architectures, where the training process has a crucial impact on the performance of the target detection model, the training of the detection portion of the model is typically subject to a balance of sample level, feature level, and target level constraints. In order to reduce the adverse effect caused by the method, an embedded balanced FPN structure is provided, and the structure is beneficial to balanced learning of a model and improves the detection precision of small targets. In the multi-scale feature sampling process of the FPN, an intersection-to-parallel ratio balance sampling and a balance L1 loss function are added to reduce unbalance in three aspects of sample, feature extraction and target detection, break through the balance limit of an integral model and obtain a balance feature fusion extraction diagram. Referring to fig. 6, feature extraction is performed by using FPN embedded with deformable convolution and pooling, and a deformable convolution and deformable pooling module is introduced to enhance the conversion modeling capability of a Convolutional Neural Network (CNN). The new module easily replaces the normal module in the existing CNN and can easily be trained end-to-end by standard back propagation. The convolution layer at the starting end of the deformable convolution neural network is a deformable convolution layer, and the pooling layer of the region of interest is deformable region-of-interest pooling; the Feature Map (Feature Map) obtained by the network is divided into parts with the same size, the number of the parts is the output dimensionality, the deformable sampling position is utilized, then, adjacent similar structure information of each pixel is compressed into a fixed grid, a deformable Feature image is created, the complex structure of the image can be reflected more effectively by applying regular convolution on the deformable Feature image, and the sensitivity to a tiny target is further improved.
According to the method, filling operation is added after convolution, namely 0-value pixels are expanded on the outermost periphery, although weak noise is possibly introduced in filling, the operation can enable the resolution of an output segmentation image to be the same as that of an original image, and the method is beneficial to accurately predicting the category of each pixel when the problem of dense small target segmentation is faced. Because the deeper the neural network structure is, the more spatial information is lost, the more small-scale objects are difficult to recover from the low-resolution feature map, and S-Dropout is used in the convolutional network to trim the network structure and remove the last pooling layer and the convolutional layer following the last pooling layer, the network has the sparsification characteristic, and overfitting in training is reduced.
The invention inputs the feature map into a regional candidate Network (RPN) for target recognition classification and candidate box regression. The nature of the RPN is a sliding window based target detector, which is a tree structure with a 3 x 3 convolutional layer for the trunk and two 1 x 1 convolutional layers for the branches. The RPN classifier (1 × 1 convolutional layer) classifies the RoI that overlaps with the real target and whose overlap area is greater than 0.5 as foreground and the RoI that does not overlap with any target or whose overlap area is less than 0.1 as background. The RPN regressor (1 × 1 convolutional layer) calculates the deviation values of the frames between these foreground rois and the real target, then performs non-maximum suppression, i.e., sorts these rois according to the classified scores, and selects the first N rois to obtain the RPN Box candidate frame. Next, the classification result and the candidate frame obtained by regression are integrated to obtain the candidate region (Proposal). The penalty function used by the RPN network here is the sum of the classification error and the regression error.
Each generated RoI is mapped onto a convolution Feature Map (Feature Map) extracted from the FPN network, and a 7 × 7 Feature Map corresponding thereto is obtained. The RoI alignment includes two processes: firstly, the original image is corresponding to the pixel points of the feature map, and then the feature map is corresponding to the feature map. Because the pooling layer in the feature extraction network can cause the feature map to be reduced by a certain proportion (related to the number and size of the pooling layers) compared with the original map, the problem of identification errors caused by rough quantization operation is avoided through bilinear interpolation and pooling, namely, a pixel value in the target map is determined by fully utilizing four real pixel values around a virtual point in the original map, so that the pixels in the original map and the pixels in the feature map are accurately aligned, the accuracy of target detection is improved, and meanwhile, target segmentation is facilitated.
And (3) performing Full Convolution Network (FCN) operation on the feature map after the RoI alignment by using a cyclic semantic segmentation Network to predict and generate a Mask (Mask) at a pixel level. The loss function used by the mask generation network is the target detection classification error + the bounding box regression error + the semantic segmentation error. The "Head layer" preceding the FCN has the main effect of expanding the dimension of the RoI alignment output, which is more accurate when predicting the mask. After the target segmentation image is obtained through the FCN, if the target segmentation cross ratio is not ideal, it is indicated that a Receptive Field (receptor Field) of the feature extraction is too small to be beneficial to accurate segmentation of the small target, and the result needs to be input into the Head layer again to be fused with the Receptive Field layer in the convolutional layer, and is segmented again through the FCN until a relatively accurate cross-over ratio result is obtained. The receptive field is defined as the area size of each pixel point on the feature map output by each layer of the convolutional neural network, which is mapped on the original image (the input image of the network).
Compared with the current target detection, the method has remarkable accuracy and sensitivity. On the basis of predecessors, a network structure is redefined, features are extracted by combining embedded balance FPN and deformable convolution, a circulation mechanism is introduced in semantic segmentation to improve the sensitivity of a receptive field, a network cascade mode is adopted to improve the classification and positioning accuracy of a candidate positioning frame, through feature fusion of different layers, information of each layer of a network is fully utilized, the final accuracy is improved to a certain extent, the result of image segmentation is smoother, the classification and positioning results are more accurate, and the innovation and superiority of the method in the aspects of semantic segmentation and target detection are fully shown.
Referring to fig. 1 in conjunction with fig. 2 and 3 and fig. 4, in one embodiment, the method of the present invention comprises:
s1, first, an image of an arbitrary size is input into a backbone network composed of 13 convolutional layers, 13 Linear Unit (ReLU) layers, and 4 pooling layers for extracting picture features. Firstly, the input is a whole picture, and dimension reduction processing is carried out on each level of the characteristic pyramid network through a 1 × 1 convolution layer. Second, the low-level feature map is upsampled and the high-level feature map is downsampled to the same scale space with the step set to 8. It was found from experiments that this setup is sufficient for an accurate pixel level prediction of the whole image. Then, information fusion is carried out on semantic features extracted before different levels; meanwhile, four convolutional layers are added structurally to further enhance the generalization capability of extracting semantic features. Finally, the convolution layer is adopted to predict the pixel level segmentation result, and a Feature map (Feature Maps) is output. Performing feature extraction on the input image by adopting a deformable convolution neural network in the convolution and pooling part of the feature pyramid network; the convolution layer at the starting end of the deformable convolution neural network is a deformable convolution layer, and the pooling layer of the region of interest is a deformable region of interest pooling layer to obtain a characteristic diagram; dividing the characteristic diagram into parts with the same size, wherein the number of the parts is the output dimensionality;
s2, for each image, after the operation of S1, all the obtained Feature Maps (Feature Maps) are input into a Region candidate Network (RPN) for training. The method comprises the steps of firstly entering a 3 x 3 convolutional layer, then entering two 1 x 1 brother convolutional layers (sitting layers), finally classifying by using a Softmax layer, accurately positioning a candidate region, and selecting the candidate region. The area candidate network comprises two parts, namely a target detection classifier and a candidate frame positioning classifier, wherein the target detection classifier outputs a target identification result and prediction accuracy, and the candidate frame positioning classifier can accurately position candidate areas and output candidate frames of a plurality of candidate areas. In the training process, samples need to be introduced, sampling is balanced according to the overlapping degree, the deformable interesting region is uniformly divided into K intervals through interval sampling, N difficult samples are uniformly distributed to each interval, and then the sample intervals are uniformly selected from the N difficult samples.
Inputting the feature map into a regional candidate network for candidate box location classification, comprising: setting 9 candidate deformable interesting regions for each pixel point in the feature map, dividing the deformable interesting regions into a foreground and a background by using a classifier, simultaneously preliminarily adjusting the positions of the deformable interesting regions by using a regressor, carrying out non-maximum suppression, sequencing the deformable interesting regions according to classified scores, and selecting the first N deformable interesting regions to obtain a candidate frame.
A circular semantic segmentation network is adopted to predict and generate a pixel-level mask for each aligned deformable interested area through a full convolution neural network; after the target segmentation image is obtained through the full convolution neural network, if the target segmentation cross-over ratio is not ideal, the result needs to be input into the full convolution neural network again for training until the needed cross-over ratio result is obtained.
And S3, taking the anchor point (Anchors) with the maximum intersection ratio of the real target Box (Ground Truth Box) as a positive training sample, and training the area candidate network. The L1 loss function for the area candidate network is defined as follows:
Figure BDA0002354833390000101
wherein i represents the ith Anchor, p in mini-batch processing (mini-batch) i Representing the probability that the ith Anchor is a foreground target, p is 1 when the ith Anchor is the foreground target, otherwise, p is 0, t i Coordinates, t, representing the predicted Bounding Box (Bounding Box) i And x is the coordinate of the real target (Ground Truth).
And training the cycle semantic segmentation network, and adjusting the weight of the L1 loss function of the task by adopting a balanced L1 loss function, wherein the task comprises semantic segmentation.
S4, performing RoI alignment through the operation of S3, and mapping the RoI to the corresponding position of the Feature Map (Feature Map) in the step S1 according to the input image; dividing the mapped region into parts with the same size, wherein the number of the parts is the same as the output dimension; a maximum pooling operation is performed for each portion so that a fixed size profile can be obtained from different size boxes.
S5, classification and regression, the output of this layer is the final purpose, the category to which the candidate region belongs and the precise location of the candidate region in the image are output.
Where a padding operation is required to pad the 0-valued pixels to the image periphery before each convolution. After each convolution a ReLU is concatenated. And the size and the number of the convolution kernels can be selected arbitrarily.
Referring to fig. 7 and 8, for model segmentation effect maps under various simple scenes processed by the method of the present invention, the final accuracy is improved to a certain extent, so that the result of image segmentation is more accurate and smooth, and as can be seen from the above examples in combination with the detailed description of the figures, the proposed and applied semantic image segmentation method of the present invention has a significant step forward compared with the conventional methods, the present invention redefines the network structure, and combines the feature maps of the decoding end and the decoding end, thereby fully showing the superiority of the present invention in semantic image segmentation, from both quantitative and qualitative comparison. Through feature fusion of different layers, information of each layer of the network is fully utilized, the final accuracy is improved to a certain degree, and the image segmentation result is more accurate and smooth.
The preferred embodiments of the present invention have been described in detail, however, the present invention is not limited to the above specific embodiments, and those skilled in the art can make modifications or equivalent changes within the scope of the claims and all the modifications and equivalent changes should be included in the scope of the present invention.

Claims (7)

1. A deformable convolution mixing task cascading semantic segmentation method based on embedding balance is characterized by comprising the following steps:
inputting the cut image into a pre-trained neural network;
performing dimensionality reduction on an input image by adopting a 3 multiplied by 3 convolution kernel and pooling operation of a characteristic pyramid network; up-sampling the low-level feature map, down-sampling the high-level feature map, and mapping the two samples to the same scale space; performing information fusion on semantic features extracted from different hierarchies; predicting a pixel-level segmentation result by using the convolution layer;
performing feature extraction on the input image by adopting a deformable convolution neural network in the convolution and pooling part of the feature pyramid network; the convolution layer at the starting end of the deformable convolution neural network is a deformable convolution layer, and the pooling layer of the region of interest is a deformable region of interest pooling layer to obtain a characteristic diagram; dividing the characteristic diagram into parts with the same size, wherein the number of the parts is the output dimensionality;
inputting a feature map obtained after the feature pyramid network into a regional candidate network for training the network, wherein the regional candidate network comprises a target detection classifier and a candidate frame positioning classifier, the target detection classifier outputs a target identification result and prediction accuracy, and the candidate frame positioning classifier can accurately position candidate regions and output candidate frames of a plurality of candidate regions;
in the process of training the regional candidate network, a balanced L1 loss function is adopted to adjust the weight of the respective L1 loss function of a plurality of tasks, wherein the tasks comprise target detection and candidate frame generation;
in the training process, samples need to be introduced, sampling is balanced according to the overlapping degree, the deformable interesting region is uniformly divided into K intervals through interval sampling, N difficult samples are uniformly distributed to each interval, and then the sample intervals are uniformly selected from the N difficult samples.
2. The method of claim 1, wherein inputting the feature map into the regional candidate network for candidate box location classification comprises: setting 9 candidate deformable interesting regions for each pixel point in the feature map, dividing the deformable interesting regions into a foreground and a background by using a classifier, simultaneously preliminarily adjusting the positions of the deformable interesting regions by using a regressor, carrying out non-maximum suppression, sequencing the deformable interesting regions according to classified scores, and selecting the first N deformable interesting regions to obtain a candidate frame.
3. The method of claim 2, wherein the deformable region of interest generated by the area candidate network is mapped onto the feature map extracted by the feature pyramid network, and a 7 x 7 feature map corresponding to the feature map is obtained for deformable region of interest alignment.
4. The method of claim 3, wherein a pixel-level mask is generated predictively using a recurrent semantic segmentation network through a full convolution neural network for each aligned deformable region of interest; after the target segmentation image is obtained through the full convolution neural network, if the target segmentation cross-over ratio is not ideal, the result needs to be input into the full convolution neural network again for training until the needed cross-over ratio result is obtained.
5. The method of claim 4, wherein the cyclic semantic segmentation network is trained to adjust the weights of the L1 loss functions of tasks comprising semantic segmentation using balanced L1 loss functions.
6. A method as claimed in claim 1, characterized in that a padding operation of 0 is applied to the image surroundings before the convolution in all processes.
7. A method as claimed in claim 1, characterized in that the convolution in all processes is followed by a modified linear element.
CN202010004799.7A 2020-01-03 2020-01-03 Deformable convolution mixing task cascading semantic segmentation method based on embedding balance Active CN111210443B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010004799.7A CN111210443B (en) 2020-01-03 2020-01-03 Deformable convolution mixing task cascading semantic segmentation method based on embedding balance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010004799.7A CN111210443B (en) 2020-01-03 2020-01-03 Deformable convolution mixing task cascading semantic segmentation method based on embedding balance

Publications (2)

Publication Number Publication Date
CN111210443A CN111210443A (en) 2020-05-29
CN111210443B true CN111210443B (en) 2022-09-13

Family

ID=70785546

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010004799.7A Active CN111210443B (en) 2020-01-03 2020-01-03 Deformable convolution mixing task cascading semantic segmentation method based on embedding balance

Country Status (1)

Country Link
CN (1) CN111210443B (en)

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111768466B (en) 2020-06-30 2024-01-12 北京百度网讯科技有限公司 Image filling method, device, equipment and storage medium
CN111754531A (en) * 2020-07-08 2020-10-09 深延科技(北京)有限公司 Image instance segmentation method and device
CN111862119A (en) * 2020-07-21 2020-10-30 武汉科技大学 Semantic information extraction method based on Mask-RCNN
CN111860332B (en) * 2020-07-21 2022-05-31 国网山东省电力公司青岛供电公司 Dual-channel electrokinetic diagram part detection method based on multi-threshold cascade detector
CN111860508A (en) * 2020-07-28 2020-10-30 平安科技(深圳)有限公司 Image sample selection method and related equipment
CN112069907A (en) * 2020-08-11 2020-12-11 盛视科技股份有限公司 X-ray machine image recognition method, device and system based on example segmentation
CN111985503B (en) * 2020-08-17 2024-04-26 浩鲸云计算科技股份有限公司 Target detection method and device based on improved feature pyramid network structure
CN111951319A (en) * 2020-08-21 2020-11-17 清华大学深圳国际研究生院 Image stereo matching method
CN112132258B (en) * 2020-08-26 2022-06-24 中国海洋大学 Multitask image processing method based on deformable convolution
CN112017065B (en) * 2020-08-27 2024-05-24 中国平安财产保险股份有限公司 Method, device and computer readable storage medium for vehicle damage assessment and claim
CN112116620B (en) * 2020-09-16 2023-09-22 北京交通大学 Indoor image semantic segmentation and coating display method
CN112446862B (en) * 2020-11-25 2021-08-10 北京医准智能科技有限公司 Dynamic breast ultrasound video full-focus real-time detection and segmentation device and system based on artificial intelligence and image processing method
CN112396053A (en) * 2020-11-25 2021-02-23 北京联合大学 Method for detecting object of all-round fisheye image based on cascade neural network
CN112418163B (en) * 2020-12-09 2022-07-12 北京深睿博联科技有限责任公司 Multispectral target detection blind guiding system
CN112560722B (en) * 2020-12-22 2022-09-09 中国人民解放军国防科技大学 Airplane target identification method and device, computer equipment and storage medium
CN112712078A (en) * 2020-12-31 2021-04-27 上海智臻智能网络科技股份有限公司 Text detection method and device
CN113076972A (en) * 2021-03-04 2021-07-06 山东师范大学 Two-stage Logo image detection method and system based on deep learning
CN112926480B (en) * 2021-03-05 2023-01-31 山东大学 Multi-scale and multi-orientation-oriented aerial photography object detection method and system
CN112950703B (en) * 2021-03-11 2024-01-19 无锡禹空间智能科技有限公司 Small target detection method, device, storage medium and equipment
CN113205526B (en) * 2021-04-01 2022-07-26 国网江苏省电力有限公司淮安供电分公司 Distribution line accurate semantic segmentation method based on multi-source information fusion
CN113065650B (en) * 2021-04-02 2023-11-17 中山大学 Multichannel neural network instance separation method based on long-term memory learning
CN113034506B (en) * 2021-05-24 2021-08-06 湖南大学 Remote sensing image semantic segmentation method and device, computer equipment and storage medium
CN113657214B (en) * 2021-07-30 2024-04-02 哈尔滨工业大学 Building damage assessment method based on Mask RCNN
CN113792584B (en) * 2021-08-03 2023-10-27 云南大学 Wearing detection method and system for safety protection tool
CN114092818B (en) * 2022-01-07 2022-05-03 中科视语(北京)科技有限公司 Semantic segmentation method and device, electronic equipment and storage medium
CN114511485B (en) * 2022-01-29 2023-05-26 电子科技大学 Compressed video quality enhancement method adopting cyclic deformable fusion
CN114170230B (en) * 2022-02-14 2022-04-29 清华大学 Glass defect detection method and device based on deformable convolution and feature fusion
CN114897798A (en) * 2022-04-24 2022-08-12 四川思极科技有限公司 Transformer oil leakage image identification method and system based on growth type detection
CN114926886B (en) * 2022-05-30 2023-04-25 山东大学 Micro-expression action unit identification method and system
CN116012719B (en) * 2023-03-27 2023-06-09 中国电子科技集团公司第五十四研究所 Weak supervision rotating target detection method based on multi-instance learning
CN116079749B (en) * 2023-04-10 2023-06-20 南京师范大学 Robot vision obstacle avoidance method based on cluster separation conditional random field and robot

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109376576A (en) * 2018-08-21 2019-02-22 中国海洋大学 The object detection method for training network from zero based on the intensive connection of alternately update
CN110276765A (en) * 2019-06-21 2019-09-24 北京交通大学 Image panorama dividing method based on multi-task learning deep neural network

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10679351B2 (en) * 2017-08-18 2020-06-09 Samsung Electronics Co., Ltd. System and method for semantic segmentation of images
US10043113B1 (en) * 2017-10-04 2018-08-07 StradVision, Inc. Method and device for generating feature maps by using feature upsampling networks
CN108446662A (en) * 2018-04-02 2018-08-24 电子科技大学 A kind of pedestrian detection method based on semantic segmentation information
CN109145713B (en) * 2018-07-02 2021-09-28 南京师范大学 Small target semantic segmentation method combined with target detection
CN109670525A (en) * 2018-11-02 2019-04-23 平安科技(深圳)有限公司 Object detection method and system based on once shot detection
CN109584248B (en) * 2018-11-20 2023-09-08 西安电子科技大学 Infrared target instance segmentation method based on feature fusion and dense connection network
CN109685067B (en) * 2018-12-26 2022-05-03 江西理工大学 Image semantic segmentation method based on region and depth residual error network
CN110097129B (en) * 2019-05-05 2023-04-28 西安电子科技大学 Remote sensing target detection method based on profile wave grouping characteristic pyramid convolution
CN110264466B (en) * 2019-06-28 2021-08-06 广州市颐创信息科技有限公司 Reinforcing steel bar detection method based on deep convolutional neural network
CN110533105B (en) * 2019-08-30 2022-04-05 北京市商汤科技开发有限公司 Target detection method and device, electronic equipment and storage medium
CN110633661A (en) * 2019-08-31 2019-12-31 南京理工大学 Semantic segmentation fused remote sensing image target detection method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109376576A (en) * 2018-08-21 2019-02-22 中国海洋大学 The object detection method for training network from zero based on the intensive connection of alternately update
CN110276765A (en) * 2019-06-21 2019-09-24 北京交通大学 Image panorama dividing method based on multi-task learning deep neural network

Also Published As

Publication number Publication date
CN111210443A (en) 2020-05-29

Similar Documents

Publication Publication Date Title
CN111210443B (en) Deformable convolution mixing task cascading semantic segmentation method based on embedding balance
Hafiz et al. A survey on instance segmentation: state of the art
Lateef et al. Survey on semantic segmentation using deep learning techniques
CN109241982B (en) Target detection method based on deep and shallow layer convolutional neural network
CN108509978B (en) Multi-class target detection method and model based on CNN (CNN) multi-level feature fusion
CN112966684B (en) Cooperative learning character recognition method under attention mechanism
CN113052210B (en) Rapid low-light target detection method based on convolutional neural network
CN112507777A (en) Optical remote sensing image ship detection and segmentation method based on deep learning
CN113239981B (en) Image classification method of local feature coupling global representation
CN110782420A (en) Small target feature representation enhancement method based on deep learning
CN111401293B (en) Gesture recognition method based on Head lightweight Mask scanning R-CNN
CN111353544B (en) Improved Mixed Pooling-YOLOV 3-based target detection method
Chen et al. Corse-to-fine road extraction based on local Dirichlet mixture models and multiscale-high-order deep learning
Cao et al. A survey on image semantic segmentation methods with convolutional neural network
CN112348036A (en) Self-adaptive target detection method based on lightweight residual learning and deconvolution cascade
CN113807361B (en) Neural network, target detection method, neural network training method and related products
CN110866938B (en) Full-automatic video moving object segmentation method
CN110517270B (en) Indoor scene semantic segmentation method based on super-pixel depth network
CN112364873A (en) Character recognition method and device for curved text image and computer equipment
CN109657538B (en) Scene segmentation method and system based on context information guidance
CN111723660A (en) Detection method for long ground target detection network
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
CN112183649A (en) Algorithm for predicting pyramid feature map
CN113850324A (en) Multispectral target detection method based on Yolov4
CN116645592A (en) Crack detection method based on image processing and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant