CN115019039B

CN115019039B - Instance segmentation method and system combining self-supervision and global information enhancement

Info

Publication number: CN115019039B
Application number: CN202210582668.6A
Authority: CN
Inventors: 高榕; 沈加伟; 邵雄凯
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2024-04-16
Anticipated expiration: 2042-05-26
Also published as: CN115019039A

Abstract

The invention discloses an example segmentation method and system combining self-supervision and global information enhancement, wherein the construction method firstly obtains a feature pyramid and performs feature map fusion by a feature extraction network based on ResNet networks and FPN modules; modeling the interaction relation among pixels of the feature images by adopting a Fastformer-based global information enhancement network, and extracting global information; then, carrying out instance segmentation through a prediction network, wherein the category prediction network is used for carrying out multi-label classification on the interested instance, and the mask prediction network is used for carrying out pixel value classification on the region where the instance is located, so as to generate an instance mask; in addition, a self-supervision learning network is added for carrying out contrast learning among examples in the picture, and the understanding capability of the model on the picture is enhanced to enhance generalization. The method can solve the problem of low detection performance on shielding and incomplete objects, strengthen the generalization capability of the model and improve the segmentation performance in a scene with more noise.

Description

Instance segmentation method and system combining self-supervision and global information enhancement

Technical Field

The invention relates to the technical field of artificial intelligence and computer vision, in particular to an instance segmentation method and system combining self-supervision and global information enhancement.

Background

Instance segmentation is a more challenging task in the field of computer vision relative to object detection, involving the task of object detection and semantic segmentation. Firstly, positioning and classifying objects of interest in an image, and then, carrying out semantic segmentation on an instance to separate a foreground and a background. With the rapid development of intelligent driving and medical image segmentation and other technologies, the performance and instantaneity of an example segmentation algorithm are also put forward higher requirements. However, the conventional top-down object detection-based instance segmentation method and system and the bottom-up semantic segmentation-based method still have difficulty in meeting the requirements of the current intelligent driving and other fields on an instance segmentation algorithm in terms of instantaneity and performance.

How to enhance the performance of the instance segmentation algorithm and shorten the forward reasoning time has great significance. In recent years, some excellent single-stage instance segmentation algorithms are proposed, so that the problems are alleviated, and a more ideal effect is achieved. Nevertheless, these algorithms still suffer from a number of drawbacks: the feature extraction network based on convolution lacks global information during information extraction, so that the detection effect on incomplete or blocked objects is poor; in addition, the supervised training mode causes poor generalization capability of the trained model, and the performance of the algorithm is difficult to develop for a scene with high noise.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide an instance segmentation method and system combining self-supervision and global information enhancement, and aims to solve the problems that the existing instance segmentation method and system lack global information in a feature extraction stage, have poor generalization capability and have poor segmentation effect on a scene with large noise.

In order to achieve the above object, the present invention provides an example segmentation method and system combining self-supervision and global information enhancement, including:

Step S1: establishing an instance segmentation model;

the example segmentation model comprises a feature extraction network, a global information enhancement network, a self-supervision learning network, a category prediction network and a mask prediction network;

the feature extraction network comprises ResNet network and FPN network, resNet is used for obtaining a picture pyramid by superposing a plurality of convolution layers, relu layers and normalization layers and residual connection. The FPN is used for combining semantic information rich in the upper-layer feature map and accurate position information of the lower-layer feature map in the feature pyramid to perform feature fusion;

The global information enhancement network is composed of Fastformer modules and is used for modeling the interaction relation between each pixel point in the feature map, extracting context information and enhancing the global information of the feature map;

The self-supervision learning network is used for carrying out contrast learning on examples in the pictures, enhancing the understanding capability of the pictures and enhancing the generalization capability of the model;

The class prediction network is used for performing multi-label classification on the interested examples to obtain the corresponding class of each example;

The mask prediction network is used for carrying out two classifications on the pixel points in the selected instance area, distinguishing the foreground from the background and generating the mask of the instance.

Step S2: training an example segmentation model;

And inputting the selected training data set which comprises the picture data and the corresponding label file. Firstly extracting a feature map, and then fusing the feature map. And then, the global information is enhanced, the global information is input into a prediction network for prediction, a loss function is obtained by comparing the global information with a label file, and the model training direction is guided by back propagation of the loss function.

Step S3: instance partitioning

The picture is divided into S x S networks, each of which is responsible for predicting the instance where the center point falls in that location. I.e. centering on the grid, predicts the class and mask of the corresponding instance.

Optionally, the feature extraction network is ResNet-50 and FPN network.

Further, the global information enhancement module is a Fastformer network based on additive attention.

The additive attention is subjected to linear transformation according to the input characteristic sequence E epsilon R ^N×d (N is the sequence length and d is the hidden dimension) to respectively obtain a query matrix, a key matrix and a value matrix, and the query matrix, the key matrix and the value matrix are marked as Q, K and V epsilon R ^N×d.

And generating a weight matrix by adopting additive attention to the query matrix Q, and adding the weight matrix Q to obtain a global query matrix. The global query vector Q is then dot multiplied with the key vector K, modeling their interrelationships.

Further, the same operation is adopted to generate a global key vector, interactive modeling is carried out on the global key vector and the value vector V, and finally, a feature vector containing rich global semantic information is obtained.

The self-supervision learning network firstly utilizes the marking box label information to obtain all instance characteristic representations, and for a randomly selected sample instance A, other instances are used as candidate pools, and similarity scores of the sample instance A and the candidate pools are calculated.

Optionally, the similarity score calculating process is as follows:

Further, the examples are ranked according to the similarity score, top-k is taken as a query set Q, and then the query set is utilized to mine false positive examples in the candidate pool.

The process for mining the false positive example comprises the following steps:

(1) And calculating the similarity between each instance in the Q and the instance in the candidate pool. Each instance I of the candidate pool gets N similarity scores (N is the number of instances in the query set Q).

(2) And (3) performing aggregation operation on the similarity scores, sequencing, taking an instance of top-k exceeding a threshold value as a pseudo positive instance, and adding the pseudo positive instance into the query set Q.

(3) And continuing to perform false positive mining by using the updated query set Q until the mined false positive is lower than the threshold value. The query set is taken as a pseudo positive example set, and the remaining examples in the candidate pool are taken as negative example sets.

(4) Obtaining a similarity score of the sample A and each example in the pseudo positive example set by using a softmax function:

Where p _i is a pseudo-positive example set instance, N _n is the number of negative examples, and N _i is a negative example set instance.

Optionally, taking a negative logarithm of the similarity score to obtain a comparison learning loss function:

Further, the class prediction network adopts a Focal loss, and a loss function is obtained by predicting the probability that each instance belongs to a certain class.

Optionally, the mask predicts the network loss function as:

Where N _pos is the positive sample number, is the class score predicted by the cell at the (i, j) position, and ψ is the indicator function.

Optionally, for d _mask, use is made of Dice Loss:

L_Dice＝1-D(p,q)

Where P _x,y represents the predicted pixel value of the cell at (x, y) and q _x,y represents the true pixel value of the cell at (x, y).

Compared with the prior art, the technical scheme provided by the invention has the following beneficial effects:

(1) On the basis of a single-stage example segmentation algorithm, the method is used for modeling global semantic information at a pixel level in the feature map by adding the Fastformer module based on the additive attention, so that the segmentation effect of the model on the shielded and incomplete object is improved.

(2) According to the invention, the self-supervision learning module is added in the prediction network, and the understanding capability of the model to the picture is enhanced and the generalization capability of the model is enhanced by carrying out contrast learning on all examples in the picture.

Drawings

FIG. 1 is a flow chart of an example segmentation model provided by an embodiment of the present invention;

FIG. 2 is a diagram of an example segmentation model framework provided by an embodiment of the present invention;

FIG. 3 is an image to be measured provided by an embodiment;

FIG. 4 (a) is a segmentation result obtained by the original single-phase example segmentation method and system;

fig. 4 (b) is an example segmentation result obtained using the method of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The embodiment of the invention provides an example segmentation method and system combining self-supervision and global information enhancement, comprising the following steps:

Step S1: establishing an instance segmentation model;

As shown in fig. 1, the example segmentation model includes a feature extraction network, a global information enhancement network, a self-supervised learning network, a class prediction network, and a mask prediction network;

The feature extraction network comprises ResNet-50 network and FPN network, resNet is used for obtaining four layers of picture pyramids with different scales by superposing a plurality of convolution layers, relu layers and normalization layers and residual connection. The FPN is used for combining semantic information rich in the upper-layer feature images and accurate position information of the lower-layer feature images in the feature pyramid to perform feature fusion;

The global information enhancement network is Fastformer module, which is used for modeling the interaction relation between each pixel point in the feature map, extracting context information and enhancing the global information of the feature map.

According to the input characteristic sequence E epsilon R ^N×d (N is sequence length and d is hidden dimension), making linear transformation to obtain inquiry matrix, key matrix and value matrix, respectively, recording them as Q,K,V∈R^N×d：Q＝[q₁,q₂,...,q_N],K＝[k₁,k₂,...,k_N],V＝[v₁,v₂,...,v_N].

Generating a weight matrix by adopting additive attention to a query matrix Q, and adding the weight matrix Q to obtain a global query matrix:

Where α _i is the attention weight value of a certain vector Q _i in the query matrix Q, and w _q∈R^d is a learnable parameter vector. The global query vector Q is then dot multiplied with the key vector K, modeling their interrelationships.

And generating a global key vector by adopting the same operation, performing interactive modeling with the value vector V, and finally obtaining a feature vector containing rich global semantic information.

The self-supervision learning network is used for carrying out contrast learning on the examples in the pictures, enhancing the understanding capability of the pictures and enhancing the generalization capability of the model;

Firstly, obtaining all instance characteristic representations by utilizing the marking box label information, and for a randomly selected sample instance A, taking the rest instances as candidate pools, calculating similarity scores of the sample instances and the candidate pools, wherein the calculation formula is as follows:

The examples are ordered according to the similarity score, top-k is taken as a query set Q, then pseudo-positive examples are mined in a candidate pool by utilizing the query set, and the mining process comprises the following steps:

(1) And calculating the similarity between each instance in the Q and the instance in the candidate pool. Each instance I of the candidate pool gets N similarity scores (N is the number of instances in the query set Q):

S(I,Q)＝(S(I,q₁),S(I,q₂),...,S(I,q_N))

Taking the negative logarithm of the similarity score to obtain a comparison learning loss function:

and the mask prediction network is used for carrying out two classifications on the pixel points in the selected instance area, distinguishing the foreground from the background and generating the mask of the instance. The mask predictive network loss function is:

For d _mask, use is made of Dice Loss:

L_Dice＝1-D(p,q)

Step S2: training an example segmentation model;

and inputting the selected training data set which comprises the picture data and the corresponding label file. Firstly extracting a feature map, and then fusing the feature map. And then, global information is enhanced, the global information is input into a head network for prediction, a loss function is obtained, the reverse propagation direction is influenced through the loss function, and model training is guided.

The present invention uses a city road street view dataset CITYSCAPES for model training, which uses street view images of different cities. Contains 2975 training sets, 500 validation sets and 1525 test images with high quality annotations.

Step S3: instance partitioning

The picture is first divided into S x S networks, each of which is responsible for predicting the instance where the center point falls in that location. I.e. centering on the grid, predicts the class and mask of the corresponding instance.

Fig. 2 is an input image, fig. 3 is an image to be measured provided in the embodiment, and the segmentation result using the original single-stage example segmentation method and system is shown in fig. 4 (a), it can be seen that the mask generated by the motorcycle on the right side of the first picture has poor matching degree, the enclosing wall is identified as a truck due to poor light and more noise on the right half part in the second picture, and the third picture is for an incomplete example: the motorcycle and rider are not well separated. The example segmentation results using the method of the present invention are shown in FIG. 4 (b), and are all very good improvements to the above case.

The method improves the problem that the original single-stage example segmentation algorithm has poor detection effect on the blocked or incomplete object to a certain extent, and further greatly improves the generalization capability of the model and the segmentation effect in the scenes such as insufficient illumination or over-strong exposure, rainy days and the like.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. An instance segmentation method combining self-supervision and global information enhancement, comprising:

Step S1: establishing an instance segmentation model;

The feature extraction network comprises ResNet network and FPN network, resNet is used for obtaining a picture pyramid by superposing a plurality of convolution layers, relu layers and normalization layers and residual connection; the FPN is used for combining semantic information rich in the upper-layer feature map and accurate position information of the lower-layer feature map in the feature pyramid to perform feature fusion;

the global information enhancement network is composed of Fastformer modules and is used for modeling the interaction relation between each pixel point in the feature map, extracting context information and enhancing the global information extraction capability of the feature map;

The self-supervision learning network is used for carrying out self-supervision contrast learning on the examples in the pictures, enhancing the understanding capability of the pictures and enhancing the generalization capability of the model;

the mask prediction network is used for carrying out two classifications on the pixel points in the selected instance area, distinguishing the foreground from the background and generating a mask of the instance;

Step S2: training an example segmentation model;

Inputting a selected training data set comprising picture data and corresponding tag files; firstly extracting a feature map, and then fusing the feature map; then, global information is enhanced, the global information is input into a head network for prediction, a loss function is obtained, and the direction of model training is optimized through back propagation of the loss function;

Step S3: instance partitioning

Firstly, dividing a picture into S multiplied by S networks, wherein each grid is responsible for predicting an instance that a center point falls at the position; i.e. centering on the grid, predicts the class and mask of the corresponding instance.

2. An instance segmentation method combining self-supervision and global information enhancement according to claim 1, wherein the feature extraction networks are ResNet-50 and FPN networks.

3. An instance splitting method combining self-supervision and global information enhancement according to claim 1, wherein the global information enhancement network is an additive attention-based Fastformer network.

4. An example segmentation method according to claim 3, wherein the additive attention is linearly transformed according to the input characteristic sequence E R ^B×d, B is the sequence length, d is the hidden dimension to obtain a query matrix, a key matrix and a value matrix, respectively, denoted Q, K, V E R ^B×d.

5. The method for partitioning instances by combining self-supervision and global information enhancement as recited in claim 4, wherein said query matrix Q is weighted by additive attention, and added to Q to obtain a global query matrix vector; and then, carrying out point multiplication on Q and K, and modeling the interrelationship of the Q and the K.

6. The method for instance segmentation combining self-supervision and global information enhancement according to claim 5, wherein the key matrix K is subjected to additive attention generation weight matrix, global key vectors are obtained by adding the weight matrix K, interactive modeling is performed with V, and finally feature vectors containing rich global semantic information are obtained.

7. The method for partitioning an instance by combining self-supervision and global information enhancement according to claim 1, wherein the self-supervision learning network first obtains feature representations of all instances by using a binding box label information, and calculates similarity scores between the rest instances as candidate pools for randomly selected sample instances a.

8. The method of claim 7, wherein the similarity score calculation process is as follows:

and sorting the examples according to the similarity scores, taking top-k as a query set Q, and then mining pseudo-positive examples in the candidate pool by utilizing the query set.

9. The method of claim 8, wherein the mining pseudo-positive example process comprises:

(1) Calculating the similarity between each instance in Q and the instance in the candidate pool; n similarity scores are obtained for each instance I of the candidate pool, wherein N is the number of instances in the query set Q;

(2) Performing aggregation operation on the similarity scores, sequencing, taking an instance of top-k exceeding a threshold as a pseudo positive instance, and adding the pseudo positive instance into a query set Q;

(3) Continuing to utilize the updated query set Q to perform pseudo-positive example mining until the mined pseudo-positive examples are lower than a threshold value, taking the query set as a pseudo-positive example set and taking the rest examples in the candidate pool as negative example sets;

Wherein, p _i is a pseudo positive example set instance, N _n is the number of negative samples, and N _i is a negative example set instance;

(5) Taking the negative logarithm of the similarity score to obtain a comparison learning loss function:

10. The method for partitioning instances by combining self-supervision and global information enhancement according to claim 1, wherein the class prediction network uses Focal loss to obtain a loss function by predicting the probability that each instance belongs to a certain class; the mask prediction network is used for carrying out two classifications on the pixel points in the selected instance area, distinguishing the foreground from the background and generating the mask of the instance.