CN111881981A

CN111881981A - Mask coding-based single-stage instance segmentation method

Info

Publication number: CN111881981A
Application number: CN202010747003.7A
Authority: CN
Inventors: 徐枫; 章如锋
Original assignee: Suzhou Keben Information Technology Co ltd
Current assignee: Suzhou Keben Information Technology Co ltd
Priority date: 2020-07-29
Filing date: 2020-07-29
Publication date: 2020-11-03

Abstract

The invention relates to the field of image processing, in particular to a mask coding-based single-stage instance segmentation method, which comprises a training stage model and a prediction stage model and is characterized by comprising the following steps of: step A, bilinear interpolation: removing category information from original mask labels given in different sizes to obtain two-dimensional masks with the same size and irrelevant categories; and B, principal component analysis coding: according to the two-dimensional mask of given information redundancy, the two-dimensional mask M is coded and compressed into a vector u through orthogonal transformation^N(ii) a And step C, constructing a single-stage instance divider: the single-stage example divider is based on a single-stage target detector FCOS and is modified correspondingly; step D, non-maximum value suppression post-processing; and E, mask coding and decoding. The present invention compresses highly redundant two-dimensional masks to masks with compact featuresAnd vector quantity is adopted to realize rapid and stable prediction, and the problem that the prediction speed is reduced along with the increase of the target quantity of the example segmentation task at the present stage is solved.

Description

Mask coding-based single-stage instance segmentation method

Technical Field

The invention relates to the field of image processing, in particular to a mask coding-based single-stage instance segmentation method.

Background

Example segmentation refers to that a semantic classification task at a pixel level and an example classification task at an area level are performed simultaneously given a current image, that is, a current position pixel is determined to belong to which category and which target object is classified at the same time. The task has very important practical significance in the fields of unmanned driving, robot navigation and the like.

In recent years, with the rapid development of deep learning, a lot of object detection-based two-stage example segmentation work has been carried out with a dramatic progress, namely, a potential object is firstly positioned, and then a local area is subjected to a pixel-level classification task. However, since this method needs to classify candidate regions in order at the pixel level, when a large number of objects appear in an image, the inference speed will be greatly reduced, and stable and efficient prediction cannot be performed in reality.

Compared with the two-stage model that the inference speed is limited by the number of targets, the single-stage method can predict all targets in the current image at the same time without losing speed, so that a plurality of single-stage example segmenters based on the full convolution network appear. The first method is based on semantic segmentation, i.e. firstly performing pixel-level classification tasks on the current image, and secondly performing instance clustering in the post-processing stage. The method often has fine edge description, however, due to problems such as clustering errors, a large number of targets are difficult to find accurately; in addition, post-processing is slow and cannot meet the real-time requirements of actual tasks.

The second category of methods, which is inspired on single-stage object detection, describes objects by encoding object contours. However, describing the target with the circumscribed outline causes a great performance degradation, and the main reasons are as follows: 1) when the target contour becomes complex, a single circumscribed contour line is difficult to describe the target finely compared to a two-dimensional mask; 2) when a background area exists inside the target, the circumscribed outline cannot represent the background area, which is very important in interactive robot navigation.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a single-stage example segmentation method based on mask coding, and simultaneously gives consideration to the advantages of high precision of mask description of a double-stage model and stable reasoning speed of a single-stage model without the limitation of the number of objects.

The invention is realized by the following technical scheme:

a mask coding-based single-stage instance segmentation method comprises a training stage model and a prediction stage model, and comprises the following steps:

step A, bilinear interpolation: removing the category information of the original mask labels given in different sizes to obtain the two-dimensional mask M which has the same size and is irrelevant to the category and belongs to {0, 1}^H×WWherein H and W are the height and width of the two-dimensional mask, respectively;

and B, principal component analysis coding: two-dimensional mask M e {0, 1} according to given information redundancy^H×WCompressing the two-dimensional mask M code into a vector u by orthogonal transformation by using a principal component analysis method^NWherein N is the dimension of the mask vector;

and step C, constructing a single-stage instance divider: the single-stage example divider is based on the single-stage target detector FCOS and is modified accordingly, and the single-stage target detector FCOS based on anchor-free in the step C is utilized, so that the advantages of less parameter amount and high detection precision are achieved compared with the anchor-based target detector.

Step D, non-maximum value inhibition post-treatment: the prediction stage model outputs a certain number of potential targets, the targets are arranged in a descending order according to classification scores, and non-maximum value inhibition is carried out according to the intersection ratio among target frames; selecting K targets before the score ranking from the rest candidates to perform subsequent operation to obtain K target vectors;

e, mask coding and decoding: and D, performing mask coding and decoding on the K target vectors obtained in the step D, and finally realizing image instance segmentation.

Further, the formula of bilinear interpolation is:

f(x，y)＝f(0，0)(1-x)(1-y)+f(1，0)x(1-y)+f(0，1)(1-x)y+f(1，1)xy

wherein, x and y are coordinates, x is the point value of the x-axis direction, and y is the point value of the y-axis direction.

Further, the formula of the principal component analysis coding in step B is:

u＝TM；

where matrix T is used to encode the two-dimensional mask as u and matrix W is used to decode the mask vector u as M.

Further, step C includes step C1: the extension branch is used for learning a mask vector by adding a convolution network of one branch, so that a mask vector prediction task is realized;

further, step C includes step C2: the mask vector training based on the MSE loss function and the mask coding enable the features between the mask vectors to be linearly uncorrelated pairwise, so that the MSE loss function is used for carrying out independent supervised regression on elements in the mask vector u, and a good effect can be achieved.

Further, step D includes step D1, target detection: inputting pictures, outputting a certain number of potential detection frames by the model once, arranging the targets in a descending order according to the classification scores, inhibiting non-maximum values according to the intersection ratio among the target frames, and reserving K targets with scores ranked ahead for subsequent operation.

Further, step D includes step D2, mask vector prediction: in parallel with step D1, the image is input, the model predicts a set of mask vectors for each potential instance, and after non-maximum suppression processing, selects K top scoring targets from the remaining candidates for subsequent operations.

Further, the mask coding decoding in the step E is principal component analysis mask coding decoding, a mask vector u is obtained from the step D2, the two-dimensional mask is reconstructed by using the decoding matrix W stored in the step B, and the two-dimensional mask is interpolated to a corresponding size according to the size of the detection frame obtained in the step D1, thereby completing an instance segmentation task. And E, performing post-processing by using a non-maximum suppression algorithm, wherein the method has the advantages that a large number of low-quality prediction results can be effectively suppressed, and a high-quality example mask is output.

An apparatus for mask-based coding of single-phase instance partitions, comprising a processor, a memory, a computer program stored on the memory and executable on the processor; the computer program, when executed by a processor, implements the steps of the example segmentation method as described above.

A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the example segmentation method as described above.

Compared with the prior art, the invention has the following technical achievements:

1) compared with a mask-based dual-stage example segmentation method, the method has the advantages that the reasoning precision is guaranteed, and meanwhile, the reasoning speed is not influenced by the target quantity;

2) compared with a single-stage example segmentation method of contour coding, the method has the advantages that on the premise that the speed is close, the detailed description and the precision are obviously improved;

3) the mask coding only relates to matrix operation, and is very simple and efficient;

4) the method has the advantages that the dimension is unified based on a bilinear interpolation method, the mask coding based on principal component analysis is used, the instance divider based on a single-stage target detector FCOS is used, the post-processing operation of non-maximum suppression is used, the mask decoding based on principal component analysis is used, and therefore the two-dimensional mask can be greatly compressed on the premise that most effective information is stored, and meanwhile the reconstruction capability is good.

5) The mask coding technology has good generalization, can be easily embedded into any existing target detector, introduces a small amount of parameters and calculation, further realizes an end-to-end dense instance segmentation task, and simultaneously gives consideration to the advantages of high precision of the mask description of the two-stage model and stable reasoning speed of the single-stage model without the limitation of the number of objects.

Drawings

FIG. 1 is an overall flow diagram of a mask-coding-based single-phase example segmentation method;

FIG. 2 is a mask encoding flow diagram of a mask encoding based single-phase example segmentation method;

FIG. 3 is a diagram of a depth network structure for a mask-based single-stage example segmentation method;

FIG. 4 is a graph of mask reconstruction efficiency for a single-stage example segmentation method based on mask coding;

FIG. 5 is an example segmentation result presentation of a mask-coding-based single-stage example segmentation method.

Detailed Description

The method scheme of the invention is as follows as a whole: giving original mask labels with different sizes, firstly removing the class information of the original mask labels, and unifying the mask labels to the same size by using a bilinear interpolation method. And then, according to a large amount of information redundancy existing in the mask, encoding the two-dimensional mask into a low-dimensional label vector with compact characteristics by orthogonal transformation by using a principal component analysis method, and storing a transformation matrix in the low-dimensional label vector. And then, on the basis of a single-stage target detector, adding a branch for predicting a group of mask vectors for each potential instance. And finally, after post-processing such as non-maximum suppression and the like, decoding the predicted mask vector (selecting K targets with top scores and decoding the mask vector) to obtain the mask result of each target. The whole encoding and decoding process only involves matrix operation, and is very simple and efficient. The method also has good generalization, can be embedded into any existing single-stage detector, introduces a small amount of parameters and calculation, further realizes an end-to-end dense instance segmentation task, and gives consideration to precision and speed.

The method is simple and easy to implement, can compress a highly redundant two-dimensional mask to a mask vector with compact characteristics, can be embedded into any existing single-stage detector to perform an end-to-end intensive instance segmentation task, further realizes rapid and stable prediction, and solves the problem that the prediction speed is reduced as the number of targets is increased in the instance segmentation task at the existing stage.

In order to better express the single-stage example segmentation method based on mask coding proposed in the present invention, we take the open-source large-scale example segmentation data set Microsoft COCO as an example, take the single-stage target detector FCOS, take the ResNet-50 network as the basic feature extractor, and further describe the present invention with reference to the attached drawings and the detailed description, wherein the length of the mask vector is 60 dimensions.

Fig. 1 is an overall flowchart of the present invention, which includes six sections, namely bilinear interpolation, principal component analysis coding, single-stage example divider construction, target detection, mask vector prediction, and principal component analysis decoding, wherein the first five sections are training stages, the last three sections are prediction stages, and two sections appear in two stages at the same time.

Step A. bilinear interpolation: giving original mask labels with different sizes, firstly removing the class information to obtain

Wherein H^o，W^OThe height and the width of the original two-dimensional mask are respectively, and the height and the width of different objects are different. Then, unifying the mask labels to the same size 28 by a bilinear interpolation method to obtain the mask labels with the same size M E {0, 1}^28×28. The formula of bilinear interpolation is as follows:

f(x，y)＝f(0，0)(1-x)(1-y)+f(1，0)x(1-y)+f(0，1)(1-x)y+f(1，1)xy

B, principal component analysis coding: two-dimensional mask M e {0, 1} for given information redundancy^28×28Encoding it as a mask vector u with compact features⁶⁰. Among these, the principle is to minimize its information loss:

u＝TM；

The whole process is realized by using a principal component analysis method, and the specific flow is shown in figure 2.

Step C, constructing a single-stage instance divider: the single-phase instance partitioner is based on a single-phase target detector FCOS. The original FCOS consists of 5 parts, namely 1 backbone network for feature extraction, 1 FPN layer for feature improvement, and 3 multitask headers for classification, localization and center point determination, respectively. On the basis, the following modifications are made:

1) adding a branch consisting of 4 convolutions for learning mask vectors;

2) the last 1 ordinary convolution of all the heads is replaced by a deformable convolution to increase the receptive field and extract more effective features.

The whole training consists of detecting Loss and masking Loss:

Loss_total＝λ_detLoss_det+λ_maskLoss_mask

wherein, detect the Loss by categorised Loss, location Loss, central point Loss constitutes:

Loss_cls＝-α(1-p)^γlogp

Loss_reg＝1-GIoU

Loss_ctr＝-log x

the mask Loss is as follows:

in the formula of_det＝λ_mask＝1，α＝0.25，γ＝2。

The whole network structure is shown in fig. 3.

Step D, target detection: inputting pictures, outputting a large number of potential detection frames by the model at one time, arranging the targets in descending order according to the classification scores, and performing non-maximum value inhibition according to the intersection ratio between the target frames. Finally, the top K targets are retained.

E, mask vector prediction: and D, inputting a picture, predicting a group of mask vectors for each potential instance by the model, and selecting K targets with scores ranked before from the rest candidates for subsequent operation after non-maximum suppression processing.

Step F, principal component analysis decoding: and E, obtaining a mask vector u from the step E, reconstructing the two-dimensional mask by using the decoding matrix W stored in the step B, and interpolating the two-dimensional mask to a corresponding size according to the size of the detection frame obtained in the step D to complete the example segmentation task.

As can be seen from fig. 4, the mask encoding can greatly compress the two-dimensional mask on the premise of saving most of the valid information. Besides, the whole encoding and decoding process only involves matrix operation, and is very simple and efficient. The method also has good generalization, can be embedded into any existing single-stage detector, introduces a small amount of parameters and calculation, further realizes an end-to-end dense instance segmentation task, and gives consideration to precision and speed.

As can be seen from FIG. 5, the method can accurately complete the task of image instance segmentation.

In this brief description, reference has been made to various embodiments. A description of a feature or function in connection with an example indicates that such feature or function is present in the example. The use of the terms "example" or "such as" or "may" in this document means that these features or functions are present in at least the described examples, whether or not explicitly described as examples, and that they may be present in some or all of the other examples, but are not necessarily present in these examples. Thus, "an example," "e.g.," or "may" refers to a particular instance of a class of examples. The attributes of an instance may be attributes of the instance only, or attributes of a class, or attributes of a subclass of a class that includes some, but not all, of the instances in the class. Thus, it is implicitly disclosed that features described with reference to one example but not with reference to another example may be used in that other example where possible, but are not necessarily used in that other example.

Although embodiments of the present invention have been described in the preceding paragraphs with reference to various embodiments, it should be appreciated that modifications to the embodiments given can be made without departing from the scope of the invention as claimed.

Features described in the foregoing description may be used in combinations other than the combinations explicitly described.

Although functions have been described with reference to certain features, those functions may be performed by other features whether described or not.

Although features have been described with reference to certain embodiments, such features may also be present in other embodiments whether described or not.

Whilst endeavoring in the foregoing specification to draw attention to those features of the invention believed to be of particular importance it should be understood that the applicant claims protection in respect of any patentable feature or combination of features hereinbefore referred to and/or shown in the drawings, whether or not particular reference is made to the features or features illustrated in the drawings.

Claims

1. A mask coding-based single-stage instance segmentation method comprises a training stage model and a prediction stage model, and is characterized by comprising the following steps:

and step C, constructing a single-stage instance divider: the single-stage example divider is based on a single-stage target detector FCOS and is modified correspondingly;

2. The mask-coding-based single-stage instance partitioning method according to claim 1, wherein: the formula for bilinear interpolation is:

f(x，y)＝f(0，0)(1-x)(1-y)+f(1，0)x(1-y)+f(0，1)(1-x)y+f(1，1)xy。

3. the mask-coding-based single-stage instance partitioning method according to claim 1, wherein: the formula of the principal component analysis coding in the step B is as follows:

u＝TM；

4. The mask-coding-based single-stage instance partitioning method according to claim 1, wherein: step C includes step C1: and the extension branch is used for learning the mask vector by adding a convolution network of one branch, so that the mask vector prediction task is realized.

5. The mask-coding-based single-stage instance partitioning method according to claim 1, wherein: step C includes step C2: mask vector training based on MSE loss function.

6. The mask-coding-based single-stage instance partitioning method according to claim 1, wherein: step D includes step D1, target detection: inputting pictures, outputting a certain number of potential detection frames by the model once, arranging the targets in a descending order according to the classification scores, inhibiting non-maximum values according to the intersection ratio among the target frames, and reserving K targets with scores ranked ahead for subsequent operation.

7. The mask-coding-based single-stage instance partitioning method according to claim 1, wherein: step D includes step D2, mask vector prediction: inputting pictures, predicting a group of mask vectors for each potential instance by the model, and selecting K targets with top scores from the rest candidates for subsequent operation after non-maximum suppression processing.

8. The mask-coding-based single-stage instance partitioning method according to claim 1, wherein: and E, the mask coding and decoding are principal component analysis mask coding and decoding, a mask vector u is obtained from D2, the decoding matrix W stored in the step B is used for reconstructing the two-dimensional mask, the two-dimensional mask is interpolated to a corresponding size according to the size of the detection frame obtained in the step D1, and the example segmentation task is completed.

9. An apparatus for mask coding based single-stage instance partitioning, comprising a processor and a memory, wherein:

a computer program stored on the memory and executable on the processor;

the computer program, when being executed by a processor, implementing the steps of the instance splitting method as claimed in any one of claims 1-8.

10. A computer-readable storage medium characterized by:

the computer-readable storage medium stores a computer program which, when executed by a processor, implements the steps of the example segmentation method as claimed in any one of claims 1 to 8.