CN110111340B

CN110111340B - Weak supervision example segmentation method based on multi-path segmentation

Info

Publication number: CN110111340B
Application number: CN201910347532.5A
Authority: CN
Inventors: 程明明; 刘云; 吴宇寰
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2019-04-28
Filing date: 2019-04-28
Publication date: 2021-05-14
Anticipated expiration: 2039-04-28
Also published as: CN110111340A

Abstract

A weak supervision example segmentation method based on multi-path segmentation is disclosed. The method trains a convolutional neural network for instance segmentation using only image-level annotation data. Specifically, a training set only with image level labels is given, and a plurality of object recommendation areas irrelevant to the category are calculated for each image by a simulation sampling algorithm; and then, taking the image and the corresponding object recommendation area as input, taking the labeled image category as a learning target, and calculating category probability distribution and semantic features of each object recommendation area through a multi-instance learning framework. Establishing a large-scale graph model by taking object recommendation areas in the whole data set as nodes, regarding the graph model as a multi-path segmentation problem, and giving a category label to each object recommendation area as a result by a segmentation result; or as a training set to train any convolutional neural network for instance segmentation. Experiments show that the method is obviously superior to the existing weak supervision example segmentation method.

Description

Weak supervision example segmentation method based on multi-path segmentation

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a method for partitioning a weak supervision instance based on multi-way partitioning.

Background

Example segmentation is directed to segmenting each object in an image separately and identifying the class of the object. Based on the enormous value of business and academia, instance segmentation is an important task in computer vision. Recent example segmentation techniques have advanced primarily from some basic models based on convolutional neural networks, such as Fast R-CNN proposed by Ross Girshick at the ICCV conference 2015, Fast R-CNN proposed by Shaoqing Ren et al at the NIPS conference 2015, and Mask R-CNN proposed by Kaiming He et al at the ICCV conference 2017. However, these deep learning models rely heavily on a large amount of training data, which are all labeled with instances of objects at the pixel level. It is time consuming to label an image from the pixel level and collecting this much data is therefore a very expensive matter.

To reduce the need for pixel-level labeling data, some research efforts have trained the example segmentation model with object labeling boxes as the supervised information. Anna Khoreva et al published in 2017 on the CVPR conference, "Simple dots it: week superior instance and segmentation" paper used a modified version of the GrabCut algorithm to estimate object segmentations in the object marker box, and then used the MCG algorithm to correct these instance segmentations. The paper "weak-and semi-pervised corporate segmentation" published by Qizhu Li et al at the 2018 ECCV conference extends the method of Anna Khoreva et al, which uses an iterative approach to correct the estimated segmentation of the instances. Specifically, they first train the network by obtaining the initial example segmentation similar to Anna Khoreva, and then retrain the network by using the prediction result after the network training as a new segmentation estimate, and so on for several iterations to obtain the final result.

However, marking a large number of object frames is still time and labor consuming, and other tasks requiring object marking frames as supervisory information, such as object detection, have all begun to seek strategies for weakly supervised learning. Therefore, Yanzhao Zhou et al further relaxes the supervision information to image-level labeling in the "weak Supervised Segmentation using Class Peak Response" paper published on the CVPR in 2018, i.e. training the example Segmentation model using only images with Class labels as training data. They propose a new concept, "similar response peak value", that is, when the provided picture is used to train the image classification model, the convolution neural network has a larger response peak value on each object through a certain process, so as to obtain the approximate position of the object, and then the object recommendation area calculated by the analog sampling is combined to obtain the example segmentation result.

Disclosure of Invention

The invention aims to solve the technical problem of the prior example segmentation technology that a large amount of training data labeled at the pixel level is needed, and provides a method for weakly supervised example segmentation based on multi-way segmentation. The method only needs to provide pictures with category labels, and an example segmentation model can be learned.

In order to achieve the above purpose, the present invention firstly designs a multi-instance learning framework, which takes images and corresponding similarity sampling results as input, takes image categories as learning targets, and a trained model can calculate probability distribution and semantic features of each object recommendation region for one input image. Based on the probability distribution and semantic features, a multi-segmentation problem is constructed, and a correct class label is assigned to each object recommendation area.

The invention provides a weak supervision instance segmentation method based on multi-path segmentation, which comprises the following steps:

a. given a data set comprising a training set and a test set, each image in the training set has image-level labels, the method uses a general similarity sampling algorithm to generate object recommendation regions for each image in the data set, wherein the object recommendation regions may include objects of a target class, and the object recommendation regions may or may not include objects of the target class (i.e. background); and, these object recommendation areas are not category labeled, but only indicate that these areas may contain objects of the target category.

b. A multi-instance learning framework based on object recommendation areas is designed, the images and the corresponding object recommendation areas are used as input, the labeled categories of the images are used as learning targets, and the category probability distribution and semantic information can be learned and calculated for each object recommendation area through a loss function of the designed multi-instance learning framework.

The multi-instance learning framework based on the object recommendation regions designs a convolutional neural network model shown in fig. 2, so that the model can predict a probability distribution for each object recommendation region, and according to the probability distribution, the class labels of the images can be used as the supervision target of each object recommendation region. The loss function of the multi-instance learning framework based on the object recommendation region is composed of three parts, namely an attention loss function, a multi-instance learning loss function and a clustering center loss function, wherein the first two loss functions are mainly used for learning category information, and the clustering center loss function is used for learning semantic features of the object recommendation region.

c. And c, establishing a large-scale graph model by using the class probability distribution and the semantic information of the object recommendation areas calculated in the step b and using the object recommendation areas in the whole data set as nodes, regarding the graph model as a large-scale multi-segmentation problem, and giving a class mark to each object recommendation area by using a segmentation result.

Specifically, each object recommendation area is regarded as a node of the graph, each target class is regarded as a vertex of the graph, the distance from one node to the edge of one vertex is the predicted class probability, the distance between two nodes is the cosine value of the included angle between the semantic feature vectors of the two nodes, and the distance between two vertices is infinite. The goal of the multi-cut is to divide the entire graph into several subsets, with one and only one vertex in each subset, and with each node belonging to one and only one subset. It is impractical to solve this large-scale multi-segmentation problem, however, it can be decomposed into several small-scale multi-segmentation problems by limiting the maximum number of edges connected to each node. For each small-scale multi-segmentation problem, the collection of their solutions is the solution of the large graph. And dividing each node representing the object recommendation area into a subset by multi-path division, wherein the category corresponding to the vertex contained in the subset is the category of the object recommendation area.

d. Deleting the object recommendation areas marked as backgrounds in the step c, and taking the rest object recommendation areas and the corresponding category marks as segmentation results; any convolutional neural network for example segmentation can also be trained by using the remaining object recommendation regions as training data, and the network after training can be used for example segmentation of the image.

Advantages and advantageous effects of the invention

The method can simultaneously calculate the probability distribution and the semantic features of an object recommendation area through a multi-instance learning framework, and finally establishes a multi-path segmentation problem by using the probability distribution and the semantic features. This can be done in conjunction with information on the object instance, image, and entire data set to filter out unwanted object recommendation areas, preserve the correct object recommendation areas and assign category labels. This is more robust and accurate than the attention model based image classification networks.

Drawings

FIG. 1 is an overall flow chart of the present invention.

Fig. 2 is a convolutional neural network in the proposed multi-instance learning framework.

FIG. 3 is a comparison of experimental results of the present invention and related methods.

Fig. 4 shows several sets of exemplary results of the present invention. The first and fourth lines are the original input image, the second and fifth lines are the correct segmentations, and the third and sixth lines are the results of the method of the invention output, and a segmentation mask of the results is drawn into the artwork for viewing.

Detailed Description

The following describes in further detail embodiments of the present invention with reference to the accompanying drawings. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

The weak supervision example segmentation method based on multi-path segmentation specifically comprises the following operations:

a. the network model is a Convolutional neural network model with multi-example learning of object recommendation region pooling, wherein the feature extraction part can be a VGG16 framework mentioned in a 'Very Deep conditional Networks for Large-Scale Image registration' article published by Karen Simoyan, or a ResNet framework mentioned in a 'Deep residual learning for Image registration' article published by Kaming He, or other basic network architectures. For the ResNet-50 network, we add a region of interest pooling module to the last module of ResNet (before global averaging pooling), as shown in FIG. 2. The interest region pooling module inputs a plurality of object recommendation region frames obtained by analog sampling, then cuts out region features with the same position and size of the object recommendation region frames from the feature map, and performs maximum pooling sampling on the region features, wherein each module can obtain a feature map of 7x7 with the same channel number as that of the input feature map. Therefore, after the feature map is extracted by ResNet, we input the object recommendation area boxes into the module, and for each recommendation box, obtain 7 × 7 feature maps with the same number of channels (2048 channels) as the input feature map. And finally, after a layer of global average pooling, each recommendation box obtains a feature vector with a corresponding 2048 dimension. Inputting the feature vectors into a full-connection layer of 21 neurons and passing through a softmax layer, wherein each recommendation box corresponds to a 21-dimensional probability feature vector, and the i-th 2048-dimensional feature vector is recorded as f_iThe ith 21-dimensional probability feature vector is p_i。

b. For f obtained in a_iAnd p_iAnd the multiple instance learning framework in fig. 1, we propose several loss functions as joint supervision. The first is the loss of network attention, we calculated a class attention map using the CAM method proposed by Bolei Zhou et al in "Learning Deep Features for cognitive Localization" published in the 2016 CVPR conference, setting the normalization of the ith image to [0,1 [ ], C]The kth class of attention is sought as

The jth recommended frame in the ith image is marked as the attention category of the jth recommended frame

Setting:

the attention score of category k of box j is recommended for the ith picture,

can be calculated from the following formula:

attention loss function

Can be calculated from the following formula:

wherein, | S_i| is the total number of recommended boxes,

and

respectively predicting the jth recommendation frame in the ith image as a category

And K', K being the total number of target classes. After the attention loss function is given, a multi-instance learning loss function is proposed, a feature map intercepted by using a log-sum-exp function for all recommendation boxes of the ith image is used, and probability vectors of each category under all the recommendation boxes are estimated

Maximum value of (1) is

Is the probability estimation value of the kth' category of the ith image,

can be calculated from the following equation:

where r is a parameter of the log-sum-exp function, where r is 5, such that the function is the maximum of the estimated input vector. Probability estimation value of k' th category of estimated ith image

Later, multi-instance learning loss function

Can be calculated from the following formula:

wherein Y is_i′Is a positive example of the category of the case,

are negative case classes, which are mutually exclusive. After introducing the multiple instance learning penalty function, we present a third penalty function as follows: a cluster center loss function based on multi-instance learning. Cluster center loss function

The calculation is obtained by the following two formulas:

wherein

The category corresponding to the maximum probability of the jth recommendation box of the ith image is shown,

2048-dimensional feature vectors corresponding to the same recommendation box,

is a category

Will change slowly as training progresses:

wherein

Obtained for the last iteration

Calculated for this iteration

θ is a parameter of the update speed, and we use θ as 0.01. After introducing the three loss functions we propose, we use the fusion of these loss functions as the final loss function:

where α, β, γ are the weights of the three loss functions, respectively, where we use α ═ 0.5, β ═ 0.5, and γ ═ 0.1. In the above, we input pictures and their corresponding recommendation boxes derived from the similarity sampling method into the multi-instance learning framework and use L⁽ⁱ⁾Supervised training is performed as a loss function.

c. And c, after the training is finished by the method in the step b, inputting the pictures and the recommendation boxes into the frame, and obtaining the feature vector and the category probability vector corresponding to each recommendation box of each picture. With them, we can build a knowledge undirected graph. And (V, E) is an undirected graph, V represents a node set, and E represents an edge set. Recommendation box

And a set of object classes

Will act as a node, so:

wherein S is_iRepresenting a set of bounding boxes in the ith image. Is provided with

For terminals (terminals), the capacity E (u, v) ═ E (v, u) of the edge uv ∈ E is:

by using the above formula, we have established an undirected graph. It should be noted that we only reserve three edges of maximum capacity for each node except the terminal node when we build the graph.

An undirected graph has been built as above and is subsequently multi-sliced. I.e. solving an optimization problem:

wherein the content of the first and second substances,

||·||₁is the 1-norm of the vector. When the optimization problem is solved, the maximum number of edges connected with each node is limited to be 3, namely three edges with the maximum weight, so that an undirected graph to be solved is converted into sub-graphs of a plurality of connected domains, and the sub-graphs G are_t＝(V_t,E_t) Are not communicated with each other, and:

we solve the above optimization problem independently under each subgraph. For each subgraph, a multiple cut D can be obtained_tAnd, and:

∪_tD_t＝D，

where D is a multiple cut of G.

d. In the multi-way cutting in the step c, each node representing the object recommendation area is divided into a subset, and the category corresponding to the vertex contained in the subset is the category of the object recommendation area. Deleting the object recommendation areas marked as backgrounds, and taking the remaining object recommendation areas and the corresponding category marks as segmentation results; any convolutional neural network for example segmentation can also be trained by using the remaining object recommendation regions as training sets, and the network after training can perform example segmentation on the image.

FIG. 3 shows a comparison of our method with other methods. mAP_0.5 ^rAnd mAP_0.75 ^rThe average precision of the class-wise averages at thresholds of 0.5 and 0.75, respectively, is indicated, and ABO indicates the average best coverage. The CAM method is a method proposed by Bolei Zhou et al in "Learning Deep Features for cognitive Localization" published by CVPR conference 2016, and SPN is Zhu Yi et al in 2017The method proposed in "Soft porous networks for week super Object localization" published in ICCV conference, MELM is the method proposed in "Min-enhanced tension Model for week super Object Detection" published in 2018 CVPR by Fan et al, PRM is the method proposed in "Weakly super Object localization using Class Peak Response" published in 2018 CVPR by Yanzhao Zhou et al. LIID is the method we propose. Representative uses a block overlay attention map, Ellipse representative uses an attention map, and MCG representative uses the method mentioned in PabloArbelaez et al, "Multiscale composite group," published by CVPR conference 2014 to overlay attention maps. It was found that our method is superior to these methods in all indications.

FIG. 4 is 10 example graphs of example segmentation results obtained using our method. The first and fourth lines are the original input image, the second and fifth lines are the correct segmentations, and the third and sixth lines are the results of the method of the invention output, and a segmentation mask of the results is drawn into the artwork for viewing.

The top picture of each example graph is the original picture, the middle picture is the human labeled reference result, and the bottom is the result generated by our method.

Claims

1. A weak supervision example segmentation method based on multi-path segmentation is characterized by comprising the following steps:

a. given a data set comprising a training set and a test set, wherein each image in the training set has an image-level label, and an object recommendation area which possibly comprises an object of a target class is generated for each image in the data set by using an analog sampling algorithm;

b. designing a multi-instance learning framework based on an object recommendation region, wherein the multi-instance learning framework takes an image and a corresponding object recommendation region as input, takes a mark category of the image as a learning target, and designs a multi-instance learning loss function to learn and calculate category probability distribution and semantic information for each object recommendation region;

c. b, using the class probability distribution and semantic information of the object recommendation areas calculated in the step b, and establishing a large-scale graph model by using the object recommendation areas in the whole data set as nodes, specifically, taking each object recommendation area as a node of the graph, taking each target class as a vertex of the graph, wherein the distance from one node to the edge of one vertex is the predicted class probability, the distance between the edges of two nodes is the cosine value of the included angle between the semantic feature vectors of the two nodes, and the distance between the two vertices is infinite; regarding the graph model as a large-scale multi-path segmentation problem, and giving a category label to each object recommendation area by a segmentation result;

d. deleting the object recommendation areas marked as the background in the step c, and taking the rest object recommendation areas and the corresponding class marks as segmentation results; or training any convolution neural network for example segmentation by using the rest object recommendation area as training data, wherein the network after training is used for example segmentation of the image.

2. The method of multi-way segmentation-based weakly supervised instance segmentation as recited in claim 1, wherein: the multi-instance learning framework based on the object recommendation areas designs a convolutional neural network model, so that the model can predict a probability distribution for each object recommendation area, and class marks of images are used as supervision targets of each object recommendation area.

3. The method of multi-way segmentation-based weakly supervised instance segmentation as recited in claim 1, wherein: the loss function of the multi-instance learning framework based on the object recommendation area is composed of three parts, namely an attention loss function, a multi-instance learning loss function and a clustering center loss function, wherein the first two loss functions are used for learning category information, and the clustering center loss function is used for learning semantic features of the object recommendation area.

4. The multi-way cut-based weakly supervised instance partitioning party of claim 3The method is characterized in that: said attention loss function

Calculated from the following formula:

wherein, | S_i| is the total number of recommended boxes,

and

And K', K being the total number of target classes;

multiple instance learning loss function

Calculated from the following formula:

wherein Y is_i′Is a positive example of the category of the case,

is a negative case category, the two categories are mutually exclusive,

the probability estimated value of the ith image belonging to the kth' category;

cluster center loss function

The calculation is obtained by the following two formulas:

wherein

2048-dimensional feature vectors corresponding to the same recommendation box,

is a category

Is a statistical feature vector of | · |₂Represents the 2-norm, | S of the vector_iAnd | is the total number of recommended boxes.

5. The method of multi-way segmentation-based weakly supervised instance segmentation as recited in claim 4, wherein: the loss function of the multi-instance learning framework is finally expressed by fusing an attention loss function, a multi-instance learning loss function and a clustering center loss function as follows:

where α, β, γ are the weights of the three loss functions, respectively.

6. The method of multi-way segmentation-based weakly supervised instance segmentation as recited in claim 1, wherein: the large-scale multi-segmentation problem decomposes the large-scale multi-segmentation problem into several small-scale multi-segmentation problems by limiting the maximum number of edges connected to each node.