CN115393580A

CN115393580A - Weak supervision instance segmentation method based on peak value mining and filtering

Info

Publication number: CN115393580A
Application number: CN202110565129.7A
Authority: CN
Inventors: 武港山; 潘冬生; 黄祖贤; 王利民
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2022-11-25

Abstract

A weak supervision example segmentation method based on peak value mining and filtering comprises the following steps: 1) A stage of processing a sample; 2) A network configuration stage; 3) A training stage; 4) And (5) a testing stage. The strategy based on peak value mining and filtering designed by the invention introduces characteristic fusion, anti-erasure and cluster analysis to enhance the diversity and integrity of the peak value response graph, and searches more accurate segmentation masks through iterative search and confidence degree updating. Compared with the existing weak supervision example segmentation method, the segmentation algorithm can realize more complete and accurate segmentation on objects with different sizes in the image, so that the example segmentation precision is effectively improved.

Description

Weak supervision instance segmentation method based on peak value mining and filtering

Technical Field

The invention belongs to the technical field of computer software, relates to a weak supervision instance segmentation technology, and particularly relates to a weak supervision instance segmentation method based on peak value mining and filtering.

Background

In the field of computer vision, conventional supervised learning needs to label a data set for a long time, which consumes resources and time, while weak supervised learning reduces a large amount of resource consumption caused by labeling the data set by weakening the supervision strength of labeling information. The example segmentation task aims at accurately segmenting all foreground objects in the image, not only predicting the semantic category of each pixel, but also judging the individual attribution of each pixel. The goal of weakly supervised instance segmentation is thus to achieve the task of instance segmentation requiring pixel level annotation by weakening the annotation information.

For example segmentation tasks, image-level class labeling is beneficial to constructing large-scale datasets compared to pixel-level class labeling, which only provides semantic classes of objects present in an image but does not directly provide information such as positions, shapes, numbers of objects, and the like. The current weak supervision example segmentation algorithms based on image-level class labeling are mainly divided into two types: based on detection and based on segmentation. For the former category, the PRM (Peak Response Map) algorithm proposes a concept of Peak activation for object positioning and classification, calculates probability dependence by using network gradient to obtain a Response of a Peak on an original image, and then retrieves a candidate mask as a segmentation mask by using a Peak Response Map. The IAM algorithm utilizes the network of encoding and decoding to learn the filling relation between the peak response graph and the segmentation mask so as to enhance the quality of segmentation and improve the model testing speed. The CountSeg algorithm combines the supervision information of the number of the objects in the image, and the peak value is used as supervision to obtain a density map of each object, which is beneficial to searching candidate masks. The Label-PENet algorithm utilizes curriculum-trained strategies to allow models to learn step by step from classes, tests, segmentations, and uses the latter models to provide supervision for the former models. For the latter category, an IRN (Inter-pixel relationship Network) algorithm first learns the relevance between pixel points to realize semantic segmentation, and distinguishes instances by using the prior knowledge that the sum of distances from each point to a central point on an object is zero and a boundary exists between points with different semantic categories.

The existing method based on the peak value as the proxy of the object detection has more advantages in precision, but is far from a complete supervision algorithm, still has the problems of positioning loss, classification error, local segmentation and the like, and limits the improvement of the final segmentation precision.

Disclosure of Invention

The invention aims to solve the problems that: how to supervise and train the classification network only through the image-level class labels so that the model can more comprehensively detect all objects of the image and carry out more accurate classification and more complete segmentation on the objects; while achieving more accurate filtering for those redundant and low quality segmentation results. Generally, the response of the classification model to the object is limited to a local salient region, and the classification model cannot fully and completely respond to all objects, namely, under activation, and the segmentation precision is seriously influenced due to fuzzy segmentation boundary and coexistence of background, namely, over activation. Therefore, the design objective of the invention is to introduce mining and filtering strategies to respectively solve the problems of over-activation and under-activation, and finally realize more accurate segmentation of the weakly supervised instances.

The technical scheme of the invention is as follows: a weak supervision example segmentation method based on peak value mining and filtering is characterized in that an image classification network and an example segmentation network are constructed, an image-level class label is used as supervision for training, the image classification network is trained first, then the image classification network obtains training data of the example segmentation network for supervision training, and then example segmentation is completed,

the image classification network includes the following configurations:

1) Feature fusion, namely extracting image features by using ResNet50 as a main network, fusing feature maps with different layers and different sizes in the network, and generating a feature map with richer representation and semantic information;

2) Counteracting erasure, reducing the dimension of the characteristic diagram and converting the characteristic diagram into a category activation diagram, erasing the characteristic diagram and filling by adopting a characteristic mean value aiming at a significant region in the category activation diagram corresponding to a truth value category, and then performing category activation on the filled characteristic diagram again by utilizing a convolution layer to expand a semantic response region; the two branches before and after erasing adopt two classifiers with different parameters for image identification, and then two class activation graphs before and after erasing are activated and output;

3) Peak value activation, namely positioning and classifying objects in the image by adopting peak values to generate a local significant peak value representing an independent object, wherein the local most significant peak value on a category activation graph of a certain semantic category indicates that one object of the category exists at the position, the local peak value is obtained by local maximum pooling and is supervised by semantic annotation information, and a Maxpooling layer is adopted to perform pooling operation on the category activation graphs obtained before and after erasing in the step 2) respectively to obtain a peak value list from branch activation before and after erasing;

4) The filtering module is used for adding a classifier different from the classifiers before and after erasing and the peak activation layer to the characteristic graph obtained in the step 1) and independently judging the semantic category of the object;

the example segmentation network comprises an image classification network, on the basis of the image classification network, clustering analysis and iterative retrieval are carried out on branch responses before and after erasure, and then a filtering module is combined to filter out segmentation masks which do not meet confidence level requirements, so as to obtain example segmentation results, wherein the example segmentation results specifically comprise the following steps: training an image classification network by a training set, obtaining a peak value response image which is the response of a peak value on an original image on the basis of network gradient calculation probability dependence on a test image containing a non-category candidate segmentation mask set to be subjected to example segmentation on the image classification network, performing cluster analysis on the test image based on the depth characteristics of the peak value, combining different peak value response images from the same object, iteratively searching the non-category candidate segmentation mask set corresponding to the original image by using the peak value response image during example segmentation, selecting an optimal matching item as a segmentation mask of the object, updating a confidence coefficient by combining category information and shape information of the object obtained by a filtering module to filter a low-quality segmentation mask, and finally performing example segmentation on the training set image of the training image classification network by using the method to generate a data set with a pseudo pixel label for the supervision training of a complete supervision example segmentation algorithm.

Compared with the prior art, the invention has the following advantages.

The invention provides a weak supervision instance segmentation method (PMF) based on peak value Mining and Filtering. The method can more comprehensively and completely segment objects in the image by mining the peak value and adopts the filtering idea to inhibit the influence of a low-quality segmentation mask, thereby obviously improving the segmentation precision.

The invention designs a weak supervision instance segmentation method based on peak value mining and filtering, which responds to more non-significant object regions through a mining strategy, integrates the semantics and shape information of an object by adopting a filtering idea to generate a more accurate segmentation mask, and generates a more refined pixel-level labeled data set for a complete supervision instance segmentation algorithm. Compared with the existing weak supervision example segmentation method, the method has good segmentation capability on objects with different scales and classes, and effectively improves the segmentation precision.

The method obtains good precision on the task of segmenting the weakly supervised examples of the image-level class labels. Compared with the existing method, the PMF weak supervision example segmentation method provided by the invention is evaluated on a recognized data set, shows good performance and can be even compared with a stronger supervision information training model.

Drawings

FIG. 1 is a flow chart diagram of a weak supervision example segmentation method based on peak mining and filtering.

Fig. 2 is a schematic diagram of feature fusion based on Resnet50 in the network structure of the present invention.

FIG. 3 is a schematic diagram of a structure for counteracting erasure in the network structure of the present invention.

FIG. 4 is a schematic flow chart of the clustering analysis based on peak features according to the present invention.

FIG. 5 is a flow chart illustrating the iterative process of retrieving a segmentation mask according to the present invention.

FIG. 6 is a flow chart illustrating filtering of a low quality segmentation mask according to the present invention.

FIG. 7 is a sample graph of example segmentation under the weakly supervised training scenario of the present invention.

Detailed Description

Aiming at the problems in the prior art, the invention aims to promote a classification network to more comprehensively and completely respond to objects in an image and filter redundant and rough segmentation masks, so as to improve the segmentation accuracy, and particularly provides a weak supervision example segmentation method based on Peak Mining and Filtering, which is named as PMF (Peak Mining and Filtering). In addition, the invention provides the idea of providing pixel-level labeling with higher precision for a complete supervision instance segmentation algorithm, thereby realizing more accurate object segmentation.

The following is a description of the practice and effects of the present invention with reference to specific examples.

According to the weak supervision example segmentation method based on peak value mining and filtering, the training set only with image-level class labels is used for training the classification network on the PASCAL VOC 2012 data set, and the performance evaluation of the example segmentation task is performed on the training set and the verification set of the data set with pixel-level labels, so that the accurate segmentation precision is obtained. Specifically, python3.6 programming language, pytorch0.4 deep learning framework is used.

FIG. 1 is a flowchart of a weak supervision example segmentation method based on peak mining and filtering used in the present invention, which uses a residual error network Resnet50 as a backbone network, and implements a pixel-level example segmentation task only by using image-level class labels, wherein the training is performed twice, the network is classified once, and the network is segmented once according to a newly generated data set. The method utilizes the locally significant peak value and the corresponding peak value response graph to represent the object, and realizes more comprehensive, more complete and more accurate segmentation of the object in the image through the excavation and filtering of the peak value. The whole method comprises a sample processing stage, a network configuration stage, a training stage and a testing stage. The specific implementation steps are as follows:

1) And a sample processing stage, wherein the sample of the training stage is an RGB image only provided with image-level class labels, the image size is converted into 448 x 448 as network input, and meanwhile, a horizontal turnover data enhancement technology with the probability of 0.5 is adopted. In the invention, the sample of the testing stage is only an RGB image, the input size of the network is 448 x 448, meanwhile, a multi-scale Combinatorial clustering algorithm (MCG) is adopted to generate a candidate segmentation mask set for each testing image, and the first 100 with the highest scores are selected.

2) In the stage of network configuration, only one classification network is trained in the strategy based on peak value mining and filtering, and the method mainly comprises the following four parts: feature fusion, countermeasure erasure, peak activation, and filtering modules.

2.1 Feature fusion, as shown in fig. 2, using Resnet50 with the last fully connected layer removed as a backbone network for image feature extraction, where Resnet50 is composed of layer0 containing 1 layer of convolution, layer1 containing 9 layers of convolution, layer2 containing 12 layers of convolution, layer3 containing 18 layers of convolution, and layer4 containing 9 layers of convolution. The input to the net is B3 448, where B represents the batch size Batchsize and the feature size of each layer output is F ₀ ∈R ^{B×64×224×224} 、F ₁ ∈R ^B ^{×256×112×112} 、F ₂ ∈R ^{B×512×56×56} 、F ₃ ∈R ^{B×1024×28×28} 、F ₄ ∈R ^{B×2048×14×14} . The method fuses the characteristics from layer3 and layer4, specifically adopts a convolution layer with the convolution kernel size of 1*1 and the step length of 1 to activate the characteristic diagram from layer3 so that the number of channels is changed from 1024 to 2048, adopts bilinear interpolation to carry out upsampling on the characteristic diagram output by layer4, and then adds the two to obtain a final characteristic diagram F _last ∈R ^{B×2048×28×28} 。

2.2 For each input sample, under the supervision of the same class label, two classifier structures with different parameters are adopted for image recognition. Specifically, as shown in FIG. 3, for 2.1) feature map F obtained after feature fusion _last ∈R ^{B×2048×28×28} First, the Branch before erasing is marked as Branch _A Activating by using a convolution layer with convolution kernel size of 1*1 and step length of 1, converting the channel number of the characteristic diagram from 2048 to semantic category number C, and marking the output of the network as a category activation diagram as M _A With an output size of M _A ∈R ^B×C×28×28 Each channel layer represents a response in the image to the semantic category. The invention converts semantic categories into digital labelsNote that, for example, 20 foreground categories, i.e., represented by a vector of 1 × 20, there is a category corresponding to an index position value of 1 and a value of 0. Selecting M according to image truth categories during network training _A The corresponding channels are counted and a foreground significant image is generated and recorded as M _s With an output size of M _s ∈R ^B×1×28×28 The value range space is {0,1} and 1 represents the foreground. The invention firstly activates the graph for the ith layer category

Normalizing, dividing the element of each position in the graph by the difference between the maximum element and the minimum element in the graph, and then selecting a threshold value of 0.6 to filter to obtain the saliency map of the corresponding category

In the invention, the significant maps corresponding to different truth value classes (prediction classes in test) in the same image are merged during training to obtain the foreground significant map of the whole image, and then the foreground significant map M is used _s Erase feature map F _last The characteristic diagram of the upper corresponding area after being erased is marked as F _erase For F _last Retaining only M _s Region with median value of 0, and M _s Filling the region with the median value of 1 by different methods such as zero value, maximum value and mean value, wherein F is adopted in the invention _last Middle corresponds to M _s The mean of the upper 0 region is filled. Finally, after erasing, the Branch is marked as Branch _B Activating to obtain a category activation map M _B Specifically, F is formed by using a convolution layer with convolution kernel size of 1*1 and step size of 1 _erase The number of channels in (C) is converted from 2048 to the number of semantic categories C. The branch networks before and after erasure have the same structure but do not share parameters, and the network parameters are initialized randomly.

2.3 Peak activation, the present invention locates and classifies objects in an image using a peak, which is a location where the maximum value in the sampling window represents the local most significant. Specifically, the category activation map M obtained before and after erasing in 2.2) is respectively matched by using a Maxpooling layer _A And M _B Performing pooling operations resulting from branch activation before and after erasingPeak list P _A And P _B The pooled window size was 3*3. The specific calculation process is as follows:

where G represents the class activation map after peak activation, and c, x, y represent the class and region that the core can access in the pooling operation, respectively, f represents the max pooling function, and N ^c Indicates the number of peaks generated by category c. Peak list P _A And P _B Is a peak list obtained by activating peaks before and after erasing, each peak comprises the peak plane position x, y, semantic class c and numerical value on the class activation graph before and after erasing

The network not only stores the value of the category activation graph after pooling, namely the peak value, but also reserves the corresponding position of the category activation graph, and specifically adopts five parameters of s, b, c, h and w to express, wherein s expresses the value of the corresponding semantic category, b expresses the image index in the corresponding batch processing, c expresses the semantic category, h expresses the ordinate of the plane position of the feature graph, and w expresses the ordinate of the plane position of the feature graph. And filtering the generated peak list network, and specifically screening the peak value generated by the feature graph plane by adopting the median value of each category of each picture. In order to effectively combine supervision of image-level class labeling, the peak value calculation mean value of each class obtained by filtering each class in the feature map plane from the branches before and after erasing of each picture is used for obtaining classification output S _A And S _B The size is B × C, B represents the batch size, and C represents the number of semantic categories. Sorting out S _A And S _B The classification score of each image of the branches before and after erasure is represented.

2.4 ) a filter module, by-passing a classifier of a different type. As shown in FIG. 6, for the feature map of layer4 output mentioned in 2.1), first, pooling is performed by using Average Pooling, and then Flat flattening is performed on the pooled feature map, i.e. two-dimensional plane conversion is performed on the pooled feature mapFor one-dimensional vector, then adopting a full-connection layer for activation and parameter random initialization, and simultaneously adopting a Softmax layer to calculate the probability S of each category _C And as the output of the classifier, the output size is B C, C represents the number of semantic categories.

The network configuration phase is specifically described below as an example. Using the rest structures of Resnet50 except FC layers as a backbone network, initializing randomly, and then performing feature extraction on the input image to obtain feature maps with the sizes of B64 × 224, B128 × 112, B512 × 56, B1024 28, and B2048 × 14 through layer0, layer1, layer2, layer3, and layer4, respectively. The number of channels of layer3 feature map is increased to 2048 using a layer of convolutional layer with convolutional size 1*1 while the layer4 feature map is upsampled using bilinear interpolation to obtain the plane size 28 x 28, and then the two feature maps are added. And (3) activating the network branches before and after erasing by adopting a convolution layer with a convolution kernel of 1*1 to generate a category activation graph, and converting the channel number into the semantic category number. Extracting corresponding foreground salient regions M according to class labels aiming at class activation graphs obtained before erasing _s ∈R ^B×1×28×28 According to M _s The foreground region in (1) erases the feature map and fills the mean of the insignificant regions as the input features of the erased branches. Two-branch generated class activation map M _A ,M _B ∈R ^B×C×28×28 The peak lists were obtained all through the largest pooling layer with a kernel size of 3*3 and median filtered by category denoted as P _A And P _B Then, the retained peak values are averaged by category to obtain S _A ，S _B ∈R ^B×C . Meanwhile, the probability S that the image is judged to be of each category is obtained through a mean pooling layer, a full-link layer and a Softmax function aiming at the layer4 feature map _C ∈R ^B×C 。S _A ，S _B ，S _C The loss against the true category label is calculated via the loss function multilabel software markinloss, respectively.

3) In the training stage, the image classification network with three different classifiers, which are different classifier branches and are scores for classifying images, is supervised and trained on a target data set, namely a training set,wherein S _A And S _B Representing the classification score of each category before and after erasure, but the input image features come from before and after erasure, S _C The features of the image before erasure are input to the classifier and the output is the probability of the image being classified into each class. 11540 training sets which are only provided with image-level class labels in a PASCAL VOC data set are used for model training, a main network model is pre-trained through Imagenet in advance, the batch processing image number Batchsize is 16, a back propagation algorithm is used for updating network parameters, specifically, a stochastic gradient descent optimizer SGD of a driving quantity is used, the momentum is set to be 0.9, the learning rate is set to be 0.01, the weight attenuation is set to be 0.0001, a multi-Label softMarginLoss is adopted as a loss function for a branch and a filtering module before and after erasing, only a single GTX 1080Ti display card is used for training, and the total training round is set to be 35.

4) In the testing stage, the present invention requires post-processing operations to generate a final clean, complete, accurate segmentation mask. The invention processes the image to be example segmented through a trained image classification network and then through a plurality of steps of a test stage, and combines S _C The segmentation mask is filtered to obtain the final segmentation result. The testing stage comprises four simple steps of peak response, cluster analysis, iterative retrieval and filtering strategy:

4.1 Peak response is the probability dependence of the peak value between layers through gradient calculation in the network, corresponding areas on the input original image are obtained through gradual backward propagation, namely a peak response image, the peak value can only position and classify objects, and in order to realize complete object segmentation, the invention calculates the response area of the peak value on the input image according to the category and the position of the peak value on the category activation image. In particular, for each peak P ⁱ Calculating the gradient of the input image in the network to calculate the probability dependence between layers, and gradually propagating backward from the category activation map to the corresponding response area of the input original image to become a peak activation map R ⁱ The size is 1 x 448, and the pixel value range space is [0,1 ]]And the sum of all elements is 1. For simplicity, consider the common convolutional layer, with the convolutional kernel noted as W ∈ R ^h×w H and w represent convolutionThe height and width of the kernel, for the classification network model adopting peak activation training convergence, the specific calculation process is as follows:

the convolutional layer input and output in the image classification network are corresponding to U and V, i, j and P, q represent positions on the input and output feature diagram plane, and P is the probability of each position. In the above formula (U) _ij |V _pq ) The probability that the convolutional layer obtains the input of the ij position from the output of the pq position is represented, and the specific calculation process is as follows:

wherein

Represents the activation of the network forward propagation input U at the position ij, while W + = ReLU (W) represents the use of the ReLU activation function to retain non-negative values for the convolution kernel parameters, i.e. only calculate the forward activation, and Z _pq Is used for ensuring sigma _p,q P(U _ij |V _pq ) Regularization factor of = 1. When the peak response graph is calculated specifically, the convolution layer in the network is replaced by the calculation process.

4.2 Cluster analysis, peak list P from branch generation before and after erasure for the peak response map calculation process described in 4.1) _A And P _B I.e. a peak response map is calculated from each peak in the peak list. The peak value is simply filtered by adopting a threshold value of 20, wherein the threshold value represents the threshold value aiming at the peak value, the peak value represents the activation degree of a certain category at the position, and the probability of each peak value in the list after back propagation is calculated to obtain a peak value response graph list R of branches before and after erasure _A And R _B . The method and the device perform cluster analysis on the peak values, and cluster the peak values from the same object according to the characteristic similarity. The specific flow is shown in FIG. 4, after feature fusionExtracting the characteristic vector of the peak position from the characteristic graph plane, realizing the clustering process by adopting a spectral clustering algorithm based on k-means optimization, taking the characteristic vector corresponding to the peak obtained by activating the branches before erasing as an initialization cluster center to optimize the iteration direction of the subsequent clustering process, setting the cluster number as the number of the branch peak before erasing, and clustering the characteristic vectors corresponding to the peaks of the two branches before and after erasing.

Firstly calculating a similarity matrix W (element) R between peak vectors in a spectral clustering algorithm flow ^n×n Then, calculating the sum of each row, namely a degree matrix D, for W, then calculating a Laplace matrix according to L = D-W, then calculating the eigenvalue of L and sorting the eigenvalue from small to large, and taking the first k eigenvalues to form a new matrix X epsilon R ^n×k And as the input of the k-means clustering algorithm, stopping iteration when the round results before and after the clustering iteration are not changed. The process outputs the cluster label for each peak.

And merging the peak value response graphs corresponding to the peak values in each cluster after clustering, and normalizing again, namely dividing the value of each pixel position on the graph by the sum of the values of all the pixel positions. Listing and recording a peak response graph with more integrity generated after clustering as R _merge And R _A 、R _B Combining is performed to increase the diversity of the peak response plots.

4.3 Each peak response map is a response to a certain object in the image, but the phenomena of over-activation and under-activation generally exist, so the invention utilizes the peak response maps to retrieve the best matching item from the class-free candidate segmentation mask set generated by the MCG algorithm as the segmentation mask of the corresponding object. The matching degree is calculated as follows:

wherein, score _ij Shows a peak response diagram R ⁱ And class-free candidate segmentation mask S ^j The degree of match between the two is determined,

representing a division mask S ^j Edge of (2), and Q ⁱ And representing the semantic salient region corresponding to the peak response graph category, wherein the specific method is to normalize a certain channel plane of the category activation graph and then filter the normalized channel plane by using a threshold value of 0.5. And alpha and beta are super-parameters with values set to 0.73 and 1.9e-5, respectively. Set of peak response maps R = { R =, according to the search process described above _A ,R _B Retrieve the corresponding segmentation mask.

The specific iterative retrieval process is shown in fig. 5, and the present invention adopts a Non-Maximum Suppression (NMS) algorithm, and first divides the mask set according to the matching Score _ij Arranging in descending order from big to small, continuously selecting the largest item in the current division mask list to filter the rest division masks in the list according to the same category and the intersection ratio between the division masks IoU is more than or equal to 0.5, and simultaneously fusing the corresponding peak response graphs to obtain a new peak response graph list R _merge1 And R _merge Merge R = { R = } _merge ,R _merge1 }. Then, using R to search the candidate segmentation mask set, and using NMS to filter the segmentation mask and fuse the peak response graph to obtain a new peak response graph list R _merge2 . Finally by using R _merge2 And searching the candidate segmentation mask set, and filtering the candidate segmentation mask set through NMS (network management system), wherein the finally obtained segmentation mask set is used as an example segmentation result of the image.

4.4 Filter strategy), the segmentation result obtained in the above step 4.3) includes a segmentation mask S, a class C, and a matching Score between the peak response map and the segmentation mask _match As the confidence coefficient, the invention generates the class confidence coefficient of the corresponding object on the basis so as to update the final segmentation confidence coefficient. Specifically, as shown in fig. 6, the input original image is cut by using the segmentation mask, the network configured in step 2) is propagated forward, meanwhile, the class corresponding to the maximum probability value output by the last Softmax function of the classifier added beside the filtering module is used, when the class is inconsistent with the previous class, the current output class is used as the correction, and the probability is the class confidence Score _class The invention updates the final segmentation confidence by the following function:

Score＝γ·Score _match +(1-γ)·Score _class ,

wherein gamma is a balance coefficient, the shape information and the category information of the object are effectively integrated in a linear combination mode, the experimental median is set to be 0.45, and a threshold value of 0.2 is adopted for filtering.

The test phase is specifically described below as an example. The complete testing process is shown in fig. 1, and the input image is subjected to a sample processing stage of the testing stage to obtain a processed image and a corresponding segmentation mask set. Then, through the network configuration stage of step 2), outputting a peak value list P activated by branches before and after erasing _A And P _B . Obtaining a peak response chart list R through the peak response in the step 4.1) _A 、R _B . Then according to P _A And P _B Extracting feature vector from the position of the medium peak value on the feature map, adopting a spectral clustering algorithm and taking P as the reference value _A Clustering as the center of the initialized cluster, and fusing the peak response graphs from the same cluster to obtain a new peak response graph list R _merge . Then, in step 4.3), the peak response map is used to retrieve the best matching item from the class-free candidate segmentation mask set as the object segmentation result, and the peak response map is filtered and fused iteratively by the NMS a priori according to the assumption that the segmentation mask with high class overlap corresponds to the same object. And finally, the original image is cut according to the segmentation mask, the classification is carried out again, the segmentation confidence coefficient is updated by combining the matching confidence coefficient and the classification confidence coefficient, and the segmentation result is filtered. FIG. 7 presents an example segmentation sample of the mining and filtering steps of the present invention. Meanwhile, the method carries out example segmentation on the data set used for classification network training to generate a data set with a pseudo pixel label for supervision training of a Mask R-CNN algorithm, and more accurate example segmentation can be realized through the scheme.

According to the invention, segmentation precision evaluation is carried out on 1464 PASCAL VOC data sets and 1449 training sets and verification sets with pixel level labels. On the training set, the indexes of mAP @25, mAP @50, mAP @75 and ABO respectively reach 53.2%, 34.9%, 12.9% and 43.3%; on the verification set, indexes of mAP @25, mAP @50, mAP @75 and ABO respectively reach 52.1%, 32.9%, 14.7% and 43.4%, wherein the segmentation Mask generated by the training set sample for classification is used as a pseudo pixel label to supervise an example segmentation algorithm Mask R-CNN of which the training backbone network is Resnet 50-FPN by using the method, indexes of mAP @50 and mAP @75 respectively reach 45.6% and 21.2%, and compared with the current weak supervision example segmentation algorithm, the method provided by the invention has better segmentation precision.

Claims

1. A weak supervision example segmentation method based on peak value mining and filtering is characterized in that an image classification network and an example segmentation network are constructed, an image-level class label is used as supervision for training, the image classification network is trained, then the image classification network obtains training data of the example segmentation network for supervision training, and then example segmentation is completed,

the image classification network includes the following configurations:

2. The weak supervision instance segmentation method based on peak mining and filtering as claimed in claim 1, characterized by comprising the following steps:

1) In the stage of processing the sample, the training sample comprises RGB images and corresponding class labels thereof, wherein the input images are subjected to data enhancement of horizontal probability inversion, the test sample comprises the RGB images, and a class-free candidate segmentation mask set corresponding to each image is generated by adopting an MCG algorithm;

2) A network configuration stage, taking RGB images as input and image-level class labels as supervision, wherein the image classification network comprises the following configurations:

2.1 Feature fusion using ResNet50 asExtracting image characteristics for a backbone network, wherein the Resnet50 comprises layer0, layer1, layer2, layer3 and layer4, and fusing the characteristics from the layer3 and the layer4 to obtain a fused characteristic diagram F _last ；

2.2 Resisting erasure, adopting two classifier structures with different parameters to carry out image recognition on each input sample under the supervision of the same class marking, and obtaining a feature map F after 2.1) feature fusion _last Branch before erase is marked as Branch _A Activating by adopting a convolution layer with the convolution kernel size of 1*1 and the step length of 1, converting the channel number of the characteristic diagram from 2048 into the semantic category number C, and Branch _A Is called a category activation map M _A Converting semantic categories into digital labels, representing N foreground categories by using vectors of 1*N, wherein if the categories exist in the image, the corresponding index position value is 1, otherwise, the index position value is 0, and selecting M according to the true value category of the image during supervised training _A The corresponding channels are counted in layers, and a foreground significant graph M is generated _s First, activate the graph for the ith layer category

Normalizing, dividing the element of each position in the graph by the difference between the maximum element and the minimum element in the graph, and then selecting a threshold value of 0.6 for filtering to obtain a saliency map of the corresponding category

Merging the saliency maps corresponding to different truth value categories in the same image during training to obtain a foreground saliency map of the whole image, and then obtaining a foreground saliency map M according to the foreground saliency map _s Erase feature map F _last Corresponding area is arranged, and the feature graph after erasure is marked as F _erase For F _last Retaining only M _s Region with median value of 0, and M _s Region of median value 1, using F _last Middle corresponds to M _s Filling the mean value of the area with the upper value of 0, and marking the Branch after erasing as Branch _B Activating to obtain a category activation map M _B Will F _erase The number of channels is converted from 2048 into the number of semantic categoriesC, the branch network structures before and after erasure are the same but do not share parameters, and the network parameters are initialized randomly;

2.3 Activation of peak, using Maxpooling layer to respectively compare the class activation map M obtained before and after erasing in 2.2) _A And M _B Pooling to obtain a list of peaks P from branch activations before and after erasure _A And P _B The size of the pooled window is 3*3, and the specific calculation process is as follows:

wherein G represents a class activation map after peak activation, c, x, y represent the class and the region that the core can access in the pooling operation, respectively, f represents a maximum pooling function, N ^c Indicating the number of peaks generated by the c-th category;

storing a value of the category activation graph after pooling as a peak value, reserving a corresponding position of the peak value, adopting five parameters of s, b, c, h and w to represent the peak value, s represents a value of a corresponding semantic category, b represents an image index in corresponding batch processing, c represents the semantic category, h represents a vertical coordinate at a feature graph plane position, and w represents a vertical coordinate at the feature graph plane position; filtering the generated peak list, and screening the peak value generated by the feature map plane by adopting the middle value of each category of each picture, wherein the average value of the peak value obtained by filtering each category of each picture from branches before and after erasing on the feature map plane is calculated to obtain a classification output S _A And S _B ；

2.4 Filter module, for the feature map output by layer4 in 2.1), firstly using Average Pooling to pool, then flattening the pooled feature map, converting the two-dimensional plane into one-dimensional vector, then using a full connection layer to activate and randomly initialize the parameters, and simultaneously using Softmax layer to calculate the probability of each category and using the probability as the output S of the classifier _C ；

3) In the training phase, for the network configuration, a backbone network pre-trained in an Imagenet data set is utilizedNetwork and response parameters, and performing supervision training on a classification network with three different classifiers in a target data set; in training, S _A ，S _B ，S _C Calculating the loss relative to the real category label via a loss function MultiLabelSoftMarkingLoss respectively;

4) In the testing stage, an image to be subjected to example segmentation generates a segmentation mask for an object in the image by using the model with the training convergence, and the method comprises the following steps:

4.1 Peak value response, for the peak value set generated by the network branches before and after erasing, the probability is calculated by gradient and the response area generating the peak value, namely the peak value response graph, is obtained in the original graph;

4.2 Cluster analysis, based on the depth characteristics of the peak positions, merging different peak response maps from the same object to increase its diversity and integrity;

4.3 Iterative retrieval, based on the principle that different segmentation masks of the same class have high intersection ratio and high maximum probability of corresponding to the same object, iteratively retrieving the best matching item from a candidate segmentation mask set without classes by utilizing a peak response diagram as a segmentation mask;

4.4 And) a filtering strategy, combining the matching scores between the peak response graph and the segmentation masks in 4.3), and the classification scores of the segmentation mask region by a classifier added beside the network branch before and after the erasure, updating the final confidence, and filtering the segmentation masks with low confidence to obtain the final example segmentation result.