CN110008962B

CN110008962B - Weak supervision semantic segmentation method based on attention mechanism

Info

Publication number: CN110008962B
Application number: CN201910289248.7A
Authority: CN
Inventors: 黄立勤; 李良御; 宋志刚
Original assignee: Fuzhou University; CERNET Corp
Current assignee: Fuzhou University; CERNET Corp
Priority date: 2019-04-11
Filing date: 2019-04-11
Publication date: 2022-08-12
Anticipated expiration: 2039-04-11
Also published as: CN110008962A

Abstract

The invention relates to a weak supervision semantic segmentation method based on an attention mechanism, which adopts a mode of combining image-level supervision and detection frame-level supervision and utilizes the target attention of the image-level supervision to improve the defect that excessive noise is introduced into a detection frame. Meanwhile, the target integrity of the detection frame supervision is utilized to improve most of the defects in the class activation mapping image generated by the image-level supervision. Compared with the existing weak supervision semantic segmentation technology, the method has better segmentation effect.

Description

Weak supervision semantic segmentation method based on attention mechanism

Technical Field

The invention relates to the field of machine vision, in particular to a weak supervision semantic segmentation method based on an attention mechanism.

Background

In recent years, convolutional neural networks have become dominant in computer vision, with the advantage of being able to obtain good features from large amounts of data, but at the same time requiring large amounts of training data to support this good result. This drawback is particularly significant in the image segmentation problem, and most semantic segmentation approaches rely on massive and dense annotation data to train the deep neural network model, but it takes a lot of labor and time to make pixel-level annotations. Statistically, it takes 4 to 5 minutes for an image pixel level mark, while the image level mark only takes 2 seconds, and the mark of bounding box only takes about 20 seconds. In contrast, it is much cheaper to use bounding box and image level annotation.

The existing weak supervision semantic segmentation method mainly comprises three types: 1. and acquiring the attention point of the target from the detection network by using an attention mechanism through image-level supervision information. 2. And framing the target object by using the detection frame, and then clustering color information to obtain a target approximate mask. 3. And carrying out appropriate scrawling selection on the target object, and searching a part with high similarity with the marked point to serve as a target mask.

The existing image-level labeling-based mode has poor anti-interference performance, and complex environments can cause over-segmentation; based on the mode of the detection frame, when the color of the background is similar to that of the foreground target, the background noise in the detection frame is easily segmented into the foreground by mistake or the foreground is mistaken as the background to be omitted; the target object is marked based on the doodling mode, and when the color of the target object changes too much, the target is easy to be missed and segmented.

Disclosure of Invention

In view of the above, the present invention provides a weak supervised semantic segmentation method based on attention mechanism, which improves the effect of the previous weak supervised semantic segmentation by fully mining the content of the supervised information and utilizing the hidden useful information therein.

The invention is realized by adopting the following scheme: a weak supervision semantic segmentation method based on an attention mechanism specifically comprises the following steps:

Step S1: inputting a picture;

step S2: respectively obtaining a target detection frame by using a target detection network, obtaining a target initial positioning point and a target object pseudo mask 1 by using a classification network, and obtaining a target object pseudo mask 2 by using an improved grabcut;

step S3: performing pixel and operation on the target object pseudo mask 1 and the target object pseudo mask 2 obtained in the step S2 to obtain a final target pseudo mask;

step S4: and training a semantic training network.

Further, in step S2, the step of obtaining a target detection frame by using the target detection network specifically includes: and dividing the detection frame into nine parts by adopting a DCN framework, and covering the nine parts on different positions of the target object respectively.

Further, in step S2, the obtaining of the target initial positioning point and the target object pseudo mask 1 by using the classification network specifically includes: firstly, training a neural network by using image-level annotation information based on a CAM (computer-aided manufacturing) frame, and analyzing whether an image contains a target object or not; the input of the classification network is an original picture, the output is a class activation mapping chart, and each pixel point in the class activation mapping chart is the probability value of a target class; then, the detection box is zoomed to be the same as the size of the class activation mapping chart, and the corresponding position in the detection box is taken to increase the corresponding probability value; taking the point with the final probability value larger than a certain threshold value as a peak point, bringing the peak point into a neural network for reverse propagation, and finally obtaining a reverse propagation diagram of the corresponding peak point of the target object; and finally, screening the back propagation diagrams of the peak points by adopting a deformable detection frame, roughly screening the back propagation diagrams belonging to the target object, and merging the screened diagrams to form the target object pseudo mask 1.

Further, in step S2, the obtaining the target object pseudo mask 2 by using the modified grabcut specifically includes: the class activation map obtained by the CAM framework is automatically supplemented with grabcut foreground information, and the target object pseudo mask 2 is generated.

Further, the semantic training network adopts a deplab network.

Compared with the prior art, the invention has the following beneficial effects:

1. the method is based on the detection network and the classification network, and the target foreground pseudo mask is obtained by utilizing the methods of respective advantage complementation defects. The method comprises the steps of utilizing the advantage that a detection network completely covers a target, supplementing the incompleteness of a class activation mapping chart obtained by a CAM framework, and screening a mask formed by back propagation of peak points of the class activation mapping chart through a deformable detection frame to be a foreground pseudo mask.

2. The invention adopts an improved grabcut mode. And the class activation mapping graph obtained by the CAM architecture is utilized to automatically supplement the foreground information of the grabcut, so that a better foreground segmentation effect can be obtained.

Drawings

Fig. 1 is a schematic flow chart of the principle of the embodiment of the invention.

Fig. 2 is a schematic diagram of a detection frame and a deformable detection frame according to an embodiment of the invention.

FIG. 3 is a CAM-generated class activation map, according to an embodiment of the present invention.

FIG. 4 is a schematic diagram of peak point backward propagation according to an embodiment of the present invention.

Fig. 5 is an exemplary raw picture of an embodiment of the present invention.

FIG. 6 is a schematic diagram of a target object pseudo mask 1 according to an embodiment of the present invention.

Fig. 7 is a diagram of the effect of the original grabcut according to the embodiment of the present invention.

Fig. 8 is a diagram of the improved grabcut effect of the embodiment of the present invention.

Fig. 9 is a schematic diagram of an output result of deplab training according to an embodiment of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, the embodiment provides a weak supervised semantic segmentation method based on an attention mechanism, which specifically includes the following steps:

step S1: inputting a picture;

step S4: and training a semantic training network.

In this embodiment, in step S2, the detection frame for obtaining the target by using the target detection network specifically includes: and dividing the detection frame into nine parts by adopting a DCN framework, and covering the nine parts on different positions of the target object respectively.

The DCN architecture is a network in which the detection frame can change the shape of the detection frame according to the shape of the target (referred to as a "deformable detection frame" in this embodiment). The original detection network gives a complete covered detection frame to the target object, and here the detection frame is divided into nine equal parts, which are covered on different positions of the target object respectively, and can cover most parts of the target. The deformable detection box is shown in fig. 2.

In this embodiment, in step S2, the obtaining of the target initial positioning point and the target object pseudo mask 1 by using the classification network specifically includes: firstly, training a neural network by using image-level annotation information based on a CAM frame (which is a mode of obtaining a target positioning point by using a classification network), and analyzing whether an image contains a target object; the network has five convolutional layers in total and mainly functions in extracting the characteristics of an input image; and the global pooling layer and the full connection layer are used for screening the information extracted by the characteristic graphs and selecting useful information to judge the target category. The input of the classification network is an original picture, the output is a class activation mapping chart, the size of the class activation mapping chart is the same as the size of the original picture output after the original picture is subjected to five convolutional layers, and each pixel point in the class activation mapping chart is the probability value of a target class; the class activation map is shown in fig. 3, which shows a probability value of a pixel point with a category of cat.

In the CAM framework, the probability value of the finally obtained class activation map is usually larger only at the position where the target object can be most distinguished, and the probability of other positions is smaller, so that a relatively complete initial positioning cannot be given. In the detection network, the obtained detection frame can completely cover the target object, that is, the probability of possessing the target in the detection frame is high. The embodiment then zooms the detection box to the same size as the class activation mapping chart, and takes the corresponding position in the detection box to increase the corresponding probability value; and taking the point with the final probability value larger than a certain threshold value as a peak point, bringing the peak point into the neural network for reverse propagation, and finally obtaining a reverse propagation diagram of the corresponding peak point of the target object at the first layer of the convolutional network, wherein the diagram is generally activated at the edge part of the target, as shown in the attached figure 4. However, not all points in the detection frame are targets, excessive noise is brought in, and the edge map which is propagated backwards in the same way contains many noise edges. Finally, in the embodiment, a deformable detection frame is adopted to screen the back propagation maps of the peak points, the back propagation maps belonging to the target object are roughly screened, and the screened maps are merged to form the target object pseudo mask 1, as shown in fig. 6.

In this embodiment, in step S2, the obtaining of the target object pseudo mask 2 by using the modified grabcut is specifically: the class activation map obtained by the CAM framework is automatically supplemented with grabcut foreground information, and the target object pseudo mask 2 is generated. In the original grabcut algorithm, the segmentation of the target foreground can be automatically obtained only by giving a detection box. Meanwhile, in order to improve the segmentation effect, a user can manually mark a part of object foreground and a part of object background, so that the algorithm can better segment a complete target foreground. However, this requires manual involvement by the user, providing supervisory information for the target object. In this embodiment, the CAM is used to obtain the class activation map, and when the probability value is greater than the set threshold, the pixel is set as the foreground target. By the method, the step of manually marking partial foreground is omitted, the automation of foreground segmentation is realized, and the pseudo mask 2 is generated. The original grabcut effect graph and the improved grabcut effect graph are respectively shown in fig. 7 and 8.

In this embodiment, when the pixel value of the pixel and operation, i.e. the pixel values of the two modes area, are both 1, the final result of the area is determined as the target foreground, and the final target pseudo mask is obtained.

In this embodiment, the semantic training network employs a deplab network. The deplab network is a neural network which combines a deep convolution network and a probability map model, increases the receptive field by using the cavity convolution and can obtain a better segmentation effect. Where the input is the original picture as shown in fig. 5 and the output is the image semantic segmentation mask as shown in fig. 9.

The embodiment adopts a mode of combining image-level supervision and detection frame-level supervision, and improves the defect that excessive noise is introduced into the detection frame by using the target attention of the image-level supervision. Meanwhile, the target integrity of the detection frame supervision is utilized to improve most of the defects in the class activation mapping image generated by the image-level supervision. In summary, compared with the existing weak supervised semantic segmentation technology, the embodiment obtains a better segmentation effect.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims

1. A weak supervision semantic segmentation method based on an attention mechanism is characterized by comprising the following steps: the method comprises the following steps:

step S1: inputting a picture;

step S4: training a semantic training network;

in step S2, the step of obtaining a detection frame of the target by using the target detection network specifically includes: dividing the detection frame into nine parts by adopting a DCN framework, and respectively covering the nine parts on different positions of a target object;

in step S2, the obtaining of the target initial positioning point and the target object pseudo mask 1 by using the classification network specifically includes: firstly, training a neural network by using image-level annotation information based on a CAM (computer-aided manufacturing) frame, and analyzing whether an image contains a target object or not; the input of the classification network is an original picture, the output is a class activation mapping chart, and each pixel point in the class activation mapping chart is the probability value of a target class; then, the detection box is zoomed to be the same as the size of the class activation mapping chart, and the corresponding position in the detection box is taken to increase the corresponding probability value; taking the point with the final probability value larger than a certain threshold value as a peak point, bringing the peak point into a neural network for reverse propagation, and finally obtaining a reverse propagation diagram of the corresponding peak point of the target object; finally, screening the back propagation diagrams of each peak point by adopting a deformable detection frame, roughly screening the back propagation diagrams belonging to the target object, and merging the screened diagrams to form a target object pseudo mask 1;

In step S2, the obtaining of the target object pseudo mask 2 by using the improved grabcut specifically includes: the class activation map obtained by the CAM framework is automatically supplemented with grabcut foreground information, and the target object pseudo mask 2 is generated.

2. The weak supervised semantic segmentation method based on the attention mechanism as recited in claim 1, wherein: the semantic training network adopts a deplab network.