CN116912501A

CN116912501A - Weak supervision semantic segmentation method based on attention fusion

Info

Publication number: CN116912501A
Application number: CN202310981553.9A
Authority: CN
Inventors: 苏京峰; 李军侠
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2023-08-04
Filing date: 2023-08-04
Publication date: 2023-10-20

Abstract

The invention discloses a weak supervision semantic segmentation method based on attention fusion, relates to the technical field of computer vision, and provides a simple and effective weak supervision semantic segmentation framework by taking Vision Transformer as a basic network structure. In the framework, an adaptive attention fusion module is designed firstly, different weights are distributed to the attention of different layers, and the attention after fusion can well inhibit background noise while keeping target details. In addition, aiming at the problem that the next-important area in the attention cannot well activate the target area, a modulation function is designed to increase the attention value of the next-important area, so that the target area is effectively highlighted. And then optimizing the rough class activation diagram by using the modulated attention, so that the target area in the obtained class activation diagram can be activated more completely and accurately, and the problem of incomplete activation of the class activation diagram can be better solved.

Description

Weak supervision semantic segmentation method based on attention fusion

Technical Field

The invention relates to the technical field of computer vision, in particular to a weak supervision semantic segmentation method based on attention fusion.

Background

Semantic segmentation is one of the fundamental and challenging tasks in the field of computer vision that utilizes a computer's feature representation to simulate the human recognition process of an image, assigning a semantic class label to each pixel of a given image. In recent years, due to the vigorous development of deep learning methods, semantic segmentation has also made remarkable progress. As a dense prediction task, the training of the semantic segmentation model is not separated from the large-scale pixel-level labeling data, but the pixel-level labeling of the image is difficult to obtain, time-consuming and labor-consuming.

The weak supervision semantic segmentation technology can solve the problem of dependence of the existing semantic segmentation model on a large amount of pixel-level annotation data because the segmentation model is trained only by relying on weak annotation data, and therefore the weak annotation technology is becoming an academic research hotspot, and common weak annotations comprise boundary box annotations, graffiti annotations, point annotations, image-level annotations and the like. In the weak supervision labels described above, image level annotation is more readily available than in other ways, and weak supervision semantic segmentation based on image level annotation is also most challenging, since only specific object class information present in the image is given, and the location of the object class in the image is not pointed out.

Most image level weak supervision semantic segmentation methods typically require coarse location information generated by class activation graphs due to lack of specific location information of the target class in the image. Class activation maps are a deep classification network based technique that generates feature maps with the same number of channels as the total class. The specific operation flow of the method is as follows: 1) Obtaining a seed area through the class activation diagram; 2) Expanding the seed region to obtain a pseudo tag; 3) The pseudo-labels are used to train a conventional fully supervised neural network to obtain the final segmentation results. Since class activation maps tend to cover only the most discriminative areas of an object and misinterpret the background as foreground, much effort is devoted to generating higher quality class activation maps.

With the rapid development of visual converters (Vision Transformer, viT), researchers began introducing Vision Transformer into a weakly supervised semantic segmentation task, some methods extracted image features using Vision Transformer structures and generated coarse class activation maps, which were then optimized using attention to obtain higher quality class activation maps. Typically these methods will directly add and fuse the attention of the different layers. However, the shallow attention of the Vision Transformer structure focuses more on the local detail features of the image, and the class activation diagram obtained by shallow attention optimization often contains more detail information; deep attention is then more focused on the global information of the image, so directly merging the attention areas of different layers and not the optimal choice may lead to misleading information in the optimization stage.

Disclosure of Invention

In order to solve the technical problems, the invention provides a weak supervision semantic segmentation method based on attention fusion, which comprises the following steps of

S1, preparing a data set, wherein the data set comprises a training set, a verification set and a test set;

s2, carrying out data preprocessing on the images in the data set;

s3, constructing a weak supervision semantic segmentation model based on attention fusion, and adopting a data image converter DeiT pre-trained on an image recognition data set ImageNet as a backbone of the model; step S3 comprises the following sub-steps:

s3.1, dividing the image subjected to data preprocessing in the step S2 into N non-overlapping blocks, constructing N block tokens through linear mapping, and splicing C class tokens and N block tokens to obtain an input token of the model;

s3.2, inputting the input token into a Transfomer coding layer of a weak supervision semantic segmentation model based on attention fusion to obtain an output token;

s3.3, extracting the last N block tokens from the output tokens to form the output block tokens, carrying out recombination operation and convolution operation on the output block tokens to obtain a rough class activation diagram,

Coarse-CAM＝Conv(Reshape(Tp_out))

where Tp_out represents the output block token, reshape represents the reorganization operation, conv represents the convolution operation, and Coarse-CAM represents the Coarse class activation map;

s3.4, when the input token passes through the Transfomer coding layer, attention is calculated on the input token through an Attention module to generate Attention, and the calculation formula is as follows:

wherein Q and K respectively represent a array matrix and a Key matrix obtained by linear projection of an input token when the input token passes through a transducer coding layer, and d _k Representing a scaling factor, T representing a matrix transposition operation;

s3.5, each transducer coding layer generates one attention, all the attention is obtained after L transducer coding layers are passed, and the L attention is named as A; and then carrying out global average pooling operation on the A, and then carrying out information interaction through a full connection layer to generate weight W, wherein the weight W is as follows:

W＝FC(GAP(A))

GAP represents global average pooling operation, and FC represents a fully connected layer;

s3.6, multiplying the obtained weight W by A and fusing to obtain the final attention W';

s3.7, further dividing the final attention W' into class-to-block attention A _c2p Sum block-to-block attention a _p2p And to class into block attention A _c2p Sum block-to-block attention a _p2p Respectively multiplying the modulation functions G;

s3.8, optimizing the rough class activation diagram sequentially by using the modulated class-to-block attention and the modulated block-to-block attention to obtain a final class activation diagram;

s4, training the weak supervision semantic segmentation model based on attention fusion for multiple times, and storing the best parameters corresponding to the best round of results of training;

s5, loading the stored best parameters into a weak supervision semantic segmentation model based on attention fusion, and then inputting test set data into the model to generate a complete class activation diagram. .

The technical scheme of the invention is as follows:

further, in step S1, using the PASCAL VOC 2012 dataset and the MS COCO 2014 dataset as datasets, the PASCAL VOC 2012 dataset has 21 categories including 20 object categories and one background category; the MS COCO 2014 dataset has 81 categories, including 80 object classes and one background class.

The weak supervision semantic segmentation method based on attention fusion comprises a training set consisting of 1464 images, a verification set consisting of 1449 images and a test set consisting of 1456 images, wherein the training set adopts 10582 images expanded by additional data; the MS COCO 2014 dataset included a training set consisting of 82081 images and a validation set consisting of 40137 images.

The aforementioned weak supervision semantic segmentation method based on attention fusion, step S2 comprises the following sub-steps:

s2.1, carrying out random horizontal overturn and color dithering treatment on the image;

s2.2, performing normalization processing on the image, and adjusting the image size to 256×256;

s2.3, finally, randomly clipping the image, and adjusting the size of the image to 224 multiplied by 224.

In the foregoing weak supervision semantic segmentation method based on attention fusion, in step S2.1, the method for performing color dithering on the image specifically includes: the brightness, contrast and saturation values of the image were all set to 0.3.

In the foregoing weak supervision semantic segmentation method based on attention fusion, in step S3.7, the modulation function G is shown as follows:

where x represents the input, μ represents the average of the input, σ represents the variance of the input, e represents the exponential function, and pi represents the circumference ratio.

In the foregoing weak supervision semantic segmentation method based on attention fusion, in step S3.8, the method for optimizing the rough activation map specifically includes: firstly multiplying a Coarse class activation diagram Coarse-CAM with modulated class-to-block attention element by element to obtain a preliminarily optimized class activation diagram; and then the class activation map is further optimized by matrix multiplication by using the modulated block-to-block attention, and a Final class activation map Final-CAM is obtained.

In the foregoing weak supervision semantic segmentation method based on attention fusion, in step S4, model-related hyper-parameters are set, model training times Epoch is set to 60, model training batch size is set to 64, an optimizer used during training is an Adamw optimizer, a loss function is multi-label cross entropy loss, and initial learning rate is set to 4e-4.

In the foregoing weak supervision semantic segmentation method based on attention fusion, in step S5, the stored best parameters are loaded into a weak supervision semantic segmentation model based on attention fusion, the pictures in the verification set and the test set are input into the model, then the average cross-over ratio MIoU is calculated, and the semantic segmentation performance of the weak supervision semantic segmentation model based on attention fusion on the Pascal VOC 2012 and MS COCO 2014 datasets is measured according to the obtained MIoU value.

The beneficial effects of the invention are as follows:

(1) In the invention, the problem of incomplete activation of a class activation diagram in weak supervision semantic segmentation is mainly solved, a visual transducer is used as a basic network structure, and a simple and effective weak supervision semantic segmentation framework is provided. In the framework, an adaptive attention fusion module is designed, different weights are distributed to the attention of different layers, and the attention after fusion can effectively inhibit background noise while keeping target details;

(2) In the invention, a modulation function is designed for increasing the attention value of the secondary important area and effectively highlighting the target area aiming at the problem that the secondary important area in the attention cannot better activate the target area;

(3) In the invention, the rough class activation diagram is optimized by using the modulated attention, and the target area in the obtained class activation diagram can be activated more completely and accurately, so that the problem of incomplete activation of the class activation diagram can be effectively solved.

Drawings

FIG. 1 is a schematic diagram of a weak supervision semantic segmentation model based on attention fusion in an embodiment of the invention;

FIG. 2 is an exemplary graph of segmentation results on a PASCAL VOC 2012 validation set in accordance with an embodiment of the present invention;

fig. 3 is an exemplary diagram of segmentation results on the MS COCO 2014 verification set according to an embodiment of the present invention.

Detailed Description

The weak supervision semantic segmentation method based on attention fusion is used for weak supervision semantic segmentation tasks under image level annotation, and the overall structure of the framework is shown in fig. 1 and mainly comprises three parts: 1) Extracting features by using a visual transducer and generating a rough activation map; 2) The self-adaptive attention fusion module is used for adaptively distributing weights to attention of different layers, so that the attention after fusion can effectively inhibit background noise while target details are kept, and a modulation function is designed for increasing the attention value of a secondary important region aiming at the problem that the secondary important region in the attention cannot well activate the target region; 3) The rough class activation map is optimized by using the modulated attention, and the obtained final class activation map can more accurately and comprehensively cover the target area.

A weak supervision semantic segmentation method based on attention fusion is shown in FIG. 1, and comprises the following steps of S1, preparing a data set, wherein the data set comprises a training set, a verification set and a test set;

using the PASCAL VOC 2012 dataset and the MS COC02014 dataset as datasets, the PASCAL VOC 2012 dataset has 21 categories including 20 object classes and one background class; the MS COCO 2014 dataset has 81 categories, including 80 object classes and one background class.

The PASCAL VOC 2012 dataset includes a training set of 1464 images, a validation set of 1449 images, and a test set of 1456 images, wherein the training set is augmented with 10582 images of additional data; the MS COCO 2014 dataset included a training set consisting of 82081 images and a validation set consisting of 40137 images.

S2, preprocessing the data of the image in the data set, and specifically comprising the following sub-steps:

s2.1, carrying out random horizontal overturn and color dithering on an image, wherein the method for carrying out the color dithering on the image specifically comprises the following steps: setting the brightness, contrast and saturation value of the image to 0.3;

S3, constructing a weak supervision semantic segmentation model based on attention fusion, and adopting a Data image converter (Data-efficient image Transformers, deiT) pre-trained on an image recognition dataset ImageNet as a backbone of the model; step S3 comprises the following sub-steps:

Coarse-CAM＝Conv(Reshape(Tp_out))

W＝FC(GAP(A))

s3.7, further dividing the final attention W' into class-to-block attention A _c2p Sum block-to-block attention a _p2p And to class into block attention A _c2p Sum block-to-block attention a _p2p Multiplied by a modulation function G, respectively, which is represented by the following formula:

wherein x represents the input, μ represents the average of the input, σ represents the variance of the input, e represents the exponential function, and pi represents the circumference ratio;

s3.8, optimizing the rough class activation diagram sequentially by using the modulated class-to-block attention and the modulated block-to-block attention, and multiplying the rough class activation diagram Coarse-CAM with the modulated class-to-block attention element by element to obtain a primarily optimized class activation diagram; and then the class activation map is further optimized by matrix multiplication by using the modulated block-to-block attention, and a Final class activation map Final-CAM is obtained.

S4, training the weak supervision semantic segmentation model based on attention fusion for multiple rounds, and storing the best parameters corresponding to the best round of training results by observing the verification results;

setting model-related super parameters, setting model training times Epoch to 60, setting model training batch size to 64, using an optimizer as an Adamw optimizer during training, using a loss function as multi-label cross entropy loss, and setting initial learning rate to 4e-4.

S5, loading the stored best parameters into a weak supervision semantic segmentation model based on attention fusion, and then inputting test set data into the model to generate a complete class activation diagram;

loading the stored best parameters into a weak supervision semantic segmentation model based on attention fusion, inputting pictures in a verification set and a test set into the model, then calculating an average cross-over ratio (Mean Intersection over Union, MIoU), and measuring semantic segmentation performance of the weak supervision semantic segmentation model based on attention fusion on a Pascal VOC 2012 and MS COCO 2014 data set according to the obtained MIoU value, wherein the segmentation result is shown in fig. 2 and 3.

In summary, the invention aims to solve the problem that the target region cannot be fully activated in the weak supervision semantic segmentation. The embodiment provides an Adaptive attention fusion module (Adaptive AttentionFusion, AAF) to measure the importance of the attention of different layers to the class activation diagram, so as to assign weights to the attention of different layers. In order to ensure that the adaptive attention fusion module can estimate accurate weights, the embodiment proposes an end-to-end adaptive attention fusion module to generate a training strategy. In the training process, the embodiment applies the weight estimated by the adaptive attention fusion module to the attention of different layers. The attention after fusion using this approach can also better suppress background noise while preserving target detail.

In addition, it is found through experiments that although the target area can be activated in the process of optimizing the area of secondary importance in the attention, because the corresponding attention value is smaller, some areas which should be activated are not activated well, so a modulation function is designed to increase the attention value of the area of secondary importance in the attention, so that more target areas can be activated. Finally, the modulated attention-optimized rough class activation diagram is used for obtaining a class activation diagram which can cover more target areas, background noise is effectively restrained, and finally, a high-quality pseudo tag is obtained for training a semantic segmentation network.

In addition to the embodiments described above, other embodiments of the invention are possible. All technical schemes formed by equivalent substitution or equivalent transformation fall within the protection scope of the invention.

Claims

1. A weak supervision semantic segmentation method based on attention fusion is characterized by comprising the following steps of: comprises the following steps

s2, carrying out data preprocessing on the images in the data set;

Coarse-CAM＝Conv(Reshape(Tp_out))

W＝FC(GAP(A))

s5, loading the stored best parameters into a weak supervision semantic segmentation model based on attention fusion, and then inputting test set data into the model to generate a complete class activation diagram.

2. The weak supervision semantic segmentation method based on attention fusion according to claim 1, wherein: in the step S1, using the PASCAL VOC 2012 dataset and the MS COCO 2014 dataset as datasets, the PASCAL VOC 2012 dataset has 21 categories including 20 object categories and one background category; the MS COCO 2014 dataset has 81 categories, including 80 object classes and one background class.

3. The weak supervision semantic segmentation method based on attention fusion according to claim 2, wherein: the PASCAL VOC 2012 dataset comprises a training set of 1464 images, a validation set of 1449 images, and a test set of 1456 images, wherein the training set is 10582 images augmented with additional data; the MS COCO 2014 dataset included a training set consisting of 82081 images and a validation set consisting of 40137 images.

4. The weak supervision semantic segmentation method based on attention fusion according to claim 1, wherein: the step S2 comprises the following sub-steps:

5. The weak supervision semantic segmentation method based on attention fusion according to claim 4, wherein: in the step S2.1, the method for performing color dithering processing on the image specifically includes: the brightness, contrast and saturation values of the image were all set to 0.3.

6. The weak supervision semantic segmentation method based on attention fusion according to claim 1, wherein: in the step S3.7, the modulation function G is represented by the following formula:

7. The weak supervision semantic segmentation method based on attention fusion according to claim 1, wherein: in the step S3.8, the method for optimizing the rough activation map specifically includes: firstly multiplying a Coarse class activation diagram Coarse-CAM with modulated class-to-block attention element by element to obtain a preliminarily optimized class activation diagram; and then the class activation map is further optimized by matrix multiplication by using the modulated block-to-block attention, and a Final class activation map Final-CAM is obtained.

8. The weak supervision semantic segmentation method based on attention fusion according to claim 1, wherein: in the step S4, setting model-related super parameters, setting model training times Epoch to 60, setting model training batch size to 64, setting an optimizer used in training as an Adamw optimizer, setting a loss function as multi-label cross entropy loss, and setting an initial learning rate to 4e-4.

9. The weak supervision semantic segmentation method based on attention fusion according to claim 1, wherein: in the step S5, the stored best parameters are loaded into the weak supervision semantic segmentation model based on attention fusion, the pictures in the verification set and the test set are input into the model, then the average intersection ratio MIoU is calculated, and the semantic segmentation performance of the weak supervision semantic segmentation model based on attention fusion on the Pascal VOC 2012 and MS COCO 2014 data sets is measured according to the obtained MIoU value.