CN114818920A

CN114818920A - Weak supervision target detection method based on double attention erasing and attention information aggregation

Info

Publication number: CN114818920A
Application number: CN202210444165.2A
Authority: CN
Inventors: 龚声蓉; 宋鹏鹏; 应文豪; 王朝晖
Original assignee: Changshu Institute of Technology
Current assignee: Changshu Institute of Technology
Priority date: 2022-04-26
Filing date: 2022-04-26
Publication date: 2022-07-29
Anticipated expiration: 2042-04-26
Also published as: CN114818920B

Abstract

The invention discloses a weak supervision target detection method based on double attention erasure and attention information aggregation, which comprises the steps of firstly extracting image characteristics, and simultaneously extracting a target candidate region from an original image by adopting a selective search algorithm; sending the obtained features into an attention information aggregation network, extracting global and local information of a target feature channel, constructing spatial information for different targets to enhance a feature map, sending the feature map into a double-attention erasing network, erasing a significant local foreground attention area and simultaneously erasing a background attention area, and simultaneously performing Sigmoid function operation to generate an enhanced map; and inputting the convolution characteristics and the candidate regions of the final characteristic diagram into a spatial pyramid pooling layer, inputting two layers of fully-connected layers connected in series, outputting to obtain a characteristic vector of each candidate frame, and then sending the characteristic vector into a multi-example branch, an optimization branch and a distillation branch to optimize a detection result. The method can solve the problem of prominent target salient region in the weak supervision scene, and improve the detection precision.

Description

Weak supervision target detection method based on double attention erasing and attention information aggregation

Technical Field

The invention relates to a target detection method, in particular to a weak supervision target detection method based on double attention erasure and attention information aggregation.

Background

Object detection is one of the hot problems in the field of computer vision. Fully supervised target detection based on deep learning requires a time-consuming and labor-consuming process to prepare a large amount of complete annotation data, but the annotation process may also be noisy due to human annotation factors. The weak supervision detection is mainly divided into two types of methods: the main processes of the traditional detection method based on multi-example learning and the method based on end-to-end multi-example network detection are firstly to generate a large number of candidate areas and then to execute the multi-example learning method on the candidate areas. Although the detection speed of the conventional detection method based on multi-example learning is high, most of the conventional detection methods use manual feature extraction and the features are not robust, so that the conventional method is complex to operate and the detection accuracy is unsatisfactory.

The method has the advantages that the strong feature extraction capability of the deep convolutional neural network is benefited, more and more work is carried out by using the end-to-end-based multi-example detection network, the precision of weak supervision detection is obviously improved, manual extraction of features is not needed, and the detection process is greatly simplified. However, since the method is constructed based on the classification network, the classification network often extracts the local features with the most significance of the target, so that the high-response regions of the target features are mainly concentrated in the regions, and the model is prone to falling into a local minimum state under certain detection scenarios, namely tends to be stabilized in the local target regions with the most significance, such as the head or the tail of a non-rigid target. However, in weak supervision detection, it is not enough to focus on only the most significant region of the target, and how to make the model focus on the whole region of the target more, and further improving the detection accuracy of weak detection is a critical problem to be solved urgently.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a weak supervision target detection method based on double attention erasing and attention information aggregation, which solves the problem that a non-rigid target pays more attention to a salient local area and the salient local area

The technical scheme of the invention is as follows: a weak supervision target detection method based on double attention erasing and attention information aggregation comprises the following steps:

firstly, extracting features of an input image and simultaneously extracting a target candidate region of the input image by adopting a selective search algorithm;

step two, the features obtained in the step one are sent to an attention information aggregation network, global and local information of a target feature channel is extracted, and spatial information is constructed for different targets to obtain enhanced features;

step three, sending the enhanced features obtained in the step two into a double attention erasing network, entering a first channel to erase a significant local foreground attention area after average calculation to search a target whole part and erase a background attention area at the same time, entering a second channel to perform Sigmoid function operation, and randomly selecting a first channel or a second channel result and the enhanced features obtained in the step two to perform element product output;

and step four, after convolution, inputting the output obtained in the step three and the candidate area obtained in the step one into a spatial pyramid pooling layer, inputting two layers of fully-connected layers connected in series, outputting to obtain a feature vector of each candidate frame, and then sending the feature vector into a multi-example branch, an optimization branch and a distillation branch to be refined to obtain a detection result.

Further, when feature extraction is performed in the first step, the largest pooling layer of the last module is removed by using the first four modules of the VGG16 network, and then extraction is performed.

Further, the convolution in step four is a 3 × 3 hole convolution.

Further, the second step specifically includes: to the characteristics of the inputPerforming global average pooling in channel dimension, and performing channel attenuation to generate global channel vector f _global Performing channel attenuation on the input features to generate local channel information f _local Output of

Wherein sigma,

And

respectively representing a Sigmoid function, broadcast addition and element-by-element multiplication; performing channel average pooling and channel maximum pooling on input features at the same time, splicing in channel dimension, performing convolution to obtain spatial information M through Sigmoid function _s The attention information aggregation network output is enhanced by

Further, the operation of the first channel in step three includes setting a threshold T _fg And T _bg ，M _sa Will be greater than T _fg Is set to 0 for the element and is set to 1 for the other, the foreground erasure mask M is thus generated _fg ，M _sa Is less than threshold value T _bg Element set to 0 and the rest to 1, thus generating a background erasure mask M _bg The total erasure mask for the first channel is

M _sa And carrying out average calculation on the enhanced features.

Further, T _fg ＝λ _fg ·max(M _sa )，T _bg ＝λ _bg ·max(M _sa )，λ _fg ∈[0,1]，λ _bg ∈[0,1]。

Further, the supervisory information for a first of the optimized branches is from the multiple-instance branch, the supervisory information for the remaining of the optimized branches is from a last of the optimized branches, and the supervisory information for the distillation branch is an average of the outputs of each of the optimized branches.

The technical scheme provided by the invention has the advantages that:

for the extracted local saliency area of the target feature, attention erasure is introduced into a double attention erasure network, and the high response area of the target is expanded by erasing the most salient local foreground area and the background attention area, so that the whole network model can pay attention to the whole area of the target as much as possible, the network is prevented from concentrating attention on the background area, and the performance accuracy of classification is maintained. In addition, in order to generate the erasure mask more accurately, the attention information aggregation network may extract global and local information of the target feature channel, and construct spatial information for different targets to enhance the feature map and generate a more accurate attention erasure mask, thereby further improving the detection accuracy. The double attention erasing network and the attention information aggregation network are plug-and-play sub-networks, and the two sub-networks are mutually cooperated, are easy to transplant, realize and integrate into other networks to solve the problem that the detection performance is seriously influenced due to the prominent target significance region in a weak supervision scene.

Drawings

Fig. 1 is a schematic structural diagram of a target detection model of a weakly supervised target detection method based on double attention erasure and attention information aggregation according to the present invention.

Fig. 2 is a schematic structural diagram of an attention information aggregation network.

Fig. 3 is a schematic structural diagram of a dual-attention erasure network.

FIG. 4 is a schematic diagram of the structure of a multi-instance learning branch, optimization branch and distillation branch.

FIG. 5 is a visualization of the high response area feature map of the present invention on some non-rigid targets.

Detailed Description

The present invention is further described in the following examples, which are intended to be illustrative only and not to be limiting as to the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications within the scope of the following claims.

The method for detecting the weakly supervised target based on the double attention erasing and the attention information aggregation comprises the steps of establishing a target detection model, training the target detection model through sample data, and detecting an input image through the trained target detection model. Please refer to fig. 1, the process of performing target detection based on the weakly supervised target detection model with double attention scrubbing and attention information aggregation is as follows:

for an input image with only image-level labels, the corresponding class label is

Where C is the total number of classes. For a picture, if y _c (1. ltoreq. C. ltoreq. C) 1, then it contains the target with the classmark C. The candidate region of the input image is represented as R ═ { R ═ R ₁ ,R ₂ ,...,R _N N is the number of candidate regions.

The method comprises the following specific steps: firstly, inputting an image only with a category label into a feature extraction network, extracting features with the channel number of 512 by the first four modules (the maximum pooling layer of the fourth module is removed) in the feature extraction network, and generating a candidate region of the input image in advance by using a selective search algorithm. The feature extraction network is five modified modules of the VGG16 network, and the specific modification is as follows: the first four modules (the largest pooling layer except the fourth module) are kept unchanged in structure, the convolution layer of the fifth module is reserved, and the sequentially connected attention information aggregation network and the double attention erasing network are inserted between the convolution layer of the fourth module and the convolution layer of the fifth module, and as a preferred embodiment, in order to protect the characteristics of a small-size object, a 3x3 hole convolution layer with an expansion rate of 2 is used to replace the convolution layer of the fifth module.

And step two, sending the characteristics obtained in the step one into an attention information aggregation network (AIA), wherein the overall architecture of the attention information aggregation network is shown as figure 2. In particular, input features

Firstly, Global Average Pooling (GAP) is carried out in Channel dimension, then, through a Channel ordering module, the Channel ordering module comprises two times of 1 multiplied by 1 convolution operation, the number of channels is firstly reduced and then recovered, in addition, a Channel Attenuation rate r is set in the process to reduce the calculation cost, and finally, a global Channel vector is generated

As an output. At the same time to F _b Obtaining directly through Channel attention module

The local channel information of (1). Finally, the channel weight can be obtained through two operations, and f is subjected to broadcast addition _global And f _local A calculation is made and then the Sigmoid function is used. The final output can be obtained in the following way

Wherein sigma,

And

respectively representing Sigmoid functions, broadcast addition and element-by-element multiplication.

In addition, the spatial information of the features can be constructed to make the network focus on which areas and positions of the areas are key-related information, which is very helpful for weak supervision detection of lack of position labels during training. Therefore, in order to further improve the positioning performance of the object region, the invention constructs the information of the target space dimension for the input feature map. Channel Average Pooling (CAP) and Channel Maximum Pooling (CMP) operate on input features simultaneously, and then splice their outputs in the channel dimensions. The result of the concatenation is then convolved by a 7 x 7 convolutionLayer, then obtaining spatial information M by Sigmoid function _s . However, simply summing the spatial information may affect the model accuracy, because the output weight after extracting the spatial information by Sigmoid is a normalized feature map, and the output response of the feature map becomes weak after dot multiplication. Therefore, the point multiplication is performed first and then the original features are added, and finally the enhanced features output by the attention information aggregation module can be calculated in the following way:

and step three, sending the enhanced features obtained in the step two into a double attention erasure network (DAE). The overall structure of the dual attention erasure network is shown in fig. 3.

Two thresholds, namely a foreground threshold lambda, are introduced into the double-attention erasure network _fg ∈[0,1]And a background threshold λ _bg ∈[0,1]. Self-attention force diagram

First by inputting features

The average calculation was performed, where C, H and W are the number of channels, height and width, respectively. Then M is added _sa Will be greater than T _fg Is set to 0 for the element and is set to 1 for the other, the foreground erasure mask M is thus generated _fg Wherein T is _fg ＝λ _fg ·max(M _sa ). In contrast to the foreground attention erasure approach, the background erasure mask M _bg Is prepared by mixing M _sa Is less than threshold value T _bg Elements are set to 0, others to 1, where T _bg ＝λ _bg ·max(M _sa ). Thus, the total erasure mask is

Wherein

Representing the multiplication of the corresponding elements.

In addition, another branch is introduced into the double-attention-force erasing network, namely M _sa Performing Sigmoid function operation to generate enhanced graph M _em To maintain the performance of the classification. By setting lambda _{drop_rate} The network model may be at M _drop And M _em The method comprises the steps of randomly selecting, deciding whether to use attention erasing or not, wherein the attention erasing is helpful for positioning performance, the other method plays an important role in classification performance, and finally applying the selected result to original input data of a double-attention erasing network through an element product method to obtain an output characteristic diagram.

And step four, finally, the output feature map obtained in the step three is sent to a convolution layer (namely a 3x3 cavity convolution layer with the expansion rate of 2) of a fifth module of the feature extraction network to obtain convolution features, the convolution features and the candidate region obtained in the step one are input into a space pyramid pooling layer, then two full-connection layers with 2 channels being 4096 are obtained, so that a candidate region feature vector with the channel number of 4096 is obtained, and then the candidate region feature vector is sent to a multi-example branch, an optimization branch and a distillation branch to further optimize the detection result, as shown in fig. 4. The candidate region feature vectors generated from the second fully-connected layer go into the multi-instance learning branch, with K optimization branches and distillation branches simultaneously. All branches are structurally identical, but the supervisory information used in training is different.

Specifically, in the multi-example learning branch, the candidate region feature vector needs to pass through two branches of classification and detection flow of the multi-example detection network to respectively generate a matrix

And

the scores of all candidate regions can be multiplied by the corresponding element x ⁰ ＝σ _class (x ^class )⊙σ _det (x ^det ) Where σ represents the Softmax function. Finally, the classification score for any class c can be obtained by adding the scores associated with c in all candidate regions, denoted as

The multi-class cross entropy penalty for a multi-instance learning (MIL) branch may be defined as:

the output of the multi-instance branch is optimized by introducing an optimization branch and a distillation branch.

In the optimization branch, the label information of the background will be considered at the same time, so the output result of each optimization branch is represented as

In (1). The supervisory information for any of the other (k-1) th branches is from the last branch thereof, except that the supervisory information for the first branch is from the multiple instance learning branch. The candidate region with the highest score in any one category will be labeled with the label of that category, while the other candidate regions, if there is a higher IoU, will be labeled with the same label as the neighboring candidate regions, with all remaining candidate regions being used as background or ignored. Therefore, the supervision information of the r-th candidate region in the k-1 st optimization branch is

It is used as a pseudo-tag for the kth optimization branch. The multi-class cross entropy loss of the kth optimization branch is defined as

Wherein,

the loss weight of the r-th candidate region in the k-th optimization branch is used for preventing the influence of unreliable candidate region scores in the initial training period.

For the distillation branch, the pseudo-label is obtained by averaging the outputs of K of the optimized branches. Loss function L of distillation branch _distill The same as the optimization branch. The total loss function may be defined as follows:

experiments on PASCAL VOC 2007 and VOC2012 are carried out on two widely used and challenging data sets, both of which comprise 20 target classes, the results of the weak supervision target detection are tested, and the validity of the method is verified. For VOC 2007, it contained 24640 labeled objects and 9963 images (of which 5011 images belonged to the training validation set trainval and 4952 images belonged to the test set test). For VOC2012, it contains 22531 images (of which 11540 images belong to the training verification set and 10991 images belong to the test set). For each data set, the experiment was trained on the training validation set and the test set was evaluated for test results.

Two indicators were used for evaluation in the experiment, namely mean Average Precision (mAP) and Correct positioning (Correct Localization). The mAP is used for measuring the detection accuracy of the detector in target detection, and the CorLoc represents the percentage of the number of correctly positioned images to the total number of images and is used for measuring the positioning accuracy. According to the PASCAL standard, a prediction box is considered correct in the evaluation only if IoU of the true box is greater than 50%. For fair comparison, the inventive method evaluates the mAP on the test set and evaluates CorLoc on the validation training set.

Experiment hardware environment: ubuntu 16.04, Tesla P100 video card and video memory 16G. The code running environment is as follows: python3.6, Pytorch 1.4. The whole network of the method is established on the basis that the reference model is boost-OICR, and the backbone network is on the ImageNet data setPre-trained VGG 16. For fairness, all settings are the same as for the reference model. An initial candidate region was generated for each image using Selective Search in the experiment. In the training phase, the experiment sets K to 3 to optimize the example classifier. For a double attention erasure network, λ is scaled according to a reference network model _fg Set to 80% and let λ _{drop_rate} Set to 75%, and further set λ _bg The result was set to 5%. In the attention information aggregation network, the channel attenuation rate r is set to 16. During the evaluation, the final result is obtained from the average output of the optimization and distillation branches. In the experiment, only image-level labels are used for training, and frame annotation of images is not used, but image complete labels are used for evaluation during testing.

The invention analyzes the condition of the characteristic diagram of the high-response area of some non-rigid targets compared with boost-OICR, and the visualization condition is shown in figure 5. The method extracts the feature map of the conv5-3 layer of the backbone network VGG16 and visualizes the significant features. It can be found that the high response area of the boost-OICR method is mainly concentrated in the most significant area of the non-rigid object, resulting in incomplete final detection results. The method can effectively expand the most significant area and activate other areas with lower significance so as to position the whole target as much as possible. Wherein in fig. 5: (a) an original image; (b) the visualization result of boost-OICR; (c) visualization results of the method of the invention.

In addition, the method of the present invention was experimentally compared with other recent weakly supervised target detection methods on data sets VOC 2007 and VOC2012, shown below mAP and CorLoc on data set VOC 2007.

It can be seen from the table that the method of the invention achieves 50.5% on mAP and 66.6% on CorLoc. In particular, for some non-rigid targets such as "cat", "dog", "horse" and "peaple", the method of the present invention improves mAP by 19.6%, 5.5%, 3.6% and 11.3% over boost-OICR, respectively, which also fully demonstrates that the method of the present invention can effectively extend the significant local area of the target to perceive the whole target. The results of the detection in the data set VOC2012 are shown in the following table. The method of the invention achieves 47.4% of detection results on mAP and 67.3% of detection results on CorLoc, and in addition, the method of the invention also shows competitive results compared with other recent weak supervision detection methods.

Method	mAP	CorLoc
			OICR	37.9	62.1
PCL	40.6	63.2
			WSRPN	40.8	64.9
C-WSL	43.0	64.9
			SDCN	43.5	67.9
C-MIL	46.7	67.4
			UWSOD	45.1	65.2
BOICR	46.3	65.8
			OIM	45.3	67.1
Ji et al.	46.9	67.4
			ours	47.4	67.3

Claims

1. A weak supervision target detection method based on double attention erasing and attention information aggregation is characterized by comprising the following steps:

2. The method for detecting the weakly supervised target based on double attention scrubbing and attention information aggregation as recited in claim 1, wherein the feature extraction in the first step is performed after removing the largest pooling layer of the last module by using the first four modules of the VGG16 network.

3. The method for detecting the weakly supervised target based on double attention scrubbing and attention information aggregation as recited in claim 2, wherein the convolution in step four is a 3x3 hole convolution.

4. The method for detecting the weakly supervised target based on double attention scrubbing and attention information aggregation as recited in claim 1, wherein the second step specifically comprises: performing global average pooling on input features in channel dimension, and performing channel attenuation to generate a global channel vector f _global Performing channel attenuation on the input features to generate local channel information f _local Output of

Wherein sigma,

And

5. The method of claim 1, wherein the operation of the first channel in step three comprises setting a threshold T _fg And T _bg ，M _sa Will be greater than T _fg Is set to 0 for the element and is set to 1 for the other, the foreground erasure mask M is thus generated _fg ，M _sa Is less than threshold value T _bg Element set to 0 and the rest to 1, thus generating a background erasure mask M _bg The total erasure mask of the first channel is

M _sa And carrying out average calculation on the enhanced features.

6. The method of claim 5, wherein T is a weak supervised target detection method based on double attention scrubbing and attention information aggregation _fg ＝λ _fg ·max(M _sa )，T _bg ＝λ _bg ·max(M _sa )，λ _fg ∈[0,1]，λ _bg ∈[0,1]。

7. The method of claim 1, wherein the supervision information of a first branch of the optimized branches is from the multiple-instance branch, the supervision information of the rest of the optimized branches is from a last branch of the optimized branches, and the supervision information of the distillation branch is an average value of the outputs of each branch of the optimized branches.