CN114818920A - Weak supervision target detection method based on double attention erasing and attention information aggregation - Google Patents

Weak supervision target detection method based on double attention erasing and attention information aggregation Download PDF

Info

Publication number
CN114818920A
CN114818920A CN202210444165.2A CN202210444165A CN114818920A CN 114818920 A CN114818920 A CN 114818920A CN 202210444165 A CN202210444165 A CN 202210444165A CN 114818920 A CN114818920 A CN 114818920A
Authority
CN
China
Prior art keywords
attention
channel
branch
target
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210444165.2A
Other languages
Chinese (zh)
Other versions
CN114818920B (en
Inventor
龚声蓉
宋鹏鹏
应文豪
王朝晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changshu Institute of Technology
Original Assignee
Changshu Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changshu Institute of Technology filed Critical Changshu Institute of Technology
Priority to CN202210444165.2A priority Critical patent/CN114818920B/en
Publication of CN114818920A publication Critical patent/CN114818920A/en
Application granted granted Critical
Publication of CN114818920B publication Critical patent/CN114818920B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a weak supervision target detection method based on double attention erasure and attention information aggregation, which comprises the steps of firstly extracting image characteristics, and simultaneously extracting a target candidate region from an original image by adopting a selective search algorithm; sending the obtained features into an attention information aggregation network, extracting global and local information of a target feature channel, constructing spatial information for different targets to enhance a feature map, sending the feature map into a double-attention erasing network, erasing a significant local foreground attention area and simultaneously erasing a background attention area, and simultaneously performing Sigmoid function operation to generate an enhanced map; and inputting the convolution characteristics and the candidate regions of the final characteristic diagram into a spatial pyramid pooling layer, inputting two layers of fully-connected layers connected in series, outputting to obtain a characteristic vector of each candidate frame, and then sending the characteristic vector into a multi-example branch, an optimization branch and a distillation branch to optimize a detection result. The method can solve the problem of prominent target salient region in the weak supervision scene, and improve the detection precision.

Description

Weak supervision target detection method based on double attention erasing and attention information aggregation
Technical Field
The invention relates to a target detection method, in particular to a weak supervision target detection method based on double attention erasure and attention information aggregation.
Background
Object detection is one of the hot problems in the field of computer vision. Fully supervised target detection based on deep learning requires a time-consuming and labor-consuming process to prepare a large amount of complete annotation data, but the annotation process may also be noisy due to human annotation factors. The weak supervision detection is mainly divided into two types of methods: the main processes of the traditional detection method based on multi-example learning and the method based on end-to-end multi-example network detection are firstly to generate a large number of candidate areas and then to execute the multi-example learning method on the candidate areas. Although the detection speed of the conventional detection method based on multi-example learning is high, most of the conventional detection methods use manual feature extraction and the features are not robust, so that the conventional method is complex to operate and the detection accuracy is unsatisfactory.
The method has the advantages that the strong feature extraction capability of the deep convolutional neural network is benefited, more and more work is carried out by using the end-to-end-based multi-example detection network, the precision of weak supervision detection is obviously improved, manual extraction of features is not needed, and the detection process is greatly simplified. However, since the method is constructed based on the classification network, the classification network often extracts the local features with the most significance of the target, so that the high-response regions of the target features are mainly concentrated in the regions, and the model is prone to falling into a local minimum state under certain detection scenarios, namely tends to be stabilized in the local target regions with the most significance, such as the head or the tail of a non-rigid target. However, in weak supervision detection, it is not enough to focus on only the most significant region of the target, and how to make the model focus on the whole region of the target more, and further improving the detection accuracy of weak detection is a critical problem to be solved urgently.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a weak supervision target detection method based on double attention erasing and attention information aggregation, which solves the problem that a non-rigid target pays more attention to a salient local area and the salient local area
The technical scheme of the invention is as follows: a weak supervision target detection method based on double attention erasing and attention information aggregation comprises the following steps:
firstly, extracting features of an input image and simultaneously extracting a target candidate region of the input image by adopting a selective search algorithm;
step two, the features obtained in the step one are sent to an attention information aggregation network, global and local information of a target feature channel is extracted, and spatial information is constructed for different targets to obtain enhanced features;
step three, sending the enhanced features obtained in the step two into a double attention erasing network, entering a first channel to erase a significant local foreground attention area after average calculation to search a target whole part and erase a background attention area at the same time, entering a second channel to perform Sigmoid function operation, and randomly selecting a first channel or a second channel result and the enhanced features obtained in the step two to perform element product output;
and step four, after convolution, inputting the output obtained in the step three and the candidate area obtained in the step one into a spatial pyramid pooling layer, inputting two layers of fully-connected layers connected in series, outputting to obtain a feature vector of each candidate frame, and then sending the feature vector into a multi-example branch, an optimization branch and a distillation branch to be refined to obtain a detection result.
Further, when feature extraction is performed in the first step, the largest pooling layer of the last module is removed by using the first four modules of the VGG16 network, and then extraction is performed.
Further, the convolution in step four is a 3 × 3 hole convolution.
Further, the second step specifically includes: to the characteristics of the inputPerforming global average pooling in channel dimension, and performing channel attenuation to generate global channel vector f global Performing channel attenuation on the input features to generate local channel information f local Output of
Figure BDA0003615900430000021
Wherein sigma,
Figure BDA0003615900430000022
And
Figure BDA0003615900430000023
respectively representing a Sigmoid function, broadcast addition and element-by-element multiplication; performing channel average pooling and channel maximum pooling on input features at the same time, splicing in channel dimension, performing convolution to obtain spatial information M through Sigmoid function s The attention information aggregation network output is enhanced by
Figure BDA0003615900430000024
Further, the operation of the first channel in step three includes setting a threshold T fg And T bg ,M sa Will be greater than T fg Is set to 0 for the element and is set to 1 for the other, the foreground erasure mask M is thus generated fg ,M sa Is less than threshold value T bg Element set to 0 and the rest to 1, thus generating a background erasure mask M bg The total erasure mask for the first channel is
Figure BDA0003615900430000025
M sa And carrying out average calculation on the enhanced features.
Further, T fg =λ fg ·max(M sa ),T bg =λ bg ·max(M sa ),λ fg ∈[0,1],λ bg ∈[0,1]。
Further, the supervisory information for a first of the optimized branches is from the multiple-instance branch, the supervisory information for the remaining of the optimized branches is from a last of the optimized branches, and the supervisory information for the distillation branch is an average of the outputs of each of the optimized branches.
The technical scheme provided by the invention has the advantages that:
for the extracted local saliency area of the target feature, attention erasure is introduced into a double attention erasure network, and the high response area of the target is expanded by erasing the most salient local foreground area and the background attention area, so that the whole network model can pay attention to the whole area of the target as much as possible, the network is prevented from concentrating attention on the background area, and the performance accuracy of classification is maintained. In addition, in order to generate the erasure mask more accurately, the attention information aggregation network may extract global and local information of the target feature channel, and construct spatial information for different targets to enhance the feature map and generate a more accurate attention erasure mask, thereby further improving the detection accuracy. The double attention erasing network and the attention information aggregation network are plug-and-play sub-networks, and the two sub-networks are mutually cooperated, are easy to transplant, realize and integrate into other networks to solve the problem that the detection performance is seriously influenced due to the prominent target significance region in a weak supervision scene.
Drawings
Fig. 1 is a schematic structural diagram of a target detection model of a weakly supervised target detection method based on double attention erasure and attention information aggregation according to the present invention.
Fig. 2 is a schematic structural diagram of an attention information aggregation network.
Fig. 3 is a schematic structural diagram of a dual-attention erasure network.
FIG. 4 is a schematic diagram of the structure of a multi-instance learning branch, optimization branch and distillation branch.
FIG. 5 is a visualization of the high response area feature map of the present invention on some non-rigid targets.
Detailed Description
The present invention is further described in the following examples, which are intended to be illustrative only and not to be limiting as to the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications within the scope of the following claims.
The method for detecting the weakly supervised target based on the double attention erasing and the attention information aggregation comprises the steps of establishing a target detection model, training the target detection model through sample data, and detecting an input image through the trained target detection model. Please refer to fig. 1, the process of performing target detection based on the weakly supervised target detection model with double attention scrubbing and attention information aggregation is as follows:
for an input image with only image-level labels, the corresponding class label is
Figure BDA0003615900430000031
Where C is the total number of classes. For a picture, if y c (1. ltoreq. C. ltoreq. C) 1, then it contains the target with the classmark C. The candidate region of the input image is represented as R ═ { R ═ R 1 ,R 2 ,...,R N N is the number of candidate regions.
The method comprises the following specific steps: firstly, inputting an image only with a category label into a feature extraction network, extracting features with the channel number of 512 by the first four modules (the maximum pooling layer of the fourth module is removed) in the feature extraction network, and generating a candidate region of the input image in advance by using a selective search algorithm. The feature extraction network is five modified modules of the VGG16 network, and the specific modification is as follows: the first four modules (the largest pooling layer except the fourth module) are kept unchanged in structure, the convolution layer of the fifth module is reserved, and the sequentially connected attention information aggregation network and the double attention erasing network are inserted between the convolution layer of the fourth module and the convolution layer of the fifth module, and as a preferred embodiment, in order to protect the characteristics of a small-size object, a 3x3 hole convolution layer with an expansion rate of 2 is used to replace the convolution layer of the fifth module.
And step two, sending the characteristics obtained in the step one into an attention information aggregation network (AIA), wherein the overall architecture of the attention information aggregation network is shown as figure 2. In particular, input features
Figure BDA0003615900430000041
Firstly, Global Average Pooling (GAP) is carried out in Channel dimension, then, through a Channel ordering module, the Channel ordering module comprises two times of 1 multiplied by 1 convolution operation, the number of channels is firstly reduced and then recovered, in addition, a Channel Attenuation rate r is set in the process to reduce the calculation cost, and finally, a global Channel vector is generated
Figure BDA0003615900430000042
As an output. At the same time to F b Obtaining directly through Channel attention module
Figure BDA0003615900430000043
The local channel information of (1). Finally, the channel weight can be obtained through two operations, and f is subjected to broadcast addition global And f local A calculation is made and then the Sigmoid function is used. The final output can be obtained in the following way
Figure BDA0003615900430000044
Wherein sigma,
Figure BDA0003615900430000045
And
Figure BDA0003615900430000046
respectively representing Sigmoid functions, broadcast addition and element-by-element multiplication.
In addition, the spatial information of the features can be constructed to make the network focus on which areas and positions of the areas are key-related information, which is very helpful for weak supervision detection of lack of position labels during training. Therefore, in order to further improve the positioning performance of the object region, the invention constructs the information of the target space dimension for the input feature map. Channel Average Pooling (CAP) and Channel Maximum Pooling (CMP) operate on input features simultaneously, and then splice their outputs in the channel dimensions. The result of the concatenation is then convolved by a 7 x 7 convolutionLayer, then obtaining spatial information M by Sigmoid function s . However, simply summing the spatial information may affect the model accuracy, because the output weight after extracting the spatial information by Sigmoid is a normalized feature map, and the output response of the feature map becomes weak after dot multiplication. Therefore, the point multiplication is performed first and then the original features are added, and finally the enhanced features output by the attention information aggregation module can be calculated in the following way:
Figure BDA0003615900430000047
and step three, sending the enhanced features obtained in the step two into a double attention erasure network (DAE). The overall structure of the dual attention erasure network is shown in fig. 3.
Two thresholds, namely a foreground threshold lambda, are introduced into the double-attention erasure network fg ∈[0,1]And a background threshold λ bg ∈[0,1]. Self-attention force diagram
Figure BDA0003615900430000048
First by inputting features
Figure BDA0003615900430000049
The average calculation was performed, where C, H and W are the number of channels, height and width, respectively. Then M is added sa Will be greater than T fg Is set to 0 for the element and is set to 1 for the other, the foreground erasure mask M is thus generated fg Wherein T is fg =λ fg ·max(M sa ). In contrast to the foreground attention erasure approach, the background erasure mask M bg Is prepared by mixing M sa Is less than threshold value T bg Elements are set to 0, others to 1, where T bg =λ bg ·max(M sa ). Thus, the total erasure mask is
Figure BDA00036159004300000410
Wherein
Figure BDA0003615900430000051
Representing the multiplication of the corresponding elements.
In addition, another branch is introduced into the double-attention-force erasing network, namely M sa Performing Sigmoid function operation to generate enhanced graph M em To maintain the performance of the classification. By setting lambda drop_rate The network model may be at M drop And M em The method comprises the steps of randomly selecting, deciding whether to use attention erasing or not, wherein the attention erasing is helpful for positioning performance, the other method plays an important role in classification performance, and finally applying the selected result to original input data of a double-attention erasing network through an element product method to obtain an output characteristic diagram.
And step four, finally, the output feature map obtained in the step three is sent to a convolution layer (namely a 3x3 cavity convolution layer with the expansion rate of 2) of a fifth module of the feature extraction network to obtain convolution features, the convolution features and the candidate region obtained in the step one are input into a space pyramid pooling layer, then two full-connection layers with 2 channels being 4096 are obtained, so that a candidate region feature vector with the channel number of 4096 is obtained, and then the candidate region feature vector is sent to a multi-example branch, an optimization branch and a distillation branch to further optimize the detection result, as shown in fig. 4. The candidate region feature vectors generated from the second fully-connected layer go into the multi-instance learning branch, with K optimization branches and distillation branches simultaneously. All branches are structurally identical, but the supervisory information used in training is different.
Specifically, in the multi-example learning branch, the candidate region feature vector needs to pass through two branches of classification and detection flow of the multi-example detection network to respectively generate a matrix
Figure BDA0003615900430000052
And
Figure BDA0003615900430000053
the scores of all candidate regions can be multiplied by the corresponding element x 0 =σ class (x class )⊙σ det (x det ) Where σ represents the Softmax function. Finally, the classification score for any class c can be obtained by adding the scores associated with c in all candidate regions, denoted as
Figure BDA0003615900430000054
The multi-class cross entropy penalty for a multi-instance learning (MIL) branch may be defined as:
Figure BDA0003615900430000055
the output of the multi-instance branch is optimized by introducing an optimization branch and a distillation branch.
In the optimization branch, the label information of the background will be considered at the same time, so the output result of each optimization branch is represented as
Figure BDA0003615900430000056
In (1). The supervisory information for any of the other (k-1) th branches is from the last branch thereof, except that the supervisory information for the first branch is from the multiple instance learning branch. The candidate region with the highest score in any one category will be labeled with the label of that category, while the other candidate regions, if there is a higher IoU, will be labeled with the same label as the neighboring candidate regions, with all remaining candidate regions being used as background or ignored. Therefore, the supervision information of the r-th candidate region in the k-1 st optimization branch is
Figure BDA0003615900430000057
It is used as a pseudo-tag for the kth optimization branch. The multi-class cross entropy loss of the kth optimization branch is defined as
Figure BDA0003615900430000058
Wherein,
Figure BDA0003615900430000061
the loss weight of the r-th candidate region in the k-th optimization branch is used for preventing the influence of unreliable candidate region scores in the initial training period.
For the distillation branch, the pseudo-label is obtained by averaging the outputs of K of the optimized branches. Loss function L of distillation branch distill The same as the optimization branch. The total loss function may be defined as follows:
Figure BDA0003615900430000062
experiments on PASCAL VOC 2007 and VOC2012 are carried out on two widely used and challenging data sets, both of which comprise 20 target classes, the results of the weak supervision target detection are tested, and the validity of the method is verified. For VOC 2007, it contained 24640 labeled objects and 9963 images (of which 5011 images belonged to the training validation set trainval and 4952 images belonged to the test set test). For VOC2012, it contains 22531 images (of which 11540 images belong to the training verification set and 10991 images belong to the test set). For each data set, the experiment was trained on the training validation set and the test set was evaluated for test results.
Two indicators were used for evaluation in the experiment, namely mean Average Precision (mAP) and Correct positioning (Correct Localization). The mAP is used for measuring the detection accuracy of the detector in target detection, and the CorLoc represents the percentage of the number of correctly positioned images to the total number of images and is used for measuring the positioning accuracy. According to the PASCAL standard, a prediction box is considered correct in the evaluation only if IoU of the true box is greater than 50%. For fair comparison, the inventive method evaluates the mAP on the test set and evaluates CorLoc on the validation training set.
Experiment hardware environment: ubuntu 16.04, Tesla P100 video card and video memory 16G. The code running environment is as follows: python3.6, Pytorch 1.4. The whole network of the method is established on the basis that the reference model is boost-OICR, and the backbone network is on the ImageNet data setPre-trained VGG 16. For fairness, all settings are the same as for the reference model. An initial candidate region was generated for each image using Selective Search in the experiment. In the training phase, the experiment sets K to 3 to optimize the example classifier. For a double attention erasure network, λ is scaled according to a reference network model fg Set to 80% and let λ drop_rate Set to 75%, and further set λ bg The result was set to 5%. In the attention information aggregation network, the channel attenuation rate r is set to 16. During the evaluation, the final result is obtained from the average output of the optimization and distillation branches. In the experiment, only image-level labels are used for training, and frame annotation of images is not used, but image complete labels are used for evaluation during testing.
The invention analyzes the condition of the characteristic diagram of the high-response area of some non-rigid targets compared with boost-OICR, and the visualization condition is shown in figure 5. The method extracts the feature map of the conv5-3 layer of the backbone network VGG16 and visualizes the significant features. It can be found that the high response area of the boost-OICR method is mainly concentrated in the most significant area of the non-rigid object, resulting in incomplete final detection results. The method can effectively expand the most significant area and activate other areas with lower significance so as to position the whole target as much as possible. Wherein in fig. 5: (a) an original image; (b) the visualization result of boost-OICR; (c) visualization results of the method of the invention.
In addition, the method of the present invention was experimentally compared with other recent weakly supervised target detection methods on data sets VOC 2007 and VOC2012, shown below mAP and CorLoc on data set VOC 2007.
Figure DA00036159004352530583
It can be seen from the table that the method of the invention achieves 50.5% on mAP and 66.6% on CorLoc. In particular, for some non-rigid targets such as "cat", "dog", "horse" and "peaple", the method of the present invention improves mAP by 19.6%, 5.5%, 3.6% and 11.3% over boost-OICR, respectively, which also fully demonstrates that the method of the present invention can effectively extend the significant local area of the target to perceive the whole target. The results of the detection in the data set VOC2012 are shown in the following table. The method of the invention achieves 47.4% of detection results on mAP and 67.3% of detection results on CorLoc, and in addition, the method of the invention also shows competitive results compared with other recent weak supervision detection methods.
Method mAP CorLoc
OICR 37.9 62.1
PCL 40.6 63.2
WSRPN 40.8 64.9
C-WSL 43.0 64.9
SDCN 43.5 67.9
C-MIL 46.7 67.4
UWSOD 45.1 65.2
BOICR 46.3 65.8
OIM 45.3 67.1
Ji et al. 46.9 67.4
ours 47.4 67.3

Claims (7)

1. A weak supervision target detection method based on double attention erasing and attention information aggregation is characterized by comprising the following steps:
firstly, extracting features of an input image and simultaneously extracting a target candidate region of the input image by adopting a selective search algorithm;
step two, the features obtained in the step one are sent to an attention information aggregation network, global and local information of a target feature channel is extracted, and spatial information is constructed for different targets to obtain enhanced features;
step three, sending the enhanced features obtained in the step two into a double attention erasing network, entering a first channel to erase a significant local foreground attention area after average calculation to search a target whole part and erase a background attention area at the same time, entering a second channel to perform Sigmoid function operation, and randomly selecting a first channel or a second channel result and the enhanced features obtained in the step two to perform element product output;
and step four, after convolution, inputting the output obtained in the step three and the candidate area obtained in the step one into a spatial pyramid pooling layer, inputting two layers of fully-connected layers connected in series, outputting to obtain a feature vector of each candidate frame, and then sending the feature vector into a multi-example branch, an optimization branch and a distillation branch to be refined to obtain a detection result.
2. The method for detecting the weakly supervised target based on double attention scrubbing and attention information aggregation as recited in claim 1, wherein the feature extraction in the first step is performed after removing the largest pooling layer of the last module by using the first four modules of the VGG16 network.
3. The method for detecting the weakly supervised target based on double attention scrubbing and attention information aggregation as recited in claim 2, wherein the convolution in step four is a 3x3 hole convolution.
4. The method for detecting the weakly supervised target based on double attention scrubbing and attention information aggregation as recited in claim 1, wherein the second step specifically comprises: performing global average pooling on input features in channel dimension, and performing channel attenuation to generate a global channel vector f global Performing channel attenuation on the input features to generate local channel information f local Output of
Figure FDA0003615900420000011
Wherein sigma,
Figure FDA0003615900420000012
And
Figure FDA0003615900420000013
respectively representing a Sigmoid function, broadcast addition and element-by-element multiplication; performing channel average pooling and channel maximum pooling on input features at the same time, splicing in channel dimension, performing convolution to obtain spatial information M through Sigmoid function s The attention information aggregation network output is enhanced by
Figure FDA0003615900420000014
5. The method of claim 1, wherein the operation of the first channel in step three comprises setting a threshold T fg And T bg ,M sa Will be greater than T fg Is set to 0 for the element and is set to 1 for the other, the foreground erasure mask M is thus generated fg ,M sa Is less than threshold value T bg Element set to 0 and the rest to 1, thus generating a background erasure mask M bg The total erasure mask of the first channel is
Figure FDA0003615900420000021
M sa And carrying out average calculation on the enhanced features.
6. The method of claim 5, wherein T is a weak supervised target detection method based on double attention scrubbing and attention information aggregation fg =λ fg ·max(M sa ),T bg =λ bg ·max(M sa ),λ fg ∈[0,1],λ bg ∈[0,1]。
7. The method of claim 1, wherein the supervision information of a first branch of the optimized branches is from the multiple-instance branch, the supervision information of the rest of the optimized branches is from a last branch of the optimized branches, and the supervision information of the distillation branch is an average value of the outputs of each branch of the optimized branches.
CN202210444165.2A 2022-04-26 2022-04-26 Weak supervision target detection method based on double-attention erasure and attention information aggregation Active CN114818920B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210444165.2A CN114818920B (en) 2022-04-26 2022-04-26 Weak supervision target detection method based on double-attention erasure and attention information aggregation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210444165.2A CN114818920B (en) 2022-04-26 2022-04-26 Weak supervision target detection method based on double-attention erasure and attention information aggregation

Publications (2)

Publication Number Publication Date
CN114818920A true CN114818920A (en) 2022-07-29
CN114818920B CN114818920B (en) 2024-08-20

Family

ID=82508059

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210444165.2A Active CN114818920B (en) 2022-04-26 2022-04-26 Weak supervision target detection method based on double-attention erasure and attention information aggregation

Country Status (1)

Country Link
CN (1) CN114818920B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111860681A (en) * 2020-07-30 2020-10-30 江南大学 Method for generating deep network difficult sample under double-attention machine mechanism and application

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183414A (en) * 2020-09-29 2021-01-05 南京信息工程大学 Weak supervision remote sensing target detection method based on mixed hole convolution
CN112329800A (en) * 2020-12-03 2021-02-05 河南大学 Salient object detection method based on global information guiding residual attention
CN113191314A (en) * 2021-05-20 2021-07-30 上海眼控科技股份有限公司 Multi-target tracking method and equipment
US20210390723A1 (en) * 2020-06-15 2021-12-16 Dalian University Of Technology Monocular unsupervised depth estimation method based on contextual attention mechanism
CN114385930A (en) * 2021-12-28 2022-04-22 清华大学 Interest point recommendation method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210390723A1 (en) * 2020-06-15 2021-12-16 Dalian University Of Technology Monocular unsupervised depth estimation method based on contextual attention mechanism
CN112183414A (en) * 2020-09-29 2021-01-05 南京信息工程大学 Weak supervision remote sensing target detection method based on mixed hole convolution
CN112329800A (en) * 2020-12-03 2021-02-05 河南大学 Salient object detection method based on global information guiding residual attention
CN113191314A (en) * 2021-05-20 2021-07-30 上海眼控科技股份有限公司 Multi-target tracking method and equipment
CN114385930A (en) * 2021-12-28 2022-04-22 清华大学 Interest point recommendation method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
冀中;孔乾坤;王建;: "一种双注意力模型引导的目标检测算法", 激光与光电子学进展, no. 06, 2 September 2019 (2019-09-02) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111860681A (en) * 2020-07-30 2020-10-30 江南大学 Method for generating deep network difficult sample under double-attention machine mechanism and application
CN111860681B (en) * 2020-07-30 2024-04-30 江南大学 Deep network difficulty sample generation method under double-attention mechanism and application

Also Published As

Publication number Publication date
CN114818920B (en) 2024-08-20

Similar Documents

Publication Publication Date Title
Wang et al. Detect globally, refine locally: A novel approach to saliency detection
Hu et al. Videomatch: Matching based video object segmentation
Huang et al. Mask scoring r-cnn
JP6853560B2 (en) A method for auto-labeling a training image to be used for learning a deep learning network that analyzes a high-precision image, and an auto-labeling device using this {METHOD FOR AUTO-LABELING TRAINING IMAGES FOR USE IN DEEP LEARNING NETWORK TOAL IMAGES WITH HIGH PRECISION, AND AUTO-LABELING DEVICE USING THE SAMEM}
Chen et al. Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained domain transform
Trnovszky et al. Animal recognition system based on convolutional neural network
CN111310731B (en) Video recommendation method, device, equipment and storage medium based on artificial intelligence
Kim et al. San: Learning relationship between convolutional features for multi-scale object detection
CN108537119B (en) Small sample video identification method
CN103810473B (en) A kind of target identification method of human object based on HMM
Bae Object detection based on region decomposition and assembly
CN110969166A (en) Small target identification method and system in inspection scene
CN109034086B (en) Vehicle weight identification method, device and system
US11062455B2 (en) Data filtering of image stacks and video streams
CN112115879B (en) Self-supervision pedestrian re-identification method and system with shielding sensitivity
CN113221987A (en) Small sample target detection method based on cross attention mechanism
KR20180054406A (en) Image processing apparatus and method
CN112529862A (en) Significance image detection method for interactive cycle characteristic remodeling
CN112381034A (en) Lane line detection method, device, equipment and storage medium
CN110287970B (en) Weak supervision object positioning method based on CAM and covering
CN114818920A (en) Weak supervision target detection method based on double attention erasing and attention information aggregation
Cholakkal et al. A classifier-guided approach for top-down salient object detection
CN112348011B (en) Vehicle damage assessment method and device and storage medium
Huang et al. On the Concept Trustworthiness in Concept Bottleneck Models
Qiu et al. Revisiting multi-level feature fusion: A simple yet effective network for salient object detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant