CN115713672A - Target detection method based on two-way parallel attention mechanism - Google Patents

Target detection method based on two-way parallel attention mechanism Download PDF

Info

Publication number
CN115713672A
CN115713672A CN202211416425.1A CN202211416425A CN115713672A CN 115713672 A CN115713672 A CN 115713672A CN 202211416425 A CN202211416425 A CN 202211416425A CN 115713672 A CN115713672 A CN 115713672A
Authority
CN
China
Prior art keywords
target detection
model
training
way parallel
attention mechanism
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211416425.1A
Other languages
Chinese (zh)
Inventor
杜松林
袁伟哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute Of Southeast University
Southeast University
Original Assignee
Shenzhen Institute Of Southeast University
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute Of Southeast University, Southeast University filed Critical Shenzhen Institute Of Southeast University
Priority to CN202211416425.1A priority Critical patent/CN115713672A/en
Publication of CN115713672A publication Critical patent/CN115713672A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention provides a target detection method based on a two-way parallel attention mechanism, which comprises the following steps: preparing a training data set; and configuring a deep learning environment, inputting the training data set into a target detection network based on a two-way parallel attention mechanism for iterative training, and obtaining a training model for target detection. This solution proposes a more flexible encoder for better extracting features of training images, the encoder comprising a two-way parallel attention mechanism for capturing global and local features of the image and a channel attention for processing channel dimension information of the image features. In the two-way parallel attention mechanism, one branch uses a variant of a self-attention mechanism to capture long-distance dependency relationships among image features, and the other branch uses a multi-layer perceptron network to extract local relationships among the image features. The target detection method based on the two-way parallel attention mechanism is provided, the parameter number of the model can be reduced, and the accuracy of the model can be improved.

Description

Target detection method based on two-way parallel attention mechanism
Technical Field
The invention relates to a target detection method based on a two-way parallel attention mechanism, and belongs to the technical field of computer vision based on deep learning.
Background
Object detection is a very basic task in the field of computer vision technology. Many tasks of computer vision, such as image segmentation, object tracking, and keypoint detection, often rely on target detection. Currently, there are two main types of target detection models. One is the classical convolution architecture model, which has made dramatic progress in the past decade. Recently an end-to-end Transformer based object detection model, called DETR, has emerged that not only does not require manually designed components, but also exhibits better performance. Most DETR-like work attempts to achieve faster convergence and better performance by improving the DETR's query design. DAB-DETR and DINO have enabled models to achieve good performance through the design of queries. The invention designs an encoder to better extract features, so that the model obtains better performance.
Convolutional Neural Networks (CNNs) have long been the dominant architecture, where convolution is the core operation. Various architectures such as GoogleNet, VGGNet, resNet, etc. In pursuit of high performance of models, the structure of models has become more and more complex, with the core operation being always convolution and its variants. Visual transformers and Visual MLPs are new paradigms of computer vision architecture.
Recent advances in computer vision have been the rise of vision transformers (ViTs). The visual transformer divides the input image into a series of tokens, which are then aggregated by the self-attention layer and a representation thereof is generated. Since the introduction, the ViTs performed well in many visual recognition tasks. Most current work expands around queries, linking queries to location information, and providing calibrated box coordinate constraints for feature libraries. Although query improvement is important, the DETR encoder is also important for feature extraction of the input image. Existing vision transducers typically use a multi-headed attention design and then aggregate information from multiple independent heads with MLP blocks. Multi-headed attention designs have resulted in information mixing that has encountered bottlenecks due to the tendency of different heads to focus on different parts of an object. Therefore, it is important how to aggregate information from different headers.
Multilayer perceptron (MLP) -like architectures composed of fully-connected layers and nonlinear activation functions are receiving much attention. Although their structure is simpler and introduce less induction bias, their performance is still comparable to the excellent model. The MLP-like model mainly comprises channel-mixed MLPs and label-mixed MLPs. The channel-blending MLP is used to transform the features of each label, while the label-blending MLP is used to aggregate information of different labels. However, many MLP architectures currently do not perform as well as CNN architectures and transformers. Aggregating different markers with fixed weights for fully connected layers is one of the causes of visual MLP bottlenecks. Existing visual MLP models ignore the difference in semantic information of various labels and aggregate different labels using fixed weights, which may not aggregate all labels of an input image well.
The target detection network based on the two-way parallel attention mechanism provides a more flexible encoder based on DAB-DETR, and the encoder is used for better extracting the characteristics of a training image. The encoder includes a two-way parallel attention module for capturing global and local features of the image and a channel attention module for processing channel dimensional information of the image features.
In the two-way parallel attention mechanism, one branch uses a variant of the self-attention mechanism to capture the long distance dependencies between image features. The other branch uses the MLP network to extract local features of the image, each marker is represented as a wave having an amplitude and a phase, the amplitude represents the value of the feature of each marker, the phase represents the relationship between the markers, the phase difference between the markers affects the aggregate output of the markers, and the markers in close phase reinforce each other. The present invention takes advantage of channel attention rather than feed forward networks, which makes the model more focused on information of channel dimensions rather than spatial information, and can facilitate channel selection of features by re-weighting. Compared with DAB-DETR, the target detection network based on the two-way parallel attention mechanism has fewer parameters and better performance, and the convergence speed and precision of the model are greatly improved.
Target detection currently has some problems that need to be studied, such as detection of small target objects; detecting a target with a large shielding area; distinguishing non-target objects with similar shapes with the target objects in the images; detecting in real time; the migration training effect of small data volume is improved; the target detection range is more and more widely applied, and various training sample data and the like are lacked.
Disclosure of Invention
Aiming at the problem that the precision of the target detection model is not high enough, the invention aims to provide a target detection method based on a two-way parallel attention mechanism, which can better capture image features by utilizing a two-way parallel attention module and a channel attention module so as to improve the model precision.
The invention discloses a complete technical means and a method.
In order to achieve the purpose, the invention adopts the following technical scheme: a target detection method based on a two-way parallel attention mechanism comprises the following steps:
step 1, preparing a data set required by model training;
step 2, downloading a COCO2017 (Lin, tsung-Yi, et al. "Microsoft COCO: common objects in context." European conference on computer vision. Springer, cham, 2014.) data set to a local computer, and then putting the data into a corresponding folder according to a format;
step 3, configuring a deep learning environment; (ii) a
Step 4, inputting the data set into a target detection network based on a two-way parallel attention mechanism for iterative training to finally obtain a target detection model;
step 5, according to the target detection model obtained in the step 4, sending the verification set image in the data set into the target detection model for verification to obtain the precision of the model;
the target detection method based on the two-way parallel attention mechanism comprises the following specific steps of 1:
org download model training the data set required to log in the web page, download the COCO2017 to the local computer.
The target detection method based on the two-way parallel attention mechanism comprises the following specific steps of:
step 2-1: the data set consists of four parts, namely a training set image, a training set label file, a verification set image and a verification set label file. A folder named data is created in a project file, and three folders named train2017, val2017 and annotations are created in the folder.
Step 2-2: and (3) putting the images of the training set into a train2017 folder, putting the images of the verification set into a val2017 folder, putting the annotation files of the training set and the annotation files of the verification set into an indications folder.
The target detection method based on the two-way parallel attention mechanism comprises the following specific steps of:
step 3-1: creating a virtual environment in the software "Anaconda" to configure the environment dependence required by the target detection network based on the two-way parallel attention mechanism, wherein the Python version of the corresponding virtual environment is 3.7.
Step 3-2: and logging in a pytorch, downloading and installing a deep learning framework from a pytorch official network, and selecting the pytorch of 1.10.0 version and the CUDA of 11.3 version to build a deep learning framework. All dependencies in the target detection network based on a two-way parallel attention mechanism are installed in this virtual environment.
The target detection method based on the two-way parallel attention mechanism comprises the following specific steps of:
and (4) inputting the algorithm requirement of the step 4 into the distributed training data set, and performing iterative training to obtain a target detection model.
Step 4-1: the backbone portion of the target detection network based on the two-way parallel attention mechanism is a residual neural network (ResNet). Inputting an image from a residual neural network C 3 ~C 5 Extracting image features in layers, wherein the down-sampling rate is 8, 16 and 32 correspondingly, and finally obtaining the image features based on double pathsThe output F of the trunk portion of the object detection network of the parallel attention mechanism.
Step 4-2: and inputting the extracted feature maps into an encoder of the network, wherein the input and the output of the encoder are all feature maps with the same resolution. In the encoder, the feature F is first input into a two-way parallel attention module. The module contains two parallel branches that share the same input but process features in different ways. Finally the outputs of the two branches are directly added as the output of the two-way parallel attention module.
The key to the attention mechanism is to generate an attention map that shows the connections between the parts of the image. One branch in the two-way parallel attention module adopts an attention branch to learn the relation between the marks and extract global features. There are currently two main approaches to constructing the relationship between different parts of an image, namely computational attention-force diagrams using large-kernel convolution and self-attention mechanisms. Because the common large-kernel convolution has large calculation amount and many parameters, the invention captures the long-range relation of the features by decomposing the large-kernel convolution. The method first constructs an attention map using a large kernel convolution module. The large-kernel convolutional network is divided into three parts: the depth convolution, the depth hole convolution and the point-by-point convolution respectively represent the spatial local convolution, the spatial long-distance convolution and the channel convolution. The large kernel convolution is defined as:
A m =Conv 1×1 (DW_D_Conv(DW_Conv(F)))
Figure BDA0003940209030000041
wherein the content of the first and second substances,
Figure BDA0003940209030000042
which represents the input features of the encoder,
Figure BDA0003940209030000043
representing an attention map, the importance of a feature is represented by a numerical value in the attention map.
Figure BDA0003940209030000044
The product over the elements is represented as,
Figure BDA0003940209030000045
representing the output of the large-kernel convolution module.
The output F' of the large kernel convolution module is then fed into the self-attention module, which again deepens the relationship between the features. The input sequence F' is first projected to the query (Q = W) q F'), bond (K = W) k F') sum (V = W) v F') tensor, in which W q 、W k 、W v Is a parameter matrix. The self-attention module then employs the H independent heads to compute self-attention in parallel. The self-attention module calculates as follows:
Figure BDA0003940209030000046
wherein, W L The representation represents a linear transformation that is,
Figure BDA0003940209030000047
which represents a scaling factor, is the ratio of the scaling factor,
Figure BDA0003940209030000048
representing the output of the attention branch.
In addition to the attention branch, the present invention also proposes an MLP branch, which is a pure MLP architecture. The module represents each marker as a wave having an amplitude and a phase. Amplitude is a value representing the characteristic of each mark and phase is used to represent the relationship between marks. Phase differences between tokens can affect their aggregate output, and tokens in close phase tend to reinforce each other. The phase-aware token mixing module of the present invention represents tokens as waves having an amplitude and a phase and then aggregates tokens. Final waveform module output
Figure BDA0003940209030000049
This is derived from the channel attention, which enhances the representation capability. Final output of two-way parallel attention moduleThe output is as follows:
F=F″+F m
two limitations of the traditional self-attention mechanism are applied along the channel dimension. The first problem is computational complexity, and the other problem is that the Softmax operation makes the parameters inefficient. Since the self-attention mechanism focuses more on feature information in the spatial dimension and ignores information in the channel dimension. Thus, the present invention utilizes channel attention in the encoder instead of a feed-forward network to aggregate features. This module not only improves the performance of the model, but also reduces the number of parameters of the model. The channel attention is calculated as:
Figure BDA00039402090300000410
wherein, W Q ' denotes a parameter matrix, sigma denotes a Softmax operation,
Figure BDA0003940209030000051
representing a token prototype.
Step 4-3: the input to the network decoder includes a feature map from the encoder and a query object from the learnable location feature. There are cross attention and self attention modules in the decoder, and the query elements of both types of attention modules are query objects.
In the cross-attention module, the query object extracts features from a feature map, with key elements from the output feature map of the encoder. In the self-attention module, query objects interact with each other, where the key element is the query object.
The present invention directly uses the calibration box coordinates as a query, dynamically updating them layer by layer. The use of a calibration box as a query not only helps to improve query-to-feature similarity using explicit location priors, but also eliminates the slow training convergence problem in the DETR. The width and height information of the box is used to adjust the position attention so that the query in the DETR can explicitly perform soft ROI pooling layer by layer in a cascaded fashion.
Step 4-4: and (4) performing iterative training on the training set image data set according to the principle, wherein 50 training rounds are performed. And the other important parameters are that the learning rate is 0.0001, the model is trained for 50 rounds, the weight attenuation rate is 0.1, the learning rate begins to attenuate in the 40 th round, the batch size is 8, and the number of queries is 100. The model is subjected to iterative training by using an English Vivian display card in the whole process, and a model file which can be used for a target detection task can be generated after the training is finished.
The target detection method based on the two-way parallel attention mechanism comprises the following specific steps of:
and (4) inputting the verification set images in the data set in the step 2-2 into the trained target detection model for detection according to the target detection model obtained in the step 4, and calculating the precision of the model.
The total number of the precision indexes is 6, and the precision indexes are respectively AP and AP 50 、AP 75 、AP S 、AP M And AP L . Where AP is the total average accuracy of the model, AP 50 Average accuracy of model IoU =0.50, AP 75 Average accuracy of the model at IoU =0.75, AP S 、AP M And AP L The average accuracy of the model for detecting small, medium and large targets, respectively.
The invention relates to a target detection method based on a two-way parallel attention mechanism, which has the following beneficial effects:
1. the target detection model with smaller number of parameters compared with the DETR is provided, so that the problems of full convergence and low precision of the DETR model are solved.
2. The original self-attention module is replaced by the double-path parallel attention module, and global and local features are captured through different branches, so that the model can better extract image features.
3. The channel attention module replaces a feedforward network to aggregate the features, so that the model focuses more on the information of the feature channel dimension than the information of the space dimension. The method can reduce the parameter quantity of the model, accelerate the convergence of the model and improve the performance of the model.
In conclusion, the target detection method based on the two-way parallel attention mechanism utilizes a more flexible encoder to have less parameter quantity, not only is the computational cost saved, but also the training speed and the model precision are obviously improved. Specifically, in the two-way parallel attention module, the attention branch firstly uses a large kernel convolution to construct an attention map to represent the importance degree of the feature points, and then the self-attention mechanism is used for achieving the effect of capturing the long-distance dependency relationship between the image features. The MLP branch utilizes an MLP network to extract local features of an image, each marker is represented as a wave having an amplitude and a phase, the amplitude represents a value of the feature of each marker, the phase represents a relationship between the markers, a phase difference between the markers affects an aggregate output of the markers, and the markers in close phase enhance each other. The method for combining the local features and the global features of the image can better mine the feature information of the image. Channel attention may make the model more focused on information of channel dimensions than spatial dimension information, and channel selection of features may be facilitated by re-weighting. The above method improves the capability of the model to capture features from two aspects.
Drawings
FIG. 1 is an overall flow diagram of the present invention;
FIG. 2 is a block diagram of the encoder of the present invention;
FIG. 3 is a schematic diagram of token fusion in accordance with the present invention;
FIG. 4 is a block diagram of the channel attention of the present invention.
Detailed Description
Example 1: the following describes in detail a target detection method based on a two-way parallel attention mechanism according to the present invention with reference to the accompanying drawings:
step 1, preparing a data set required by model training.
And 2, downloading the COCO2017 data set to a local computer, and then putting the data into a corresponding folder according to a format.
The data set consists of a training set image, a training set label file, a verification set image and a verification set label file. A folder named data is created in a project file, and three folders named train2017, val2017 and annotations are created in the folder.
And (3) putting the images of the training set into a train2017 folder, putting the images of the verification set into a val2017 folder, putting the annotation files of the training set and the annotation files of the verification set into an indications folder.
And 3, creating a virtual environment in the software 'Anaconda' to configure all the dependencies required by the target detection network based on the two-way parallel attention mechanism, wherein the Python version of the corresponding virtual environment is 3.7.
And downloading and installing a deep learning framework from a pytorch official network, and selecting the pytorch of 1.10.0 version and the CUDA of 11.3 version to build a deep learning architecture.
All dependencies in the target detection network based on a two-way parallel attention mechanism are installed in this virtual environment.
And 4, inputting the image data of the training set into a target detection network based on a two-way parallel attention mechanism for iterative training.
Firstly, features of an input image are extracted from a main part of a target detection network based on a two-way parallel attention mechanism, and multi-scale image features are extracted from a residual error neural network.
The extracted feature map is input into an encoder of the network, and global and local features of the image are captured by a two-way parallel attention module. One branch uses a variant of the self-attention mechanism to capture long-distance dependencies between image features, and the other branch uses a multi-layer perceptron network to extract local relationships of image features.
The encoder also uses channel attention, rather than a feed-forward network, to aggregate features by focusing on information about the feature channel dimensions, making the model more focused on information about the channel dimensions rather than spatial information. The output of the encoder is a feature map with the same resolution as the input.
The output of the encoder is fed to a decoder for prediction and model optimization.
And repeating iterative training for 50 times to finally obtain a model for target detection.
And 5, inputting the verification set images in the data set in the step 2 into the trained target model for detection according to the target model obtained in the step 4, and calculating the model precision.
The total number of the precision indexes is 6, and the precision indexes are respectively AP and AP 50 、AP 75 、AP S 、AP M And AP L . Where AP is the total average accuracy of the model, AP 50 Average accuracy of the model at IoU =0.50, AP 75 Average accuracy of the model at IoU =0.75, AP S 、AP M And AP L The average accuracy of the model for detecting small, medium and large targets, respectively.
The invention relates to a target detection method based on a double-path parallel attention mechanism, which replaces an original self-attention module with a double-path parallel attention module, captures global and local characteristics through different branches and enables a model to better extract image characteristics. The channel attention module replaces a feed-forward network to aggregate the characteristics, so that the model focuses more on the information of the characteristic channel dimension rather than the information of the space dimension. The method can reduce the parameter quantity of the model, accelerate the convergence of the model and improve the performance of the model.
The above-mentioned contents only illustrate the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and all the modifications made on the basis of the technical solution belong to the technical idea proposed by the present invention and fall within the protection scope of the claims of the present invention.

Claims (6)

1. A target detection method based on a two-way parallel attention mechanism is characterized by comprising the following steps:
step 1, preparing a data set required by model training;
step 2, downloading the COCO2017 data set to a local computer, and then putting the data into a corresponding folder according to a format;
step 3, configuring a deep learning environment;
step 4, inputting the data set into a target detection network based on a two-way parallel attention mechanism for iterative training to finally obtain a target detection model;
and 5, sending the verification set images in the data set into the target detection model for verification according to the target detection model obtained in the step 4 to obtain the precision of the model.
2. The target detection method based on the two-way parallel attention mechanism according to claim 1, characterized in that the step 1: org download model training the data set required to log in the web page, download the COCO2017 to the local computer.
3. The target detection method based on the two-way parallel attention mechanism according to claim 1, wherein the step 2: the data set comprises four parts, namely a training set image, a training set annotation file, a verification set image and a verification set annotation file, a folder named data is created in a project file, three folders are created in the folder and named as train2017, val2017 and annotations respectively, the training set image is placed in the train2017 folder, and the verification set image is placed in the val2017 folder, the training set annotation file and the verification set annotation file are placed in the annotations folder.
4. The target detection method based on the two-way parallel attention mechanism according to claim 1, wherein the step 3: the deep learning environment is built by adopting PyTorch of the Facebook institute of artificial intelligence as a deep learning framework and adopting Wuban image 20.04 system, python version 3.7, pytrch 1.10.0 and CUDA 11.3.
5. The target detection method based on the two-way parallel attention mechanism according to claim 1, characterized in that the step 4: inputting the training set images in the data set into a target detection network based on a two-way parallel attention system for iterative training, inputting the images, extracting image characteristics by using a trunk part of the target detection network based on a structure self-adaptive attention system, inputting the characteristics into a target detection model based on the two-way parallel attention system for iterative training, finally training a target detection model, setting a parameter learning rate to be 0.0001, training the model for 50 rounds, setting a weight attenuation rate to be 0.1, beginning attenuation in a 40 th round, setting the batch size to be 8, and querying the number to be 100, wherein a piece of Yingwei Dazhao display card is arranged.
6. The target detection method based on the two-way parallel attention mechanism according to claim 1, characterized in that the step 5: and (4) inputting the verification set images in the data set into a trained target detection model for detection according to the insulator defect model obtained in the step (4), and calculating the precision of the model.
CN202211416425.1A 2022-11-12 2022-11-12 Target detection method based on two-way parallel attention mechanism Pending CN115713672A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211416425.1A CN115713672A (en) 2022-11-12 2022-11-12 Target detection method based on two-way parallel attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211416425.1A CN115713672A (en) 2022-11-12 2022-11-12 Target detection method based on two-way parallel attention mechanism

Publications (1)

Publication Number Publication Date
CN115713672A true CN115713672A (en) 2023-02-24

Family

ID=85232953

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211416425.1A Pending CN115713672A (en) 2022-11-12 2022-11-12 Target detection method based on two-way parallel attention mechanism

Country Status (1)

Country Link
CN (1) CN115713672A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117155595A (en) * 2023-05-12 2023-12-01 中国刑事警察学院 Malicious encryption traffic detection method and model based on visual attention network

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117155595A (en) * 2023-05-12 2023-12-01 中国刑事警察学院 Malicious encryption traffic detection method and model based on visual attention network

Similar Documents

Publication Publication Date Title
CN110210551B (en) Visual target tracking method based on adaptive subject sensitivity
Garcia-Garcia et al. A survey on deep learning techniques for image and video semantic segmentation
CN113469094A (en) Multi-mode remote sensing data depth fusion-based earth surface coverage classification method
Zou et al. Lessformer: Local-enhanced spectral-spatial transformer for hyperspectral image classification
Dai et al. RADANet: Road augmented deformable attention network for road extraction from complex high-resolution remote-sensing images
CN113780149A (en) Method for efficiently extracting building target of remote sensing image based on attention mechanism
CN113052106B (en) Airplane take-off and landing runway identification method based on PSPNet network
CN113838109A (en) Low-coincidence point cloud registration method
CN112364721A (en) Road surface foreign matter detection method
Sharma et al. A survey on object instance segmentation
CN112949407A (en) Remote sensing image building vectorization method based on deep learning and point set optimization
Wang et al. TF-SOD: a novel transformer framework for salient object detection
CN115713672A (en) Target detection method based on two-way parallel attention mechanism
CN115239765A (en) Infrared image target tracking system and method based on multi-scale deformable attention
Yuan et al. SSoB: searching a scene-oriented architecture for underwater object detection
Lv et al. Region-based adaptive association learning for robust image scene recognition
Yuan et al. Multi-branch bounding box regression for object detection
Xie et al. FCT: fusing CNN and transformer for scene classification
CN116844039A (en) Multi-attention-combined trans-scale remote sensing image cultivated land extraction method
Wang et al. Multi‐level feature fusion network for crowd counting
CN112818982B (en) Agricultural pest image detection method based on depth feature autocorrelation activation
Pang et al. DCTN: a dense parallel network combining CNN and transformer for identifying plant disease in field
Hasan et al. Deep learning-based semantic segmentation for remote sensing: A bibliometric literature review
Wei et al. Graph Convolutional Networks (GCN)-Based Lightweight Detection Model for Dangerous Driving Behavior
Yang et al. UP-Net: unique keyPoint description and detection net

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination