CN115713672A

CN115713672A - Target detection method based on two-way parallel attention mechanism

Info

Publication number: CN115713672A
Application number: CN202211416425.1A
Authority: CN
Inventors: 杜松林; 袁伟哲
Original assignee: Shenzhen Institute Of Southeast University; Southeast University
Current assignee: Shenzhen Institute Of Southeast University; Southeast University
Priority date: 2022-11-12
Filing date: 2022-11-12
Publication date: 2023-02-24

Abstract

The invention provides a target detection method based on a two-way parallel attention mechanism, which comprises the following steps: preparing a training data set; and configuring a deep learning environment, inputting the training data set into a target detection network based on a two-way parallel attention mechanism for iterative training, and obtaining a training model for target detection. This solution proposes a more flexible encoder for better extracting features of training images, the encoder comprising a two-way parallel attention mechanism for capturing global and local features of the image and a channel attention for processing channel dimension information of the image features. In the two-way parallel attention mechanism, one branch uses a variant of a self-attention mechanism to capture long-distance dependency relationships among image features, and the other branch uses a multi-layer perceptron network to extract local relationships among the image features. The target detection method based on the two-way parallel attention mechanism is provided, the parameter number of the model can be reduced, and the accuracy of the model can be improved.

Description

Target detection method based on two-way parallel attention mechanism

Technical Field

The invention relates to a target detection method based on a two-way parallel attention mechanism, and belongs to the technical field of computer vision based on deep learning.

Background

Object detection is a very basic task in the field of computer vision technology. Many tasks of computer vision, such as image segmentation, object tracking, and keypoint detection, often rely on target detection. Currently, there are two main types of target detection models. One is the classical convolution architecture model, which has made dramatic progress in the past decade. Recently an end-to-end Transformer based object detection model, called DETR, has emerged that not only does not require manually designed components, but also exhibits better performance. Most DETR-like work attempts to achieve faster convergence and better performance by improving the DETR's query design. DAB-DETR and DINO have enabled models to achieve good performance through the design of queries. The invention designs an encoder to better extract features, so that the model obtains better performance.

Convolutional Neural Networks (CNNs) have long been the dominant architecture, where convolution is the core operation. Various architectures such as GoogleNet, VGGNet, resNet, etc. In pursuit of high performance of models, the structure of models has become more and more complex, with the core operation being always convolution and its variants. Visual transformers and Visual MLPs are new paradigms of computer vision architecture.

Recent advances in computer vision have been the rise of vision transformers (ViTs). The visual transformer divides the input image into a series of tokens, which are then aggregated by the self-attention layer and a representation thereof is generated. Since the introduction, the ViTs performed well in many visual recognition tasks. Most current work expands around queries, linking queries to location information, and providing calibrated box coordinate constraints for feature libraries. Although query improvement is important, the DETR encoder is also important for feature extraction of the input image. Existing vision transducers typically use a multi-headed attention design and then aggregate information from multiple independent heads with MLP blocks. Multi-headed attention designs have resulted in information mixing that has encountered bottlenecks due to the tendency of different heads to focus on different parts of an object. Therefore, it is important how to aggregate information from different headers.

Multilayer perceptron (MLP) -like architectures composed of fully-connected layers and nonlinear activation functions are receiving much attention. Although their structure is simpler and introduce less induction bias, their performance is still comparable to the excellent model. The MLP-like model mainly comprises channel-mixed MLPs and label-mixed MLPs. The channel-blending MLP is used to transform the features of each label, while the label-blending MLP is used to aggregate information of different labels. However, many MLP architectures currently do not perform as well as CNN architectures and transformers. Aggregating different markers with fixed weights for fully connected layers is one of the causes of visual MLP bottlenecks. Existing visual MLP models ignore the difference in semantic information of various labels and aggregate different labels using fixed weights, which may not aggregate all labels of an input image well.

The target detection network based on the two-way parallel attention mechanism provides a more flexible encoder based on DAB-DETR, and the encoder is used for better extracting the characteristics of a training image. The encoder includes a two-way parallel attention module for capturing global and local features of the image and a channel attention module for processing channel dimensional information of the image features.

In the two-way parallel attention mechanism, one branch uses a variant of the self-attention mechanism to capture the long distance dependencies between image features. The other branch uses the MLP network to extract local features of the image, each marker is represented as a wave having an amplitude and a phase, the amplitude represents the value of the feature of each marker, the phase represents the relationship between the markers, the phase difference between the markers affects the aggregate output of the markers, and the markers in close phase reinforce each other. The present invention takes advantage of channel attention rather than feed forward networks, which makes the model more focused on information of channel dimensions rather than spatial information, and can facilitate channel selection of features by re-weighting. Compared with DAB-DETR, the target detection network based on the two-way parallel attention mechanism has fewer parameters and better performance, and the convergence speed and precision of the model are greatly improved.

Target detection currently has some problems that need to be studied, such as detection of small target objects; detecting a target with a large shielding area; distinguishing non-target objects with similar shapes with the target objects in the images; detecting in real time; the migration training effect of small data volume is improved; the target detection range is more and more widely applied, and various training sample data and the like are lacked.

Disclosure of Invention

Aiming at the problem that the precision of the target detection model is not high enough, the invention aims to provide a target detection method based on a two-way parallel attention mechanism, which can better capture image features by utilizing a two-way parallel attention module and a channel attention module so as to improve the model precision.

The invention discloses a complete technical means and a method.

In order to achieve the purpose, the invention adopts the following technical scheme: a target detection method based on a two-way parallel attention mechanism comprises the following steps:

step 1, preparing a data set required by model training;

step 2, downloading a COCO2017 (Lin, tsung-Yi, et al. "Microsoft COCO: common objects in context." European conference on computer vision. Springer, cham, 2014.) data set to a local computer, and then putting the data into a corresponding folder according to a format;

step 3, configuring a deep learning environment; (ii) a

Step 4, inputting the data set into a target detection network based on a two-way parallel attention mechanism for iterative training to finally obtain a target detection model;

step 5, according to the target detection model obtained in the step 4, sending the verification set image in the data set into the target detection model for verification to obtain the precision of the model;

the target detection method based on the two-way parallel attention mechanism comprises the following specific steps of 1:

org download model training the data set required to log in the web page, download the COCO2017 to the local computer.

The target detection method based on the two-way parallel attention mechanism comprises the following specific steps of:

step 2-1: the data set consists of four parts, namely a training set image, a training set label file, a verification set image and a verification set label file. A folder named data is created in a project file, and three folders named train2017, val2017 and annotations are created in the folder.

Step 2-2: and (3) putting the images of the training set into a train2017 folder, putting the images of the verification set into a val2017 folder, putting the annotation files of the training set and the annotation files of the verification set into an indications folder.

step 3-1: creating a virtual environment in the software "Anaconda" to configure the environment dependence required by the target detection network based on the two-way parallel attention mechanism, wherein the Python version of the corresponding virtual environment is 3.7.

Step 3-2: and logging in a pytorch, downloading and installing a deep learning framework from a pytorch official network, and selecting the pytorch of 1.10.0 version and the CUDA of 11.3 version to build a deep learning framework. All dependencies in the target detection network based on a two-way parallel attention mechanism are installed in this virtual environment.

and (4) inputting the algorithm requirement of the step 4 into the distributed training data set, and performing iterative training to obtain a target detection model.

Step 4-1: the backbone portion of the target detection network based on the two-way parallel attention mechanism is a residual neural network (ResNet). Inputting an image from a residual neural network C ₃ ～C ₅ Extracting image features in layers, wherein the down-sampling rate is 8, 16 and 32 correspondingly, and finally obtaining the image features based on double pathsThe output F of the trunk portion of the object detection network of the parallel attention mechanism.

Step 4-2: and inputting the extracted feature maps into an encoder of the network, wherein the input and the output of the encoder are all feature maps with the same resolution. In the encoder, the feature F is first input into a two-way parallel attention module. The module contains two parallel branches that share the same input but process features in different ways. Finally the outputs of the two branches are directly added as the output of the two-way parallel attention module.

The key to the attention mechanism is to generate an attention map that shows the connections between the parts of the image. One branch in the two-way parallel attention module adopts an attention branch to learn the relation between the marks and extract global features. There are currently two main approaches to constructing the relationship between different parts of an image, namely computational attention-force diagrams using large-kernel convolution and self-attention mechanisms. Because the common large-kernel convolution has large calculation amount and many parameters, the invention captures the long-range relation of the features by decomposing the large-kernel convolution. The method first constructs an attention map using a large kernel convolution module. The large-kernel convolutional network is divided into three parts: the depth convolution, the depth hole convolution and the point-by-point convolution respectively represent the spatial local convolution, the spatial long-distance convolution and the channel convolution. The large kernel convolution is defined as:

A _m ＝Conv _1×1 (DW_D_Conv(DW_Conv(F)))

wherein the content of the first and second substances,

which represents the input features of the encoder,

representing an attention map, the importance of a feature is represented by a numerical value in the attention map.

The product over the elements is represented as,

representing the output of the large-kernel convolution module.

The output F' of the large kernel convolution module is then fed into the self-attention module, which again deepens the relationship between the features. The input sequence F' is first projected to the query (Q = W) _q F'), bond (K = W) _k F') sum (V = W) _v F') tensor, in which W _q 、W _k 、W _v Is a parameter matrix. The self-attention module then employs the H independent heads to compute self-attention in parallel. The self-attention module calculates as follows:

wherein, W _L The representation represents a linear transformation that is,

which represents a scaling factor, is the ratio of the scaling factor,

representing the output of the attention branch.

In addition to the attention branch, the present invention also proposes an MLP branch, which is a pure MLP architecture. The module represents each marker as a wave having an amplitude and a phase. Amplitude is a value representing the characteristic of each mark and phase is used to represent the relationship between marks. Phase differences between tokens can affect their aggregate output, and tokens in close phase tend to reinforce each other. The phase-aware token mixing module of the present invention represents tokens as waves having an amplitude and a phase and then aggregates tokens. Final waveform module output

This is derived from the channel attention, which enhances the representation capability. Final output of two-way parallel attention moduleThe output is as follows:

F＝F″+F _m

two limitations of the traditional self-attention mechanism are applied along the channel dimension. The first problem is computational complexity, and the other problem is that the Softmax operation makes the parameters inefficient. Since the self-attention mechanism focuses more on feature information in the spatial dimension and ignores information in the channel dimension. Thus, the present invention utilizes channel attention in the encoder instead of a feed-forward network to aggregate features. This module not only improves the performance of the model, but also reduces the number of parameters of the model. The channel attention is calculated as:

wherein, W _Q ' denotes a parameter matrix, sigma denotes a Softmax operation,

representing a token prototype.

Step 4-3: the input to the network decoder includes a feature map from the encoder and a query object from the learnable location feature. There are cross attention and self attention modules in the decoder, and the query elements of both types of attention modules are query objects.

In the cross-attention module, the query object extracts features from a feature map, with key elements from the output feature map of the encoder. In the self-attention module, query objects interact with each other, where the key element is the query object.

The present invention directly uses the calibration box coordinates as a query, dynamically updating them layer by layer. The use of a calibration box as a query not only helps to improve query-to-feature similarity using explicit location priors, but also eliminates the slow training convergence problem in the DETR. The width and height information of the box is used to adjust the position attention so that the query in the DETR can explicitly perform soft ROI pooling layer by layer in a cascaded fashion.

Step 4-4: and (4) performing iterative training on the training set image data set according to the principle, wherein 50 training rounds are performed. And the other important parameters are that the learning rate is 0.0001, the model is trained for 50 rounds, the weight attenuation rate is 0.1, the learning rate begins to attenuate in the 40 th round, the batch size is 8, and the number of queries is 100. The model is subjected to iterative training by using an English Vivian display card in the whole process, and a model file which can be used for a target detection task can be generated after the training is finished.

and (4) inputting the verification set images in the data set in the step 2-2 into the trained target detection model for detection according to the target detection model obtained in the step 4, and calculating the precision of the model.

The total number of the precision indexes is 6, and the precision indexes are respectively AP and AP ₅₀ 、AP ₇₅ 、AP _S 、AP _M And AP _L . Where AP is the total average accuracy of the model, AP ₅₀ Average accuracy of model IoU =0.50, AP ₇₅ Average accuracy of the model at IoU =0.75, AP _S 、AP _M And AP _L The average accuracy of the model for detecting small, medium and large targets, respectively.

The invention relates to a target detection method based on a two-way parallel attention mechanism, which has the following beneficial effects:

1. the target detection model with smaller number of parameters compared with the DETR is provided, so that the problems of full convergence and low precision of the DETR model are solved.

2. The original self-attention module is replaced by the double-path parallel attention module, and global and local features are captured through different branches, so that the model can better extract image features.

3. The channel attention module replaces a feedforward network to aggregate the features, so that the model focuses more on the information of the feature channel dimension than the information of the space dimension. The method can reduce the parameter quantity of the model, accelerate the convergence of the model and improve the performance of the model.

In conclusion, the target detection method based on the two-way parallel attention mechanism utilizes a more flexible encoder to have less parameter quantity, not only is the computational cost saved, but also the training speed and the model precision are obviously improved. Specifically, in the two-way parallel attention module, the attention branch firstly uses a large kernel convolution to construct an attention map to represent the importance degree of the feature points, and then the self-attention mechanism is used for achieving the effect of capturing the long-distance dependency relationship between the image features. The MLP branch utilizes an MLP network to extract local features of an image, each marker is represented as a wave having an amplitude and a phase, the amplitude represents a value of the feature of each marker, the phase represents a relationship between the markers, a phase difference between the markers affects an aggregate output of the markers, and the markers in close phase enhance each other. The method for combining the local features and the global features of the image can better mine the feature information of the image. Channel attention may make the model more focused on information of channel dimensions than spatial dimension information, and channel selection of features may be facilitated by re-weighting. The above method improves the capability of the model to capture features from two aspects.

Drawings

FIG. 1 is an overall flow diagram of the present invention;

FIG. 2 is a block diagram of the encoder of the present invention;

FIG. 3 is a schematic diagram of token fusion in accordance with the present invention;

FIG. 4 is a block diagram of the channel attention of the present invention.

Detailed Description

Example 1: the following describes in detail a target detection method based on a two-way parallel attention mechanism according to the present invention with reference to the accompanying drawings:

step 1, preparing a data set required by model training.

And 2, downloading the COCO2017 data set to a local computer, and then putting the data into a corresponding folder according to a format.

The data set consists of a training set image, a training set label file, a verification set image and a verification set label file. A folder named data is created in a project file, and three folders named train2017, val2017 and annotations are created in the folder.

And (3) putting the images of the training set into a train2017 folder, putting the images of the verification set into a val2017 folder, putting the annotation files of the training set and the annotation files of the verification set into an indications folder.

And 3, creating a virtual environment in the software 'Anaconda' to configure all the dependencies required by the target detection network based on the two-way parallel attention mechanism, wherein the Python version of the corresponding virtual environment is 3.7.

And downloading and installing a deep learning framework from a pytorch official network, and selecting the pytorch of 1.10.0 version and the CUDA of 11.3 version to build a deep learning architecture.

All dependencies in the target detection network based on a two-way parallel attention mechanism are installed in this virtual environment.

And 4, inputting the image data of the training set into a target detection network based on a two-way parallel attention mechanism for iterative training.

Firstly, features of an input image are extracted from a main part of a target detection network based on a two-way parallel attention mechanism, and multi-scale image features are extracted from a residual error neural network.

The extracted feature map is input into an encoder of the network, and global and local features of the image are captured by a two-way parallel attention module. One branch uses a variant of the self-attention mechanism to capture long-distance dependencies between image features, and the other branch uses a multi-layer perceptron network to extract local relationships of image features.

The encoder also uses channel attention, rather than a feed-forward network, to aggregate features by focusing on information about the feature channel dimensions, making the model more focused on information about the channel dimensions rather than spatial information. The output of the encoder is a feature map with the same resolution as the input.

The output of the encoder is fed to a decoder for prediction and model optimization.

And repeating iterative training for 50 times to finally obtain a model for target detection.

And 5, inputting the verification set images in the data set in the step 2 into the trained target model for detection according to the target model obtained in the step 4, and calculating the model precision.

The total number of the precision indexes is 6, and the precision indexes are respectively AP and AP ₅₀ 、AP ₇₅ 、AP _S 、AP _M And AP _L . Where AP is the total average accuracy of the model, AP ₅₀ Average accuracy of the model at IoU =0.50, AP ₇₅ Average accuracy of the model at IoU =0.75, AP _S 、AP _M And AP _L The average accuracy of the model for detecting small, medium and large targets, respectively.

The invention relates to a target detection method based on a double-path parallel attention mechanism, which replaces an original self-attention module with a double-path parallel attention module, captures global and local characteristics through different branches and enables a model to better extract image characteristics. The channel attention module replaces a feed-forward network to aggregate the characteristics, so that the model focuses more on the information of the characteristic channel dimension rather than the information of the space dimension. The method can reduce the parameter quantity of the model, accelerate the convergence of the model and improve the performance of the model.

The above-mentioned contents only illustrate the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and all the modifications made on the basis of the technical solution belong to the technical idea proposed by the present invention and fall within the protection scope of the claims of the present invention.

Claims

1. A target detection method based on a two-way parallel attention mechanism is characterized by comprising the following steps:

step 1, preparing a data set required by model training;

step 2, downloading the COCO2017 data set to a local computer, and then putting the data into a corresponding folder according to a format;

step 3, configuring a deep learning environment;

and 5, sending the verification set images in the data set into the target detection model for verification according to the target detection model obtained in the step 4 to obtain the precision of the model.

2. The target detection method based on the two-way parallel attention mechanism according to claim 1, characterized in that the step 1: org download model training the data set required to log in the web page, download the COCO2017 to the local computer.

3. The target detection method based on the two-way parallel attention mechanism according to claim 1, wherein the step 2: the data set comprises four parts, namely a training set image, a training set annotation file, a verification set image and a verification set annotation file, a folder named data is created in a project file, three folders are created in the folder and named as train2017, val2017 and annotations respectively, the training set image is placed in the train2017 folder, and the verification set image is placed in the val2017 folder, the training set annotation file and the verification set annotation file are placed in the annotations folder.

4. The target detection method based on the two-way parallel attention mechanism according to claim 1, wherein the step 3: the deep learning environment is built by adopting PyTorch of the Facebook institute of artificial intelligence as a deep learning framework and adopting Wuban image 20.04 system, python version 3.7, pytrch 1.10.0 and CUDA 11.3.

5. The target detection method based on the two-way parallel attention mechanism according to claim 1, characterized in that the step 4: inputting the training set images in the data set into a target detection network based on a two-way parallel attention system for iterative training, inputting the images, extracting image characteristics by using a trunk part of the target detection network based on a structure self-adaptive attention system, inputting the characteristics into a target detection model based on the two-way parallel attention system for iterative training, finally training a target detection model, setting a parameter learning rate to be 0.0001, training the model for 50 rounds, setting a weight attenuation rate to be 0.1, beginning attenuation in a 40 th round, setting the batch size to be 8, and querying the number to be 100, wherein a piece of Yingwei Dazhao display card is arranged.

6. The target detection method based on the two-way parallel attention mechanism according to claim 1, characterized in that the step 5: and (4) inputting the verification set images in the data set into a trained target detection model for detection according to the insulator defect model obtained in the step (4), and calculating the precision of the model.