CN115294356A

CN115294356A - Target detection method based on wide area receptive field space attention

Info

Publication number: CN115294356A
Application number: CN202210882431.XA
Authority: CN
Inventors: 王改华; 曹清程; 翟乾宇; 甘鑫
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2022-07-26
Filing date: 2022-07-26
Publication date: 2022-11-04

Abstract

The invention discloses a target detection method based on wide area receptive field space attention, which comprises the following steps: preparing an image data set for training and testing; constructing a target detection network based on wide area receptive field space attention, wherein the target detection network comprises four parts, namely a backhaul part, a hack part, a Head part and an MSA part; and performing feature extraction on the test set images by using the trained network. The invention captures the pixel-level characteristic information from the angle of the wide area receptive field, simultaneously considers the intercrossing of different characteristic information, and greatly improves the characteristic extraction effect under the condition of not obviously increasing the parameter quantity and the calculated quantity.

Description

Target detection method based on wide area receptive field space attention

Technical Field

The invention belongs to the technical field of target detection, and particularly relates to a target detection method based on wide area receptive field space attention.

Background

In the development background of deep learning, the convolutional neural network has been accepted by more and more people, and the application is more and more common. The target detection algorithm based on deep learning utilizes a Convolutional Neural Network (CNN) to automatically select features, and then the features are input into a detector to classify and position targets.

In neural network learning, generally, the more parameters of a model, the stronger the expressive power of the model, and the larger the amount of information stored by the model, but this may cause a problem of information overload. By introducing the attention mechanism, the information which is more critical to the current task is focused in a plurality of input information, the attention degree to other information is reduced, and even irrelevant information is filtered, so that the problem of information overload can be solved, and the efficiency and the accuracy of task processing are improved.

In recent years, attention mechanisms have been widely used for different deep learning tasks, such as object detection, semantic segmentation, and pose estimation. Attention is divided into soft and hard attention. The soft attention mechanism is divided into three attention domains: spatial domain, channel domain, and hybrid domain. The spatial domain refers to the corresponding spatial transformation in the image. The channel domain directly concentrates information in the global channel. The hybrid domain contains both channel and spatial attention. In order to allow the network to focus more attention on the area around a significant target, the present invention proposes a wide area receptor field spatial attention module to process the extracted feature map.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a target detection method based on wide area receptive field space attention, which improves the feature expression capability of a network under the condition of not excessively increasing the number of model parameters. The method mainly comprises pooling operation, remolding operation, cavity rolling block, up-sampling operation and the like, and greatly enhances the expression capability of important characteristic information.

In order to achieve the purpose, the technical scheme provided by the invention is a target detection method based on wide area receptive field space attention, which comprises the following steps:

step 1, an image data set is prepared for testing and training.

And 2, constructing a target detection network based on the spatial attention of the wide area receptive field.

And 3, training the target detection network model based on the wide area receptive field space attention by using the training set image.

And 4, carrying out target detection on the test set image by using the network model trained in the step 3.

In step 1, the size of all the images is adjusted to 512 × 512 for multi-scale training, and a series of operations are performed on the image data set by data enhancement: random turning, padding filling, random cutting, normalization processing and image distortion processing.

In step 2, the target detection network based on wide area receptive field space attention is composed of a backhaul, a Neck, a Head and an MSA wide area receptive field space attention, wherein the backhaul adopts a ResNet50 Backbone network for extracting the characteristics of the picture, the Neck structure is used for connecting the backhaul and the Head and fusing the characteristics, the Head is used for detecting the object and realizing the classification and regression of the target, and the MSA is arranged between the backhaul and the Neck and between the Neck and the Head.

The ResNet50 Backbone network outputs 4 feature maps [ C1, C2, C3, C4] with different sizes, the step distance is [4,8,16,32], the channel size is [256,512,1024,2048], a Neck structure adopts three feature maps [ C2, C3, C4] of a backhaul, the channels are reduced to 256 after 1 x 1 convolution, feature fusion is carried out on [ P1, P2, P3] in an FPN structure, then P3 is downsampled for two times to obtain P4 and P5, finally ablation processing is carried out on the feature maps by 3 x 3 convolution, 5 feature maps with different sizes are output, the step distance is [8,16,32,64,128], and the channel sizes are all 256.

The structure of MSA is as follows: let F ∈ R ^C×H×W Is the input tensor, where C, H, W represent the channel, height and width, respectively; by 3X 3 convolution, the high and wide of F are halved to obtain F'. Epsilon.R ^C×H/2×W/2 Then respectively obtaining F through a common convolution branch ₀ ∈R ^1×H/2×W/2 And three depth separable convolution branches to F ₁ ∈R ^{C/2×H/2×W/2} 、F ₂ ∈R ^{C/2×H/2×W/2} 、F ₃ ∈R ^C ^/2×H/2×W/2 Then F1, F2, F3 are reshaped into M1, M2, M3 by a change in dimension (three-dimensional to two-dimensional), i.e.:

m1, M2, M3 have the same matrix shape [ H/2W/2, C/2], H/2W/2 and C/2 represent the rows and columns of the matrix; multiplying M1, M2 and M3 respectively to obtain three relation matrixes N1, N2 and N3, wherein each value in the relation matrixes represents the relation between every two pixels in the characteristics; the calculation formulas of N1, N2 and N3 are as follows:

in the formula (I), the compound is shown in the specification,

representing matrix multiplication, M ₁ ^T ,M ₂ ^T ,M ₃ ^T Transpose matrices of M1, M2, M3, respectively, N1, N2, N3 are [ H/2W/2, H/2W/2]H/2 w/2 and H/2 w/2 represent the rows and columns of the matrix, respectively;

reshaping N1, N2, N3 into T1, T2, T3, wherein T1, T2, T3 have a shape of [ H/2W/2, H/2, W/2]H/2W/2, H/2, W/2 denote channel, height and width, respectively; to obtain an output containing more useful global priors, F is ₀ Splicing T1, T2 and T3 together to obtain a characteristic F ^M ：

F ^M ＝concat[F ₀ ,T ₁ ,T ₂ ,T ₃ ] (3)

In the formula, F ^M ∈R ^{(H/2*W/2)*3×H/2×W/2} H/2, w/2, (H/2 w/2) × 3 denote height, width and channel;

f is to be ^M Reshape to Y ₁ To generate an attention weight, which is then weighted by Y using an interpolation algorithm ₁ Adjusted to Y ₂ Obtaining the same space size as the Input characteristic Input, and then carrying out Y operation by reshaping ₂ Is shaped into three-dimensional space with the size of [1, W]And finally, multiplying the Sigmoid function by the Input characteristic Input to obtain the final Output.

In step 3, the sizes of the training set images are unified to 512 × 512, the learning rate is set to 0.001, the size of batch \sizeis set to 4, the number of times of training is 12 epochs, and the learning rate is reduced to 1/10 of the original rate at the 8 th epoch and the 11 th epoch.

Compared with the prior art, the invention has the following advantages:

compared with the common spatial attention, the method provided by the invention captures the pixel-level characteristic information from the angle of the wide area receptive field, considers the mutual intersection among different characteristic information, and greatly improves the characteristic extraction effect under the condition of not obviously increasing the parameter quantity and the calculated quantity.

Drawings

Fig. 1 is a schematic diagram of a network structure according to the present invention.

Fig. 2 is a schematic view of a spatial attention structure of a wide area receptive field.

Fig. 3 is a schematic diagram of the network detection effect according to the present invention.

Detailed Description

The invention provides a target detection method based on wide area receptive field space attention, and the technical scheme of the invention is further explained by combining the attached drawings and an embodiment.

As shown in fig. 1, the process of the embodiment of the present invention includes the following steps:

step 1, an image data set is prepared for testing and training.

Selecting a COCO 2017 data set which is a large and rich object detection, segmentation and caption data set, comprises 80 types for detection, namely 80 common individuals in daily life such as people, bicycles, automobiles, motorcycles, airplanes, buses, trains, trucks, ships, traffic lights and the like, and comprises four files of innotations, test2017, train2017 and val2017, wherein the train comprises 118287 images, the val comprises 5000 images, the test comprises 28660 images, and the innotations are a set of mark types: object instances, object keypoints and image references, stored using JSON files.

All images were resized to 512 x 512 for multi-scale training, and a series of operations were performed on the image dataset with data enhancement: random turning, padding filling, random cutting, normalization processing and image distortion processing.

As shown in fig. 1, the target detection network based on wide area receptive field Spatial Attention is composed of four parts, namely, back bone, tack, head and MSA (Multiple receptive field Spatial Attention). Backbone adopts a ResNet50 Backbone network for extracting the features of pictures, and the network outputs 4 feature maps [ C1, C2, C3, C4] with different sizes, the step pitch is [4,8,16,32], and the channel size is [256,512,1024,2048]. The Neck structure is used for connecting a Backbone and a Head and fusing features, the structure adopts three feature maps [ C2, C3 and C4] of the Backbone, channels are reduced to 256 after 1 x 1 convolution, feature fusion is carried out through [ P1, P2 and P3] in the FPN structure, then P3 is downsampled twice to obtain P4 and P5, finally ablation processing is carried out on the feature maps by 3 x 3 convolution, 5 feature maps with different sizes are output, the step distance is [8,16,32,64 and 128], and the channel size is 256. The Head is used for detecting the object and realizing the classification and regression of the target.

The MSA wide area receptive field spatial attention mechanism is put between backsbone and Neck, and between Neck and Head, i.e. "MSA" in FIG. 1, in 8 positions.

The structure of MSA is shown in FIG. 2, where F ∈ R ^C×H×W Is the input tensor, where C, H, W represent the channel, height and width, respectively. To reduce the parameters and the amount of computation, the high and wide halves of F are reduced by 3 × 3 convolution to obtain F' ∈ R ^C×H/2×W/2 Then is divided intoRespectively obtaining F through a common convolution branch ₀ ∈R ^1×H/2×W/2 And three depth separable convolution branches to F ₁ ∈R ^C ^/2×H/2×W/2 、F ₂ ∈R ^{C/2×H/2×W/2} 、F ₃ ∈R ^{C/2×H/2×W/2} Then F1, F2, F3 are reshaped into M1, M2, M3 by a change in dimension (three-dimensional to two-dimensional), i.e.:

m1, M2, M3 have the same matrix shape [ H/2 w/2, C/2], H/2 w/2 and C/2 representing the rows and columns of the matrix. And multiplying M1, M2 and M3 respectively to obtain three relation matrixes N1, N2 and N3, wherein each value in the relation matrixes represents the relation between every two pixels in the characteristics. The calculation formulas of N1, N2 and N3 are as follows:

in the formula (I), the compound is shown in the specification,

representing matrix multiplication, M ₁ ^T ,M ₂ ^T ,M ₃ ^T Respectively, M1, M2, M3. The shape of N1, N2 and N3 is [ H/2W/2, H/2W/2]H/2 w/2 and H/2 w/2 represent the rows and columns of the matrix, respectively. Matrix multiplication is advantageous for fusing richer feature information while extracting features more carefully from a pixel perspective.

Reshaping N1, N2, N3 to T1, T2, T3 for the next feature fusion operation. The shapes of T1, T2, T3 are [ H/2W/2, H/2, W/2], H/2W/2, H/2, W/2 indicating channel, height and width, respectively.

To obtain an output containing more useful global priors, F ₀ Spliced with T1, T2 and T3 to obtain a characteristic F ^M ∈R ^{(H/2*W/2)*3×H/2×W/2} Wherein H/2, w/2, (H/2 w/2) × 3 denote height, width and channel. F ^M The formula of (1) is as follows:

F ^M ＝concat[F ₀ ,T ₁ ,T ₂ ,T ₃ ] (3)

F ^M is reshaped into Y ₁ To generate an attention weight, which is then weighted by Y using an interpolation algorithm ₁ Adjusted to Y ₂ The same space size as the Input feature Input is obtained. Then Y is reset by reshaping operation ₂ Is shaped into three-dimensional space with the size of [1, W]And finally, multiplying the Input characteristic Input by a Sigmoid function to obtain an Output.

The sizes of the images in the training set are uniformly 512 multiplied by 512, the learning rate is set to 0.001, the size of batch _sizeis set to 4, the training times are 12 epochs, and the learning rate is reduced to 1/10 of the original rate in 8 th and 11 th epochs.

And 4, performing target detection on the images in the test set by using the network model trained in the step 3.

The experimental environment is as follows: and (3) building a Python compiling environment with PyTorch1.6, torchvision =0.7.0, CUDA10.0 and CUDNN7.4 as deep learning frameworks, and realizing the Python compiling environment on a platform mmdetection 2.6.

Experimental equipment: CPU Intel Xeon E5-2683 V3@2.00GHz; RAM 16GB; graphics card Nvidia GTX 2060super; 500GB is the Hard disk.

In order to test the influence of the spatial attention structure of the MSA wide-area receptive field on the precision of the detected object, comparison experiments are carried out on a plurality of networks. The evaluation standard of the experiment adopts Average Precision (AP), and AP is selected ₅₀ 、AP ₇₅ 、AP _S 、AP _M 、AP _L As a main evaluation criterion, wherein AP ₅₀ ，AP ₇₅ Means taking the detection result of the detector with the IoU threshold value larger than 0.50 and larger than 0.75, AP _S 、AP _M 、AP _L The detection accuracy of small, medium and large targets is respectively corresponded, and the experimental results are shown in table 1.

Table 1 effect of MSA spatial attention on different networks

Table 1 shows the effect of detecting MSA wide area receptor field spatial attention structure on the COCO 2017 dataset. As can be seen from the table, the magnitude of the increase in each network is between 0.7% and 0.9%. Because the image of the COCO 2017 data set often contains a large number of complex objects, the type, the scale and the posture of the target to be detected are often uncertain, and some difficulties are brought to detection, for example, the detection effect of large targets after ATSS and VFNet are added with MSA space attention is slightly worse than that of an original network. Overall, the wide-area receptor field spatial attention mechanism extracts important features well.

Some test pictures were selected to test the final result. As can be seen from fig. 3, the proposed target detection network achieves good results. When only one bird is present in the third picture, the network can accurately detect the object. When a plurality of objects exist in other pictures, a good detection effect can be achieved. For the fourth picture and the fifth picture, when part of objects are blocked, the categories can still be accurately identified. In addition, for small objects and blurred images which well complete detection tasks, such as ships with the fourth picture and horses with the eighth picture, the detection network provided by the invention can also well complete detection tasks. In general, the network provided by the invention accurately completes the task of target detection and has excellent recognition effect at the edge.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A target detection method based on wide area receptive field space attention is characterized by comprising the following steps:

step 1, preparing an image data set for testing and training;

step 2, constructing a target detection network based on wide area receptive field space attention, wherein the network is composed of a Backbone, a Neck, a Head and an MSA wide area receptive field space attention, the Backbone adopts a ResNet50 Backbone network for extracting the characteristics of pictures, the Neck structure is used for connecting the Backbone and the Head and fusing the characteristics, the Head is used for detecting objects and realizing the classification and regression of the target, and the MSA is arranged between the Backbone and the Neck and between the Neck and the Head;

step 3, training a target detection network model based on wide area receptive field space attention by using a training set image;

2. The method as claimed in claim 1, wherein the wide area receptive field spatial attention-based target detection method comprises: in step 1, the sizes of all images are adjusted to 512 × 512 for multi-scale training, and a series of operations are performed on an image data set by data enhancement: random turning, padding, random cutting, normalization processing and image distortion processing.

3. The method as claimed in claim 1, wherein the wide area receptive field spatial attention-based target detection method comprises: in step 2, the ResNet50 Backbone network outputs 4 feature maps [ C1, C2, C3, C4] with different sizes, the step distance is [4,8,16,32], the channel size is [256,512,1024,2048], a hack structure adopts three feature maps [ C2, C3, C4] of Backbone, the channels are all reduced to 256 after 1 × 1 convolution, feature fusion is carried out on [ P1, P2, P3] in the FPN structure, then P3 is sampled twice to obtain P4 and P5, finally 3 × 3 convolution is adopted to carry out ablation processing on the feature maps, 5 feature maps with different sizes are output, the step distance is [8,16,32,64,128], and the channel sizes are all 256.

4. The method of claim 1A target detection method based on wide area receptive field space attention is characterized in that: the structure of MSA in step 2 is as follows: let F ∈ R ^C×H×W Is the input tensor, where C, H, W represent the channel, height and width, respectively; the high and wide halves of F are reduced by 3 x 3 convolution to obtain F' ∈ R ^C×H/2×W/2 Then respectively obtaining F through a common convolution branch ₀ ∈R ^1×H/2×W/2 And three depth separable convolution branches to F ₁ ∈R ^{C/2×H/2×W/2} 、F ₂ ∈R ^{C/2×H/2×W/2} 、F ₃ ∈R ^{C/2×H/2×W/2} Then, F1, F2, F3 are reshaped into M1, M2, M3 through the change of dimension, namely three-dimensional change into two-dimensional change, namely:

in the formula (I), the compound is shown in the specification,

reshaping N1, N2, N3 into T1, T2, T3, wherein the shape of T1, T2, T3 is [ H/2 × W/2, H/2, W/2]H/2W/2, H/2, W/2 denote channel, height and width, respectively; to obtain an output containing more useful global priors, F ₀ And T1, T2. T3 are spliced together to obtain a characteristic F ^M ：

F ^M ＝concat[F ₀ ,T ₁ ,T ₂ ,T ₃ ] (3)

In the formula, F ^M ∈R ^{(H/2*W/2)*3×H/2×W/2} H/2, w/2, (H/2 × w/2) × 3 denote height, width and channel;

f is to be ^M Reshape to Y ₁ To generate an attention weight, which is then weighted by Y using an interpolation algorithm ₁ Adjusted to Y ₂ Get the same space size as the Input feature Input, then get Y through reshaping operation ₂ Is shaped into three-dimensional space with the size of [1, W]And finally, multiplying the Sigmoid function by the Input characteristic Input to obtain the final Output.

5. The method as claimed in claim 4, wherein the object detection method based on wide-area receptive field space attention comprises the following steps: in step 3, the sizes of the images of the training set are unified to 512 multiplied by 512, the learning rate is set to 0.001, the size of batch _sizeis set to 4, the training times are 12 epochs, and the learning rate is reduced to 1/10 of the original rate when the 8 th epoch and the 11 th epoch are carried out.