CN114022682A

CN114022682A - Weak and small target detection method based on attention secondary feature fusion mechanism

Info

Publication number: CN114022682A
Application number: CN202111305479.6A
Authority: CN
Inventors: 陈仁海; 李开俊; 冯志勇
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2022-02-08

Abstract

The invention discloses a weak and small target detection method based on a secondary feature fusion mechanism of attention, which fuses feature information extracted from a picture to be processed by a backbone network by using a secondary feature fusion module based on a double-layer pyramid; the method comprises the steps that a main network of a secondary feature fusion module adopts a residual error network, the first feature fusion from top to bottom is to perform continuous convolution and down sampling on a feature map extracted from the main network, the second feature fusion from bottom to top is to perform continuous convolution and up sampling on the feature map obtained at the first time, then an attention mechanism is introduced to respectively add and fuse feature maps with the same size, and small target feature information in a shallow feature map is reserved; and then, prediction is carried out based on the characteristic graphs of different size receptive fields formed by addition fusion, and the prediction results are summarized. The invention effectively solves the problem of effectively detecting the vehicle target in the haze environment.

Description

Weak and small target detection method based on attention secondary feature fusion mechanism

Technical Field

The invention relates to the technical field of target detection, in particular to a weak and small target detection method based on a secondary feature fusion mechanism of attention.

Background

With the improvement of the performance of the embedded device and the development of the edge computing technology, the depth model configured on the edge computing device is increasingly popularized, so that the unmanned aerial vehicle carrying the embedded device to perform a high-altitude target detection task becomes possible. Unmanned aerial vehicles, as a new thing in recent years, have been introduced and perform related tasks by more and more industries due to their advantages of being flexible, lightweight, free from terrain control, and beneficial to obtaining near-ground remote sensing data.

And among these, the effect of the depth intelligence model is critical. With the development of Convolutional Neural Networks (CNN), the performance of target detection algorithms has been greatly improved. Researchers apply the convolutional neural networks to target detection, and compared with a traditional target detection algorithm, the speed and the accuracy are greatly improved. The method comprises the following steps that (1) a classified Two-Stage target detection framework represented by an R-CNN series is mainly divided into 2 types, (1) a classified Two-Stage target detection framework is adopted, firstly, Selective Search, Edge Boxes and other algorithms are utilized to generate candidate regions (Region probes) possibly containing targets to be detected, and then, the candidate regions are further classified and position calibration is carried out to obtain a final detection result; (2) according to the regression-based One-Stage target detection framework represented by the YOLO and SSD algorithms, a candidate area does not need to be generated in the regression-based target detection method, a target detection result is obtained by directly marking an anchor frame on a feature map and is finally mapped back to an original image, and the speed is greatly improved relative to the Two-Stage algorithm. Although the target detection algorithm in the two stages has high precision, the speed is very low; although the target detection algorithm at one stage is fast, the accuracy is low. For the target detection problem under the complex background and the final deployment on the hardware platform with limited resources, the precision and the speed of the model are very challenging.

Although the better the depth model is, the better the model effect depends on the data set, most of the existing models are trained on public data sets which are basically photographed in an ideal environment, and thus most of the existing depth models cannot be applied to a real complex environment. Meanwhile, images shot by the unmanned aerial vehicle are generally in a high-altitude scene, so that shot target pixels are too small, and the existing mainstream detection model has a poor detection effect on small targets.

Unmanned aerial vehicles carry intelligent algorithms to realize that unmanned intelligent systems are also more and more popular. For example, in the military field, a military unmanned aerial vehicle carries an intelligent detection algorithm to realize military target detection and the like; in the industrial field, the industrial unmanned aerial vehicle carries an intelligent detection algorithm to realize forest fire prevention, wild animal and plant protection and the like. At present, target detection based on a deep learning model is mostly performed in an ideal environment without considering an actual complex environment, such as a haze environment. Meanwhile, target pixels shot at high altitude by the unmanned aerial vehicle are often too small, and the detection effect of the current mainstream detection model on small targets is poor. Therefore, how to effectively detect the vehicle target in the haze environment; how to carry out effective detection to the little target that unmanned aerial vehicle shot is the technological problem that must solve.

Disclosure of Invention

The invention aims to provide a weak and small target detection method based on a secondary feature fusion mechanism of attention, aiming at the technical defects in the prior art, and the method can realize real-time detection of the unmanned aerial vehicle carrying embedded equipment in a haze environment by designing the secondary feature fusion network based on the secondary feature fusion mechanism, thereby effectively improving the detection effect.

The technical scheme adopted for realizing the purpose of the invention is as follows:

a weak and small target detection method based on a secondary feature fusion mechanism of attention is characterized in that a secondary feature fusion module based on a double-layer pyramid is used for fusing feature information extracted from a picture to be processed by a backbone network; the method comprises the steps that a main network of a secondary feature fusion module adopts a residual error network, the first feature fusion from top to bottom is to perform continuous convolution and down sampling on a feature map extracted from the main network, the second feature fusion from bottom to top is to perform continuous convolution and up sampling on the feature map obtained at the first time, then an attention mechanism is introduced to respectively add and fuse feature maps with the same size, and small target feature information in a shallow feature map is reserved; and then, prediction is carried out based on the characteristic graphs of different size receptive fields formed by addition fusion, and the prediction results are summarized.

Preferably, the residual error network obtains the feature map with the down-sampling multiple of 2 times, 4 times, 8 times, 16 times and 32 times in the continuous convolution and down-sampling process.

Preferably, in the process of fusing the feature maps, the residual error network adds an attention mechanism between feature maps with the same size to focus on target features, so that the model increases feature weight for a target region affected by haze, focuses the attention of the model on a region shielded by the haze, and improves the effect of extracting the features of the target.

Preferably, the attention mechanism employs a hybrid attention mechanism that combines spatial attention and channel attention.

Preferably, after the input multi-channel feature map is subjected to convolution processing, a channel attention mechanism is firstly introduced, different weights are given to each channel for processing, a processing result is output, then spatial attention is introduced based on the processing result, mean operation is carried out on a plurality of channels, the overall distribution of all the channels is learned, the singular channels are abandoned, the picture features in each channel are equally processed, and finally a result map is output.

Preferably, the to-be-processed picture comprises a picture with a target pixel lower than a preset threshold value, wherein the picture is acquired by an aircraft in a haze environment and/or a high-altitude environment.

The invention provides an attention mechanism based on secondary feature fusion, which effectively solves the problem of effectively detecting a vehicle target in a haze environment. By embedding the attention mechanism into the second bottom-up feature fusion process, the feature information extracted from the vehicle target in different receptive fields is fully utilized, the attention of the model is focused on the effective target features, and the target detection effect is improved.

Drawings

FIG. 1 is a schematic diagram showing thermodynamic diagrams of different sized targets on different receptive field profiles.

FIG. 2 is a schematic diagram of a double-layer pyramid feature fusion method according to the present invention.

FIG. 3 is a block diagram of a quadratic feature fusion module based on an attention mechanism.

Fig. 4 is a schematic diagram of a channel signature for 64 channels.

FIG. 5 is a schematic diagram of the hybrid attention mechanism of the present invention.

Fig. 6 is a flowchart of a weak and small target detection method based on a secondary attention feature fusion mechanism.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The target detection algorithm based on the depth model generally comprises a backbone network (backbone), a feature fusion network (hack) and a prediction network (head). The backbone network is used for extracting the characteristic information in the picture for the utilization of the following network; the feature fusion network is arranged between the backbone network and the prediction network, and is used for better utilizing features extracted by the backbone network; the prediction network is a network that acquires network output content, and performs result prediction using the features extracted previously.

In order to solve the problems in the background art, the embodiment of the invention designs a secondary feature pyramid fusion module from bottom to top and from top to bottom so as to solve the detection problem of small targets and add an attention mechanism in the secondary fusion process so as to solve the detection problem of the targets in the haze environment.

In the depth model based on the convolutional neural network, target features are closely related to the receptive fields of the feature maps, the feature maps in different stages correspond to the receptive fields with different sizes, and accordingly, the abstraction degrees of the information expressed by the feature maps are different. Specifically, the shallow feature maps have a smaller field of view and are therefore better suited for detecting small objects, while the deep feature maps have a larger field of view and are better suited for detecting large objects.

As the convolution process is continuously deepened, the receptive field of the feature map is larger, so that the features of the small target gradually disappear as the convolution is deepened, and the point can be found by printing thermodynamic diagrams at different stages in the convolution process, as shown in FIG. 1, a large target such as zebra in the diagram is easier to be captured by the features at the higher layer, because the larger receptive field and the high-level semantic features are needed for the large object. For the detection of the underlying flock, it can be seen that fine-grained features in the underlying features are more desirable for discrimination. Therefore, in order to improve the detection effect of small targets, it is necessary to fuse the shallow features and the deep features.

In order to fully utilize the characteristics of targets on characteristic graphs of different receptive fields, the invention designs a double-layer pyramid secondary characteristic fusion module, as shown in fig. 2, wherein a residual error network is used by a main network, characteristic graphs with down-sampling multiples of 2 times, 4 times, 8 times, 16 times and 32 times are obtained in the continuous convolution and down-sampling process, the first time of characteristic fusion from top to bottom is to continuously convolve and down-sample the characteristic graphs extracted by the main network, the second time of characteristic fusion from bottom to top is to continuously convolve and up-sample the characteristic graphs obtained for the first time, then the characteristic graphs with the same size are respectively added and fused, finally, the predictions are made on the characteristic graphs of the receptive fields with different sizes, and finally, the prediction results are summarized. Through the multiple fusion, the small target feature information in the shallow feature map is retained, and the detection effect of the small target is improved.

In order to realize effective extraction of target features in a haze environment, the embodiment of the invention designs an attention-based secondary fusion module, the structure of which is shown in fig. 3, and an attention mechanism is added between feature maps with the same size to focus on the target features, so as to improve the feature extraction effect on the target.

The attention mechanism is very similar to the human visual attention, which is a brain signal processing mechanism specific to human vision. The human beings obtain the target area that needs the focus of attention through quick scanning global image, obtain the focus of attention, then put into more attention to this area to obtain the detailed information of more needs focus on the target to restrain other useless information. The method is a means for rapidly screening high-value information from a large amount of information by using limited attention resources, is a survival mechanism formed by human in long-term evolution, and greatly improves the efficiency and accuracy of visual information processing.

For example, for a newspaper with a printed picture, the person would first look at the title of the newspaper and then look at the highlighted picture. The attention mechanism in deep learning is similar to the selective visual attention mechanism of human beings in nature, and the aim is to select information which is more critical to the current task target from a plurality of information.

The attention mechanism is generally considered as a resource allocation mechanism, and the original network model equally allocates all resources for input data, but research proves that the data have relevance, such as finding out important parts in the data, reallocating the resources according to the importance degree, and highlighting some important features. Therefore, in a haze environment, the method needs to utilize an attention mechanism to enable the model to increase the characteristic weight of the model for a target area affected by haze, and the popular point is to focus the attention of the model on an area shielded by the haze.

The attention module is divided into channel attention and spatial attention. The channel attention mechanism is used for distributing different weights to each channel, so that the network pays attention to important features and suppresses unimportant features. And the spatial attention mechanism is to weight the information of the channels with higher importance to the channels with lower importance, so as to improve the importance of each channel of the feature map as a whole.

The difference between the two is explained below by means of a 64-channel characteristic in fig. 4.

Attention of the channel: that is, each channel is given a different weight, for example, the shape of the horse at 1, 2 is obvious, so that the weight of the channel at 1, 2 is larger and the weight at 3, 4 is smaller.

Spatial attention is as follows: spatial attention is given to the fact that one mean operation is performed on 64 channels to obtain a weight of (w, x, h), the mean operation learns the overall distribution of all channels, and the singular channels are discarded. For example, the graph of 1 and 2 can well depict the shape of the horse, and 3 and 4 do not (but essentially the shape of the horse is also shown), but after mean is obtained and weight shares of w, x and h are obtained, a certain weight description of 3 and 4 is given, which is equivalent to a certain attention given to 3 and 4, so that the graph can also depict the shape of the horse.

After the design ideas of the two attention mechanisms are known, the advantages and the disadvantages of the two attention mechanisms can be found through simple comparison. Firstly, the spatial attention ignores the information in the channel domain, and the picture features in each channel are processed equally, so that the spatial domain transformation method is limited to the original picture feature extraction stage, and the interpretability of the method applied to other layers of a neural network layer is not strong. Channel attention is the direct global average pooling of information within one channel, while ignoring local information within each channel, which is also a practice of comparative violence. Therefore, the single attention mechanism is fused, and a hybrid attention mechanism model which takes the advantage of the deficiency is designed. The attention mechanism adopted by the model of the invention is a mixed attention mechanism combining space attention and channel attention, and the specific mode is shown in fig. 5:

and finally, embedding the mixed attention mechanism into a secondary feature fusion process, predicting results on feature graphs with different sizes, and finally summarizing the predicted results.

The network model is compared with an open source detection model on a COCO data set through experiments, the detection effect on small targets is improved by 4.9%, and the overall detection effect is improved by 2.1%.

The experimental comparison results are as follows:

	AP	AP50	AP75	APs	APm	APl	ARs	ARm	ARl
										DeNet	33.8	53.4	36.1	12.3	36.1	50.8	19.2	46.9	64.3
CoupleNet	34.4	54.8	37.2	13.4	38.1	50.8	20.7	53.1	68.5
										Cascade-RCNN	42.8	62.1	46.3	23.7	45.5	55.2	-	-	-
Mask-RCNN	39.8	62.3	43.3	22.1	43.2.	51.2	-	-	-
										DSSD	33.2	53.3	35.2	13	35.4	51.1	21.8	49.1	66.4
CornerNet	42.1	57.8	45.3	20.8	44.8	56.7	38.5	62.7	77.4
										SSD	31.2	50.4	33.3	10.2	34.5	49.8	21.8	49.1	66.4
yolov4	43.8	60.7	46.9	25.3	48.6	56.7	45.8	69.1	79
										yolov5	42.5	61.2	45.2	20.6	47.4	62.6	32.8	64.4	78.5
Ours	44.6	63.7	47.5	25.5	49.7	61.7	42.2	65.8	77

TABLE 1

Meanwhile, the invention also carries out ablation experiments on the network model designed by the invention and carries out effect tests on the vehicle data set. Comparative experiments were performed by adjusting the detection branch and attention module, and the results are shown in table 2:

	P	R	AP0.5	AP
					Normal	71.7	75.6	70.8	48.1
Add-Detection	68.3	80.5	74.9	51.8
					Add-Attention	72.3	76.4	72	48.8
Add-Detection-Attention	73.8	81.2	75.9	50.8

TABLE 2

The lightweight network model makes it possible for the depth model to be deployed on resource-limited hardware equipment and to perform real-time detection. Meanwhile, the secondary feature fusion module based on the attention mechanism is designed to effectively extract weak and small target feature information in the haze environment, so that the unmanned aerial vehicle can effectively detect the target vehicle in the haze environment.

Test results show that the invention can achieve the real-time detection (30FPS) effect on NVIDIA AGX XAVIER hardware equipment, and improves the detection effect in a haze environment.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. The weak and small target detection method based on the attention secondary feature fusion mechanism is characterized in that a secondary feature fusion module based on a double-layer pyramid is used for fusing feature information extracted from a picture to be processed by a backbone network; the method comprises the steps that a main network of a secondary feature fusion module adopts a residual error network, the first feature fusion from top to bottom is to perform continuous convolution and down sampling on a feature map extracted from the main network, the second feature fusion from bottom to top is to perform continuous convolution and up sampling on the feature map obtained at the first time, then an attention mechanism is introduced to respectively add and fuse feature maps with the same size, and small target feature information in a shallow feature map is reserved; and then, prediction is carried out based on the characteristic graphs of different size receptive fields formed by addition fusion, and the prediction results are summarized.

2. The method for detecting weak and small targets based on attention quadratic feature fusion mechanism according to claim 1, wherein the residual network obtains feature maps with down-sampling multiples of 2 times, 4 times, 8 times, 16 times and 32 times in the process of continuous convolution and down-sampling.

3. The method for detecting the weak and small target based on the attention secondary feature fusion mechanism according to claim 1 or 2, wherein in the process of fusing the feature maps, the residual error network adds an attention mechanism between the feature maps with the same size to focus on the target features, so that the model increases the feature weight for the target region affected by the haze, and focuses the attention of the model on the region shielded by the haze, thereby improving the effect of extracting the features of the target.

4. The weak object detection method based on the attention quadratic feature fusion mechanism of claim 3, characterized in that the attention mechanism adopts a mixed attention mechanism combining spatial attention and channel attention.

5. The method for detecting the weak and small targets based on the attention secondary feature fusion mechanism as claimed in claim 4, wherein after convolution processing is performed on input multi-channel feature maps, a channel attention mechanism is firstly introduced, different weights are given to each channel for processing, a processing result is output, then spatial attention is introduced based on the processing result, mean operation is performed on a plurality of channels, overall distribution of all channels is learned, singular channels are discarded, image features in each channel are equally processed, and a result map is finally output.

6. The method for detecting the weak and small target based on the attention secondary feature fusion mechanism according to claim 1 or 5, wherein the to-be-processed picture comprises a picture of which the target pixel acquired by the aircraft in a haze environment and/or a high-altitude environment is lower than a preset threshold value.