CN111738110A

CN111738110A - Remote sensing image vehicle target detection method based on multi-scale attention mechanism

Info

Publication number: CN111738110A
Application number: CN202010521480.1A
Authority: CN
Inventors: 门飞飞; 李训根; 马琪; 潘勉; 吕帅帅; 李子璇; 张战; 刘爱林
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2020-06-10
Filing date: 2020-06-10
Publication date: 2020-10-02

Abstract

The invention discloses a remote sensing image vehicle target detection method based on a multi-scale attention mechanism, which comprises the following steps: s1, extracting the features of the original picture by using a multilayer convolutional neural network, and constructing a bottom-up pyramid network from the generated feature pictures with different scales; s2, for the constructed pyramid network, realizing the feature fusion from top to bottom, and in the fusion process, sequentially carrying out channel attention operation on the high-level feature map and fusing the high-level feature map to the low-level feature map; s3, acquiring the spatial attention information of the fused low-level feature map, and fusing the spatial attention information into the original low-level feature; s4, generating a large number of candidate frames according to the preset size, proportion and the like, determining a used feature map according to the size of the real frame of the detection target, and judging the positive and negative of the candidate frames according to the intersection ratio of the real frame and the candidate frames; and S5, directly predicting the category information and regression information of the obtained positive sample candidate frame, and filtering the obtained overlapped region same category candidate frame by using a non-maximum inhibition method.

Description

Remote sensing image vehicle target detection method based on multi-scale attention mechanism

Technical Field

The invention belongs to the technical field of image processing of deep learning, and particularly relates to a remote sensing image vehicle target detection method based on a multi-scale attention mechanism.

Background

With the development of remote sensing satellite technology, a large number of remote sensing pictures across space and time can be easily acquired. The remote sensing image provides a new visual angle for people to analyze the ground vehicles. The detection of the vehicle target through the aerial visual angle can help the tasks of urban intelligent traffic, urban traffic planning, military target detection and tracking, cross-regional remote monitoring and the like to be smoothly implemented. The identification and detection of vehicle objects is an important and fundamental function in the above task. The quality of the remote sensing image changes along with different acquisition platforms and acquisition modes of the remote sensing image. Different ground sampling distances create different scales for the same target, which presents challenges for the detection of different targets, especially small targets.

The traditional method for identifying the remote sensing image vehicle by using manual feature extraction has high design difficulty and low identification rate, is difficult to accurately identify the vehicle in a small and dense vehicle target area, and is difficult to avoid complex ground environment interference information.

With the development of deep learning technology, vehicle target semantic information can be easily acquired through training of a deep neural network. There are not trivial challenges to accurately identify the specific location of the vehicle. Among them, the feature pyramid formed based on the deep neural network is widely used in the field of detection of multi-scale targets and small targets. The characteristic diagrams with different scales are selected for detection according to the area information of the target, so that certain effect improvement is achieved. However, the vehicle targets are mostly concentrated on lower-layer features due to the large number of small targets, and the lower-layer features obtained by simple up-sampling and addition often do not have rich semantic features.

Disclosure of Invention

In view of the technical problems, the invention is used for providing a remote sensing image vehicle target detection method based on a multi-scale attention mechanism, and aiming at the characteristics of smaller vehicle targets, an attention mechanism strengthening mode is adopted for the low-level characteristics of a characteristic pyramid. By fusing the channel attention mechanism and the space attention mechanism for the feature graph of the lower layer, the features of the lower layer have different weights on channel and space information, more accurate semantic information is provided for target identification and detection of a subsequent network, and interference of background information in a remote sensing image on a vehicle target is reduced.

In order to solve the technical problems, the invention adopts the following technical scheme:

a remote sensing image vehicle target detection method based on a multi-scale attention mechanism comprises the following steps:

s1, extracting the features of the original picture by using a multilayer convolutional neural network, and constructing a bottom-up pyramid network from the generated feature pictures with different scales;

s2, for the constructed pyramid network, realizing the feature fusion from top to bottom, and in the fusion process, sequentially carrying out channel attention operation on the high-level feature map and fusing the high-level feature map to the low-level feature map;

s3, acquiring the spatial attention information of the fused low-level feature map, and fusing the spatial attention information into the original low-level feature;

s4, generating a large number of candidate frames according to the preset size, proportion and the like, determining a used feature map according to the size of the real frame of the detection target, and judging the positive and negative of the candidate frames according to the intersection ratio of the real frame and the candidate frames;

and S5, directly predicting the category information and regression information of the obtained positive sample candidate frame, and filtering the obtained overlapping region same category candidate frame by using a non-maximum inhibition method to obtain a final detection result.

Preferably, the S1 includes: selecting ResNet-50 as a basic convolution neural network, enabling pictures to pass through the network, outputting feature maps with different scales at different layers, enabling each feature map to be output by the neural network through which the next feature map passes, enabling different features to have different channel numbers at the moment, enabling the features at the upper layer to be more in channel number and smaller in scale, firstly performing channel number unification on the different feature maps, and enabling the process to be as follows:

P_i＝Conv_3×3(C_i，256，3，1，1) (1)

wherein, P_iCharacteristic diagram, Conv, representing the i-th layer_3x3Denotes the 3 × 3 convolutional layer, C_iI-th characteristic diagram showing the input picture obtained through ResNet-50, convolution layer Conv at 3 × 3_3x3Inner, C_iThe number of channels of the input feature map is 256, the number of channels of the output feature map is 3, the size of the convolution kernel used is represented by 3, 1 represents the step size of each movement of the convolution kernel, and 1 represents the number of boundary fillings for the feature map.

Preferably, the S2 includes: the feature map fusion of each time always involves one high-level feature and one low-level feature in operation, the high-level feature map P4 is translated and unchanged, the next high-level feature map P3 fuses information from the feature map P4, the maximum pooling and average pooling of channels is firstly carried out on the low-level features, then the two pooled results are input into 1 × 1 convolution, and a feature block with 256 channels and 1 × 1 scale is obtained; secondly, the feature block and the low-level feature map are subjected to channel multiplication to obtain a low-level feature map containing the attention of a channel, and the process is expressed as the following form:

wherein

Represents P_iObtaining a profile, P, through the channel attention_i-1Is P_iOf the next layer network, Conv_1×1Represents a convolution operation of 1 × 1, cat () represents a join operation of the feature graph, C_maxpool() Denotes maximum pooling of channels, C_avgpool() Representing channel-averaged pooling, unomple () is an upsampling of the feature map.

Preferably, the S3 includes: firstly, performing spatial maximum pooling on the feature map obtained in the last step to obtain feature blocks with unchanged scale and 1 channel number, and then simultaneously obtaining average pooled feature blocks; splicing the two feature blocks, and then sending the spliced feature blocks into a convolution block with the convolution kernel size of 1x1 to obtain a feature block with the channel number of 1, wherein the feature block is fused with spatial information in a feature map;

secondly, activating the value of a pixel point in the characteristic diagram to be between 0 and 1 by utilizing a Sigmoid () activation function; finally, the final result is obtained by multiplying the feature map by the matrix of the feature block, and the process can be expressed as the following form:

wherein the content of the first and second substances,

representing the characteristic map, S, finally obtained by the channel attention and the spatial attention_maxpool() Representing the spatial maximum pooling, S_avgpool() Representing space average pooling, and Sigmoid () representing Sigmoid activation of the feature block obtained after convolution.

Preferably, the S4 includes: after generating a characteristic pyramid and fusing attention information, the network has a plurality of characteristic graphs with 256 channels with different scales from top to bottom, generates a large number of candidate frames in the input remote sensing picture by a candidate region generation method, and judges the positive and negative of the candidate frames according to the intersection of a target frame of a vehicle target in the input remote sensing picture and the candidate after filtering the candidate frames beyond the picture boundary. For the candidate box of the positive sample, the area is considered to have the vehicle object.

Preferably, the S5 includes: the plurality of feature maps obtained in S3 are put into two sub-networks: the target frame type prediction sub-network and the target frame regression sub-network are characterized in that the target frame type prediction sub-network conducts multiple convolution on an input feature graph to obtain a feature block with an unchanged scale, the number of channels is 2, 2 indicates that the prediction types are two, a vehicle target and a non-vehicle target are provided, the target frame regression sub-network conducts multiple convolution on the input feature graph to obtain a feature block with an unchanged scale, the number of channels is 4, and 4 indicates the number of regression parameters of a target frame.

The invention has the following beneficial effects:

(1) according to the embodiment of the invention, the attention information of the vehicle target in the feature map is considered when the feature pyramid is used, and the fused attention information is utilized to extract the important information of the vehicle target in the space and the channel of the feature map.

(2) According to the embodiment of the invention, two attention mechanisms are fused into the characteristic pyramid network, so that the accuracy and recall rate of the target detection result of the remote sensing image are improved on the premise of not greatly increasing the internal memory and running time of the network.

Drawings

FIG. 1 is a schematic diagram of a remote sensing image vehicle target detection method based on a multi-scale attention mechanism according to the invention;

FIG. 2 is a schematic diagram of a method of incorporating attention into a pyramid of features according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 shows a method for detecting a vehicle target based on a remote sensing image of a multi-scale attention mechanism, which includes the following steps:

and S1, performing feature extraction on the original picture by using a multilayer convolutional neural network, and constructing a bottom-up pyramid network from the generated feature maps with different scales.

As a specific implementation mode, ResNet-50 is selected as a basic convolutional neural network. As shown in the left side of fig. 1, the picture passes through the network, and different scales of feature maps are output at different layers, and each previous feature map is output by the next feature map through the neural network. The different features have different channel numbers, and the higher the upper layer features, the more the channel numbers are, but the smaller the dimension is. Firstly, channel numbers of different feature maps are unified. The process is as follows:

P_i＝Conv_3×3(C_i，256，3，1，1) (1)

wherein, P_iRepresenting the characteristic diagram of the ith layer. Conv_3x3Represents 3 × 3 convolutional layer. C_iI-th characteristic diagram showing input picture obtained through ResNet-50 at 3 × 3_3x3Inner, C_iThe number of channels of the input feature map is 256, the number of channels of the output feature map is 3, the size of the convolution kernel used is represented by 3, 1 represents the step size of each movement of the convolution kernel, and 1 represents the number of boundary fillings for the feature map.

And S2, realizing top-down feature fusion for the constructed pyramid network. In the fusion process, channel attention operation is carried out on the high-level feature map in sequence and the high-level feature map is fused to the low-level feature map.

As a specific implementation mode, each time the feature map fusion is always operated by one high-level feature and one low-level feature. As shown on the right side of fig. 1, the high level feature map P4 is translated unchanged, and the next highest level feature map P3 will fuse the information from feature map P4. As shown in the left side of fig. 2, the channel maximum pooling and average pooling is performed on the low-level features first, and then the two pooled results are input into a 1 × 1 convolution to obtain a feature block with 256 channels and a scale size of 1 × 1. And secondly, channel multiplication is carried out on the feature block and the low-level feature map to obtain the low-level feature map containing the attention of the channel. The process can be expressed in the following form:

wherein

Represents P_iThe feature map is obtained by channel attention. P_i-1Is P_iThe next layer of network. Conv_1×1Represents the convolution operation of 1 × 1, cat () represents the join operation of the feature graph C_maxpool() Denotes maximum pooling of channels, C_avgpool() Indicating channel average pooling. Unomple () is an upsampling of a feature map.

And S3, acquiring the spatial attention information of the fused low-level feature map, and fusing the spatial attention information into the original low-level feature.

As a specific embodiment, as shown in the right side of fig. 2, the feature map obtained in the previous step is first pooled to the maximum space to obtain feature blocks with the same scale and the number of channels being 1, and then the feature blocks with the average pooling are obtained at the same time. And splicing the two feature blocks, and sending the spliced feature blocks into a convolution block with the convolution kernel size of 1x1 to obtain a feature block with the channel number of 1, wherein the feature block fuses spatial information in the feature map.

Then, activating the value of the pixel point in the feature map to be between 0 and 1 by using a Sigmoid () activation function. And finally, multiplying the characteristic diagram by the matrix of the characteristic block to obtain a final result. The process can be expressed in the following form:

wherein the content of the first and second substances,

representing the characteristic map finally obtained by the channel attention and the spatial attention. S_maxpool() Representing the spatial maximum pooling, S_avgpool() Representing the spatial average pooling. Sigmoid () represents Sigmoid activation of the feature block obtained after convolution.

At S4, a large number of candidate frames are generated by a predetermined size, ratio, or the like. And determining the used characteristic diagram according to the size of the real frame of the detection target. And judging the positive and negative of the candidate frame through the intersection ratio of the real frame and the candidate frame.

As a specific implementation mode, after the network generates a feature pyramid and fuses attention information, the network has a feature map with 256 channels with different scales from top to bottom, a large number of candidate frames are generated in an input remote sensing picture through a candidate region generation method, after candidate frames exceeding the picture boundary are filtered, the positive and negative of the candidate frames are judged according to the intersection and combination ratio of a target frame of a vehicle target and the candidates in the input remote sensing picture. For the candidate box of the positive sample, the area is considered to have the vehicle object.

S5, the category information and regression information of the positive sample candidate box are directly predicted. And filtering the obtained overlapped region same-class candidate frames by using a non-maximum inhibition method to obtain a final detection result.

As a specific implementation, the plurality of feature maps obtained in step 3 are imported into two sub-networks: target box class prediction subnetwork and target box regression subnetwork. The target frame type prediction sub-network performs multiple convolutions on the input feature map to obtain feature blocks with unchanged scales and 2 channels (2 indicates that two types of prediction types are available, namely, a vehicle target and a non-vehicle target). The target frame regression subnetwork convolutes the input feature graph for multiple times to obtain feature blocks with unchanged scales and 4 channels (4 represents the number of regression parameters of the target frame).

To verify the validity of the inventive scheme, the following simulation experiment was performed.

Firstly, loading a pre-training model ResNet-50 provided by torchvision to initialize network parameters, inputting the processed remote sensing picture with the label into a neural network, and extracting feature maps with different scales and different channel numbers of the picture. And (3) forming a characteristic pyramid network by adopting the mode of the step 1.

Then, attention information fusion is carried out on each feature map except the highest layer in the feature pyramid. The high-level feature map is firstly subjected to global channel maximum pooling and global channel average pooling. The obtained connected feature block is subjected to a 1x1 convolution operation to obtain the channel attention information of a single channel, and the attention information block of the single channel is multiplied by the low-level feature map. And 2 times down sampling the high-level feature map and adding the down-sampled high-level feature map and the low-level feature map fused with the channel attention information.

Secondly, the feature map containing the channel attention information obtained in the last step is subjected to spatial maximum pooling and spatial maximum pooling. Likewise, the concatenated feature blocks are convolved by 1 × 1, and the number of channels is reduced to 1. And then activating the value of each pixel point in the obtained space attention information block to be between 0 and 1 by using a sigmoid () activation function. The closer to 1 pixel point, the higher the importance. And finally, multiplying the feature block by a feature map, and simultaneously obtaining the feature map with channel attention information and space attention information.

Then, a subsequent class prediction sub-network and object box regression sub-network are generated for each feature map. In the class prediction subnetwork, the input feature map is WxHx256 in size. The feature blocks of WxHx2 and WxHx4 are obtained through two FCN-like subnets, respectively. Meanwhile, a large number of candidate frames are generated on feature maps of different scales respectively. These candidate boxes determine whether the candidate box is a positive case by a cross-over ratio (here 0.5) with the true box in the graph.

Finally, after the characteristic diagram of the positive sample candidate box is determined, two sub-networks subsequent to the characteristic diagram of the layer calculate the network loss. For class prediction sub-networks, Focal distance is used, and for target box regression sub-networks, SmoothL1Loss is used. During the inference phase, the sub-network outputs the target box and the confidence of the target box, respectively. And (4) screening confidence degrees through a threshold of 0.05, and filtering some low-confidence-degree target frames with overlapped regions by using a non-maximum inhibition method and taking 0.5 as a reference.

In addition, the scale information and the definition of the vehicle target in satellite images of different sampling distances and different areas have larger deviation because the vehicle target in the remote sensing image is smaller. In a general vehicle region, background information is complex, and interference is caused to vehicle detection in a remote sensing image. The method combines the characteristics of the vehicle target in the remote sensing image, strengthens the semantic information of the low-level features by fusing various attention mechanisms in the feature pyramid, enables the information part which can represent the vehicle in the feature map to be more prominent in a channel and a space, and weakens the influence of background noise information on the detection result.

In conclusion, the invention further improves the vehicle detection performance in the remote sensing image by combining the data characteristics of the vehicle target in the remote sensing image.

It is to be understood that the exemplary embodiments described herein are illustrative and not restrictive. Although one or more embodiments of the present invention have been described with reference to the accompanying drawings, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present invention as defined by the following claims.

Claims

1. A remote sensing image vehicle target detection method based on a multi-scale attention mechanism is characterized by comprising the following steps:

2. The method for remotely sensing an image vehicle object based on the multi-scale attention mechanism as claimed in claim 1, wherein said S1 comprises: selecting ResNet-50 as a basic convolution neural network, enabling pictures to pass through the network, outputting feature maps with different scales at different layers, enabling each feature map to be output by the neural network through which the next feature map passes, enabling different features to have different channel numbers at the moment, enabling the features at the upper layer to be more in channel number and smaller in scale, firstly performing channel number unification on the different feature maps, and enabling the process to be as follows:

P_i＝Conv_3×3(C_i，256，3，1，1) (1)

3. The method for remotely sensing an image vehicle object based on the multi-scale attention mechanism as claimed in claim 1, wherein said S2 comprises: the feature map fusion of each time always involves one high-level feature and one low-level feature in operation, the high-level feature map P4 is translated and unchanged, the next high-level feature map P3 fuses information from the feature map P4, the maximum pooling and average pooling of channels is firstly carried out on the low-level features, then the two pooled results are input into 1 × 1 convolution, and a feature block with 256 channels and 1 × 1 scale is obtained; secondly, the feature block and the low-level feature map are subjected to channel multiplication to obtain a low-level feature map containing the attention of a channel, and the process is expressed as the following form:

wherein

4. The method for remotely sensing an image vehicle object based on the multi-scale attention mechanism as claimed in claim 1, wherein said S3 comprises: firstly, performing spatial maximum pooling on the feature map obtained in the last step to obtain feature blocks with unchanged scale and 1 channel number, and then simultaneously obtaining average pooled feature blocks; splicing the two feature blocks, and sending the spliced two feature blocks into a convolution block with the convolution kernel size of 1 multiplied by 1 to obtain a feature block with the channel number of 1, wherein the feature block is fused with spatial information in a feature map;

secondly, activating the value of a pixel point in the characteristic diagram to be between 0 and 1 by utilizing a Sigmoid () activation function;

finally, the final result is obtained by multiplying the feature map by the matrix of the feature block, and the process can be expressed as the following form:

wherein the content of the first and second substances,

5. The method for remotely sensing an image vehicle object based on the multi-scale attention mechanism as claimed in claim 1, wherein said S4 comprises: after generating a characteristic pyramid and fusing attention information, the network has a plurality of characteristic graphs with 256 channels with different scales from top to bottom, generates a large number of candidate frames in the input remote sensing picture by a candidate region generation method, and judges the positive and negative of the candidate frames according to the intersection ratio of a target frame of a vehicle target in the input remote sensing picture and the candidates after filtering the candidate frames beyond the picture boundary. For the candidate box of the positive sample, the area is considered to have the vehicle object.

6. The method for remotely sensing an image vehicle object based on the multi-scale attention mechanism as claimed in claim 1, wherein said S5 comprises: the plurality of feature maps obtained in S3 are put into two sub-networks: the target frame type prediction sub-network and the target frame regression sub-network are characterized in that the target frame type prediction sub-network conducts multiple convolution on an input feature graph to obtain a feature block with an unchanged scale, the number of channels is 2, 2 indicates that the prediction types are two, a vehicle target and a non-vehicle target are provided, the target frame regression sub-network conducts multiple convolution on the input feature graph to obtain a feature block with an unchanged scale, the number of channels is 4, and 4 indicates the number of regression parameters of a target frame.