CN116229295A

CN116229295A - Remote sensing image target detection method based on fusion convolution attention mechanism

Info

Publication number: CN116229295A
Application number: CN202310176483.XA
Authority: CN
Inventors: 朱虎明; 王晨; 王金成; 缪孔苗; 李秋明; 薛怡煜; 侯彪; 焦李成
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2023-02-28
Filing date: 2023-02-28
Publication date: 2023-06-06

Abstract

The invention discloses a remote sensing image target detection method based on a fusion convolution attention mechanism, which solves the problems of low detection precision and low convergence rate of a small target. The realization method comprises the following steps: collecting and processing remote sensing image data; constructing a feature extraction backbone network and a convolutional fusion transform coding-decoding architecture; forming a target detection network model integrating a convolution attention mechanism; training and testing a target detection network model. The invention adopts the downsampling characteristic of pyramid structure to extract the backbone network, and outputs the characteristic matrix with the same size to the input images with different sizes; a convolution module with deep convolution and point-by-point convolution is built, so that the information extraction capability of local features of the remote sensing image is enhanced; and part of attention heads are replaced by convolution modules, so that the number of large parameters for matrix operation is reduced, and the training time is shortened. The method is used in the fields of aviation aircrafts, remote sensing satellites, intelligent traffic, intelligent agriculture and the like with high requirements on real-time and accuracy of remote sensing image target detection.

Description

Remote sensing image target detection method based on fusion convolution attention mechanism

Technical Field

The invention belongs to the technical field of remote sensing image target detection, mainly relates to target detection of an optical remote sensing image, and particularly relates to a remote sensing image target detection method based on a fusion convolution attention mechanism. The method is applied to the field of real-time detection of ground targets by aviation aircrafts and the like.

Background

The remote sensing technology is a technology for acquiring the characteristic information of a remote target in a non-contact mode. The method is used for carrying out non-contact recording and analysis on the electromagnetic wave characteristics of a detected target under the support of certain technical equipment and systems so as to obtain target characteristic information. In the last decades of development, remote sensing technology has been widely used in a number of fields, such as agricultural development, geological analysis, marine monitoring, military reconnaissance, environmental protection, and the like.

Target detection has become an important research hotspot in the fields of remote sensing image ground feature identification, computer vision and the like. Through target detection, a specific target in an image can be identified, and the type and the specific position of the specific target can be obtained, so that the specific target plays an important role in the fields of intelligent transportation, smart cities, public safety, military warfare and the like. Therefore, the research of target detection on the remote sensing image data has great significance in the fields of ocean, military, agriculture and the like, for example, the cost can be reduced, the efficiency can be improved, and the technological progress in the field can be promoted. With the rapid development of high-resolution satellites, the number of high-resolution remote sensing images is increased sharply, so that the target detection of the remote sensing images based on big data is an urgent requirement in the current field of high-resolution remote sensing image detection.

The origin of the development of the target detection technology can be traced back to the 90 s of the 20 th century. The current target detection methods are mainly based on artificial feature extraction and classifier training, such as SVM, adaBoost and the like, but the methods are difficult to adapt to complex scene changes. With the popularization of deep learning technology, the target detection technology has made great progress. Deep learning is an effective machine learning method with a strong ability to learn complex data representations. In deep learning object detection, convolutional Neural Networks (CNNs) are one of the most commonly used models that can learn complex feature representations of the underlying layers of objects in an image.

Early target detection methods such as R-CNN series, fast R-CNN and the like fully utilize the learning ability of CNN, and greatly improve the accuracy of target detection. However, these methods still have problems of high computational complexity, slow reasoning speed, and the like.

In view of these problems, the development of the target detection technology of remote sensing images has been turned to a single-stage target detection method, such as YOLO, SSD, retinaNet. The method can perform target detection in a single stage, reduces the computational complexity, and improves the detection precision and the reasoning speed. In addition, the method also obtains excellent performance in various target detection benchmark tests.

The target detection method based on the transducer is a new direction in the field of target detection which is popular in recent years. The main idea behind this type of approach is to apply the encoding-decoding architecture originally proposed for Natural Language Processing (NLP) tasks to target detection tasks.

Compared with convolution, the vision transducer breaks through the limitation that the traditional convolution neural network target detection model cannot be calculated in parallel; the number of times the transducer calculates the correlation between two target positions does not increase with distance; the self-attention mechanism can generate a more interpretable model, and the encoder module can calculate an attention matrix from the feature map, wherein each value on the matrix actually constructs the predicted frame coordinates, so that the target frame can be directly predicted.

The core of the transform method is a self-attention mechanism (self-attention) that allows the model to focus on different regions of the input image and dynamically adjust the importance of each region. Compared with the traditional Convolutional Neural Network (CNN) based method, the target detection method based on the Transformer is more flexible and can process complex scenes with multiple objects.

One of the first efforts in this area was DETR, which proposed a transducer-based end-to-end target detection framework. DETR uses a set of queues to predict object positions and categories, and uses an encoding-decoding architecture to process images and output predictions. The self-attention mechanism allows DETR to handle instances of different dimensions and shapes and target detection in one stage making it more efficient than the traditional two-stage approach.

Although the end-to-end target detection framework of the present transducer remote sensing image based on DETR can obtain a better detection effect in the field of remote sensing image target detection, some problems still exist, such as overlong training time caused by difficulty in convergence of an attention mechanism, low detection precision of a model on a small target caused by incapability of effectively acquiring local information by the attention mechanism.

In summary, although DETR proposes a framework for simplifying target detection of a remote sensing image and improving overall detection performance, the problems of low detection performance on a small target, slow model convergence speed and the like are not yet solved.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a remote sensing image target detection method based on a fusion convolution attention mechanism, which has stronger acquisition capability to image local features and higher model convergence speed.

The invention relates to a remote sensing image target detection method based on a fusion convolution attention mechanism, which is characterized by comprising the following steps:

step 1, collecting and processing remote sensing image data: the method comprises the steps of obtaining a public remote sensing image from a public website, dividing the public remote sensing image into a training data set, a verification data set and a test data set, and forming a remote sensing image data set as a whole, wherein the remote sensing image data set contains fifteen types of targets respectively: aircraft, boats, storage tanks, baseball fields, tennis courts, basketball courts, playgrounds, ports, bridges, large vehicles, small vehicles, helicopters, roundabout, football fields, swimming pools; generating txt files by using the coordinates and the category information of all targets of the original image data in the remote sensing image data set, and inputting the txt files and the original image data into a built feature extraction backbone network;

Step 2, building a feature extraction backbone network: the built feature extraction backbone network is formed by sequentially connecting four convolution groups, wherein the first convolution group sequentially passes through a convolution layer, a Norm layer, an activation function layer and a maximum pooling layer to form a residual error connecting unit; the second, third and fourth convolution groups are formed by sequentially connecting residual error connecting units with different numbers, and each residual error connecting unit is formed by sequentially stacking a convolution layer, a GN layer and an activation function layer; the method comprises the steps that input original image data is subjected to downsampling operation of a built feature extraction backbone network, and then a remote sensing image feature matrix is output;

step 3, constructing a fusion convolution transducer encoder: the built transform encoder comprises a fusion convolution multi-head attention module formed by parallel connection of a convolution module and an attention module, and the encoder sequentially comprises the fusion convolution multi-head attention module, a residual error connection and layer normalization module, a forward propagation module, a residual error connection and layer normalization module from an input end, wherein the convolution module in the fusion convolution multi-head attention module comprises a first convolution layer, a first activation function layer, a second convolution layer, a BN layer, a second activation function layer and a third convolution layer which are sequentially connected, and the attention module comprises an LN layer, a self-attention layer and a feedforward network layer which are sequentially connected; wherein, the proportion of convolution module and attention module is 4:4, the size of the matrix output by the convolution module is the same as that of the matrix output by the attention module, and an output matrix with the same size as that of the input matrix of the fusion convolved transform encoder module is formed after concat cascading;

Step 4, a transducer decoder of a mixed attention mechanism is built: the decoder processes redundant information on an input target query matrix through a self-attention mechanism, the cross-attention mechanism models the relation between a feature matrix output by the encoder and the target query matrix, and the forward propagation module performs feature transformation on image features and a prediction frame;

step 5, forming a target detection network model integrating a convolution attention mechanism: establishing a target detection network model of a fusion convolution attention mechanism, which is formed by a feature extraction backbone network, a fusion convolution transducer encoder and a hybrid attention mechanism transducer decoder in sequence, wherein the target detection network model is called a network model for short;

step 6, training a network model: training the network model by using a training data set to obtain a trained target detection network model integrating a convolution attention mechanism;

step 7, testing a network model: and detecting the test data set by using a trained target detection network model integrating the convolution attention mechanism, namely inputting the test set into the trained network model to obtain a detection result of each type of target in the remote sensing image data set, wherein the detection result comprises average precision AP and average precision average value mAP of all types of targets.

The invention solves the technical problems of low convergence rate of the training model and low accuracy of the detection model to the small target in the end-to-end remote sensing image target detection frame.

Compared with the prior art, the invention has the following advantages:

the detection precision of the model to the small target is improved: the invention designs a convolution module in an encoder, which consists of point-by-point convolution, depth convolution, an activation function and a normalization layer, wherein the module acquires local information of an image on the premise of not changing the size of an input matrix of the encoder; the encoder formed by parallel connection of the convolution module and the attention module has better feature extraction capability on the global features and the local features of the image; the method improves the detection precision of the small target on the premise of ensuring the detection precision of the large target by the target detection network model fused with the convolution attention mechanism.

The training time of the model is reduced: in the prior art, when an encoder based on an attention mechanism encodes an image, the second power of the pixel quantity of the image is calculated, so that the calculation complexity of a model is high, and the parameter quantity is large; the convolution module in the encoder designed by the invention has the advantages of small parameter amount in the point-by-point convolution and depth convolution, reduces the calculation complexity of the model, accelerates the convergence speed of the model and reduces the training time consumption of the model.

Drawings

FIG. 1 is a block flow diagram of an implementation of the present invention;

FIG. 2 is a block diagram of a backbone network for extracting image features in the present invention;

FIG. 3 is a block diagram of a fusion convolved transducer encoder constructed in accordance with the present invention;

FIG. 4 is a block diagram of a convolution module constructed in the encoder of the present invention;

FIG. 5 is a flow chart of the encoder-decoder of the present invention;

fig. 6 is a graph of experimental results of the present invention, in which fig. 6 (a) is a graph of remote sensing image detection results for a small vehicle and a large vehicle target in a test set of DOTA data sets, and fig. 6 (b) is a graph of remote sensing image detection results for a small vehicle and a loop target in a test set of DOTA data sets.

Detailed Description

Example 1

In the prior art, the target detection method based on the transducer can be divided into two types, one is based on the image feature extraction of the backbone network of the transducer, and the other is based on the set prediction method of the transducer. DETR is the first end-to-end target detection framework of the transform-based set prediction method proposed by Facebook team in 2020, the image is characterized by being extracted through the backbone network, then together with position coding, fed into the encoder, the encoder output matrix is fed into the decoder together with the target sequence, the output of the decoder is fed into the prediction head, and the feed-forward neural network predicts the object class and bounding box. The DETR predicts the object position and class by using a group of queues, processes the image by using an encoding-decoding architecture and outputs the prediction, the framework is simple and clear, the image sequence is directly predicted to a detection frame, and the defect that the traditional target detection needs non-maximum suppression is overcome. However, DETR uses an attention mechanism to pay more attention to the global features of the image, resulting in a model with low detection accuracy for small targets; furthermore, the attention mechanism requires a larger number of parameters, which are more difficult to converge than conventional convolutional object detection networks. The invention provides a remote sensing image target detection method based on a fusion convolution attention mechanism.

The invention relates to a remote sensing image target detection method based on a fusion convolution attention mechanism, which is shown in fig. 1, wherein fig. 1 is a flow chart for realizing the method; the method comprises the following steps:

step 1, collecting and processing remote sensing image data: acquiring a public remote sensing image from a public website, dividing the image into a training data set, a verification data set and a test data set according to the proportion of 4:2:3, and forming a remote sensing image data set as a whole, wherein the image in the remote sensing image data set is called as original image data; the remote sensing image dataset contains fifteen types of targets, which are respectively: aircraft, ships, storage tanks, baseball fields, tennis courts, basketball courts, playgrounds, ports, bridges, large vehicles, small vehicles, helicopters, roundabout, football fields, swimming pools. The invention generates txt files by the coordinates and category information of all targets of each piece of original image data in the remote sensing image data set, and inputs the txt files and the original image data into the established feature extraction backbone network. The data set is marked in an inclined frame mode, and the pixel size distribution of each image is different and comprises objects with different scales, directions and shapes.

Step 2, building a feature extraction backbone network: the built feature extraction backbone network is formed by sequentially connecting four convolution groups, wherein the first convolution group sequentially passes through a convolution layer, a Norm layer, an activation function layer and a maximum pooling layer to form a residual error connecting unit; the second, third and fourth convolution groups sequentially downsample the feature map output by the previous convolution group; and outputting a remote sensing image feature matrix after the input original image data is subjected to the downsampling operation of the constructed feature extraction backbone network. The design of the backbone network needs to meet the requirement of multiple scales of images, and in order to enable the backbone network to output the feature matrix with the same size after extracting the features of the images with different sizes, the invention adds partial downsampling operation in the feature extraction backbone network.

Step 3, constructing a fusion convolution transducer encoder: referring to fig. 3, the built transform encoder comprises a fusion convolution multi-head attention module formed by parallel connection of a convolution module and an attention module, the encoder sequentially comprises a fusion convolution multi-head attention module, a residual error connection and layer normalization module, a forward propagation module, a residual error connection and layer normalization module from an input end, wherein the convolution module in the fusion convolution multi-head attention module comprises a first point-by-point convolution layer, a first activation function layer, a depth convolution layer, a BN layer, a second activation function layer and a second point-by-point convolution layer which are sequentially connected, and the attention module comprises an LN layer, a self-attention layer and a feedforward network layer which are sequentially connected; wherein, the proportion of convolution module and attention module is 4: and 4, the size of the matrix output by the convolution module is the same as that of the matrix output by the attention module, and an output matrix with the same size as that of the input matrix of the fusion convolved transform encoder is formed after concat cascading. In the embodiment, the improved multi-head attention module selects eight heads, the number of the attention heads selected by the multi-head attention module can be selected according to model training time and detection precision, and the eight heads achieve the balance of model parameter quantity and detection precision, wherein the equal quantity of convolution modules and the attention modules ensures that the encoder cannot be more biased towards global features or local features when extracting image features. The convolution module does not change the size of the feature map by using a point-by-point convolution mode, does not change the channel number of the feature map by using a depth convolution mode, ensures that the size of a matrix output by the convolution module is equal to that of a matrix output by the attention module, and can be directly added.

Step 4, a transducer decoder of a mixed attention mechanism is built: the decoder processes redundant information on the input target query matrix through a self-attention mechanism, the cross-attention mechanism models the relation between the characteristic matrix output by the encoder and the target query matrix, and the forward propagation module performs characteristic transformation on the image characteristics and the prediction frame. The decoder is formed by sequentially connecting six decoder units, and each decoder unit outputs a group of prediction information for the category and the position of each target in the image after the matrix passes through the forward propagation module. The attention coefficient matrix based on the image pixel region output by the encoder is converted into an attention coefficient matrix based on each object in the image.

Step 5, forming a target detection network model integrating a convolution attention mechanism: and establishing a target detection network model of the fused convolution attention mechanism, which is simply called a network model, and sequentially consists of a feature extraction backbone network, a fused convolution transducer encoder and a hybrid attention mechanism transducer decoder.

Step 6, training a target detection network model integrating a convolution attention mechanism: and training a fusion convolution attention mechanism target detection network model formed by sequentially connecting a feature extraction backbone network, a fusion convolution attention mechanism encoder and a mixed attention mechanism decoder by using a training data set to obtain a trained fusion convolution attention mechanism target detection network model. In the invention, as the parameter quantity of the attention module is relatively large, the number of images of each training batch is reduced as much as possible on the premise of ensuring the convergence speed of the model when training parameter setting is carried out.

Step 7, testing a target detection network model integrating a convolution attention mechanism: and detecting the test data set by using the trained target detection network model with the fused convolution attention mechanism, namely inputting the test set into the trained target detection network model with the fused convolution attention mechanism to obtain a detection result of each type of target in the remote sensing image data set, wherein the detection result comprises average precision AP and average precision average value mAP of all types of targets.

The technical idea of the invention is as follows: the convolution layer is fused with the encoder, so that the extraction of the model to the local features of the image is increased, and the detection precision of the model to the small target is improved; and a convolution module with small parameter quantity is used for replacing a part of attention modules with large parameter quantity, so that model parameters are reduced, and the model training speed is increased.

In order to solve the defects of the existing end-to-end remote sensing image target detection framework, the convolution module is introduced to replace part of self-attention modules of the multi-head attention module in the transform encoder, and the remote sensing image target detection method fused with the convolution attention mechanism is specially designed. The method can be applied to the field of real-time detection of ground targets by aviation aircrafts and the like.

Example 2

The remote sensing image target detection method based on the fusion convolution attention mechanism is the same as that of the embodiment 1, and features of the invention constructed in the step 2 are extracted into a main network, and referring to fig. 2, fig. 2 is a main network structure diagram of the extracted image features of the invention, and the constructed feature extraction main network is formed by sequentially connecting four convolution groups; in this example, the first convolution group is composed of a convolution layer with a convolution kernel size of 6×6, a convolution kernel number of 32, a step length of 1, a groupnum layer, a ReLU activation function layer, and a maximum pooling layer with a window size of 3×3 and a step length of 2 in sequence; the second convolution group is formed by sequentially connecting three identical residual modules 1, each residual module 1 is formed by sequentially connecting one convolution layer with the size of 1 multiplied by 1, the number of the convolution cores is 128, the convolution layers with the step length of 1, the convolution cores with the size of 3 multiplied by 3, the number of the convolution cores is 128, the convolution layers with the step length of 1, the convolution cores with the size of 2 multiplied by 2, the number of the convolution cores is 128, and the convolution layers with the step length of 1; the third convolution group is formed by sequentially connecting four identical residual modules 2, each residual module 2 is formed by sequentially connecting a convolution layer with the size of 1 multiplied by 1, the number of the convolution kernels is 128, the convolution layers with the step length of 1, the convolution kernels with the size of 3 multiplied by 3, the number of the convolution kernels is 128, the convolution layers with the step length of 1, the convolution kernels with the size of 1 multiplied by 1, the number of the convolution kernels is 512, and the convolution layers with the step length of 1; the fourth convolution group is formed by sequentially connecting nine identical residual modules 3, each residual module 3 is formed by sequentially connecting a convolution layer with a convolution kernel size of 1×1, a convolution kernel number of 128, a convolution layer with a step length of 2, a convolution kernel size of 3×3, a convolution kernel number of 256, a convolution layer with a step length of 1, a convolution kernel size of 1×1, a convolution kernel number of 512 and a convolution layer with a step length of 1. The network structure parameters provided by the embodiment are a group of parameters with good performance aiming at the remote sensing image target detection task, and the parameters of the feature extraction backbone network can be adjusted according to different specific tasks.

Because the remote sensing image has the characteristics of overlooking view angle, high resolution, nonuniform target scale, multidirectional target rotation, complex image background and the like, the design of the backbone network needs to meet the requirement of multiple scales of the image, and the downsampling processing is carried out on the pictures with different input sizes, so that the dimension of the feature matrix output by the feature extraction backbone network is the same.

Example 3

The remote sensing image target detection method based on the fusion convolution attention mechanism is the same as that of the fusion convolution transducer encoder constructed in the embodiment 1-2 and the step 3, referring to fig. 3, fig. 3 is a fusion convolution transducer encoder constructed by the invention, and the fusion convolution transducer encoder constructed by the invention is formed by sequentially connecting six encoder units. The output sequence of the feature extraction backbone network is added with position coding to generate a position coding feature sequence which is used as the input of the whole encoder; the structure of each encoder unit is identical, and the encoder unit is composed of a fusion convolution multi-head attention module, a first residual error connection and layer normalization module, a forward propagation module, a second residual error connection and layer normalization module which are sequentially connected. The first residual connection and layer normalization module performs addition short circuit on an input matrix of the encoder unit and an output matrix of the multi-head attention module, and then performs normalization processing on the short-circuited matrix; the forward propagation module is formed by sequentially connecting a linear line layer, a relu activation function layer and a dropout layer. And the second residual connection and layer normalization module performs addition short circuit on the output matrix of the first residual connection and layer normalization module and the output matrix of the forward propagation module, and then performs normalization processing on the short-circuited matrix. For the first, second, third, fourth and fifth encoder units, the output matrix of the current encoder unit is used as the input matrix of the next encoder module; in particular, the output matrices of the sixth encoder unit are respectively used as input matrices of six decoder units in the decoder.

The convolution module designed by the invention consists of depth convolution, point-by-point convolution, activation function and normalization layer, and ensures that the output matrix and the input matrix have the same size on the premise of meeting the requirement of extracting the local features of the image. Similarly, the output matrix of the attention module is the same as the input matrix in size, and the convolution module can add to the attention module in parallel without changing the input matrix in size.

Example 4

The remote sensing image target detection method based on the fusion convolution attention mechanism is the same as that of the embodiments 1-3, see fig. 4, fig. 4 is a structure diagram of a convolution module constructed in the encoder of the present invention, and the fusion convolution multi-head attention module in the fusion convolution transducer encoder of the present invention is formed by four self-attention units and four convolution units in parallel; the four self-attention units have the same structure, each self-attention unit firstly multiplies an input matrix by three matrices with different parameters to perform three different linear transformations on the input matrix to obtain three matrices Q, K, V with the same size and different parameters, and then calculates Q, K, V three matrices by a softmax function to obtain an attention parameter matrix, wherein the formula is as follows:

The four convolution units have the same structure, each convolution unit is formed by sequentially connecting a point-by-point convolution layer with the convolution kernel size of 1 multiplied by 1, the convolution kernel number of 128, a GLU activation function layer with the step length of 1, a convolution kernel size of 3 multiplied by 3, the convolution kernel number of 256, a depth convolution layer with the step length of 1, a BN normalization layer, a Swish activation function layer, the convolution kernel size of 1 multiplied by 1, the convolution kernel number of 256 and the point-by-point convolution layer with the step length of 1.

According to the invention, by means of extracting the image features in parallel by the convolution module and the attention module, not only the global information of the image can be obtained, but also the local information of the image can be obtained, and the detection precision of the model to the small target is improved on the premise of ensuring the detection precision of the trained model to the large target. Meanwhile, the calculated amount of the attention mechanism is calculated by carrying out second power and the like on the characteristic dimension of the image, the parameter amount is too large, the parameter amount of the model is reduced by introducing the local convolution module, the convergence speed of the model is accelerated, and the training time consumption of the model is reduced.

Example 5

The remote sensing image target detection method based on the fusion convolution attention mechanism is the same as that of the embodiments 1-4, referring to fig. 5, fig. 5 is a flow chart of the encoder-decoder of the present invention; the invention relates to a transducer decoder with a mixed attention mechanism in the step 4, which is formed by sequentially connecting six decoder units; each decoder unit has the same structure and is formed by sequentially connecting a multi-head self-attention module, a first residual error connection and layer normalization module, a multi-head cross-attention module, a second residual error connection and layer normalization module, a forward propagation module, a third residual error connection and layer normalization module, namely sequentially connecting the multi-head self-attention module, the first residual error connection and layer normalization module according to the sequence from input to output.

In the embodiment, the size of an input target query matrix is 100 multiplied by 256, the size of an output matrix of an encoder is 850 multiplied by 256, the size of a mask matrix is 25 multiplied by 34, and the size of the output matrix is 100 multiplied by 256. The flow of generating the target detection frame for the image by the target detection task is simplified.

Example 6

The remote sensing image target detection method based on the fused convolution attention mechanism is the same as that of the embodiments 1-5, and the training network model in the step 6 of the invention is to train the fused convolution attention mechanism target detection network by using a remote sensing image training data set, specifically:

6.1 super parameter setting: setting an initial learning rate as R, setting a learning rate adjustment as a steps mode, setting a weight attenuation parameter as a, setting a batch size as B, and setting a training round as E;

6.2 training method: updating weight and bias of the whole network model by using a random gradient descent algorithm, updating weight and bias once for every input B training images, and iterating altogether

Stopping updating for the second time, and finishing training;

and 6.3, obtaining a final trained network model: and when the iteration is stopped, obtaining a trained target detection network model integrating the convolution attention mechanism.

In the example, the initial learning rate is set to be 0.001, the learning rate is adjusted to be a steps mode, the weight attenuation parameter is set to be 0.0001, the batch size is set to be 4, and the training round is set to be 100; and updating the weight and the bias of the whole network model by using a random gradient descent algorithm, wherein the weight and the bias are updated once every 4 training images are input, and the updating is stopped for 40000 times in total, so that the final trained network model is obtained. The parameters are a group of parameters with good training effect in the example, and the invention can be adjusted for different target detection tasks.

A more detailed example is given below to further illustrate the invention

Example 7

The remote sensing image target detection method based on the fusion convolution attention mechanism is the same as in embodiments 1-6,

the invention provides a remote sensing image target detection method based on a fusion convolution attention mechanism, which is presented in the invention, referring to fig. 1, fig. 1 is a flow chart of the implementation of the invention; the method comprises the following steps:

step 1, collecting and processing remote sensing image data: acquiring a public remote sensing image from a public website, dividing the image into a training data set, a verification data set and a test data set according to the proportion of 3:1:2, and forming a remote sensing image data set as a whole, wherein the image in the remote sensing image data set is called as original image data; the remote sensing image dataset contains fifteen types of targets, which are respectively: aircraft, ships, storage tanks, baseball fields, tennis courts, basketball courts, playgrounds, ports, bridges, large vehicles, small vehicles, helicopters, roundabout, football fields, swimming pools. And generating txt files by using the coordinates and the category information of all targets of each piece of original image data in the remote sensing image data set, and inputting the txt files and the original image data into the built feature extraction backbone network.

Step 2, building a feature extraction backbone network: the built feature extraction backbone network is formed by sequentially connecting four convolution groups, wherein the first convolution group is formed by sequentially connecting a convolution layer, a Norm layer, an activation function layer and a maximum pooling layer; the second, third and fourth convolution groups are formed by sequentially connecting residual error connecting units with different numbers, and each residual error connecting unit is formed by sequentially stacking a convolution layer, a GN layer and an activation function layer; the backbone network adopts different convolution modules to carry out residual error connection and stacked characteristic pyramid modules, and the same characteristic matrix is generated after downsampling operation is carried out on input pictures with different sizes; the feature matrix output by the backbone network is added with the position codes of the same dimension through dimension reduction treatment and then is sent to the encoder.

Referring to fig. 2, the feature extraction backbone network constructed by the invention is formed by sequentially connecting four convolution groups; the first convolution group consists of a convolution layer, a GroupNorm layer, an activation function layer and a maximum pooling layer sequentially; the second convolution group is formed by sequentially connecting three identical residual modules 1, and each residual module 1 is formed by sequentially connecting three different convolution layers, a normalization layer and an activation function layer; the third convolution group is formed by sequentially connecting four identical residual modules 2, and each residual module 2 is formed by sequentially connecting three different convolution layers, a normalization layer and an activation function layer; the fourth convolution group is formed by sequentially connecting nine identical residual modules 3, and each residual module 3 is formed by sequentially connecting three different convolution layers, a normalization layer and an activation function layer.

The first convolution group consists of a convolution layer with a convolution kernel size of 7×7, a convolution kernel number of 64, a step size of 2, a groupnum layer, a ReLU activation function layer, and a max-pooling layer with a window size of 3×3 and a step size of 2. The second convolution group is formed by sequentially connecting three identical residual modules 1, wherein each residual module 1 is formed by sequentially connecting one convolution layer with the size of 1 multiplied by 1, the number of the convolution cores is 128, the convolution layers with the step length of 1, the convolution cores with the size of 3 multiplied by 3, the number of the convolution cores is 128, the convolution layers with the step length of 1, the convolution cores with the size of 1 multiplied by 1, the number of the convolution cores is 256, and the convolution layers with the step length of 1. The third convolution group is formed by sequentially connecting four identical residual modules 2, each residual module 2 is formed by sequentially connecting a convolution layer with the size of 1×1, the number of convolution kernels of 128, the convolution layers with the step length of 1, the convolution kernels with the size of 3×3, the number of convolution kernels of 128, the convolution layers with the step length of 1, the convolution kernels with the size of 1×1, the number of convolution kernels of 512 and the convolution layers with the step length of 1. The fourth convolution group is formed by sequentially connecting nine identical residual modules 3, each residual module 3 is formed by sequentially connecting a convolution layer with the size of 1×1, the number of the convolution cores of 256, the convolution layers with the step length of 1, the convolution cores with the size of 3×3, the number of the convolution cores of 256, the convolution layers with the step length of 1, the convolution cores with the size of 1×1, the number of the convolution cores of 1024 and the convolution layers with the step length of 1.

According to the feature extraction backbone network designed by the invention, through stacking the residual error unit modules, the downsampling operation is carried out on the input images, so that the input images with different sizes pass through the backbone network and then output the feature matrix with the same size, and the problem of multi-scale detection of the remote sensing image target is solved.

Step 3, constructing a fusion convolution transducer encoder: referring to fig. 3, the constructed transducer encoder is formed by sequentially connecting six encoder units, each encoder unit comprises a fusion convolution multi-head attention module formed by parallel connection of a convolution module and an attention module, the encoder sequentially comprises a fusion convolution multi-head attention module, a first residual error connection and layer normalization module, a forward propagation module, a second residual error connection and layer normalization module from an input end, wherein the first residual error connection and layer normalization module performs addition short circuit on an input matrix of the encoder unit and an output matrix of the multi-head attention module, and then performs normalization processing on the matrix after short circuit. The forward propagation module is formed by sequentially connecting a linear layer, a relu activation function layer and a dropout layer; the second residual error connection and layer normalization module performs addition short-circuiting on the output matrix of the first residual error connection and layer normalization module and the output matrix of the forward propagation module, and then performs normalization processing on the short-circuited matrix; for the first, second, third, fourth and fifth encoder modules, the output matrix of the current encoder module is used as the input matrix of the next encoder module; in particular, the output matrices of the sixth encoder module are respectively used as input matrices for six decoder units in the decoder module.

A fusion convolution multi-head attention module in a fusion convolution converter coder is formed by four self-attention units and four convolution units in parallel. The four self-attention units have the same structure, each self-attention unit comprises an LN layer, a self-attention layer and a feedforward network layer which are sequentially connected, three different matrixes Q, K, V with the same size and different parameters are obtained by multiplying an input matrix with three matrixes with different parameters, wherein a Q matrix is an inquiry matrix for carrying out linear transformation on an image feature matrix, a K matrix is a key matrix for carrying out linear transformation on the image feature matrix, a V matrix is a value matrix for carrying out linear transformation on the image feature matrix, and then the three matrixes Q, K, V are calculated through a softmax function to obtain an attention parameter matrix, and the formula is as follows:

the four convolution units have the same structure, see fig. 4, and each convolution unit comprises a point-by-point convolution layer, an activation function layer, a depth convolution layer, a BN layer, an activation function layer and a point-by-point convolution layer which are sequentially connected. The ratio of convolution module to attention module is 4: and 4, the size of the matrix output by the convolution module is the same as that of the matrix output by the attention module, and an output matrix with the same size as that of the input matrix of the fusion convolved transform encoder module is formed after concat cascading.

Step 4, a transducer decoder module of the mixed attention mechanism is built: the decoder processes redundant information on the input target query matrix through a self-attention mechanism, the cross-attention mechanism models the relation between the characteristic matrix output by the encoder and the target query matrix, and the forward propagation module performs characteristic transformation on the image characteristics and the prediction frame.

Referring to fig. 5, the transducer decoder module of the mixed attention mechanism constructed by the present invention is composed of six decoder units connected in sequence; each decoder unit has the same structure and is formed by sequentially connecting a multi-head self-attention module, a first residual error connection and layer normalization module, a multi-head cross-attention module, a second residual error connection and layer normalization module, a forward propagation module, a third residual error connection and layer normalization module. Firstly, a target query matrix is input into a multi-head self-attention module to process redundant information, the processed target query matrix and a matrix output by an encoder are input into a multi-head cross-attention module to perform cross-attention calculation, the attention matrix of an image area is converted into an attention matrix of an image target, and each decoder unit outputs a prediction matrix of the image through a forward propagation module and the forward propagation module predicts the target.

Step 5, forming a target detection network model integrating a convolution attention mechanism: and establishing a target detection network model of the fusion convolution attention mechanism, which is simply called a network model, and is formed by a feature extraction backbone network, a fusion convolution transducer encoder module and a hybrid attention mechanism transducer decoder module in sequence.

Step 6, training a target detection network model integrating a convolution attention mechanism: and training a fusion convolution attention mechanism target detection network model formed by sequentially connecting a feature extraction backbone network, a fusion convolution attention mechanism encoder and a mixed attention mechanism decoder by using a training data set to obtain a trained fusion convolution attention mechanism target detection network model. The method specifically comprises the following steps:

Stopping updating for the second time, and finishing training;

In the example, the initial learning rate is set to be 0.0025, the learning rate is adjusted to be a steps mode, the weight attenuation parameter is set to be 0.0001, the batch size is set to be 8, and the training round is set to be 100; and updating the weight and the bias of the whole network model by using a random gradient descent algorithm, wherein the weight and the bias are updated once every 8 training images are input, and the updating is stopped for 20000 times in total, so that the final trained network model is obtained.

The invention accelerates the convergence speed of the model and reduces the time consumption of model training.

The invention adopts the feature extraction backbone network with pyramid structure and comprising downsampling operation, outputs feature matrixes with the same size for input images with different sizes, and solves the difficulty of multi-scale detection of remote sensing image targets; a convolution module comprising depth convolution and point-by-point convolution is built, and the information extraction capability of the model on local features of the remote sensing image is enhanced; the method has the advantages that part of attention heads in the encoder are replaced by built convolution modules, so that the large parameter quantity caused by the encoder which is completely composed of an attention mechanism is reduced, the convergence speed of the model is improved, and the training time consumption of the model is reduced. The method is used in the fields of aviation aircrafts, remote sensing satellites, intelligent traffic, intelligent agriculture and the like, which have high requirements on real-time and accuracy of remote sensing image target detection.

The technical effects of the present invention will be explained again by experiments and the result data thereof

Example 8

The remote sensing image target detection method based on the fusion convolution attention mechanism is the same as in embodiments 1-7,

experimental conditions: all experiments are carried out under the same platform, the CPU of the hardware configuration of the platform is Intel8358P, the GPU is NVIDIA GeForce RTX 3090, and the video memory is 24G. The operating system used in the experiment is Ubuntu 18.04LTS, the deep learning framework used is Pytorch 1.7.1, the GPU computing platform is CUDA 11.0, and the GPU acceleration library is cuDNN 8.0.5.

The experimental contents are as follows: the remote sensing image target detection method based on the fusion convolution attention mechanism is adopted to carry out target detection on the disclosed remote sensing data set DOTA, the training method is used to obtain a trained model, the target class and position accuracy rate test is carried out on 937 pictures of the test set in the DOTA data set, two detection result pictures are randomly extracted and are shown in figure 6, wherein figure 6 (a) is a detection result picture containing small vehicle and large vehicle targets detected by the method,

experimental results and analysis: referring to fig. 6 (a), as can be seen from fig. 6 (a), the targets of the small and large vehicles in all the detected remote sensing images are detected, the confidence of most detection frames is high, the targets in the rotation direction and the targets in the rotation directions can be accurately detected by using the rotation frames, the condition of target missing detection is avoided, and the invention has good detection performance on the large targets and the small targets in the remote sensing data set.

Example 9

The remote sensing image target detection method based on the fusion convolution attention mechanism is the same as that of examples 1-7, and the experimental conditions and experimental contents are the same as that of example 8.

Experimental results and analysis: referring to fig. 6 (b), fig. 6 (b) is a picture of a detection result of an object including a small vehicle, a circular road section detected by the present invention. Specifically, the invention aims at the detection of the target type and the position accuracy of the remote sensing image, and as can be seen from the figure 6 (b), all small vehicles and annular loop targets in the detected remote sensing image are detected, and the confidence of most detection frames is higher, and as can be seen from the figure, the invention can accurately detect the targets with large scale differences, namely the small vehicles and the annular loop, without the condition of target missing detection, and verifies the capability of the feature extraction backbone network designed by the invention to extract the multi-scale features of the image through downsampling operation.

Example 10

The remote sensing image target detection method based on the fusion convolution attention mechanism is the same as that of the embodiments 1-7, and the experimental conditions are the same as that of the embodiment 8.

The experimental contents are as follows: the remote sensing image target detection method based on the fusion convolution attention mechanism, the remote sensing image target detection method DETR based on the attention mechanism and the convolution image sensing target detection method REDet based on the convolution are respectively used for carrying out a comparison test on a DOTA data set, and the comparison test results are shown in Table 1.

TABLE 1 results of comparative experiments on DOTA remote sensing dataset

Experimental results and analysis: as can be seen from Table 1, the detection accuracy of the invention is improved on the detection accuracy of most targets including targets with large scale difference compared with the detection accuracy of the remote sensing image targets based on convolution, and the detection accuracy of the invention is more similar to that of the targets on the large targets compared with the detection accuracy of the targets on the remote sensing image targets based on attention mechanism, and the detection accuracy is better on the small targets.

In summary, the invention provides a remote sensing image target detection method based on a fused convolution attention mechanism, which solves the technical problems of low detection precision of a model on a small target object of a remote sensing image and low convergence speed during model training in the existing end-to-end remote sensing image target detection technology. The realization method comprises the following steps: collecting and processing remote sensing image data; constructing a feature extraction backbone network; constructing a transform encoder of fusion convolution; a transform decoder that builds a mixed-attention mechanism; forming a target detection network model integrating a convolution attention mechanism; training a target detection network model integrating a convolution attention mechanism; and testing the target detection network model of the fusion convolution attention mechanism. According to the invention, the main network is extracted by adopting the downsampling characteristics of the pyramid structure, and the characteristic matrixes with the same size are output for the input images with different sizes, so that the difficulty in multi-scale detection of the remote sensing image targets is solved; a convolution module consisting of depth convolution, point-by-point convolution, an activation function and a normalization layer is built, and the information extraction capability of the model on local features of the remote sensing image is enhanced; the method has the advantages that part of attention heads in the multi-head attention module are replaced by built convolution modules, so that the large parameter number of dot product operation of the second power matrix of the image pixel quantity by an attention mechanism is reduced, the convergence speed of the model is improved, and the training time consumption of the model is reduced. The method is used in the fields of aviation aircrafts, remote sensing satellites, intelligent traffic, intelligent agriculture and the like, which have high requirements on real-time and accuracy of remote sensing image target detection.

Claims

1. The remote sensing image target detection method based on the fusion convolution attention mechanism is characterized by comprising the following steps of:

step 1, collecting and processing remote sensing image data: acquiring a public remote sensing image from a public website, dividing the public remote sensing image into a training data set, a verification data set and a test data set, and forming a remote sensing image data set as a whole; the remote sensing image dataset contains fifteen types of targets, which are respectively: aircraft, boats, storage tanks, baseball fields, tennis courts, basketball courts, playgrounds, ports, bridges, large vehicles, small vehicles, helicopters, roundabout, football fields, swimming pools; generating txt files by using the coordinates and the category information of all targets of the original image data in the remote sensing image data set, and inputting the txt files and the original image data into a built feature extraction backbone network;

step 2, building a feature extraction backbone network: the built feature extraction backbone network is formed by sequentially connecting four convolution groups, wherein the first convolution group sequentially comprises a convolution layer, a Norm layer, an activation function layer and a maximum pooling layer; the second, third and fourth convolution groups are formed by sequentially connecting residual error connecting units with different numbers, and each residual error connecting unit is formed by sequentially stacking a convolution layer, a GN layer and an activation function layer; the method comprises the steps that input original image data is subjected to downsampling operation of a built feature extraction backbone network, and then a remote sensing image feature matrix is output;

Step 3, constructing a fusion convolution transducer encoder: the built transform encoder comprises a fusion convolution multi-head attention module formed by parallel connection of a convolution module and an attention module, and the encoder sequentially comprises the fusion convolution multi-head attention module, a residual error connection and layer normalization module, a forward propagation module, a residual error connection and layer normalization module from an input end, wherein the convolution module in the fusion convolution multi-head attention module comprises a first convolution layer, a first activation function layer, a second convolution layer, a BN layer, a second activation function layer and a third convolution layer which are sequentially connected, and the attention module comprises an LN layer, a self-attention layer and a feedforward network layer which are sequentially connected; wherein, the proportion of convolution module and attention module is 4:4, the size of the matrix output by the convolution module is the same as that of the matrix output by the attention module, and an output matrix with the same size as that of the input matrix of the fusion convolved transform encoder is formed after concat cascading;

step 6, training a target detection network model integrating a convolution attention mechanism: training a fusion convolution attention mechanism target detection network model formed by sequentially connecting a feature extraction main network, a fusion convolution attention mechanism encoder and a mixed attention mechanism decoder by using a training data set to obtain a trained fusion convolution attention mechanism target detection network model;

2. The remote sensing image target detection method based on the fusion convolution attention mechanism according to claim 1, wherein the feature extraction backbone network constructed in the step 2 is formed by sequentially connecting four convolution groups; the first convolution group consists of a convolution layer, a GroupNorm layer, an activation function layer and a maximum pooling layer sequentially; the second convolution group is formed by sequentially connecting three identical residual modules 1, and each residual module 1 is formed by sequentially connecting three different convolution layers; the third convolution group is formed by sequentially connecting four identical residual modules 2, and each residual module 2 is formed by sequentially connecting three different convolution layers; the fourth convolution group is formed by sequentially connecting nine identical residual modules 3, each residual module 3 being formed by sequentially connecting three different convolution layers.

3. The remote sensing image target detection method based on a fusion convolution attention mechanism according to claim 1, wherein the fusion convolution transducer encoder constructed in the step 3 is formed by sequentially connecting six identical encoder units; the structure of each encoder unit is identical, and the encoder unit is formed by sequentially connecting a fusion convolution multi-head attention module, a first residual error connection and layer normalization module, a forward propagation module, a second residual error connection and layer normalization module; the output sequence of the feature extraction backbone network is added with position coding to generate a position coding feature sequence, and the position coding feature sequence is used as the input of the whole encoder and is input to a first encoder unit; in the encoder unit, a first residual error connection and layer normalization module performs addition short-circuiting on an input matrix of the encoder unit and an output matrix of the multi-head attention module, and then performs normalization processing on the short-circuited matrix; the forward propagation module is formed by sequentially connecting a linear layer, a relu activation function layer and a dropout layer; the second residual error connection and layer normalization module performs addition short-circuiting on the output matrix of the first residual error connection and layer normalization module and the output matrix of the forward propagation module, and then performs normalization processing on the short-circuited matrix; for the first, second, third, fourth and fifth encoder units, the output matrix of the current encoder unit is used as the input matrix of the next encoder unit; in particular, the output matrices of the sixth encoder unit are respectively used as input matrices of six decoder units in the decoder.

4. A fused convolutionally fransformer encoder according to claim 1 or 3, wherein the fused convolutionally multi-headed attention module in the fused convolutionally fransformer encoder is formed of four self-attention units in parallel with four convolution units; the four self-attention units have the same structure, each self-attention unit firstly multiplies an input matrix by three matrices with different parameters to perform three different linear transformations on the input matrix to obtain three matrices Q, K, V with the same size and different parameters, and then calculates Q, K, V three matrices by a softmax function to obtain an attention parameter matrix, wherein the formula is as follows:

the four convolution units have the same structure, and each convolution unit is formed by sequentially connecting a first convolution layer, a first activation function layer, a second convolution layer, a Norm layer, a second activation function layer and a third convolution layer.

5. The method for detecting the target of the remote sensing image based on the fusion convolution attention mechanism according to claim 1, wherein the transducer decoder of the hybrid attention mechanism in the step 4 is formed by sequentially connecting six decoder units; each decoder unit has the same structure and is formed by sequentially connecting a multi-head self-attention module, a first residual error connection and layer normalization module, a multi-head cross-attention module, a second residual error connection and layer normalization module, a forward propagation module, a third residual error connection and layer normalization module.

6. The method for detecting a target of a remote sensing image based on a fused convolution attention mechanism according to claim 1, wherein the training network model in step 6 is to train the target detection network of the fused convolution attention mechanism by using a remote sensing image training data set, specifically:

Stopping updating for the second time, and finishing training;