CN116229295A - Remote sensing image target detection method based on fusion convolution attention mechanism - Google Patents

Remote sensing image target detection method based on fusion convolution attention mechanism Download PDF

Info

Publication number
CN116229295A
CN116229295A CN202310176483.XA CN202310176483A CN116229295A CN 116229295 A CN116229295 A CN 116229295A CN 202310176483 A CN202310176483 A CN 202310176483A CN 116229295 A CN116229295 A CN 116229295A
Authority
CN
China
Prior art keywords
convolution
layer
module
attention
remote sensing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310176483.XA
Other languages
Chinese (zh)
Inventor
朱虎明
王晨
王金成
缪孔苗
李秋明
薛怡煜
侯彪
焦李成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202310176483.XA priority Critical patent/CN116229295A/en
Publication of CN116229295A publication Critical patent/CN116229295A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/17Terrestrial scenes taken from planes or by drones
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/776Validation; Performance evaluation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Remote Sensing (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Processing (AREA)
  • Radar Systems Or Details Thereof (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a remote sensing image target detection method based on a fusion convolution attention mechanism, which solves the problems of low detection precision and low convergence rate of a small target. The realization method comprises the following steps: collecting and processing remote sensing image data; constructing a feature extraction backbone network and a convolutional fusion transform coding-decoding architecture; forming a target detection network model integrating a convolution attention mechanism; training and testing a target detection network model. The invention adopts the downsampling characteristic of pyramid structure to extract the backbone network, and outputs the characteristic matrix with the same size to the input images with different sizes; a convolution module with deep convolution and point-by-point convolution is built, so that the information extraction capability of local features of the remote sensing image is enhanced; and part of attention heads are replaced by convolution modules, so that the number of large parameters for matrix operation is reduced, and the training time is shortened. The method is used in the fields of aviation aircrafts, remote sensing satellites, intelligent traffic, intelligent agriculture and the like with high requirements on real-time and accuracy of remote sensing image target detection.

Description

Remote sensing image target detection method based on fusion convolution attention mechanism
Technical Field
The invention belongs to the technical field of remote sensing image target detection, mainly relates to target detection of an optical remote sensing image, and particularly relates to a remote sensing image target detection method based on a fusion convolution attention mechanism. The method is applied to the field of real-time detection of ground targets by aviation aircrafts and the like.
Background
The remote sensing technology is a technology for acquiring the characteristic information of a remote target in a non-contact mode. The method is used for carrying out non-contact recording and analysis on the electromagnetic wave characteristics of a detected target under the support of certain technical equipment and systems so as to obtain target characteristic information. In the last decades of development, remote sensing technology has been widely used in a number of fields, such as agricultural development, geological analysis, marine monitoring, military reconnaissance, environmental protection, and the like.
Target detection has become an important research hotspot in the fields of remote sensing image ground feature identification, computer vision and the like. Through target detection, a specific target in an image can be identified, and the type and the specific position of the specific target can be obtained, so that the specific target plays an important role in the fields of intelligent transportation, smart cities, public safety, military warfare and the like. Therefore, the research of target detection on the remote sensing image data has great significance in the fields of ocean, military, agriculture and the like, for example, the cost can be reduced, the efficiency can be improved, and the technological progress in the field can be promoted. With the rapid development of high-resolution satellites, the number of high-resolution remote sensing images is increased sharply, so that the target detection of the remote sensing images based on big data is an urgent requirement in the current field of high-resolution remote sensing image detection.
The origin of the development of the target detection technology can be traced back to the 90 s of the 20 th century. The current target detection methods are mainly based on artificial feature extraction and classifier training, such as SVM, adaBoost and the like, but the methods are difficult to adapt to complex scene changes. With the popularization of deep learning technology, the target detection technology has made great progress. Deep learning is an effective machine learning method with a strong ability to learn complex data representations. In deep learning object detection, convolutional Neural Networks (CNNs) are one of the most commonly used models that can learn complex feature representations of the underlying layers of objects in an image.
Early target detection methods such as R-CNN series, fast R-CNN and the like fully utilize the learning ability of CNN, and greatly improve the accuracy of target detection. However, these methods still have problems of high computational complexity, slow reasoning speed, and the like.
In view of these problems, the development of the target detection technology of remote sensing images has been turned to a single-stage target detection method, such as YOLO, SSD, retinaNet. The method can perform target detection in a single stage, reduces the computational complexity, and improves the detection precision and the reasoning speed. In addition, the method also obtains excellent performance in various target detection benchmark tests.
The target detection method based on the transducer is a new direction in the field of target detection which is popular in recent years. The main idea behind this type of approach is to apply the encoding-decoding architecture originally proposed for Natural Language Processing (NLP) tasks to target detection tasks.
Compared with convolution, the vision transducer breaks through the limitation that the traditional convolution neural network target detection model cannot be calculated in parallel; the number of times the transducer calculates the correlation between two target positions does not increase with distance; the self-attention mechanism can generate a more interpretable model, and the encoder module can calculate an attention matrix from the feature map, wherein each value on the matrix actually constructs the predicted frame coordinates, so that the target frame can be directly predicted.
The core of the transform method is a self-attention mechanism (self-attention) that allows the model to focus on different regions of the input image and dynamically adjust the importance of each region. Compared with the traditional Convolutional Neural Network (CNN) based method, the target detection method based on the Transformer is more flexible and can process complex scenes with multiple objects.
One of the first efforts in this area was DETR, which proposed a transducer-based end-to-end target detection framework. DETR uses a set of queues to predict object positions and categories, and uses an encoding-decoding architecture to process images and output predictions. The self-attention mechanism allows DETR to handle instances of different dimensions and shapes and target detection in one stage making it more efficient than the traditional two-stage approach.
Although the end-to-end target detection framework of the present transducer remote sensing image based on DETR can obtain a better detection effect in the field of remote sensing image target detection, some problems still exist, such as overlong training time caused by difficulty in convergence of an attention mechanism, low detection precision of a model on a small target caused by incapability of effectively acquiring local information by the attention mechanism.
In summary, although DETR proposes a framework for simplifying target detection of a remote sensing image and improving overall detection performance, the problems of low detection performance on a small target, slow model convergence speed and the like are not yet solved.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a remote sensing image target detection method based on a fusion convolution attention mechanism, which has stronger acquisition capability to image local features and higher model convergence speed.
The invention relates to a remote sensing image target detection method based on a fusion convolution attention mechanism, which is characterized by comprising the following steps:
step 1, collecting and processing remote sensing image data: the method comprises the steps of obtaining a public remote sensing image from a public website, dividing the public remote sensing image into a training data set, a verification data set and a test data set, and forming a remote sensing image data set as a whole, wherein the remote sensing image data set contains fifteen types of targets respectively: aircraft, boats, storage tanks, baseball fields, tennis courts, basketball courts, playgrounds, ports, bridges, large vehicles, small vehicles, helicopters, roundabout, football fields, swimming pools; generating txt files by using the coordinates and the category information of all targets of the original image data in the remote sensing image data set, and inputting the txt files and the original image data into a built feature extraction backbone network;
Step 2, building a feature extraction backbone network: the built feature extraction backbone network is formed by sequentially connecting four convolution groups, wherein the first convolution group sequentially passes through a convolution layer, a Norm layer, an activation function layer and a maximum pooling layer to form a residual error connecting unit; the second, third and fourth convolution groups are formed by sequentially connecting residual error connecting units with different numbers, and each residual error connecting unit is formed by sequentially stacking a convolution layer, a GN layer and an activation function layer; the method comprises the steps that input original image data is subjected to downsampling operation of a built feature extraction backbone network, and then a remote sensing image feature matrix is output;
step 3, constructing a fusion convolution transducer encoder: the built transform encoder comprises a fusion convolution multi-head attention module formed by parallel connection of a convolution module and an attention module, and the encoder sequentially comprises the fusion convolution multi-head attention module, a residual error connection and layer normalization module, a forward propagation module, a residual error connection and layer normalization module from an input end, wherein the convolution module in the fusion convolution multi-head attention module comprises a first convolution layer, a first activation function layer, a second convolution layer, a BN layer, a second activation function layer and a third convolution layer which are sequentially connected, and the attention module comprises an LN layer, a self-attention layer and a feedforward network layer which are sequentially connected; wherein, the proportion of convolution module and attention module is 4:4, the size of the matrix output by the convolution module is the same as that of the matrix output by the attention module, and an output matrix with the same size as that of the input matrix of the fusion convolved transform encoder module is formed after concat cascading;
Step 4, a transducer decoder of a mixed attention mechanism is built: the decoder processes redundant information on an input target query matrix through a self-attention mechanism, the cross-attention mechanism models the relation between a feature matrix output by the encoder and the target query matrix, and the forward propagation module performs feature transformation on image features and a prediction frame;
step 5, forming a target detection network model integrating a convolution attention mechanism: establishing a target detection network model of a fusion convolution attention mechanism, which is formed by a feature extraction backbone network, a fusion convolution transducer encoder and a hybrid attention mechanism transducer decoder in sequence, wherein the target detection network model is called a network model for short;
step 6, training a network model: training the network model by using a training data set to obtain a trained target detection network model integrating a convolution attention mechanism;
step 7, testing a network model: and detecting the test data set by using a trained target detection network model integrating the convolution attention mechanism, namely inputting the test set into the trained network model to obtain a detection result of each type of target in the remote sensing image data set, wherein the detection result comprises average precision AP and average precision average value mAP of all types of targets.
The invention solves the technical problems of low convergence rate of the training model and low accuracy of the detection model to the small target in the end-to-end remote sensing image target detection frame.
Compared with the prior art, the invention has the following advantages:
the detection precision of the model to the small target is improved: the invention designs a convolution module in an encoder, which consists of point-by-point convolution, depth convolution, an activation function and a normalization layer, wherein the module acquires local information of an image on the premise of not changing the size of an input matrix of the encoder; the encoder formed by parallel connection of the convolution module and the attention module has better feature extraction capability on the global features and the local features of the image; the method improves the detection precision of the small target on the premise of ensuring the detection precision of the large target by the target detection network model fused with the convolution attention mechanism.
The training time of the model is reduced: in the prior art, when an encoder based on an attention mechanism encodes an image, the second power of the pixel quantity of the image is calculated, so that the calculation complexity of a model is high, and the parameter quantity is large; the convolution module in the encoder designed by the invention has the advantages of small parameter amount in the point-by-point convolution and depth convolution, reduces the calculation complexity of the model, accelerates the convergence speed of the model and reduces the training time consumption of the model.
Drawings
FIG. 1 is a block flow diagram of an implementation of the present invention;
FIG. 2 is a block diagram of a backbone network for extracting image features in the present invention;
FIG. 3 is a block diagram of a fusion convolved transducer encoder constructed in accordance with the present invention;
FIG. 4 is a block diagram of a convolution module constructed in the encoder of the present invention;
FIG. 5 is a flow chart of the encoder-decoder of the present invention;
fig. 6 is a graph of experimental results of the present invention, in which fig. 6 (a) is a graph of remote sensing image detection results for a small vehicle and a large vehicle target in a test set of DOTA data sets, and fig. 6 (b) is a graph of remote sensing image detection results for a small vehicle and a loop target in a test set of DOTA data sets.
Detailed Description
Example 1
In the prior art, the target detection method based on the transducer can be divided into two types, one is based on the image feature extraction of the backbone network of the transducer, and the other is based on the set prediction method of the transducer. DETR is the first end-to-end target detection framework of the transform-based set prediction method proposed by Facebook team in 2020, the image is characterized by being extracted through the backbone network, then together with position coding, fed into the encoder, the encoder output matrix is fed into the decoder together with the target sequence, the output of the decoder is fed into the prediction head, and the feed-forward neural network predicts the object class and bounding box. The DETR predicts the object position and class by using a group of queues, processes the image by using an encoding-decoding architecture and outputs the prediction, the framework is simple and clear, the image sequence is directly predicted to a detection frame, and the defect that the traditional target detection needs non-maximum suppression is overcome. However, DETR uses an attention mechanism to pay more attention to the global features of the image, resulting in a model with low detection accuracy for small targets; furthermore, the attention mechanism requires a larger number of parameters, which are more difficult to converge than conventional convolutional object detection networks. The invention provides a remote sensing image target detection method based on a fusion convolution attention mechanism.
The invention relates to a remote sensing image target detection method based on a fusion convolution attention mechanism, which is shown in fig. 1, wherein fig. 1 is a flow chart for realizing the method; the method comprises the following steps:
step 1, collecting and processing remote sensing image data: acquiring a public remote sensing image from a public website, dividing the image into a training data set, a verification data set and a test data set according to the proportion of 4:2:3, and forming a remote sensing image data set as a whole, wherein the image in the remote sensing image data set is called as original image data; the remote sensing image dataset contains fifteen types of targets, which are respectively: aircraft, ships, storage tanks, baseball fields, tennis courts, basketball courts, playgrounds, ports, bridges, large vehicles, small vehicles, helicopters, roundabout, football fields, swimming pools. The invention generates txt files by the coordinates and category information of all targets of each piece of original image data in the remote sensing image data set, and inputs the txt files and the original image data into the established feature extraction backbone network. The data set is marked in an inclined frame mode, and the pixel size distribution of each image is different and comprises objects with different scales, directions and shapes.
Step 2, building a feature extraction backbone network: the built feature extraction backbone network is formed by sequentially connecting four convolution groups, wherein the first convolution group sequentially passes through a convolution layer, a Norm layer, an activation function layer and a maximum pooling layer to form a residual error connecting unit; the second, third and fourth convolution groups sequentially downsample the feature map output by the previous convolution group; and outputting a remote sensing image feature matrix after the input original image data is subjected to the downsampling operation of the constructed feature extraction backbone network. The design of the backbone network needs to meet the requirement of multiple scales of images, and in order to enable the backbone network to output the feature matrix with the same size after extracting the features of the images with different sizes, the invention adds partial downsampling operation in the feature extraction backbone network.
Step 3, constructing a fusion convolution transducer encoder: referring to fig. 3, the built transform encoder comprises a fusion convolution multi-head attention module formed by parallel connection of a convolution module and an attention module, the encoder sequentially comprises a fusion convolution multi-head attention module, a residual error connection and layer normalization module, a forward propagation module, a residual error connection and layer normalization module from an input end, wherein the convolution module in the fusion convolution multi-head attention module comprises a first point-by-point convolution layer, a first activation function layer, a depth convolution layer, a BN layer, a second activation function layer and a second point-by-point convolution layer which are sequentially connected, and the attention module comprises an LN layer, a self-attention layer and a feedforward network layer which are sequentially connected; wherein, the proportion of convolution module and attention module is 4: and 4, the size of the matrix output by the convolution module is the same as that of the matrix output by the attention module, and an output matrix with the same size as that of the input matrix of the fusion convolved transform encoder is formed after concat cascading. In the embodiment, the improved multi-head attention module selects eight heads, the number of the attention heads selected by the multi-head attention module can be selected according to model training time and detection precision, and the eight heads achieve the balance of model parameter quantity and detection precision, wherein the equal quantity of convolution modules and the attention modules ensures that the encoder cannot be more biased towards global features or local features when extracting image features. The convolution module does not change the size of the feature map by using a point-by-point convolution mode, does not change the channel number of the feature map by using a depth convolution mode, ensures that the size of a matrix output by the convolution module is equal to that of a matrix output by the attention module, and can be directly added.
Step 4, a transducer decoder of a mixed attention mechanism is built: the decoder processes redundant information on the input target query matrix through a self-attention mechanism, the cross-attention mechanism models the relation between the characteristic matrix output by the encoder and the target query matrix, and the forward propagation module performs characteristic transformation on the image characteristics and the prediction frame. The decoder is formed by sequentially connecting six decoder units, and each decoder unit outputs a group of prediction information for the category and the position of each target in the image after the matrix passes through the forward propagation module. The attention coefficient matrix based on the image pixel region output by the encoder is converted into an attention coefficient matrix based on each object in the image.
Step 5, forming a target detection network model integrating a convolution attention mechanism: and establishing a target detection network model of the fused convolution attention mechanism, which is simply called a network model, and sequentially consists of a feature extraction backbone network, a fused convolution transducer encoder and a hybrid attention mechanism transducer decoder.
Step 6, training a target detection network model integrating a convolution attention mechanism: and training a fusion convolution attention mechanism target detection network model formed by sequentially connecting a feature extraction backbone network, a fusion convolution attention mechanism encoder and a mixed attention mechanism decoder by using a training data set to obtain a trained fusion convolution attention mechanism target detection network model. In the invention, as the parameter quantity of the attention module is relatively large, the number of images of each training batch is reduced as much as possible on the premise of ensuring the convergence speed of the model when training parameter setting is carried out.
Step 7, testing a target detection network model integrating a convolution attention mechanism: and detecting the test data set by using the trained target detection network model with the fused convolution attention mechanism, namely inputting the test set into the trained target detection network model with the fused convolution attention mechanism to obtain a detection result of each type of target in the remote sensing image data set, wherein the detection result comprises average precision AP and average precision average value mAP of all types of targets.
The technical idea of the invention is as follows: the convolution layer is fused with the encoder, so that the extraction of the model to the local features of the image is increased, and the detection precision of the model to the small target is improved; and a convolution module with small parameter quantity is used for replacing a part of attention modules with large parameter quantity, so that model parameters are reduced, and the model training speed is increased.
In order to solve the defects of the existing end-to-end remote sensing image target detection framework, the convolution module is introduced to replace part of self-attention modules of the multi-head attention module in the transform encoder, and the remote sensing image target detection method fused with the convolution attention mechanism is specially designed. The method can be applied to the field of real-time detection of ground targets by aviation aircrafts and the like.
Example 2
The remote sensing image target detection method based on the fusion convolution attention mechanism is the same as that of the embodiment 1, and features of the invention constructed in the step 2 are extracted into a main network, and referring to fig. 2, fig. 2 is a main network structure diagram of the extracted image features of the invention, and the constructed feature extraction main network is formed by sequentially connecting four convolution groups; in this example, the first convolution group is composed of a convolution layer with a convolution kernel size of 6×6, a convolution kernel number of 32, a step length of 1, a groupnum layer, a ReLU activation function layer, and a maximum pooling layer with a window size of 3×3 and a step length of 2 in sequence; the second convolution group is formed by sequentially connecting three identical residual modules 1, each residual module 1 is formed by sequentially connecting one convolution layer with the size of 1 multiplied by 1, the number of the convolution cores is 128, the convolution layers with the step length of 1, the convolution cores with the size of 3 multiplied by 3, the number of the convolution cores is 128, the convolution layers with the step length of 1, the convolution cores with the size of 2 multiplied by 2, the number of the convolution cores is 128, and the convolution layers with the step length of 1; the third convolution group is formed by sequentially connecting four identical residual modules 2, each residual module 2 is formed by sequentially connecting a convolution layer with the size of 1 multiplied by 1, the number of the convolution kernels is 128, the convolution layers with the step length of 1, the convolution kernels with the size of 3 multiplied by 3, the number of the convolution kernels is 128, the convolution layers with the step length of 1, the convolution kernels with the size of 1 multiplied by 1, the number of the convolution kernels is 512, and the convolution layers with the step length of 1; the fourth convolution group is formed by sequentially connecting nine identical residual modules 3, each residual module 3 is formed by sequentially connecting a convolution layer with a convolution kernel size of 1×1, a convolution kernel number of 128, a convolution layer with a step length of 2, a convolution kernel size of 3×3, a convolution kernel number of 256, a convolution layer with a step length of 1, a convolution kernel size of 1×1, a convolution kernel number of 512 and a convolution layer with a step length of 1. The network structure parameters provided by the embodiment are a group of parameters with good performance aiming at the remote sensing image target detection task, and the parameters of the feature extraction backbone network can be adjusted according to different specific tasks.
Because the remote sensing image has the characteristics of overlooking view angle, high resolution, nonuniform target scale, multidirectional target rotation, complex image background and the like, the design of the backbone network needs to meet the requirement of multiple scales of the image, and the downsampling processing is carried out on the pictures with different input sizes, so that the dimension of the feature matrix output by the feature extraction backbone network is the same.
Example 3
The remote sensing image target detection method based on the fusion convolution attention mechanism is the same as that of the fusion convolution transducer encoder constructed in the embodiment 1-2 and the step 3, referring to fig. 3, fig. 3 is a fusion convolution transducer encoder constructed by the invention, and the fusion convolution transducer encoder constructed by the invention is formed by sequentially connecting six encoder units. The output sequence of the feature extraction backbone network is added with position coding to generate a position coding feature sequence which is used as the input of the whole encoder; the structure of each encoder unit is identical, and the encoder unit is composed of a fusion convolution multi-head attention module, a first residual error connection and layer normalization module, a forward propagation module, a second residual error connection and layer normalization module which are sequentially connected. The first residual connection and layer normalization module performs addition short circuit on an input matrix of the encoder unit and an output matrix of the multi-head attention module, and then performs normalization processing on the short-circuited matrix; the forward propagation module is formed by sequentially connecting a linear line layer, a relu activation function layer and a dropout layer. And the second residual connection and layer normalization module performs addition short circuit on the output matrix of the first residual connection and layer normalization module and the output matrix of the forward propagation module, and then performs normalization processing on the short-circuited matrix. For the first, second, third, fourth and fifth encoder units, the output matrix of the current encoder unit is used as the input matrix of the next encoder module; in particular, the output matrices of the sixth encoder unit are respectively used as input matrices of six decoder units in the decoder.
The convolution module designed by the invention consists of depth convolution, point-by-point convolution, activation function and normalization layer, and ensures that the output matrix and the input matrix have the same size on the premise of meeting the requirement of extracting the local features of the image. Similarly, the output matrix of the attention module is the same as the input matrix in size, and the convolution module can add to the attention module in parallel without changing the input matrix in size.
Example 4
The remote sensing image target detection method based on the fusion convolution attention mechanism is the same as that of the embodiments 1-3, see fig. 4, fig. 4 is a structure diagram of a convolution module constructed in the encoder of the present invention, and the fusion convolution multi-head attention module in the fusion convolution transducer encoder of the present invention is formed by four self-attention units and four convolution units in parallel; the four self-attention units have the same structure, each self-attention unit firstly multiplies an input matrix by three matrices with different parameters to perform three different linear transformations on the input matrix to obtain three matrices Q, K, V with the same size and different parameters, and then calculates Q, K, V three matrices by a softmax function to obtain an attention parameter matrix, wherein the formula is as follows:
Figure BDA0004101114610000091
The four convolution units have the same structure, each convolution unit is formed by sequentially connecting a point-by-point convolution layer with the convolution kernel size of 1 multiplied by 1, the convolution kernel number of 128, a GLU activation function layer with the step length of 1, a convolution kernel size of 3 multiplied by 3, the convolution kernel number of 256, a depth convolution layer with the step length of 1, a BN normalization layer, a Swish activation function layer, the convolution kernel size of 1 multiplied by 1, the convolution kernel number of 256 and the point-by-point convolution layer with the step length of 1.
According to the invention, by means of extracting the image features in parallel by the convolution module and the attention module, not only the global information of the image can be obtained, but also the local information of the image can be obtained, and the detection precision of the model to the small target is improved on the premise of ensuring the detection precision of the trained model to the large target. Meanwhile, the calculated amount of the attention mechanism is calculated by carrying out second power and the like on the characteristic dimension of the image, the parameter amount is too large, the parameter amount of the model is reduced by introducing the local convolution module, the convergence speed of the model is accelerated, and the training time consumption of the model is reduced.
Example 5
The remote sensing image target detection method based on the fusion convolution attention mechanism is the same as that of the embodiments 1-4, referring to fig. 5, fig. 5 is a flow chart of the encoder-decoder of the present invention; the invention relates to a transducer decoder with a mixed attention mechanism in the step 4, which is formed by sequentially connecting six decoder units; each decoder unit has the same structure and is formed by sequentially connecting a multi-head self-attention module, a first residual error connection and layer normalization module, a multi-head cross-attention module, a second residual error connection and layer normalization module, a forward propagation module, a third residual error connection and layer normalization module, namely sequentially connecting the multi-head self-attention module, the first residual error connection and layer normalization module according to the sequence from input to output.
In the embodiment, the size of an input target query matrix is 100 multiplied by 256, the size of an output matrix of an encoder is 850 multiplied by 256, the size of a mask matrix is 25 multiplied by 34, and the size of the output matrix is 100 multiplied by 256. The flow of generating the target detection frame for the image by the target detection task is simplified.
Example 6
The remote sensing image target detection method based on the fused convolution attention mechanism is the same as that of the embodiments 1-5, and the training network model in the step 6 of the invention is to train the fused convolution attention mechanism target detection network by using a remote sensing image training data set, specifically:
6.1 super parameter setting: setting an initial learning rate as R, setting a learning rate adjustment as a steps mode, setting a weight attenuation parameter as a, setting a batch size as B, and setting a training round as E;
6.2 training method: updating weight and bias of the whole network model by using a random gradient descent algorithm, updating weight and bias once for every input B training images, and iterating altogether
Figure BDA0004101114610000101
Stopping updating for the second time, and finishing training;
and 6.3, obtaining a final trained network model: and when the iteration is stopped, obtaining a trained target detection network model integrating the convolution attention mechanism.
In the example, the initial learning rate is set to be 0.001, the learning rate is adjusted to be a steps mode, the weight attenuation parameter is set to be 0.0001, the batch size is set to be 4, and the training round is set to be 100; and updating the weight and the bias of the whole network model by using a random gradient descent algorithm, wherein the weight and the bias are updated once every 4 training images are input, and the updating is stopped for 40000 times in total, so that the final trained network model is obtained. The parameters are a group of parameters with good training effect in the example, and the invention can be adjusted for different target detection tasks.
A more detailed example is given below to further illustrate the invention
Example 7
The remote sensing image target detection method based on the fusion convolution attention mechanism is the same as in embodiments 1-6,
the invention provides a remote sensing image target detection method based on a fusion convolution attention mechanism, which is presented in the invention, referring to fig. 1, fig. 1 is a flow chart of the implementation of the invention; the method comprises the following steps:
step 1, collecting and processing remote sensing image data: acquiring a public remote sensing image from a public website, dividing the image into a training data set, a verification data set and a test data set according to the proportion of 3:1:2, and forming a remote sensing image data set as a whole, wherein the image in the remote sensing image data set is called as original image data; the remote sensing image dataset contains fifteen types of targets, which are respectively: aircraft, ships, storage tanks, baseball fields, tennis courts, basketball courts, playgrounds, ports, bridges, large vehicles, small vehicles, helicopters, roundabout, football fields, swimming pools. And generating txt files by using the coordinates and the category information of all targets of each piece of original image data in the remote sensing image data set, and inputting the txt files and the original image data into the built feature extraction backbone network.
Step 2, building a feature extraction backbone network: the built feature extraction backbone network is formed by sequentially connecting four convolution groups, wherein the first convolution group is formed by sequentially connecting a convolution layer, a Norm layer, an activation function layer and a maximum pooling layer; the second, third and fourth convolution groups are formed by sequentially connecting residual error connecting units with different numbers, and each residual error connecting unit is formed by sequentially stacking a convolution layer, a GN layer and an activation function layer; the backbone network adopts different convolution modules to carry out residual error connection and stacked characteristic pyramid modules, and the same characteristic matrix is generated after downsampling operation is carried out on input pictures with different sizes; the feature matrix output by the backbone network is added with the position codes of the same dimension through dimension reduction treatment and then is sent to the encoder.
Referring to fig. 2, the feature extraction backbone network constructed by the invention is formed by sequentially connecting four convolution groups; the first convolution group consists of a convolution layer, a GroupNorm layer, an activation function layer and a maximum pooling layer sequentially; the second convolution group is formed by sequentially connecting three identical residual modules 1, and each residual module 1 is formed by sequentially connecting three different convolution layers, a normalization layer and an activation function layer; the third convolution group is formed by sequentially connecting four identical residual modules 2, and each residual module 2 is formed by sequentially connecting three different convolution layers, a normalization layer and an activation function layer; the fourth convolution group is formed by sequentially connecting nine identical residual modules 3, and each residual module 3 is formed by sequentially connecting three different convolution layers, a normalization layer and an activation function layer.
The first convolution group consists of a convolution layer with a convolution kernel size of 7×7, a convolution kernel number of 64, a step size of 2, a groupnum layer, a ReLU activation function layer, and a max-pooling layer with a window size of 3×3 and a step size of 2. The second convolution group is formed by sequentially connecting three identical residual modules 1, wherein each residual module 1 is formed by sequentially connecting one convolution layer with the size of 1 multiplied by 1, the number of the convolution cores is 128, the convolution layers with the step length of 1, the convolution cores with the size of 3 multiplied by 3, the number of the convolution cores is 128, the convolution layers with the step length of 1, the convolution cores with the size of 1 multiplied by 1, the number of the convolution cores is 256, and the convolution layers with the step length of 1. The third convolution group is formed by sequentially connecting four identical residual modules 2, each residual module 2 is formed by sequentially connecting a convolution layer with the size of 1×1, the number of convolution kernels of 128, the convolution layers with the step length of 1, the convolution kernels with the size of 3×3, the number of convolution kernels of 128, the convolution layers with the step length of 1, the convolution kernels with the size of 1×1, the number of convolution kernels of 512 and the convolution layers with the step length of 1. The fourth convolution group is formed by sequentially connecting nine identical residual modules 3, each residual module 3 is formed by sequentially connecting a convolution layer with the size of 1×1, the number of the convolution cores of 256, the convolution layers with the step length of 1, the convolution cores with the size of 3×3, the number of the convolution cores of 256, the convolution layers with the step length of 1, the convolution cores with the size of 1×1, the number of the convolution cores of 1024 and the convolution layers with the step length of 1.
According to the feature extraction backbone network designed by the invention, through stacking the residual error unit modules, the downsampling operation is carried out on the input images, so that the input images with different sizes pass through the backbone network and then output the feature matrix with the same size, and the problem of multi-scale detection of the remote sensing image target is solved.
Step 3, constructing a fusion convolution transducer encoder: referring to fig. 3, the constructed transducer encoder is formed by sequentially connecting six encoder units, each encoder unit comprises a fusion convolution multi-head attention module formed by parallel connection of a convolution module and an attention module, the encoder sequentially comprises a fusion convolution multi-head attention module, a first residual error connection and layer normalization module, a forward propagation module, a second residual error connection and layer normalization module from an input end, wherein the first residual error connection and layer normalization module performs addition short circuit on an input matrix of the encoder unit and an output matrix of the multi-head attention module, and then performs normalization processing on the matrix after short circuit. The forward propagation module is formed by sequentially connecting a linear layer, a relu activation function layer and a dropout layer; the second residual error connection and layer normalization module performs addition short-circuiting on the output matrix of the first residual error connection and layer normalization module and the output matrix of the forward propagation module, and then performs normalization processing on the short-circuited matrix; for the first, second, third, fourth and fifth encoder modules, the output matrix of the current encoder module is used as the input matrix of the next encoder module; in particular, the output matrices of the sixth encoder module are respectively used as input matrices for six decoder units in the decoder module.
A fusion convolution multi-head attention module in a fusion convolution converter coder is formed by four self-attention units and four convolution units in parallel. The four self-attention units have the same structure, each self-attention unit comprises an LN layer, a self-attention layer and a feedforward network layer which are sequentially connected, three different matrixes Q, K, V with the same size and different parameters are obtained by multiplying an input matrix with three matrixes with different parameters, wherein a Q matrix is an inquiry matrix for carrying out linear transformation on an image feature matrix, a K matrix is a key matrix for carrying out linear transformation on the image feature matrix, a V matrix is a value matrix for carrying out linear transformation on the image feature matrix, and then the three matrixes Q, K, V are calculated through a softmax function to obtain an attention parameter matrix, and the formula is as follows:
Figure BDA0004101114610000121
the four convolution units have the same structure, see fig. 4, and each convolution unit comprises a point-by-point convolution layer, an activation function layer, a depth convolution layer, a BN layer, an activation function layer and a point-by-point convolution layer which are sequentially connected. The ratio of convolution module to attention module is 4: and 4, the size of the matrix output by the convolution module is the same as that of the matrix output by the attention module, and an output matrix with the same size as that of the input matrix of the fusion convolved transform encoder module is formed after concat cascading.
Step 4, a transducer decoder module of the mixed attention mechanism is built: the decoder processes redundant information on the input target query matrix through a self-attention mechanism, the cross-attention mechanism models the relation between the characteristic matrix output by the encoder and the target query matrix, and the forward propagation module performs characteristic transformation on the image characteristics and the prediction frame.
Referring to fig. 5, the transducer decoder module of the mixed attention mechanism constructed by the present invention is composed of six decoder units connected in sequence; each decoder unit has the same structure and is formed by sequentially connecting a multi-head self-attention module, a first residual error connection and layer normalization module, a multi-head cross-attention module, a second residual error connection and layer normalization module, a forward propagation module, a third residual error connection and layer normalization module. Firstly, a target query matrix is input into a multi-head self-attention module to process redundant information, the processed target query matrix and a matrix output by an encoder are input into a multi-head cross-attention module to perform cross-attention calculation, the attention matrix of an image area is converted into an attention matrix of an image target, and each decoder unit outputs a prediction matrix of the image through a forward propagation module and the forward propagation module predicts the target.
Step 5, forming a target detection network model integrating a convolution attention mechanism: and establishing a target detection network model of the fusion convolution attention mechanism, which is simply called a network model, and is formed by a feature extraction backbone network, a fusion convolution transducer encoder module and a hybrid attention mechanism transducer decoder module in sequence.
Step 6, training a target detection network model integrating a convolution attention mechanism: and training a fusion convolution attention mechanism target detection network model formed by sequentially connecting a feature extraction backbone network, a fusion convolution attention mechanism encoder and a mixed attention mechanism decoder by using a training data set to obtain a trained fusion convolution attention mechanism target detection network model. The method specifically comprises the following steps:
6.1 super parameter setting: setting an initial learning rate as R, setting a learning rate adjustment as a steps mode, setting a weight attenuation parameter as a, setting a batch size as B, and setting a training round as E;
6.2 training method: updating weight and bias of the whole network model by using a random gradient descent algorithm, updating weight and bias once for every input B training images, and iterating altogether
Figure BDA0004101114610000131
Stopping updating for the second time, and finishing training;
and 6.3, obtaining a final trained network model: and when the iteration is stopped, obtaining a trained target detection network model integrating the convolution attention mechanism.
In the example, the initial learning rate is set to be 0.0025, the learning rate is adjusted to be a steps mode, the weight attenuation parameter is set to be 0.0001, the batch size is set to be 8, and the training round is set to be 100; and updating the weight and the bias of the whole network model by using a random gradient descent algorithm, wherein the weight and the bias are updated once every 8 training images are input, and the updating is stopped for 20000 times in total, so that the final trained network model is obtained.
The invention accelerates the convergence speed of the model and reduces the time consumption of model training.
Step 7, testing a target detection network model integrating a convolution attention mechanism: and detecting the test data set by using the trained target detection network model with the fused convolution attention mechanism, namely inputting the test set into the trained target detection network model with the fused convolution attention mechanism to obtain a detection result of each type of target in the remote sensing image data set, wherein the detection result comprises average precision AP and average precision average value mAP of all types of targets.
The invention adopts the feature extraction backbone network with pyramid structure and comprising downsampling operation, outputs feature matrixes with the same size for input images with different sizes, and solves the difficulty of multi-scale detection of remote sensing image targets; a convolution module comprising depth convolution and point-by-point convolution is built, and the information extraction capability of the model on local features of the remote sensing image is enhanced; the method has the advantages that part of attention heads in the encoder are replaced by built convolution modules, so that the large parameter quantity caused by the encoder which is completely composed of an attention mechanism is reduced, the convergence speed of the model is improved, and the training time consumption of the model is reduced. The method is used in the fields of aviation aircrafts, remote sensing satellites, intelligent traffic, intelligent agriculture and the like, which have high requirements on real-time and accuracy of remote sensing image target detection.
The technical effects of the present invention will be explained again by experiments and the result data thereof
Example 8
The remote sensing image target detection method based on the fusion convolution attention mechanism is the same as in embodiments 1-7,
experimental conditions: all experiments are carried out under the same platform, the CPU of the hardware configuration of the platform is Intel8358P, the GPU is NVIDIA GeForce RTX 3090, and the video memory is 24G. The operating system used in the experiment is Ubuntu 18.04LTS, the deep learning framework used is Pytorch 1.7.1, the GPU computing platform is CUDA 11.0, and the GPU acceleration library is cuDNN 8.0.5.
The experimental contents are as follows: the remote sensing image target detection method based on the fusion convolution attention mechanism is adopted to carry out target detection on the disclosed remote sensing data set DOTA, the training method is used to obtain a trained model, the target class and position accuracy rate test is carried out on 937 pictures of the test set in the DOTA data set, two detection result pictures are randomly extracted and are shown in figure 6, wherein figure 6 (a) is a detection result picture containing small vehicle and large vehicle targets detected by the method,
experimental results and analysis: referring to fig. 6 (a), as can be seen from fig. 6 (a), the targets of the small and large vehicles in all the detected remote sensing images are detected, the confidence of most detection frames is high, the targets in the rotation direction and the targets in the rotation directions can be accurately detected by using the rotation frames, the condition of target missing detection is avoided, and the invention has good detection performance on the large targets and the small targets in the remote sensing data set.
Example 9
The remote sensing image target detection method based on the fusion convolution attention mechanism is the same as that of examples 1-7, and the experimental conditions and experimental contents are the same as that of example 8.
Experimental results and analysis: referring to fig. 6 (b), fig. 6 (b) is a picture of a detection result of an object including a small vehicle, a circular road section detected by the present invention. Specifically, the invention aims at the detection of the target type and the position accuracy of the remote sensing image, and as can be seen from the figure 6 (b), all small vehicles and annular loop targets in the detected remote sensing image are detected, and the confidence of most detection frames is higher, and as can be seen from the figure, the invention can accurately detect the targets with large scale differences, namely the small vehicles and the annular loop, without the condition of target missing detection, and verifies the capability of the feature extraction backbone network designed by the invention to extract the multi-scale features of the image through downsampling operation.
Example 10
The remote sensing image target detection method based on the fusion convolution attention mechanism is the same as that of the embodiments 1-7, and the experimental conditions are the same as that of the embodiment 8.
The experimental contents are as follows: the remote sensing image target detection method based on the fusion convolution attention mechanism, the remote sensing image target detection method DETR based on the attention mechanism and the convolution image sensing target detection method REDet based on the convolution are respectively used for carrying out a comparison test on a DOTA data set, and the comparison test results are shown in Table 1.
TABLE 1 results of comparative experiments on DOTA remote sensing dataset
Figure BDA0004101114610000151
Figure BDA0004101114610000161
Experimental results and analysis: as can be seen from Table 1, the detection accuracy of the invention is improved on the detection accuracy of most targets including targets with large scale difference compared with the detection accuracy of the remote sensing image targets based on convolution, and the detection accuracy of the invention is more similar to that of the targets on the large targets compared with the detection accuracy of the targets on the remote sensing image targets based on attention mechanism, and the detection accuracy is better on the small targets.
In summary, the invention provides a remote sensing image target detection method based on a fused convolution attention mechanism, which solves the technical problems of low detection precision of a model on a small target object of a remote sensing image and low convergence speed during model training in the existing end-to-end remote sensing image target detection technology. The realization method comprises the following steps: collecting and processing remote sensing image data; constructing a feature extraction backbone network; constructing a transform encoder of fusion convolution; a transform decoder that builds a mixed-attention mechanism; forming a target detection network model integrating a convolution attention mechanism; training a target detection network model integrating a convolution attention mechanism; and testing the target detection network model of the fusion convolution attention mechanism. According to the invention, the main network is extracted by adopting the downsampling characteristics of the pyramid structure, and the characteristic matrixes with the same size are output for the input images with different sizes, so that the difficulty in multi-scale detection of the remote sensing image targets is solved; a convolution module consisting of depth convolution, point-by-point convolution, an activation function and a normalization layer is built, and the information extraction capability of the model on local features of the remote sensing image is enhanced; the method has the advantages that part of attention heads in the multi-head attention module are replaced by built convolution modules, so that the large parameter number of dot product operation of the second power matrix of the image pixel quantity by an attention mechanism is reduced, the convergence speed of the model is improved, and the training time consumption of the model is reduced. The method is used in the fields of aviation aircrafts, remote sensing satellites, intelligent traffic, intelligent agriculture and the like, which have high requirements on real-time and accuracy of remote sensing image target detection.

Claims (6)

1. The remote sensing image target detection method based on the fusion convolution attention mechanism is characterized by comprising the following steps of:
step 1, collecting and processing remote sensing image data: acquiring a public remote sensing image from a public website, dividing the public remote sensing image into a training data set, a verification data set and a test data set, and forming a remote sensing image data set as a whole; the remote sensing image dataset contains fifteen types of targets, which are respectively: aircraft, boats, storage tanks, baseball fields, tennis courts, basketball courts, playgrounds, ports, bridges, large vehicles, small vehicles, helicopters, roundabout, football fields, swimming pools; generating txt files by using the coordinates and the category information of all targets of the original image data in the remote sensing image data set, and inputting the txt files and the original image data into a built feature extraction backbone network;
step 2, building a feature extraction backbone network: the built feature extraction backbone network is formed by sequentially connecting four convolution groups, wherein the first convolution group sequentially comprises a convolution layer, a Norm layer, an activation function layer and a maximum pooling layer; the second, third and fourth convolution groups are formed by sequentially connecting residual error connecting units with different numbers, and each residual error connecting unit is formed by sequentially stacking a convolution layer, a GN layer and an activation function layer; the method comprises the steps that input original image data is subjected to downsampling operation of a built feature extraction backbone network, and then a remote sensing image feature matrix is output;
Step 3, constructing a fusion convolution transducer encoder: the built transform encoder comprises a fusion convolution multi-head attention module formed by parallel connection of a convolution module and an attention module, and the encoder sequentially comprises the fusion convolution multi-head attention module, a residual error connection and layer normalization module, a forward propagation module, a residual error connection and layer normalization module from an input end, wherein the convolution module in the fusion convolution multi-head attention module comprises a first convolution layer, a first activation function layer, a second convolution layer, a BN layer, a second activation function layer and a third convolution layer which are sequentially connected, and the attention module comprises an LN layer, a self-attention layer and a feedforward network layer which are sequentially connected; wherein, the proportion of convolution module and attention module is 4:4, the size of the matrix output by the convolution module is the same as that of the matrix output by the attention module, and an output matrix with the same size as that of the input matrix of the fusion convolved transform encoder is formed after concat cascading;
step 4, a transducer decoder of a mixed attention mechanism is built: the decoder processes redundant information on an input target query matrix through a self-attention mechanism, the cross-attention mechanism models the relation between a feature matrix output by the encoder and the target query matrix, and the forward propagation module performs feature transformation on image features and a prediction frame;
Step 5, forming a target detection network model integrating a convolution attention mechanism: establishing a target detection network model of a fusion convolution attention mechanism, which is formed by a feature extraction backbone network, a fusion convolution transducer encoder and a hybrid attention mechanism transducer decoder in sequence, wherein the target detection network model is called a network model for short;
step 6, training a target detection network model integrating a convolution attention mechanism: training a fusion convolution attention mechanism target detection network model formed by sequentially connecting a feature extraction main network, a fusion convolution attention mechanism encoder and a mixed attention mechanism decoder by using a training data set to obtain a trained fusion convolution attention mechanism target detection network model;
step 7, testing a target detection network model integrating a convolution attention mechanism: and detecting the test data set by using the trained target detection network model with the fused convolution attention mechanism, namely inputting the test set into the trained target detection network model with the fused convolution attention mechanism to obtain a detection result of each type of target in the remote sensing image data set, wherein the detection result comprises average precision AP and average precision average value mAP of all types of targets.
2. The remote sensing image target detection method based on the fusion convolution attention mechanism according to claim 1, wherein the feature extraction backbone network constructed in the step 2 is formed by sequentially connecting four convolution groups; the first convolution group consists of a convolution layer, a GroupNorm layer, an activation function layer and a maximum pooling layer sequentially; the second convolution group is formed by sequentially connecting three identical residual modules 1, and each residual module 1 is formed by sequentially connecting three different convolution layers; the third convolution group is formed by sequentially connecting four identical residual modules 2, and each residual module 2 is formed by sequentially connecting three different convolution layers; the fourth convolution group is formed by sequentially connecting nine identical residual modules 3, each residual module 3 being formed by sequentially connecting three different convolution layers.
3. The remote sensing image target detection method based on a fusion convolution attention mechanism according to claim 1, wherein the fusion convolution transducer encoder constructed in the step 3 is formed by sequentially connecting six identical encoder units; the structure of each encoder unit is identical, and the encoder unit is formed by sequentially connecting a fusion convolution multi-head attention module, a first residual error connection and layer normalization module, a forward propagation module, a second residual error connection and layer normalization module; the output sequence of the feature extraction backbone network is added with position coding to generate a position coding feature sequence, and the position coding feature sequence is used as the input of the whole encoder and is input to a first encoder unit; in the encoder unit, a first residual error connection and layer normalization module performs addition short-circuiting on an input matrix of the encoder unit and an output matrix of the multi-head attention module, and then performs normalization processing on the short-circuited matrix; the forward propagation module is formed by sequentially connecting a linear layer, a relu activation function layer and a dropout layer; the second residual error connection and layer normalization module performs addition short-circuiting on the output matrix of the first residual error connection and layer normalization module and the output matrix of the forward propagation module, and then performs normalization processing on the short-circuited matrix; for the first, second, third, fourth and fifth encoder units, the output matrix of the current encoder unit is used as the input matrix of the next encoder unit; in particular, the output matrices of the sixth encoder unit are respectively used as input matrices of six decoder units in the decoder.
4. A fused convolutionally fransformer encoder according to claim 1 or 3, wherein the fused convolutionally multi-headed attention module in the fused convolutionally fransformer encoder is formed of four self-attention units in parallel with four convolution units; the four self-attention units have the same structure, each self-attention unit firstly multiplies an input matrix by three matrices with different parameters to perform three different linear transformations on the input matrix to obtain three matrices Q, K, V with the same size and different parameters, and then calculates Q, K, V three matrices by a softmax function to obtain an attention parameter matrix, wherein the formula is as follows:
Figure FDA0004101114580000031
the four convolution units have the same structure, and each convolution unit is formed by sequentially connecting a first convolution layer, a first activation function layer, a second convolution layer, a Norm layer, a second activation function layer and a third convolution layer.
5. The method for detecting the target of the remote sensing image based on the fusion convolution attention mechanism according to claim 1, wherein the transducer decoder of the hybrid attention mechanism in the step 4 is formed by sequentially connecting six decoder units; each decoder unit has the same structure and is formed by sequentially connecting a multi-head self-attention module, a first residual error connection and layer normalization module, a multi-head cross-attention module, a second residual error connection and layer normalization module, a forward propagation module, a third residual error connection and layer normalization module.
6. The method for detecting a target of a remote sensing image based on a fused convolution attention mechanism according to claim 1, wherein the training network model in step 6 is to train the target detection network of the fused convolution attention mechanism by using a remote sensing image training data set, specifically:
6.1 super parameter setting: setting an initial learning rate as R, setting a learning rate adjustment as a steps mode, setting a weight attenuation parameter as a, setting a batch size as B, and setting a training round as E;
6.2 training method: updating weight and bias of the whole network model by using a random gradient descent algorithm, updating weight and bias once for every input B training images, and iterating altogether
Figure FDA0004101114580000041
Stopping updating for the second time, and finishing training;
and 6.3, obtaining a final trained network model: and when the iteration is stopped, obtaining a trained target detection network model integrating the convolution attention mechanism.
CN202310176483.XA 2023-02-28 2023-02-28 Remote sensing image target detection method based on fusion convolution attention mechanism Pending CN116229295A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310176483.XA CN116229295A (en) 2023-02-28 2023-02-28 Remote sensing image target detection method based on fusion convolution attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310176483.XA CN116229295A (en) 2023-02-28 2023-02-28 Remote sensing image target detection method based on fusion convolution attention mechanism

Publications (1)

Publication Number Publication Date
CN116229295A true CN116229295A (en) 2023-06-06

Family

ID=86590726

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310176483.XA Pending CN116229295A (en) 2023-02-28 2023-02-28 Remote sensing image target detection method based on fusion convolution attention mechanism

Country Status (1)

Country Link
CN (1) CN116229295A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116758562A (en) * 2023-08-22 2023-09-15 杭州实在智能科技有限公司 Universal text verification code identification method and system
CN116824407A (en) * 2023-06-21 2023-09-29 深圳市华赛睿飞智能科技有限公司 Target detection method, device and equipment based on patrol robot
CN116824525A (en) * 2023-08-29 2023-09-29 中国石油大学(华东) Image information extraction method based on traffic road image
CN116883729A (en) * 2023-06-27 2023-10-13 西北大学 Ceramic cultural relic fragment microscopic image classification method based on combination of Transformer and CNN
CN117312931A (en) * 2023-11-30 2023-12-29 山东科技大学 Drilling machine stuck drill prediction method based on transformer
CN117576513A (en) * 2023-11-24 2024-02-20 铜陵学院 Method, device and medium for detecting end-to-end spacecraft assembly
CN117593514A (en) * 2023-12-08 2024-02-23 耕宇牧星(北京)空间科技有限公司 Image target detection method and system based on deep principal component analysis assistance
CN117593666A (en) * 2024-01-19 2024-02-23 南京航空航天大学 Geomagnetic station data prediction method and system for aurora image
CN117636269A (en) * 2024-01-23 2024-03-01 济南博赛网络技术有限公司 Intelligent detection method for road guardrail collision
CN117994254A (en) * 2024-04-03 2024-05-07 江苏兴力工程管理有限公司 Overhead line insulator positioning and identifying method based on conditional cross attention mechanism
CN118096541A (en) * 2024-04-28 2024-05-28 山东省淡水渔业研究院(山东省淡水渔业监测中心) Fishery remote sensing test image data processing method
CN118212476A (en) * 2024-05-20 2024-06-18 山东云海国创云计算装备产业创新中心有限公司 Image classification method, product and storage medium
CN118365974A (en) * 2024-06-20 2024-07-19 山东省水利科学研究院 Water quality class detection method, system and equipment based on hybrid neural network

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116824407A (en) * 2023-06-21 2023-09-29 深圳市华赛睿飞智能科技有限公司 Target detection method, device and equipment based on patrol robot
CN116883729A (en) * 2023-06-27 2023-10-13 西北大学 Ceramic cultural relic fragment microscopic image classification method based on combination of Transformer and CNN
CN116758562A (en) * 2023-08-22 2023-09-15 杭州实在智能科技有限公司 Universal text verification code identification method and system
CN116758562B (en) * 2023-08-22 2023-12-08 杭州实在智能科技有限公司 Universal text verification code identification method and system
CN116824525A (en) * 2023-08-29 2023-09-29 中国石油大学(华东) Image information extraction method based on traffic road image
CN116824525B (en) * 2023-08-29 2023-11-14 中国石油大学(华东) Image information extraction method based on traffic road image
CN117576513B (en) * 2023-11-24 2024-05-14 铜陵学院 Method, device and medium for detecting end-to-end spacecraft assembly
CN117576513A (en) * 2023-11-24 2024-02-20 铜陵学院 Method, device and medium for detecting end-to-end spacecraft assembly
CN117312931B (en) * 2023-11-30 2024-02-23 山东科技大学 Drilling machine stuck drill prediction method based on transformer
CN117312931A (en) * 2023-11-30 2023-12-29 山东科技大学 Drilling machine stuck drill prediction method based on transformer
CN117593514A (en) * 2023-12-08 2024-02-23 耕宇牧星(北京)空间科技有限公司 Image target detection method and system based on deep principal component analysis assistance
CN117593514B (en) * 2023-12-08 2024-05-24 耕宇牧星(北京)空间科技有限公司 Image target detection method and system based on deep principal component analysis assistance
CN117593666A (en) * 2024-01-19 2024-02-23 南京航空航天大学 Geomagnetic station data prediction method and system for aurora image
CN117593666B (en) * 2024-01-19 2024-05-17 南京航空航天大学 Geomagnetic station data prediction method and system for aurora image
CN117636269A (en) * 2024-01-23 2024-03-01 济南博赛网络技术有限公司 Intelligent detection method for road guardrail collision
CN117994254A (en) * 2024-04-03 2024-05-07 江苏兴力工程管理有限公司 Overhead line insulator positioning and identifying method based on conditional cross attention mechanism
CN117994254B (en) * 2024-04-03 2024-08-06 江苏兴力工程管理有限公司 Overhead line insulator positioning and identifying method based on conditional cross attention mechanism
CN118096541A (en) * 2024-04-28 2024-05-28 山东省淡水渔业研究院(山东省淡水渔业监测中心) Fishery remote sensing test image data processing method
CN118212476A (en) * 2024-05-20 2024-06-18 山东云海国创云计算装备产业创新中心有限公司 Image classification method, product and storage medium
CN118365974A (en) * 2024-06-20 2024-07-19 山东省水利科学研究院 Water quality class detection method, system and equipment based on hybrid neural network

Similar Documents

Publication Publication Date Title
CN116229295A (en) Remote sensing image target detection method based on fusion convolution attention mechanism
CN112070729B (en) Anchor-free remote sensing image target detection method and system based on scene enhancement
CN113807464B (en) Unmanned aerial vehicle aerial image target detection method based on improved YOLO V5
CN110189304B (en) Optical remote sensing image target on-line rapid detection method based on artificial intelligence
Lu et al. A cnn-transformer hybrid model based on cswin transformer for uav image object detection
CN114332639A (en) Satellite attitude vision measurement algorithm of nonlinear residual error self-attention mechanism
CN113283409A (en) Airplane detection method in aerial image based on EfficientDet and Transformer
CN117011883A (en) Pedestrian re-recognition method based on pyramid convolution and transducer double branches
CN114842681A (en) Airport scene flight path prediction method based on multi-head attention mechanism
Shen et al. An improved UAV target detection algorithm based on ASFF-YOLOv5s
CN117079095A (en) Deep learning-based high-altitude parabolic detection method, system, medium and equipment
CN115690730A (en) High-speed rail contact net foreign matter detection method and system based on single classification and abnormal generation
Ning et al. Point-voxel and bird-eye-view representation aggregation network for single stage 3D object detection
CN117576591A (en) Unmanned aerial vehicle image small target detection algorithm based on sea rescue
Zhang et al. Full-scale Feature Aggregation and Grouping Feature Reconstruction Based UAV Image Target Detection
CN116229272B (en) High-precision remote sensing image detection method and system based on representative point representation
CN117710841A (en) Small target detection method and device for aerial image of unmanned aerial vehicle
CN117392568A (en) Method for unmanned aerial vehicle inspection of power transformation equipment in complex scene
CN116820131A (en) Unmanned aerial vehicle tracking method based on target perception ViT
CN114694042A (en) Disguised person target detection method based on improved Scaled-YOLOv4
Zhang et al. Sea surface ships detection method of UAV based on improved YOLOv3
CN118397476B (en) Improvement method of remote sensing image target detection model
Liang et al. A lightweight vision transformer with symmetric modules for vision tasks
Kang et al. Efficient Object Detection with Deformable Convolution for Optical Remote Sensing Imagery
Song et al. Lightweight small target detection algorithm based on YOLOv5

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination