CN111860398A

CN111860398A - Remote sensing image target detection method and system and terminal equipment

Info

Publication number: CN111860398A
Application number: CN202010737230.1A
Authority: CN
Inventors: 刘京; 田亮; 郭蔚; 杨烁今; 陈栋; 周丙寅
Original assignee: Hebei Normal University
Current assignee: Hebei Normal University
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2020-10-30
Anticipated expiration: 2040-07-28
Also published as: CN111860398B

Abstract

The invention is suitable for the technical field of image processing, and discloses a method, a system and a terminal device for detecting a remote sensing image target, wherein the method comprises the following steps: acquiring a remote sensing image to be detected; inputting the remote sensing image to be detected into the trained parallel perception attention network model to obtain a plurality of output characteristic graphs with different scales; and carrying out target detection according to the output characteristic graphs with different scales to obtain a detection result. The invention carries out feature extraction through the parallel perception attention network model, not only can extract multi-scale, context and global features of the target, but also can extract correlation features among non-local targets and can extract direction sensitive target features.

Description

Remote sensing image target detection method and system and terminal equipment

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a method, a system and a terminal device for detecting a remote sensing image target.

Background

The target detection is an important research content in the field of image processing, has high practical application value and is a research subject which is widely concerned by experts and scholars at home and abroad. With the development of deep learning, the application of deep learning to target detection of remote sensing images is becoming more prevalent.

Currently, target detection models based on deep learning are mainly classified into two categories. One type of target detection models is represented by RCNN, Fast-RCNN and is based on region recommendation, boundary frames and types of detected targets are predicted through two steps from rough to fine by the target detection models, and the target detection models have high accuracy but low detection speed; the other type is a regression-based target detection model represented by YOLO (YOLO, solid State disk), the type of the model directly predicts a boundary enclosure and a type of a detected target without a process of 'firstly thickening and then thinning', and the model has high detection speed but general detection accuracy. Therefore, the prior art cannot give consideration to both the target detection speed and the target detection accuracy.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method, a system, and a terminal device for detecting a target in a remote sensing image, so as to solve the problem that the prior art cannot consider both the target detection speed and the target detection accuracy.

The first aspect of the embodiment of the invention provides a method for detecting a remote sensing image target, which comprises the following steps:

acquiring a remote sensing image to be detected;

inputting the remote sensing image to be detected into the trained parallel perception attention network model to obtain a plurality of output characteristic graphs with different scales;

and carrying out target detection according to the output characteristic graphs of different scales to obtain a detection result.

A second aspect of an embodiment of the present invention provides a remote sensing image target detection system, including:

the acquisition module is used for acquiring a remote sensing image to be detected;

the feature extraction module is used for inputting the remote sensing image to be detected into the trained parallel perception attention network model to obtain a plurality of output feature maps with different scales;

and the target detection module is used for carrying out target detection according to the output characteristic graphs with different scales to obtain a detection result.

A third aspect of the embodiments of the present invention provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method for detecting a target in a remote sensing image according to the first aspect when executing the computer program.

A fourth aspect of embodiments of the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by one or more processors, implements the steps of the method for object detection in remote sensing images according to the first aspect.

Compared with the prior art, the embodiment of the invention has the following beneficial effects: according to the embodiment of the invention, the remote sensing image to be detected is firstly acquired, then the remote sensing image to be detected is input into the trained parallel perception attention network model to obtain a plurality of output feature maps with different scales, and finally, the target detection is carried out according to the plurality of output feature maps with different scales to obtain the detection result.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flow chart illustrating an implementation of a method for detecting a target in a remote sensing image according to an embodiment of the present invention;

FIG. 2 is a diagram of a parallel perceptual attention network model provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a first multi-scale attention submodule provided in accordance with an embodiment of the present invention;

FIG. 4 is a schematic diagram of a first context attention sub-module provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of a first channel attention sub-module provided in accordance with an embodiment of the present invention;

FIG. 6 is a schematic view of a thermal image of a first scale feature map provided by an embodiment of the present invention;

FIG. 7 is a schematic diagram of a thermal image of a first contextual feature map provided by an embodiment of the present invention;

FIG. 8 is a schematic thermal image of a first channel profile provided by an embodiment of the present invention;

FIG. 9 is a schematic flow chart illustrating an implementation of a method for detecting a target in a remote sensing image according to another embodiment of the present invention;

FIG. 10 is a schematic diagram of experimental test results provided by an embodiment of the present invention;

FIG. 11 is a schematic block diagram of a remote sensing image target detection system provided by an embodiment of the invention;

fig. 12 is a schematic block diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

Fig. 1 is a schematic flow chart of an implementation of a method for detecting a target in a remote sensing image according to an embodiment of the present invention, and for convenience of description, only a part related to the embodiment of the present invention is shown. The execution main body of the embodiment of the invention can be terminal equipment. As shown in fig. 1, the method may include the steps of:

s101: and acquiring a remote sensing image to be detected.

In the embodiment of the invention, the remote sensing image to be detected can be obtained by the existing method.

S102: and inputting the remote sensing image to be detected into the trained parallel perception attention network model to obtain a plurality of output characteristic graphs with different scales.

In the embodiment of the invention, a parallel perception attention network model is firstly constructed, and then the constructed parallel perception attention network model is trained through a training set to obtain the trained parallel perception attention network model.

In one embodiment of the invention, in the training of the parallel perceptual attention network model, a class loss function and a regression loss function are used, wherein the regression loss function is a distance cross-correlation loss function.

Specifically, the class loss function is:

the Distance Intersection over loss function (DIoU) is:

in the distance cross-correlation loss function, b^gtRespectively representing the center points of the anchor bounding box and the label bounding box, p representing the Euclidean distance for calculating the two center points, and c representing the diagonal distance of the minimum rectangle which can simultaneously cover the anchor bounding box and the label bounding box. The normalized distance between the anchor bounding box and the tag bounding box is thus modeled in DIoU. The loss function is beneficial to improving the detection accuracy of the small target while accelerating the convergence.

In the embodiment of the invention, the distance cross-correlation loss is adopted to replace the traditional regression loss, so that the training speed can be accelerated, and the detection accuracy of the small target can be enhanced.

In one embodiment of the present invention, referring to fig. 2, the parallel perceptual attention network model takes a residual error network as a backbone;

the parallel perceptual attention network model comprises a first residual block B₁A second residual block B₂A third residual block B₃And a fourth residual block B₄The system comprises a first parallel perception attention module, a second parallel perception attention module, a third parallel perception attention module and a fourth parallel perception attention module; first residual block B₁A second residual block B₂A third residual block B₃And a fourth residual block B₄All the sizes of the components are different;

the first parallel perception attention module uses a first residual block B₁And a second residual block B₂For input, output the first fused feature map IB₁(ii) a A second parallel perceptual attention module with a second residual block B₂And a third residual block B₃For input, output a second fused profile IB₂(ii) a A third parallel perceptual attention module with a third residual block B₃And a fourth residual block B₄Outputting the third fused feature map IB as input₃(ii) a A fourth parallel perceptual attention module with a fourth residual block B₄Outputting the fourth fused feature map IB as input₄；

Fourth fused feature map IB₄Obtaining an output characteristic diagram O of a fourth scale through deformable convolution₄(ii) a Third fused profile IB₃The output characteristic diagram O of the fourth scale after being subjected to deformable convolution and 2 times of upsampling₄Adding to obtain an output characteristic diagram O of a third scale₃(ii) a Second fused profile IB₂The output characteristic diagram O of the third scale after being subjected to deformable convolution and 2 times of upsampling₃Adding to obtain an output characteristic diagram O of a second scale₂(ii) a First fused profile IB₁The output characteristic diagram O of the second scale after being subjected to deformable convolution and 2 times of upsampling₂Adding to obtain an output characteristic diagram O of a first scale₁。

The main part of the parallel perception attention network model adopts a residual error network ResNet-101.

And introducing a parallel perception attention module behind each residual block, obtaining each fusion characteristic graph through fusion operation, then accurately extracting position sensitive characteristics by using deformable convolution and obtaining four output characteristic graphs with different scales by adopting a multi-scale fusion strategy.

Because the target angles in the remote sensing image are changeable, the conventional convolution operation is easy to extract irrelevant information, in order to reduce the influence of irrelevant features on the direction sensitive target, the embodiment of the invention uses the deformable convolution operation after obtaining the feature images fused under various scales, and the operation achieves the purpose of correcting the sampling position by predicting the offset of a pair of x direction and y direction for each sampling position, thereby changing the traditional regular sampling structure, being capable of sampling objects in any shape and enhancing the feature extraction capability of the direction sensitive target.

Specifically, in the deformable convolution, a feature map of H × W × C is input, H is the height of the feature map, W is the width of the feature map, and C is the number of channels of the feature map, the feature map of H × W × 2C is obtained after convolution operation, the number of channels at this time is twice as large as the original number, and represents the offset of each pixel point in the X direction and the Y direction, and finally the final feature map is obtained by adding the index of the pixel in the input image and the offset obtained through convolution, and the offset needs to be set to be within the picture when pixel offset is performed. Because the offset is usually a decimal number in actual operation and cannot be directly used as an offset coordinate, if rounding is forced, a large error is introduced, and in order to avoid the error, a bilinear interpolation method is usually adopted in actual operation to obtain a final characteristic diagram.

In one embodiment of the invention, the first parallel perceptual attention module, the second parallel perceptual attention module and the third parallel perceptual attention module have the same structure;

referring to fig. 2, the first parallel perceptual attention module includes a first multi-scale attention submodule, a first context attention submodule, and a first channel attention submodule;

the first multi-scale attention submodule uses a first residual block B₁And a second residual block B₂Outputting a first scale feature map E as an input;

the first context attention submodule uses a first residual block B₁Outputting a first context feature map F as an input;

the first channel attention sub-module uses a first residual block B₁Outputting a first channel characteristic diagram G as an input;

fusing the first scale characteristic diagram E, the first context characteristic diagram F and the first channel characteristic diagram G to obtain a first fused characteristic diagram IB₁。

In the embodiment of the invention, the first parallel attention sensing module, the second parallel attention sensing module and the third parallel attention sensing module have the same structure, but different input and output. The second parallel perception attention module comprises a second multi-scale attention submodule, a second context attention submodule and a second channel attention submodule; the third parallel perceptual attention module includes a third multi-scale attention sub-module, a third contextual attention sub-module, and a third channel attention sub-module.

In particular, in the second parallel awareness module, the secondResidual block B₂Replacing the first residual block B in the first parallel perceptual attention module₁Position of (2), third residual block B₃Replacing the second residual block B in the first parallel perceptual attention module₂The position of (a). Similarly, in the third parallel perceptual attention module, the third residual block B₃Replacing the first residual block B in the first parallel perceptual attention module₁Position of (1), fourth residual block B₄Replacing the second residual block B in the first parallel perceptual attention module₂The position of (a).

In one embodiment of the invention, referring to FIG. 3, in a first multi-scale attention submodule, a first residual block B is formed₁Performing convolution to obtain a first intermediate scale feature map A, performing convolution on a second residual block B2 to obtain a second intermediate scale feature map B, performing matrix transformation on the second intermediate scale feature map B, multiplying the second intermediate scale feature map B by the first intermediate scale feature map A to obtain a third intermediate scale feature map, normalizing the third intermediate scale feature map to obtain a first multi-scale attention weight map M, performing multiplication on the first multi-scale attention weight map M and the second intermediate scale feature map B to obtain a fourth intermediate scale feature map, performing up-sampling on the fourth intermediate scale feature map, and performing up-sampling on the fourth intermediate scale feature map and the first residual block B₁Adding to obtain a first scale characteristic diagram E;

referring to FIG. 4, in the first context attention submodule, a first residual block B is formed₁Convolution is carried out to respectively obtain a first intermediate context feature graph K and a second intermediate context feature graph D, the second intermediate context feature graph D is multiplied by the first intermediate context feature graph K after matrix transformation is carried out to obtain a third intermediate context feature graph, normalization is carried out on the third intermediate context feature graph to obtain a first context attention weight graph P, and the first context attention weight graph P and a first residual block B are combined₁Multiplying to obtain a fourth intermediate context feature map, matrix-transforming the fourth intermediate context feature map and the first residual block B₁Adding to obtain a first context feature map F;

referring to fig. 5, in the first channel attention submodule,the first residual block B₁After matrix transformation, the first residual block B₁Multiplying to obtain a first intermediate channel characteristic diagram, normalizing the first intermediate channel characteristic diagram to obtain a first channel attention weight diagram Q, and comparing the first channel attention weight diagram Q with a first residual block B₁Multiplying to obtain a second intermediate channel characteristic diagram, performing matrix transformation on the second intermediate channel characteristic diagram, and then performing matrix transformation on the second intermediate channel characteristic diagram and the first residual block B₁And performing addition operation to obtain a first channel characteristic diagram G.

In the embodiment of the present invention, specific working processes of a first multi-scale attention submodule, a first context attention submodule, and a first channel attention submodule included in a first parallel awareness module are given, and since the first parallel awareness module, a second parallel awareness module, and a third parallel awareness module have the same structure and are different only in input and output, specific processes of the second parallel awareness module and the third parallel awareness module are not described in detail herein.

Specifically, in the deep convolutional neural network, feature maps of different scales contain different degrees of structure and semantic information, the semantic information is rich in a high-level feature map, and the structural information is rich in a low-level feature map. However, the information is very important for detecting the target in the remote sensing image, especially the small target, and in order to fully utilize the information, the embodiment of the invention provides a multi-scale attention module so as to enhance the feature expression of the small target.

The embodiment of the invention provides a specific working process of the first multi-scale attention submodule. Wherein the first intermediate-scale feature map A and the second intermediate-scale feature map B are composed of a first residual block B₁And a second residual block B₂Attention weight maps obtained by performing 1 × 1 convolution respectively, H and W respectively representing the first residual block B₁Height and width of (1), first residual block B₁The number of channels of (2) is denoted by C. The matrix transformation may be a matrix transposition. The normalization may be Softmax normalization. Since the second intermediate-scale feature map B is in a deeper network layer, the first intermediate-scale feature map A is richer in contentThe first multi-scale attention weight map M contains a priori of the structure information of the first intermediate-scale feature map a to the structure information of the second intermediate-scale feature map B, so that the first scale feature map E obtained through the first multi-scale attention weight map M contains rich structure information and deeper semantic information, and detection of a small-scale target is facilitated.

In one embodiment of the present invention, the first multi-scale attention weight map M is calculated as:

wherein i represents the ith row, j represents the jth column, and N is the first residual block B₁A is a first intermediate-scale feature map, B is a second intermediate-scale feature map;

the calculation formula of the first scale feature map E is as follows:

wherein, B₁α is a first weight coefficient that can be learned, for the first residual block.

Optionally, j may take a value from 1 to the first residual block B₁Is a positive integer between the widths of (a).

Optionally, a first residual block B₁Is the same as the width.

M_jiIs a normalized weight coefficient in the first multi-scale attention weight map M, which measures the influence of the ith position on the jth position in each scale, and alpha is a learnable first weight coefficient used for weighing the corrected feature map and the initial feature map. Referring to fig. 6, fig. 6 shows a thermal image of a portion of the first scale feature E from which it can be seen that more small aircraft regions are activated.

The embodiment of the invention also provides a specific working process of the first context attention submodule. The context information can effectively distinguish foreground information from background information and is beneficial to remote sensing image target detection under a complex background, and the first context attention submodule embeds the context information into an attention mechanism so as to fully extract the associated information of the front background and the rear background and further enhance the feature expression capability of the network. The main structure is shown in fig. 4.

Wherein, in the first context attention submodule, for the first residual block B₁Respectively carrying out 7 multiplied by 7 convolution to obtain a first intermediate context feature map K and a second intermediate context feature map D; in the second context attention submodule, for the second residual block B₂Respectively carrying out 5 multiplied by 5 convolution to obtain two intermediate context feature maps; in the third context attention submodule, for a third residual block B₃Respectively carrying out 3 x3 convolution to obtain two intermediate context feature maps; in the fourth context attention submodule, for a fourth residual block B₄The 1 × 1 convolution is performed to obtain two intermediate context feature maps.

The first context attention weight graph comprises the contribution degree of the context information of the target with non-local relevance to the classification and regression of the target under each scale. The first context feature map enhances the expression of the target and associated information around the target.

In one embodiment of the present invention, the first contextual attention weight map P is calculated as:

wherein K is a first intermediate context feature map, and D is a second intermediate context feature map;

the calculation formula of the first context feature map F is:

where β is a learnable second weight coefficient.

Wherein, P_jiWeighting influence coefficients of the ith position to the jth position in a weight map with context information; beta is a second weight coefficient which can be learned and is used for weighing the correctedA feature map and an initial feature map. Referring to fig. 7, fig. 7 shows a thermal image of a portion of the first contextual characteristic map F, from which it can be seen that more local information around the object is activated.

The embodiment of the invention also provides a specific working process of the first channel attention submodule. Each channel of the feature map of the convolutional neural network has global information of different categories and spatial positions, some information is favorable for target detection, some information is unfavorable for target detection, and in order to strengthen positive response and weaken negative response, the embodiment of the invention provides a channel attention submodule for modeling the interrelation among the channels and the non-local association in the feature map. The specific process can be seen in fig. 5.

In one embodiment of the present invention, the first channel attention weight map Q is calculated as:

wherein C is the first residual block B₁The number of channels of (a);

the calculation formula of the first channel characteristic diagram G is as follows:

where γ is a learnable third weight coefficient.

Wherein Q_jiGamma is a third weight coefficient which can be learned and is used for weighing the corrected characteristic diagram and the initial characteristic diagram. Referring to fig. 8, fig. 8 shows a thermal image of a portion of the first channel signature G from which it can be seen that more global information associated with the target is activated.

In one embodiment of the invention, the fourth parallel perceptual attention module comprises a fourth contextual attention submodule and a fourth channel attention submodule;

fourth context attention submodule with fourth residual block B₄Outputting the fourth context feature as an inputA drawing;

the fourth channel attention submodule uses a fourth residual block B₄Outputting a fourth channel characteristic diagram for input;

fusing the fourth context feature map and the fourth channel feature map to obtain a fourth fused feature map IB₄。

Different from the three awareness modules, the fourth parallel awareness module only includes a context awareness sub-module and a channel attention sub-module, and the context awareness sub-module and the channel attention sub-module are similar to the working processes of the first context awareness sub-module and the first channel attention sub-module described above, and are not described herein again.

Optionally, before S102, the method may further include:

preprocessing a remote sensing image to be detected to obtain a preprocessed remote sensing image to be detected;

accordingly, S102 may include:

and inputting the preprocessed remote sensing image to be detected into the trained parallel perception attention network model to obtain a plurality of output characteristic graphs with different scales.

S103: and carrying out target detection according to the output characteristic graphs of different scales to obtain a detection result.

In the embodiment of the invention, any existing method can be utilized to perform target detection according to a plurality of output characteristic graphs with different scales to obtain a detection result.

Alternatively, referring to fig. 9, after feature extraction is performed through the trained parallel perceptual attention network model, target detection may be performed through operations of area recommendation network, alignment, pooling, and the like, and operations of non-maximum suppression output classification, positioning result, and the like, so as to obtain a detection result.

In the embodiment of the present invention, the design detail parameters of the parallel awareness network model are shown in table 1.

TABLE 1 design details parameters of parallel perceptual attention network model

The target detection effect of the embodiment of the invention is verified through experiments.

The hardware and software environments used for the experiments were as follows:

a CPU: intel core i767003.30GHZ; GPU: p 20005G; memory: 16G; operating the system: ubuntu 16.04; and (3) developing environment: tensorflow programming language: python 3.5; IDE: pycharm

Experimental data set:

the data set used in the experiment is two remote sensing image public data sets: RSOD and UCAS-AOD, 80% of the automobile and airplane categories are randomly selected as training sets, and 20% of the automobile and airplane categories are selected as testing sets.

The network model adopts a residual error network with 101 layers as a main network, parameters are initialized by using weights pre-trained on ImageNet, the input sizes of pictures are uniformly adjusted to 800x800 pixels, 30000 rounds of training are carried out by using a random gradient descent method, the initial learning rate is 0.001, and the initial learning rate is reduced to 0.0001 after 15000 rounds. In anchor bounding box selection, four dimensions of 32x32, 64x64, 128x128 and 256x256 are used, the aspect ratio is 1:1, 2:1 and 1:2, the anchor bounding boxes are used, calculation can be reduced, and meanwhile good accuracy is guaranteed, and the threshold value of IoU is set to be 0.7.

Table 2 compares the results of the average accuracy and recall with those of other methods

The experimental results are as follows:

the evaluation indexes of the experiment adopt average accuracy and recall rate. Fig. 10 shows a comparison between the detection results of the target detection method according to the embodiment of the present invention and the detection results of the current mainstream deep learning method, where the first three columns show the detection results of a target (aircraft) in a complex background under a small scale and under an occlusion condition, and the last column shows the detection results of an automobile in each scene, where the first row is an original picture, the second row and the third row are the detection results of a regression-based target detection model YOLO and an SSD, respectively, and it can be seen from the frame that the detection accuracy is not high, and there are still many missed detection conditions in a complex scene. The fourth line and the fifth line are target detection models FPN and Faster-RCNN based on region recommendation, and the detection accuracy is higher than that of YOLO and SSD according to results.

Table 2 shows the comparison of the accuracy and the recall ratio of the automobile and airplane detection results by the method provided in the embodiment of the present invention with other methods, and compared with other deep learning methods, the method provided in the embodiment of the present invention improves the average accuracy and the recall ratio of the automobile and airplane detection by 7% on average, which is about 1% higher than the best detection method.

Table 3 shows a comparison between the detection speed of the method provided by the embodiment of the present invention and the detection speed of other methods, and it can be seen from table 3 that the detection speed of about 8.8FPS can be achieved by using the network model in the method provided by the embodiment of the present invention as a backbone network for target detection, which is improved by 3 times compared with the previous model, and the detection speed is also improved compared with the main stream network model based on regional recommendation.

TABLE 3 comparison of the test rates with other methods

The experiment also used ablation studies to verify the effect of each sub-module on the test results, and from the ablation study data in table 4, the average accuracy was improved by 0.9% when the model used only the channel attention sub-module and the context attention sub-module, by 2.1% when the multi-scale attention sub-module and the channel attention sub-module were used, and by 2.3% when the context attention sub-module and the multi-scale attention sub-module were used, which indicates that the information features of the multi-scale and context are more helpful for testing the target, and by 3.7% when all sub-modules were used, it can be seen that each sub-module is effective for testing the target.

TABLE 4 Effect of modules on test results

The embodiment of the invention provides a parallel perception attention network model (neural network model) based on an attention mechanism to improve the accuracy and detection speed of remote sensing image target detection, wherein the network model comprises a parallel multi-scale attention submodule, a context attention submodule and a channel attention submodule. Firstly, the output of three parallel modules under multiple scales is fused to obtain abundant multi-scale features, context features and non-local associated features; then, the deformable convolution is used for replacing the traditional convolution in the obtained fused feature map, so that the object features sensitive to the direction are better extracted; finally, the distance intersection is used for replacing the traditional bounding box loss, so that the model convergence speed is accelerated, and more accurate target positioning is obtained; the experimental result proves that the network model is used as a backbone network for target detection, so that the detection accuracy and the detection speed can be effectively improved, and meanwhile, the network model also has a good detection effect on targets in a complex scene.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

Fig. 11 is a schematic block diagram of a remote sensing image target detection system according to an embodiment of the present invention, and for convenience of description, only the parts related to the embodiment of the present invention are shown.

In the embodiment of the present invention, the remote sensing image target detection system 110 may include an acquisition module 1101, a feature extraction module 1102, and a target detection module 1103.

The acquisition module 1101 is used for acquiring a remote sensing image to be detected;

the feature extraction module 1102 is used for inputting the remote sensing image to be detected into the trained parallel perceptual attention network model to obtain a plurality of output feature maps with different scales;

and the target detection module 1103 is configured to perform target detection according to a plurality of output feature maps with different scales to obtain a detection result.

Optionally, in the feature extraction module 1102, the parallel perceptual attention network model takes a residual error network as a backbone;

the parallel perception attention network model comprises a first residual block, a second residual block, a third residual block, a fourth residual block, a first parallel perception attention module, a second parallel perception attention module, a third parallel perception attention module and a fourth parallel perception attention module; the sizes of the first residual block, the second residual block, the third residual block and the fourth residual block are all different;

the first parallel perception attention module takes the first residual block and the second residual block as input and outputs a first fused feature map IB 1; the second parallel perception attention module takes the second residual block and the third residual block as input and outputs a second fusion feature map; the third parallel perception attention module takes the third residual block and the fourth residual block as input and outputs a third fusion feature map; the fourth parallel perception attention module takes the fourth residual block as input and outputs a fourth fusion characteristic diagram;

obtaining an output characteristic diagram of a fourth scale by the fourth fusion characteristic diagram through deformable convolution; after the third fusion characteristic diagram is subjected to deformable convolution, adding the third fusion characteristic diagram and the output characteristic diagram of the fourth scale subjected to 2 times of upsampling to obtain an output characteristic diagram of the third scale; after the second fusion characteristic diagram is subjected to deformable convolution, adding the second fusion characteristic diagram and the output characteristic diagram of the third scale subjected to 2 times of upsampling to obtain an output characteristic diagram of the second scale; the first fused feature map IB1 is subjected to deformable convolution and then added with the output feature map of the second scale after being subjected to 2 times of upsampling to obtain the output feature map of the first scale.

Optionally, the first parallel attention sensing module, the second parallel attention sensing module and the third parallel attention sensing module have the same structure;

the first parallel perceptual attention module includes a first multi-scale attention submodule, a first contextual attention submodule, and a first channel attention submodule;

the first multi-scale attention submodule takes the first residual block and the second residual block as input and outputs a first scale feature map;

the first context attention submodule takes the first residual block as input and outputs a first context feature map;

the first channel attention submodule takes the first residual block as input and outputs a first channel characteristic diagram;

and fusing the first scale feature map, the first context feature map and the first channel feature map to obtain a first fused feature map IB 1.

Optionally, in the first multi-scale attention submodule, convolving the first residual block to obtain a first intermediate-scale feature map, convolving the second residual block to obtain a second intermediate-scale feature map, performing matrix transformation on the second intermediate-scale feature map, multiplying the second intermediate-scale feature map by the first intermediate-scale feature map to obtain a third intermediate-scale feature map, normalizing the third intermediate-scale feature map to obtain a first multi-scale attention weight map, multiplying the first multi-scale attention weight map by the second intermediate-scale feature map to obtain a fourth intermediate-scale feature map, upsampling the fourth intermediate-scale feature map, and adding the upsampled fourth intermediate-scale feature map to the first residual block to obtain the first scale feature map;

in a first context attention submodule, performing convolution on a first residual block to respectively obtain a first intermediate context feature map and a second intermediate context feature map, performing matrix transformation on the second intermediate context feature map, then performing multiplication operation on the second intermediate context feature map and the first intermediate context feature map to obtain a third intermediate context feature map, normalizing the third intermediate context feature map to obtain a first context attention weight map, performing multiplication operation on the first context attention weight map and a first residual block to obtain a fourth intermediate context feature map, performing matrix transformation on the fourth intermediate context feature map, and then performing addition operation on the fourth intermediate context feature map and the first residual block to obtain a first context feature map;

in the first channel attention submodule, a first residual block is subjected to matrix transformation and then multiplied by the first residual block to obtain a first intermediate channel characteristic diagram, the first intermediate channel characteristic diagram is normalized to obtain a first channel attention weight diagram, the first channel attention weight diagram is multiplied by the first residual block to obtain a second intermediate channel characteristic diagram, and the second intermediate channel characteristic diagram is subjected to matrix transformation and then added to the first residual block to obtain a first channel characteristic diagram.

Optionally, the calculation formula of the first multi-scale attention weight map M is:

wherein i represents the ith row, j represents the jth column, N is the height of the first residual block, A is a first intermediate-scale feature map, and B is a second intermediate-scale feature map;

the calculation formula of the first scale feature map E is as follows:

wherein, B₁Is a first residual block, α is a first weight coefficient that can be learned;

the calculation formula of the first contextual attention weight map P is:

the calculation formula of the first context feature map F is:

wherein β is a learnable second weight coefficient;

the first channel attention weight map Q is calculated as:

wherein, C is the channel number of the first residual block;

where γ is a learnable third weight coefficient.

Optionally, the fourth parallel perceptual attention module comprises a fourth contextual attention submodule and a fourth channel attention submodule;

the fourth context attention submodule takes the fourth residual block as input and outputs a fourth context feature map;

the fourth channel attention submodule takes the fourth residual block as input and outputs a fourth channel characteristic diagram;

and fusing the fourth context feature map and the fourth channel feature map to obtain a fourth fused feature map.

Optionally, in the training of the parallel perceptual attention network model, a class loss function and a regression loss function are used, wherein the regression loss function is a distance cross-correlation loss function.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional units and modules is merely used as an example, and in practical applications, the foregoing function distribution may be performed by different functional units and modules as needed, that is, the internal structure of the remote sensing image target detection system is divided into different functional units or modules to perform all or part of the above-described functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the above-mentioned apparatus may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Fig. 12 is a schematic block diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 12, the terminal device 120 of this embodiment includes: one or more processors 1201, a memory 1202, and a computer program 1203 stored in the memory 1202 and executable on the processors 1201. The processor 1201 implements the steps in the embodiments of the remote sensing image target detection method described above, for example, steps S101 to S103 shown in fig. 1, when executing the computer program 1203. Alternatively, the processor 1201 realizes the functions of the modules/units in the embodiment of the remote sensing image target detection system, for example, the functions of the modules 1101 to 1103 shown in fig. 11, when executing the computer program 1203.

Illustratively, the computer program 1203 may be partitioned into one or more modules/units that are stored in the memory 1202 and executed by the processor 1201 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 1203 in the terminal device 120. For example, the computer program 1203 may be divided into an acquisition module, a feature extraction module, and an object detection module, and each module specifically functions as follows:

Other modules or units can refer to the description of the embodiment shown in fig. 11, and are not described again here.

The terminal device 120 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device 120 includes, but is not limited to, a processor 1201 and a memory 1202. Those skilled in the art will appreciate that fig. 12 is only one example of a terminal device 120, and does not constitute a limitation to terminal device 120, and may include more or less components than those shown, or combine certain components, or different components, for example, terminal device 120 may also include an input device, an output device, a network access device, a bus, etc.

The Processor 1201 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 1202 may be an internal storage unit of the terminal device 120, such as a hard disk or a memory of the terminal device 120. The memory 1202 may also be an external storage device of the terminal device 120, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 120. Further, the memory 1202 may also include both an internal storage unit of the terminal device 120 and an external storage device. The memory 1202 is used for storing the computer program 1203 and other programs and data required by the terminal device 120. The memory 1202 may also be used to temporarily store data that has been output or is to be output.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed remote sensing image target detection system and method may be implemented in other ways. For example, the above-described embodiments of the remote sensing image object detection system are merely illustrative, and for example, the division of the modules or units is only a logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media which may not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A remote sensing image target detection method is characterized by comprising the following steps:

acquiring a remote sensing image to be detected;

and carrying out target detection according to the output characteristic graphs with different scales to obtain a detection result.

2. The remote sensing image target detection method of claim 1, wherein the parallel perceptual attention network model is based on a residual error network;

the parallel perception attention network model comprises a first residual block, a second residual block, a third residual block, a fourth residual block, a first parallel perception attention module, a second parallel perception attention module, a third parallel perception attention module and a fourth parallel perception attention module; the first, second, third and fourth residual blocks all have different sizes;

the first parallel perception attention module takes the first residual block and the second residual block as input and outputs a first fusion feature map; the second parallel perception attention module takes the second residual block and the third residual block as input and outputs a second fusion feature map; the third parallel perceptual attention module takes the third residual block and the fourth residual block as input and outputs a third fused feature map; the fourth parallel perception attention module takes the fourth residual block as input and outputs a fourth fusion feature map;

the fourth fusion feature map is subjected to deformable convolution to obtain an output feature map of a fourth scale; after the third fusion feature map is subjected to deformable convolution, adding the third fusion feature map and the output feature map of the fourth scale subjected to 2 times of upsampling to obtain an output feature map of the third scale; the second fusion feature map is subjected to deformable convolution and then is added with the output feature map of the third scale subjected to 2 times of upsampling to obtain an output feature map of the second scale; and the first fusion feature map is subjected to deformable convolution and then is added with the output feature map of the second scale subjected to 2 times of upsampling to obtain an output feature map of the first scale.

3. The remote sensing image target detection method of claim 2, wherein the first parallel perceptual attention module, the second parallel perceptual attention module and the third parallel perceptual attention module have the same structure;

and fusing the first scale feature map, the first context feature map and the first channel feature map to obtain the first fused feature map.

4. The remote sensing image target detection method of claim 3, wherein in the first multi-scale attention sub-module, convolving the first residual block to obtain a first intermediate scale feature map, convolving the second residual block to obtain a second intermediate scale feature map, performing matrix transformation on the second intermediate scale feature map, multiplying the second intermediate scale feature map by the first intermediate scale feature map to obtain a third intermediate scale feature map, normalizing the third intermediate-scale feature map to obtain a first multi-scale attention weight map, multiplying the first multi-scale attention weight map and the second intermediate-scale feature map to obtain a fourth intermediate-scale feature map, performing upsampling on the fourth intermediate-scale feature map, and then performing addition operation on the upsampled fourth intermediate-scale feature map and the first residual block to obtain a first-scale feature map;

in the first context attention submodule, performing convolution on the first residual block to respectively obtain a first intermediate context feature map and a second intermediate context feature map, performing matrix transformation on the second intermediate context feature map, then performing multiplication operation on the second intermediate context feature map and the first intermediate context feature map to obtain a third intermediate context feature map, normalizing the third intermediate context feature map to obtain a first context attention weight map, performing multiplication operation on the first context attention weight map and the first residual block to obtain a fourth intermediate context feature map, performing matrix transformation on the fourth intermediate context feature map, and then performing addition operation on the fourth intermediate context feature map and the first residual block to obtain the first context feature map;

in the first channel attention submodule, performing matrix transformation on the first residual block, then performing multiplication operation on the first residual block and the first residual block to obtain a first intermediate channel feature map, normalizing the first intermediate channel feature map to obtain a first channel attention weight map, performing multiplication on the first channel attention weight map and the first residual block to obtain a second intermediate channel feature map, performing matrix transformation on the second intermediate channel feature map, and then performing addition operation on the second intermediate channel feature map and the first residual block to obtain the first channel feature map.

5. The method for detecting the target in the remote sensing image as claimed in claim 4, wherein the calculation formula of the first multi-scale attention weight map M is as follows:

wherein i represents the ith row, j represents the jth column, N is the height of the first residual block, a is the first intermediate-scale feature map, and B is the second intermediate-scale feature map;

the calculation formula of the first scale feature map E is as follows:

wherein, B₁For the first residual block, α is a learnable first weight coefficient;

the calculation formula of the first context attention weight map P is as follows:

wherein K is the first intermediate context feature map, and D is the second intermediate context feature map;

the calculation formula of the first context feature map F is as follows:

wherein β is a learnable second weight coefficient;

the calculation formula of the first channel attention weight map Q is as follows:

wherein C is the channel number of the first residual block;

where γ is a learnable third weight coefficient.

6. The remote sensing image target detection method of claim 2, wherein the fourth parallel perceptual attention module comprises a fourth contextual attention submodule and a fourth channel attention submodule;

the fourth channel attention submodule takes a fourth residual block as input and outputs a fourth channel characteristic diagram;

and fusing the fourth context feature map and the fourth channel feature map to obtain the fourth fused feature map.

7. The remote sensing image target detection method of any one of claims 1-6, characterized in that a class loss function and a regression loss function are used in the training process of the parallel perceptual attention network model, wherein the regression loss function is a distance cross-correlation loss function.

8. A remote sensing image target detection system, comprising:

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method for object detection of remote sensing images according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by one or more processors, carries out the steps of the method for object detection according to any one of claims 1 to 7.