CN117315458A

CN117315458A - Target detection method and device for remote sensing image, electronic equipment and storage medium

Info

Publication number: CN117315458A
Application number: CN202311048592.XA
Authority: CN
Inventors: 刘阁; 吕亚龙; 汪磊; 李强; 李健存
Original assignee: Beijing Guanwei Technology Co ltd
Current assignee: Beijing Guanwei Technology Co ltd
Priority date: 2023-08-18
Filing date: 2023-08-18
Publication date: 2023-12-29

Abstract

The invention provides a target detection method, a target detection device, electronic equipment and a storage medium of a remote sensing image, and relates to the technical field of image data processing, wherein the target detection method comprises the following steps: extracting features of the remote sensing images according to different scales to obtain first feature images under different scales; respectively carrying out context semantic feature enhancement on the plurality of first feature images to obtain a plurality of second feature images; performing feature alignment on the plurality of second feature images to obtain an aligned feature image of the remote sensing image; and carrying out target detection on the alignment feature map to obtain a target detection result in the remote sensing image. According to the method, after multi-scale feature extraction is carried out on the remote sensing image, context semantic feature enhancement can be carried out on a plurality of first feature images, a plurality of second feature images with context global features are obtained, feature alignment is carried out, an aligned feature image with good image quality is obtained, and then a target detection result in the remote sensing image with high accuracy can be obtained.

Description

Target detection method and device for remote sensing image, electronic equipment and storage medium

Technical Field

The present invention relates to the field of image data processing technologies, and in particular, to a target detection method and apparatus for a remote sensing image, an electronic device, and a storage medium.

Background

With the rapid development of multispectral remote sensing image technology, the spatial resolution of the remote sensing image is gradually increased. When the target is detected in the early low-resolution remote sensing image, only the category of the rough ground object can be often distinguished, and when the target is detected in the high-resolution remote sensing image, the ground object target can be automatically detected and identified by utilizing methods such as image processing, deep learning and the like.

The target detection method of the remote sensing image has wide application prospect in the aspects of intelligent monitoring of military, ecology and the like, such as natural disaster detection based on the remote sensing image, military base and target airplane detection and the like. However, compared with a natural image, the background and the content of the remote sensing image are more complex, and when the target detection result is determined by adopting a traditional target detection method, semantic ambiguity easily occurs, the detection precision is affected, and the finally obtained target detection result is not accurate enough.

Disclosure of Invention

The invention provides a target detection method, a device, electronic equipment and a storage medium of a remote sensing image, which are used for solving the defects that the background and the content of the remote sensing image are more complex in the prior art, semantic ambiguity easily occurs when a target detection result is determined by adopting a traditional target detection method, the detection precision is influenced, and the finally obtained target detection result is inaccurate.

The invention provides a target detection method of a remote sensing image, which comprises the following steps:

extracting features of the remote sensing images according to different scales to obtain first feature images under the different scales;

respectively carrying out context semantic feature enhancement on the plurality of first feature images to obtain a plurality of second feature images;

performing feature alignment on the plurality of second feature images to obtain an aligned feature image of the remote sensing image;

and carrying out target detection on the alignment feature map to obtain a target detection result in the remote sensing image.

According to the target detection method for the remote sensing image provided by the invention, the context semantic feature enhancement is respectively carried out on the plurality of first feature images to obtain a plurality of second feature images, and the target detection method comprises the following steps: determining a first second feature map corresponding to the first feature map according to the first feature map in the plurality of first feature maps and a first feature map adjacent to the first feature map; for each other first feature map except the first feature map, determining other second feature maps corresponding to other first feature maps according to the other first feature maps, the first feature maps adjacent to the other first feature maps and the second feature maps corresponding to other first feature maps before the other first feature maps; and determining the first second characteristic diagram and the other second characteristic diagrams as the plurality of second characteristic diagrams.

According to the target detection method for a remote sensing image provided by the invention, the determining of the first second feature map corresponding to the first feature map according to the first feature map in the plurality of first feature maps and the first feature map adjacent to the first feature map comprises the following steps: determining a first second feature map corresponding to the first feature map according to a first formula; wherein, the first formula is: representing the first second feature map; />Representing the first feature map; />Representing the first feature map adjacent to the first feature map;

the determining, according to the other first feature map, the first feature map adjacent to the other first feature map, and the second feature map corresponding to the other first feature map before the other first feature map, the other second feature map corresponding to the other first feature map includes: determining other second feature maps corresponding to the other first feature maps according to a second formula; wherein the second formula is: representing the other second feature map; />Representing the other first feature map; />Representing a first feature map adjacent to the other first feature maps; / >And representing a second characteristic diagram corresponding to the other first characteristic diagram before the other first characteristic diagram.

According to the target detection method for the remote sensing image provided by the invention, the feature alignment is performed on the plurality of second feature images to obtain an aligned feature image of the remote sensing image, and the method comprises the following steps: acquiring anchor point box information corresponding to the remote sensing image; and carrying out feature alignment on the plurality of second feature images according to the anchor point information to obtain an aligned feature image of the remote sensing image.

According to the target detection method for the remote sensing image provided by the invention, the feature alignment is carried out on the plurality of second feature images according to the anchor point box information to obtain an aligned feature image of the remote sensing image, and the target detection method comprises the following steps: determining an offset field according to the anchor point box information; and carrying out alignment convolution on the offset field and the plurality of second feature images to obtain an alignment feature image of the remote sensing image.

According to the target detection method for the remote sensing image provided by the invention, the target detection is carried out on the alignment feature map to obtain a target detection result in the remote sensing image, and the target detection method comprises the following steps: encoding the direction information of the targets in the alignment feature images by adopting an active rotation filter to obtain filter feature images of channels in different directions; and determining a target detection result in the remote sensing image according to the plurality of filtering feature diagrams.

According to the target detection method for the remote sensing image provided by the invention, the feature extraction is carried out on the remote sensing image according to different scales to obtain the first feature map under the different scales, and the target detection method comprises the following steps: extracting features of the remote sensing images according to different scales to obtain initial feature images under the different scales; and adopting an attention mechanism to respectively perform feature capturing on the plurality of initial feature images to obtain first feature images under different scales.

The invention also provides a target detection device of various remote sensing images, which comprises:

the multi-scale feature extraction module is used for carrying out feature extraction on the remote sensing image according to different scales to obtain a first feature map under the different scales;

the bidirectional feature pyramid module is used for respectively carrying out context semantic feature enhancement on the plurality of first feature images to obtain a plurality of second feature images;

the aggregated feature alignment module is used for carrying out feature alignment on the plurality of second feature images to obtain an aligned feature image of the remote sensing image;

and the object-oriented directional detection module is used for carrying out target detection on the alignment feature map to obtain a target detection result in the remote sensing image.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the target detection method of the remote sensing image when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of target detection of a remote sensing image as described in any of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a method of target detection of a remote sensing image as described in any one of the above.

According to the target detection method, the target detection device, the electronic equipment and the storage medium for the remote sensing image, the first feature images under different scales are obtained by extracting the features of the remote sensing image according to the different scales; respectively carrying out context semantic feature enhancement on the plurality of first feature images to obtain a plurality of second feature images; performing feature alignment on the plurality of second feature images to obtain an aligned feature image of the remote sensing image; and carrying out target detection on the alignment feature map to obtain a target detection result in the remote sensing image. The method is used for solving the defects that in the prior art, the background and the content of a remote sensing image are more complex, semantic ambiguity easily occurs when a traditional target detection method is adopted to determine a target detection result, the detection precision is affected, and the finally obtained target detection result is not accurate enough, and the purposes that after multi-scale feature extraction is carried out on the remote sensing image, context semantic feature enhancement can be carried out on a plurality of first feature images to obtain a plurality of second feature images with context global features, then feature alignment is carried out on the plurality of second feature images to obtain an alignment feature image with better image quality, and then the target detection result in the remote sensing image with higher accuracy can be obtained.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a target detection method of a remote sensing image provided by the invention;

FIG. 2 is a schematic diagram of a multi-scale feature extraction module of the present invention for embedding an attention mechanism;

FIG. 3 is a schematic diagram of a bi-directional feature pyramid module provided by the present invention;

FIG. 4 is a schematic view of a target detection method of a remote sensing image according to the present invention;

FIG. 5 is a schematic diagram of a target detection device for remote sensing images according to the present invention;

fig. 6 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the prior art, aircraft target detection in multispectral remote sensing images/aerial images is aimed at identifying the location and class of aircraft of interest. In the framework of deep convolutional neural networks, object detection (Object Detection in Aerial Images, ODAI) in aerial images has made significant progress in recent years, and most of the existing approaches address challenges presented by large scale changes and arbitrary directions of crowded objects in aerial images. In order to obtain better detection performance, most of the most advanced aviation target detectors often rely on a complex R-convolutional neural network (R-Convolutional Neural Network, R-CNN) framework consisting of two parts, respectively: regional advice network regional pick-up networks (Region Proposal Network, RPN) and R-CNN probes. In the general pipeline, a region selection network (such as a pyramid network of interest) is used to generate high quality regions of interest (Region of Interest, roI) from horizontal anchors to obtain horizontal RoI, then the RoI Chi Suanzi is used to extract accurate features from the horizontal RoI, and finally the R-CNN probe is used to perform regression on the bounding box to classify the features.

Notably, horizontal rois typically result in severe misalignment between bounding boxes and oriented objects, e.g., horizontal rois typically contain multiple instances due to directional and dense objects in aerial images. One natural solution is to use a directional bounding box as an anchor to alleviate this problem, requiring elaborate anchors with different angles, proportions and aspect ratios, but this can lead to a large amount of computation and memory usage. Recently, an RoI converter (transducer) has been proposed to solve this problem by converting horizontal RoI into rotational RoI, avoiding a large number of anchors, but this approach still requires heuristically defined anchors and complex RoI operations. Compared with an R-CNN detector, the single-stage detector returns to a boundary box, and the characteristics are directly classified by using regular and dense sampling anchors, so that the system structure has high calculation efficiency, but the detection precision is relatively backward, and the finally obtained target detection result is inaccurate.

Targets in multispectral remote sensing images are typically dense and widely distributed and appear in arbitrary directions. The general object detection method with horizontal localization points typically suffers from serious misalignments in this case. One anchor point/RoI may contain multiple instances. Some approaches alleviate this problem with rotating anchors having different angles, proportions, and aspect ratios, while involving extensive computations (e.g., bounding box transformations and ground truth matching) associated with the anchors. Existing area selection networks can convert horizontal rois to rotational rois, avoiding massive anchoring and alleviating the problem of misalignment, but still require heuristically defined anchors and complex RoI operations. The vertices of the sliding horizontal bounding box are employed to accurately describe the oriented object, but the corresponding features of the RoI remain horizontal and present a misalignment problem. The R3Det samples features from five locations (e.g., center and corners) of the corresponding anchor box and adds all samples against one another to recode the location information.

In a dual-stage object detector, the RoI operator (e.g., region of interest pooling, region of interest alignment, and deformable region of interest pooling) is employed to extract fixed length features inside the RoI that can approximately represent the position of the object. Region of interest pooling first divides the RoI into sub-region meshes and then maximizes each sub-region mesh to a corresponding output mesh unit. However, the RoI operator pool quantizes the floating point number boundaries of the RoI to integers, which can lead to inconsistencies between the RoI and the features. In order to avoid quantization of the alignment of the region of interest, the alignment of the region of interest adopts bilinear interpolation to calculate the extraction value of each sampling point in the subarea, and the positioning performance is obviously improved. Meanwhile, the deformable RoI pool adds an offset to each subregion grid of the RoI, thereby enabling adaptive feature selection. However, the RoI operator typically involves a large number of region operations, such as feature warping and feature interpolation, which becomes a bottleneck for fast target detection. More recently, the anchor has been directed to attempt to align features with the guides of the anchor shape, the offset field can be learned from the anchor prediction map, and then the guide deformable convolution can extract the aligned features. Alignment detectors design a RoI convolution to achieve the same effect as alignment of the region of interest in a single stage detector. These methods are suitable for target detection in natural images, but tend to lose performance when detecting oriented and dense objects in aerial images.

In summary, the following problems exist in the detection of aircraft targets in multispectral remote sensing images/aerial images:

(1) The complex objects of the background content are denser, and the distribution range is wide. Meanwhile, the anchor points defined by heuristic are low in quality, and objects cannot be covered, so that dislocation between the objects and the anchor points is caused. For example, large-scale and small-scale aircraft have aspect ratio spans that are large, typically between 1/3 and 1/30, and can only specify few or even no anchor points. Such inconsistencies often exacerbate background-to-different object class imbalance and severely impact detection performance.

(2) When the simple feature extractor acquires the aircraft features, redundant information is repeatedly used for many times, the representation capability of the context global and local details is ignored, and when the pyramid structure is used for polymerizing the features, information flow which is propagated unidirectionally is mostly considered, so that the detail information is lost, and the final detection and positioning performance is affected.

(3) The convolution characteristics of the backbone network are typically aligned with a fixed receptive field axis, whereas the aircraft targets in the multispectral telemetry image tend to be distributed in arbitrary directions and with different appearances. When an anchor box is assigned to an object with high confidence, there is still a bias between the anchor box and the convolution feature. In short, the corresponding features of the anchor frame are somewhat difficult to represent for the entire aircraft target object. Therefore, the final classification accuracy cannot accurately reflect the positioning accuracy, severely affecting the performance of the post-processing stage (non-maximum suppression (Non Maximum Suppression, NMS) screening of the anchor frame).

In summary, the multispectral remote sensing image/aerial image aircraft target detection in the prior art has certain limitations, which can lead to inaccurate target detection results finally obtained.

In order to solve the problems, the invention provides a target detection method, a target detection device, electronic equipment and a storage medium for a remote sensing image, wherein the first feature map under different scales is obtained by extracting features of the remote sensing image according to the different scales; respectively carrying out context semantic feature enhancement on the plurality of first feature images to obtain a plurality of second feature images; performing feature alignment on the plurality of second feature images to obtain an aligned feature image of the remote sensing image; and carrying out target detection on the alignment feature map to obtain a target detection result in the remote sensing image. The method is used for solving the defects that in the prior art, the background and the content of a remote sensing image are more complex, semantic ambiguity easily occurs when a traditional target detection method is adopted to determine a target detection result, the detection precision is affected, and the finally obtained target detection result is not accurate enough, and the purposes that after multi-scale feature extraction is carried out on the remote sensing image, context semantic feature enhancement can be carried out on a plurality of first feature images to obtain a plurality of second feature images with context global features, then feature alignment is carried out on the plurality of second feature images to obtain an alignment feature image with better image quality, and then the target detection result in the remote sensing image with higher accuracy can be obtained.

It should be noted that, the electronic device according to the embodiment of the present invention may include: computer, mobile terminal, wearable device, etc.

The execution body according to the embodiment of the present invention may be a feed source stability detection device or an electronic device, and the embodiment of the present invention is further described below by taking the electronic device as an example.

As shown in fig. 1, a flow chart of a target detection method for a remote sensing image provided by the present invention may include:

101. and extracting features of the remote sensing image according to different scales to obtain first feature images under different scales.

Where the dimensions refer to size information.

The remote sensing image may also be referred to as a satellite image/multispectral remote sensing image, which refers to an image acquired by a satellite, the number of remote sensing images being unlimited.

For example, after the electronic device acquires the remote sensing image, the remote sensing image may be input into the multi-scale feature extraction module, so as to obtain a plurality of first feature maps output by the multi-scale feature extraction module. The multi-scale feature extraction module may include 4 stages (Stage), namely Stage 1, stage 2, stage 3 and Stage 4, each Stage being a subspace, each Stage corresponding to a scale, that is, the multi-scale feature extraction module may include 4 scales, wherein any two of the 4 scales are different. Furthermore, each scale corresponds to a first feature map.

In the process of carrying out feature extraction on the remote sensing image by adopting the multi-scale feature extraction module, the electronic equipment can firstly carry out feature extraction on the remote sensing image according to the scale corresponding to the stage 1 to obtain a first feature map F1 corresponding to the stage 1; then, inputting the first feature map F1 into a stage 2, and extracting features of the first feature map F1 according to the scale corresponding to the stage 2 to obtain a first feature map F2 corresponding to the stage 2; then, inputting the first feature map F2 into a stage 3, and carrying out feature extraction on the first feature map F2 according to the scale corresponding to the stage 3 to obtain a first feature map F3 corresponding to the stage 3; and then, inputting the first characteristic diagram F3 into a stage 4, and carrying out characteristic extraction on the first characteristic diagram F3 according to the scale corresponding to the stage 4 to obtain a first characteristic diagram F4 corresponding to the stage 4. In this way, the electronic device may finally obtain 4 first feature graphs, which are respectively: the first feature map F1, the first feature map F2, the first feature map F3, and the first feature map F4.

In some embodiments, the electronic device performs feature extraction on the remote sensing image according to different scales, to obtain a first feature map under the different scales, and may include: the electronic equipment performs feature extraction on the remote sensing image according to different scales to obtain initial feature images under different scales; the electronic equipment adopts an Attention (Attention) mechanism to respectively perform feature capture on a plurality of initial feature images to obtain first feature images under different scales.

The attention mechanism can reduce the use of redundant information in the attention fusion process, and improve the representation of multi-scale local features, so that the accurate capture of detail features can be respectively carried out on a plurality of initial feature graphs.

Exemplary, as shown in fig. 2, a schematic structure of the multi-scale feature extraction module of the embedded attention mechanism provided by the present invention is shown. As can be seen from fig. 2, the attention mechanism is embedded in each stage in the multi-scale feature extraction module, which is then the Res2Net Backbone network in which the attention mechanism is embedded.

In the process of adopting the multi-scale feature extraction module to extract features of the remote sensing image, the electronic equipment can firstly perform convolution operation (namely Conv1×1 (-)) with the kernel 1*1 on the remote sensing image to obtain a third feature image, andin the 4 stages of the multi-scale feature extraction module, 4 fourth feature graphs corresponding to the third feature graph are obtained, where the fourth feature graphs are respectively: fourth feature map f1, fourth feature map f2, fourth feature map f3, and fourth feature map f4. Wherein the remote sensing image can be represented by x, x is R ^H ^×W×C H represents the height of the remote sensing image x, W represents the width of the remote sensing image x, and C represents the number of characteristic channels of the remote sensing image x.

Then, the fourth feature map f1 is determined as an initial feature map f1'; the fourth feature map f2, the fourth feature map f3 and the fourth feature map f4 are respectively subjected to convolution operation with the kernel 3*3 (i.e., conv3×3 (·) operation), to obtain an initial feature map f2', an initial feature map f3' and an initial feature map f4'.

Then, the initial feature map may be refined by using an Attention mechanism (i.e., attention (·)) by feature capturing the initial feature map f1', the initial feature map f2', the initial feature map f3', and the initial feature map f4', respectively, and merging the third feature map to obtain a fifth feature map f1 ", a fifth feature map f 2", a fifth feature map f3 ", and a fifth feature map f 4".

Finally, a convolution operation with the kernel 1*1 is performed on the fifth feature map f1 ", the fifth feature map f 2", the fifth feature map f3 "and the fifth feature map f 4", and the third feature map is fused again, so that 4 first feature maps with higher accuracy can be obtained, which are respectively: the first feature map F1, the first feature map F2, the first feature map F3, and the first feature map F4.

To sum up, the whole process can be expressed by { F1, F2, F3, F4} = AttRes2Net (x).

Wherein,a first feature map representing different scales, s= {1,2,3,4} representing scales; / >Fifth feature map representing the difference subspace, < >>f _conv1×1(x) Representing a third characteristic diagram, p _k ' represents the initial feature map in the kth space, k= {1,2,3,4} represents the subspace, α _S Representing attention coefficients corresponding to different scales; />Convolution results of the fifth feature map under different subspaces.

Optionally, the method may further include: and the electronic equipment monitors and adjusts the multiscale feature extraction module embedded with the attention mechanism according to the first loss function.

Wherein the first loss function is:

representing a loss bias of a multi-scale feature extraction module embedded with an attention mechanism; />Representing the real category of the target in the remote sensing image; />Representing the predicted class of the target.

Wherein the target may be an aircraft, optionally the class of aircraft may include: 11 classes of co-scouts, transportation, helicopters, bombers, fuel dispensers, aggressors, fighters, coaches, anti-submarines, pre-warning machines, and others.

102. And respectively carrying out context semantic feature enhancement on the plurality of first feature images to obtain a plurality of second feature images.

After the electronic device obtains the first feature maps, the first feature maps can be input into a Bi-directional feature pyramid module (Bi-Feature Pyramid Networks, biFPM) to obtain second feature maps output by the Bi-directional feature pyramid module. The bidirectional feature pyramid module can acquire context global features in two directions, namely feature information in a plurality of first feature images can be aggregated from top to bottom and from bottom to top, so that context semantic feature information in the plurality of first feature images can be effectively enhanced, and a plurality of second feature images with high accuracy are obtained.

In some embodiments, the electronic device performs context semantic feature enhancement on the plurality of first feature maps to obtain a plurality of second feature maps, which may include: the electronic equipment determines a first second feature map corresponding to the first feature map according to the first feature map in the plurality of first feature maps and the first feature map adjacent to the first feature map; the electronic equipment determines other second feature images corresponding to the other first feature images according to the other first feature images, the first feature images adjacent to the other first feature images and the second feature images corresponding to the other first feature images before the other first feature images aiming at the other first feature images except the first feature image; the electronic device determines the first second feature map and the other second feature maps as a plurality of second feature maps.

Illustratively, stage 1 corresponds to the first feature map, i.e., first feature map F1, stage 2, stage 3, and stage 4 correspond to the other first feature maps, i.e., first feature map F2, first feature map F3, and first feature map F4, respectively. Fig. 3 is a schematic structural diagram of the bidirectional feature pyramid module provided by the present invention. As can be seen from fig. 3, in the process of performing context semantic feature enhancement on a plurality of first feature graphs by using a bidirectional feature pyramid module, the electronic device may determine, according to the first feature graph F1 and the first feature graph F2, a second feature graph F1' corresponding to the first feature graph F1; then, determining a second feature map F2 'corresponding to the first feature map F2 according to the first feature map F1, the first feature map F2, the first feature map F3 and the second feature map F1'; then, determining a second feature map F3 'corresponding to the first feature map F3 according to the first feature map F2, the first feature map F3, the first feature map F4, and the second feature map F2'; finally, a second feature map F4 'corresponding to the first feature map F4 is determined according to the first feature map F3, the first feature map F4, and the second feature map F3'. In this way, the bidirectional feature pyramid module may aggregate from bottom to top based on the feature flow, and aggregate from top to bottom to obtain the second feature map F1', the second feature map F2', the second feature map F3', and the second feature map F4' with higher accuracy.

Wherein "∈and" ↗ "represent bottom-up aggregation of feature streams; "∈" and "↘" indicate that feature streams are aggregated from top to bottom.

To sum up, the whole process can be expressed by { F1', F2', F3', F4' } = BiFPM (F1, F2, F3, F4).

In some embodiments, the determining, by the electronic device, a first second feature map corresponding to the first feature map according to the first feature map in the plurality of first feature maps and a first feature map adjacent to the first feature map may include: and the electronic equipment determines a first second characteristic diagram corresponding to the first characteristic diagram according to the first formula.

Wherein, the first formula is:

representing a first second characteristic map, namely a second characteristic map F1'; />Representing a first feature map, namely a first feature map F1; />A first feature map adjacent to the first feature map, i.e. the first feature map F2, is represented.

The electronic device can accurately determine the second feature map F1' according to the first formula.

In some embodiments, the determining, by the electronic device, the other second feature map corresponding to the other first feature map according to the other first feature map, the first feature map adjacent to the other first feature map, and the second feature map corresponding to the other first feature map preceding the other first feature map may include: and the electronic equipment determines other second feature maps corresponding to the other first feature maps according to a second formula.

Wherein, the second formula is:

representing other second feature maps; />Representing other first feature maps; />Representing a first feature map adjacent to the other first feature maps; />And a second feature map corresponding to the other first feature map preceding the other first feature map is represented.

Illustratively, n has a value of 4. For the second feature map F2, the second formula is: representing a second profile F2', +_>Representing a first characteristic diagram F2, < >>Representing a first characteristic F1,/or->Representing a first characteristic F3,/>Representing a second characteristic F1′。

For the second feature map F3, the second formula is: representing a second profile F3', +_>Representing a first characteristic F3,/>Representing a first characteristic diagram F2, < >>Representing a first characteristic F4,/>A second characteristic diagram F2' is shown.

For the second feature map F4, the second formula is: representing a second profile F4', +_>Representing a first characteristic F4,/>Representing a first characteristic F3,/>A second characteristic diagram F3' is shown.

103. And carrying out feature alignment on the plurality of second feature images to obtain an aligned feature image of the remote sensing image.

After the electronic device acquires the plurality of second feature maps, the plurality of second feature maps can be input into an aggregated feature alignment module (Convergent the Feature Alignment Module, CFAM) to obtain an aligned feature map of the remote sensing image output by the aggregated feature alignment module. Wherein the aggregated feature alignment module may comprise: an anchor refinement component (Anchor Refinement Component, ARC) and an aligned convolution layer (Aligned Convolution Layers, ACL), the anchor refinement component may include: an anchor classification branch and an anchor regression branch, wherein the anchor classification branch can divide anchors into different categories, and the anchor regression branch refines horizontal anchors to generate high-quality rotating anchors. In addition, an alignment convolution is embedded in the aggregated feature alignment module to extract alignment features, so that an alignment feature map with higher accuracy is obtained. Typically, a regression anchor block may be employed to adjust the sampling locations in the alignment convolution.

The output result of the aggregated feature alignment module CFAM is a regression operation.

In some embodiments, the electronic device performs feature alignment on the plurality of second feature maps to obtain an aligned feature map of the remote sensing image, which may include: the electronic equipment acquires anchor point box information corresponding to the remote sensing image; and the electronic equipment performs characteristic alignment on the plurality of second characteristic images according to the anchor point box information to obtain an aligned characteristic image of the remote sensing image.

The anchor point information can be expressed by (χ, W, H, beta), χ represents the current pixel position of the target in the remote sensing image, and beta represents the rotation angle.

In the process of determining the alignment feature map of the remote sensing image, the electronic equipment can firstly generate high-quality anchor point box information by adopting an anchor point thinning component according to the remote sensing image, and further combine a plurality of second feature maps to obtain the alignment feature map of the remote sensing image.

In some embodiments, the electronic device performs feature alignment on the plurality of second feature maps according to the anchor point box information to obtain an aligned feature map of the remote sensing image, and may include: the electronic equipment determines an offset field according to the anchor point box information; the electronic device performs alignment convolution on the offset field and the plurality of second feature images to obtain an alignment feature image of the remote sensing image.

In the process of carrying out feature alignment on a plurality of second feature images according to anchor point box information, the electronic equipment can firstly determine an offset field of the remote sensing image based on the anchor point box information; and then, the electronic equipment inputs the offset field and the plurality of second feature images into an alignment convolution layer to carry out alignment convolution, so as to obtain an alignment feature image of the remote sensing image.

Optionally, the determining, by the electronic device, the offset field according to the anchor point information may include: and the electronic equipment determines the offset field of the remote sensing image according to the offset field formula.

The offset field formula is:

phi represents the offset field; η (eta) _p Representing a given anchor point position in anchor point box information theta;representing anchor position eta _p Is the i-th sampling point, eta _s The stride of the feature map is represented, v the kernel size, (w, h) the sampling point +.>Coordinate information of (a) is provided.

Optionally, the method may further include: and the electronic equipment monitors and adjusts the aggregated feature alignment module according to the second loss function. Wherein the second loss function is:

δ _CFAM representing the loss bias of the aggregated feature alignment module; n (N) _CF Representing the correct number of samples of the aggregated feature alignment module; delta _C Representing focus classification loss;representing a predicted class of the aggregated feature alignment module; / >Representing the true class of the target; />Representing an exponential function; delta _r Representing the regression loss of smoothed L1; />Predicted location information representing the aggregated feature alignment module; />Representing the true position information of the object.

104. And carrying out target detection on the alignment feature map to obtain a target detection result in the remote sensing image.

After the electronic device acquires the alignment feature map, the alignment feature map may be input into an Object-oriented orientation detection module (Object-Oriented directional Detection Module, OODM) to obtain a target detection result in the remote sensing image output by the Object-oriented orientation detection module. The bidirectional feature pyramid module can comprise an active rotation filter, and the direction sensitive feature and the direction unchanged feature are generated based on the active rotation filter, so that inconsistency between classification scores and positioning accuracy can be relieved, accurate target detection is performed, and a target detection result finally obtained by the electronic equipment is accurate.

In some embodiments, the electronic device performs target detection on the alignment feature map to obtain a target detection result in the remote sensing image, which may include: the electronic equipment adopts an active rotation filter to encode the direction information of the target in the alignment feature map so as to obtain filter feature maps of channels in different directions; the electronic device determines a target detection result in the remote sensing image according to the plurality of filtering feature diagrams.

In the process of determining a target detection result in a remote sensing image by adopting an active rotation filter, the electronic equipment can firstly encode the direction information of the target in the alignment feature image, and carry out convolution operation for N-1 times in the encoding process to generate a filtering feature image with N direction channels, wherein N is more than or equal to 1; then, the electronic device determines a target detection result in the remote sensing image according to the N filtering feature maps.

Alternatively, the filter characteristic map of the channel in the μ -direction may be defined byConfirm that μ=0, …, N-1.

Wherein,indicating a direction angle in the direction information; />Representing the filter characteristics->N of (2) ^th A plurality of directional channels; f (F) ⁽ⁿ⁾ N-th of mapping features F representing alignment feature map ^th And a plurality of directional channels.

It should be noted that, when the active rotation filter is applied to the convolution layer, the direction sensitive features can be obtained by explicitly encoding the direction information, and meanwhile, the direction sensitive features are collected to extract the direction invariant features (i.e., the direction information). Notably, the bounding box regression task benefits from direction sensitive features, while the object classification task benefits from direction invariant features. In addition, compared with the direction sensitive feature, the direction invariable feature has the characteristics of less parameters and high efficiency. In this way, the target detection result determined by the electronic device through the active rotation filter is accurate.

Optionally, the method may further include: and the electronic equipment monitors and adjusts the orientation detection module facing the object according to the third loss function. Wherein the third loss function is:

δ _OODM representing a loss bias of the object-oriented orientation detection module; λ represents a preset balance parameter; n (N) _OOD Representation-oriented pairThe correct sample number of the orientation detection module of the elephant;representing a prediction category of the object-oriented orientation detection module; />Prediction bounding box information representing an object-oriented orientation detection module.

Optionally, after step 104, the method may further include: and the electronic equipment outputs the target detection result on the display screen.

Thus, a user can intuitively check the target detection result of the remote sensing image.

In the embodiment of the invention, the remote sensing image is subjected to feature extraction according to different scales to obtain a first feature map under different scales; respectively carrying out context semantic feature enhancement on the plurality of first feature images to obtain a plurality of second feature images; performing feature alignment on the plurality of second feature images to obtain an aligned feature image of the remote sensing image; and carrying out target detection on the alignment feature map to obtain a target detection result in the remote sensing image. The method is used for solving the defects that in the prior art, the background and the content of a remote sensing image are more complex, semantic ambiguity easily occurs when a traditional target detection method is adopted to determine a target detection result, the detection precision is affected, and the finally obtained target detection result is not accurate enough, and the purposes that after multi-scale feature extraction is carried out on the remote sensing image, context semantic feature enhancement can be carried out on a plurality of first feature images to obtain a plurality of second feature images with context global features, then feature alignment is carried out on the plurality of second feature images to obtain an alignment feature image with better image quality, and then the target detection result in the remote sensing image with higher accuracy can be obtained.

Exemplary, as shown in fig. 4, a schematic view of a target detection method of a remote sensing image provided by the present invention is shown. As can be seen from fig. 4, the electronic device is provided with a multi-scale feature extraction module, a bi-directional feature pyramid module, an aggregated feature alignment module, and an object-oriented orientation detection module. In the process of carrying out target detection on the remote sensing image, the electronic equipment can adopt a bidirectional feature pyramid module to determine respective first feature graphs of 4 stages; then, a bidirectional feature pyramid module is adopted to determine respective second feature graphs of the 4 first feature graphs; then, adopting an aggregated feature alignment module to perform feature alignment on the 4 second feature images to obtain an aligned feature image of the remote sensing image; and finally, performing target detection on the alignment feature map by adopting an object-oriented detection module to obtain a target detection result in the remote sensing image.

It should be noted that, the multi-scale feature extraction module, the bidirectional feature pyramid module, the aggregate feature alignment module and the object-oriented orientation detection module may form an alignment convolution detection framework enhanced by the bidirectional feature pyramid, where the framework is intended to describe features of the target in the remote sensing image from different directions and different scales, so as to fully obtain the contextual feature and the global detail feature of the target, so that the target detection result finally obtained by the electronic device is more accurate.

Optionally, a total loss function corresponding to the alignment convolution detection framework enhanced by the bidirectional feature pyramid may be determined according to the first loss function, the second loss function and the third loss function, where the total loss function is:gamma represents a learnable balance factor.

The total loss function can effectively monitor the alignment convolution detection framework enhanced by the bidirectional feature pyramid to learn better feature representation, so that the detection precision of the target in the remote sensing image is effectively improved, and the finally obtained target detection result is accurate.

In addition, the alignment convolution detection framework enhanced by the bidirectional feature pyramid has the following advantages:

(1) Aiming at the problems of large span of different aircraft target dimensions, low anchor point quality and the like, an alignment convolution self-adaptive forming alignment feature is embedded in an aircraft detection frame, and unlike other methods with densely sampled anchors, only one square anchor is used for each position in a feature map, and an anchor point optimization module refines the square anchor into a high-quality rotating anchor. At the same time, the alignment convolution adaptively aligns features according to the shape, size and direction of the corresponding anchor point of the target.

(2) Aiming at the problems that a feature extractor is easy to ignore context, local detail and the like, a bidirectional feature pyramid can model information flow from the front direction, the back direction and the like, and the representation of the context semantic features is improved. Meanwhile, considering that a simple Resnet backbone network cannot sufficiently capture local details of a target (such as an aircraft), res2Net with better detail processing is adopted as the backbone network to encode the local details of the target in the remote sensing image.

(3) The active rotation filter is adopted to encode the direction information and generate direction sensitive characteristics, and then the direction invariant characteristics are extracted by combining the direction sensitive characteristics. At the same time, misalignment between the axis-aligned convolution feature and any oriented object is relieved in a completely convolved manner. Finally, the characteristics are fed back to a regression sub-network and a classification sub-network which are included in the object-oriented directional detection module, so that final prediction is generated, and a target detection result with higher accuracy is obtained.

Optionally, the collected data may be preprocessed prior to training the modules in the alignment convolution detection framework of the bi-directional feature pyramid enhancement, where the preprocessing may include three parts, data collection, data processing, and data sample partitioning.

In data acquisition, 91 satellite images may be acquired and the 91 satellite images may be divided into training samples, verification samples, and test samples. In addition, a plurality of oversized images are independently acquired to demonstrate the detection efficiency and the detection precision of the detection framework on the oversized images.

In the data processing, corresponding labeling software is adopted to label the airplane targets in the 91 satellite images, and in order to improve the detection precision and increase the samples, preprocessing operations such as rotation, color transformation, normalization and the like are adopted. Next, in order to ensure that these satellite images can be smoothly input into the detection frame, 91 satellite images are cut to 1024×1024 size according to the multi-scale shrinkage ratio of r= {0.5,1.0,1.5}, and the step g=724 of window sliding. Finally, to ensure the balance of positive and negative samples, 40% of the negative samples were randomly deleted from all samples.

(3) Data sample division: to ensure the successful performance of the experiments of the present invention, we divided all samples into three parts, training samples, validation samples and test samples, randomly extracted 60% of all samples as training samples, 10% of all samples as validation samples, and the remaining 30% as test samples.

Table 1:

aircraft type	Class	Gts	Dets	Recall	AP
						Scout plane	Militaryplane01	1032	1290	0.984	0.904
Conveyor	Militaryplane02	3008	3892	0.993	0.908
						Helicopter	Militaryplane03	7943	8989	0.983	0.907
Bomber machine	Militaryplane04	1240	1602	1.000	0.999
						Oiling machine	Militaryplane05	1291	1392	0.997	0.909
Oiling machine	Militaryplane06	1296	1756	0.994	0.907
						Fighter plane	Militaryplane07	8322	10423	0.971	0.906
Training machine	Militaryplane08	1236	1732	0.950	0.897
						Anti-diving machine	Militaryplane09	1356	1604	0.933	0.907
Early warning machine	Militaryplane10	691	783	0.933	0.909
						Others	Militaryplane11	2621	3781	0.990	0.904
mAP	–	–	–	–	0.914

The electronic device performs object detection simulation on remote sensing images of different objects, wherein the objects are planes, and simulation results are shown in table 1.

As can be seen from table 1, the accuracy of the target detection results is high.

The target detection device of the remote sensing image provided by the invention is described below, and the target detection device of the remote sensing image described below and the target detection method of the remote sensing image described above can be correspondingly referred to each other.

As shown in fig. 5, a schematic structural diagram of an object detection device for a remote sensing image according to the present invention may include:

the multi-scale feature extraction module 501 is configured to perform feature extraction on the remote sensing image according to different scales, so as to obtain a first feature map under the different scales;

The bidirectional feature pyramid module 502 is configured to perform context semantic feature enhancement on the plurality of first feature graphs to obtain a plurality of second feature graphs;

an aggregated feature alignment module 503, configured to perform feature alignment on the plurality of second feature images to obtain an aligned feature image of the remote sensing image;

and the object-oriented orientation detection module 504 is configured to perform object detection on the alignment feature map to obtain an object detection result in the remote sensing image.

Optionally, the bidirectional feature pyramid module 502 is specifically configured to determine, according to a first feature map of the plurality of first feature maps and a first feature map adjacent to the first feature map, a first second feature map corresponding to the first feature map; for each other first feature map except the first feature map, determining other second feature maps corresponding to the other first feature maps according to the other first feature maps, the first feature maps adjacent to the other first feature maps and the second feature maps corresponding to the other first feature maps before the other first feature maps; the first second feature map and the other second feature maps are determined as the plurality of second feature maps.

Optionally, the bidirectional feature pyramid module 502 is specifically configured to determine, according to a first formula, a first second feature map corresponding to the first feature map; wherein, the first formula is: representing the first second feature map; />Representing the first feature map; />Representing the first feature map adjacent to the first feature map;

the bidirectional feature pyramid module 502 is specifically configured to determine other second feature graphs corresponding to the other first feature graphs according to a second formula; wherein the second formula is: representing the other second feature map; />Representing the other first feature map; />Representing a first feature map adjacent to the other first feature map;and a second feature map corresponding to the other first feature map preceding the other first feature map.

Optionally, the aggregated feature alignment module 503 is specifically configured to obtain anchor point box information corresponding to the remote sensing image; and carrying out feature alignment on the plurality of second feature images according to the anchor point information to obtain an aligned feature image of the remote sensing image.

Optionally, the aggregated feature alignment module 503 is specifically configured to determine an offset field according to the anchor point box information; and carrying out alignment convolution on the offset field and the plurality of second feature images to obtain an alignment feature image of the remote sensing image.

Optionally, the object-oriented orientation detection module 504 is specifically configured to encode the direction information of the target in the alignment feature map by using an active rotation filter to obtain a filtering feature map of channels in different directions; and determining a target detection result in the remote sensing image according to the plurality of filtering characteristic diagrams.

Optionally, the multi-scale feature extraction module 501 is specifically configured to perform feature extraction on the remote sensing image according to different scales, so as to obtain an initial feature map under the different scales; and adopting an attention mechanism to respectively perform feature capturing on the plurality of initial feature images to obtain first feature images under different scales.

As shown in fig. 6, a schematic structural diagram of an electronic device provided by the present invention may include: processor 610, communication interface (Communications Interface) 620, memory 630, and communication bus 640, wherein processor 610, communication interface 620, and memory 630 communicate with each other via communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a target detection method for a telemetry image, the method comprising: extracting features of the remote sensing images according to different scales to obtain first feature images under the different scales; respectively carrying out context semantic feature enhancement on the plurality of first feature images to obtain a plurality of second feature images; performing feature alignment on the plurality of second feature images to obtain an aligned feature image of the remote sensing image; and carrying out target detection on the alignment feature map to obtain a target detection result in the remote sensing image.

Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of executing a method for detecting an object of a remote sensing image provided by the above methods, the method comprising: extracting features of the remote sensing images according to different scales to obtain first feature images under the different scales; respectively carrying out context semantic feature enhancement on the plurality of first feature images to obtain a plurality of second feature images; performing feature alignment on the plurality of second feature images to obtain an aligned feature image of the remote sensing image; and carrying out target detection on the alignment feature map to obtain a target detection result in the remote sensing image.

In yet another aspect, the present invention provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a method for target detection of a remote sensing image provided by the above methods, the method comprising: extracting features of the remote sensing images according to different scales to obtain first feature images under the different scales; respectively carrying out context semantic feature enhancement on the plurality of first feature images to obtain a plurality of second feature images; performing feature alignment on the plurality of second feature images to obtain an aligned feature image of the remote sensing image; and carrying out target detection on the alignment feature map to obtain a target detection result in the remote sensing image.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A target detection method for a remote sensing image, comprising:

2. The method of claim 1, wherein the performing the contextual semantic feature enhancement on the plurality of first feature maps, respectively, results in a plurality of second feature maps, comprising:

determining a first second feature map corresponding to the first feature map according to the first feature map in the plurality of first feature maps and a first feature map adjacent to the first feature map;

for each other first feature map except the first feature map, determining other second feature maps corresponding to other first feature maps according to the other first feature maps, the first feature maps adjacent to the other first feature maps and the second feature maps corresponding to other first feature maps before the other first feature maps;

And determining the first second characteristic diagram and the other second characteristic diagrams as the plurality of second characteristic diagrams.

3. The method according to claim 2, wherein determining, from a first feature map of the plurality of first feature maps and a first feature map adjacent to the first feature map, a first second feature map corresponding to the first feature map includes:

determining a first second feature map corresponding to the first feature map according to a first formula;

wherein, the first formula is:

representing the first second feature map; />Representing the first feature map; />Representing the first feature map adjacent to the first feature map;

the determining, according to the other first feature map, the first feature map adjacent to the other first feature map, and the second feature map corresponding to the other first feature map before the other first feature map, the other second feature map corresponding to the other first feature map includes:

determining other second feature maps corresponding to the other first feature maps according to a second formula;

wherein the second formula is:

representing the other second feature map; / >Representing the other first feature map; />Representing a first feature map adjacent to the other first feature maps; />And representing a second characteristic diagram corresponding to the other first characteristic diagram before the other first characteristic diagram.

4. A method according to any one of claims 1-3, wherein said performing feature alignment on said plurality of second feature maps to obtain an aligned feature map of said remote sensing image comprises:

acquiring anchor point box information corresponding to the remote sensing image;

and carrying out feature alignment on the plurality of second feature images according to the anchor point information to obtain an aligned feature image of the remote sensing image.

5. The method of claim 4, wherein the performing feature alignment on the plurality of second feature maps according to the anchor point information to obtain an aligned feature map of the remote sensing image includes:

determining an offset field according to the anchor point box information;

and carrying out alignment convolution on the offset field and the plurality of second feature images to obtain an alignment feature image of the remote sensing image.

6. A method according to any one of claims 1-3, wherein performing object detection on the alignment feature map to obtain an object detection result in the remote sensing image includes:

Encoding the direction information of the targets in the alignment feature images by adopting an active rotation filter to obtain filter feature images of channels in different directions;

and determining a target detection result in the remote sensing image according to the plurality of filtering feature diagrams.

7. A method according to any one of claims 1 to 3, wherein the feature extraction is performed on the remote sensing image according to different scales, so as to obtain a first feature map under the different scales, and the method comprises:

extracting features of the remote sensing images according to different scales to obtain initial feature images under the different scales;

and adopting an attention mechanism to respectively perform feature capturing on the plurality of initial feature images to obtain first feature images under different scales.

8. An object detection device for a remote sensing image, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of object detection of a remote sensing image as claimed in any one of claims 1 to 7 when the program is executed by the processor.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the object detection method of a remote sensing image according to any of claims 1 to 7.