CN116229272B

CN116229272B - High-precision remote sensing image detection method and system based on representative point representation

Info

Publication number: CN116229272B
Application number: CN202310241950.2A
Authority: CN
Inventors: 张锦; 顾因; 陈锋; 段晔鑫; 姜伟成; 蔡军; 耿京; 刘雁轩; 杨子恒
Original assignee: Army Military Transportation University of PLA Zhenjiang
Current assignee: Army Military Transportation University of PLA Zhenjiang
Priority date: 2023-03-14
Filing date: 2023-03-14
Publication date: 2023-10-31
Anticipated expiration: 2043-03-14
Also published as: CN116229272A

Abstract

The application discloses a high-precision remote sensing image detection method based on representative point representation, which comprises the following steps: acquiring a remote sensing image to be detected; inputting the acquired remote sensing image into a pre-trained single feature aggregation depth network, and outputting feature mapping; the backbone network is used for dividing the remote sensing image into multi-scale characteristics of a plurality of scales; the above-feature fusion network is used for upwardly fusing deep features with low resolution with shallow features with high resolution to obtain primary fusion features; the feature lower fusion network gradually fuses the deep high-resolution primary fusion features with the shallow low-resolution primary fusion features downwards to obtain high-level fusion features; the feature transformation network is used for reducing the dimension of the features and outputting lightweight features; the convolution module is used for carrying out convolution operation on the image; the single feature aggregation depth network is used for carrying out feature encoding and decoding on the original image and outputting feature mapping with the channel dimension of na× (nc+1+2 x np); the penalty functions include a classification penalty term, a location penalty term, a confidence penalty term, a geometric regularization term, and a feature regularization term.

Description

High-precision remote sensing image detection method and system based on representative point representation

Technical Field

The application relates to a high-precision remote sensing image detection method and system based on representative point representation, and belongs to the technical field of remote sensing image detection.

Background

The remote sensing image detection is an important means for implementing on-line information mining and dynamic monitoring aiming at important targets in a wide region, and can be widely applied to scenes such as large-range regional personnel search and rescue, forest fire detection, geological investigation, battlefield information real-time sensing and reconnaissance and the like. In one aspect, current rotation detectors face "angle critical problems" and "circle-like problems" by directly regressing the "rotation box" angle. The "angle critical problem" refers to a contradiction that the angle predicted value and the true value have small differences but large losses in the vicinity of the critical angle; "circle-like problem" means that the circle-like object is essentially direction independent, but the loss at this time is still very sensitive to direction prediction. The method aims to solve the problem of 'angle critical' caused by rapidity of remote sensing image detection and a rotating frame. On the other hand, the remote sensing image detector based on the rotating frame generally has low reasoning speed, is difficult to cope with the rapid requirement of high-resolution image detection, and the target detection algorithm of the horizontal frame which is rapid in direct migration is difficult to adapt to the requirements of the rotating invariance, the scale difference and the like of the remote sensing image.

Disclosure of Invention

The application aims to provide a high-precision remote sensing image detection method and system based on representative point representation, which are used for solving the problem of rapidness of remote sensing image detection and angle critical brought by a rotating frame.

A high-precision remote sensing image detection method based on representative point representation, the method comprising:

acquiring a remote sensing image to be detected;

inputting the acquired remote sensing image into a pre-trained single feature aggregation depth network, and outputting feature mapping;

the single feature aggregation depth network comprises a backbone network based on a single feature aggregation module, an upper feature fusion network, a lower feature fusion network, a feature transformation network, a representative point loss function and a convolution module;

the backbone network is used for dividing the remote sensing image into multi-scale characteristics of a plurality of scales; the above-feature fusion network is used for upwardly fusing deep features with low resolution with shallow features with high resolution to obtain primary fusion features; the feature lower fusion network gradually fuses the deep high-resolution primary fusion features with the shallow low-resolution primary fusion features downwards to obtain high-level fusion features; the feature transformation network is used for reducing the dimension of the features and outputting lightweight features; the convolution module is used for carrying out convolution operation on the image; the single feature aggregation depth network is used for carrying out feature encoding and decoding on the original image and outputting feature mapping with the channel dimension of na× (nc+1+2 x np); the representative point loss function includes a classification loss term, a location loss term, a confidence loss term, a geometric regularization term, and a feature regularization term.

Further, the single feature aggregation depth network is trained through a representative point loss function, and the formula of the representative point loss function is as follows:

wherein ,for classifying loss, < >>For confidence loss, < >>For locating losses, < >>For geometrically regularized items, < >>Feature regularization term, j represents a corresponding scale; alpha represents the weight of the different types of losses, and beta represents the weight of the different scale losses; p represents a predicted value, and T represents a true value; CE represents classical cross entropy loss, convexGIOU represents polygon generalized cross-correlation operator, ++> and />The distribution of a set of representative points representing an object is constrained from a geometric and characteristic perspective, respectively.

Further, the single feature aggregation depth network performs feature extraction and multi-scale feature fusion based on a single feature aggregation module.

Further, the single feature aggregation module includes:

c for input ₀ Carrying out convolution processing on the characteristics to obtain 4 groups of output characteristics with the channel number of c;

and splicing the obtained 4 groups of output features with the channel number of c into a group of features with the channel number of 4c along the channel dimension, and then performing aggregation by using convolution operation to obtain a group of features with the channel number of 2 c.

Further, the convolution module comprises a convolution one and a convolution two; the first convolution is a convolution kernel with the kernel size of 3 multiplied by 3, and the step length of 1 is cascaded with an activation function with the parameter of 0.1, and the second convolution is a convolution kernel with the kernel size of 1 multiplied by 1, so that the quick dimension reduction of the high-dimension input characteristic is realized.

Further, the backbone network performs layer-by-layer abstraction and processing on the input remote sensing image based on a single feature aggregation module and a plurality of independent convolution and maximum pooling operations, and outputs feature mapping with different scale resolutions.

Further, the above-feature fusion network firstly uses convolution with a kernel of 1×1 to process the low-resolution input features of one channel number and the high-resolution input features of the other channel to obtain two features with uniform channel dimensions, then upsamples the low-resolution features to further unify the resolution, and fuses the obtained two features by using a single feature aggregation module.

Further, the feature downsampling network realizes downsampling of high-resolution features through a convolution operation with a step size of 2.

Further, the feature transformation network uses convolution with a lightweight kernel of 1×1 to reduce the dimension of the channel number features to obtain dimension-reduced features, then adopts multi-group pooling operation to encode diversified features on the dimension-reduced features, and finally carries out aggregation on multi-branch features to output lightweight features.

A high-precision remote sensing image detection system based on representative point representation, the system comprising:

the acquisition module is used for acquiring a remote sensing image to be detected;

the processing module is used for processing the input remote sensing image to be detected;

the processing module comprises a backbone network based on a single feature aggregation module, an upper feature fusion network, a lower feature fusion network, a feature transformation network, a representative point loss function and a convolution module; the backbone network is used for dividing the remote sensing image into multi-scale characteristics of a plurality of scales; the above-feature fusion network is used for upwardly fusing deep features with low resolution with shallow features with high resolution to obtain primary fusion features; the feature lower fusion network gradually fuses the deep high-resolution primary fusion features with the shallow low-resolution primary fusion features downwards to obtain high-level fusion features; the feature transformation network is used for reducing the dimension of the features and outputting lightweight features; the convolution module is used for carrying out convolution operation on the image; the single feature aggregation depth network is used for carrying out feature encoding and decoding on the original image and outputting feature mapping with the channel dimension of na× (nc+1+2 x np); the representative point loss function includes a classification loss term, a location loss term, a confidence loss term, a geometric regularization term, and a feature regularization term.

Compared with the prior art, the application has the beneficial effects that: the single feature aggregation depth network of the application performs feature extraction and multi-scale feature fusion based on a GPU reasoning efficient single feature aggregation module as a core component; the single feature aggregation module OFA can avoid low-efficiency calculation and large storage characteristics brought by a dense feature multiplexing mode so as to improve the reasoning speed, and can extract more diversified feature representations by a bit adding mode.

The application "discards" the rectangular box representation commonly used for "object" and selects np points to represent "object". In the restoration process, a rectangular frame surrounding all representative points is generated based on a classical Jarvis March algorithm or a minimum area algorithm.

Because the representative points and the rectangular frame generation algorithm based on the representative points do not relate to the representation of the angles of the rotating rectangular frames, the angle critical problem and the circle-like problem can be fundamentally avoided, and the stability of model training and the high precision of detection are ensured.

Drawings

FIG. 1 is a single feature aggregation depth network of the present application;

FIG. 2 is a schematic diagram of a single feature aggregation module of the present application;

FIG. 3 is a schematic diagram of a backbone network of the present application;

FIG. 4 is a schematic diagram of a feature transformation network of the present application;

FIG. 5 is a schematic diagram of a converged network in accordance with features of the present application;

fig. 6 is a schematic diagram of a converged network in accordance with the features of the present application.

Detailed Description

The application is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the application easy to understand.

Example 1

acquiring a remote sensing image to be detected;

the backbone network is used for dividing the remote sensing image into multi-scale characteristics of a plurality of scales; the above-feature fusion network is used for upwardly fusing deep features with low resolution with shallow features with high resolution to obtain primary fusion features; the feature lower fusion network gradually fuses the deep high-resolution primary fusion features with the shallow low-resolution primary fusion features downwards to obtain high-level fusion features; the feature transformation network is used for reducing the dimension of the features and outputting lightweight features; the convolution module is used for carrying out convolution operation on classical images; the single feature aggregation depth network is used for carrying out feature encoding and decoding on the original image and outputting feature mapping with the channel dimension of na× (nc+1+2 x np); the representative point loss function includes a classification loss term, a location loss term, a confidence loss term, a geometric regularization term, and a feature regularization term.

The application "discards" the rectangular box representation commonly used for "object" and selects np points to represent "object". At the time of restoration, a rectangular frame surrounding all representative points is generated based on a classical Jarvis March algorithm or a minimum area algorithm (MinAeraRect).

Because the representative points and the rectangular frame generation algorithm based on the representative points do not relate to the representation of the angles of the rotating rectangular frames, the angle critical problem and the circle-like problem can be fundamentally avoided, and the stability of model training and the high precision of detection are ensured. The single feature aggregation depth network OFAN is based on a GPU reasoning efficient single feature aggregation module (One-pass Feature Aggregation, OFA) as a core component to perform feature extraction and multi-scale feature fusion. The single feature aggregation module OFA can not only avoid low-efficiency calculation and large storage characteristics caused by a dense feature multiplexing mode (such as DenseNet) so as to improve the reasoning speed, but also extract more diversified feature representations by a bit adding mode (such as ResNet).

The specific mode is as follows:

as shown in fig. 1, the single feature aggregation deep network OFAN mainly includes a Backbone network (Backbone), a fusion-on-feature network (FuseDown 2Up, fuseD 2U), a fusion-under-feature network (FuseUp 2Down, fuseU 2D), and a feature transformation network (Transition) and a convolution module (conv), wherein the Backbone network (Backbone) uses a single feature aggregation module OFA as a core component.

The single feature aggregation deep network OFAN workflow is as follows: the remote sensing image is sent into a Backbone network Backbone to obtain multi-scale features of 4 scales, then deep features with low resolution are gradually fused upwards with shallow features with high resolution (upward means that the resolution of the fused features is consistent with that of the shallow features with high resolution), a series of primary fused features (output features of FuseU 2D) are obtained, and then deep high-resolution primary fused features are gradually fused downwards with shallow low-resolution primary fused features (downward means that the resolution of the fused features is consistent with that of the shallow features with low resolution), so that advanced fused features (output features of FuseU 2D) are obtained.

Finally, the feature map with the dimensions of na× (nc+1+2×np) of the final output channel is obtained after the convolution processing of the advanced fusion features of each scale.

A specific explanation of this channel dimension is as follows: each prediction box is covered with np representative points (called a set of points), and naturally, 2×np neurons represent the abscissa offset of this set of points; nc neurons represent class confidence of the predicted object (class space size nc); 1 neuron represents the confidence that the object covered by the group of representative points belongs to the foreground; na denotes the number of prediction frames per position.

As shown in equation 1, represents the point loss function RPBy classification loss->Confidence loss->And loss of localization->Geometric regularization term->And feature regularization term->Composition is prepared. The first three losses correspond to three sets of nc, 1, 2 xnp neurons. Wherein j represents a corresponding scale; alpha and beta respectively represent the weights of different types of losses and the weights of different scale losses; p and T represent predicted and actual values, respectively; CE and ConvexGIOU represent classical cross entropy loss and polygon generic cross-correlation operators. /> and />The distribution of a set of representative points representing an object is constrained from a geometric and characteristic perspective, respectively. />The set of guide points should be as diffuse as possible to cover the whole object as completely as possible.

The application designs the total distance rho of the group of points from the point set center c _kc Inverse of (formula 5).The similarity of the features corresponding to the points is minimized to facilitate the selection of points of different semantic parts of the object to characterize the whole object.

The application designs the characteristic similarity (e _kc ) And (3) summing.

As shown in fig. 2, the single feature aggregation module OFA (c ₀ The method comprises the steps of carrying out a first treatment on the surface of the 2c) The number of the opposite channels is c ₀ And processing the input features of the number of channels to obtain an output feature with the channel number of 2 c. The single feature aggregation module OFA adjusts the longest path (branch 1) and the shortest path (branch 4) in the module to increase gradient diversity, so that the network can learn more diversity features and promote precision and accelerate convergence, and the specific workflow is as follows: the OFA builds 4 branches based on convolution operation to extract diversified features, then splices the diversified features along the channel dimension (the number of channels after splicing is 4 c), and then uses convolution operation to aggregate. OFA involves 2 different types of convolution operations-conv3×3,1, leakRelu (0.1) and conv1×1,1, leakRelu (0.1). The former represents a convolution kernel with a kernel size of 3 x3 and a step size of 1, followed by concatenating an activation function, leakrlu, with a parameter of 0.1 (0.1 times the output itself when the input value is exactly the output itself, negative); the latter uses a 1 x 1 convolution kernel to achieve fast dimensionality reduction of the high-dimensional input features.

As shown in fig. 3, the Backbone network backhaul performs layer-by-layer abstraction and processing on the input remote sensing image (resolution is h×w) based on a single feature aggregation module OFA and several independent convolution (conv) and Max Pooling (Max Pooling) operations, and finally outputs feature maps (b 2, b3, b4, b 5) of 4 different scale resolutions. Wherein the downsampling of the features is implemented based on a convolution of step size 2 (conv3×3,2, leakrelu (0.1)) and a Pooling operation of core size 2 (Max Pooling2, 2). Although the deep features after downsampling lose part of spatial information (positioning capability is reduced), the single feature aggregation module OFA obtains larger receptive field, so that advanced semantic features are easier to encode (object class prediction capability is improved). In contrast, the shallow feature space before downsampling is rich in information (strong in positioning capability), while the semantic information is weak (insufficient in object class prediction capability). In order to comprehensively utilize the advantages of the high-resolution shallow features and the low-resolution deep features, feature fusion is carried out in 2 stages of 'up' fusion and 'down' fusion based on FuseD2U and FuseU2D modules.

In order to improve the operation efficiency, the deep feature b5 is firstly subjected to dimension reduction based on Transition before feature fusion, and the number of feature channels is aggregated from 1024 to 256 (fig. 1). The transmission (4 c; c) shown in FIG. 4 processes the input feature with the number of channels of 4c to obtain the output feature with the number of channels of c. And the feature transformation network Transition uses the convolution with the lightweight kernel of 1 multiplied by 1 to reduce the dimension of the feature with the channel number of 4c to obtain the dimension reduction feature with the channel number of c, then adopts 3 groups of parallel pooling operations with receptive fields of 5, 9 and 13 to encode diversified features, and finally carries out aggregation on multi-branch features to output the lightweight features.

As shown in fig. 5, the feature fusion network FuseDown2Up (c ₀ C1; c) The module first pairs the number of channels to c using a convolution with a kernel of 1 x 1 ₀ Is c ₁ And (3) processing the high-resolution input features to obtain two features with the channel dimension being unified as c, and then upsampling the low-resolution features to further unify the resolution. And finally, fusing the two obtained features by using a single feature aggregation module OFA module. And a feature fusion network FuseDown2Up (c) ₀ C1; c) Similarly, the under-feature fusion network FuseUp2Down (c ₀ C1; c) The difference is that fuseep 2Down is the downsampling of the high resolution features achieved by a convolution operation with a step size of 2, see fig. 6.

Experimental conditions: the advancement of the present application was demonstrated by experimental verification on the global maximum remote sensing image detection Dataset (DOTA). The DOTA dataset contains 2806 remote sensing images, nearly 19 tens of thousands of labeled examples, for a total of 15 categories (airplane, ship, tank, tennis court, basketball court, baseball field, track field, harbor, bridge, dolly, large truck, helicopter, roundabout, soccer field, swimming pool). The experimental procedure was divided into 3 stages: 1) Developing network training and parameter adjustment based on the training set and the verification set disclosed by DOTA; 2) Reasoning the trained network on the unmarked test set; 3) And submitting the reasoning result to a DOTA official website to obtain algorithm evaluation.

Algorithm parameter setting: classification lossConfidence loss->And loss of localization->Geometric regularization termAnd feature regularization term->Weight alpha corresponding to five-part loss ₁ 、α ₂ 、α ₃ 、α ₄ 、α ₅ Set to 0.07, 0.0375, 1.92, 0.03, respectively; the weight (beta) of the object lost by the small to large scale ₁ 、β ₂ 、β ₃ 、β ₄ ) 4, 1, 0.25, 0.06, respectively. The number np of a set of points is set to 9; the training batch size was set to 48, the number of rounds was set to 200, the initial learning rate was set to 0.01, and after linear decay, the final round of learning rate was set to 0.002. The input image resolution for both training and testing procedures was 1024 x 1024. The IOU threshold for maximum suppression was set to 0.45 and the confidence thresholds for training and reasoning were set to 0.01 and 0.25, respectively.

TABLE 1 comprehensive Performance of high Performance remote sensing image detection Algorithm on DOTA test set ¹

Experimental analysis: table 1 shows the overall mAP accuracy and inference speed (FPS) of the present application (OFAN) versus the high performance remote sensing image detection algorithm under DOTA testing. The reasoning speed of the existing high-performance algorithm does not exceed 20 Frames Per Second (FPS), and the detection speed of the method in the GTX3090 video card respectively reaches 62.5FPS (16 ms). Particularly, when the batch size is 32, the reasoning speed of the application reaches 177FPS (5.6 ms), and the rapid requirement of remote sensing image detection can be better met. For a 20-level high-precision remote sensing image (0.27 meters in spatial resolution), the algorithm can monitor an area of about 13.5 square kilometers (177×0.27×1024/1000×0.27×1024/1000) in real-time on the second level. The mAP precision of the application is 73.7 when the acquisition speed is greatly improved, and is close to the detection precision of the current most advanced aviation image detector.

Example 2

The foregoing is merely a preferred embodiment of the present application, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present application, and such modifications and variations should also be regarded as being within the scope of the application.

Claims

1. The high-precision remote sensing image detection method based on representative point representation is characterized by comprising the following steps of:

acquiring a remote sensing image to be detected;

the backbone network is used for dividing the remote sensing image into multi-scale characteristics of a plurality of scales; the above-feature fusion network is used for upwardly fusing deep features with low resolution with shallow features with high resolution to obtain primary fusion features; the feature lower fusion network gradually fuses the deep high-resolution primary fusion features with the shallow low-resolution primary fusion features downwards to obtain high-level fusion features; the feature transformation network is used for reducing the dimension of the features and outputting lightweight features; the convolution module is used for carrying out convolution operation on the image; the single feature aggregation depth network is used for carrying out feature encoding and decoding on the original image and outputting feature mapping with the channel dimension of na× (nc+1+2 x np); the representative point loss function comprises a classification loss term, a positioning loss term, a confidence loss term, a geometric regularization term and a characteristic regularization term;

the single feature aggregation depth network is trained through a representative point loss function, and the formula of the representative point loss function is as follows:

wherein ,for classifying loss, < >>For confidence loss, < >>For locating losses, < >>Is a geometric regularization term,Feature regularization term, j represents a corresponding scale; alpha represents the weight of the different types of losses, and beta represents the weight of the different scale losses; p represents a predicted value, and T represents a true value; CE represents warpClassical cross entropy loss, convexGIOU represents the polygon generalized cross-correlation operator, +.> and />The distribution of a set of representative points representing an object is constrained from a geometric and characteristic perspective, respectively;

the feature fusion network firstly uses convolution with a kernel of 1 multiplied by 1 to process the low-resolution input features of one channel number and the high-resolution input features of the other channel to obtain two features with uniform channel dimensions, then upsamples the low-resolution features to further unify the resolution, and fuses the obtained two features by using a single feature aggregation module;

gradually fusing the deep high-resolution primary fusion characteristic with the shallow low-resolution primary fusion characteristic downwards to obtain an advanced fusion characteristic.

2. The high-precision remote sensing image detection method based on representative point representation according to claim 1, wherein the single feature aggregation depth network performs feature extraction and multi-scale feature fusion based on a single feature aggregation module.

3. The method for detecting a high-precision remote sensing image based on representative point representation according to claim 1, wherein the single feature aggregation module comprises:

4. The method for detecting the high-precision remote sensing image based on the representative point representation according to claim 1, wherein the convolution module comprises a first convolution and a second convolution; the first convolution is a convolution kernel with the kernel size of 3 multiplied by 3, and the step length of 1 is cascaded with an activation function with the parameter of 0.1, and the second convolution is a convolution kernel with the kernel size of 1 multiplied by 1, so that the quick dimension reduction of the high-dimension input characteristic is realized.

5. The method for detecting the high-precision remote sensing image based on representative point representation according to claim 1, wherein the backbone network performs layer-by-layer abstraction and processing on the input remote sensing image based on a single feature aggregation module and a plurality of independent convolution and maximum pooling operations, and outputs feature mapping with different scale resolutions.

6. The method for detecting the high-precision remote sensing image based on representative point representation according to claim 1, wherein the feature downsampling network realizes downsampling of high-resolution features through convolution operation with a step length of 2.

7. The method for detecting the high-precision remote sensing image based on representative point representation according to claim 1, wherein the feature transformation network uses a convolution with a lightweight kernel of 1×1 to perform dimension reduction on the features of the channel number to obtain dimension reduction features, then adopts a plurality of groups of pooling operations on the dimension reduction features to encode diversified features, and finally performs aggregation on the multi-branch features to output lightweight features.

8. A high-precision remote sensing image detection system based on representative point representation, the system comprising:

the processing module comprises a backbone network based on a single feature aggregation module, an upper feature fusion network, a lower feature fusion network, a feature transformation network, a representative point loss function and a convolution module; the backbone network is used for dividing the remote sensing image into multi-scale characteristics of a plurality of scales; the above-feature fusion network is used for upwardly fusing deep features with low resolution with shallow features with high resolution to obtain primary fusion features; the feature lower fusion network gradually fuses the deep high-resolution primary fusion features with the shallow low-resolution primary fusion features downwards to obtain high-level fusion features; the feature transformation network is used for reducing the dimension of the features and outputting lightweight features; the convolution module is used for carrying out convolution operation on the image; the single feature aggregation depth network is used for carrying out feature encoding and decoding on the original image and outputting feature mapping with the channel dimension of na× (nc+1+2 x np); the representative point loss function comprises a classification loss term, a positioning loss term, a confidence loss term, a geometric regularization term and a characteristic regularization term;

wherein ,for classifying loss, < >>For confidence loss, < >>For locating losses, < >>Is a geometric regularization term,Feature regularization term, j represents a corresponding scale; alpha represents the weight of the different types of losses, and beta represents the weight of the different scale losses; p represents a predicted value, and T represents a true value; CE represents classical cross entropy loss, convexGIOU represents polygon generalized cross-correlation operator, ++> and />The distribution of a set of representative points representing an object is constrained from a geometric and characteristic perspective, respectively;

the feature fusion network firstly uses convolution with a kernel of 1 multiplied by 1 to process the low-resolution input features of one channel number and the high-resolution input features of the other channel to obtain two features with uniform channel dimensions, then upsamples the low-resolution features to further unify the resolution, and fuses the obtained two features by using a single feature aggregation module; gradually fusing the deep high-resolution primary fusion characteristic with the shallow low-resolution primary fusion characteristic downwards to obtain an advanced fusion characteristic.