CN115115601A

CN115115601A - Remote sensing ship target detection method based on deformation attention pyramid

Info

Publication number: CN115115601A
Application number: CN202210786076.6A
Authority: CN
Inventors: 张磊; 张悦; 刘佰龙; 梁志贞
Original assignee: China University of Mining and Technology CUMT
Current assignee: China University of Mining and Technology CUMT
Priority date: 2022-07-04
Filing date: 2022-07-04
Publication date: 2022-09-27

Abstract

A remote sensing ship target detection method based on a deformed attention pyramid belongs to the technical field of remote sensing detection. The method comprises the following steps: the system comprises a deformation attention pyramid network, a multi-scale feature enhancement network, a multi-scale self-adaptive pooling network and a rotary prediction network. The detected image sequentially passes through the sub-networks, so that the feature extraction, the feature enhancement, the region of interest generation and the ship marking are gradually realized. The advantages are that: the pyramid network suppresses background noise of the remote sensing image from multiple scales, and reduces interference of irrelevant features on the model. The multi-scale feature enhancement network uses parallel Laplace convolution kernels and cavity convolution kernels to sharpen weak and small target features, and feature description of a model on a small target ship is enhanced. The multi-scale self-adaptive pooling network acquires a more accurate detection region of interest by using a multi-scale self-adaptive pooling strategy. The rotation prediction network generates a rotation prediction frame based on a regression strategy, and the rotation prediction frame is used for accurately marking ships in any direction.

Description

Remote sensing ship target detection method based on deformation attention pyramid

Technical Field

The invention belongs to the technical field of remote sensing detection, and particularly relates to a remote sensing ship target detection method based on a deformed attention pyramid.

Background

The automatic detection of ships has important significance and value in marine traffic management, marine rescue and national defense safety application. In recent years, with the rapid development of remote sensing technology, researchers can acquire high-quality optical remote sensing images from satellite-borne or airborne sensors at reasonable cost. Therefore, the optical remote sensing image becomes a main image source for ship detection. Therefore, the ship target detection technology based on the optical remote sensing image becomes a research hotspot.

At present, the ship detection methods related to remote sensing images mainly fall into two categories: a traditional detection method and a ship detection method based on deep learning. Prior to the advent of deep-learning ship detection methods, many researchers have applied machine learning techniques to the field of ship detection. However, ships in the remote sensing images are affected by background noise such as illumination, cloud layers and particles in the atmosphere, and the ships have the characteristics of small targets, arbitrary pointing and crowded arrangement. Therefore, the traditional method is difficult to accurately summarize the ship characteristics under various imaging conditions. The traditional detection method has low accuracy and lacks generalization capability.

With the wide application of the convolutional neural network, the convolutional neural network has strong performance in the field of target detection by virtue of superior feature extraction capability. Many ship detection methods based on deep learning are born, but most methods use a horizontal bounding box to detect ships, so that detection results have more redundant information and multi-scale ship detection cannot be carried out. Some later scholars proposed a spinning Libra R-CNN method. The method has the idea of balancing three levels of a neural network, and firstly, extracting multi-scale ship features by using a balancing pyramid; then, generating a region proposal by using the cross-over ratio balance sampling; and finally, balanced L1 Loss is introduced into the rotary branch, so that the ship detection performance in a crowded scene is greatly improved. However, the method is still interfered by background noise, and has higher missing rate on small target ships. How to extract the characteristics of ships with large aspect ratio, arbitrary shape and size and arbitrary direction in crowded scenes to the maximum extent and effectively inhibit background noise is the problem existing at present.

Disclosure of Invention

Aiming at the existing problems or defects, the invention provides a remote sensing ship target detection method based on a deformation attention pyramid, which solves the problems that ships in an optical remote sensing image are interfered by background noise, and the existing target detection method cannot accurately detect the ships due to the small, large aspect ratio, crowded arrangement and any direction characteristics of the ships.

The purpose of the invention is realized as follows: the remote sensing ship target detection method comprises the following steps: the following four sub-networks, respectively: the system comprises a deformation attention pyramid network, a multi-scale feature enhancement network, a multi-scale self-adaptive pooling network and a rotation prediction network; the detected image sequentially passes through the sub-networks, so that the feature extraction, the feature enhancement, the region of interest generation and the ship marking are gradually realized;

firstly, integrating a ship deformation convolution unit and a ship attention module in a deformation attention pyramid network to fully extract the characteristics of ships in any direction;

the ship deformation convolution unit can distribute convolution sampling points of original fixed positions inside the ship; the ship attention module can effectively distribute the weight of different components in the multi-scale characteristic diagram, so that the model focuses on ships with different scales and suppresses multi-scale background noise.

Thirdly, extracting the characteristics of the ships by adopting a multi-scale characteristic enhancement module in the multi-scale characteristic enhancement network so as to enhance the characteristics of the ships with different scales and enhance the target characteristics of the ships;

the multi-scale feature enhancement module uses parallel Laplace convolution kernels and cavity convolution kernels, and the Laplace convolution kernels enhance ship targets in different scale feature maps, so that the detection performance of the network is further improved; background information of the small-scale target is extracted through the cavity convolution kernel, and the detection performance of the network on the small-target ship is favorably improved;

then, a multi-scale self-adaptive pooling strategy is adopted in the multi-scale self-adaptive pooling network, the region of interest of the ship with the large aspect ratio is accurately obtained, and multi-scale context information of the ship is learned;

the multi-scale self-adaptive pooling strategy is based on a bilinear interpolation method, and can self-adaptively acquire transverse and longitudinal sampling points according to the regional suggestions, so that the model can more uniformly sample the ship target to acquire more effective ship feature description, and can realize reasoning on an adjacent unknown target to a certain extent.

And finally, generating a ship target-based RBB (radial basis function) by the rotary prediction network through five-parameter regression and foreground and background classification for accurately marking ships in any direction.

The method comprises the following specific steps:

step S1, inputting the image to be detected into a deformation attention pyramid network for processing, wherein the network uses a ship deformation convolution unit as a basic feature extraction module and uses a ship attention module to suppress background noise;

step S2, inputting the multi-scale feature map processed by the deformed attention pyramid network into a multi-scale feature enhancement network, and performing feature enhancement on the multi-scale feature map by using a multi-scale feature enhancement module in the multi-scale feature enhancement network;

step S3, inputting the data after the multi-scale feature enhancement network is enhanced into a multi-scale self-adaptive pooling network, firstly generating a multi-scale rotating region suggestion by using an Oriented RPN, and then obtaining a more accurate region of interest by using a multi-scale self-adaptive pooling strategy; the Oriented RPN is a method based on an anchor strategy, and can generate regional suggestions with various scales and in any direction;

s4, further identifying the ship in the remote sensing image on the basis of the region of interest by the rotation prediction network, and firstly generating a rotation bounding box to mark the ship in any direction; calculating the intersection ratio of the rotating bounding boxes by using a skew IoU strategy, so as to better distinguish positive and negative samples; finally, reasonably selecting a rotation prediction box by using a soft-rotate-NMS strategy; the skew IoU strategy is based on a triangulation idea and can accurately calculate the intersection ratio of the rotating bounding boxes; the soft-rotate-NMS strategy updates the confidence degrees of the residual adjacent rotation prediction frames according to the overlapping area, and then reasonably selects the rotation prediction frames;

step S5, training the ship detection network by adopting a multitask loss function, wherein the multitask loss function comprises regression loss and classification loss, the regression loss is used for adjusting the position of the rotating boundary frame, and the classification loss is used for judging whether the target to be detected is a ship or not; and training the optimized parameters step by step to realize the detection of the remote sensing ship target.

In step S1, the deformed attention pyramid network processing adopts two processing procedures, namely bottom-up processing and top-down processing;

the bottom-up treatment process comprises the following steps: fully extracting the characteristics of the input image by using the SDC unit to form a characteristic pyramid { C with one dimension increasing and resolution decreasing ₂ ,C ₃ ,C ₄ ,C ₅ }；

The top-down process is: with { C ₂ ,C ₃ ,C ₄ ,C ₅ Constructing a characteristic pyramid (P) from top to bottom in an up-sampling and transverse connection mode by taking the P as input ₂ ,P ₃ ,P ₄ ,P ₅ }; in the two processing processes, a ship attention module is integrated to suppress background noise in feature maps with different scales;

when identifying ships with complex sizes, the following processing is carried out:

the detection aiming at the ships with smaller sizes adopts the feature map with higher resolution, and utilizes the multi-scale feature fusion mode to combine the feature map with low resolution { P } ₄ ,P ₅ The semantic information in the map is merged into a high-resolution feature map P ₂ ,P ₃ }；

Aiming at a large-size ship, a high-dimensional characteristic diagram P ₅ Performing a down-sampling operation of step size 2 on the basis of (2) to generate P ₆ The characteristic diagram is specially used for detecting large ships.

The feature extraction process of the SDC unit comprises the following steps:

firstly, adjusting the number of channels of an input feature map by using a 1 × 1 traditional convolution, and adaptively acquiring a two-dimensional offset by using a3 × 3 traditional convolution; then, based on the two-dimensional offset, using 3 multiplied by 3 deformation convolution to extract the ship characteristics with any shape, size and direction; finally, using 1 × 1 traditional convolution to reduce dimension, using a linear activation function to activate features, and generating an output feature map;

the ship deformation convolution unit is as follows:

the remote sensing image has higher resolution, so that a model needs to occupy larger video memory resources when training the image; if the batch size is set to be too large, the video memory of the computer overflows and cannot be trained; compared with batch normalization, the calculation mode of grouping normalization is optimized; the grouping normalization operation does not depend on the size of the batch size, and the mean value and the variance of each group in the channel direction are calculated; the grouping normalization is more suitable for training the remote sensing image; the batch size represents the number of samples taken for a single training session;

a linear activation function is used in the ship deformation convolution unit, so that a model is ensured to keep more complete characteristic information; if the input features are located in the non-low-dimensional subspace of the input space, the nonlinear activation function can destroy the extracted features and influence the detection performance;

the ship attention module is as follows:

the ship attention module suppresses the background noise of the remote sensing image from a plurality of scales, reduces the interference of irrelevant features on the model, and suppresses the background noise to the maximum extent;

the ship attention module is based on the inherent multi-scale architecture of the deformed attention pyramid network, including SA ₂ 、SA ₃ And SA ₄ A module; SA ₂ The process of module attentiveness adjustment is described as follows:

V ₃ ＝Flatten(Conv(F ₃ ))

A ₃ ＝Grid(Sigmoid(V ₃ ))

wherein the content of the first and second substances,

representing that feature graphs with the same size are fused; f ₃ Representing the fused feature map; flatten represents a flattening operation that can transform the feature map into a one-dimensional vector V ₃ (ii) a Sigmoid is an activation function that can obtain V ₃ Attention coefficients of the respective feature points; grid denotes a gridding operation, changing the one-dimensional attention coefficient to P ₃ An attention coefficient map a3 having the same size;

representing the multiplication of corresponding elements of the two matrices;

SA ₃ and SA ₄ Attention regulation process and SA ₂ The process of the module to adjust attention is the same.

In step S2, a multi-scale feature enhancement module in a multi-scale feature enhancement network is used, where the multi-scale feature enhancement module uses parallel laplacian convolution kernels and hole convolution kernels; the Laplace convolution kernel is used for sharpening small and weak target features; the cavity convolution kernel is used for improving the receptive field of the weak and small target, and the receptive field is an area where the network can sense the input characteristic diagram, so that the multi-scale characteristic enhancement network enhances the characteristic description of the model on the small target ship; performing feature enhancement on the obtained feature map, performing sufficient feature extraction on the ship target in the remote sensing image through a deformed attention pyramid network, accurately detecting a multi-scale ship, and eliminating the omission factor of a small target ship;

the enhancement process of the multi-scale feature enhancement module is as follows:

wherein, GN (·)) Representing packet normalization; LConv _3×3 The method comprises the steps of representing a Laplace convolution kernel with the size of 3 x 3, wherein the Laplace convolution kernel has the function of sharpening ship targets with different scales through Laplace characteristics; conv _1×1 A standard convolution kernel with a size of 1 × 1 is represented for adjusting the number of channels; DConv _3×3 Representing the cavity convolution with the size of 3 multiplied by 3, extracting the background information of the small-scale target in a mode of enlarging the receptive field so as to promote the characteristic description of the model to the weak and small target ship;

a feature fusion operation is represented.

In step S3, generating a region suggestion in various scales and arbitrary directions using an oriente RPN; performing pooling operation to generate an ROI with a fixed size of k multiplied by k;

the multi-scale self-adaptive pooling strategy is based on a bilinear interpolation method, can self-adaptively and uniformly acquire transverse and longitudinal sampling points, and can aggregate the transverse and longitudinal sampling points into a plurality of closed-loop rectangles to form a compact characteristic representation;

in the multi-scale self-adaptive pooling network, the deep characteristic diagram has low resolution, large receptive field and weak geometric information representation capability, so that the contour characteristics of the ship can be more easily highlighted, and the positioning capability of the model can be improved; the deep features are represented as closed loop rectangles near the center point;

the shallow feature map has high resolution, small corresponding receptive field and strong representation capability of geometric information, and is beneficial to improving the classification capability of the model; the shallow features are represented as closed-loop rectangles away from the center point;

finally, the multi-scale adaptive pooling network generates a fixed-size feature map covering the deep and shallow feature information through the closed-loop rectangles.

In step S4, the rotation prediction network further identifies the ship in the remote sensing image on the basis of the region of interest, and generates a rotation bounding box to mark the ship in any direction.

The rotating bounding box may be represented by five parameters, x, y, w, h, θ, respectively. (x, y) represents the coordinates of the center point of the rotating bounding box; w and h represent the width and height of the rotating bounding box, respectively; θ represents the angle of w of the rotated bounding box with respect to the x-axis. The angle range is [ -90 °,0 ° ]; the regression process for the rotated bounding box is as follows:

t _x ＝(x-x _a )/w _a ,t _y ＝(y-y _a )/h _a ,t _w ＝log(w/w _a ),t _h ＝log(h/h _a ),t _θ ＝θ-θ _a

t′ _x ＝(x′-x _a )/w _a ,t′ _y ＝(y′-y _a )/h _a ,t′ _w ＝log(w′/w _a ),t′ _h ＝log(h′/h _a ),t′ _θ ＝θ′-θ _a

in the above formula, x, y, w, h, θ represent parameters of the label bounding box; x is the number of _a ,y _a ,w _a ,h _a ,θ _a Parameters representing auxiliary bounding boxes marked according to the regional suggestions; x ', y ', w ', h ', θ ' represent parameters of the rotation prediction box.

After the rotating bounding box is generated, in order to better distinguish positive and negative samples, calculating the intersection ratio of the rotating bounding box by using a skewIoU strategy; the skewIoU strategy is based on a triangulation idea, and can calculate intersection ratio more accurately;

finally removing redundant rotation prediction frames by using soft-rotate-NMS strategy; when the rotation prediction frame with the highest confidence coefficient is selected, the soft-rotate-NMS strategy updates the confidence coefficients of the rest adjacent rotation prediction frames according to the overlapping area, reasonably selects the rotation prediction frame, and accurately marks the ship.

In step S5, training the ship detection network optimization parameters to achieve remote sensing ship target detection: adopting a multi-task loss function, including regression loss and classification loss;

the regression loss is used for adjusting the position of the rotating bounding box, and the classification loss is used for judging whether the target to be detected is a ship or not; the definition is as follows:

wherein i refers to the ith auxiliary bounding box; l is _cls The logarithmic loss of the foreground and the background is obtained, wherein the foreground is a ship and the background is a non-ship target; l is _reg The regression Loss is expressed as the regression Loss of a target candidate frame, and in order to improve the robustness of the model and effectively solve the problem of imbalance of positive and negative samples, balance L1 Loss is used as the regression Loss; p _i The probability that each auxiliary bounding box belongs to the target ship is indicated; t is t _i Coordinate parameters representing a prediction candidate box of the sample;

coordinate parameters representing true candidate frames of the sample; λ is a balance parameter of the weight;

class L _cls And regression of L _reg The loss of (a) is specifically expressed as:

L _cls ＝-(1-y _i )log(1-p _i )-(y _i )log(p _i )

wherein, y _i Denotes the sign of the sample when y _i Is 1, it means that the corresponding sample is a positive sample, when y _i A value of 0 indicates that the corresponding sample is a negative sample. In particular, when y _i At 0, there is no regression loss. | x | represents

Gamma is defined as

The method has the advantages that due to the adoption of the scheme, the method is composed of a deformation attention pyramid network, a multi-scale feature enhancement network, a multi-scale self-adaptive pooling network and a rotation prediction network. Firstly, the deformation attention pyramid network integrates a ship deformation convolution unit and a ship attention module which are designed independently so as to fully extract the characteristics of ships in any direction and guide the network to pay attention to ship regions with different scales. And thirdly, designing a multi-scale feature enhancement module, wherein the module uses parallel Laplace convolution kernels and cavity convolution kernels to extract features of ships so as to enhance the characteristics of the ships with different scales. Then, a multi-scale feature enhancement network is designed for accurately acquiring the region of interest of the ship with the large aspect ratio and learning multi-scale context information of the ship. And finally, generating a rotation detection frame through a rotation prediction network, and accurately marking the ships in any direction. Compared with other mainstream target detection algorithms, the method has obvious advantages.

The problem that the existing target detection method cannot accurately detect the ships due to the fact that the ships in the optical remote sensing image are interfered by background noise and the ships generally have the characteristics of small target, large aspect ratio, crowded arrangement and any direction is solved, and the purpose of the invention is achieved.

The advantages are that: the pyramid network is noticed in a deformation mode, the problem that other advanced target detection models cannot sufficiently extract ship features is solved, background noise of remote sensing images is restrained from multiple scales, and interference of irrelevant features on the models is reduced.

The multi-scale feature enhancement network uses parallel Laplace convolution kernels and cavity convolution kernels to sharpen the features of the weak and small targets, and improves the receptive field of the weak and small targets, wherein the receptive field is an area where the network can sense an input feature map, and the feature description of a model on a small target ship is enhanced;

the multi-scale self-adaptive pooling network acquires a more accurate detection region of interest by using a multi-scale self-adaptive pooling strategy.

The rotation prediction network generates a ship with a rotation prediction frame accurately marked in any direction in the remote sensing image based on a regression strategy.

Description of the drawings:

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a diagram of the remote sensing ship target detection structure of the present invention.

FIG. 3 is a structural diagram of a deformed attention pyramid network according to the present invention.

Fig. 4 is a diagram of a multi-scale feature enhanced network architecture according to the present invention.

Fig. 5 is a diagram of a multi-scale adaptive pooling network architecture of the present invention.

Fig. 6 illustrates the working principle of the multi-scale adaptive pooling of the present invention.

FIG. 7 is a representation of a rotating bounding box according to the present invention.

FIG. 8 is a graph showing the results of the detection of the present invention.

Detailed Description

The remote sensing ship target detection method comprises the following steps: the following four sub-networks, respectively: the method comprises the following steps of (1) transforming an attention pyramid network, a multi-scale feature enhancement network, a multi-scale self-adaptive pooling network and a rotary prediction network; the detected image sequentially passes through the sub-networks, so that the feature extraction, the feature enhancement, the region of interest generation and the ship marking are gradually realized;

firstly, integrating a ship deformation convolution unit and a ship attention module in a deformation attention pyramid network to suppress multi-scale background noise;

the ship deformation convolution unit can distribute convolution sampling points of original fixed positions inside the ship; the ship attention module can effectively distribute the weight of different components in the multi-scale characteristic diagram, so that the model pays attention to ships with different scales and inhibits multi-scale background noise.

Then, a multi-scale feature enhancement module is adopted in the multi-scale feature enhancement network to extract features of the ships, so that the ship features of different scales are enhanced, and the ship target features are enhanced;

the multi-scale feature enhancement module uses parallel Laplace convolution kernels and cavity convolution kernels, and the Laplace convolution kernels enhance ship targets in different scale feature maps so as to further improve the detection performance of the network; background information of the small-scale target is extracted through the cavity convolution kernel, and the detection performance of the network on the small-target ship is favorably improved;

thirdly, a multi-scale self-adaptive pooling strategy is adopted in the multi-scale self-adaptive pooling network, so that the region of interest of the ship with the large aspect ratio is accurately obtained and the multi-scale context information of the ship is learned;

the multi-scale self-adaptive pooling strategy is based on a bilinear interpolation method, and can self-adaptively acquire transverse and longitudinal sampling points according to the regional suggestions, so that the model can more uniformly sample ship targets to acquire more effective ship feature description, and inference on an adjacent unknown target is realized to a certain extent.

And finally, generating a rotation detection frame through a rotation prediction network, and generating a rotation boundary frame based on a ship target through five-parameter regression and foreground and background classification for accurately marking ships in any direction.

English expression of the deformed Attention Pyramid Network is Deformable Attention Pyramid Network, and is abbreviated as DAPNet; the multi-scale Feature Enhancement Network is represented as a Multiscale Feature Enhancement Network in English, and is abbreviated as MFENet; the Multiscale Adaptive Pooling Network is represented as Multiscale Adaptive Positive Network in English, and is abbreviated as MAPNT; the Rotation Prediction Network is represented by a Rotation Prediction Network in English, and is abbreviated as RPNet.

The method comprises the following specific steps:

step S1, inputting the image to be detected into DAPNet processing, wherein the network uses a ship deformation convolution unit as a basic feature extraction module and uses a ship attention module to suppress background noise; the Ship Deformation Convolution is represented by Ship Deformation convention in English, and is abbreviated as SDC; the Ship Attention English is expressed as Ship Attention, and is abbreviated as SA;

step S2, inputting the multi-scale feature map processed by the DAPNet into the MFENet, and performing feature enhancement on the multi-scale feature map by using a multi-scale feature enhancement module in the MFENet; the multi-scale Feature Enhancement module is expressed as Multiscale Feature Enhancement in English, and is abbreviated as MFE;

step S3, inputting data into MAPNT, wherein the MAPNT firstly uses an Oriented RPN to generate a multi-scale rotating region suggestion, and then uses a multi-scale self-adaptive pooling strategy to obtain a more accurate region of interest; the Oriented RPN is a method based on an anchor strategy, and can generate regional suggestions with various scales and any directions; the multi-scale self-Adaptive Pooling English is expressed as Multiscale Adaptive Pooling, which is abbreviated as MA Pooling; the Region of Interest is represented as Region of Interest in English, and is abbreviated as ROI;

s4, further identifying ships in the remote sensing image on the basis of the ROI, and firstly generating a rotating boundary frame to mark the ships in any direction; calculating the intersection ratio of the rotating bounding boxes by using a skew IoU strategy, so as to better distinguish positive and negative samples; finally, reasonably selecting a rotation prediction box by using a soft-rotate-NMS strategy; the English expression of the Rotating Bounding Box is a Rotating Bounding Box, which is abbreviated as RBB; the Intersection ratio is expressed as Intersection over Union, which is abbreviated as IoU; the skew IoU strategy is based on a triangulation idea and can accurately calculate the intersection ratio of the rotating bounding boxes; the soft-rotate-NMS strategy updates the confidence degrees of the residual adjacent rotation prediction frames according to the overlapping area, and then reasonably selects the rotation prediction frames;

step S5, training the ship detection network by adopting a multitask loss function, wherein the multitask loss function comprises regression loss and classification loss, the regression loss is used for adjusting the position of the RBB, and the classification loss is used for judging whether the target to be detected is a ship or not; and training the optimized parameters step by step to realize the detection of the remote sensing ship target.

In step S1, the DAPNet process employs two bottom-up and top-down processes;

The top-down process is: with { C ₂ ,C ₃ ,C ₄ ,C ₅ Constructing a characteristic pyramid (P) from top to bottom in an up-sampling and transverse connection mode by taking the P as input ₂ ,P ₃ ,P ₄ ,P ₅ }; in the two processing processes, an SA module is integrated to inhibit background noise in feature maps with different scales;

Aiming at a large-size ship, a high-dimensional characteristic diagram P ₅ On the basis of (2) performing a down-sampling operation of step size 2 to generate P ₆ The characteristic diagram is specially used for detecting large ships.

The feature extraction process of the SDC unit comprises the following steps:

firstly, adjusting the number of channels of an input feature map by using a 1 × 1 traditional convolution, and adaptively acquiring a two-dimensional offset by using a3 × 3 traditional convolution; then, based on the two-dimensional offset, using 3 x 3 deformation convolution to extract the ship features with any shape, size and direction; finally, using 1 × 1 traditional convolution to reduce dimension, using a linear activation function to activate features, and generating an output feature map;

the SDC unit is as follows:

the remote sensing image has higher resolution, so that a model needs to occupy larger video memory resources when training the image; if the batch size is set to be too large, the video memory of the computer overflows and cannot be trained; compared with batch normalization, the calculation mode of grouping normalization is optimized; the grouping normalization operation does not depend on the size of the batch size, and the mean value and the variance of each group in the channel direction are calculated; the grouping normalization is more suitable for training the remote sensing image; the batch size represents the number of samples taken for a single training session; the Batch Normalization English is expressed as Batch Normalization and is abbreviated as BN; the grouping Normalization English is expressed as Group Normalization, and is abbreviated as GN;

a linear activation function is used in the SDC unit, so that a model is ensured to keep more complete characteristic information; if the input features are located in the non-low-dimensional subspace of the input space, the nonlinear activation function can destroy the extracted features and influence the detection performance;

the SA module is as follows:

the SA module suppresses the background noise of the remote sensing image from a plurality of scales, reduces the interference of irrelevant features on the model, and suppresses the background noise to the maximum extent;

the SA module is based on the inherent multi-scale architecture of the DAPNet itself, including SA ₂ 、SA ₃ And SA ₄ A module; SA ₂ The process of module attention adjustment is described as follows:

V ₃ ＝Flatten(Conv(F ₃ ))

A ₃ ＝Grid(Sigmoid(V ₃ ))

wherein the content of the first and second substances,

showing that the feature maps with the same size are fused; f ₃ Representing the fused feature map; flatten denotes a flattening operation, which can transform the feature map into a one-dimensional vector V ₃ (ii) a Sigmoid is an activation function that can obtain V ₃ Attention coefficients of the respective feature points; grid denotes a gridding operation, changing the one-dimensional attention coefficient to P ₃ Attention coefficient map A having the same size ₃ ；

Representing the multiplication of corresponding elements of the two matrices;

In step S2, an MFE module in the MFENet is employed, where the MFE module uses parallel laplacian convolution kernels and hole convolution kernels; the Laplace convolution kernel is used for sharpening the weak and small target features; the cavity convolution kernel is used for improving the reception field of the weak and small target, and the reception field is an area where the network can sense an input characteristic diagram, so that the MFENet enhances the characteristic description of the model on the small target ship; after the ship target in the remote sensing image is fully subjected to feature extraction through the DAPNet, the obtained feature map is subjected to feature enhancement, a multi-scale ship is accurately detected, and the missing rate of a small target ship is eliminated;

the enhancement process for the MFE module is as follows:

wherein GN (·) denotes packet normalization; LConv _3×3 The method comprises the steps of representing a Laplace convolution kernel with the size of 3 x 3, wherein the Laplace convolution kernel has the function of sharpening ship targets with different scales through Laplace characteristics; conv _1×1 A standard convolution kernel with a size of 1 × 1 is represented for adjusting the number of channels; DConv _3×3 Representing the cavity convolution with the size of 3 multiplied by 3, extracting the background information of the small-scale target in a mode of enlarging the receptive field so as to promote the characteristic description of the model to the weak and small target ship;

a feature fusion operation is represented.

the MApooling strategy is based on a bilinear interpolation method, can adaptively and uniformly acquire transverse and longitudinal sampling points, and can aggregate the transverse and longitudinal sampling points into a plurality of closed-loop rectangles to form a compact characteristic representation.

In MAPNT, the deep characteristic map has low resolution, large receptive field and weak geometric information representation capability, so that the contour characteristics of ships can be more easily highlighted, and the positioning capability of the model can be improved; the deep features are represented as closed loop rectangles near the center point;

finally, MAPNet generates a fixed-size feature map covering the deep and shallow feature information from these closed-loop rectangles.

In step S4, RPNet further identifies ships in the remote sensing image on the basis of the ROI area and generates an RBB to mark the ships in any direction;

the RBB can be represented by five parameters, x, y, w, h, θ. (x, y) represents the center point coordinates of the RBB; w and h represent the width and height of the RBB, respectively; θ represents the angle of w of the RBB with respect to the x-axis. The angle range is [ -90 °,0 ° ]; the regression process for RBB is as follows:

in the above formula, x, y, w, h, θ represent parameters of the label bounding box; x is the number of _a ,y _a ,w _a ,h _a ,θ _a Parameters representing auxiliary bounding boxes marked according to the regional suggestions; x ', y ', w ', h ', θ ' represents the parameters of the rotation prediction box;

after the RBB is generated, in order to better distinguish positive and negative samples, calculating the intersection ratio of the rotating bounding boxes by using a skewIoU strategy; the skewIoU strategy is based on a triangulation idea, and can calculate intersection ratio more accurately;

the regression loss is used for adjusting the position of the RBB, and the classification loss is used for judging whether the target to be detected is a ship or not; the definition is as follows:

wherein i refers to the ith auxiliary bounding box; l is _cls The logarithmic loss for the foreground and background (foreground is ship, background is non-ship target); l is _reg Expressed as the regression Loss of the target candidate frame, in order to improve the robustness of the model and effectively solve the problem of unbalance of positive and negative samples, the invention takes balanced L1 Loss as the regression Loss; p _i The probability that each auxiliary bounding box belongs to the target ship is indicated; t is t _i Coordinate parameters representing a prediction candidate box of the sample;

L _cls ＝-(1-y _i )log(1-p _i )-(y _i )log(p _i )

Gamma is defined as

Example 1:

the following description of the embodiments of the present invention will be provided in conjunction with the accompanying drawings 1 to 8 in order to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the inventive concept are protected thereby.

An embodiment of a remote sensing ship target detection method based on a deformed attention pyramid in a DOTA data set comprises the following steps:

s1, a DOTA data set is one of the largest public data sets used in the field of remote sensing target detection, and remote sensing images of the DOTA data set are derived from Google Earth, GF-2 and JL-1 satellites. The data set contains 15 types of targets (ships, harbors, helicopters, etc.), and the resolution range of the image is about 800 × 800 to 4000 × 4000.

Because the images of the ships contained in the DOTA are limited, the invention uses data enhancement techniques to augment them. A total of 13624 images (about 165K ship targets) were obtained and 1/2 was randomly drawn as a training set, 1/3 as a test set, and 1/6 as a validation set. Finally, in order to fairly compare the performances of the HBB detector and the RBB detector, the invention respectively uses the HBB and the RBB to label the training sets, the used labeling tools are LabelImg and RoLabelImg, and the target label is ship.

And S2, inputting the DOTA data set to be processed into the DAPNet for processing, wherein the network uses an autonomously designed ship deformation convolution unit as a basic feature extraction module and uses a ship attention module to suppress background noise. The structure of the DAPNet is shown in fig. 3.

And S3, inputting the DAPNet-processed multi-scale feature map into the MFENet, and performing feature enhancement processing on the multi-scale feature map by using a multi-scale feature enhancement module in the MFENet. The MFENet structure is shown in fig. 4.

S4, then inputting data into MAPNet, where the MAPNet generates a multi-scale rotation region suggestion using an organized RPN, and then obtains a more precise region of interest using an MA posing strategy. The structure of the MAPNET structure is shown in FIG. 5, and the operating principle of the MA poolling strategy is shown in FIG. 6.

S5, and finally, RPNet is used to mark ships in any direction by generating RBBs, as shown in fig. 7.

S6, setting experimental environment and hyper-parameters:

the basic backbone network used by the model is ResNet101 integrated with an SDC unit and an SA module, named SS-ResNet101, the deep learning framework depended on is PyTorch 1.1.0, and the programming language is Python 3.6. All experiments are carried out on a computer carrying NVIDIA GeForce RTX 3080Ti, and the deep learning acceleration environment is CUDA 10.0 and cuDNN7.0. During training, the invention performs optimization training using MomentumOptimizer, with a batch size of 16, weight decay of 0.0001, and momentum of 0.9. The total number of iterations of training is 160K, the initial learning rate is 0.003, and the learning rate is attenuated by the increase of the number of iterations, so that the model can fully learn the target features.

The results of the model tests are shown in FIG. 8. (a) The (a) is a scene of ship crowding distribution; (e) (h) is the scene of a small target ship; (i) (l) a scene of a multi-scale ship; (m) to (p) are scenes in which background noise is present.

Claims

1. A remote sensing ship target detection method based on a deformed attention pyramid is characterized by comprising the following steps: the system comprises the following four sub-networks: the method comprises the following steps of (1) transforming an attention pyramid network, a multi-scale feature enhancement network, a multi-scale self-adaptive pooling network and a rotary prediction network; the detected image sequentially passes through the sub-networks, so that the feature extraction, the feature enhancement, the region of interest generation and the ship marking are gradually realized;

and finally, generating a rotation detection frame through a rotation prediction network, and generating a rotation boundary frame based on a ship target through parameter regression and foreground and background classification for accurately marking ships in any direction.

2. The remote sensing ship target detection method based on the deformed attention pyramid as claimed in claim 1, characterized in that: the method comprises the following specific steps:

step S2, inputting the multi-scale feature map processed by the deformed attention pyramid network into a multi-scale feature enhancement network, and performing feature enhancement by using a multi-scale feature enhancement module therein;

step S3, inputting the data after the multi-scale feature enhancement network is enhanced into a multi-scale self-adaptive pooling network, firstly generating a multi-scale rotating region suggestion by using OrientedRPN, and then obtaining a more accurate region of interest by using a multi-scale self-adaptive pooling strategy;

step S4, the ship in the remote sensing image is further identified by the rotation prediction network on the basis of the region of interest, and a rotation bounding box is firstly generated to mark the ship in any direction; then calculating the intersection ratio of the rotating bounding boxes by using a skewIoU strategy, thereby better distinguishing positive and negative samples; finally, reasonably selecting a rotation prediction box by using a soft-rotate-NMS strategy;

3. The remote sensing ship target detection method based on the deformed attention pyramid as claimed in claim 2, characterized in that: in step S1, the deformed attention pyramid network processing adopts two processing procedures, namely bottom-up processing and top-down processing;

the bottom-up treatment process comprises the following steps: fully extracting the characteristics of the input image by using a ship deformation convolution unit to form a characteristic pyramid { C with gradually increased dimensionality and gradually decreased resolution ₂ ,C ₃ ,C ₄ ,C ₅ }；

Aiming at a ship with larger size, a high-dimensional characteristic diagram P ₅ On the basis of (2) performing a down-sampling operation of step size 2 to generate P ₆ The characteristic diagram is specially used for detecting large ships.

4. The remote sensing ship target detection method based on the deformed attention pyramid as claimed in claim 3, characterized in that: the characteristic extraction process of the ship deformation convolution unit comprises the following steps:

the ship deformation convolution unit is as follows:

the remote sensing image has higher resolution, so that a model needs to occupy larger video memory resources when training the image; if the batch size is set to be too large, the video memory of the computer overflows and cannot be trained; compared with batch normalization, the calculation mode of grouping normalization is optimized; the grouping normalization operation does not depend on the size of the batch size, and the mean value and the variance of each group in the channel direction are calculated; the grouping normalization is more suitable for training the remote sensing image;

the ship attention module is as follows:

V ₃ ＝Flatten(Conv(F ₃ ))

A ₃ ＝Grid(Sigmoid(V ₃ ))

wherein the content of the first and second substances,

showing that the feature maps with the same size are fused; f ₃ Representing the fused feature map; flatten represents a flattening operation that can transform the feature map into a one-dimensional vector V ₃ (ii) a Sigmoid is an activation function that can obtain V ₃ Attention coefficients of the respective feature points; grid denotes a gridding operation, changing the one-dimensional attention coefficient to P ₃ Attention coefficient map A with the same size ₃ ；

Representing the multiplication of corresponding elements of the two matrices;

5. The remote sensing ship target detection method based on the deformed attention pyramid as claimed in claim 2, characterized in that: in step S2, a multi-scale feature enhancement module in a multi-scale feature enhancement network is used, where the multi-scale feature enhancement module uses parallel laplacian convolution kernels and hole convolution kernels; the Laplace convolution kernel is used for sharpening the weak and small target features; the cavity convolution kernel is used for improving the receptive field of the weak and small target, so that the multi-scale feature enhancement network enhances the feature description of the model on the small target ship; after the ship target in the remote sensing image is fully subjected to feature extraction through the deformed attention pyramid network, the obtained feature map is subjected to feature enhancement, a multi-scale ship is accurately detected, and the omission factor of a small target ship is eliminated;

representing a feature fusion operation.

6. The remote sensing ship target detection method based on the deformed attention pyramid as claimed in claim 2, characterized in that: in step S3, generating a region suggestion in various scales and arbitrary directions using orientrpn; performing pooling operation to generate an interested area with a fixed size of k multiplied by k;

the multi-scale adaptive pooling network generates a fixed-size feature map covering deep and shallow feature information through the closed-loop rectangles.

7. The remote sensing ship target detection method based on the deformed attention pyramid as claimed in claim 2, characterized in that: in the step S4, the rotation prediction network further identifies ships in the remote sensing image on the basis of the region of interest, and generates a rotation boundary frame to mark the ships in any direction;

the rotating bounding box may be represented by five parameters, x, y, w, h, θ; (x, y) represents the coordinates of the center point of the rotating bounding box; w and h represent the width and height of the rotating bounding box, respectively; θ represents the angle of w of the rotating bounding box with respect to the x-axis; the angle range is [ -90 °,0 ° ]; the regression process for the rotated bounding box is as follows:

finally removing redundant rotation prediction frames by using soft-rotate-NMS strategy; and when the rotation prediction frame with the highest confidence coefficient is selected, updating the confidence coefficients of the rest adjacent rotation prediction frames by the soft-rotate-NMS strategy according to the overlapping area, reasonably selecting the rotation prediction frame, and accurately marking the ship.

8. The remote sensing ship target detection method based on the deformed attention pyramid as claimed in claim 2, characterized in that: in step S5, training the ship detection network optimization parameters to realize remote sensing ship target detection: adopting a multi-task loss function, including regression loss and classification loss;

wherein i refers to the ith auxiliary bounding box; l is _cls The logarithmic loss of the foreground and the background is realized, in the method, the foreground is a ship, and the background is a non-ship target; l is _reg Expressed as the regression loss of the target candidate box; p _i The probability that each auxiliary bounding box belongs to the target ship is indicated; t is t _i Coordinate parameters representing a prediction candidate box of the sample;

coordinate parameters representing true candidate frames of the sample; λ is a balance parameter of the weight; class L _cls And regression of L _reg The loss of (a) is specifically expressed as:

L _cls ＝-(1-y _i )log(1-p _i )-(y _i )log(p _i )

wherein, y _i Denotes the sign of the sample when y _i Is 1, it means that the corresponding sample is a positive sample, when y _i When the value is 0, the corresponding sample is a negative sample; in particular, when y _i At 0, there is no regression loss. | x | represents

Gamma is defined as