CN115565066A - SAR image ship target detection method based on Transformer - Google Patents
SAR image ship target detection method based on Transformer Download PDFInfo
- Publication number
- CN115565066A CN115565066A CN202211173313.8A CN202211173313A CN115565066A CN 115565066 A CN115565066 A CN 115565066A CN 202211173313 A CN202211173313 A CN 202211173313A CN 115565066 A CN115565066 A CN 115565066A
- Authority
- CN
- China
- Prior art keywords
- edge
- image
- target
- transformer
- fusion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a method for detecting SAR image ship targets based on transformers, which aims at small-scale SAR ship targets, takes the transformers as a backbone network, fuses effective information of ships through a deformable attention mechanism and improves the detection precision. The input original image is first subjected to Patch division. And inputting the image divided by the patch into a four-stage feature extraction backbone network formed by a transform to obtain four features with different scales from shallow to deep. Inputting the four features with different scales into the feature pyramid network for feature fusion to obtain five fusion features with different scales from shallow to deep. And extracting a crude extracted edge image of the target according to the position mark of the ship of the original image. Inputting the shallowest fusion feature and the crude extracted edge image into an edge-guided shape enhancement module to obtain the enhanced shallowest fusion feature. And inputting the enhanced shallowest fusion feature and the fusion features of other four scales into a target detection head without an anchor frame to obtain a target detection result.
Description
Technical Field
The invention relates to the technical field of target detection, in particular to a SAR image ship target detection method based on a Transformer.
Background
Synthetic Aperture Radar (SAR) is an active remote sensing system with operating frequencies in the microwave band. Compared with an optical sensor, the SAR sensor has the main advantages that the SAR sensor can work all day long and all weather, has strong penetrating capacity of emitted signals and can penetrate through clouds and fog.
The SAR image data is increased with the wide application of SAR sensors, and an automatic detection technology for a target in a SAR image becomes an important research. The existing target detection method of the SAR image can be divided into a traditional method and a deep learning method. The traditional SAR target detection method mainly comprises a method based on contrast information, geometric and textural features and statistical analysis. With the benefit of the development of deep learning and GPU computing power, the deep Convolutional Neural Network (CNN) makes the target detection a great breakthrough. At present, a target detection technology based on a convolutional neural network becomes a mainstream direction of a target detection field, algorithms are mainly divided into two types, the first type is a two-stage target detection algorithm represented by fast R-CNN, a candidate area is generated on a frame containing a target, then the target detection is carried out, the detection precision is high, but the efficiency is low; the second type is a Single-stage target detection algorithm, which is mainly represented by SSD (Single Shot multi box Detector) and YOLO (young Only Look one) series, and this type of algorithm does not generate a candidate region, and directly performs target detection by a regression method, and has high detection efficiency but accuracy inferior to the first type of method. As a new neural network structure, the Transformer provides a new thinking mode for visual tasks. Originally, transformers were used in the field of Natural Language Processing (NLP). The non-loop network structure of an encoder-decoder and a self-attention mechanism is adopted, and the optimal performance of machine translation is realized. The successful application of Transformer in the field of NLP has led the relevant scholars to discuss and try its application in the field of computer vision. Some backbones that use transformers instead of convolutions, such as ViT and Swin Transformer, have proven to have better performance than CNN because the global interaction mechanism of the Transformer can quickly expand the effective field of reception of features.
However, there are two problems with using a Transformer as the backbone network in SAR ship detection. Firstly, the background of the marine SAR ship image is very simple, so the global relationship modeling mechanism in the Transformer can be associated with some redundant background information. And secondly, the contour between the near-shore SAR ship target and the shore is fuzzy, so that the near-shore ship target is difficult to distinguish from the background. The transform-extracted features need to be reconstructed with more object details to focus on the SAR ship target in a similar context.
At present, no related technical scheme can solve the problem that the detection precision is reduced due to fuzzy ship target outlines caused by backgrounds such as seacoasts, islands, sea waves and the like in SAR images,
disclosure of Invention
In view of this, the invention provides a method for detecting a SAR image ship target based on a transform, which can effectively fuse effective information of a small ship through a sparse attention mechanism capable of transforming attention by using a local sparse information aggregation transform based on a Swin transform architecture as a backbone network for a small-scale SAR ship target, thereby improving detection accuracy.
In order to achieve the purpose, the technical scheme of the invention comprises the following steps:
step 1: and carrying out Patch division on the input original image to obtain a Patch divided image.
Step 2: and inputting the image divided by the patch into a four-stage feature extraction backbone network formed by local sparse information aggregation transform to obtain four features with different scales from shallow to deep.
And 3, step 3: inputting the four features with different scales into the feature pyramid network to perform feature fusion among different scales, and obtaining five fusion features with different scales from shallow to deep after fusion.
And 4, step 4: and extracting the target edge according to the position mark of the ship of the original image to obtain a crude extracted edge image of the target.
And 5: inputting the shallowest fusion feature and the crude extracted edge image in the fusion features of the five different scales into an edge-guided shape enhancement module to obtain the enhanced shallowest fusion feature.
Step 6: and inputting the enhanced shallowest fusion feature and the fusion features of other four scales into a target detection head without an anchor frame to obtain a target detection result.
Further, the Patch division of the input original image comprises the following steps: dividing an input image of size H × W × 3 into 4 × 4 non-overlapping patches, the feature dimension of each patch is 4 × 4 × 3=48, and the number of patches is H/4 × W/4.
Further, the four-stage feature extraction backbone network formed by the local sparse information aggregation Transformer comprises: the basic composition structure is based on a Swin Transformer backbone network structure and is divided into four stages, the four stages are respectively marked as stage one, stage two, stage three and stage four according to the sequence, and four characteristics with different scales from shallow to deep are sequentially output.
The first stage comprises a linear embedding module and a double transform module which are connected in sequence; and the linear embedding module is used for carrying out dimension transformation on the image divided by the patch.
The second phase comprises a patch fusion module and a double transform module which are connected in sequence.
And the third stage comprises a patch fusion module and 3 double transform modules which are connected in sequence.
And the stage four comprises a patch fusion module and a double transform module which are connected in sequence.
The double Transformer module comprises a front Transformer module and a rear Transformer module, wherein the front Transformer module executes the following steps: and calculating a local sparse information aggregation attention diagram aiming at the input characteristics of the front Transformer module, performing residual error connection on the local sparse information aggregation attention diagram and the input characteristics of the front Transformer module, and performing residual error linkage on the result and the result after linear transformation and a multi-layer perceptron to obtain the output characteristics of the local sparse information aggregation Transformer.
The output characteristics of the front Transformer module are used as the input characteristics of the rear Transformer module.
The post-Transformer module performs the following steps: and after sampling transformation is carried out on the input characteristics of the post-Transformer module, calculating a local sparse information aggregation attention diagram according to the characteristics after sampling transformation, carrying out residual error connection on the local sparse information aggregation attention diagram and the input characteristics of the post-Transformer module, and carrying out residual error linkage on the result and the result after linear transformation and a multi-layer perceptron to obtain the output characteristics of the local sparse information aggregation Transformer.
The sampling transformation comprises the following steps: obtaining a data-based sampling value by the input characteristics of the post-Transformer module through the convolution layer; and performing bilinear interpolation sampling on the input characteristics of the post-Transformer module by using the sampling value based on the data to obtain the characteristics after sampling transformation.
Calculating a local sparse information aggregation attention diagram, wherein the input characteristics of a front Transformer module or the input characteristics of a rear Transformer module after sampling transformation are used as current input characteristics A; the method comprises the following steps:
s1: and carrying out window division on the current input features A with the same size and without overlapping to obtain a feature map of the window division.
S2: and inputting the window division characteristic diagram into an offset generation network to obtain an offset matrix required when the deformable attention is calculated.
S3: and carrying out linear transformation on the window division characteristic diagram to obtain a value matrix required for calculating the deformable attention.
S4: and carrying out linear transformation on the window division characteristic diagram to obtain an attention weight matrix required when the deformable attention is calculated.
S5: and carrying out bilinear interpolation sampling on the value matrix by using the offset matrix, carrying out weighted summation on a sampling result by using the attention weight matrix, and then carrying out linear transformation to obtain a local sparse information aggregation attention diagram.
Further, a local sparse information aggregation attention map, denoted as DeformAttn (z) q ,p q ,x):
Wherein z is q And x is two representations of the input features, p q For any reference point on the feature, Δ p mqk Is an offset amount, A mpk For attention weight, W m Is a learnable weight, W' m Is W m In the transpose of (1), M is the number of attention heads, and K is the number of sampling points.
Further, the patch fusion module comprises: and (3) taking values at the same position of each calculation area in the output characteristics of the double transform modules to form a new patch, and connecting to obtain the characteristics after 2-time down-sampling.
Further, the input of the feature pyramid network is four features C1-C4 with different scales from shallow to deep;
the feature pyramid network performs the following processing on the features C1 to C4 to obtain five fused features P1 to P5 with different scales from shallow to deep after fusion: directly obtaining P4 by C4, obtaining P5 by P4 downsampling, obtaining P3 by fusing P4 upsampled and C3, obtaining P2 by fusing P3 upsampled and C2, and obtaining P1 by fusing P2 upsampled and C1.
Further, the edge-guided shape enhancement module includes:
and fusing the Transformer output shallowest fusion characteristic with the crude extraction edge image, and inputting the sobel edge extraction operator to obtain a target edge prediction image.
And calculating the weighted cross entropy loss between the target edge prediction graph and the target edge real graph.
And weighting and fusing the Transformer output shallowest fusion feature by using the normalized target edge prediction graph, and inputting the weighted fusion feature into a convolution network to obtain a target shape prediction graph.
And calculating the two-classification loss and the Dice loss between the target shape prediction image and the target binary segmentation real image, optimizing the parameters of the convolution network based on the losses, and enhancing the characteristics.
Further, the weighted cross entropy Loss between the target edge prediction graph and the target edge real graph is expressed as Loss ce :
Wherein, y i Is the edge pixel value, p, of the real image of the edge of the object i Is the probability that the ith pixel is classified as an edge pixel, and conversely, 1-y i Is the background pixel value, 1-p, of the true image of the edge of the object i Is the probability that the ith pixel is classified as a background pixel; n is a radical of c Is the number of edge pixels in the target edge real map; and N is all pixels in the real image of the target edge.
Further, the two-classification loss and the Dice loss between the target shape prediction graph and the target binary segmentation real graph are expressed as follows:
therein, loss se Two classification losses between the target shape prediction graph and the target binary segmentation real graph are obtained; loss Dice The Dice loss between the target shape prediction graph and the target binary segmentation real graph is obtained; y is i Is the edge pixel value, p, of the real image of the edge of the object i Is the probability that the ith pixel is classified as an edge pixel, and conversely, 1-y i Is the background pixel value, 1-p, of the true image of the edge of the object i Is the probability that the ith pixel is classified as a background pixel; alpha determines twoThe lost weight.
Further, the target detection header without the anchor frame comprises a detection header of the FCOS, and results of target detection classification and regression are obtained.
Has the advantages that:
1. the SAR image ship target detection method based on the explicit edge guidance local sparse information aggregation Transformer provided by the invention provides a local sparse information aggregation Transformer based on a Swin Transformer architecture as a backbone network aiming at a small-scale SAR ship target, and effectively fuses effective information of a small ship through a sparse attention mechanism with deformable attention.
2. The SAR image ship target detection method based on the explicit edge guidance local sparse information aggregation Transformer provided by the invention combines a data-dependent offset generator when a deformable attention mechanism is used for replacing a self-attention mechanism, so as to obtain more obvious small SAR ship target characteristics.
3. The invention provides an SAR image ship target detection method based on explicit edge guidance local sparse information aggregation Transformer, and provides an explicit edge guidance shape enhancement module, which can be used for more effectively enhancing SAR ships with fuzzy outlines in the features extracted by the Transformer and distinguishing the SAR ships from background interference.
Drawings
Fig. 1 is a schematic flow chart of a method for detecting a ship target in an SAR image based on explicit edge-guided local sparse information aggregation transform according to a first embodiment of the present invention;
fig. 2 is a schematic diagram of a feature extraction network according to a first embodiment of the present invention;
fig. 3 is a schematic structural diagram of a transform module for local sparse information aggregation according to a first embodiment of the present invention;
FIG. 4 is a schematic diagram of an edge-guided shape enhancement module according to a first embodiment of the present invention;
fig. 5 is a schematic structural diagram of a method for detecting a ship target in an SAR image based on explicit edge-directed local sparse information aggregation transform according to an embodiment of the present invention.
Detailed Description
The invention is described in detail below by way of example with reference to the accompanying drawings.
The embodiment provides an SAR image ship target detection method based on explicit edge guidance local sparse information aggregation Transformer, and the flow of the method is shown in fig. 1. Firstly, dividing 4*4 pixel areas which are not overlapped in an input image into a patch to obtain an image after the patch division, and inputting the image into a four-stage feature extraction backbone network formed by local sparse information aggregation Transformer to obtain four features with different scales from shallow to deep; then, inputting the features into a feature pyramid network to perform feature fusion among different scales to obtain five features with different scales from shallow to deep after fusion; meanwhile, marking according to the position of the ship of the original image to obtain a crude extraction edge image of the target; then inputting the features of the shallowest layer and the crude extracted edge images into an edge-directed shape enhancement module to obtain enhanced shallowest layer features; and finally, inputting the characteristics into a target detection head without an anchor frame to obtain a target detection result.
The specific implementation process of the scheme comprises the following steps:
Dividing non-overlapping 4 × 4 pixel regions in an input image into one patch, namely dividing the input image with the size of H × W × 3 into 4 × 4 non-overlapping patches, wherein the feature dimension of each patch is 4 × 4 × 3=48, the number of the patches is H/4 × W/4, and obtaining the subsequent structure of the patch-divided image input;
step 2: inputting the image divided by the patch into a four-stage feature extraction backbone network formed by a local sparse information aggregation Transformer to obtain four features with different scales from shallow to deep;
an improved Swin Transformer feature extraction network is shown in figure 2.
The basic composition structure is based on a Swin transducer backbone network structure and is divided into four stages, the four stages are respectively marked as stage one, stage two, stage three and stage four according to the sequence, and four characteristics with different scales from shallow to deep are sequentially output;
the first stage comprises a linear embedding module and a double transform module which are connected in sequence; the linear embedding module is used for carrying out dimension transformation on the image divided by the patch;
the second stage comprises a patch fusion module and a double transform module which are connected in sequence;
the third stage comprises a patch fusion module and 3 double transform modules which are connected in sequence;
the stage four comprises a patch fusion module and a double transform module which are connected in sequence;
the double Transformer module comprises a front Transformer module and a rear Transformer module, wherein the front Transformer module executes the following steps:
calculating a local sparse information aggregation attention diagram aiming at the input characteristics of the front Transformer module, performing residual error connection on the local sparse information aggregation attention diagram and the input characteristics of the front Transformer module, and performing residual error linkage on the result and the result after linear transformation and a multi-layer perceptron to obtain the output characteristics of the local sparse information aggregation Transformer;
the output characteristic of the front Transformer module is used as the input characteristic of the rear Transformer module;
the post-Transformer module performs the following steps:
and after sampling transformation is carried out on the input characteristics of the post-Transformer module, calculating a local sparse information aggregation attention diagram according to the characteristics after sampling transformation, carrying out residual error connection on the local sparse information aggregation attention diagram and the input characteristics of the post-Transformer module, and carrying out residual error linkage on the result and the result after linear transformation and a multi-layer perceptron to obtain the output characteristics of the local sparse information aggregation Transformer.
The sampling transformation comprises the following steps: obtaining a data-based sampling value by the input characteristics of the post-Transformer module through the convolution layer; and performing bilinear interpolation sampling on the input characteristics of the post-Transformer module by using the sampling value based on the data to obtain the characteristics after sampling transformation.
Calculating a local sparse information aggregation attention diagram, wherein the input characteristics of a front Transformer module or the input characteristics of a rear Transformer module after sampling transformation are used as current input characteristics A; the method comprises the following steps:
s1: and carrying out window division on the current input features A with the same size and without overlapping to obtain a feature map of the window division.
S2: and inputting the window division characteristic diagram into an offset generation network to obtain an offset matrix required for calculating the deformable attention.
S3: and carrying out linear transformation on the window division characteristic diagram to obtain a value matrix required when the deformable attention is calculated.
S4: and carrying out linear transformation on the window division characteristic diagram to obtain an attention weight matrix required when the deformable attention is calculated.
S5: and carrying out bilinear interpolation sampling on the value matrix by using the offset matrix, carrying out weighted summation on a sampling result by using the attention weight matrix, and then carrying out linear transformation to obtain a local sparse information aggregation attention diagram.
The structure of the local sparse information aggregation Transformer module is shown in fig. 3.
The deformable attention calculation formula is:
wherein, deformAttn (z) q ,p q X) is an attention map, z q And x is two representations of the input features, p q For any reference point on the feature, Δ p mqk Is an offset amount, A mpk For attention weight, W m Is a learnable weight, W' m Is W m In the transpose of (1), M is the number of attention heads, and K is the number of sampling points.
And performing residual error connection on the attention diagrams and the input characteristics, and performing residual error linkage on the result and the result subjected to linear transformation and the multilayer perceptron to obtain the output characteristics of the local sparse information aggregation Transformer.
Because every two continuous Transformer modules lack information interaction between windows, sampling transformation needs to be carried out on a feature graph obtained by the previous Transformer module, namely the feature is input into a convolutional layer to obtain a data-based sampling value, then bilinear interpolation sampling is carried out on the feature by using the sampling value, and then the calculation of the next Transformer module is carried out.
And the linear embedding module performs dimension transformation on the patch divided image.
And the patch fusion module is used for splicing the values of the same position of each calculation area in the output characteristics of the transform into a new patch and connecting the new patch to obtain the characteristics after 2 times of down-sampling.
The output of the feature extraction network is four features with different scales from shallow to deep.
And step 3: inputting the four features with different scales into the feature pyramid network to perform feature fusion among different scales, and obtaining five fusion features with different scales from shallow to deep after fusion.
As shown in the feature fusion pyramid structure in fig. 5, the input of the feature pyramid network is four features C1 to C4 with different scales from shallow to deep; the feature pyramid network performs the following processing on the features C1 to C4 to obtain five fused features P1 to P5 with different scales from shallow to deep after fusion: directly obtaining P4 by C4, obtaining P5 by P4 downsampling, obtaining P3 by fusing P4 upsampled and C3, obtaining P2 by fusing P3 upsampled and C2, and obtaining P1 by fusing P2 upsampled and C1.
And 4, step 4: and extracting the target edge according to the position mark of the ship of the original image to obtain a crude extracted edge image of the target.
And 5: inputting the shallowest fusion feature and the crude extracted edge image in the fusion features of the five different scales into an edge-guided shape enhancement module to obtain the enhanced shallowest fusion feature.
Inputting the shallowest feature and the crude extraction edge image in the five features with different scales into an edge-guided shape enhancement module, namely fusing the shallowest feature and the crude extraction edge image, inputting the fused shallowest feature and the crude extraction edge image to obtain a target edge prediction image by using a sobel edge extraction operator, and calculating the weighted cross entropy loss between the target edge prediction image and a target edge real image; weighting and fusing the shallowest layer features by using the normalized target edge prediction graph, inputting the weighted shallowest layer features into a convolution network to obtain a target shape prediction graph, and calculating the classification loss and the Dice loss between the target shape prediction graph and a target binary segmentation real graph; and finally, based on the loss, the characteristics are enhanced by training network optimization parameters.
The calculation formula of the weighted cross entropy loss between the target edge prediction graph and the target edge real graph is as follows:
wherein, y i Is the edge pixel value, p, of the real image of the edge of the object i Is the probability that the ith pixel is classified as an edge pixel, and conversely, 1-y i Is the background pixel value, 1-p, of the true image of the edge of the object i Is the probability that the ith pixel is classified as a background pixel; n is a radical of c Is the number of edge pixels in the target edge real map; and N is all pixels in the target edge real image.
The calculation formulas of the two-classification loss and the Dice loss between the target shape prediction graph and the target binary segmentation real graph are as follows:
therein, loss se Two classification losses between the target shape prediction graph and the target binary segmentation real graph are obtained; loss Dice Dividing the Dice loss between the target shape prediction graph and the target binary segmentation real graphLosing; y is i Is the edge pixel value, p, of the real image of the edge of the object i Is the probability that the ith pixel is classified as an edge pixel, and conversely, 1-y i Is the background pixel value, 1-p, of the true image of the edge of the object i Is the probability that the ith pixel is classified as a background pixel; alpha determines the weight of the two losses.
Step 6: and inputting the enhanced shallowest fusion feature and the fusion features of other four scales into a target detection head without an anchor frame to obtain a target detection result.
As shown in fig. 5, the enhanced features of the shallowest layer and the other four scales are input into the target detection head without an anchor frame of the FCOS, and a target detection classification and regression result is obtained.
Those skilled in the art will appreciate that the steps and modules of the present invention described in the embodiments above can be implemented in general-purpose computing hardware and software, which can be centralized on a single computing device or distributed across a network of multiple computing devices. Modules implemented in executable program code may optionally be stored in Random Access Memory (RAM), memory, read Only Memory (ROM), a hard disk, a removable disk, a CD-ROM, or any other form of computer storage medium known in the art. In some cases, the steps shown or described may be performed in an order different than presented herein, or they may be separately fabricated as individual integrated circuit modules, or multiple ones or steps of them may be fabricated as a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A SAR image ship target detection method based on a Transformer is characterized by comprising the following steps:
step 1: performing Patch division on an input original image to obtain a Patch divided image;
step 2: inputting the patch-divided image into a four-stage feature extraction backbone network formed by a local sparse information aggregation Transformer to obtain four features with different scales from shallow to deep;
and step 3: inputting the four features with different scales into a feature pyramid network to perform feature fusion among different scales to obtain five fused features with different scales from shallow to deep after fusion;
and 4, step 4: extracting a target edge according to the position mark of the ship of the original image to obtain a crude extracted edge image of the target;
and 5: inputting the shallowest fusion feature of the five fusion features with different scales and the roughly extracted edge image into an edge-directed shape enhancement module to obtain the enhanced shallowest fusion feature;
step 6: and inputting the enhanced shallowest fusion feature and the fusion features of other four scales into a target detection head without an anchor frame to obtain a target detection result.
2. The transform-based SAR image ship target detection method according to claim 1, wherein the Patch division of the input original image comprises the following steps:
dividing an input image of size H × W × 3 into 4 × 4 non-overlapping patches, the feature dimension of each patch is 4 × 4 × 3=48, and the number of patches is H/4 × W/4.
3. The method for detecting SAR image ship targets based on Transformer according to claim 1, wherein the four-stage feature extraction backbone network formed by local sparse information aggregation Transformer comprises:
the basic composition structure is based on a Swin transducer backbone network structure and is divided into four stages, the four stages are respectively marked as stage one, stage two, stage three and stage four according to the sequence, and four characteristics with different scales from shallow to deep are sequentially output;
the first stage comprises a linear embedding module and a double transform module which are connected in sequence; the linear embedding module is used for carrying out dimension transformation on the patch divided image;
the second stage comprises a patch fusion module and a double transform module which are connected in sequence;
the third stage comprises a patch fusion module and 3 double transform modules which are connected in sequence;
the stage four comprises a patch fusion module and a double transform module which are connected in sequence;
the double Transformer modules comprise a front Transformer module and a rear Transformer module, wherein the front Transformer module executes the following steps:
calculating a local sparse information aggregation attention diagram aiming at the input characteristics of a front Transformer module, performing residual error connection on the local sparse information aggregation attention diagram and the input characteristics of the front Transformer module, and performing residual error linkage on the result and the result after linear transformation and a multilayer perceptron to obtain the output characteristics of the local sparse information aggregation Transformer;
the output characteristic of the front Transformer module is used as the input characteristic of the rear Transformer module;
the post-Transformer module performs the following steps:
after sampling transformation is carried out on input features of a post-Transformer module, local sparse information aggregation attentional diagram is calculated according to the features after sampling transformation, residual error connection is carried out on the local sparse information aggregation attentional diagram and the input features of the post-Transformer module, then residual error linkage is carried out on the result and the result after linear transformation and a multi-layer perceptron, and output features of the local sparse information aggregation Transformer are obtained;
the sampling transformation comprises: obtaining a data-based sampling value by the input characteristics of the post-Transformer module through the convolution layer; carrying out bilinear interpolation sampling on the input characteristics of the post-Transformer module by using the data-based sampling value to obtain characteristics after sampling transformation;
calculating a local sparse information aggregation attention diagram, wherein the input characteristics of a front Transformer module or the input characteristics of a rear Transformer module after sampling transformation are used as current input characteristics A; the method comprises the following steps:
s1: carrying out window division on the current input characteristics A with the same size and without overlapping to obtain a characteristic diagram of the window division;
s2: inputting the window division characteristic diagram into an offset generation network to obtain an offset matrix required for calculating the deformable attention;
s3: performing linear transformation on the window division characteristic diagram to obtain a value matrix required for calculating deformable attention;
s4: performing linear transformation on the window division characteristic diagram to obtain an attention weight matrix required for calculating deformable attention;
s5: and carrying out bilinear interpolation sampling on the value matrix by using the offset matrix, carrying out weighted summation on a sampling result by using an attention weight matrix, and then carrying out linear transformation to obtain a local sparse information aggregation attention diagram.
4. The transform-based SAR image ship target detection method according to any one of claims 1-3, characterized in that the local sparse information aggregation attention map is represented as DeformAttn (z) q ,p q ,x):
Wherein z is q And x is two representations of the input features, p q For any reference point on the feature, Δ p mqk Is an offset amount, A mpk For the attention weight, W m Is a learnable weight, W' m Is W m The transpose of (1), M is the number of attention heads, and K is the number of sampling points.
5. The transform-based SAR image ship target detection method of claim 3, wherein the patch fusion module comprises: and (3) taking values of the same position of each calculation area in the output characteristics of the double transform modules to form a new patch, and connecting to obtain the characteristics after 2 times of down sampling.
6. The method for detecting SAR image ship targets based on Transformer according to claim 1, wherein the input of the feature pyramid network is four features C1-C4 with different scales from shallow to deep;
the feature pyramid network performs the following processing on the features C1 to C4 to obtain five fused features P1 to P5 with different scales from shallow to deep after fusion: directly obtaining P4 by C4, obtaining P5 by P4 downsampling, obtaining P3 by fusing P4 upsampled and C3, obtaining P2 by fusing P3 upsampled and C2, and obtaining P1 by fusing P2 upsampled and C1.
7. The transform-based SAR image ship target detection method of claim 1, wherein the edge-guided shape enhancement module comprises:
fusing the output shallowest fusion characteristic of the Transformer with the crude extraction edge image, and inputting a sobel edge extraction operator to obtain a target edge prediction graph;
calculating the weighted cross entropy loss between the target edge prediction graph and the target edge real graph;
weighting and fusing the Transformer output shallowest fusion characteristic by using the normalized target edge prediction graph, and inputting the weighted fusion characteristic into a convolution network to obtain a target shape prediction graph;
and calculating the two-classification loss and the Dice loss between the target shape prediction image and the target binary segmentation real image, optimizing the parameters of the convolution network based on the losses, and enhancing the characteristics.
8. The transform-based SAR image ship target detection method of claim 11, wherein the weighted cross entropy Loss between the target edge prediction graph and the target edge real graph is expressed as Loss ce :
Wherein, y i Is the edge pixel value, p, of the real image of the edge of the object i Is the probability that the ith pixel is classified as an edge pixel, and conversely, 1-y i Is the background pixel value, 1-p, of the true image of the edge of the object i Is the probability that the ith pixel is classified as a background pixel; n is a radical of c Is the number of edge pixels in the real image of the target edge; and N is all pixels in the target edge real image.
9. The transform-based SAR image ship target detection method of claim 11, wherein the two-classification loss and the Dice loss between the target shape prediction graph and the target binary segmentation real graph are expressed as:
therein, loss se Two classification losses between the target shape prediction graph and the target binary segmentation real graph are obtained; loss Dice The Dice loss between the target shape prediction graph and the target binary segmentation real graph is obtained; y is i Is the edge pixel value, p, of the real image of the edge of the object i Is the probability that the ith pixel is classified as an edge pixel, and conversely, 1-y i Is the background pixel value, 1-p, of the true image of the edge of the object i Is the probability that the ith pixel is classified as a background pixel; alpha determines the weight of the two losses.
10. The transform-based SAR image ship target detection method of claim 1, wherein the anchor-frame-free target detection head comprises a FCOS detection head, and results of target detection classification and regression are obtained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211173313.8A CN115565066A (en) | 2022-09-26 | 2022-09-26 | SAR image ship target detection method based on Transformer |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211173313.8A CN115565066A (en) | 2022-09-26 | 2022-09-26 | SAR image ship target detection method based on Transformer |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115565066A true CN115565066A (en) | 2023-01-03 |
Family
ID=84742425
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211173313.8A Pending CN115565066A (en) | 2022-09-26 | 2022-09-26 | SAR image ship target detection method based on Transformer |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115565066A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116740370A (en) * | 2023-05-18 | 2023-09-12 | 北京理工大学 | Complex target recognition method based on deep self-attention transformation network |
CN117372676A (en) * | 2023-09-26 | 2024-01-09 | 南京航空航天大学 | Sparse SAR ship target detection method and device based on attention feature fusion |
CN117830874A (en) * | 2024-03-05 | 2024-04-05 | 成都理工大学 | Remote sensing target detection method under multi-scale fuzzy boundary condition |
CN117935251A (en) * | 2024-03-22 | 2024-04-26 | 济南大学 | Food identification method and system based on aggregated attention |
CN118628933A (en) * | 2024-08-15 | 2024-09-10 | 西南交通大学 | Ship target detection method, system, equipment and readable storage medium |
-
2022
- 2022-09-26 CN CN202211173313.8A patent/CN115565066A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116740370A (en) * | 2023-05-18 | 2023-09-12 | 北京理工大学 | Complex target recognition method based on deep self-attention transformation network |
CN117372676A (en) * | 2023-09-26 | 2024-01-09 | 南京航空航天大学 | Sparse SAR ship target detection method and device based on attention feature fusion |
CN117830874A (en) * | 2024-03-05 | 2024-04-05 | 成都理工大学 | Remote sensing target detection method under multi-scale fuzzy boundary condition |
CN117830874B (en) * | 2024-03-05 | 2024-05-07 | 成都理工大学 | Remote sensing target detection method under multi-scale fuzzy boundary condition |
CN117935251A (en) * | 2024-03-22 | 2024-04-26 | 济南大学 | Food identification method and system based on aggregated attention |
CN118628933A (en) * | 2024-08-15 | 2024-09-10 | 西南交通大学 | Ship target detection method, system, equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115565066A (en) | SAR image ship target detection method based on Transformer | |
CN108596248B (en) | Remote sensing image classification method based on improved deep convolutional neural network | |
Mahmoud et al. | Object detection using adaptive mask RCNN in optical remote sensing images | |
CN112507777A (en) | Optical remote sensing image ship detection and segmentation method based on deep learning | |
CN111898432B (en) | Pedestrian detection system and method based on improved YOLOv3 algorithm | |
Hou et al. | SolarNet: a deep learning framework to map solar power plants in China from satellite imagery | |
Wang et al. | Ship detection based on fused features and rebuilt YOLOv3 networks in optical remote-sensing images | |
Wang et al. | Automatic SAR ship detection based on multifeature fusion network in spatial and frequency domains | |
Zhang et al. | Efficiently utilizing complex-valued PolSAR image data via a multi-task deep learning framework | |
CN116758130A (en) | Monocular depth prediction method based on multipath feature extraction and multi-scale feature fusion | |
CN113505634B (en) | Optical remote sensing image salient target detection method of double-flow decoding cross-task interaction network | |
Xu et al. | Fast ship detection combining visual saliency and a cascade CNN in SAR images | |
CN114022408A (en) | Remote sensing image cloud detection method based on multi-scale convolution neural network | |
Wang et al. | SAR ship detection in complex background based on multi-feature fusion and non-local channel attention mechanism | |
CN114565824B (en) | Single-stage rotating ship detection method based on full convolution network | |
CN115294468A (en) | SAR image ship identification method for improving fast RCNN | |
Chen et al. | Ship detection with optical image based on attention and loss improved YOLO | |
Yang et al. | SAR image target detection and recognition based on deep network | |
Luo et al. | SAM-RSIS: Progressively adapting SAM with box prompting to remote sensing image instance segmentation | |
Fang et al. | Scinet: Spatial and contrast interactive super-resolution assisted infrared uav target detection | |
Liu et al. | Content-guided and class-oriented learning for vhr image semantic segmentation | |
Wang et al. | Attention-aware Sobel Graph Convolutional Network for Remote Sensing Image Change Detection | |
CN115147719A (en) | Remote sensing image deep land utilization classification method based on enhanced semantic representation | |
Bendre et al. | Natural disaster analytics using high resolution satellite images | |
Wang et al. | NAS-YOLOX: ship detection based on improved YOLOX for SAR imagery |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |