CN115565066A

CN115565066A - A Transformer-based ship target detection method in SAR images

Info

Publication number: CN115565066A
Application number: CN202211173313.8A
Authority: CN
Inventors: 师皓; 柴冰茜; 王裕沛; 陈亮
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-09-26
Filing date: 2022-09-26
Publication date: 2023-01-03

Abstract

The invention discloses a Transformer-based SAR image ship target detection method. For small-scale SAR ship targets, the Transformer is used as a backbone network to fuse effective ship information through a deformable attention mechanism to improve detection accuracy. Firstly, Patch division is performed on the input original image. The image after patch division is input into the four-stage feature extraction backbone network composed of Transformer, and four different scale features from shallow to deep are obtained. The features of four different scales are input into the feature pyramid network for feature fusion, and the fusion features of five different scales from shallow to deep are obtained. According to the ship position annotation of the original image, the rough edge image of the target is extracted. The shallowest fused features and coarsely extracted edge images are input into the edge-guided shape enhancement module to obtain the enhanced shallowest fused features. The enhanced shallowest fused features and the fused features of the other four scales are input into the anchor-free target detection head to obtain the target detection results.

Description

A Transformer-based ship target detection method in SAR images

技术领域technical field

本发明涉及目标检测技术领域，具体涉及一种基于Transformer的SAR图像舰船目标检测方法。The invention relates to the technical field of target detection, in particular to a Transformer-based SAR image ship target detection method.

背景技术Background technique

合成孔径雷达(SAR)是一种工作频率在微波波段的主动遥感系统。在飞机或卫星上运行的SAR传感器可对地面或海平面进行成像和监测，与光学传感器相比，SAR传感器的主要优点是能全天时、全天候工作，发射的信号穿透能力很强，能够穿透云和雾。Synthetic Aperture Radar (SAR) is an active remote sensing system whose operating frequency is in the microwave band. SAR sensors operating on aircraft or satellites can image and monitor the ground or sea level. Compared with optical sensors, the main advantage of SAR sensors is that they can work all day and all day long, and the emitted signals have strong penetration capabilities. Penetrates clouds and fog.

SAR图像数据随着SAR传感器的广泛应用而增多，针对SAR图像中目标的自动检测技术成为一项重要的研究。现有的SAR图像的目标检测方法可以分为传统方法和深度学习方法。传统的SAR目标检测方法主要包括基于对比度信息，几何和纹理特征以及统计分析的方法。受益于深度学习和GPU计算能力的发展，深度卷积神经网络(CNN)使目标检测取得了巨大突破。目前基于卷积神经网络的目标检测技术已经成为目标检测领域的主流方向，算法主要分为两类，第一类是以Faster R-CNN为代表的双阶段目标检测算法，在包含目标的边框上生成候选区域再进行目标检测，检测精度高但效率较低；第二类是单阶段目标检测算法，主要以SSD(Single Shot MultiBox Detector，SSD)和YOLO(You Only Look Once，YOLO)系列为代表，此类算法不生成候选区域，直接通过回归的方式进行目标检测，检测效率高但是精度不如第一类方法。作为一种新的神经网络结构，Transformer为视觉任务提供了一种新的思维方式。最初，Transformer用于自然语言处理(NLP)领域。它采用编码器-解码器和自注意力机制的非循环网络结构，实现了机器翻译的最佳性能。Transformer在NLP领域的成功应用，让有关学者开始讨论和尝试其在计算机视觉领域的应用。一些使用Transformer代替卷积的骨干网络，如ViT和Swin Transformer，已被证明比CNN具有更好的性能，因为Transformer的全局交互机制可以迅速扩大特征的有效感受野。SAR image data increases with the wide application of SAR sensors, and the automatic detection technology for targets in SAR images has become an important research. Existing target detection methods for SAR images can be divided into traditional methods and deep learning methods. Traditional SAR target detection methods mainly include methods based on contrast information, geometric and texture features, and statistical analysis. Benefiting from the development of deep learning and GPU computing power, deep convolutional neural network (CNN) has made a huge breakthrough in object detection. At present, the target detection technology based on convolutional neural network has become the mainstream direction in the field of target detection. The algorithm is mainly divided into two types. The first type is a two-stage target detection algorithm represented by Faster R-CNN. Generate candidate areas and then perform target detection, which has high detection accuracy but low efficiency; the second type is single-stage target detection algorithms, mainly represented by SSD (Single Shot MultiBox Detector, SSD) and YOLO (You Only Look Once, YOLO) series , this type of algorithm does not generate candidate regions, and directly performs target detection through regression. The detection efficiency is high but the accuracy is not as good as the first type of method. As a new neural network structure, Transformer provides a new way of thinking for vision tasks. Originally, Transformer was used in the field of natural language processing (NLP). It uses an acyclic network structure of encoder-decoder and self-attention mechanism to achieve the best performance in machine translation. The successful application of Transformer in the field of NLP has led relevant scholars to discuss and try its application in the field of computer vision. Some backbone networks that use Transformer instead of convolution, such as ViT and Swin Transformer, have been proven to have better performance than CNN, because Transformer's global interaction mechanism can quickly expand the effective receptive field of features.

然而，在SAR舰船检测中使用Transformer作为骨干网络存在两个问题。一是海上SAR舰船图像的背景非常简单，因此Transformer中的全局关系建模机制会关联一些冗余的背景信息。二是近岸SAR舰船目标与与海岸之间的轮廓模糊，难以从背景中区分近岸舰船目标。所以Transformer提取的特征需要通过更多的物体细节来重建，以聚焦于相似背景下的SAR舰船目标。However, there are two problems in using Transformer as the backbone network in SAR ship detection. One is that the background of maritime SAR ship images is very simple, so the global relationship modeling mechanism in Transformer will associate some redundant background information. The second is that the outline between the near-shore SAR ship target and the coast is blurred, and it is difficult to distinguish the near-shore ship target from the background. Therefore, the features extracted by Transformer need to be reconstructed with more object details to focus on SAR ship targets in similar backgrounds.

目前没有相关技术方案能够解决SAR图像中海岸、岛屿和海浪等背景造成的舰船目标轮廓模糊，从而检测精度下降的问题，At present, there is no relevant technical solution that can solve the problem of blurred ship target outlines caused by backgrounds such as coasts, islands, and waves in SAR images, thereby reducing the detection accuracy.

发明内容Contents of the invention

有鉴于此，本发明提供了一种基于Transformer的SAR图像舰船目标检测方法，能够针对小尺度SAR舰船目标，基于Swin Transformer架构的局部稀疏信息聚合Transformer作为主干网络，通过可变形注意力的稀疏注意力机制有效地融合小型舰船的有效信息，从而提高检测精度。In view of this, the present invention provides a Transformer-based SAR image ship target detection method, which can focus on small-scale SAR ship targets, using the local sparse information aggregation Transformer based on the Swin Transformer architecture as the backbone network, through the deformable attention The sparse attention mechanism effectively fuses the effective information of small ships to improve the detection accuracy.

为达到上述目的，本发明的技术方案包括如下步骤：In order to achieve the above object, the technical solution of the present invention comprises the following steps:

步骤1：对输入的原始图像进行Patch划分，得到patch划分后的图像。Step 1: Patch the input original image to obtain the patched image.

步骤2：将patch划分后的图像输入由局部稀疏信息聚合Transformer构成的四阶段特征提取骨干网络，得到由浅到深的四个不同尺度的特征。Step 2: Input the patched image into the four-stage feature extraction backbone network composed of local sparse information aggregation Transformer, and obtain four different scale features from shallow to deep.

步骤3：将四个不同尺度的特征输入特征金字塔网络进行不同尺度间的特征融合，得到融合后由浅到深的五个不同尺度的融合特征。Step 3: Input features of four different scales into the feature pyramid network for feature fusion between different scales, and obtain fusion features of five different scales from shallow to deep after fusion.

步骤4：根据原始图像的舰船位置标注，提取目标边缘，得到目标的粗提取边缘图像。Step 4: According to the ship position annotation of the original image, extract the edge of the target, and obtain the rough edge image of the target.

步骤5：将五个不同尺度的融合特征中最浅层的融合特征以及粗提取边缘图像输入边缘指导的形状增强模块，得到增强后的最浅层融合特征。Step 5: Input the shallowest fusion feature among the fusion features of five different scales and the roughly extracted edge image into the edge-guided shape enhancement module to obtain the enhanced shallowest fusion feature.

步骤6：将增强后的最浅层融合特征和其他四个尺度的融合特征输入无锚框的目标检测头部，得到目标检测结果。Step 6: Input the enhanced shallowest fusion feature and fusion features of other four scales into the target detection head without anchor frame to obtain the target detection result.

进一步地，对输入的原始图像进行Patch划分包括如下步骤：将大小为H×W×3的输入图像的划分为4×4的不重叠patch，则每个patch的特征维度为4×4×3＝48，patch的数量为H/4×W/4。Further, the patch division of the input original image includes the following steps: divide the input image with a size of H×W×3 into 4×4 non-overlapping patches, and then the feature dimension of each patch is 4×4×3 =48, the number of patches is H/4×W/4.

进一步地，由局部稀疏信息聚合Transformer构成的四阶段特征提取骨干网络包括：基本组成结构基于Swin Transformer的骨干网络结构，分为四个阶段，按照顺序分别记为阶段一、阶段二、阶段三和阶段四，顺次输出由浅到深的四个不同尺度的特征。Further, the four-stage feature extraction backbone network composed of local sparse information aggregation Transformer includes: the basic composition structure is based on the backbone network structure of Swin Transformer, which is divided into four stages, which are respectively recorded as stage 1, stage 2, stage 3 and In the fourth stage, the features of four different scales from shallow to deep are output in sequence.

阶段一包括顺次连接的线性嵌入模块和双Transformer模块；线性嵌入模块用于对patch划分后的图像进行维度变换。Phase 1 includes sequentially connected linear embedding modules and dual Transformer modules; the linear embedding module is used to perform dimension transformation on the patch-divided image.

阶段二包括顺次连接的patch融合模块和双Transformer模块。Phase two includes sequentially connected patch fusion modules and dual Transformer modules.

阶段三包括顺次连接的patch融合模块和3个顺次连接的双Transformer模块。Phase three includes sequentially connected patch fusion modules and three sequentially connected dual Transformer modules.

阶段四包括顺次连接的patch融合模块和双Transformer模块。Phase four includes sequentially connected patch fusion modules and dual Transformer modules.

双Transformer模块包括前后两个Transformer模块，其中前Transformer模块执行如下步骤：针对前Transformer模块的输入特征计算局部稀疏信息聚合注意力图，将局部稀疏信息聚合注意力图和前Transformer模块的输入特征进行残差连接，再将此结果与经过线性变换和多层感知机后的结果进行残差链接，得到局部稀疏信息聚合Transformer的输出特征。The double Transformer module includes two Transformer modules before and after, and the front Transformer module performs the following steps: calculate the local sparse information aggregation attention map for the input features of the front Transformer module, and make residual difference between the local sparse information aggregation attention map and the input features of the front Transformer module Connect, and then residually link this result with the result after linear transformation and multi-layer perceptron to obtain the output features of the local sparse information aggregation Transformer.

前Transformer模块的输出特征作为后Transformer模块的输入特征。The output features of the pre-Transformer module are used as the input features of the post-Transformer module.

后Transformer模块执行如下步骤：针对后Transformer模块的输入特征进行采样变换后，针对采样变换后的特征计算局部稀疏信息聚合注意力图，将局部稀疏信息聚合注意力图和后Transformer模块的输入特征进行残差连接，再将此结果与经过线性变换和多层感知机后的结果进行残差链接，得到局部稀疏信息聚合Transformer的输出特征。The post-Transformer module performs the following steps: After sampling and transforming the input features of the post-Transformer module, calculate the local sparse information aggregation attention map for the features after sampling transformation, and perform residual difference between the local sparse information aggregation attention map and the input features of the post-Transformer module Connect, and then residually link this result with the result after linear transformation and multi-layer perceptron to obtain the output features of the local sparse information aggregation Transformer.

采样变换包括：将后Transformer模块的输入特征通过卷积层得到基于数据的采样值；利用基于数据的采样值对后Transformer模块的输入特征进行双线性插值采样，得到采样变换后的特征。The sampling transformation includes: passing the input features of the post-Transformer module through the convolution layer to obtain data-based sampling values; using the data-based sampling values to perform bilinear interpolation sampling on the input features of the post-Transformer module to obtain the features after sampling transformation.

其中计算局部稀疏信息聚合注意力图，前Transformer模块的输入特征或者采样变换后的后Transformer模块的输入特征作为当前输入特征A；采用如下步骤：Among them, the local sparse information aggregation attention map is calculated, and the input features of the former Transformer module or the input features of the post-Transformer module after sampling transformation are used as the current input feature A; the following steps are adopted:

S1：对当前输入特征A进行大小相同互不重叠的窗口划分得到窗口划分的特征图。S1: Divide the current input feature A into windows of the same size and do not overlap each other to obtain a window-divided feature map.

S2：将窗口划分特征图输入偏移量生成网络，得到计算可变形注意力时所需要的偏移量矩阵。S2: Input the window division feature map into the offset generation network to obtain the offset matrix required for calculating deformable attention.

S3：将窗口划分特征图进行线性变换，得到计算可变形注意力时需要的值矩阵。S3: Linearly transform the window-divided feature map to obtain the value matrix required for calculating deformable attention.

S4：将窗口划分特征图进行线性变换，得到计算可变形注意力时需要的注意力权重矩阵。S4: Linearly transform the window-divided feature map to obtain the attention weight matrix required for calculating deformable attention.

S5：利用偏移量矩阵对值矩阵进行双线性插值采样，并用注意力权重矩阵对采样结果进行加权求和，再进行线性变换得到局部稀疏信息聚合注意力图。S5: Use the offset matrix to perform bilinear interpolation sampling on the value matrix, and use the attention weight matrix to weight and sum the sampling results, and then perform linear transformation to obtain the local sparse information aggregation attention map.

进一步地，局部稀疏信息聚合注意力图，其表示为DeformAttn(z_q,p_q,x)：Further, the local sparse information aggregates the attention map, which is expressed as DeformAttn(z _q ,p _q ,x):

其中，z_q和x为输入特征的两种表示，p_q为特征上任意一个参考点，Δp_mqk为偏移量，A_mpk为注意力权重，W_m为可学习的权重，W′_m为W_m的转置，M为注意力头部个数，K为采样点个数。Among them, z _q and x are two representations of input features, p _q is any reference point on the feature, Δp _mqk is the offset, A _mpk is the attention weight, W _m is the learnable weight, W′ _m is The transposition of W _m , M is the number of attention heads, and K is the number of sampling points.

进一步地，patch融合模块包括：取双Transformer模块的输出特征中每个计算区域相同位置的值拼成新的patch并进行连接，得到2倍下采样后的特征。Further, the patch fusion module includes: taking the values of the same position of each calculation area in the output features of the dual Transformer modules to form a new patch and connecting them to obtain the features after 2 times downsampling.

进一步地，特征金字塔网络的输入为由浅到深的四个不同尺度的特征C1～C4；Further, the input of the feature pyramid network is features C1~C4 of four different scales from shallow to deep;

特征金字塔网络对特征C1～C4进行如下处理获得融合后由浅到深的五个不同尺度的融合特征P1～P5：C4直接得到P4，P4下采样得到P5，P4上采样后与C3融合得到P3，P3上采样后与C2融合得到P2，P2上采样后与C1融合得到P1。The feature pyramid network performs the following processing on features C1~C4 to obtain fusion features P1~P5 of five different scales from shallow to deep after fusion: C4 directly obtains P4, P4 is down-sampled to obtain P5, P4 is up-sampled and fused with C3 to obtain P3, P3 is up-sampled and fused with C2 to get P2, and P2 is up-sampled and fused with C1 to get P1.

进一步地，边缘指导的形状增强模块包括：Further, the edge-guided shape enhancement module includes:

将Transformer输出最浅层融合特征与粗提取边缘图像融合，并输入sobel边缘提取算子得到目标边缘预测图。The shallowest fusion feature output by Transformer is fused with the rough extraction edge image, and input to the sobel edge extraction operator to obtain the target edge prediction map.

计算目标边缘预测图与目标边缘真实图之间的加权交叉熵损失。Computes the weighted cross-entropy loss between the predicted map of the target edge and the ground truth map of the target edge.

利用归一化后的目标边缘预测图对Transformer输出最浅层融合特征进行加权后与之融合，并输入卷积网络得到目标形状预测图。The normalized target edge prediction map is used to weight the shallowest fusion feature output by the Transformer and fuse with it, and input the convolutional network to obtain the target shape prediction map.

计算目标形状预测图与目标二值分割真实图之间的二分类损失和Dice损失，基于上述损失，优化卷积网络参数，对特征进行增强。Calculate the binary classification loss and Dice loss between the target shape prediction map and the target binary segmentation real map. Based on the above losses, optimize the convolutional network parameters and enhance the features.

进一步地，目标边缘预测图与目标边缘真实图之间的加权交叉熵损失，其表示为Loss_ce：Further, the weighted cross-entropy loss between the target edge prediction map and the target edge real map, which is expressed as Loss _ce :

其中，y_i是目标边缘真实图的边缘像素值，p_i是第i个像素被归类为边缘像素的概率，相反地，1-y_i是目标边缘真实图的背景像素值，1-p_i是第i个像素被归类为背景像素的概率；N_c是目标边缘真实图中的边缘像素数量；N为目标边缘真实图中所有像素。Among them, y _i is the edge pixel value of the target edge-truth map, p _i is the probability that the i-th pixel is classified as an edge pixel, conversely, 1-y _i is the background pixel value of the target edge-truth map, 1-p _i is the probability that the i-th pixel is classified as a background pixel; N _c is the number of edge pixels in the target edge-truth image; N is all pixels in the target edge-truth image.

进一步地，目标形状预测图与目标二值分割真实图之间的二分类损失和Dice损失，其表示为：Further, the binary classification loss and Dice loss between the target shape prediction map and the target binary segmentation real map are expressed as:

其中，Loss_se为目标形状预测图与目标二值分割真实图之间的二分类损失；Loss_Dice为目标形状预测图与目标二值分割真实图之间的Dice损失；y_i是目标边缘真实图的边缘像素值，p_i是第i个像素被归类为边缘像素的概率，相反地，1-y_i是目标边缘真实图的背景像素值，1-p_i是第i个像素被归类为背景像素的概率；α决定了两个损失的权重。Among them, Loss _se is the binary classification loss between the target shape prediction map and the target binary segmentation real map; Loss _Dice is the Dice loss between the target shape prediction map and the target binary segmentation real map; y _i is the target edge real map The edge pixel value of , p _i is the probability that the i-th pixel is classified as an edge pixel, conversely, 1-y _i is the background pixel value of the real image of the target edge, and 1-p _i is the i-th pixel is classified is the probability of the background pixel; α determines the weight of the two losses.

进一步地，无锚框的目标检测头部包括FCOS的检测头部，得到目标检测分类和回归的结果。Further, the object detection head without anchor box includes the detection head of FCOS, and the results of object detection classification and regression are obtained.

有益效果：Beneficial effect:

1、本发明提供的基于显式边缘指导局部稀疏信息聚合Transformer的SAR图像舰船目标检测方法，针对小尺度SAR舰船目标，提出了一种基于Swin Transformer架构的局部稀疏信息聚合Transformer作为主干网络，通过可变形注意力的稀疏注意力机制有效地融合小型舰船的有效信息。1. The SAR image ship target detection method based on the explicit edge-guided local sparse information aggregation Transformer provided by the present invention, for small-scale SAR ship targets, a local sparse information aggregation Transformer based on the Swin Transformer architecture is proposed as the backbone network , to effectively fuse the effective information of small ships through a sparse attention mechanism of deformable attention.

2、本发明提供的基于显式边缘指导局部稀疏信息聚合Transformer的SAR图像舰船目标检测方法，用可变形的注意力机制代替自我注意力机制时，结合了数据依赖的偏移量生成器，以获得更显着的小型SAR舰船目标特征。2. The SAR image ship target detection method based on explicit edge-guided local sparse information aggregation Transformer provided by the present invention, when replacing the self-attention mechanism with a deformable attention mechanism, combines a data-dependent offset generator, In order to obtain more prominent small SAR ship target characteristics.

3、本发明提供的基于显式边缘指导局部稀疏信息聚合Transformer的SAR图像舰船目标检测方法，提出了一种显式边缘指导的形状增强模块，更有效地增强Transformer提取的特征中轮廓模糊的SAR船舶，并将它们与背景干扰区分出来。3. The SAR image ship target detection method based on the explicit edge-guided local sparse information aggregation Transformer provided by the present invention proposes an explicit edge-guided shape enhancement module, which can more effectively enhance the fuzzy outline of the features extracted by the Transformer. SAR ships and distinguish them from background interference.

附图说明Description of drawings

图1为本发明实施例一的基于显式边缘指导局部稀疏信息聚合Transformer的SAR图像舰船目标检测方法流程示意图；1 is a schematic flow diagram of a method for detecting a ship target in a SAR image based on an explicit edge-guided local sparse information aggregation Transformer according to Embodiment 1 of the present invention;

图2为本发明实施例一的特征提取网络结构示意图；FIG. 2 is a schematic diagram of a feature extraction network structure according to Embodiment 1 of the present invention;

图3为本发明实施例一的局部稀疏信息聚合Transformer模块结构示意图；FIG. 3 is a schematic structural diagram of a local sparse information aggregation Transformer module according to Embodiment 1 of the present invention;

图4为本发明实施例一的边缘指导的形状增强模块结构示意图；FIG. 4 is a schematic structural diagram of an edge-guided shape enhancement module according to Embodiment 1 of the present invention;

图5为本发明实施例一的基于显式边缘指导局部稀疏信息聚合Transformer的SAR图像舰船目标检测方法结构示意图。FIG. 5 is a schematic structural diagram of a method for detecting a ship target in a SAR image based on an explicit edge-guided local sparse information aggregation Transformer according to Embodiment 1 of the present invention.

具体实施方式detailed description

下面结合附图并举实施例，对本发明进行详细描述。The present invention will be described in detail below with reference to the accompanying drawings and examples.

本实施例提供一种基于显式边缘指导局部稀疏信息聚合Transformer的SAR图像舰船目标检测方法，其流程如图1所示。首先，对输入图像中不重叠的4*4个像素区域被划分为一个patch，得到patch划分后的图像，并将其输入由局部稀疏信息聚合Transformer构成的四阶段特征提取骨干网络，得到包括由浅到深的四个不同尺度的特征；之后，将特征输入特征金字塔网络进行不同尺度间的特征融合，得到融合后由浅到深的五个不同尺度的特征；与此同时，根据原始图像的舰船位置标注，得到目标的粗提取边缘图像；然后，将最浅层的特征以及粗提取边缘图像输入边缘指导的形状增强模块，得到增强后的最浅层特征；最后，将特征输入无锚框的目标检测头部，得到目标检测结果。This embodiment provides a method for detecting a ship target in a SAR image based on an explicit edge-guided local sparse information aggregation Transformer, the process of which is shown in FIG. 1 . First, the non-overlapping 4*4 pixel area in the input image is divided into a patch, and the image divided by the patch is obtained, and it is input into a four-stage feature extraction backbone network composed of a local sparse information aggregation Transformer, which includes a shallow The features of four different scales from deep to deep; after that, the features are input into the feature pyramid network for feature fusion between different scales, and the features of five different scales from shallow to deep after fusion are obtained; at the same time, according to the ship of the original image Position labeling to get the rough extracted edge image of the target; then, input the shallowest feature and rough extracted edge image into the edge-guided shape enhancement module to obtain the enhanced shallowest feature; finally, input the feature into the anchor box-free The target detects the head and obtains the target detection result.

本方案具体实现过程包括以下步骤：The specific implementation process of this program includes the following steps:

步骤1，对输入的原始图像进行Patch划分，得到patch划分后的图像。Step 1: Perform patch division on the input original image to obtain the patched image.

对输入图像中不重叠的4×4个像素区域被划分为一个patch，即将大小为H×W×3的输入图像的划分为4×4的不重叠patch，则每个patch的特征维度为4×4×3＝48，patch的数量为H/4×W/4，得到patch划分后的图像输入后续结构；The non-overlapping 4×4 pixel area in the input image is divided into a patch, that is, the input image with a size of H×W×3 is divided into 4×4 non-overlapping patches, and the feature dimension of each patch is 4 ×4×3=48, the quantity of patch is H/4×W/4, obtains the subsequent structure of image input after patch division;

步骤2：将patch划分后的图像输入由局部稀疏信息聚合Transformer构成的四阶段特征提取骨干网络，得到由浅到深的四个不同尺度的特征；Step 2: Input the patched image into the four-stage feature extraction backbone network composed of local sparse information aggregation Transformer, and obtain four different scale features from shallow to deep;

改进的Swin Transformer特征提取网络如图2所示。The improved Swin Transformer feature extraction network is shown in Figure 2.

基本组成结构基于Swin Transformer的骨干网络结构，分为四个阶段，按照顺序分别记为阶段一、阶段二、阶段三和阶段四，顺次输出由浅到深的四个不同尺度的特征；The basic composition structure is based on the backbone network structure of Swin Transformer, which is divided into four stages, which are recorded as stage 1, stage 2, stage 3 and stage 4 in sequence, and the features of four different scales from shallow to deep are output in sequence;

阶段一包括顺次连接的线性嵌入模块和双Transformer模块；线性嵌入模块用于对patch划分后的图像进行维度变换；Phase 1 includes a sequentially connected linear embedding module and a double Transformer module; the linear embedding module is used to perform dimension transformation on the patch-divided image;

阶段二包括顺次连接的patch融合模块和双Transformer模块；Phase 2 includes sequentially connected patch fusion modules and dual Transformer modules;

阶段三包括顺次连接的patch融合模块和3个顺次连接的双Transformer模块；Phase three includes sequentially connected patch fusion modules and three sequentially connected dual Transformer modules;

阶段四包括顺次连接的patch融合模块和双Transformer模块；Phase four includes sequentially connected patch fusion modules and dual Transformer modules;

双Transformer模块包括前后两个Transformer模块，其中前Transformer模块执行如下步骤：The dual Transformer module includes two Transformer modules before and after, and the former Transformer module performs the following steps:

针对前Transformer模块的输入特征计算局部稀疏信息聚合注意力图，将局部稀疏信息聚合注意力图和前Transformer模块的输入特征进行残差连接，再将此结果与经过线性变换和多层感知机后的结果进行残差链接，得到局部稀疏信息聚合Transformer的输出特征；Calculate the local sparse information aggregation attention map for the input features of the pre-Transformer module, perform residual connection between the local sparse information aggregation attention map and the input features of the pre-Transformer module, and then combine this result with the result after linear transformation and multi-layer perceptron Perform residual linking to obtain the output features of the local sparse information aggregation Transformer;

前Transformer模块的输出特征作为后Transformer模块的输入特征；The output features of the former Transformer module are used as the input features of the post-Transformer module;

后Transformer模块执行如下步骤：The post-Transformer module performs the following steps:

针对后Transformer模块的输入特征进行采样变换后，针对采样变换后的特征计算局部稀疏信息聚合注意力图，将局部稀疏信息聚合注意力图和后Transformer模块的输入特征进行残差连接，再将此结果与经过线性变换和多层感知机后的结果进行残差链接，得到局部稀疏信息聚合Transformer的输出特征。After sampling and transforming the input features of the post-Transformer module, the local sparse information aggregation attention map is calculated for the features after the sampling transformation, and the local sparse information aggregation attention map is residually connected with the input features of the post-Transformer module, and then the result is combined with After linear transformation and multi-layer perceptron, the results are residually linked to obtain the output features of the local sparse information aggregation Transformer.

局部稀疏信息聚合Transformer模块的结构如图3所示。The structure of the Local Sparse Information Aggregation Transformer module is shown in Figure 3.

可变形注意力的计算公式为：The calculation formula of deformable attention is:

其中，DeformAttn(z_q,p_q,x)为注意力图，z_q和x为输入特征的两种表示，p_q为特征上任意一个参考点，Δp_mqk为偏移量，A_mpk为注意力权重，W_m为可学习的权重，W′_m为W_m的转置，M为注意力头部个数，K为采样点个数。Among them, DeformAttn(z _q ,p _q ,x) is the attention map, z _q and x are two representations of input features, p _q is any reference point on the feature, Δp _mqk is the offset, A _mpk is the attention Weight, W _m is the learnable weight, W′ _m is the transpose of W _m , M is the number of attention heads, and K is the number of sampling points.

将上述注意力图和输入特征进行残差连接，再将此结果与经过线性变换和多层感知机后的结果进行残差链接，得到局部稀疏信息聚合Transformer的输出特征。The above attention map is residually connected to the input features, and then the result is residually linked with the result after linear transformation and multi-layer perceptron to obtain the output features of the local sparse information aggregation Transformer.

每两个连续的Transformer模块由于缺少窗口间的信息交互，所以需要对前一个Transformer模块得到的特征图进行采样变换，即将该特征输入卷积层得到基于数据的采样值，再利用采样值对该特征进行双线性插值采样，再进行后一个Transformer模块的计算。Due to the lack of information interaction between windows in every two consecutive Transformer modules, it is necessary to sample and transform the feature map obtained by the previous Transformer module, that is, to input the feature into the convolutional layer to obtain a data-based sampling value, and then use the sampling value to this The feature is sampled by bilinear interpolation, and then the calculation of the latter Transformer module is performed.

线性嵌入模块对patch划分后的图像进行维度变换。The linear embedding module performs dimension transformation on the patched image.

patch融合模块取Transformer的输出特征中每个计算区域相同位置的值拼成新的patch并进行连接，得到2倍下采样后的特征。The patch fusion module takes the value of the same position of each calculation area in the output feature of Transformer to form a new patch and connects it to obtain the feature after 2 times downsampling.

特征提取网络的输出为由浅到深的四个不同尺度的特征。The output of the feature extraction network is features of four different scales from shallow to deep.

如图5中的特征融合金字塔结构所示，特征金字塔网络的输入为由浅到深的四个不同尺度的特征C1～C4；特征金字塔网络对特征C1～C4进行如下处理获得融合后由浅到深的五个不同尺度的融合特征P1～P5：C4直接得到P4，P4下采样得到P5，P4上采样后与C3融合得到P3，P3上采样后与C2融合得到P2，P2上采样后与C1融合得到P1。As shown in the feature fusion pyramid structure in Figure 5, the input of the feature pyramid network is four different scale features C1~C4 from shallow to deep; the feature pyramid network performs the following processing on the features C1~C4 to obtain the fusion from shallow to deep Fusion features of five different scales P1～P5: C4 directly obtains P4, P4 is down-sampled to obtain P5, P4 is up-sampled and fused with C3 to obtain P3, P3 is up-sampled and fused with C2 to obtain P2, P2 is up-sampled and fused with C1 to obtain P1.

将五个不同尺度的特征中最浅层的特征以及粗提取边缘图像输入边缘指导的形状增强模块，即先将最浅层特征与粗提取边缘图像融合，并输入利用sobel边缘提取算子得到目标边缘预测图，再计算目标边缘预测图与目标边缘真实图之间的加权交叉熵损失；之后利用归一化后的目标边缘预测图对最浅层特征进行加权后与之融合，并输入卷积网络得到目标形状预测图，并计算目标形状预测图与目标二值分割真实图之间的二分类损失和Dice损失；最后基于上述损失，通过训练网络优化参数，对特征进行增强。Input the shallowest feature of the five different scale features and the rough extraction edge image into the edge-guided shape enhancement module, that is, first fuse the shallowest feature with the rough extraction edge image, and input the target using the sobel edge extraction operator edge prediction map, and then calculate the weighted cross-entropy loss between the target edge prediction map and the target edge real map; then use the normalized target edge prediction map to weight the shallowest feature and fuse it, and input the convolution The network obtains the target shape prediction map, and calculates the binary classification loss and Dice loss between the target shape prediction map and the target binary segmentation real map; finally, based on the above losses, the parameters are optimized by training the network to enhance the features.

目标边缘预测图与目标边缘真实图之间的加权交叉熵损失的计算公式为：The weighted cross-entropy loss between the target edge prediction map and the target edge real map is calculated as:

目标形状预测图与目标二值分割真实图之间的二分类损失和Dice损失的计算公式为：The calculation formulas of binary classification loss and Dice loss between the target shape prediction map and the target binary segmentation real map are:

如图5中检测头部所示，将增强后的最浅层和其他四个尺度的特征输入FCOS的无锚框的目标检测头部，得到目标检测分类和回归结果。As shown in the detection head in Figure 5, the enhanced shallowest layer and the features of the other four scales are input into the target detection head of the anchor-free frame of FCOS, and the target detection classification and regression results are obtained.

本领域的技术人员应该理解，以上实施例描述的本发明的各步骤和各模块可以用通用的计算硬件和软件来实现，它们可以集中在单个的计算装置上，或者分布在多个计算装置所组成的网络上。用可执行的程序代码来实现的模块可以选择将它们存储在随机存储器(RAM)、内存、只读存储器(ROM)、硬盘、可移动磁盘、CD-ROM等技术领域内所公知的任意其它形式的计算机存储介质中。在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤，或者将它们分别制作成各个集成电路模块，或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。所以，本发明不限制于任何特定的硬件和软件结合。Those skilled in the art should understand that each step and each module of the present invention described in the above embodiments can be realized by general-purpose computing hardware and software, and they can be concentrated on a single computing device, or distributed among multiple computing devices. formed on the network. Modules implemented with executable program code may choose to store them in random access memory (RAM), internal memory, read-only memory (ROM), hard disk, removable disk, CD-ROM, etc. in any other form known in the technical field in computer storage media. In some cases, the steps shown or described may be performed in a different order than here, or they may be fabricated separately as individual integrated circuit modules, or multiple of them or steps may be fabricated as a single integrated circuit module to realise. Therefore, the present invention is not limited to any specific combination of hardware and software.

综上所述，以上仅为本发明的较佳实施例而已，并非用于限定本发明的保护范围。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。To sum up, the above are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. a Transformer-based SAR image ship target detection method, is characterized in that, comprises the steps:

Step 1: Patch the input original image to obtain the patched image;

Step 2: Input the image divided by the patch into the four-stage feature extraction backbone network composed of local sparse information aggregation Transformer, and obtain four different scale features from shallow to deep;

Step 3: Input the features of the four different scales into the feature pyramid network for feature fusion between different scales, and obtain fusion features of five different scales from shallow to deep after fusion;

Step 4: According to the ship position annotation of the original image, extract the edge of the target, and obtain the roughly extracted edge image of the target;

Step 5: input the shallowest fusion feature among the five fusion features of different scales and the rough extracted edge image into the edge-guided shape enhancement module to obtain the enhanced shallowest fusion feature;

Step 6: Input the enhanced fusion features of the shallowest layer and the fusion features of other four scales into the target detection head without anchor frame to obtain the target detection result.

2. the SAR image ship target detection method based on Transformer as claimed in claim 1, is characterized in that, the described original image of input is carried out Patch division and comprises the steps:

Divide the input image of size H×W×3 into 4×4 non-overlapping patches, then the feature dimension of each patch is 4×4×3=48, and the number of patches is H/4×W/4.

3. the Transformer-based SAR image ship target detection method as claimed in claim 1, is characterized in that, the four-stage feature extraction backbone network composed of local sparse information aggregation Transformer comprises:

The basic composition structure is based on the backbone network structure of Swin Transformer, which is divided into four stages, which are recorded as stage 1, stage 2, stage 3 and stage 4 in sequence, and the features of four different scales from shallow to deep are output in sequence;

Phase one comprises a linear embedding module and a double Transformer module connected in sequence; the linear embedding module is used to perform dimension transformation on the image divided by the patch;

Phase 2 includes sequentially connected patch fusion modules and dual Transformer modules;

Phase three includes sequentially connected patch fusion modules and three sequentially connected dual Transformer modules;

Phase four includes sequentially connected patch fusion modules and dual Transformer modules;

Described double Transformer module comprises two Transformer modules before and after, wherein former Transformer module performs the following steps:

Calculate the local sparse information aggregation attention map for the input features of the pre-Transformer module, perform residual connection between the local sparse information aggregation attention map and the input features of the front Transformer module, and then compare the result with the linear transformation and multi-layer perceptron. The results of the residual link are obtained to obtain the output features of the local sparse information aggregation Transformer;

The output features of the former Transformer module are used as the input features of the post-Transformer module;

The post-Transformer module performs the following steps:

After sampling and transforming the input features of the post-Transformer module, calculate the local sparse information aggregation attention map for the features after the sampling transformation, and perform residual connection between the local sparse information aggregation attention map and the input features of the post-Transformer module, and then connect the The result is residually linked with the result after linear transformation and multi-layer perceptron, and the output features of the local sparse information aggregation Transformer are obtained;

The sampling transformation includes: passing the input feature of the post-Transformer module through a convolution layer to obtain a data-based sampling value; using the data-based sampling value to perform bilinear interpolation sampling on the input feature of the post-Transformer module to obtain a post-sampling transformation. Characteristics;

Among them, the local sparse information aggregation attention map is calculated, and the input features of the former Transformer module or the input features of the post-Transformer module after sampling transformation are used as the current input feature A; the following steps are adopted:

S1: Divide the current input feature A into windows of the same size and do not overlap each other to obtain a feature map of the window division;

S2: Input the window division feature map into the offset generation network to obtain the offset matrix required for calculating the deformable attention;

S3: performing linear transformation on the window-divided feature map to obtain a value matrix required for calculating deformable attention;

S4: performing linear transformation on the window division feature map to obtain an attention weight matrix required for calculating deformable attention;

S5: Use the offset matrix to perform bilinear interpolation sampling on the value matrix, and use the attention weight matrix to weight and sum the sampling results, and then perform linear transformation to obtain a local sparse information aggregation attention map.

4. The Transformer-based SAR image ship target detection method according to any one of claims 1 to 3, wherein the local sparse information aggregation attention map is expressed as DeformAttn(z _q , p _q , x) :

Among them, z _q and x are two representations of input features, p _q is any reference point on the feature, Δp _mqk is the offset, A _mpk is the attention weight, W _m is the learnable weight, W′ _m is the transpose of W _m , M is the number of attention heads, and K is the number of sampling points.

5. the SAR image ship target detection method based on Transformer as claimed in claim 3, is characterized in that, described patch fusion module comprises: get the value spelling of the identical position of each computation area in the output feature of described double Transformer module Form a new patch and connect it to get the features after 2 times downsampling.

6. the SAR image ship target detection method based on Transformer as claimed in claim 1, is characterized in that, the input of described feature pyramid network is the feature C1～C4 of four different scales from shallow to deep;

The feature pyramid network performs the following processing on features C1~C4 to obtain fusion features P1~P5 of five different scales from shallow to deep after fusion: C4 directly obtains P4, P4 is down-sampled to obtain P5, P4 is up-sampled and fused with C3 to obtain P3, P3 is up-sampled and fused with C2 to get P2, and P2 is up-sampled and fused with C1 to get P1.

7. the SAR image ship target detection method based on Transformer as claimed in claim 1, is characterized in that, the shape enhancement module of described edge guidance comprises:

Said Transformer outputs the shallowest fusion feature and said rough extraction edge image fusion, and imports the sobel edge extraction operator to obtain the target edge prediction map;

Calculate the weighted cross-entropy loss between the target edge prediction map and the target edge real map;

Using the normalized target edge prediction map to weight the Transformer output shallowest fusion feature and then fuse with it, and input the convolution network to obtain the target shape prediction map;

Calculate the binary classification loss and Dice loss between the target shape prediction map and the target binary segmentation real map. Based on the above losses, optimize the convolutional network parameters and enhance the features.

8. the SAR image ship target detection method based on Transformer as claimed in claim 11, is characterized in that, the weighted cross-entropy loss between described target edge prediction map and target edge real map, it is expressed as Loss _ce :

Among them, y _i is the edge pixel value of the target edge-truth map, p _i is the probability that the i-th pixel is classified as an edge pixel, conversely, 1-y _i is the background pixel value of the target edge-truth map, 1-p _i is the probability that the i-th pixel is classified as a background pixel; N _c is the number of edge pixels in the target edge-truth image; N is all pixels in the target edge-truth image.

9. the SAR image ship target detection method based on Transformer as claimed in claim 11, is characterized in that, the binary classification loss and the Dice loss between the target shape prediction map and the target binary segmentation real map, it is expressed as :

Among them, Loss _se is the binary classification loss between the target shape prediction map and the target binary segmentation real map; Loss _Dice is the Dice loss between the target shape prediction map and the target binary segmentation real map; y _i is the target edge real map The edge pixel value of , p _i is the probability that the i-th pixel is classified as an edge pixel, conversely, 1-y _i is the background pixel value of the real image of the target edge, and 1-p _i is the i-th pixel is classified is the probability of the background pixel; α determines the weight of the two losses.

10. The Transformer-based SAR image ship target detection method according to claim 1, wherein the target detection head without anchor frames comprises the detection head of FCOS, and obtains the results of target detection classification and regression.