CN114627292A

CN114627292A - Industrial occlusion target detection method

Info

Publication number: CN114627292A
Application number: CN202210227869.4A
Authority: CN
Inventors: 王慧燕; 林文君; 闫义祥; 何浩
Original assignee: Hangzhou Xiaoli Technology Co ltd; Zhejiang Gongshang University
Current assignee: Hangzhou Xiaoli Technology Co ltd; Zhejiang Gongshang University
Priority date: 2022-03-08
Filing date: 2022-03-08
Publication date: 2022-06-14
Anticipated expiration: 2042-03-08
Also published as: CN114627292B

Abstract

The present application relates to an industrial occlusion target detection method. In three consecutive Wiggle Transformer Block operations, different types of W‑MSA modules, WWL‑MSA modules and WWR modules are used when performing the window segmentation step in each Wiggle Transformer Block operation. ‑MSA module performs window segmentation, and performs attention calculation in each window formed after segmentation. Different MSA modules can extract different window positions, realize more cross-window connections, increase the interaction between windows, retain the necessity of image edge information interaction, and improve globality and robustness. Awesome. The attention calculation is performed separately in each window. The advantage of this is that it can not only introduce the limitations of the CNN convolution operation, but also save the calculation amount to meet the needs of industrial applications. The present application can not only reduce the influence of the occluder on the detection target, but also can increase the detail features at multiple levels, so as to improve the accuracy of the detection of the occluded target.

Description

Industrial occlusion target detection method

技术领域technical field

本申请涉及计算机视觉技术领域，特别是涉及一种工业遮挡目标检测方法。The present application relates to the technical field of computer vision, and in particular, to an industrial occlusion target detection method.

背景技术Background technique

近年来，随着深度学习技术的迅猛发展，相关技术已经在目标检测领域广泛应用。目标检测作为其中一项较为基础的识别技术对智能交通、社会安全、工业互联网等领域提供技术支持，具有广泛的应用场景。In recent years, with the rapid development of deep learning technology, related technologies have been widely used in the field of object detection. As one of the more basic recognition technologies, target detection provides technical support for intelligent transportation, social security, industrial Internet and other fields, and has a wide range of application scenarios.

遮挡情况在各种领域的目标检测中普遍存在，为此遮挡目标检测是目标检测领域的一个难点问题，也是目标检测方法在工业应用过程中最广为关注的问题之一。根据遮挡物的不同，在现实场景中遮挡目标主要分为两种情况，即待检测的目标被干扰物体遮挡，这往往会造成目标信息的丢失，从而导致目标漏检；而另一种则是待检测的目标之间造成的遮挡，这通常容易引大量的干扰信息，进而导致了目标错检的情况。Occlusion is common in target detection in various fields. Therefore, occlusion target detection is a difficult problem in the field of target detection, and it is also one of the most widely concerned problems in the industrial application of target detection methods. According to the different occluders, the occlusion targets in the real scene are mainly divided into two cases, that is, the target to be detected is occluded by the interfering object, which often causes the loss of target information, resulting in the missed detection of the target; the other is The occlusion caused by the targets to be detected usually leads to a large amount of interference information, which in turn leads to the situation of target misdetection.

目标检测在相对长的一段时间以来一直以卷积神经网络为主导。尽管卷积神经网络的检测算法取得了不错的效果，但是大多数方法只使用卷积神经网络最后一层特征，无法实现具有不同感受野的特征在尺度上的多样性，而且系统运算量会随着卷积层数增加，计算十分耗时。特别地，卷积运算擅长提取局部特征，但难以捕获全局表示。不少人将Self-Attention与ResNet结合应用于目标检测的任务中。相对于单纯的卷积，两者的结合实现了比相应的ResNet架构实现了精度与速度之间的权衡。但是，它们昂贵的内存访问导致它们的实际延迟明显大于卷积网且如何精确地嵌入局部特征和全局表示的问题仍然存在。接下来的Vision Transformer在图像分类方面取得一定的进展，但其结构不适合作为密集视觉任务或输入图像时的通用主干网络。受限于图像的矩阵性质，一张图片至少需要几百个像素点来表达图片信息，而建模这种几百个长序列的数据导致Transformer的计算量较大，无法实现精度与访问内存，以及图片处理速度之前的权衡。Object detection has been dominated by convolutional neural networks for a relatively long time. Although the detection algorithm of the convolutional neural network has achieved good results, most methods only use the features of the last layer of the convolutional neural network, which cannot achieve the diversity of features with different receptive fields in terms of scale, and the amount of system computation will vary with As the number of convolutional layers increases, the computation is very time-consuming. In particular, convolution operations are good at extracting local features but have difficulty capturing global representations. Many people combine Self-Attention with ResNet in the task of target detection. Compared to pure convolution, the combination of the two achieves a trade-off between accuracy and speed than the corresponding ResNet architecture. However, their expensive memory access causes their actual latency to be significantly larger than convolutional nets and the problem of how to precisely embed local features and global representations remains. The next Vision Transformer has made some progress in image classification, but its structure is not suitable as a general backbone network for dense vision tasks or input images. Limited by the matrix nature of the image, a picture needs at least several hundred pixels to express the picture information, and modeling such hundreds of long sequences of data results in a large amount of computation for the Transformer, which cannot achieve accuracy and access memory. And the trade-off before image processing speed.

综上所述，对于工业遮挡目标检测来说，目前最主要的问题是检测精度不足，同时检测速度难以满足要求。To sum up, for industrial occlusion target detection, the main problem at present is the lack of detection accuracy, and the detection speed is difficult to meet the requirements.

发明内容SUMMARY OF THE INVENTION

基于此，有必要针对传统工业遮挡目标检测方法检测精度不足，同时检测速度难以满足要求的问题，提供一种工业遮挡目标检测方法。本申请所提供的方法可广泛应用于各类工业检测领域。Based on this, it is necessary to provide an industrial occlusion target detection method to solve the problems that the detection accuracy of the traditional industrial occlusion target detection method is insufficient and the detection speed is difficult to meet the requirements. The method provided in this application can be widely used in various industrial detection fields.

本申请提供一种工业遮挡目标检测方法，所述方法包括：The present application provides an industrial occlusion target detection method, the method comprising:

输入待处理图片，将待处理图片进行patch分割和应用线性嵌入层进行降维，得到第一特征图；Input the image to be processed, perform patch segmentation on the to-be-processed image and apply a linear embedding layer for dimensionality reduction to obtain the first feature map;

将第一特征图进行三次连续的WiggleTransformer Block操作，生成第二特征图；The first feature map is carried out three consecutive WiggleTransformer Block operations to generate a second feature map;

将第二特征图输入至Patch Merging模块执行Restructure操作，将经Restructure操作后的第二特征图进行三次连续的WiggleTransformer Block操作，生成第三特征图；The second feature map is input to the Patch Merging module and performs the Restructure operation, and the second feature map after the Restructure operation is carried out three consecutive WiggleTransformer Block operations to generate the third feature map;

将第三特征图输入至Patch Merging模块执行Restructure操作，将经Restructure操作后的第三特征图进行六次连续的WiggleTransformer Block操作，生成第四特征图；The 3rd feature map is input to Patch Merging module and carries out Restructure operation, the 3rd feature map after Restructure operation is carried out six times of continuous WiggleTransformer Block operation, generates the 4th feature map;

将第四特征图输入至Patch Merging模块执行Restructure操作，将经Restructure操作后的第四特征图进行三次连续的WiggleTransformer Block操作，生成第五特征图；The 4th feature map is input to Patch Merging module and performs Restructure operation, the 4th feature map after the Restructure operation is carried out three consecutive WiggleTransformer Block operations, and the 5th feature map is generated;

将第五特征图送入RPN网络，通过RPN网络对第五特征图进行遮挡目标检测，输出遮挡目标检测结果；The fifth feature map is sent to the RPN network, and the occlusion target detection is performed on the fifth feature map through the RPN network, and the occlusion target detection result is output;

在三次连续的WiggleTransformer Block操作中，分别采用W-MSA模块， WWL-MSA模块和WWR-MSA模块进行窗口切分。In three consecutive WiggleTransformer Block operations, the W-MSA module, the WWL-MSA module and the WWR-MSA module are respectively used for window segmentation.

本申请涉及一种工业遮挡目标检测方法，对Swin Transformer网络中的 SwinTransformer block模块进行改进，在三次连续的WiggleTransformer Block操作中，在执行每次WiggleTransformer Block操作中的窗口切分步骤时采用不同类型的MSA模块。在三次连续的WiggleTransformer Block操作中，分别采用W-MSA模块，WWL-MSA模块和WWR-MSA模块进行窗口切分，并在每个切分后形成的窗口中各自进行attention计算。不同的MSA模块可以提取不一样的窗口位置，实现了更多的跨窗口之间的连接，增加了窗口与窗口之间的交互，保留了图片边缘信息交互的必要性，提高了全局性和鲁棒性。在每个窗口中各自进行attention计算，这样做的好处是既能引入CNN卷积操作的局限性，另一方面能节省计算量，提高模型的检测速度，以满足工业应用的需求。并且通过对patch的合并，可以缩小分辨率，调整通道数进而形成层次化的设计，也能节省一定运算量。同时既可以减少遮挡物对检测目标的影响，又可以多层次增加细节特征，以提高遮挡目标检测的准确率，以满足工业应用对高精度的需求。The present application relates to an industrial occlusion target detection method, which improves the SwinTransformer block module in the Swin Transformer network. In three consecutive WiggleTransformer Block operations, different types of MSA module. In three consecutive WiggleTransformer Block operations, the W-MSA module, the WWL-MSA module and the WWR-MSA module are respectively used to perform window segmentation, and the attention calculation is performed in each window formed after segmentation. Different MSA modules can extract different window positions, realize more cross-window connections, increase the interaction between windows, retain the necessity of image edge information interaction, and improve globality and robustness. Awesome. The attention calculation is performed separately in each window. The advantage of this is that it can not only introduce the limitations of the CNN convolution operation, but also save the amount of calculation and improve the detection speed of the model to meet the needs of industrial applications. And by combining patches, the resolution can be reduced, the number of channels can be adjusted to form a hierarchical design, and a certain amount of computation can be saved. At the same time, it can not only reduce the influence of occluders on the detection target, but also increase the detail features at multiple levels to improve the accuracy of occluded target detection and meet the high-precision requirements of industrial applications.

附图说明Description of drawings

图1为本申请一实施例提供的工业遮挡目标检测方法的流程示意图。FIG. 1 is a schematic flowchart of an industrial occlusion target detection method provided by an embodiment of the present application.

图2为本申请一实施例提供的工业遮挡目标检测方法的网络结构图。FIG. 2 is a network structure diagram of an industrial occlusion target detection method provided by an embodiment of the present application.

图3为本申请一实施例提供的工业遮挡目标检测方法中三个连续的WiggleTransformer Block操作的流程模块图。FIG. 3 is a flowchart block diagram of three consecutive WiggleTransformer Block operations in an industrial occlusion target detection method provided by an embodiment of the present application.

图4为本申请一实施例提供的工业遮挡目标检测方法中窗口切割的示意图。FIG. 4 is a schematic diagram of window cutting in an industrial occlusion target detection method provided by an embodiment of the present application.

图5为本申请一实施例提供的工业遮挡目标检测方法中W-MSA模块窗口重组方式的示意图。5 is a schematic diagram of a W-MSA module window reorganization mode in an industrial occlusion target detection method provided by an embodiment of the application.

图6为本申请一实施例提供的工业遮挡目标检测方法中WWL-MSA模块窗口重组方式的示意图。6 is a schematic diagram of a window reorganization mode of a WWL-MSA module in an industrial occlusion target detection method provided by an embodiment of the present application.

图7为本申请一实施例提供的工业遮挡目标检测方法中WWR-MSA模块窗口重组方式的示意图。7 is a schematic diagram of a WWR-MSA module window reorganization mode in an industrial occlusion target detection method provided by an embodiment of the application.

图8位本申请一实施例提供的工业遮挡目标检测方法中S320，通过Patch Merging模块在第二特征图的高度方向和宽度方向上，间隔选取2×2patch块进行拼接的示意图。Fig. 8 is a schematic diagram of selecting 2 × 2 patch blocks at intervals for splicing in the height direction and width direction of the second feature map through the Patch Merging module in S320 in the industrial occlusion target detection method provided by an embodiment of the present application.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application clearer, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

本申请提供一种工业遮挡目标检测方法。需要说明的是，本申请提供的工业遮挡目标检测方法的应用于出现互相遮挡物体的图片。The present application provides an industrial occlusion target detection method. It should be noted that the industrial occlusion target detection method provided in this application is applied to pictures in which objects occlude each other.

此外，本申请提供的工业遮挡目标检测方法不限制其执行主体。可选地，本申请提供的工业遮挡目标检测方法的执行主体可以为一种工业遮挡目标检测终端。In addition, the industrial occlusion target detection method provided by this application does not limit its execution subject. Optionally, the execution subject of the industrial occlusion target detection method provided by the present application may be an industrial occlusion target detection terminal.

如图1所示，在本申请的一实施例中，所述工业遮挡目标检测方法包括如下S100至S600：As shown in Figure 1, in an embodiment of the present application, the industrial occlusion target detection method includes the following S100 to S600:

S100，输入待处理图片，将待处理图片进行patch分割和应用线性嵌入层进行降维，得到第一特征图。S100, inputting a picture to be processed, performing patch segmentation on the picture to be processed and applying a linear embedding layer to reduce the dimension to obtain a first feature map.

S200，将第一特征图进行三次连续的Wiggle Transformer Block操作，生成第二特征图。S200, perform three consecutive Wiggle Transformer Block operations on the first feature map to generate a second feature map.

S300，将第二特征图输入至Patch Merging模块执行Restructure操作，将经Restructure操作后的第二特征图进行三次连续的Wiggle Transformer Block操作，生成第三特征图。S300, the second feature map is input to the Patch Merging module to perform a Restructure operation, and the second feature map after the Restructure operation is subjected to three consecutive Wiggle Transformer Block operations to generate a third feature map.

S400，将第三特征图输入至Patch Merging模块执行Restructure操作，将经Restructure操作后的第三特征图进行六次连续的Wiggle Transformer Block操作，生成第四特征图。S400: Input the third feature map to the Patch Merging module to perform a Restructure operation, and perform six consecutive Wiggle Transformer Block operations on the third feature map after the Restructure operation to generate a fourth feature map.

S500，将第四特征图输入至Patch Merging模块执行Restructure操作，将经Restructure操作后的第四特征图进行三次连续的Wiggle Transformer Block操作，生成第五特征图。S500, the fourth feature map is input to the Patch Merging module to perform a Restructure operation, and the fourth feature map after the Restructure operation is subjected to three consecutive Wiggle Transformer Block operations to generate a fifth feature map.

S600，将第五特征图送入RPN网络，通过RPN网络对第五特征图进行遮挡目标检测，输出遮挡目标检测结果。S600, the fifth feature map is sent to the RPN network, and the occlusion target detection is performed on the fifth feature map through the RPN network, and the occlusion target detection result is output.

其中，在三次连续的Wiggle Transformer Block操作中，分别采用W-MSA 模块，WWL-MSA模块和WWR-MSA模块进行窗口切分，在每个切分后形成的窗口中各自进行attention计算。Among them, in three consecutive Wiggle Transformer Block operations, the W-MSA module, the WWL-MSA module and the WWR-MSA module are respectively used to perform window segmentation, and the attention calculation is performed in each window formed after segmentation.

具体地，待处理图片是出现互相遮挡物体的图片。Specifically, the to-be-processed picture is a picture in which objects that occlude each other appear.

在三次连续的WiggleTransformer Block操作中，本实施例分别采用W-MSA 模块，WWL-MSA模块和WWR-MSA模块进行窗口切分，并在每个切分后形成的窗口中各自进行attention计算。In three consecutive WiggleTransformer Block operations, the present embodiment adopts the W-MSA module, the WWL-MSA module and the WWR-MSA module to perform window segmentation, and performs attention calculation in the windows formed after each segmentation.

如图2所示，图2为本申请一实施例提供的工业遮挡目标检测方法的网络结构图，它简述了整个工业遮挡目标检测方法的流程，即简述了S100至S600。As shown in Figure 2, Figure 2 is a network structure diagram of an industrial occlusion target detection method provided by an embodiment of the application, which briefly describes the flow of the entire industrial occlusion target detection method, that is, S100 to S600 are briefly described.

图2中的Image即待处理图片。stage1即S100至S200。stage2即S300。stage3即S400。stage4即S500。最后的RPN部分即S600，即通过RPN网络对第五特征图进行遮挡目标检测，输出遮挡目标检测结果，目标检测结果有2个，一个是class loss值，另一个是bbox_loss值。The Image in Figure 2 is the image to be processed. stage1 is S100 to S200. stage2 is S300. stage3 is S400. stage4 is S500. The last RPN part is S600, that is, the occlusion target detection is performed on the fifth feature map through the RPN network, and the occlusion target detection result is output. There are 2 target detection results, one is the class loss value, and the other is the bbox_loss value.

本实施例中，对Swin Transformer网络中的Swin Transformer block模块进行改进，在三次连续的WiggleTransformer Block操作中，在执行每次 WiggleTransformerBlock操作中的窗口切分步骤时采用不同类型的MSA模块。在三次连续的WiggleTransformer Block操作中，分别采用W-MSA模块，WWL-MSA 模块和WWR-MSA模块进行窗口切分。不同的MSA模块可以提取不一样的窗口位置，实现了更多的跨窗口之间的连接，增加了窗口与窗口之间的交互，保留了图片边缘信息交互的必要性，提高了全局性和鲁棒性。在每个窗口中各自进行 attention计算，这样做的好处是既能引入CNN卷积操作的局限性，另一方面能节省计算量，提高模型的检测速度，以满足工业应用的需求。并且通过对patch 的合并，可以缩小分辨率，调整通道数进而形成层次化的设计，也能节省一定运算量。同时既可以减少遮挡物对检测目标的影响，又可以多层次增加细节特征，以提高遮挡目标检测的准确率，以满足工业应用对高精度的需求。In the present embodiment, the Swin Transformer block module in the Swin Transformer network is improved, and in three consecutive WiggleTransformer Block operations, different types of MSA modules are used when performing the window segmentation step in each WiggleTransformerBlock operation. In three consecutive WiggleTransformer Block operations, the W-MSA module, the WWL-MSA module and the WWR-MSA module are used for window segmentation. Different MSA modules can extract different window positions, realize more cross-window connections, increase the interaction between windows, retain the necessity of image edge information interaction, and improve globality and robustness. Awesome. The attention calculation is performed separately in each window. The advantage of this is that it can not only introduce the limitations of the CNN convolution operation, but also save the calculation amount and improve the detection speed of the model to meet the needs of industrial applications. And by merging patches, the resolution can be reduced, the number of channels can be adjusted to form a hierarchical design, and a certain amount of computation can be saved. At the same time, it can not only reduce the influence of occluders on the detection target, but also increase the detailed features at multiple levels to improve the accuracy of occluded target detection and meet the needs of industrial applications for high precision.

此外，我们在每一个WiggleTransformer Block操作的最后一步以串行加入了三次卷积操作，将卷积操作提取的局部特征融入transformer来增强表示学习，更好地保留了局部特征和全局表示的表示能力。In addition, we added three convolution operations in series in the last step of each WiggleTransformer Block operation, and integrated the local features extracted by the convolution operation into the transformer to enhance representation learning and better preserve the representation capabilities of local features and global representations .

在本申请的一实施例中，S100包括如下S110至S130：In an embodiment of the present application, S100 includes the following S110 to S130:

S110，输入一张高度为H，宽度为W，通道数为3的待处理图片。S110, input a picture to be processed whose height is H, width is W, and the number of channels is 3.

具体地，例如H取224像素，W取224像素，那么输入的待处理图片就为一张224×224×3的图片。Specifically, for example, H takes 224 pixels and W takes 224 pixels, then the input to-be-processed picture is a 224×224×3 picture.

S120，将待处理图片划分为

个尺寸相同的patch块，每个patch块的高度为4像素，宽度为4像素。S120: Divide the picture to be processed into

Patch blocks of the same size, each patch block is 4 pixels high and 4 pixels wide.

具体地，如图2所示，将待处理图片通过Patch Partition划分成不重合的patch集合，patch块的数量为

承接上述的例子，如果H取224像素，W取224像素，那么本步骤中划分后的patch块的数量为56×56＝3136，其中每个patch大小为4×4，即高度为4像素，宽度为4像素。那么这多个尺寸相同的patch块可以输出为一个56×56×48的特征图(即feature map)，这个特征图在本实施例中称为原始特征图。48成为原始特征图的特征维度，你可以把它看成是由多个二维图片(即patch块)叠放在一起产生的结构，其特征维度为4×4×3＝48。Specifically, as shown in Figure 2, the to-be-processed picture is divided into non-overlapping patch sets by Patch Partition, and the number of patch blocks is

Following the above example, if H takes 224 pixels and W takes 224 pixels, then the number of patch blocks divided in this step is 56×56=3136, where the size of each patch is 4×4, that is, the height is 4 pixels, The width is 4 pixels. Then, the multiple patch blocks with the same size can be output as a 56×56×48 feature map (ie, feature map), and this feature map is called the original feature map in this embodiment. 48 becomes the feature dimension of the original feature map, you can think of it as a structure produced by stacking multiple two-dimensional images (ie, patch blocks) together, and its feature dimension is 4×4×3=48.

S130，将多个尺寸相同的patch块作为原始特征图输入至线性嵌入层中，将原始特征图的特征维度变为预设维度C，以将2维的原始特征图转换成1维 patch片层，将转换后的1维patch片层作为第一特征图。S130: Input multiple patch blocks with the same size as original feature maps into the linear embedding layer, and change the feature dimension of the original feature map to a preset dimension C, so as to convert the 2-dimensional original feature map into a 1-dimensional patch slice layer , take the converted 1-dimensional patch slice as the first feature map.

具体地，承接上述例子，将56×56×48的原始特征图输入至线性嵌入层中。线性嵌入层英文名称为Linear Embedding。线性嵌入层可以将56×56×48的原始特征图映射到一个预设维度C中。具体的原理就是将2维的原始特征图中的H 维度和W维度展开，使得2维的原始特征图转换成1维patch片层，生成最终的特征结构(或称为特征向量)作为第一特征图。Specifically, following the above example, the original feature map of 56×56×48 is input into the linear embedding layer. The English name of the linear embedding layer is Linear Embedding. The linear embedding layer can map the 56×56×48 original feature map into a preset dimension C. The specific principle is to expand the H dimension and W dimension in the 2-dimensional original feature map, so that the 2-dimensional original feature map is converted into a 1-dimensional patch slice, and the final feature structure (or called feature vector) is generated as the first feature map.

那么可以理解，56×56×48的原始特征图可以转化为56×56×96的第一特征图。Then it can be understood that the original feature map of 56×56×48 can be transformed into the first feature map of 56×56×96.

在本申请的一实施例中，所述S200包括如下S210至S230：In an embodiment of the present application, the S200 includes the following S210 to S230:

S210，将第一特征图输入至Wiggle Transformer Block中，进行第一WiggleTransformer Block操作。在进行第一Wiggle Transformer Block操作的过程中，利用W-MSA模块进行窗口切分，生成多个第一窗口，并在每个切分后形成的第一窗口中各自进行attention计算。S220，将经第一Wiggle Transformer Block操作后的第一特征图输入至Wiggle Transformer Block中，进行第二 Wiggle Transformer Block操作。在进行第二Wiggle Transformer Block操作的过程中，利用WWL-MSA模块进行窗口切分，生成多个第二窗口，并在每个切分后形成的第二窗口中各自进行attention计算。S210: Input the first feature map into the Wiggle Transformer Block, and perform the first Wiggle Transformer Block operation. In the process of performing the first Wiggle Transformer Block operation, the W-MSA module is used to perform window segmentation to generate a plurality of first windows, and the attention calculation is performed separately in the first windows formed after each segmentation. S220: Input the first feature map after the first Wiggle Transformer Block operation into the Wiggle Transformer Block, and perform the second Wiggle Transformer Block operation. In the process of performing the second Wiggle Transformer Block operation, the WWL-MSA module is used to perform window segmentation to generate a plurality of second windows, and the attention calculation is performed separately in the second windows formed after each segmentation.

S230，将经第二Wiggle Transformer Block操作后的第一特征图输入至 WiggleTransformer Block中，进行第三Wiggle Transformer Block操作，将最终得到的特征图作为第二特征图。在进行第三Wiggle Transformer Block操作的过程中，利用WWR-MSA模块进行窗口切分，生成多个第三窗口，并在每个切分后形成的第三窗口中各自进行attention计算。S230, input the first feature map after the second Wiggle Transformer Block operation into the Wiggle Transformer Block, perform the third Wiggle Transformer Block operation, and use the finally obtained feature map as the second feature map. In the process of performing the third Wiggle Transformer Block operation, the WWR-MSA module is used to perform window segmentation, multiple third windows are generated, and attention calculation is performed in each of the third windows formed after segmentation.

具体地，S210至S230包括连续三个Wiggle Transformer Block操作。如图3所示，图3为三个连续的Wiggle Transformer Block操作的流程模块图。每个WiggleTransformer Block操作中使用的窗口切分的模块不同，这代表窗口切分的方式是不同的。W-MSA模块具有无窗口与窗口之间交互的窗型的多头注意模块。WWL-MSA模块具有注重左边窗口交互的窗型的多头注意模块。WWR-MSA 模块具有注重右边窗口交互的窗型的多头注意模块。他们各自的侧重点各不同。Specifically, S210 to S230 include three consecutive Wiggle Transformer Block operations. As shown in Fig. 3, Fig. 3 is a flow block diagram of three consecutive Wiggle Transformer Block operations. The window segmentation module used in each WiggleTransformer Block operation is different, which means that the window segmentation method is different. The W-MSA module has a window-type multi-head attention module without window-to-window interaction. The WWL-MSA module has a multi-head attention module that focuses on the left window interaction. The WWR-MSA module has a multi-head attention module that focuses on the right window interaction. Their respective focuses are different.

具体地，利用W-MSA模块进行窗口切分生成的窗口，就命名为第一窗口。利用WWL-MSA模块进行窗口切分生成的窗口，就命名为第二窗口。利用WWR-MSA 模块进行窗口切分生成的窗口，就命名为第三窗口。后文出现的“第一窗口”， “第二窗口”“第三窗口”，其各自意义都参见此处解释，后文不再重复解释。Specifically, the window generated by window segmentation using the W-MSA module is named the first window. The window generated by window segmentation using the WWL-MSA module is named the second window. The window generated by window segmentation using the WWR-MSA module is named the third window. The meanings of "the first window", "the second window" and "the third window" appearing later are all explained here, and the explanation will not be repeated hereafter.

S210，S220，S230，每一步骤都包括两个步骤，一个步骤是窗口切分，另一个步骤是窗口内的attention计算。S210, S220, S230, each step includes two steps, one step is window segmentation, and the other step is attention calculation in the window.

窗口切分也分为两步，第一步是窗口切割，第二步是窗口重组。Window segmentation is also divided into two steps, the first step is window cutting, and the second step is window reorganization.

图4为本申请一实施例提供的工业遮挡目标检测方法中窗口切割的示意图，它展示了窗口切割的示例，窗口切割可以表明窗口是怎么来的。图4中，最小单元是最小的格子，4×4最小单元的虚线范围框起来的patch块就是一个窗口，可以理解，图4(a)有3×3＝9个窗口。FIG. 4 is a schematic diagram of window cutting in an industrial occlusion target detection method provided by an embodiment of the present application, which shows an example of window cutting, and window cutting can indicate how the window comes from. In Fig. 4, the minimum unit is the smallest grid, and the patch block framed by the dotted range of the 4×4 minimum unit is a window. It can be understood that Fig. 4(a) has 3×3=9 windows.

窗口切割可以有不同的切割方式。图4(a)的方式是每2*2相邻的窗口中心点处各取四分之一窗口大小组成一个新的窗口，这样重复4次，得到窗口1，窗口2，窗口3和窗口4。图4(b)的方式是水平方向上最外侧两行，每两个相邻窗口之间各取一半窗口大小组成一个新的窗口，这样重复4次，得到窗口5，窗口6，窗口9和窗口11。图4(c)的方式是竖直方向上最外侧两列，每两个相邻窗口块之间各取一半窗口大小组成一个新的窗口，这样重复4次，得到窗口7，窗口8，窗口11和窗口12。Window cuts can be cut in different ways. The method of Figure 4(a) is to take a quarter of the window size at the center point of each 2*2 adjacent window to form a new window, and repeat this 4 times to obtain window 1, window 2, window 3 and window 4 . The method in Figure 4(b) is the two outermost lines in the horizontal direction, and half of the window size is taken between each two adjacent windows to form a new window. Repeat this 4 times to obtain window 5, window 6, window 9 and window 11. The method in Figure 4(c) is the two outermost columns in the vertical direction, and half of the window size is taken between each two adjacent window blocks to form a new window, and this is repeated 4 times to obtain window 7, window 8, and window 8. 11 and window 12.

W-MSA模块，WWL-MSA模块和WWR-MSA三种不同的窗口重组方式如图5-图7 所示。The three different window reorganization methods of W-MSA module, WWL-MSA module and WWR-MSA are shown in Figure 5-Figure 7.

W-MSA模块的窗口重组方式参见图5，主要侧重于无窗口与窗口之间交互。承接上述实施例，将图4中的窗口切割得到的多个窗口随机选取9个排列成3 ×3的矩阵，然后9个窗口中，每个窗口内各自进行attention计算。The window reorganization method of the W-MSA module is shown in Figure 5, which mainly focuses on the interaction between windowless and windows. Following the above embodiment, 9 windows are randomly selected from the multiple windows obtained by cutting the window in FIG. 4 and arranged into a 3 × 3 matrix, and then, among the 9 windows, attention calculation is performed in each window.

WWL-MSA模块的窗口重组方式参见图6，主要侧重于与左边窗口的交互。承接上述实施例，将图4中的窗口切割得到的多个窗口随机选取9个排列成3×3 的矩阵，但是窗口1，窗口2，窗口3，窗口4在右下角，其他窗口包围在窗口 1，窗口2，窗口3，窗口4的左上角，可以表明侧重于与左边窗口的交互。然后9个窗口中，每个窗口内各自进行attention计算。The window reorganization method of the WWL-MSA module is shown in Figure 6, which mainly focuses on the interaction with the left window. Following the above embodiment, 9 windows are randomly selected from the multiple windows obtained by cutting the window in FIG. 4 and arranged into a 3×3 matrix, but window 1, window 2, window 3, and window 4 are in the lower right corner, and other windows are surrounded by windows. 1, the upper left corner of window 2, window 3, and window 4 can indicate that the focus is on the interaction with the left window. Then in the 9 windows, the attention calculation is performed in each window.

WWR-MSA模块的窗口重组方式参见图7，主要侧重于与右边窗口的交互。将图4中的窗口切割得到的多个窗口随机选取9个排列成3×3的矩阵，但是窗口1，窗口2，窗口3，窗口4在左上角，其他窗口包围在窗口1，窗口2，窗口3，窗口4的右下角，可以表明侧重于与右边窗口的交互。然后9个窗口中，每个窗口内各自进行attention计算。The window reorganization method of the WWR-MSA module is shown in Figure 7, which mainly focuses on the interaction with the right window. Randomly select 9 of the multiple windows obtained by cutting the window in Figure 4 and arrange them into a 3×3 matrix, but window 1, window 2, window 3, and window 4 are in the upper left corner, and other windows are surrounded by window 1, window 2, Window 3, the lower right corner of Window 4, can indicate a focus on interaction with the right window. Then in the 9 windows, the attention calculation is performed in each window.

本实施例中，在三次连续的WiggleTransformer Block操作中，通过分别采用W-MSA模块，WWL-MSA模块和WWR-MSA模块进行窗口切分。不同的MSA模块可以提取不一样的窗口位置，实现了更多的跨窗口之间的连接，增加了窗口与窗口之间的交互，提高了全局性和鲁棒性。In this embodiment, in three consecutive WiggleTransformer Block operations, window segmentation is performed by using the W-MSA module, the WWL-MSA module and the WWR-MSA module respectively. Different MSA modules can extract different window positions, realize more connections across windows, increase the interaction between windows, and improve globality and robustness.

如图3所示，在本申请的一实施例中，所述S210包括如下S211至S217：As shown in FIG. 3, in an embodiment of the present application, the S210 includes the following S211 to S217:

S211，将第一特征图输入至Wiggle Transformer Block中，进行LayerNormalization操作。S211 , input the first feature map into the Wiggle Transformer Block, and perform a LayerNormalization operation.

具体地，S211至S217具体阐述了第一Wiggle Transformer Block操作的过程，就是图3中的第一列操作。首先在S211中，先对第一特征图进行Layer Normalization操作，简称为LN操作。为了便于描述，这里将第一特征图记为 Z^x-1。Specifically, S211 to S217 specifically describe the process of the first Wiggle Transformer Block operation, which is the first column operation in FIG. 3 . First, in S211, a Layer Normalization operation, which is referred to as an LN operation for short, is performed on the first feature map. For the convenience of description, the first feature map is denoted as Z ^x-1 here.

Layer Normalization操作的作用是对Z^x-1进行归一化。The role of the Layer Normalization operation is to normalize Z ^x-1 .

S212，将经Layer Normalization操作后的特征图进行W-MSA模块窗口切分，生成多个第一窗口，并在每个切分后形成的第一窗口中各自进行attention 计算，生成经W-MSA模块窗口切分及attention计算后的特征图。S212, perform the W-MSA module window segmentation on the feature map after the Layer Normalization operation, generate a plurality of first windows, and perform attention calculation in the first windows formed after each segmentation to generate the W-MSA Feature map after module window segmentation and attention calculation.

具体地，经Layer Normalization操作后的特征图即S211中经LayerNormalization操作后的第一特征图。Specifically, the feature map after the Layer Normalization operation is the first feature map after the LayerNormalization operation in S211.

W-MSA模块窗口切分方式如图5所示。The window segmentation method of the W-MSA module is shown in Figure 5.

W-MSA模块窗口切分时，多头自注意力(Multi-Head Attention)的head 的个数为3，K、Q、V分别是一个长度为3的Tensor，每个Tensor的维度是(56， 56，96)，窗口大小为7×7。由于画布大小限制，因此图4展示的每个窗口的大小是4×4，而不是7×7。那么承接上述实施例，窗口个数就为56/7×56/7＝8 ×8＝64个。When the W-MSA module window is divided, the number of heads of Multi-Head Attention is 3, K, Q, V are respectively a Tensor of length 3, and the dimension of each Tensor is (56, 56, 96), and the window size is 7×7. The size of each window shown in Figure 4 is 4x4 instead of 7x7 due to canvas size limitations. Then, following the above embodiment, the number of windows is 56/7×56/7=8×8=64.

窗口切分所得到的特征图中，每个7×7尺寸的窗口在自身内部执行 Self-Attention(即前述内容提到的attention计算)，并且在原始计算 Attention的公式中的Q，K时加入了相对位置编码B，进一步提升了模型性能。In the feature map obtained by window segmentation, each 7×7 window performs Self-Attention (that is, the attention calculation mentioned in the previous content) within itself, and is added to Q and K in the original formula for calculating Attention The relative position encoding B is used to further improve the performance of the model.

本申请的Attention计算的公式如下：The formula for the Attention calculation of this application is as follows:

其中，Q为Query向量，K为Key向量，V为Value向量，d是score归一化的参数。B就是本申请新加入的相对位置编码，它是一种可以自学习的参数。Among them, Q is the Query vector, K is the Key vector, V is the Value vector, and d is the parameter of score normalization. B is the relative position code newly added in this application, which is a self-learning parameter.

在每个7×7尺寸的窗口在自身内部执行Self-Attention后，窗口切分的整个完整的流程完成，生成的特征图记为

即将W-MSA模块窗口切分后得到的特征图，它是第一特征图先进行Layer Normalization操作，再执行W-MSA模块窗口切分后得到的结果。After each 7×7 window performs Self-Attention within itself, the entire process of window segmentation is completed, and the generated feature map is marked as

That is, the feature map obtained after the W-MSA module window is segmented, it is the result obtained after the first feature map performs the Layer Normalization operation, and then performs the W-MSA module window segmentation.

S213，将经W-MSA模块窗口切分及attention计算后的特征图和第一特征图进行残差连接。S213, perform residual connection between the feature map after the W-MSA module window segmentation and the attention calculation and the first feature map.

具体地，本步骤是将

和Z^x-1进行残差连接，残差连接的作用是探寻W-MSA模块窗口切分后得到的特征图和W-MSA模块窗口切分前的特征图之间的联系。Specifically, this step is to

Perform residual connection with Z ^x-1 . The function of residual connection is to explore the connection between the feature map obtained after the W-MSA module window is segmented and the feature map before the W-MSA module window segment.

S214，将残差连接后得到的特征图进行Layer Normalization操作。S214, perform a Layer Normalization operation on the feature map obtained after residual connection.

具体地，残差连接后得到的特征图再次进行一个LN操作。Specifically, the feature map obtained after residual connection is again subjected to an LN operation.

S215，将经Layer Normalization操作后的特征图输入至一个2层的MLP 神经网络模块进行神经网络处理。S215, the feature map after the Layer Normalization operation is input into a 2-layer MLP neural network module for neural network processing.

具体地，本步骤中所说的“经Layer Normalization操作后的特征图”是 S214的输出结果。本步骤是将S214的输出结果输入至一个2层的MLP神经网络模块进行神经网络处理。Specifically, the "feature map after Layer Normalization operation" mentioned in this step is the output result of S214. This step is to input the output result of S214 into a 2-layer MLP neural network module for neural network processing.

2层的MLP神经网络模块是一个2层的多层感知器。The 2-layer MLP neural network module is a 2-layer multilayer perceptron.

S216，将经神经网络处理后的特征图进行三次卷积。S216, perform three convolutions on the feature map processed by the neural network.

具体地，将经神经网络处理后的特征图输入至一个3层convolution的结构进行三次卷积。三次卷积的卷积核分别为3×3，5×5和1×1。卷积可以让特征图更注重局部细节特征的提取。Specifically, the feature map processed by the neural network is input into a 3-layer convolution structure for cubic convolution. The convolution kernels of the cubic convolution are 3×3, 5×5 and 1×1, respectively. Convolution can make the feature map pay more attention to the extraction of local details.

S217，将三次卷积后的特征图和残差连接后得到的特征图再次进行残差连接，最终得到经第一Wiggle Transformer Block操作后的第一特征图。三次卷积的卷积核分别为3×3，5×5和1×1。S217, perform residual connection again on the feature map obtained after three convolutions and the feature map obtained by residual connection, and finally obtain the first feature map after the operation of the first Wiggle Transformer Block. The convolution kernels of the three convolutions are 3×3, 5×5 and 1×1, respectively.

具体地，本次残差连接相当于是将S213的输出结果和S216的输出结果进行残差连接，最终得到经第一Wiggle Transformer Block操作后的第一特征图，记为Z^x。Specifically, this residual connection is equivalent to performing residual connection between the output result of S213 and the output result of S216, and finally obtains the first feature map after the first Wiggle Transformer Block operation, which is denoted as Z ^x .

本实施例中，通过在每一个block的MLP模块后加一个3层的convolution 模块，使得其在注重全局特征的同时，克服了目标目标被遮挡物遮挡而在背景中混淆的情况，实现了在更大感受野内全局特征交互的同时也更注重局部特征，从而提高了遮挡目标检测的精度。具有充分考虑的CNN的位移不变性，感受野与层次的关系等特点，将CNN与transformer各自的优势有效的结合了起来。In this embodiment, a 3-layer convolution module is added after the MLP module of each block, so that it pays attention to the global features and overcomes the situation that the target is occluded by occluders and confused in the background. The interaction of global features in the larger receptive field also pays more attention to local features, thereby improving the accuracy of occluded target detection. It has the characteristics of fully considered the displacement invariance of CNN, the relationship between the receptive field and the level, etc., which effectively combines the advantages of CNN and transformer.

如图3所示，在本申请的一实施例中，所述S220包括如下S221至S227：As shown in FIG. 3, in an embodiment of the present application, the S220 includes the following S221 to S227:

S221，将经第一Wiggle Transformer Block操作后的第一特征图输入至WiggleTransformer Block中，进行Layer Normalization操作。S221 , input the first feature map after the operation of the first Wiggle Transformer Block into the Wiggle Transformer Block, and perform a Layer Normalization operation.

S222，将经Layer Normalization操作后的特征图进行WWL-MSA模块窗口切分，生成多个第二窗口，并在每个切分后形成的第二窗口中各自进行 attention计算，生成经WWL-MSA模块窗口切分及attention计算后的特征图。S222, perform the WWL-MSA module window segmentation on the feature map after the Layer Normalization operation, generate multiple second windows, and perform attention calculation in each of the second windows formed after segmentation to generate the WWL-MSA Feature map after module window segmentation and attention calculation.

S223，将经WWL-MSA模块窗口切分及attention计算后的特征图和经第一 WiggleTransformer Block操作后的第一特征图进行残差连接。S223, perform residual connection between the feature map after window segmentation and attention calculation by the WWL-MSA module and the first feature map after the first WiggleTransformer Block operation.

S224，将残差连接后得到的特征图进行Layer Normalization操作。S224 , performing a Layer Normalization operation on the feature map obtained after the residuals are connected.

S225，将经Layer Normalization操作后的特征图输入至一个2层的MLP 神经网络模块进行神经网络处理。S225, the feature map after the Layer Normalization operation is input into a 2-layer MLP neural network module for neural network processing.

S226，将经神经网络处理后的特征图进行三次卷积。S226, perform three convolutions on the feature map processed by the neural network.

S227，将三次卷积后的特征图和残差连接后得到的特征图再次进行残差连接，最终得到经第二Wiggle Transformer Block操作后的第一特征图。三次卷积的卷积核分别为3×3，5×5和1×1。S227: Perform residual connection again on the feature map obtained after the three convolutions and the feature map obtained by residual connection, and finally obtain the first feature map after the second Wiggle Transformer Block operation. The convolution kernels of the three convolutions are 3×3, 5×5 and 1×1, respectively.

具体地，S221至S227就是图3中的第二列操作，它的工作原理和S211至 S217的工作原理大部分是一致的，此处就不再赘述了，区别在于窗口切分时采用的模块不同，即采用的切分形式不同(实际上属于窗口重组方式不同)。S222 采用的是WWL-MSA模块。Specifically, S221 to S227 are the operations in the second column in FIG. 3 , and their working principles are mostly the same as those of S211 to S217 , which will not be repeated here. The difference lies in the modules used in window segmentation. Different, that is, the segmentation form adopted is different (actually it belongs to the different window reorganization methods). The S222 adopts the WWL-MSA module.

WWL-MSA模块窗口切分方式如图6所示。Figure 6 shows the window segmentation method of the WWL-MSA module.

W-MSA模块实现了跨窗口之间的交互，进行mask之后的窗口包含原本相邻窗口的元素，并且更注重左边窗口之间的通信。The W-MSA module realizes the interaction between windows, and the window after masking contains the elements of the original adjacent windows, and pays more attention to the communication between the left windows.

Z^x经过执行S221至S227后，生成经第二Wiggle Transformer Block 操作后的第一特征图，记为Z^x+1。After performing S221 to S227 for Z ^x , a first feature map after the second Wiggle Transformer Block operation is generated, which is denoted as Z ^x+1 .

在本申请的一实施例中，所述S230包括如下S231至S237：In an embodiment of the present application, the S230 includes the following S231 to S237:

S231，将经第二Wiggle Transformer Block操作后的第一特征图输入至 WiggleTransformer Block中，进行Layer Normalization操作。S231: Input the first feature map after the second Wiggle Transformer Block operation into the Wiggle Transformer Block, and perform a Layer Normalization operation.

S232，将经Layer Normalization操作后的特征图进行WWR-MSA模块窗口切分，生成多个第三窗口，并在每个切分后形成的第三窗口中各自进行 attention计算，生成经WWR-MSA模块窗口切分及attention计算后的特征图。S232, perform the WWR-MSA module window segmentation on the feature map after the Layer Normalization operation, generate a plurality of third windows, and perform attention calculation in the third windows formed after each segmentation to generate the WWR-MSA Feature map after module window segmentation and attention calculation.

S233，将经WWR-MSA模块窗口切分及attention计算后的特征图和经第二 WiggleTransformer Block操作后的第一特征图进行残差连接。S233, perform residual connection between the feature map after window segmentation and attention calculation by the WWR-MSA module and the first feature map after the second WiggleTransformer Block operation.

S234，将残差连接后得到的特征图进行Layer Normalization操作。S234 , performing a Layer Normalization operation on the feature map obtained after residual connection.

S235，将经Layer Normalization操作后的特征图输入至一个2层的MLP 神经网络模块进行神经网络处理。S235, the feature map after the Layer Normalization operation is input into a 2-layer MLP neural network module for neural network processing.

S236，将经神经网络处理后的特征图进行三次卷积。S236, perform three convolutions on the feature map processed by the neural network.

S237，将三次卷积后的特征图和残差连接后得到的特征图再次进行残差连接，最终得到第二特征图。三次卷积的卷积核分别为3×3，5×5和1×1。S237, perform residual connection again on the feature map obtained after the three convolutions and the feature map obtained after residual connection, and finally obtain a second feature map. The convolution kernels of the cubic convolution are 3×3, 5×5 and 1×1, respectively.

具体地，S231至S237就是图3中的第三列操作，它的工作原理和S211至 S217的工作原理，S221至S227的工作原理大部分是一致的，此处就不再赘述了，区别在于窗口切分时采用的模块不同，即采用的切分形式不同。S232采用的是WWR-MSA模块。WWR-MSA模块在W-MSA模块划分的基础上，对特征图进行不同位置的mask，以取得“移位”后的窗口，mask的大小为M/2，并且可以保持 windows的个数不变。Specifically, S231 to S237 are the operations in the third column in FIG. 3 . Its working principle is the same as that of S211 to S217 , and most of the working principles of S221 to S227 are the same, so I will not repeat them here. The difference is that The modules used in window segmentation are different, that is, the segmentation forms used are different. The S232 uses the WWR-MSA module. Based on the division of the W-MSA module, the WWR-MSA module masks the feature map at different positions to obtain the “shifted” window. The size of the mask is M/2, and the number of windows can be kept unchanged. .

WWR-MSA模块窗口切分方式如图7所示。The window segmentation method of the WWR-MSA module is shown in Figure 7.

WWR-MSA模块实现了跨窗口之间的交互，进行mask之后的窗口包含原本相邻窗口的元素，并且更注重右边窗口之间的通信。The WWR-MSA module realizes the interaction between windows. The window after masking contains the elements of the original adjacent windows, and pays more attention to the communication between the right windows.

Z^x+1经过执行S231至S237后，生成第二特征图，记为Z^x+2。After executing S231 to S237 for Z ^x+1 , a second feature map is generated, which is denoted as Z ^x+2 .

综上，执行连续三次的Wiggle Transformer Block操作后，Z^x-1变为 Z^x+2，但是分辨率保持在H为224/4，W为224/4。In summary, after performing three consecutive Wiggle Transformer Block operations, Z ^x-1 becomes Z ^x+2 , but the resolution remains at 224/4 for H and 224/4 for W.

后续的S300，S400，S600，都有和S200相同的步骤，即都有一样的“进行三次连续的WiggleTransformer Block操作”，注意他们的工作原理是完全一致的。只是S300，S400，S600在执行三次连续的WiggleTransformer Block操作之前，都需要进行一个“Restructure操作”，下面通过S310至S340详细解释一下这个Restructure操作的过程。Subsequent S300, S400, and S600 all have the same steps as S200, that is, they all have the same “perform three consecutive WiggleTransformer Block operations”. Note that their working principles are exactly the same. It's just that S300, S400, and S600 all need to perform a "Restructure operation" before performing three consecutive WiggleTransformer Block operations. The following explains the process of this Restructure operation in detail through S310 to S340.

在本申请的一实施例中，所述S300包括如下S310至S330：In an embodiment of the present application, the S300 includes the following S310 to S330:

S310，将第二特征图输入至Patch Merging模块。S310, input the second feature map to the Patch Merging module.

S320，通过Patch Merging模块在第二特征图的高度方向和宽度方向上，间隔选取2×2patch块进行拼接。S320 , 2×2 patch blocks are selected at intervals in the height direction and the width direction of the second feature map through the Patch Merging module for splicing.

S330，将拼接后的第二特征图进行一次卷积核为1×1的卷积，最终生成经Restructure操作后的第二特征图。S330, perform a convolution with a convolution kernel of 1×1 on the spliced second feature map, and finally generate a second feature map after the Restructure operation.

具体地，“Restructure操作”包括两个步骤，第一个步骤为S320中的拼接步骤，第二个步骤为S330中的卷积步骤。“Restructure操作”从处理原理上来讲其实等同于一种下采样的处理。Specifically, "Restructure operation" includes two steps, the first step is the splicing step in S320, and the second step is the convolution step in S330. "Restructure operation" is actually equivalent to a downsampling process in terms of processing principle.

S320的拼接过程如图8所示，可以理解为将第二特征图中的多个patch块分成多组，如图8(a)所示，图8(a)的左上角就是一组patch块，图8(a) 共有4×4＝16组patch块。每组包括以2×2矩阵式排列的相邻的4个patch块。然后将每组组内的4个patch块拆分，分为左上角patch块，右上角patch块，左下角patch块和右下角patch快。每一组取出左上角patch块，将所有的左上角patch块进行拼接，生成图(b)的patch块。每一组取出右上角patch块，将所有的右上角patch块进行拼接，生成图(c)的patch块。每一组取出左下角patch块，将所有的左下角patch块进行拼接，生成图(d)的patch块。每一组取出右下角patch块，将所有的右下角patch块进行拼接，生成图(e)的 patch块。最后将图(b)的patch块，图(c)的patch块，图(d)的patch 块和图(e)的patch块叠放，生成拼接后的第二特征图，然后执行后续S330。The splicing process of S320 is shown in Fig. 8, which can be understood as dividing multiple patch blocks in the second feature map into multiple groups, as shown in Fig. 8(a), the upper left corner of Fig. 8(a) is a group of patch blocks , Figure 8(a) has a total of 4×4=16 groups of patch blocks. Each group consists of 4 adjacent patch blocks arranged in a 2×2 matrix. Then, the 4 patch blocks in each group are split and divided into the upper left corner patch block, the upper right corner patch block, the lower left corner patch block and the lower right corner patch block. Each group takes out the upper left corner patch block, and splices all the upper left corner patch blocks to generate the patch block of Figure (b). Each group takes out the upper right corner patch block, and splices all the upper right corner patch blocks to generate the patch block of Figure (c). Each group takes out the lower left corner patch block, and splices all the lower left corner patch blocks to generate the patch block of Figure (d). Each group takes out the patch block in the lower right corner, and splices all the patch blocks in the lower right corner to generate the patch block of Figure (e). Finally, the patch block of Figure (b), the patch block of Figure (c), the patch block of Figure (d), and the patch block of Figure (e) are stacked to generate a second feature map after splicing, and then the subsequent S330 is performed.

承接上述实施例，那么Z^x+2经拼接后，变为28H，28W，384特征维度的特征图。卷积时，又将特征维度384降为192。Following the above embodiment, after splicing, Z ^x+2 becomes a feature map of 28H, 28W, and 384 feature dimensions. During convolution, the feature dimension 384 is reduced to 192.

执行S310至S340后，得到经Restructure操作后的第二特征图。后续再将经Restructure操作后的第二特征图进行三次连续的WiggleTransformer Block操作(原理和S210至S230一致)。其中，多头自注意力(Multi-Head Attention)的head的个数为6，Q，K，V组成的的tensor维度是(28，28， 192)。此时窗口个数是数为28/7×28/7，即总共是4×4＝16个窗口。最终输出的第三特征图是(28，28，192)，即最终得到的第三特征图的分辨率保持在H/8， W/8。After performing S310 to S340, the second feature map after the Restructure operation is obtained. Subsequently, perform three consecutive WiggleTransformer Block operations on the second feature map after the Restructure operation (the principle is consistent with S210 to S230). Among them, the number of heads of Multi-Head Attention is 6, and the tensor dimension composed of Q, K, and V is (28, 28, 192). At this time, the number of windows is 28/7×28/7, that is, 4×4=16 windows in total. The final output third feature map is (28, 28, 192), that is, the resolution of the final third feature map is kept at H/8, W/8.

在本申请的一实施例中，所述S400包括如下S410至S470：In an embodiment of the present application, the S400 includes the following S410 to S470:

S410，将第三特征图输入至Patch Merging模块执行Restructure操作。S410, the third feature map is input to the Patch Merging module to perform the Restructure operation.

具体地，和S310至S340的下采样是相同的方式，也是先将第三特征图内每组2×2相邻patch块进行拼接，后进行卷积核为1×1的卷积。Specifically, in the same manner as the downsampling in S310 to S340, each group of 2×2 adjacent patch blocks in the third feature map is first spliced, and then a convolution with a convolution kernel of 1×1 is performed.

将第三特征图内每组2×2相邻patch块进行拼接后，特征图变为(14，14， 768)，经过卷积核为1×1的卷积，特征维度由768降低为384。After splicing each group of 2×2 adjacent patch blocks in the third feature map, the feature map becomes (14, 14, 768). After convolution with a convolution kernel of 1×1, the feature dimension is reduced from 768 to 384 .

S420，将经Restructure操作后的第三特征图输入至Wiggle Transformer Block中，进行第一Wiggle Transformer Block操作。在进行第一Wiggle Transformer Block操作的过程中，利用W-MSA模块进行窗口切分，生成多个第一窗口，并在每个切分后形成的第一窗口中各自进行attention计算。S420: Input the third feature map after the Restructure operation into the Wiggle Transformer Block, and perform the first Wiggle Transformer Block operation. In the process of performing the first Wiggle Transformer Block operation, the W-MSA module is used for window segmentation, multiple first windows are generated, and attention calculation is performed in each of the first windows formed after segmentation.

S530，将经第一Wiggle Transformer Block操作后的第三特征图输入至 WiggleTransformer Block中，进行第二Wiggle Transformer Block操作。在进行第二WiggleTransformer Block操作的过程中，利用WWL-MSA模块进行窗口切分，生成多个第二窗口，并在每个切分后形成的第二窗口中各自进行 attention计算。S530: Input the third feature map after the first Wiggle Transformer Block operation into the Wiggle Transformer Block, and perform the second Wiggle Transformer Block operation. In the process of performing the second WiggleTransformer Block operation, the WWL-MSA module is used to perform window segmentation, and multiple second windows are generated, and attention calculation is performed in each of the second windows formed after segmentation.

S440，将经第二Wiggle Transformer Block操作后的第三特征图输入至 WiggleTransformer Block中，进行第三Wiggle Transformer Block操作。在进行第三WiggleTransformer Block操作的过程中，利用WWR-MSA模块进行窗口切分，生成多个第三窗口，并在每个切分后形成的第三窗口中各自进行 attention计算。S440: Input the third feature map after the second Wiggle Transformer Block operation into the Wiggle Transformer Block, and perform the third Wiggle Transformer Block operation. In the process of performing the third WiggleTransformer Block operation, the WWR-MSA module is used to perform window segmentation, multiple third windows are generated, and attention calculation is performed in each of the third windows formed after segmentation.

S450，将经第三Wiggle Transformer Block操作后的第三特征图输入至 WiggleTransformer Block中，进行第二次第一Wiggle Transformer Block操作。在进行第二次第一Wiggle Transformer Block操作的过程中，利用W-MSA 模块进行窗口切分，生成多个第一窗口，并在每个切分后形成的第一窗口中各自进行attention计算。S450, input the third feature map after the third Wiggle Transformer Block operation into the Wiggle Transformer Block, and perform the second first Wiggle Transformer Block operation. In the process of the second first Wiggle Transformer Block operation, the W-MSA module is used to perform window segmentation to generate multiple first windows, and the attention calculation is performed in each of the first windows formed after segmentation.

S460，将经第二次第一Wiggle Transformer Block操作后的第三特征图输入至Wiggle Transformer Block中，进行第二次第二Wiggle Transformer Block 操作。在进行第二次第二Wiggle Transformer Block操作的过程中，利用WWL-MSA 模块进行窗口切分，生成多个第二窗口，并在每个切分后形成的第二窗口中各自进行attention计算。S460, input the third feature map after the second first Wiggle Transformer Block operation into the Wiggle Transformer Block, and perform the second second Wiggle Transformer Block operation. In the process of the second second Wiggle Transformer Block operation, the WWL-MSA module is used to perform window segmentation to generate multiple second windows, and the attention calculation is performed in each of the second windows formed after segmentation.

S470，将经第二次第二Wiggle Transformer Block操作后的第四特征图输入至Wiggle Transformer Block中，进行第二次第三Wiggle Transformer Block 操作，将最终得到的特征图作为第四特征图。在进行第二次第三Wiggle Transformer Block操作的过程中，利用WWR-MSA模块进行窗口切分，生成多个第三窗口，并在每个切分后形成的第三窗口中各自进行attention计算。S470, input the fourth feature map after the second second Wiggle Transformer Block operation into the Wiggle Transformer Block, carry out the second third Wiggle Transformer Block operation, and use the finally obtained feature map as the fourth feature map. In the process of performing the second third Wiggle Transformer Block operation, the WWR-MSA module is used to perform window segmentation to generate multiple third windows, and the attention calculation is performed in each of the third windows formed after segmentation.

具体地，本实施例中，进行六次连续的WiggleTransformer Block操作而不是三次，但是原理是和三次连续的WiggleTransformer Block操作的原理是一致的，就是两组连续的WiggleTransformer Block操作，每组有三次连续的 WiggleTransformer Block操作。Specifically, in this embodiment, six consecutive WiggleTransformer Block operations are performed instead of three times, but the principle is consistent with the principle of three consecutive WiggleTransformer Block operations, that is, two consecutive WiggleTransformer Block operations, each of which has three consecutive WiggleTransformer Block operations. The WiggleTransformer Block operation.

多头自注意力(Multi-Head Attention)的head的个数为12，Q，K，V组成的的tensor维度是(14，14，384)。此时窗口个数是数为14/7×14/7，即总共是2×2＝16个窗口。最终生成的第第四特征图的分辨率保持在H/16，W/16。The number of heads of Multi-Head Attention is 12, and the tensor dimension composed of Q, K, and V is (14, 14, 384). At this time, the number of windows is 14/7×14/7, that is, 2×2=16 windows in total. The resolution of the finally generated fourth feature map is kept at H/16, W/16.

此外，S500的步骤和前述S300的原理是一致的。首先经Restructure操作中，将第四特征图内每组2×2相邻patch块进行拼接后，特征图变为(7，7， 1536)，经过卷积核为1×1的卷积，特征维度由1536降低为768。In addition, the steps of S500 are consistent with the principles of the aforementioned S300. First, in the Restructure operation, after splicing each group of 2 × 2 adjacent patch blocks in the fourth feature map, the feature map becomes (7, 7, 1536). After the convolution kernel is 1 × 1, the features Dimension reduced from 1536 to 768.

S500中的窗口切分时，多头自注意力(Multi-Head Attention)的head的个数为24，Q，K，V组成的的tensor维度是(7，7，768)。此时窗口个数是数为7/7×7/7，即只剩下1个窗口。最终生成的第四特征图的分辨率保持在 H/32，W/32。When the window is split in S500, the number of heads of Multi-Head Attention is 24, and the tensor dimension composed of Q, K, and V is (7, 7, 768). At this time, the number of windows is 7/7×7/7, that is, there is only one window left. The resolution of the final generated fourth feature map is kept at H/32, W/32.

在本申请的一实施例中，所述S600包括如下S610至S620：In an embodiment of the present application, the S600 includes the following S610 to S620:

S610，将第五特征图送入RPN网络。S610: Send the fifth feature map to the RPN network.

S620，通过RPN网络对第五特征图进行二值分类，得到分类loss值。S620, perform binary classification on the fifth feature map through the RPN network to obtain a classification loss value.

具体地，分类loss值是一个值，它代表遮挡部分的元素类型。例如鸟的分类loss值为0，树木的分类loss值为1，人类的分类loss值为2，那么可以根据得到的分类loss值对遮挡部分的元素类型进行定性。Specifically, the classification loss value is a value that represents the element type of the occluded part. For example, the classification loss value of birds is 0, the classification loss value of trees is 1, and the classification loss value of human beings is 2, then the element type of the occluded part can be qualitatively determined according to the obtained classification loss value.

在本申请的一实施例中，所述S600还包括如下步骤：In an embodiment of the present application, the S600 further includes the following steps:

S630，通过RPN网络对第五特征图进行Bounding Box回归，得到回归loss 值。S630, perform Bounding Box regression on the fifth feature map through the RPN network to obtain the regression loss value.

具体地，回归loss值也是一个值。Specifically, the regression loss value is also a value.

本发明将针对遮挡目标目标的复杂性和不确定性，设计提出了一种新的网络模块，并加入WWL-MSA模块，WWL-MSA和一个3层convolution(卷积层)的结构来克服遮挡物对目标目标检测的遮挡在混淆，增加了窗口与窗口之间的交互，以提高全局性和鲁棒性。在窗口中计算各自的注意力，这样做的好处是既能引入CNN卷积操作的局限性，另一方面能节省计算量。并且通过对patch的合并，可以缩小分辨率，调整通道数进而形成层次化的设计，同时也能节省一定运算量。既可以减少遮挡物对检测目标目标的影响，又可以多层次增加细节特征，以提高遮挡目标检测的准确率。The present invention designs and proposes a new network module for the complexity and uncertainty of the occlusion target, and adds the WWL-MSA module, WWL-MSA and a 3-layer convolution (convolution layer) structure to overcome the occlusion The occlusion of object-to-target detection is confusing, and the interaction between windows is increased to improve globality and robustness. Calculating the respective attentions in the window has the advantage of not only introducing the limitations of the CNN convolution operation, but also saving the amount of computation. And by merging patches, the resolution can be reduced, the number of channels can be adjusted to form a hierarchical design, and a certain amount of computation can also be saved. It can not only reduce the influence of occluders on the detection target, but also increase the detail features at multiple levels to improve the accuracy of occluded target detection.

以上所述实施例的各技术特征可以进行任意的组合，各方法步骤也并不做执行顺序的限制，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above-described embodiments can be combined arbitrarily, and the execution order of each method step is not limited. For the sake of brevity, all possible combinations of the technical features in the above-described embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of the description in this specification.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对本申请专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请的保护范围应以所附权利要求为准。The above-mentioned embodiments only represent several embodiments of the present application, and the descriptions thereof are relatively specific and detailed, but should not be construed as a limitation on the scope of the patent of the present application. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of the present application, some modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of this application should be determined by the appended claims.

Claims

1. an industrial occlusion target detection method, is characterized in that, described method comprises:

Input the to-be-processed image, perform patch segmentation on the to-be-processed image and apply a linear embedding layer for dimensionality reduction to obtain the first feature map;

Perform three consecutive Wiggle Transformer Block operations on the first feature map to generate a second feature map;

The second feature map is input to the Patch Merging module to perform the Restructure operation, and the second feature map after the Restructure operation is subjected to three consecutive WiggleTransformer Block operations to generate the third feature map;

Input the third feature map to the Patch Merging module to perform the Restructure operation, and perform six consecutive WiggleTransformer Block operations on the third feature map after the Restructure operation to generate the fourth feature map;

Input the fourth feature map to the Patch Merging module to perform the Restructure operation, and perform three consecutive WiggleTransformer Block operations on the fourth feature map after the Restructure operation to generate the fifth feature map;

The fifth feature map is sent to the RPN network, and the occlusion target detection is performed on the fifth feature map through the RPN network, and the occlusion target detection result is output;

In three consecutive WiggleTransformer Block operations, the W-MSA module, the WWL-MSA module and the WWR-MSA module are used to perform window segmentation, and attention is calculated in each window formed after segmentation.

2. The method for detecting an industrial occlusion target according to claim 1, wherein the input picture to be processed, the picture to be processed is subjected to patch segmentation and a linear embedding layer is applied for dimension reduction, and the first feature map is obtained, comprising:

Input a picture to be processed with height H, width W, and channel number 3;

Divide the image to be processed into

Patch blocks of the same size, each patch block is 4 pixels high and 4 pixels wide;

Input multiple patch blocks of the same size as the original feature map into the linear embedding layer, and change the feature dimension of the original feature map to the preset dimension C, so as to convert the 2-dimensional original feature map into a 1-dimensional patch slice, and The transformed 1D patch slice is used as the first feature map.

3. The industrial occlusion target detection method according to claim 2, wherein the first feature map is subjected to three consecutive Wiggle Transformer Block operations, and the generation of the second feature map comprises:

Input the first feature map into the Wiggle Transformer Block, and perform the first Wiggle Transformer Block operation; in the process of performing the first Wiggle Transformer Block operation, use the W-MSA module to perform window segmentation to generate multiple first windows, and in the process of performing the first Wiggle Transformer Block operation. The attention calculation is performed in the first window formed after each segmentation;

Input the first feature map after the first Wiggle Transformer Block operation into the WiggleTransformer Block, and perform the second Wiggle Transformer Block operation; in the process of performing the second WiggleTransformer Block operation, use the WWL-MSA module to perform window segmentation to generate A plurality of second windows, and the attention calculation is performed in the second window formed after each segmentation;

Input the first feature map after the second Wiggle Transformer Block operation into the Wiggle Transformer Block, perform the third Wiggle Transformer Block operation, and use the resulting feature map as the second feature map; in the process of performing the third Wiggle Transformer Block operation In , the WWR-MSA module is used for window segmentation, multiple third windows are generated, and attention calculation is performed in each of the third windows formed after segmentation.

4. The industrial occlusion target detection method according to claim 3, wherein the first feature map is input into the Wiggle Transformer Block, and the first Wiggle Transformer Block operation is performed, comprising:

Input the first feature map into the Wiggle Transformer Block, and perform the Layer Normalization operation;

The feature map after the Layer Normalization operation is divided into the W-MSA module window to generate multiple first windows, and the attention calculation is performed in the first window formed after each segmentation, and the W-MSA module window is generated. Feature map after segmentation and attention calculation;

Residual connection is performed between the feature map and the first feature map after the W-MSA module window segmentation and attention calculation;

Perform Layer Normalization operation on the feature map obtained after residual connection;

Input the feature map after Layer Normalization operation to a 2-layer MLP neural network module for neural network processing;

Perform three convolutions on the feature map processed by the neural network;

The feature map after three convolutions and the feature map obtained after residual connection are connected again by residual connection, and finally the first feature map after the first Wiggle Transformer Block operation is obtained; the convolution kernels of three convolutions are 3× 3, 5×5 and 1×1.

5. The industrial occlusion target detection method according to claim 4, wherein the first feature map after the operation of the first Wiggle Transformer Block is input into the Wiggle Transformer Block, and the second Wiggle Transformer Block operation is performed, comprising: :

Input the first feature map after the operation of the first Wiggle Transformer Block into the Wiggle Transformer Block, and perform the Layer Normalization operation;

The feature map after the Layer Normalization operation is divided into the WWL-MSA module window to generate multiple second windows, and the attention calculation is performed in the second window formed after each segmentation, and the WWL-MSA module window is generated. Feature map after segmentation and attention calculation;

Residual connection is performed between the feature map after WWL-MSA module window segmentation and attention calculation and the first feature map after the first WiggleTransformer Block operation;

Perform three convolutions on the feature map processed by the neural network;

The feature map after three convolutions and the feature map obtained after residual connection are connected again by residual connection, and finally the first feature map after the second Wiggle Transformer Block operation is obtained; the convolution kernels of three convolutions are 3× 3, 5×5 and 1×1.

6. The industrial occlusion target detection method according to claim 5, wherein the first feature map after the second Wiggle Transformer Block operation is input into the Wiggle Transformer Block, the third Wiggle Transformer Block operation is performed, and the The final feature map is used as the second feature map, including:

Input the first feature map after the second Wiggle Transformer Block operation into the Wiggle Transformer Block, and perform the Layer Normalization operation;

The feature map after the Layer Normalization operation is divided into the WWR-MSA module window to generate multiple third windows, and the attention calculation is performed in each of the third windows formed after the segmentation, and the WWR-MSA module window is generated. Feature map after segmentation and attention calculation;

Residual connection is performed between the feature map after the WWR-MSA module window segmentation and attention calculation and the first feature map after the second WiggleTransformer Block operation;

Perform three convolutions on the feature map processed by the neural network;

The feature map after three convolutions and the feature map obtained after residual connection are connected again by residual connection, and finally the second feature map is obtained; the convolution kernels of three convolutions are 3×3, 5×5 and 1×1 respectively. .

7. The industrial occlusion target detection method according to claim 6, wherein the second feature map is input to the Patch Merging module to perform a Restructure operation, comprising:

Input the second feature map to the Patch Merging module;

In the height direction and width direction of the second feature map, the Patch Merging module selects 2×2 patch blocks at intervals for splicing to generate a spliced second feature map;

Perform a convolution with a convolution kernel of 1×1 on the spliced second feature map, and finally generate the second feature map after the Restructure operation.

8. industrial occlusion target detection method according to claim 7, is characterized in that, described the 3rd feature map is input to Patch Merging module to carry out Restructure operation, the 3rd feature map after the Restructure operation is carried out six consecutive times. WiggleTransformer Block operation to generate the fourth feature map, including:

Input the third feature map to the Patch Merging module to perform the Restructure operation;

Input the third feature map after the Restructure operation into the Wiggle Transformer Block, and perform the first Wiggle Transformer Block operation; in the process of performing the first Wiggle Transformer Block operation, use the W-MSA module to perform window segmentation to generate multiple The first window, and the attention calculation is performed in the first window formed after each segmentation;

Input the third feature map after the first Wiggle Transformer Block operation into the WiggleTransformer Block, and perform the second Wiggle Transformer Block operation; in the process of performing the second Wiggle Transformer Block operation, use the WWL-MSA module to perform window segmentation to generate Multiple second windows, and the attention calculation is performed in each second window formed after segmentation;

Input the third feature map after the second Wiggle Transformer Block operation into the WiggleTransformer Block, and perform the third Wiggle Transformer Block operation; in the process of performing the third WiggleTransformer Block operation, use the WWR-MSA module to perform window segmentation to generate Multiple third windows, and the attention calculation is performed in the third window formed after each segmentation;

Input the third feature map after the third Wiggle Transformer Block operation into the Wiggle Transformer Block, and perform the second first Wiggle Transformer Block operation; in the process of performing the second first Wiggle Transformer Block operation, use W-MSA The module performs window segmentation, generates multiple first windows, and performs attention calculation in the first windows formed after each segmentation;

Input the third feature map after the second first Wiggle Transformer Block operation into the Wiggle Transformer Block, and perform the second second Wiggle Transformer Block operation; during the second second Wiggle Transformer Block operation, use The WWL-MSA module performs window segmentation, generates multiple second windows, and performs attention calculation in each of the second windows formed after segmentation;

Input the third feature map after the second second Wiggle Transformer Block operation into the WiggleTransformer Block, perform the second third Wiggle Transformer Block operation, and use the resulting feature map as the fourth feature map; In the process of the third Wiggle Transformer Block operation, the WWR-MSA module is used for window segmentation to generate multiple third windows, and the attention calculation is performed in each of the third windows formed after segmentation.

9. industrial occlusion target detection method according to claim 8, is characterized in that, the 5th feature map is sent into RPN network, carries out occlusion target detection to the 5th feature map by RPN network, outputs occlusion target detection result, include:

Send the fifth feature map to the RPN network;

Binary classification is performed on the fifth feature map through the RPN network to obtain the classification loss value.

10. The industrial occlusion target detection method according to claim 9, wherein the fifth feature map is sent into the RPN network, the occlusion target detection is performed on the fifth feature map through the RPN network, and the occlusion target detection result is output. ,Also includes:

Perform Bounding Box regression on the fifth feature map through the RPN network to obtain the regression loss value.