CN117132774B

CN117132774B - Multi-scale polyp segmentation method and system based on PVT

Info

Publication number: CN117132774B
Application number: CN202311097260.0A
Authority: CN
Inventors: 张朝晖; 杨超荣; 何哲远; 王威; 刘晨光; 黄丽娜
Original assignee: Hebei Normal University
Current assignee: Hebei Normal University
Priority date: 2023-08-29
Filing date: 2023-08-29
Publication date: 2024-03-01
Anticipated expiration: 2043-08-29
Also published as: CN117132774A

Abstract

The invention discloses a multi-scale polyp segmentation method and system based on PVT, and relates to the technical field of deep learning medical image semantic segmentation. The invention includes the following steps: acquiring a colorectal endoscope image to be detected and performing image preprocessing; using a PVTv2 backbone network to perform multi-scale feature extraction on the preprocessed image; and using a parallel Sobel edge decoder to extract original feature maps of different scales generated by PVT. Perform step-by-step fusion to obtain a global prediction map; use a multi-scale parallel dilated convolution attention module to extract features of multiple receptive fields from the original feature map; use the global prediction map to gradually guide and generate a multi-stage prediction map step by step; convert the global prediction The prediction loss is obtained by comparing the multi-stage prediction map and the ground truth map, and the last level prediction map is the final polyp segmentation prediction map. This application can accurately identify and segment polyps in colorectal images, providing doctors with effective help in correct diagnosis.

Description

A multi-scale polyp segmentation method and system based on PVT

技术领域Technical field

本发明属于深度学习医学图像语义分割技术领域，具体涉及一种基于PVT的多尺度息肉分割方法及系统。The invention belongs to the technical field of deep learning medical image semantic segmentation, and specifically relates to a multi-scale polyp segmentation method and system based on PVT.

背景技术Background technique

结直肠癌是一种常见恶性肿瘤，其早期发现和治疗对于提高患者的生存率有着重要的意义。由于结直肠癌发展初期没有典型症状，因此筛查结直肠癌的重要性逐渐受到重视，而筛查手段之一就是结直肠镜检查。在结直肠镜检查中，息肉具有与其周围区域的正常组织颜色相似、形状多变、大小不一的特点，甚至多个小息肉可能会粘连在一起的情况也比较常见，并且息肉的边界通常是模糊的。这使得针对结肠镜成像结果进行息肉分割面临多种挑战。传统的医学图像分割方法，例如阈值分割法和区域生长法分割，往往需要专业医生辅助的手动标注，并且分割结果因光照条件以及医生经验和主观因素等的影响，使得分割不仅耗时费力，而且存在较大的误差和不稳定性。因此如何实现关于结肠镜图像的自动分割、从而获取边界更为清晰、有效的息肉分割结果，成为了医学图像分割的热点之一。Colorectal cancer is a common malignant tumor, and its early detection and treatment are of great significance to improving the survival rate of patients. Since colorectal cancer has no typical symptoms in its early stages, the importance of screening for colorectal cancer has gradually received attention, and one of the screening methods is colorectal examination. In colorectal examination, polyps are similar in color to the normal tissue in the surrounding area, with variable shapes and sizes. It is even common that multiple small polyps may be adhered together, and the boundaries of the polyps are usually Blurred. This makes polyp segmentation based on colonoscopic imaging results multiple challenges. Traditional medical image segmentation methods, such as threshold segmentation and region growing segmentation, often require manual annotation assisted by professional doctors, and the segmentation results are affected by lighting conditions, doctor experience and subjective factors, making segmentation not only time-consuming and labor-intensive, but also There are large errors and instability. Therefore, how to achieve automatic segmentation of colonoscopy images to obtain clearer and more effective polyp segmentation results has become one of the hot topics in medical image segmentation.

近年来，深度学习技术已广泛应用于医学图像分割。在结直肠镜检查中，基于卷积神经网络的息肉分割方法已经得到了广泛应用。用于息肉分割的卷积神经网络主要有两种典型架构：以U-Net为基础的U型结构，以及PraNet架构。例如，U-Net使用编码器-解码器的结构，通过跳跃连接，将低级和高级特征结合起来，可以有效地保留空间局部信息，但易受到噪声和遮挡的影响。PraNet首先使用并行部分解码器(PPD)进行高程特征的聚合，并生成一个全局图来大致定位息肉，然后利用反向注意力(RA)模块来逐步细化区域和边界；然而，由于卷积神经网络本身的限制，模型的分割精度和鲁棒性还存在一定的问题。因此，需要对现有的模型进行改进，提高其在结直肠镜检查过程中成像结果的息肉分割性能。In recent years, deep learning technology has been widely used in medical image segmentation. In colorectal examination, polyp segmentation methods based on convolutional neural networks have been widely used. There are two main typical architectures of convolutional neural networks used for polyp segmentation: U-shaped structure based on U-Net, and PraNet architecture. For example, U-Net uses an encoder-decoder structure to combine low-level and high-level features through skip connections, which can effectively retain spatial local information, but is susceptible to noise and occlusion. PraNet first uses a Parallel Part Decoder (PPD) to aggregate elevation features and generate a global map to roughly locate polyps, and then utilizes a reverse attention (RA) module to gradually refine regions and boundaries; however, due to convolutional neural Due to the limitations of the network itself, there are still certain problems with the segmentation accuracy and robustness of the model. Therefore, existing models need to be improved to improve their polyp segmentation performance in imaging results during colorectal examination.

最近，Transformer在自然语言处理(NLP)领域的成功应用启发了计算机视觉研究人员，这使得Transformer在计算机视觉研究任务中得到一定的应用与发展。由于基于Transformer的网络善于通过全局自注意力来捕捉图像目标的长程依赖关系，因此可以考虑将Transformer应用于息肉分割任务，即：在息肉分割任务中，使用Transformer来学习结直肠镜检查图像中不同区域之间的依赖关系，并利用这些信息来提高模型的分割性能和鲁棒性；此外，还可通过使用先进的优化算法来加速模型的训练过程，并提高模型的收敛速度。通过这些技术的应用，可以进一步提高结直肠镜检查图像中对息肉的分割效果，为临床医生提供更加准确和可靠的诊断结果。Recently, the successful application of Transformer in the field of natural language processing (NLP) has inspired computer vision researchers, which has led to certain application and development of Transformer in computer vision research tasks. Since Transformer-based networks are good at capturing long-range dependencies of image targets through global self-attention, Transformer can be considered to be applied to the polyp segmentation task, that is: in the polyp segmentation task, Transformer is used to learn the differences in colorectal examination images. dependencies between regions, and use this information to improve the segmentation performance and robustness of the model; in addition, the model training process can be accelerated and the model's convergence speed can be improved by using advanced optimization algorithms. Through the application of these technologies, the segmentation effect of polyps in colorectal examination images can be further improved, providing clinicians with more accurate and reliable diagnostic results.

发明内容Contents of the invention

针对现有技术的不足，本发明提出了一种基于PVT的多尺度息肉分割方法及系统，以有效解决现有技术中不能准确识别息肉区域的问题，进一步提高对结直肠镜检查图像中息肉分割边界的准确性以及语义完整性，实现关于息肉的准确、快速、自动分割。In view of the shortcomings of the existing technology, the present invention proposes a multi-scale polyp segmentation method and system based on PVT to effectively solve the problem of the inability to accurately identify polyp areas in the existing technology and further improve the polyp segmentation in colorectal examination images. The accuracy of boundaries and semantic completeness enable accurate, fast, and automatic segmentation of polyps.

为实现上述目的，本发明提供了如下方案：In order to achieve the above objects, the present invention provides the following solutions:

一种基于PVT的多尺度息肉分割方法，包括以下步骤：A multi-scale polyp segmentation method based on PVT, including the following steps:

S1.获取待检测的结直肠镜图像，并对所述结直肠镜图像进行预处理；S1. Obtain the colonoscopy image to be detected and preprocess the colonoscopy image;

S2.使用PVTv2骨干网络对预处理后的所述结直肠镜图像进行多尺度的特征提取，得到不同尺度的原始特征图；S2. Use the PVTv2 backbone network to perform multi-scale feature extraction on the preprocessed colorectal image, and obtain original feature maps of different scales;

S3.使用并行Sobel边缘解码器对所述原始特征图进行逐级融合，得到一个全局预测图；S3. Use a parallel Sobel edge decoder to fuse the original feature maps step by step to obtain a global prediction map;

S4.使用多尺度并行空洞卷积注意力模块对所述原始特征图进行多感受野的特征提取；S4. Use a multi-scale parallel dilated convolutional attention module to extract features of multiple receptive fields from the original feature map;

S5.使用所述全局预测图逐步引导多感受野特征提取后的所述原始特征图，并逐级生成多阶段预测图；S5. Use the global prediction map to gradually guide the original feature map after multi-receptive field feature extraction, and generate multi-stage prediction maps step by step;

S6.将所述全局预测图和所述多阶段预测图与真值图比较，计算损失，得到的最后一级预测图即为最终息肉分割预测图。S6. Compare the global prediction map and the multi-stage prediction map with the true value map, calculate the loss, and the final level prediction map obtained is the final polyp segmentation prediction map.

优选的，所述S1中，对所述结直肠镜图像进行预处理的方法包括：Preferably, in step S1, the method for preprocessing the colorectal image includes:

采用随机旋转、垂直翻转、水平翻转以及归一化的技术对结直肠镜图像数据进行增强，最后将所述结直肠镜图像统一裁剪至352×352大小、并使用{0.75,1,1.25}的多尺度策略对所述结直肠镜图像进行缩放。Random rotation, vertical flipping, horizontal flipping and normalization techniques were used to enhance the colorectal image data. Finally, the colorectal image was uniformly cropped to a size of 352×352, and {0.75, 1, 1.25} was used. A multi-scale strategy is used to scale the colorectal image.

优选的，所述S2中，使用PVTv2骨干网络对预处理后的所述结直肠镜图像进行多尺度的特征提取的方法包括：Preferably, in S2, the method of using the PVTv2 backbone network to perform multi-scale feature extraction on the preprocessed colorectal image includes:

判断输入到PVTv2骨干网络中的预处理后的结直肠镜图像是否为3通道图像，如果是3通道图像，则直接送入网络中进行特征提取，如果不是3通道图像，则使用一次1×1的卷积，将图像的通道数调整至3；Determine whether the preprocessed colorectal image input to the PVTv2 backbone network is a 3-channel image. If it is a 3-channel image, it is directly sent to the network for feature extraction. If it is not a 3-channel image, 1×1 is used once. Convolution, adjust the number of channels of the image to 3;

使用PVTv2-B2的预训练模型进行四阶段的特征提取。Four-stage feature extraction is performed using the pre-trained model of PVTv2-B2.

优选的，所述S3中，使用并行Sobel边缘解码器对所述原始特征图进行逐级融合，得到一个全局预测图的方法包括：Preferably, in S3, the method of using a parallel Sobel edge decoder to fuse the original feature maps step by step to obtain a global prediction map includes:

S31：在第一个分支中使用一次1×1的卷积，对特征图通道进行压缩；S31: Use a 1×1 convolution in the first branch to compress the feature map channel;

S32：在第二个分支中首先使用1×1的卷积，对特征图通道进行压缩，然后分别一次使用1×3、3×1的非对称卷积以及空洞率为3的3×3卷积对特征图进行特征提取；S32: In the second branch, first use 1×1 convolution to compress the feature map channel, and then use 1×3, 3×1 asymmetric convolution and 3×3 convolution with hole rate 3 respectively. Perform feature extraction on the feature map;

S33：在第三个分支中首先使用1×1的卷积，对特征图通道进行压缩，然后分别一次使用1×5、5×1的非对称卷积以及空洞率为5的3×3卷积对特征图进行特征提取；S33: In the third branch, first use 1×1 convolution to compress the feature map channel, and then use 1×5, 5×1 asymmetric convolution and 3×3 convolution with hole rate 5 respectively. Perform feature extraction on the feature map;

S34：在第四个分支中首先使用1×1的卷积，对特征图通道进行压缩，然后分别一次使用1×7、7×1的非对称卷积以及空洞率为7的3×3卷积对特征图进行特征提取；S34: In the fourth branch, first use 1×1 convolution to compress the feature map channel, and then use 1×7, 7×1 asymmetric convolution and 3×3 convolution with hole rate 7 respectively. Perform feature extraction on the feature map;

S35：将第一分支的压缩后的特征图与第二、第三、第四分支的特征提取后的特征图在通道维度进行拼接，然后使用1×1的卷积，对特征图通道进行压缩；S35: Splice the compressed feature map of the first branch with the feature extracted feature maps of the second, third, and fourth branches in the channel dimension, and then use 1×1 convolution to compress the feature map channel. ;

S36：将压缩后的拼接特征图和借助卷积进行通道压缩的原始特征图进行逐像素相加，之后经过ReLU非线性激活函数后再输入到Sobel操作中；S36: Add the compressed spliced feature map and the original feature map channel-compressed with the help of convolution pixel by pixel, and then pass through the ReLU nonlinear activation function before inputting into the Sobel operation;

S37：将Sobel算子进行梯度锐化后的特征图逐像素相加，再经过一个1×1的卷积、并使用双线性插值上采样操作，以生成初始的息肉分割全局预测图。S37: Add the feature maps after gradient sharpening with the Sobel operator pixel by pixel, and then undergo a 1×1 convolution and use bilinear interpolation upsampling operation to generate an initial polyp segmentation global prediction map.

优选的，所述S4中，使用多尺度并行空洞卷积注意力模块对所述原始特征图进行多感受野的特征提取的方法包括：Preferably, in S4, the method of using a multi-scale parallel dilated convolutional attention module to extract features of multiple receptive fields from the original feature map includes:

S41：将PVT编码器的四个层级的原始特征图经过一个1×1卷积进行通道压缩，得到具有原始特征图通道数目倍的多通道特征图；S41: Channel-compress the original feature maps of the four levels of the PVT encoder through a 1×1 convolution to obtain the number of channels of the original feature map. times of multi-channel feature maps;

S42：将通道压缩后的特征图进行通道的均匀分组，送入四个分支进行处理，其中，所述处理方法为：分别使用空洞率为1、3、5、7的3×3卷积进行相应分支的特征提取，然后对四个分支的处理结果进行通道拼接；S42: Group the channel-compressed feature maps evenly into channels and send them to four branches for processing. The processing method is: using 3×3 convolutions with hole rates of 1, 3, 5, and 7 respectively. Feature extraction of the corresponding branches, and then channel splicing of the processing results of the four branches;

S43：对通道拼接的特征图进行1×1卷积，然后顺次进行批量归一化BN及ReLU激活函数的非线性运算，得到处理后的特征图；S43: Perform 1×1 convolution on the channel spliced feature map, and then perform nonlinear operations of batch normalized BN and ReLU activation functions sequentially to obtain the processed feature map;

S44：将处理后的特征图送入到CBAM模块，对特征图进行进一步的注意力加权，得到更具有区分度的特征图。S44: Send the processed feature map to the CBAM module, perform further attention weighting on the feature map, and obtain a more distinguishable feature map.

优选的，所述S5中，使用所述全局预测图逐步引导多感受野特征提取后的所述原始特征图，并逐级生成多阶段预测图的方法包括：Preferably, in S5, the method of using the global prediction map to gradually guide the original feature map after multi-receptive field feature extraction, and generating multi-stage prediction maps step by step includes:

S51：将全局预测图进行空间下采样，使分辨率大小和PVT第四阶段的特征图分辨率大小一致；然后送入RA模块中进行反注意力操作，以生成一个注意力图；之后，再和PVT第四阶段的特征图进行逐元素的乘操作，进一步使用三次3×3卷积的特征降维之后和上一阶段的预测图进行逐像素相加，生成本阶段的预测图；S51: Spatially downsample the global prediction map so that the resolution is consistent with the resolution of the feature map in the fourth stage of PVT; then send it to the RA module for anti-attention operation to generate an attention map; then, and The feature map in the fourth stage of PVT is multiplied element-by-element, and further uses three 3×3 convolutions to reduce the feature dimension and adds it to the prediction map of the previous stage pixel-by-pixel to generate the prediction map of this stage;

S52：将本阶段的预测图送入下一个阶段中，执行与所述S51阶段相同的操作，以引导最后一级特征图的生成。S52: Send the prediction map of this stage to the next stage, and perform the same operation as the S51 stage to guide the generation of the last level feature map.

优选的，所述S6中，将所述全局预测图和所述多阶段预测图与真值图比较，计算损失，得到的最后一级预测图即为最终息肉分割预测图的方法包括：Preferably, in S6, the method of comparing the global prediction map and the multi-stage prediction map with the true value map, calculating the loss, and the obtained last-level prediction map is the final polyp segmentation prediction map includes:

将所述全局预测图和所述多阶段预测图通过双线性插值的空间上采样操作，将所有预测图的尺寸调整为与输入图像对应的真值图一样的尺寸，并计算加权BCE和加权IOU的混合损失；The global prediction map and the multi-stage prediction map are subjected to a spatial upsampling operation of bilinear interpolation, the size of all prediction maps is adjusted to the same size as the ground truth map corresponding to the input image, and the weighted BCE and weighted BCE are calculated. Mixed losses of IOU;

加权BCE损失计算方法定义为：The weighted BCE loss calculation method is defined as:

其中，G表示真值图；P表示预测图；(x,y)表示图像中的任意像素位置，相应的权重系数ω(x,y)用于表示像素(x,y)重要性，具体计算时将/>设置为5；Among them, G represents the true value map; P represents the prediction map; (x, y) represents any pixel position in the image, and the corresponding weight coefficient ω (x, y) is used to represent the importance of the pixel (x, y). The specific calculation will be/> set to 5;

加权IOU损失计算方法定义为：The weighted IOU loss calculation method is defined as:

加权BCE和加权IOU两部分损失结合，得到预测图相对于真值图的混合损失的方法为：The method of combining the weighted BCE and weighted IOU losses to obtain the hybrid loss of the prediction map relative to the true value map is:

本发明还提供了一种基于PVT的多尺度息肉分割系统，包括：预处理模块、第一特征提取模块、融合模块、第二特征提取模块、引导模块和预测模块；The invention also provides a multi-scale polyp segmentation system based on PVT, including: a preprocessing module, a first feature extraction module, a fusion module, a second feature extraction module, a guidance module and a prediction module;

所述预处理模块用于获取待检测的结直肠镜图像，并对所述结直肠镜图像进行预处理；The preprocessing module is used to obtain the colorectal endoscope image to be detected and preprocess the colorectal endoscope image;

所述第一特征提取模块用于使用PVTv2骨干网络对预处理后的所述结直肠镜图像进行多尺度的特征提取，得到不同尺度的原始特征图；The first feature extraction module is used to perform multi-scale feature extraction on the preprocessed colorectal endoscope image using the PVTv2 backbone network to obtain original feature maps of different scales;

所述融合模块用于使用并行Sobel边缘解码器对所述原始特征图进行逐级融合，得到一个全局预测图；The fusion module is used to fuse the original feature maps step by step using a parallel Sobel edge decoder to obtain a global prediction map;

所述第二特征提取模块用于使用多尺度并行空洞卷积注意力模块对所述原始特征图进行多感受野的特征提取；The second feature extraction module is used to extract features of multiple receptive fields from the original feature map using a multi-scale parallel atrous convolution attention module;

所述引导模块用于使用所述全局预测图逐步引导多感受野特征提取后的所述原始特征图，并逐级生成多阶段预测图；The guidance module is configured to use the global prediction map to gradually guide the original feature map after multi-receptive field feature extraction, and generate multi-stage prediction maps step by step;

所述预测模块用于将所述全局预测图和所述多阶段预测图与真值图比较，计算损失，得到的最后一级预测图即为最终息肉分割预测图。The prediction module is used to compare the global prediction map and the multi-stage prediction map with the true value map, and calculate the loss. The obtained last-level prediction map is the final polyp segmentation prediction map.

与现有技术相比，本发明的有益效果为：Compared with the prior art, the beneficial effects of the present invention are:

1、本发明采用PVTv2作为骨干网络，取代了PraNet中的ResNet骨干网络，使得网络具有更好的全局特征提取能力。1. The present invention uses PVTv2 as the backbone network, replacing the ResNet backbone network in PraNet, so that the network has better global feature extraction capabilities.

2、本发明提出一种并行Sobel边缘解码器，将不同尺度下的特征图进行融合，从而提高模型对不同大小息肉的分割效果。2. The present invention proposes a parallel Sobel edge decoder to fuse feature maps at different scales, thereby improving the model's segmentation effect on polyps of different sizes.

3、本发明还提出一种多尺度并行空洞卷积注意力模块，对不同尺度的特征图在多感受野下进行特征提取，并使用CBAM对特征图重新分配权重，提取感兴趣区域。3. The present invention also proposes a multi-scale parallel dilated convolution attention module to extract features from feature maps of different scales under multiple receptive fields, and uses CBAM to redistribute weights on the feature maps to extract regions of interest.

4、本发明还使用一种新的损失函数,通过引入加权BCE损失和加权IOU损失相结合的方式来训练模型，消除了正负样本分布不均衡带来的影响，从而进一步提高模型的分割精度和鲁棒性。4. The present invention also uses a new loss function to train the model by introducing a combination of weighted BCE loss and weighted IOU loss, which eliminates the impact of uneven distribution of positive and negative samples, thereby further improving the segmentation accuracy of the model. and robustness.

附图说明Description of drawings

为了更清楚地说明本发明的技术方案，下面对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the present invention more clearly, the drawings required to be used in the embodiments are briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For ordinary people in the art, Technical personnel can also obtain other drawings based on these drawings without exerting creative labor.

图1是本发明实施例中提供的基于PVT的多尺度息肉分割方法的实现流程示意图；Figure 1 is a schematic flow chart of the implementation of the PVT-based multi-scale polyp segmentation method provided in the embodiment of the present invention;

图2为PraNet网络模型结构示意图；Figure 2 is a schematic diagram of the PraNet network model structure;

图3为本发明实施例中构建的基于PVT的多尺度息肉分割方法的网络模型结构示意图；Figure 3 is a schematic structural diagram of the network model of the PVT-based multi-scale polyp segmentation method constructed in the embodiment of the present invention;

图4为本发明的并行Sobel边缘解码器模块；Figure 4 shows the parallel Sobel edge decoder module of the present invention;

图5为本发明的并行Sobel边缘解码器模块中的RFB操作示意图；Figure 5 is a schematic diagram of RFB operation in the parallel Sobel edge decoder module of the present invention;

图6为本发明的多尺度并行空洞卷积注意力模块；Figure 6 shows the multi-scale parallel dilated convolution attention module of the present invention;

图7为实施例中基于PVT的多尺度息肉分割方法的实验可视化对比图。Figure 7 is an experimental visualization comparison diagram of the multi-scale polyp segmentation method based on PVT in the embodiment.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

为使本发明的上述目的、特征和优点能够更加明显易懂，下面结合附图和具体实施方式对本发明作进一步详细的说明。In order to make the above objects, features and advantages of the present invention more obvious and understandable, the present invention will be described in further detail below with reference to the accompanying drawings and specific embodiments.

实施例一Embodiment 1

图1是本发明实施例中提供的基于PVT的多尺度息肉分割方法的实现流程示意图。如图1所示，所述基于PVT的多尺度息肉分割方法包括以下步骤：Figure 1 is a schematic flow chart of the implementation of the PVT-based multi-scale polyp segmentation method provided in the embodiment of the present invention. As shown in Figure 1, the PVT-based multi-scale polyp segmentation method includes the following steps:

步骤S1，获取Kvasir-SEG、CVC-ClinicDB、CVC-ColonDB、ETIS以及CVC-T中待检测的结直肠镜图像，并对图像进行预处理。Step S1: Obtain the colorectal endoscope images to be detected in Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, ETIS and CVC-T, and preprocess the images.

在实际中，预处理可以包括对待检测的结直肠镜图像的随机旋转、垂直翻转、水平翻转以及归一化等处理方式，将待检测的结直肠镜图像处理为符合检测要求的图像。然后将图像的尺寸统一裁剪到352行×352列、并使用{0.75,1,1.25}的多尺度策略对图像进行缩放，这些预处理技术可以为神经网络模型提供更可靠的输入数据，使得网络可以处理不同大小的息肉，从而提高结直肠镜检查中息肉的分割效果。In practice, preprocessing may include random rotation, vertical flipping, horizontal flipping, and normalization of the colorectal image to be detected, so as to process the colorectal image to be detected into an image that meets the detection requirements. Then the size of the image is uniformly cropped to 352 rows × 352 columns, and the multi-scale strategy of {0.75, 1, 1.25} is used to scale the image. These preprocessing techniques can provide more reliable input data for the neural network model, making the network Polyps of different sizes can be processed, thereby improving polyp segmentation during colorectal endoscopy.

步骤S2，使用PVTv2骨干网络对预处理后的图像进行多尺度的特征提取。Step S2: Use the PVTv2 backbone network to extract multi-scale features from the preprocessed image.

参照图2为PraNet网络模型结构，PraNet是一种并行反向注意网络，它可以准确地从结肠镜图像中分割息肉。该网络首先利用并行部分解码器(PPD)进行高级特征的聚合，从而生成一个初始全局预测图，为后续步骤做指导；然后利用反向注意力模块来建立目标区域与边界之间的关系，充分发挥边缘与区域之间的互补性。但由于PraNet骨干网络的接受域有限，只能捕获局部信息，而忽略了空间上下文和全局信息。本发明对其进行改进，图3即为本发明实施例中构建的基于PVT的多尺度息肉分割方法的网络模型结构示意图。Refer to Figure 2 for the PraNet network model structure. PraNet is a parallel reverse attention network that can accurately segment polyps from colonoscopy images. The network first uses the Parallel Part Decoder (PPD) to aggregate high-level features to generate an initial global prediction map to guide subsequent steps; then it uses the reverse attention module to establish the relationship between the target area and the boundary, fully Leverage the complementarity between edges and regions. However, due to the limited receptive field of the PraNet backbone network, it can only capture local information, while ignoring the spatial context and global information. The present invention improves it. Figure 3 is a schematic structural diagram of the network model of the PVT-based multi-scale polyp segmentation method constructed in the embodiment of the present invention.

在本发明实施例中，使用PVTv2骨干网络进行多尺度的特征提取。PVTv2是当前最先进的预训练模型之一，因其采用了Pyramid Vision Transformer的架构，可以提供关于输入图像更为准确和鲁棒的特征提取能力，该模型可以在包括图像分类、目标检测和分割等多种视觉任务中表现出色。因此利用PVTv2骨干网络可以更好地处理结直肠镜图像中不同分辨率的息肉。In the embodiment of the present invention, the PVTv2 backbone network is used for multi-scale feature extraction. PVTv2 is one of the most advanced pre-training models currently. Because it uses the Pyramid Vision Transformer architecture, it can provide more accurate and robust feature extraction capabilities for input images. This model can include image classification, target detection and segmentation. Excellent performance in a variety of visual tasks. Therefore, the PVTv2 backbone network can be used to better handle polyps with different resolutions in colorectal images.

PVTv2为了对局部连续性信息进行建模，利用重叠贴片嵌入对图像进行标记。我们扩大补丁窗口，使相邻的窗口重叠一半的区域，并将特征图填充为零，以保持分辨率。为确保PVTv2具有和CNN一样的线性复杂性度，我们使用零填充的卷积来实现重叠的贴片嵌入。使用PVTv2骨干网络进行多尺度的特征提取的具体操作如下：In order to model local continuity information, PVTv2 uses overlapping patch embeddings to label images. We enlarge the patch windows so that adjacent windows overlap half of the area and pad the feature maps with zeros to maintain resolution. To ensure that PVTv2 has the same linear complexity as CNN, we use zero-padding convolutions to implement overlapping patch embeddings. The specific operations of using the PVTv2 backbone network for multi-scale feature extraction are as follows:

S201，首先查看网络输入的预处理结直肠镜图像是否为3通道图像。S201: First check whether the preprocessed colorectal image input by the network is a 3-channel image.

S202，如果为3通道图像，则直接送入网络中进行特征提取；如果不是3通道图像，则借助一次1×1卷积进行通道调整，使调整后的图像通道数改变为3。S202, if it is a 3-channel image, it is directly sent to the network for feature extraction; if it is not a 3-channel image, channel adjustment is performed with a 1×1 convolution, so that the adjusted number of image channels is changed to 3.

S203，本模型使用PVTv2-B2的预训练模型进行四阶段的特征提取。这四个阶段中，PVTv2基本单元的层数分别为3、3、6、3，生成的四个层级的原始特征图X_i的尺寸(即：通道数×行数×列数)分别为：64×(H/4)×(W/4)、128×(H/8)×(W/8)、320×(H/16)×(W/16)、512×(H/32)×(W/32)。S203. This model uses the PVTv2-B2 pre-training model to perform four-stage feature extraction. In these four stages, the number of layers of PVTv2 basic units are 3, 3, 6, and 3 respectively. The sizes of the original feature maps _Xi of the four levels generated (i.e., the number of channels × the number of rows × the number of columns) are: 64×(H/4)×(W/4), 128×(H/8)×(W/8), 320×(H/16)×(W/16), 512×(H/32)× (W/32).

步骤S3，使用并行Sobel边缘解码器对PVT生成的不同尺度的原始特征图进行逐级融合，得到一个全局预测图。如图4所示即为本发明的并行Sobel边缘解码器模块，其具体操作如下：Step S3: Use a parallel Sobel edge decoder to fuse the original feature maps of different scales generated by PVT step by step to obtain a global prediction map. As shown in Figure 4, it is the parallel Sobel edge decoder module of the present invention. Its specific operation is as follows:

如图4所示，图中的X₁、X₂、X₃、X₄为PVT阶段的四个层级的原始特征图。我们对其并行进行RFB操作，其具体操作如图5所示，以X₁为例进行说明，对X₁需要进行四个分支的卷积运算操作。其操作分别如下：As shown in Figure 4, X ₁ , X ₂ , X ₃ , and X ₄ in the figure are the original feature maps of the four levels of the PVT stage. We perform RFB operations in parallel. The specific operations are shown in Figure 5. Taking X ₁ as an example, four branch convolution operations are required for X ₁ . The operations are as follows:

S301，在第一个分支中使用一次1×1的卷积，对特征图的通道数目进行压缩，为方便计算，我们将所有特征的通道数都压缩到32。S301, use a 1×1 convolution in the first branch to compress the number of channels of the feature map. To facilitate calculation, we compress the number of channels of all features to 32.

S302，在第二个分支中首先使用1×1的卷积，对特征图的通道数目进行压缩，然后分别一次使用1×3、3×1的非对称卷积以及空洞率为3的3×3卷积对特征图进行特征提取。S302, in the second branch, first use 1×1 convolution to compress the number of channels of the feature map, and then use 1×3, 3×1 asymmetric convolution and 3× with hole rate 3 respectively. 3 convolution performs feature extraction on the feature map.

S303，在第三个分支中首先使用1×1的卷积，对特征图的通道数目进行压缩，然后分别一次使用1×5、5×1的非对称卷积以及空洞率为5的3×3卷积对特征图进行特征提取。S303, in the third branch, first use 1×1 convolution to compress the number of channels of the feature map, and then use 1×5, 5×1 asymmetric convolution and 3× with hole rate 5 respectively. 3 convolution performs feature extraction on the feature map.

S304，在第四个分支中首先使用1×1的卷积，对特征图的通道数目进行压缩，然后分别一次使用1×7、7×1的非对称卷积以及空洞率为7的3×3卷积对特征图进行特征提取。S304, in the fourth branch, first use 1×1 convolution to compress the number of channels of the feature map, and then use 1×7, 7×1 asymmetric convolution and 3× with hole rate 7 respectively. 3 convolution performs feature extraction on the feature map.

S305，在第四个分支中首先使用1×1的卷积，对特征图的通道数目进行压缩，然后分别一次使用1×7、7×1的非对称卷积以及空洞率为7的3×3卷积对特征图进行特征提取。S305, in the fourth branch, first use 1×1 convolution to compress the number of channels of the feature map, and then use 1×7, 7×1 asymmetric convolution and 3× with hole rate 7 respectively. 3 convolution performs feature extraction on the feature map.

S306，将这四个分支的特征图在通道维度进行拼接，然后使用1×1的卷积，对特征图的通道数目进行压缩。S306: Splice the feature maps of these four branches in the channel dimension, and then use 1×1 convolution to compress the number of channels of the feature map.

S307，将特征图和经过1×1卷积的通道压缩后的原始特征图进行逐像素相加，之后经过ReLU激活函数的非线性运算，再进行Sobel操作。S307: Add the feature map and the original feature map compressed by 1×1 convolution channel pixel by pixel, and then perform the nonlinear operation of the ReLU activation function, and then perform the Sobel operation.

X₂、X₃、X₄特征图的RFB操作也如上面一样.The RFB operation of the X ₂ , X ₃ , and X ₄ feature maps is also the same as above.

对上述处理好的四个特征图，使用Sobel算子进行梯度锐化后逐像素相加，再经过一个1×1卷积并使用基于双线性插值的空间上采样操作，生成初始的息肉分割全局预测图。For the four feature maps processed above, the Sobel operator is used for gradient sharpening and then added pixel by pixel, and then undergoes a 1×1 convolution and uses a spatial upsampling operation based on bilinear interpolation to generate the initial polyp segmentation. Global prediction map.

步骤S4，使用多尺度并行空洞卷积注意力模块对原始特征图进行多感受野的特征提取。如图6所示即为本发明的多尺度并行空洞卷积注意力模块，其具体操作如下：Step S4: Use the multi-scale parallel atrous convolution attention module to extract features of multiple receptive fields from the original feature map. As shown in Figure 6, it is the multi-scale parallel dilated convolution attention module of the present invention. Its specific operation is as follows:

S401，将PVT编码器的四个层级的原始特征图分别送入多尺度并行空洞卷积注意力模块作进一步的特征提取，首先经过一个1×1卷积的通道压缩，使得处理后的特征图通道数目为原始特征图通道数目的 S401, send the original feature maps of the four levels of the PVT encoder to the multi-scale parallel atrous convolution attention module for further feature extraction. First, they undergo a 1×1 convolution channel compression, so that the processed feature maps The number of channels is the number of channels in the original feature map.

S402，将处理后的特征图按照通道均匀分为四组，分别送入四个分支进行处理，即分别使用空洞率为1、3、5、7的3×3卷积进行特征提取，之后对处理结果进行通道拼接。S402, divide the processed feature maps evenly into four groups according to channels, and send them to four branches for processing, that is, use 3×3 convolutions with hole rates of 1, 3, 5, and 7 for feature extraction, and then The processing results are channel spliced.

S403，对通道拼接的特征图进行1×1卷积，然后顺次进行批量归一化BN与ReLU函数的非线性激活。S403: Perform 1×1 convolution on the channel spliced feature map, and then sequentially perform nonlinear activation of batch normalized BN and ReLU functions.

S404，将S403处理得到的特征图输入到CBAM模块中，对特征图进行进一步的注意力增强。其中CBAM模块主要包含两个部分：通道注意力和空间注意力。通道注意力模块对特征图进行全局意义的通道间的重要性分配，减少冗余计算；空间注意力模块则通过对空间局部信息的不同程度关注来保留空间局部更多的有效特征信息。经过CBAM模块的特征图被进一步优化，得到更具有区分度的特征图。最终生成四个尺寸(即：通道数目×行数×列数)分别为64×(H/4)×(W/4)、128×(H/8)×(W/8)、320×(H/16)×(W/16)、512×(H/32)×(W/32)的特征图。S404: Input the feature map processed in S403 into the CBAM module to further enhance the feature map with attention. The CBAM module mainly contains two parts: channel attention and spatial attention. The channel attention module assigns global importance to the feature map between channels to reduce redundant calculations; the spatial attention module retains more effective local feature information in the space by paying varying degrees of attention to local spatial information. The feature map after the CBAM module is further optimized to obtain a more distinguishable feature map. The final generated four sizes (ie: number of channels × number of rows × number of columns) are 64×(H/4)×(W/4), 128×(H/8)×(W/8), 320×( Feature maps of H/16)×(W/16) and 512×(H/32)×(W/32).

步骤S5，使用全局预测图逐步引导并逐级生成多阶段预测图。其操作和PraNet一样，具体操作如下：Step S5: Use the global prediction map to gradually guide and generate multi-stage prediction maps step by step. Its operation is the same as PraNet, and the specific operations are as follows:

S501，将全局预测图进行空间下采样，使其分辨率大小和PVT第四阶段的特征图分辨率大小一致，然后送入RA模块中进行反注意力操作生成一个注意力图，然后再和PVT第四阶段的特征图进行逐元素乘操作，之后使用三次3×3卷积的进行特征降维，进一步和上一阶段的预测图进行逐像素相加，生成本阶段的预测图。S501: Spatially downsample the global prediction map so that its resolution is consistent with the resolution of the feature map in the fourth stage of PVT, and then send it to the RA module for anti-attention operation to generate an attention map, which is then combined with the resolution of the feature map in the fourth stage of PVT. The feature maps of the four stages are multiplied element by element, and then three 3×3 convolutions are used to reduce the feature dimension. They are further added to the prediction map of the previous stage pixel by pixel to generate the prediction map of this stage.

S502，将本阶段的预测图送入下一个阶段中，进行和S501阶段相同的操作，以指导最后一级特征图的生成。S502, send the prediction map of this stage to the next stage, and perform the same operation as the S501 stage to guide the generation of the last level feature map.

使用并行Sobel边缘解码器生成的全局预测图对S4生成的四个特征图进行逐步的引导并逐级生成预测图，具体操作如下：Use the global prediction map generated by the parallel Sobel edge decoder to gradually guide the four feature maps generated by S4 and generate prediction maps step by step. The specific operations are as follows:

S501：使用S3生成的全局预测图和S4生成的尺寸为512×(H/32)×(W/32)的特征图进行引导预测，首先将S3生成的全局预测图进行空间下采样操作，将其尺寸缩小至S4特征图的大小；然后与S4特征图进行融合生成尺寸为1×(H/32)×(W/32)的预测图。这种引导预测的方法，能够结合全局和局部信息，更加准确地进行结直肠镜图像的息肉分割。S501: Use the global prediction map generated by S3 and the feature map with a size of 512×(H/32)×(W/32) generated by S4 for guided prediction. First, perform a spatial downsampling operation on the global prediction map generated by S3. Its size is reduced to the size of the S4 feature map; then it is fused with the S4 feature map to generate a prediction map with a size of 1×(H/32)×(W/32). This guided prediction method can combine global and local information to more accurately segment polyps in colorectal images.

S502：使用S501生成的预测图对S4生成的尺寸为320×(H/16)×(W/16)的特征图进行引导预测，首先将S501生成的预测图通过空间上采样操作，将其尺寸扩大至S4特征图的大小；然后与S4特征图进行融合生成尺寸为1×(H/16)×(W/16)的预测图；之后，依次对下一阶段进行引导预测，顺次生成尺寸分别为1×(H/8)×(W/8)、1×(H/4)×(W/4)的两个息肉分割预测图。S502: Use the prediction map generated by S501 to conduct guided prediction on the feature map with a size of 320×(H/16)×(W/16) generated by S4. First, use the prediction map generated by S501 to perform a spatial upsampling operation to reduce its size. Expand to the size of the S4 feature map; then fuse it with the S4 feature map to generate a prediction map with a size of 1 × (H/16) × (W/16); then, guide the prediction for the next stage in sequence, and generate the size in sequence The two polyp segmentation prediction maps are 1×(H/8)×(W/8) and 1×(H/4)×(W/4) respectively.

步骤S6，将生成的全局预测图和多阶段预测图与真值图比较，计算预测损失，最后一级预测图即为最终的息肉分割预测图。其具体操作如下：Step S6: Compare the generated global prediction map and multi-stage prediction map with the true value map to calculate the prediction loss. The last level prediction map is the final polyp segmentation prediction map. The specific operations are as follows:

S601，将这五个预测图通过双线性插值的空间上采样操作，将预测图的尺寸调整为和输入图像对应的真值图一样的尺寸，并计算由加权BCE和加权IOU构成的混合损失L_seg。S601, perform the spatial upsampling operation of bilinear interpolation on these five prediction maps, adjust the size of the prediction map to the same size as the ground truth map corresponding to the input image, and calculate a hybrid loss composed of weighted BCE and weighted IOU. L _seg .

加权BCE损失的计算方法如下：The weighted BCE loss is calculated as follows:

G表示真值图，P表示预测图，(x,y)表示图像中的任意像素位置。G represents the true value map, P represents the predicted map, and (x, y) represents any pixel position in the image.

权重系数ω(x,y)来表示像素(x,y)重要性，其计算方式为：The weight coefficient ω(x,y) represents the importance of pixel (x,y), and its calculation method is:

具体计算时，我们将设置为5。For specific calculations, we will Set to 5.

加权IOU损失计算方法如下：The weighted IOU loss is calculated as follows:

最终预测图P相对于真值图G的混合损失函数为：The hybrid loss function of the final predicted map P relative to the ground truth map G is:

其中和/>是加权IOU损失和加权BCE损失。in and/> are the weighted IOU loss and the weighted BCE loss.

为说明本方法的有效性，以Kvasir-SEG、CVC-ClinicDB数据集为训练集，对本方法进行训练，然后使用Kvasir-SEG、CVC-ClinicDB、CVC-ColonDB、ETIS以及CVC-T五种数据集进行测试，并将测试结果并与现有技术中主流的息肉分割算法进行对比。具体实验过程中，基于PyTorch 1.8深度学习框架，使用1个NVIDIA RTX 2080Ti显卡训练，显存为11G；设置输入图像尺寸为352×352，初始学习率1e-4，使用AdamW优化器，batch size设置为6，总的训练轮数为100epochs。在Kvasir-SEG、CVC-ClinicDB数据集的测试结果如表1所示：In order to illustrate the effectiveness of this method, the Kvasir-SEG and CVC-ClinicDB data sets are used as training sets to train this method, and then five data sets of Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, ETIS and CVC-T are used Tests are conducted and the test results are compared with mainstream polyp segmentation algorithms in the existing technology. During the specific experiment, based on the PyTorch 1.8 deep learning framework, an NVIDIA RTX 2080Ti graphics card was used for training, with a video memory of 11G; the input image size was set to 352×352, the initial learning rate was 1e-4, the AdamW optimizer was used, and the batch size was set to 6. The total number of training rounds is 100 epochs. The test results on the Kvasir-SEG and CVC-ClinicDB data sets are shown in Table 1:

表1本发明与其他7种方法在Kvasir-SEG、CVC-ClinicDB数据集息肉分割的测试结果Table 1 Test results of polyp segmentation of the present invention and other 7 methods in Kvasir-SEG and CVC-ClinicDB data sets

表1Table 1

在CVC-ClinicDB、CVC-ColonDB、ETIS数据集的实验结果如表2所示：The experimental results on CVC-ClinicDB, CVC-ColonDB, and ETIS data sets are shown in Table 2:

表2本发明与其他7种方法在CVC-ClinicDB、CVC-ColonDB、ETIS数据集息肉分割的测试结果Table 2 Test results of polyp segmentation of the present invention and other 7 methods in CVC-ClinicDB, CVC-ColonDB, and ETIS data sets

表2Table 2

7个评价指标中，前2个为语义分割任务中常用的评价指标，并且前6个指标的取值越接近1，表明分割结果越好；第7个指标取值不低于0，取值越接近0越好。Among the 7 evaluation indicators, the first 2 are commonly used evaluation indicators in semantic segmentation tasks, and the closer the value of the first 6 indicators is to 1, the better the segmentation result is; the value of the 7th indicator is not less than 0, and the value is The closer to 0 the better.

图7为本发明中方法以及其他方法的实验结果可视化对比图，从结果得出，本发明所提出的息肉分割方法可以得到边界更准确、语义结构更完整的分割结果。Figure 7 is a visual comparison of the experimental results of the method of the present invention and other methods. From the results, it can be concluded that the polyp segmentation method proposed by the present invention can obtain segmentation results with more accurate boundaries and more complete semantic structure.

实施例二Embodiment 2

以上所述的实施例仅是对本发明优选方式进行的描述，并非对本发明的范围进行限定，在不脱离本发明设计精神的前提下，本领域普通技术人员对本发明的技术方案做出的各种变形和改进，均应落入本发明权利要求书确定的保护范围内。The above-described embodiments are only descriptions of preferred modes of the present invention, and do not limit the scope of the present invention. Without departing from the design spirit of the present invention, those of ordinary skill in the art can make various modifications to the technical solutions of the present invention. All deformations and improvements shall fall within the protection scope determined by the claims of the present invention.

Claims

1. A multi-scale polyp segmentation method based on PVT, characterized by including the following steps:

S1. Obtain the colonoscopy image to be detected and preprocess the colonoscopy image;

S2. Use the PVTv2 backbone network to perform multi-scale feature extraction on the preprocessed colorectal image, and obtain original feature maps of different scales;

S3. Use a parallel Sobel edge decoder to fuse the original feature maps step by step to obtain a global prediction map;

In the S3, the method of using a parallel Sobel edge decoder to fuse the original feature maps step by step to obtain a global prediction map includes:

S31: Use a 1×1 convolution in the first branch to compress the feature map channel;

S32: In the second branch, first use 1×1 convolution to compress the feature map channel, and then use 1×3, 3×1 asymmetric convolution and 3×3 convolution with hole rate 3 respectively. Perform feature extraction on the feature map;

S33: In the third branch, first use 1×1 convolution to compress the feature map channel, and then use 1×5, 5×1 asymmetric convolution and 3×3 convolution with hole rate 5 respectively. Perform feature extraction on the feature map;

S34: In the fourth branch, first use 1×1 convolution to compress the feature map channel, and then use 1×7, 7×1 asymmetric convolution and 3×3 convolution with hole rate 7 respectively. Perform feature extraction on the feature map;

S35: Splice the compressed feature map of the first branch with the feature extracted feature maps of the second, third, and fourth branches in the channel dimension, and then use 1×1 convolution to compress the feature map channel. ;

S36: Add the compressed spliced feature map and the original feature map channel-compressed with the help of convolution pixel by pixel, and then pass through the ReLU nonlinear activation function before inputting into the Sobel operation;

S37: Add the feature maps after gradient sharpening by the Sobel operator pixel by pixel, and then undergo a 1×1 convolution and use bilinear interpolation upsampling operation to generate an initial polyp segmentation global prediction map;

S4. Use a multi-scale parallel dilated convolutional attention module to extract features of multiple receptive fields from the original feature map;

In S4, the method of using a multi-scale parallel dilated convolutional attention module to extract features of multiple receptive fields from the original feature map includes:

S41: Channel-compress the original feature maps of the four levels of the PVT encoder through a 1×1 convolution to obtain the number of channels of the original feature map. times of multi-channel feature maps;

S42: Group the channel-compressed feature maps evenly into channels and send them to four branches for processing. The processing method is: use 3×3 convolutions with hole rates of 1, 3, 5, and 7 to perform corresponding branches. Feature extraction, and then channel splicing of the processing results of the four branches;

S43: Perform 1×1 convolution on the channel spliced feature map, and then perform nonlinear operations of batch normalized BN and ReLU activation functions sequentially to obtain the processed feature map;

S44: Send the processed feature map to the CBAM module to further weight the feature map with attention to obtain a more distinguishable feature map;

S5. Use the global prediction map to gradually guide the original feature map after multi-receptive field feature extraction, and generate multi-stage prediction maps step by step;

In S5, the method of using the global prediction map to gradually guide the original feature map after multi-receptive field feature extraction and generating multi-stage prediction maps step by step includes:

S51: Spatially downsample the global prediction map so that the resolution is consistent with the resolution of the feature map in the fourth stage of PVT; then send it to the RA module for anti-attention operation to generate an attention map; then, and The feature map in the fourth stage of PVT is multiplied element-by-element, and further uses three 3×3 convolutions to reduce the feature dimension and adds it to the prediction map of the previous stage pixel-by-pixel to generate the prediction map of this stage;

S52: Send the prediction map of this stage to the next stage, and perform the same operation as the S51 stage to guide the generation of the last level feature map;

S6. Compare the global prediction map and the multi-stage prediction map with the true value map, calculate the loss, and the final level prediction map obtained is the final polyp segmentation prediction map.

2. The multi-scale polyp segmentation method based on PVT according to claim 1, characterized in that, in the S1, the method for preprocessing the colorectal endoscope image includes:

Random rotation, vertical flipping, horizontal flipping and normalization techniques were used to enhance the colorectal image data. Finally, the colorectal image was uniformly cropped to a size of 352×352, and {0.75, 1, 1.25} was used. A multi-scale strategy is used to scale the colorectal image.

3. The multi-scale polyp segmentation method based on PVT according to claim 1, characterized in that, in the S2, a method of using the PVTv2 backbone network to perform multi-scale feature extraction on the preprocessed colorectal endoscope image include:

Determine whether the preprocessed colorectal image input to the PVTv2 backbone network is a 3-channel image. If it is a 3-channel image, it is directly sent to the network for feature extraction. If it is not a 3-channel image, 1×1 is used once. Convolution, adjust the number of channels of the image to 3;

Four-stage feature extraction is performed using the pre-trained model of PVTv2-B2.

4. The multi-scale polyp segmentation method based on PVT according to claim 1, characterized in that, in the S6, the global prediction map and the multi-stage prediction map are compared with the true value map, and the loss is calculated to obtain The last level prediction map is the final polyp segmentation prediction map. Methods include:

The global prediction map and the multi-stage prediction map are subjected to a spatial upsampling operation of bilinear interpolation, the size of all prediction maps is adjusted to the same size as the ground truth map corresponding to the input image, and the weighted BCE and weighted BCE are calculated. Mixed losses of IOU;

The weighted BCE loss calculation method is defined as:

Among them, G represents the true value map; P represents the prediction map; (x, y) represents any pixel position in the image, and the corresponding weight coefficient ω (x, y) is used to represent the importance of the pixel (x, y). The specific calculation will be/> set to 5;

The weighted IOU loss calculation method is defined as:

The method of combining the weighted BCE and weighted IOU losses to obtain the hybrid loss of the prediction map relative to the true value map is:

5. A multi-scale polyp segmentation system based on PVT, characterized by comprising: a preprocessing module, a first feature extraction module, a fusion module, a second feature extraction module, a guidance module and a prediction module;

The preprocessing module is used to obtain the colorectal endoscope image to be detected and preprocess the colorectal endoscope image;

The first feature extraction module is used to perform multi-scale feature extraction on the preprocessed colorectal endoscope image using the PVTv2 backbone network to obtain original feature maps of different scales;

The fusion module is used to fuse the original feature maps step by step using a parallel Sobel edge decoder to obtain a global prediction map;

The process of using a parallel Sobel edge decoder to fuse the original feature maps step by step to obtain a global prediction map includes:

In the first branch, a 1×1 convolution is used to compress the feature map channel;

In the second branch, 1×1 convolution is first used to compress the feature map channel, and then 1×3, 3×1 asymmetric convolution and 3×3 convolution pair with hole rate 3 are used respectively. Feature map for feature extraction;

In the third branch, 1×1 convolution is first used to compress the feature map channel, and then 1×5, 5×1 asymmetric convolution and 3×3 convolution pair with hole rate 5 are used respectively. Feature map for feature extraction;

In the fourth branch, 1×1 convolution is first used to compress the feature map channel, and then 1×7, 7×1 asymmetric convolution and 3×3 convolution pair with hole rate 7 are used respectively. Feature map for feature extraction;

Splice the compressed feature map of the first branch with the feature extracted feature maps of the second, third, and fourth branches in the channel dimension, and then use 1×1 convolution to compress the feature map channel;

The compressed spliced feature map and the original feature map channel-compressed with the help of convolution are added pixel by pixel, and then passed through the ReLU nonlinear activation function and then input into the Sobel operation;

The feature maps after gradient sharpening by the Sobel operator are added pixel by pixel, and then undergo a 1×1 convolution and use bilinear interpolation upsampling operation to generate an initial polyp segmentation global prediction map;

The second feature extraction module is used to extract features of multiple receptive fields from the original feature map using a multi-scale parallel atrous convolution attention module;

The process of using multi-scale parallel dilated convolutional attention modules to perform multi-receptive field feature extraction on the original feature map includes:

The original feature maps of the four levels of the PVT encoder are channel-compressed through a 1×1 convolution to obtain the number of original feature map channels. times of multi-channel feature maps;

The channel-compressed feature maps are evenly grouped into channels and sent to four branches for processing. The processing method is: using 3×3 convolutions with hole rates of 1, 3, 5, and 7 to perform the features of the corresponding branches. Extract, and then perform channel splicing on the processing results of the four branches;

Perform 1×1 convolution on the channel spliced feature map, and then perform nonlinear operations of batch normalized BN and ReLU activation functions sequentially to obtain the processed feature map;

The processed feature map is sent to the CBAM module, and the feature map is further weighted with attention to obtain a more distinguishable feature map;

The guidance module is configured to use the global prediction map to gradually guide the original feature map after multi-receptive field feature extraction, and generate multi-stage prediction maps step by step;

The process of using the global prediction map to gradually guide the original feature map after multi-receptive field feature extraction and gradually generating a multi-stage prediction map includes:

The global prediction map is spatially downsampled to make the resolution size consistent with the resolution of the feature map in the fourth stage of PVT; then it is sent to the RA module for anti-attention operation to generate an attention map; after that, it is compared with the feature map in the fourth stage of PVT. The feature maps of the four stages are multiplied element by element, and then the feature dimensionality reduction using three times 3×3 convolution is performed and added pixel by pixel to the prediction map of the previous stage to generate the prediction map of this stage;

The prediction map of this stage is sent to the next stage, and the same operation as the previous stage is performed to guide the generation of the last level feature map;

The prediction module is used to compare the global prediction map and the multi-stage prediction map with the true value map, and calculate the loss. The obtained last-level prediction map is the final polyp segmentation prediction map.