CN115546466A

CN115546466A - A weakly supervised image object localization method based on multi-scale salient feature fusion

Info

Publication number: CN115546466A
Application number: CN202211201019.3A
Authority: CN
Inventors: 李建强; 刘小玲; 刘朝磊; 赵琳娜; 刘素芹; 徐曦; 赵青
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2022-09-28
Filing date: 2022-09-28
Publication date: 2022-12-30

Abstract

A weak supervision image target positioning method based on multi-scale salient feature fusion belongs to the field of computer vision. In order to solve the two problems of complicated work of ROI labeling of a small target image and insufficient CAM activation, the invention focuses on the research of outputting a class activation graph by a classification network under the condition of optimizing weak supervision. The invention relates to information fusion of two layers: (1) because the semantic information of the feature map at the bottommost layer in the convolutional neural network is weak but the position information is strong, the feature map can be fused with the feature map at the highest layer to obtain the final feature map of the classification network; (2) because the sensitivity of the classification network to the ROIs with different scales is different, the obtained class activation maps are also different, the complementary object information in the different activation maps is fused, so that the positioning of the target region in the image can be perfected, and a more accurate pseudo label is generated for a segmentation task.

Description

A weakly supervised image object localization method based on multi-scale salient feature fusion

技术领域technical field

本发明涉及一种基于多尺度显著特征融合的弱监督图像目标定位方法，属于计算机视觉领域。The invention relates to a weakly supervised image target positioning method based on multi-scale salient feature fusion, which belongs to the field of computer vision.

背景技术Background technique

图像感兴趣区域(RegionOfInterest，ROI)的定位分割是计算机视觉研究中的一个经典难题，目前基于自然图像的ROI定位分割研究已经取得了巨大的进展。然而对于一些特定领域中的非自然图像(例如医疗图像、花粉颗粒图像)，它们的ROI较于自然图像更小，所以基于自然图像的ROI定位分割方法并不完全适用于这类图像。因此，基于特定领域的图像小目标定位分割研究具有十分重要的意义。The positioning and segmentation of the region of interest (RegionOfInterest, ROI) in the image is a classic problem in computer vision research. At present, the research on the positioning and segmentation of ROI based on natural images has made great progress. However, for unnatural images in some specific fields (such as medical images, pollen grain images), their ROIs are smaller than natural images, so the ROI positioning and segmentation method based on natural images is not completely suitable for such images. Therefore, it is of great significance to study the localization and segmentation of small image objects based on specific fields.

目前主流的基于深度学习的小目标定位分割方法有全监督学习和弱监督学习两类。Z. Ning等人^[1]提出分别利用乳腺超声图像前景和背景的显著性图来引导主网络和辅助网络分别学习前景显著表示和背景显著表示，最终融合两者特征增强分割网络的形态学习能力。但该类全监督型深度学习方法一般都需要大量的已标注数据集，而获取图像的像素级标签是一项繁杂且费时的工作，相对而言，获取只带有类别信息的数据集更容易，故不少工作只使用图像级标签这类弱监督方法来实现目标定位分割。但弱监督学习中由分类网络得到的类激活图(Class Activation Map，CAM)只能覆盖图像中最为显著的部分，并不能指示完整的目标区域，即CAM的定位精度较低(激活不足)，为此，Li Y等人^[2]首先利用乳腺解剖学先验知识来约束分类网络对乳腺病变组织的搜索空间，再使用水平集算法对CAM 进行修正。但其忽略了一个重要信息：对于不同尺度的目标，分类网络捕获的判别性区域并不一致。At present, there are two types of mainstream deep learning-based small target segmentation methods: fully supervised learning and weakly supervised learning. Z. Ning et al. ^[1] proposed to use the saliency maps of the foreground and background of breast ultrasound images to guide the main network and the auxiliary network to learn the foreground saliency representation and the background saliency representation respectively, and finally fuse the two features to enhance the morphological learning ability of the segmentation network . However, this type of fully supervised deep learning method generally requires a large amount of labeled data sets, and obtaining pixel-level labels of images is a complicated and time-consuming task. Relatively speaking, it is easier to obtain data sets with only category information , so many works only use weakly supervised methods such as image-level labels to achieve target location segmentation. However, the class activation map (Class Activation Map, CAM) obtained by the classification network in weakly supervised learning can only cover the most significant part of the image, and cannot indicate the complete target area, that is, the positioning accuracy of CAM is low (insufficient activation), To this end, Li Y et al. ^[2] first used the prior knowledge of breast anatomy to constrain the search space of the classification network for breast lesion tissue, and then used the level set algorithm to correct the CAM. But it ignores an important information: for objects of different scales, the discriminative regions captured by the classification network are not consistent.

为了解决小目标图像ROI标注工作繁杂、CAM激活不足两个问题，本发明重点关注优化弱监督下分类网络输出类激活图的研究。本发明涉及两个层面的信息融合：①由于卷积神经网络中最底层的特征图语义信息弱但位置信息强，故可与最高层特征图进行融合得到分类网络最终的特征图；②由于分类网络对不同尺度ROI的敏感度不同，其得到的类激活图也有所不同，所以融合不同激活图中互补的对象信息能够完善图像中目标区域的定位，进而产生更准确的伪标签用于分割任务。In order to solve the two problems of complicated ROI labeling for small target images and insufficient CAM activation, the present invention focuses on the research on optimizing the output class activation map of the classification network under weak supervision. The present invention involves information fusion at two levels: ①Since the semantic information of the feature map at the bottom layer in the convolutional neural network is weak but the position information is strong, it can be fused with the feature map at the top layer to obtain the final feature map of the classification network; ②Since the classification The sensitivity of the network to ROIs of different scales is different, and the class activation maps obtained are also different. Therefore, the fusion of complementary object information in different activation maps can improve the positioning of the target area in the image, and then generate more accurate pseudo-labels for segmentation tasks. .

参考文献：references:

[1]Z.Ning,S.Zhong,Q.Feng,W.Chen and Y.Zhang,"SMU-Net:Saliency-GuidedMorphology-Aware U-Net for Breast Lesion Segmentation in Ultrasound Image,"inIEEE Transactions on Medical Imaging,vol.41,no.2,pp.476-490,Feb.2022,doi:10.1109/TMI.2021.3116087.[1] Z.Ning, S.Zhong, Q.Feng, W.Chen and Y.Zhang,"SMU-Net:Saliency-GuidedMorphology-Aware U-Net for Breast Lesion Segmentation in Ultrasound Image,"inIEEE Transactions on Medical Imaging ,vol.41,no.2,pp.476-490,Feb.2022,doi:10.1109/TMI.2021.3116087.

[2]Li Y,Liu Y,Huang L,Wang Z,Luo J.Deep weakly-supervised breasttumor segmentation in ultrasound images with explicit anatomicalconstraints.Med Image Anal. 2022Feb；76:102315.doi:10.1016/j.media.2021.102315.Epub 2021Nov 28.PMID: 34902792.[2] Li Y, Liu Y, Huang L, Wang Z, Luo J. Deep weakly-supervised breast tumor segmentation in ultrasound images with explicit anatomical constraints. Med Image Anal. 2022Feb; 76:102315.doi:10.1016/j.media.2021.102315 .Epub 2021Nov 28.PMID: 34902792.

发明内容Contents of the invention

针对现有基于全监督学习的图像小目标定位分割研究存在标注工作繁杂、而基于弱监督学习的单尺度图像小目标定位分割研究存在CAM激活不足的问题，本发明设计了一种基于多尺度显著特征融合的弱监督图像目标定位方法。具体而言，我们通过构建图像金字塔获取三种不同尺度的图像，并由此得到同一张图像的多尺度CAM，然后将其进行融合，最后将融合后的CAM作为弱监督信息训练分割网络。Aiming at the problem that the existing research on the location and segmentation of image small objects based on fully supervised learning has complicated labeling work, and the research on location and segmentation of single-scale images based on weakly supervised learning has insufficient CAM activation, the present invention designs a method based on multi-scale saliency A Weakly-Supervised Image Object Localization Approach for Feature Fusion. Specifically, we obtain images of three different scales by constructing an image pyramid, and thus obtain a multi-scale CAM of the same image, then fuse them, and finally use the fused CAM as weakly supervised information to train a segmentation network.

本发明所述的基于多尺度显著特征融合的弱监督图像目标定位方法由五个阶段组成：第一阶段为图像的预处理，主要对数据集中图像的分辨率进行统一。第二阶段为图像金字塔的构建。该阶段主要包括以输入图像为源图像向下采样构建图像金字塔顶层、向上采样构建图像金字塔底层、最终图像金字塔层数确定三个部分。第三阶段为分类器特征图的获取。该阶段首先为图像金字塔每层的图像训练一个分类器，然后对于任一层的分类器，将最高层的特征图与最低层的特征图进行拼接以获得融合后的特征图。第四阶段为多尺度CAM的融合。该阶段首先通过每层特征图的加权和获得同一图像的多尺度CAM，然后将所有CAM进行对齐，最后将其融合，获得源图像最终的CAM。第五阶段为目标区域的预测。该阶段首先将融合后的CAM转换为伪二值标签，然后利用该伪标签来训练分割网络，最后通过分割网络预测目标区域。The weakly supervised image target location method based on multi-scale salient feature fusion in the present invention consists of five stages: the first stage is image preprocessing, which mainly unifies the resolution of images in the data set. The second stage is the construction of the image pyramid. This stage mainly includes three parts: taking the input image as the source image to down-sample to construct the top layer of the image pyramid, up-sampling to construct the bottom layer of the image pyramid, and finally determining the number of layers of the image pyramid. The third stage is the acquisition of classifier feature maps. In this stage, a classifier is first trained for each layer of the image pyramid, and then for any layer of the classifier, the feature map of the highest layer is spliced with the feature map of the lowest layer to obtain a fused feature map. The fourth stage is the fusion of multi-scale CAM. In this stage, the multi-scale CAM of the same image is first obtained through the weighted sum of the feature maps of each layer, then all the CAMs are aligned, and finally they are fused to obtain the final CAM of the source image. The fifth stage is the prediction of the target area. This stage first converts the fused CAM into a pseudo-binary label, then uses the pseudo-label to train the segmentation network, and finally predicts the target region through the segmentation network.

本发明的具体方案如附图2所示。Concrete scheme of the present invention is as shown in accompanying drawing 2.

步骤1：图像预处理Step 1: Image Preprocessing

图像预处理的目的是统一数据集内所有图像的尺寸。本发明所参考的数据主要为小目标型图像数据，例如公开的乳腺图像数据集和花粉图像数据集。若数据集内图像分辨率不统一，将导致后续分类网络得到的特征图大小也不一致，而分类网络中全连接层的参数无法适应不同大小的特征图，所以必须将所有输入图像的大小固定为统一的尺寸。The purpose of image preprocessing is to unify the dimensions of all images in the dataset. The data referred to in the present invention are mainly small target image data, such as the public mammary gland image data set and pollen image data set. If the image resolution in the data set is not uniform, the size of the feature map obtained by the subsequent classification network will be inconsistent, and the parameters of the fully connected layer in the classification network cannot adapt to feature maps of different sizes, so the size of all input images must be fixed as Uniform size.

步骤2：图像金字塔构建Step 2: Image Pyramid Construction

该步骤以数据集内图像为源图像，通过构建高斯金字塔来获取输入图像的三种尺度变换。为同时获取较于原图更全局和更细粒度的信息，本发明构建的高斯金字塔采用向下采样和向上采样混合的金字塔结构。In this step, the image in the dataset is used as the source image, and the three scale transformations of the input image are obtained by constructing a Gaussian pyramid. In order to obtain more global and finer-grained information than the original image at the same time, the Gaussian pyramid constructed by the present invention adopts a pyramid structure of down-sampling and up-sampling.

步骤2.1图像金字塔顶层构建：以输入图像为源图像，首先利用5*5大小的模板高斯核对其进行高斯平滑处理，然后通过去除图像矩阵中的偶数行和列来对处理后的图像进行下采样，最后得到输入图像1/4大小的图像，并以此作为图像金字塔顶层。Step 2.1 Construction of the top layer of the image pyramid: take the input image as the source image, first use the template Gaussian kernel of 5*5 size to perform Gaussian smoothing on it, and then downsample the processed image by removing the even-numbered rows and columns in the image matrix , and finally get an image of 1/4 the size of the input image, and use it as the top layer of the image pyramid.

步骤2.2图像金字塔底层构建：以输入图像为源图像，首先将图像在每个方向上都扩大为原来的2倍，其中新增的行和列用数值0来填充；然后将5*5大小的模板高斯核乘4 后再与放大后的图像进行卷积运算，以获得新增像素的近似值。最后得到输入图像4倍大小的图像，并以此作为图像金字塔底层。Step 2.2 Construction of the bottom layer of the image pyramid: take the input image as the source image, first expand the image to twice the original size in each direction, and fill the newly added rows and columns with the value 0; then the 5*5 size The template Gaussian kernel is multiplied by 4 and then convolved with the enlarged image to obtain the approximate value of the added pixels. Finally, an image that is 4 times the size of the input image is obtained, and this is used as the bottom layer of the image pyramid.

步骤2.3图像金字塔层数确定：为图像金字塔中不同层上的图像确定编号，其中图像金字塔层数编号从0开始，随着金字塔层数的增加，图像分辨率相应减小。本发明构建的图像金字塔为3层，其中原图处于第2层，相应的金字塔层数编号为1。Step 2.3 Determining the number of layers in the image pyramid: determine the numbers for the images on different layers in the image pyramid, where the number of layers in the image pyramid starts from 0, and as the number of pyramid layers increases, the image resolution decreases accordingly. The image pyramid constructed by the present invention has three layers, wherein the original image is on the second layer, and the corresponding pyramid layer number is 1.

步骤3：分类器特征图获取Step 3: Classifier feature map acquisition

该步骤针对图像金字塔中三种不同尺度的图像，分别训练一个分类器，以得到同一图像三种不同尺度的类激活图。In this step, a classifier is trained for images of three different scales in the image pyramid to obtain class activation maps of three different scales of the same image.

步骤3.1分类网络训练：本发明选用经典的ResNet50作为分类网络，用于判断输入图像所属的类别。由于图像金字塔中存在三种不同尺度的图像，所以最终需要为三个不同尺度的图像数据集分别训练一个分类器。Step 3.1 Classification network training: the present invention selects the classic ResNet50 as the classification network for judging the category to which the input image belongs. Since there are three images of different scales in the image pyramid, it is finally necessary to train a classifier for each of the three image datasets of different scales.

步骤3.2高低层特征图融合：对于每一个分类网络，浅层感受野较小，提取的是纹理、边缘等低级的几何信息；而高层感受野大，提取的是更全局、更深层次的语义信息。所以本发明将每一个分类网络中最高层特征与最低层特征进行对齐拼接，促使网络增强小目标对象低层次的特征，以获得网络最后的融合特征图。Step 3.2 Fusion of high- and low-level feature maps: For each classification network, the shallow receptive field is small, and low-level geometric information such as texture and edge is extracted; while the high-level receptive field is large, more global and deeper semantic information is extracted . Therefore, the present invention aligns and splices the highest-level features and the lowest-level features in each classification network, prompting the network to enhance the low-level features of small target objects, so as to obtain the final fusion feature map of the network.

步骤4：多尺度CAM融合Step 4: Multi-scale CAM Fusion

该步骤获取三个分类网络的CAM，将其对齐后再进行融合，最终得到图像对应的融合CAM图。This step obtains the CAMs of the three classification networks, aligns them and then fuses them, and finally obtains the fused CAM map corresponding to the image.

步骤4.1分类网络CAM获取：针对步骤3.2中得到的最终融合特征图，通过将其与分类网络中全连接层的权重矩阵相乘以获得CAM。由于本发明使用了三个分类网络，所以对于每一张源图像，最终将得到三张不同尺度的CAM，构成CAM金字塔。Step 4.1 Classification network CAM acquisition: For the final fusion feature map obtained in step 3.2, multiply it with the weight matrix of the fully connected layer in the classification network to obtain the CAM. Since the present invention uses three classification networks, for each source image, three CAMs of different scales will be finally obtained to form a CAM pyramid.

步骤4.2多个CAM对齐：将不同尺度的CAM基于源图像的尺寸进行对齐，以方便后续的融合操作。Step 4.2 Align multiple CAMs: align CAMs of different scales based on the size of the source image to facilitate subsequent fusion operations.

步骤4.3多个CAM融合：对于融合CAM中的任一像素，本发明采用以下判断机制：若至少存在两个独立CAM在该点关于某类别的激活值大于等于阈值，则认为该像素点属于该类别。若经过判断机制后该像素点未分配给任何类别，则忽略该像素点；若该像素点被分配给了多个类别，则将该像素点分配给三个独立CAM在该点的最大平均激活值对应的类别。Step 4.3 Fusion of multiple CAMs: For any pixel in the fused CAM, the present invention adopts the following judging mechanism: if there are at least two independent CAMs whose activation values for a certain category at this point are greater than or equal to the threshold, the pixel is considered to belong to the category. If the pixel is not assigned to any category after the judgment mechanism, the pixel is ignored; if the pixel is assigned to multiple categories, the pixel is assigned to the maximum average activation of the three independent CAMs at that point The category to which the value corresponds.

步骤5：ROI预测Step 5: ROI Prediction

该步骤首先将步骤4.3中得到的融合CAM转换为伪标签，再基于伪标签训练图像ROI 的定位分割网络，最后利用网络进行ROI的预测。In this step, first convert the fused CAM obtained in step 4.3 into a pseudo-label, then train the image ROI positioning and segmentation network based on the pseudo-label, and finally use the network to predict the ROI.

步骤5.1融合CAM伪标签转换：将融合后的CAM转换为用于分割网络训练的伪二值掩膜。本发明采用以下判断机制：若融合CAM中的任意像素点属于非目标类，则将该点像素值赋为0，否则赋为1。Step 5.1 Fused CAM pseudo-label conversion: Convert the fused CAM to a pseudo-binary mask for segmentation network training. The present invention adopts the following judging mechanism: if any pixel point in the fused CAM belongs to the non-target category, assign the pixel value of this point as 0, otherwise assign it as 1.

步骤5.2分割网络训练预测：基于步骤5.1中获得的伪二值标签训练图像分割网络，本发明选用的分割网络架构为U-Net，最后利用训练好的网络对测试集进行ROI的分割预测。Step 5.2 Segmentation network training prediction: Based on the pseudo-binary label training image segmentation network obtained in step 5.1, the segmentation network architecture selected by the present invention is U-Net, and finally use the trained network to perform ROI segmentation prediction on the test set.

与已有技术相比，本发明有益效果在于：Compared with the prior art, the present invention has the beneficial effects of:

一、本发明采用的基于多尺度显著特征融合的弱监督图像目标定位方法，有效地避免了全监督学习下所需关于ROI的像素级标注，这很大程度上减轻了数据的标注工作量。1. The weakly supervised image target positioning method based on multi-scale salient feature fusion adopted by the present invention effectively avoids the pixel-level labeling of ROI required under fully supervised learning, which greatly reduces the workload of data labeling.

二、本发明采用的基于多尺度显著特征融合的弱监督图像目标定位方法，将每个分类网络得到的最高层特征图和最低层特征图进行拼接，加强了网络对小目标对象低层次特征的学习，从而使得网络能关注到小目标对象的更多特征。2. The weakly supervised image target positioning method based on multi-scale salient feature fusion adopted by the present invention splices the highest-level feature map and the lowest-level feature map obtained by each classification network, which strengthens the network's ability to detect low-level features of small target objects Learning, so that the network can focus on more features of small target objects.

三、本发明采用的基于多尺度显著特征融合的弱监督图像目标定位方法，基于源图像构建向下采样、向上采样的混合金字塔结构，由此同时得到较于原图更全局、更细粒度的特征，通过融合这些特征得到更为完整的CAM。3. The weakly supervised image target positioning method based on multi-scale salient feature fusion adopted by the present invention constructs a hybrid pyramid structure of down-sampling and up-sampling based on the source image, thereby obtaining a more global and finer-grained image than the original image at the same time features, and a more complete CAM is obtained by fusing these features.

附图说明Description of drawings

图1为本发明构建的图像金字塔示意图。Fig. 1 is a schematic diagram of an image pyramid constructed in the present invention.

图2为本发明提出方法的整体流程图。Fig. 2 is the overall flowchart of the method proposed by the present invention.

具体实施方式detailed description

以下结合说明书附图2，对本发明的实施实例加以详细说明：Below in conjunction with accompanying drawing 2 of description, the embodiment of the present invention is described in detail:

本发明所述的基于多尺度显著特征融合的弱监督图像目标定位方法由五个阶段组成：第一阶段为图像的预处理，主要对数据集的分辨率进行统一。第二阶段为图像金字塔的构建。该阶段主要包括以输入图像为源图像向下采样构建图像金字塔顶层、向上采样构建图像金字塔底层、最终图像金字塔层数确定三个部分。第三阶段为分类器特征图的获取。该阶段首先为图像金字塔每层的图像训练一个分类器，然后对于任一层的分类器，将最高层的特征图与最低层的特征图进行拼接以获得融合后的特征图。第四阶段为多尺度CAM的融合。该阶段首先通过每层特征图的加权和获得同一图像的多尺度CAM，然后将所有CAM 进行对齐，最后将其融合，获得源图像最终的CAM。第五阶段为ROI的预测。该阶段首先将融合后的CAM转换为伪二值标签，然后利用该伪标签来训练分割网络，最后通过分割网络预测ROI。The weakly supervised image target positioning method based on multi-scale salient feature fusion in the present invention consists of five stages: the first stage is image preprocessing, which mainly unifies the resolution of the data set. The second stage is the construction of the image pyramid. This stage mainly includes three parts: taking the input image as the source image to down-sample to construct the top layer of the image pyramid, up-sampling to construct the bottom layer of the image pyramid, and finally determining the number of layers of the image pyramid. The third stage is the acquisition of classifier feature maps. In this stage, a classifier is first trained for each layer of the image pyramid, and then for any layer of the classifier, the feature map of the highest layer is spliced with the feature map of the lowest layer to obtain a fused feature map. The fourth stage is the fusion of multi-scale CAM. In this stage, the multi-scale CAM of the same image is first obtained through the weighted sum of the feature maps of each layer, then all the CAMs are aligned, and finally they are fused to obtain the final CAM of the source image. The fifth stage is the prediction of ROI. This stage first converts the fused CAM into a pseudo-binary label, then uses the pseudo-label to train the segmentation network, and finally predicts the ROI through the segmentation network.

具体地，该方法包括以下步骤：Specifically, the method includes the following steps:

步骤1：图像预处理Step 1: Image Preprocessing

图像预处理的目的是统一数据集内所有图像的尺寸。本发明所参考的数据主要为小目标型图像数据，例如公开的乳腺图像数据集和花粉图像数据集。若数据集中图像分辨率不唯一，将导致后续分类网络最后一个卷积层得到的特征图大小不一，而全连接层和前一层连接的参数维度是事先固定的，即全连接层的参数无法适应不同的特征图大小，所以必须将所有输入图像的大小进行固定。为最小程度地降低图像信息损失及方便后续分类网络中的卷积运算，我们将所有图像的尺寸都设定为512*512。The purpose of image preprocessing is to unify the dimensions of all images in the dataset. The data referred to in the present invention are mainly small target image data, such as the public mammary gland image data set and pollen image data set. If the image resolution in the data set is not unique, the size of the feature map obtained by the last convolutional layer of the subsequent classification network will be different, and the parameter dimension of the connection between the fully connected layer and the previous layer is fixed in advance, that is, the parameters of the fully connected layer It cannot adapt to different feature map sizes, so the size of all input images must be fixed. In order to minimize the loss of image information and facilitate the convolution operation in the subsequent classification network, we set the size of all images to 512*512.

步骤2：图像金字塔构建Step 2: Image Pyramid Construction

该步骤通过构建高斯金字塔来获取输入图像的三种尺度变换。为同时获取较于原图而言更全局与更细粒度的信息，本发明构建的高斯金字塔采用向下采样和向上采样混合的金字塔结构。具体而言，构建过程包括两个部分：其一，通过高斯金字塔将输入原图的宽和高分别下采样为原始图像的50％，由此得到256*256分辨率的图像作为金字塔的顶层；其二，通过高斯金字塔将输入原图的宽和高分别上采样为原始图像的200％，由此得到 1024*1024分辨率的图像作为金字塔的底层。This step obtains three scale transformations of the input image by constructing a Gaussian pyramid. In order to obtain more global and finer-grained information than the original image at the same time, the Gaussian pyramid constructed by the present invention adopts a pyramid structure of down-sampling and up-sampling. Specifically, the construction process includes two parts: first, the width and height of the input original image are down-sampled to 50% of the original image through the Gaussian pyramid, thereby obtaining a 256*256 resolution image as the top layer of the pyramid; Second, through the Gaussian pyramid, the width and height of the input original image are up-sampled to 200% of the original image, thereby obtaining a 1024*1024 resolution image as the bottom layer of the pyramid.

步骤2.1图像金字塔顶层构建：对于给定的512*512大小的原图，我们向下采样以原图1/4大小的图像构建高斯金字塔的顶层，图像对应分辨率为256*256。具体过程如公式(1) 所示：首先对512*512的原始图像做一次高斯平滑处理，其与简单平滑不同，高斯平滑在计算周围像素加权平均值时，对中心点临近的像素赋予了更高的权重值；然后通过去除图像矩阵中的偶数行和列来对处理后的图像进行下采样，以得到256*256分辨率的图像。Step 2.1 Construction of the top layer of the image pyramid: For a given original image with a size of 512*512, we downsample to construct the top layer of the Gaussian pyramid with an image of 1/4 the size of the original image, and the corresponding resolution of the image is 256*256. The specific process is shown in formula (1): firstly, a Gaussian smoothing process is performed on the 512*512 original image, which is different from simple smoothing. When Gaussian smoothing calculates the weighted average of surrounding pixels, it gives more High weight value; then the processed image is down-sampled by removing even rows and columns in the image matrix to obtain a 256*256 resolution image.

(1≤l≤L,0≤x≤R_l,0≤y≤C_l)(1≤l≤L,0≤x≤R _l ,0≤y≤C _l )

其中G_l为高斯金字塔的第l层图像(高斯金字塔层数从0开始)，L为高斯金字塔顶层的层号，R_l和C_l分别为第l层图像的行数和列数，W(m,n)为高斯滤波模板的第m行第n列数值，一般取5*5大小，本发明选用反锐化掩膜算法中广泛使用的二维可分离5*5的高斯核对原图进行平滑处理，其值如(2)所示。Among them, G _l is the l-th layer image of the Gaussian pyramid (the number of layers of the Gaussian pyramid starts from 0), L is the layer number of the top layer of the Gaussian pyramid, R _l and C _l are the number of rows and columns of the l-th layer image respectively, W( m, n) is the value of the mth row and the nth column of the Gaussian filter template, which is generally 5*5 in size. The present invention selects the two-dimensional separable 5*5 Gaussian kernel widely used in the unsharp mask algorithm to check the original image. Smoothing, its value is shown in (2).

步骤2.2图像金字塔底层构建：对于给定的512*512原图，我们向上采样以原图4倍大小的图像构建高斯金字塔的最低层，其对应分辨率为1024*1024。具体过程为：首先将图像在每个方向上扩大为原图像的2倍，其中新增的行和列都用数值0来填充；然后将向下采样中使用的高斯内核先乘4，再将其与放大的图像进行卷积运算，以获得新增像素的近似值，最终得到1024*1024分辨率的图像。Step 2.2 Construction of the bottom layer of the image pyramid: For a given 512*512 original image, we upsample an image 4 times the size of the original image to construct the lowest layer of the Gaussian pyramid, which corresponds to a resolution of 1024*1024. The specific process is as follows: First, the image is enlarged to twice the original image in each direction, and the newly added rows and columns are filled with the value 0; then the Gaussian kernel used in the downsampling is multiplied by 4, and then the It performs convolution operation with the enlarged image to obtain the approximate value of the newly added pixels, and finally obtains an image with a resolution of 1024*1024.

步骤2.3图像金字塔层数确定：图像金字塔构建完成后，分辨率为w*h的图像对应高斯金字塔中的层数l由公式(3)确定。Step 2.3 Determination of the number of layers of the image pyramid: After the construction of the image pyramid is completed, the number of layers l in the Gaussian pyramid corresponding to the image with a resolution of w*h is determined by formula (3).

其中l₀为512*512的原图在图像金字塔中的层数，由于高斯金字塔中图像的三个尺度分别为1024、512、256，所以原图对应的层数l₀＝1。由公式(3)可得，1024*1024大小的图像对应层数为0，即位于高斯金字塔的最低层；256*256大小的图像对应层数为2，即位于高斯金字塔的最顶层。Where l ₀ is the number of layers of the original image of 512*512 in the image pyramid. Since the three scales of the image in the Gaussian pyramid are 1024, 512, and 256 respectively, the number of layers corresponding to the original image is l ₀ =1. From the formula (3), it can be obtained that the corresponding layer number of the 1024*1024 image is 0, that is, it is located at the lowest layer of the Gaussian pyramid; the corresponding layer number of the 256*256 image is 2, that is, it is located at the top layer of the Gaussian pyramid.

步骤3：分类器特征图获取Step 3: Classifier feature map acquisition

步骤3.1分类网络训练：分类网络用于判别输入图像所属类别，本发明选用的分类网络为ResNet50，ResNet50网络包含49个卷积层和1个全连接层，每个残差块都有三层卷积，该网络的残差结构能够将输入直接连接到后面的网络层，以避免信息的丢失。本发明针对图像金字塔中256*256、512*512、1024*1024三种分辨率的图像分别训练一个分类器，记为R₁、R₂、R₃。Step 3.1 Classification network training: The classification network is used to distinguish the category of the input image. The classification network selected by the present invention is ResNet50. The ResNet50 network includes 49 convolutional layers and 1 fully connected layer. Each residual block has three layers of convolution , the residual structure of the network can directly connect the input to the following network layer to avoid the loss of information. The present invention trains a classifier respectively for images with three resolutions of 256*256, 512*512, and 1024*1024 in the image pyramid, which are denoted as R ₁ , R ₂ , and R ₃ .

步骤3.2高低层特征图融合：对于每一个分类网络而言，浅层的特征图感受野小，提取的是图像纹理、边缘等局部且通用的特征，即低级的几何信息；而随着网络层数的加深，高层的特征图感受野更大，提取的是更深层次、更全局的特征，即高级的语义信息。所以本发明将每一个分类网络中的最高层特征与最低层特征进行拼接，以此作为网络输出的最后特征图，促使网络增强小目标对象低层次的特征。对于分类器R₁，设其网络得到的最高层特征图为

最低层特征图为

则分类器R₁最终的特征图f₁可由公式(4)得到。Step 3.2 Fusion of high-level and low-level feature maps: For each classification network, the feature map of the shallow layer has a small receptive field, and the local and general features such as image texture and edge are extracted, that is, low-level geometric information; As the number deepens, the high-level feature map has a larger receptive field, and deeper and more global features are extracted, that is, high-level semantic information. Therefore, the present invention splices the highest-level features and the lowest-level features in each classification network as the final feature map output by the network, prompting the network to enhance the low-level features of small target objects. For the classifier R ₁ , let the highest-level feature map obtained by its network be

The lowest layer feature map is

Then the final feature map f ₁ of classifier R ₁ can be obtained by formula (4).

其中UP为上采样操作，即对最高层特征图进行上采样，以达到和最低层特征图相同的尺寸，便于后续处理，

为特征图的横向连接操作，即逐元素相加。同理，分类器R₂和分类器R₃最终的特征图f₂、f₃则可由公式(5)和(6)所得。Among them, UP is an upsampling operation, that is, upsampling the feature map of the highest layer to achieve the same size as the feature map of the lowest layer, which is convenient for subsequent processing.

is the horizontal connection operation of the feature map, that is, element-wise addition. Similarly, the final feature maps f ₂ and f ₃ of classifier R ₂ and classifier R ₃ can be obtained by formulas (5) and (6).

其中

分别为分类器R₂得到的最高层特征图和最低层特征图，

则分别为分类器R₃得到的最高层特征图和最低层特征图，UP为上采样操作，

为特征图的横向连接操作，即逐元素相加。in

are the highest-level feature map and the lowest-level feature map obtained by the classifier R ₂ , respectively,

are respectively the highest-level feature map and the lowest-level feature map obtained by the classifier R ₃ , UP is an upsampling operation,

is the horizontal connection operation of the feature map, that is, element-wise addition.

步骤4：多尺度CAM融合Step 4: Multi-scale CAM Fusion

该步骤获取三个分类网络的CAM，将其对齐后再进行融合，最终的输出为图像对应的融合CAM图。This step obtains the CAMs of the three classification networks, aligns them and then fuses them, and the final output is the fused CAM image corresponding to the image.

步骤4.1分类网络CAM获取：对于分类器R₁而言，256*256分辨率大小的图像中空间像素点u(x,y)关于类c的激活值

可由公式(7)获得。Step 4.1 Classification network CAM acquisition: For classifier R ₁ , the activation value of spatial pixel u(x,y) in the image with a resolution of 256*256 for class c

Can be obtained by formula (7).

其中i为分类网络最后一个卷积层的通道编号，K为分类网络最后一个卷积层的通道数目，

为通道i中类别c对应的权重，f_i ¹(x,y)为分类器R₁最后融合特征图中通道i上位置 (x,y)的特征值。同理，对于分类器R₂和分类器R₃，图像上像素点u(x,y)关于类c的激活值

可分别由公式(8)和(9)获得。Where i is the channel number of the last convolutional layer of the classification network, K is the number of channels of the last convolutional layer of the classification network,

is the weight corresponding to category c in channel i, f _i ¹ (x, y) is the feature value of position (x, y) on channel i in the final fusion feature map of classifier R ₁ . Similarly, for classifier R ₂ and classifier R ₃ , the activation value of pixel u(x,y) on the image with respect to class c

can be obtained by formulas (8) and (9), respectively.

为通道i中类别c对应的权重，f_i ²(x,y)和f_i ³(x,y)分别为分类器R₂和R₃最后融合特征图中通道i上位置(x,y)的特征值。Where i is the channel number of the last convolutional layer of the classification network, K is the number of channels of the last convolutional layer of the classification network,

is the weight corresponding to category c in channel i, f _i ² (x, y) and f _i ³ (x, y) are the positions (x, y) on channel i in the final fusion feature map of classifiers R ₂ and R ₃ respectively eigenvalues of .

步骤4.2多个CAM对齐：由于分类器R₁、R₂、R₃的输入为三层的图像金字塔，故其得到的三个类激活映射图尺寸也构成激活图金字塔。为了融合三个不同尺度的CAM，需要将其进行对齐，本发明将所有CAM都设定为和原始输入图像一致的大小，即512*512 分辨率。Step 4.2 Multiple CAM alignment: Since the input of the classifiers R ₁ , R ₂ , and R ₃ is a three-layer image pyramid, the dimensions of the three class activation maps obtained also constitute an activation map pyramid. In order to fuse three CAMs of different scales, they need to be aligned. In the present invention, all CAMs are set to the same size as the original input image, that is, 512*512 resolution.

步骤4.3多个CAM融合：将对齐后的三个CAM融合为最终的CAM。对于融合类激活图M_agg中的像素点u(x,y)，本发明的融合机制如下所述：若至少存在两个独立激活图在该点关于类c的激活值大于等于阈值θ(θ∈[0.5，0.7])，则认为M_agg中该像素点属于类别c。若经过融合机制后该像素点未分配给任何类别，则忽略该像素；若该像素点被分配给了多个类别，则按照公式(10)判定其最终所属类别cla(x,y)。Step 4.3 Multiple CAM fusion: The aligned three CAMs are fused into the final CAM. For the pixel point u(x,y) in the fusion class activation map _Magg , the fusion mechanism of the present invention is as follows: if there are at least two independent activation maps at this point, the activation value of the class c is greater than or equal to the threshold θ(θ ∈[0.5, 0.7]), then the pixel in _Magg is considered to belong to category c. If the pixel is not assigned to any category after the fusion mechanism, the pixel is ignored; if the pixel is assigned to multiple categories, the final category cla(x,y) is determined according to formula (10).

其中j为金字塔层数编号，P为金字塔总层数，此处P＝3，N为数据集划分的类别数目 (不包含背景类)，

为像素点u(x,y)在第j层金字塔得到的特征图中关于类别c的激活值。

指像素点u(x,y)在P个特征图中关于背景类(类别编号为0)的平均激活值，

指像素点u(x,y)在P个特征图中关于类别c的平均激活值，

指像素点u(x,y)在P个特征图中关于类别N的平均激活值。index为取索引操作，即取数组中最大值对应的索引序号，在这里也指像素点所属类别，例如数组中第0 个平均激活值最大，则取出的索引值为0，也指其类别属于0。Where j is the number of pyramid layers, P is the total number of layers of the pyramid, where P=3, N is the number of categories divided by the data set (not including the background class),

is the activation value of category c in the feature map obtained for the pixel point u(x, y) at the jth layer of the pyramid.

Refers to the average activation value of the pixel point u(x,y) in the P feature map with respect to the background class (category number 0),

Refers to the average activation value of pixel u(x, y) in P feature maps with respect to category c,

Refers to the average activation value of pixel u(x, y) in P feature maps with respect to category N. index is an indexing operation, that is, taking the index number corresponding to the maximum value in the array. Here, it also refers to the category to which the pixel belongs. For example, the 0th in the array has the largest average activation value, and the extracted index value is 0, which also means that its category belongs to 0.

步骤5：ROI区域预测Step 5: ROI Area Prediction

该步骤首先将步骤4.3中得到的融合CAM转换为伪标签，再基于伪标签训练分割网络，最后利用网络进行ROI的预测。This step first converts the fused CAM obtained in step 4.3 into a pseudo-label, then trains the segmentation network based on the pseudo-label, and finally uses the network to predict the ROI.

步骤5.1融合CAM伪标签转换：将融合后的CAM转换为用于分割网络训练的伪二值掩膜。伪二值掩膜

中像素点u(x,y)的取值由公式(11)确定。Step 5.1 Fused CAM pseudo-label conversion: Convert the fused CAM to a pseudo-binary mask for segmentation network training. pseudo binary mask

The value of the pixel point u(x, y) in is determined by the formula (11).

其中cla(x,y)指像素点u(x,y)所属的类别，cla(x,y)＝0则表示该像素点属于非目标类。Among them, cla(x, y) refers to the category to which the pixel point u(x, y) belongs, and cla(x, y)=0 indicates that the pixel point belongs to a non-target category.

本发明主要针对于小目标图像数据，例如病灶区域较小的医疗影像数据、花粉图像数据等，通过融合由图像金字塔得到的多尺度显著特征，能够加强小目标对象的位置信息及轮廓信息，从而提升弱监督场景下小目标对象定位分割任务的性能。本发明特定实例中所描述的步骤可以被修改，但系统体系结构并不脱离本发明的基本精神。因此，当前的实施例可以看作为示例性的而非限定性的，本发明的范围由所附权利要求而非上述描述定义。The present invention is mainly aimed at small target image data, such as medical image data and pollen image data with small lesion areas. By fusing the multi-scale salient features obtained from the image pyramid, the position information and contour information of small target objects can be enhanced, thereby Improve the performance of small object localization and segmentation tasks in weakly supervised scenarios. The steps described in specific examples of the invention may be modified without departing from the basic spirit of the invention in terms of system architecture. Accordingly, the present embodiments are to be considered as illustrative rather than restrictive, with the scope of the invention being defined by the appended claims rather than the foregoing description.

Claims

1. A weak supervision image target positioning method based on multi-scale salient feature fusion is characterized by comprising the following steps:

step 1: image pre-processing

The purpose of image pre-processing is to unify the size of all images within a data set;

step 2: image pyramid construction

The method comprises the steps of taking an image in a data set as a source image, and obtaining three scale transformations of an input image by constructing a Gaussian pyramid; in order to obtain more global and finer-grained information than the original image, the constructed Gaussian pyramid adopts a pyramid structure with a mixture of down sampling and up sampling;

step 2.1, image pyramid top layer construction: taking an input image as a source image, firstly performing Gaussian smoothing processing on the input image by using a template Gaussian core of 5*5 size, then performing downsampling on the processed image by removing even rows and columns in an image matrix, and finally obtaining an image of 1/4 size of the input image, wherein the image is used as an image pyramid top layer;

step 2.2, constructing the pyramid bottom layer of the image: taking an input image as a source image, firstly, expanding the image to be 2 times of the original image in each direction, wherein newly added rows and columns are filled with a numerical value of 0; then, multiplying the template Gaussian kernel of 5*5 by 4, and then performing convolution operation on the multiplied image to obtain an approximate value of a newly added pixel; finally, obtaining an image of 4 times of the size of the input image, and taking the image as an image pyramid bottom layer;

step 2.3, determining the pyramid layer number of the image: determining numbers for images on different layers in an image pyramid, wherein the number of the pyramid layers of the image is from 0, and the resolution of the image is correspondingly reduced along with the increase of the pyramid layers; the constructed image pyramid is 3 layers, wherein the original image is positioned at the 2 nd layer, and the number of the corresponding pyramid layer is 1;

and step 3: classifier feature map acquisition

Respectively training a classifier aiming at three images with different scales in an image pyramid to obtain class activation graphs of the same image with three different scales;

step 3.1, training a classification network: selecting classical ResNet50 as a classification network for judging the class of the input image; because three images with different scales exist in the image pyramid, a classifier is required to be trained for three image data sets with different scales respectively;

step 3.2, fusing the high-low layer characteristic diagrams:

aligning and splicing the highest layer features and the lowest layer features in each classification network to promote the network to enhance the low-level features of the small target object so as to obtain a final fusion feature map of the network;

and 4, step 4: multi-scale CAM fusion

The step of obtaining the CAMs of the three classification networks, aligning the CAMs and then fusing the CAMs to finally obtain a fused CAM image corresponding to the image;

step 4.1, obtaining by the classification network CAM: obtaining a CAM by multiplying the final fusion characteristic diagram obtained in the step 3.2 by a weight matrix of a full connection layer in the classification network; because three classification networks are used, three CAMs with different scales are finally obtained for each source image to form a CAM pyramid;

step 4.2 multiple CAM alignment: aligning the CAMs with different scales based on the size of the source image so as to facilitate subsequent fusion operation;

step 4.3 multiple CAM fusions: for any pixel in the fused CAM, the following judgment mechanism is adopted: if the activation value of at least two independent CAMs at the point related to a certain category is larger than or equal to a threshold value, the pixel point is considered to belong to the category; if the pixel point is not allocated to any category after passing through the judgment mechanism, ignoring the pixel point; if the pixel point is allocated to a plurality of categories, the pixel point is allocated to the category corresponding to the maximum average activation value of the three independent CAMs at the point;

and 5: ROI prediction

Firstly, converting the fusion CAM obtained in the step 4.3 into a pseudo label, then training a positioning segmentation network of an image ROI based on the pseudo label, and finally predicting the ROI by using the network;

step 5.1, fusing CAM pseudo label conversion: converting the fused CAM into a pseudo-binary mask for segmenting network training; the following judgment mechanism is adopted: if any pixel point in the fusion CAM belongs to the non-target class, assigning the pixel value of the point to be 0, and otherwise assigning the pixel value to be 1;

step 5.2, training and predicting the segmentation network: and (5) training an image segmentation network based on the pseudo binary label obtained in the step (5.1), wherein the selected segmentation network architecture is U-Net, and finally, performing ROI segmentation prediction on the test set by using the trained network.

2. The weak supervised image target localization method based on multi-scale salient feature fusion as recited in claim 1, wherein:

step 1: image pre-processing

The purpose of image pre-processing is to unify the size of all images in the dataset; all images were sized 512 x 512;

step 2: image pyramid construction

The construction process includes two parts: firstly, the width and the height of an input original image are respectively down-sampled into 50% of an original image through a Gaussian pyramid, and an image with 256 × 256 resolution is obtained as the top layer of the pyramid; secondly, respectively up-sampling the width and the height of the input original image into 200% of the original image through a Gaussian pyramid, and thus obtaining an image with 1024 × 1024 resolution as the bottom layer of the pyramid; the method comprises the following specific steps:

step 2.1, image pyramid top layer construction:

for a given original image with a size of 512 × 512, downsampling to construct a top layer of a gaussian pyramid from an image with a size of 1/4 of the original image, wherein the corresponding resolution of the image is 256 × 256; the specific process is shown as formula (1): firstly, performing primary Gaussian smoothing on an original image of 512 by 512, wherein the primary Gaussian smoothing is different from simple smoothing, and when the weighted average value of surrounding pixels is calculated, pixels adjacent to a central point are endowed with higher weight values by Gaussian smoothing; the processed image is then down-sampled by removing even rows and columns from the image matrix to obtain a 256 x 256 resolution image;

1≤l≤L,0≤x≤R _l ,0≤y≤C _l

wherein G is _l The image of the first layer of the Gaussian pyramid is obtained, the number of the layers of the Gaussian pyramid is started from 0, L is the layer number of the top layer of the Gaussian pyramid, and R is _l And C _l Respectively the number of rows and columns of the image of the l layer, wherein W (m, n) is the value of the nth row and column of the mth row of the Gaussian filter template, the value is generally 5*5, and the original image is smoothed by selecting a Gaussian core of a two-dimensional separable 5*5 widely used in an anti-sharpening mask algorithm, and the value is shown as (2);

step 2.2, constructing the pyramid bottom layer of the image:

for a given 512 x 512 original image, upsampling to construct the lowest layer of a Gaussian pyramid by using an image with the size 4 times that of the original image, wherein the corresponding resolution is 1024 x 1024; the specific process is as follows: firstly, expanding the image to be 2 times of the original image in each direction, wherein the newly added rows and columns are filled with 0 values; then, multiplying the Gaussian kernel used in the down sampling by 4, and then performing convolution operation on the Gaussian kernel and the amplified image to obtain an approximate value of a newly added pixel, and finally obtaining an image with 1024 × 1024 resolution;

step 2.3, determining the pyramid layer number of the image:

after the image pyramid is constructed, determining the number l of layers in the Gaussian pyramid corresponding to the image with the resolution of w x h by a formula (3);

wherein l ₀ The number of layers l of the original image is 512 × 512, since the three scales of the image in the gaussian pyramid are 1024, 512, and 256, respectively, the number of layers l of the original image corresponds to ₀ =1; as can be seen from formula (3), the number of corresponding layers of 1024 × 1024 images is 0, i.e., the corresponding layers are located at the lowest layer of the gaussian pyramid; 256 by 256 images correspond to 2 layers, namely the images are positioned at the topmost layer of the Gaussian pyramid;

and step 3: classifier feature map acquisition

Respectively training a classifier aiming at three images with different scales in an image pyramid to obtain class activation graphs of the same image with three different scales, wherein the steps are as follows;

step 3.1, training a classification network: the classification network is used for judging the class of the input image, the selected classification network is ResNet50, the ResNet50 network comprises 49 convolution layers and 1 full-connection layer, and each residual block has three layers of convolution; respectively training a classifier for images with three resolutions of 256 × 256, 512 × 512 and 1024 × 1024 in the image pyramid, and marking as R ₁ 、R ₂ 、R ₃ ；

Step 3.2, fusing the high-low layer characteristic diagrams:

splicing the highest layer features and the lowest layer features in each classification network, and using the highest layer features and the lowest layer features as a final feature map output by the network to promote the network to enhance the low-level features of the small target objects; for the classifier R ₁ Let its network obtain the highest level feature diagram as

The lowest layer characteristic diagram is

Then the classifier R ₁ Final feature map f ₁ Can be obtained from formula (4);

wherein UP is an upsampling operation, i.e., upsampling is performed on the feature map of the highest layer to achieve the same size as the feature map of the lowest layer, so as to facilitate subsequent processing,

a cross-join operation for the feature map, i.e., element-by-element addition; in the same way, the classifier R ₂ And a classifier R ₃ Final feature map f ₂ 、f ₃ Then the results can be obtained from equations (5) and (6);

wherein

Are respectively a classifier R ₂ The obtained highest-level feature map and the lowest-level feature map,

are respectively the classifier R ₃ The obtained highest layer characteristic diagram and the lowest layer characteristic diagram, UP is the UP-sampling operation,

a cross-join operation for the feature map, i.e., element-by-element addition;

and 4, step 4: multi-scale CAM fusion

Acquiring CAMs of the three classification networks, aligning the CAMs, and fusing the CAMs, wherein the final output is a fused CAM image corresponding to the image, and the method specifically comprises the following steps:

step 4.1, obtaining by a classified network CAM: for the classifier R ₁ In particular, the activation value of a spatial pixel u (x, y) in an image with 256 × 256 resolution with respect to class c

Can be obtained from equation (7);

wherein i is the channel number of the last convolutional layer of the classification network, K is the channel number of the last convolutional layer of the classification network,

for the weight corresponding to class c in channel i, f _i ¹ (x, y) is a classifier R ₁ Finally, fusing the characteristic values of the positions (x, y) on the channel i in the characteristic diagram; for the same reason, for classifier R ₂ And a classifier R ₃ Activation value of a pixel point u (x, y) on an image with respect to class c

Can be obtained from equations (8) and (9), respectively;

wherein i is the channel number of the last convolutional layer of the classification network, and K is the channel number of the last convolutional layer of the classification network，

For the weight corresponding to class c in channel i, f _i ² (x, y) and f _i ³ (x, y) are classifiers R, respectively ₂ And R ₃ Finally, fusing the characteristic values of the positions (x, y) on the channel i in the characteristic diagram;

step 4.2 multiple CAM alignment: due to the classifier R ₁ 、R ₂ 、R ₃ The input of (2) is an image pyramid with three layers, so the sizes of the three obtained activation mapping graphs also form an activation graph pyramid; in order to fuse three differently scaled CAMs, it is necessary to align them, setting all CAMs to a size consistent with the original input image, i.e. 512 × 512 resolution;

step 4.3 multiple CAM fusions: fusing the three aligned CAMs into a final CAM; activation map M for fusion classes _agg The fusion mechanism of the pixel u (x, y) in (1) is as follows: if there are at least two independent activation graphs at the point where the activation value for class c is greater than or equal to the threshold value theta, theta epsilon [0.5,0.7]Then, consider M _agg The pixel belongs to the category c; if the pixel point is not allocated to any category after the fusion mechanism, ignoring the pixel; if the pixel point is allocated to a plurality of categories, judging the category cla (x, y) to which the pixel point belongs finally according to a formula (10);

where j is the number of pyramid levels, P is the total number of pyramid levels, where P =3,N is the number of classes of data set partitioning (excluding background classes),

the activation value of the pixel u (x, y) in the feature map obtained by the pyramid of the j layer about the category c is obtained;

finger pixelThe average activation value of the point u (x, y) in the P feature maps about the background class, and the background class is numbered as 0;

refers to the average activation value of pixel point u (x, y) in P feature maps for category c,

the average activation value of the pixel point u (x, y) in the P characteristic graphs about the category N is pointed; index is an index-taking operation, that is, an index sequence number corresponding to the maximum value in the array is taken, and here, the index sequence number also refers to the category to which the pixel belongs, for example, if the 0 th average activation value in the array is the maximum, the index value taken out is 0, and also refers to that the category belongs to 0;

and 5: ROI region prediction

Firstly, converting the fused CAM obtained in the step 4.3 into a pseudo label, training and segmenting a network based on the pseudo label, and finally predicting the ROI by using the network; the method comprises the following specific steps:

step 5.1, fusing CAM pseudo label conversion: converting the fused CAM into a pseudo-binary mask for segmenting network training; pseudo binary mask

The value of the middle pixel u (x, y) is determined by the formula (11);

wherein cla (x, y) refers to the category to which the pixel u (x, y) belongs, and cla (x, y) =0 indicates that the pixel belongs to the non-target category;