CN115082675B

CN115082675B - A transparent object image segmentation method and system

Info

Publication number: CN115082675B
Application number: CN202210633162.3A
Authority: CN
Inventors: 胡泊; 王勇; 邹逸群
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2022-06-07
Filing date: 2022-06-07
Publication date: 2024-06-04
Anticipated expiration: 2042-06-07
Also published as: CN115082675A

Abstract

The invention discloses a transparent object image segmentation method, which comprises the following steps: s1: establishing a dual-resolution feature extraction module containing a high-resolution branch and a low-resolution branch to obtain a high-resolution feature map and a multi-scale fused low-resolution feature map; s2: the differential boundary attention module is utilized to respectively carry out differential convolution and spatial attention operation on the feature images with different dimensions extracted in the step S1, and multi-scale edge feature images are extracted and fused; s3: modeling the context relation of the category level of the high-resolution feature map and the multi-scale fused low-resolution feature map by using a region attention module to obtain a pixel-region enhanced feature map; and the high-resolution feature map, the multi-scale edge feature map and the pixel-area enhancement feature map are fused, and a final transparent object segmentation result is obtained after feature dimension reduction, so that the situation that semantic information of a transparent object is lost due to factors such as environment and shielding is effectively solved.

Description

A transparent object image segmentation method and system

技术领域Technical Field

本发明涉及计算机视觉领域，特别是一种透明物体图像分割方法及系统。The present invention relates to the field of computer vision, and in particular to a method and system for segmenting transparent object images.

背景技术Background technique

图像语义分割技术是智能系统理解自然场景的关键技术之一，但是对于透明物体这类在现实世界中广泛存在的目标，现有通用的图像分割方法往往不能得到令人满意的结果，主要存在以下几个问题，透明物体很容易受到环境因素的影响导致难以提取到鲁棒性好的特征，透明物体容易被遮挡造成语义信息不完整，透明物体的边缘分割不准确，以上问题最终影响透明物体的分割效果。Image semantic segmentation technology is one of the key technologies for intelligent systems to understand natural scenes. However, for targets such as transparent objects that are widely present in the real world, the existing general image segmentation methods often cannot obtain satisfactory results. There are mainly the following problems: transparent objects are easily affected by environmental factors, which makes it difficult to extract robust features; transparent objects are easily occluded, resulting in incomplete semantic information; and the edge segmentation of transparent objects is inaccurate. The above problems ultimately affect the segmentation effect of transparent objects.

发明内容Summary of the invention

本发明的目的在于克服上述技术的不足，提出一种透明物体图像分割方法，通过加入高分辨率特征和自注意力机制，不断增强来自于同一类物体的像素的特征，能有效解决透明物体因环境、遮挡等因素造成语义信息缺失的情况。The purpose of the present invention is to overcome the shortcomings of the above-mentioned technology and propose a transparent object image segmentation method. By adding high-resolution features and self-attention mechanism, the features of pixels from the same type of objects are continuously enhanced, which can effectively solve the problem of missing semantic information of transparent objects due to environmental, occlusion and other factors.

为解决以上技术问题，本发明所采用的技术方案是：In order to solve the above technical problems, the technical solution adopted by the present invention is:

一种透明物体图像分割方法，包括以下步骤：A transparent object image segmentation method comprises the following steps:

S1：建立含有高分辨率分支和低分辨率分支的双分辨率特征提取模块，将输入图像输入所述双分辨率特征提取模块，高分辨率分支通过连接并行的不同分辨率特征图和重复进行多尺度交叉融合来维持精确的空间位置信息，得到1/8原图大小的高分辨率特征图；低分辨率分支通过连续的下采样以及与高分辨特征图进行交叉融合来提取高维的语义信息，得到1/64原图大小的低分辨率特征图；在低分辨率分支末端加上深度金字塔池化模块，用于扩大了有效的感受野并且融合多尺度的上下文信息，得到多尺度融合的低分辨特征图；S1: Establish a dual-resolution feature extraction module containing a high-resolution branch and a low-resolution branch, input the input image into the dual-resolution feature extraction module, the high-resolution branch maintains accurate spatial position information by connecting parallel feature maps of different resolutions and repeatedly performing multi-scale cross-fusion to obtain a high-resolution feature map of 1/8 the size of the original image; the low-resolution branch extracts high-dimensional semantic information by continuous downsampling and cross-fusion with the high-resolution feature map to obtain a low-resolution feature map of 1/64 the size of the original image; add a deep pyramid pooling module at the end of the low-resolution branch to expand the effective receptive field and fuse multi-scale context information to obtain a multi-scale fused low-resolution feature map;

S2：利用差分边界注意力模块对S1中提取到的不同维度的特征图分别进行差分卷积和空间注意力操作，提取多尺度的边缘特征图并进行融合，经过特征降维后得到透明物体的边缘预测图像，计算出边缘损失函数L1作为总损失函数L的一部分，在梯度下降过程中参与网络权重更新来优化模型参数；其中 L1采用的是交叉熵损失函数，p_i为边界处像素i的预测结果，y_i为边界处像素i的实际结果；其计算公式如下：S2: Use the differential boundary attention module to perform differential convolution and spatial attention operations on the feature maps of different dimensions extracted in S1, extract multi-scale edge feature maps and fuse them, and obtain the edge prediction image of the transparent object after feature dimensionality reduction. Calculate the edge loss function L1 as part of the total loss function L, and participate in the network weight update during the gradient descent process to optimize the model parameters; L1 uses the cross entropy loss function, p _i is the predicted result of pixel i at the boundary, and _yi is the actual result of pixel i at the boundary; its calculation formula is as follows:

S3：利用区域注意力模块对S1中得到的高分辨率特征图和多尺度融合的低分辨特征图进行类别层面的上下文关系建模，增强来自于同一类物体的像素的特征，得到像素-区域增强特征图；融合所述高分辨率特征图、多尺度的边缘特征图和像素-区域增强特征图，经过特征降维后得到最终的透明物体分割结果，计算出透明物体的损失函数L2作为总损失函数L的另外一部分。其中L2与L1相同都是采用的交叉熵损失函数，总损失函数L为L1、L2两者之和。S3: Use the regional attention module to model the contextual relationship at the category level for the high-resolution feature map obtained in S1 and the multi-scale fused low-resolution feature map, enhance the features of pixels from the same class of objects, and obtain a pixel-region enhanced feature map; fuse the high-resolution feature map, multi-scale edge feature map and pixel-region enhanced feature map, and obtain the final transparent object segmentation result after feature dimensionality reduction, and calculate the loss function L2 of the transparent object as another part of the total loss function L. Among them, L2 and L1 are the same, both of which use the cross entropy loss function, and the total loss function L is the sum of L1 and L2.

进一步地，所述双分辨率特征提取网络由conv1、conv2、conv3_x、conv4_x、conv5_x、DPPM共六个层级构成，其中x＝1或2，x＝1代表高分辨率分支，x＝2代表低分辨率分支；Furthermore, the dual-resolution feature extraction network consists of six layers: conv1, conv2, conv3_x, conv4_x, conv5_x, and DPPM, where x=1 or 2, x=1 represents a high-resolution branch, and x=2 represents a low-resolution branch;

conv1包含步长为2且卷积核为3*3的卷积层、BatchNorm层和ReLU层，conv1层用于改变输入图像的维度；conv1 contains a convolutional layer with a stride of 2 and a convolution kernel of 3*3, a BatchNorm layer, and a ReLU layer. The conv1 layer is used to change the dimension of the input image.

conv2是由级联的残差块Basic Block组成；用于得到1/8原图大小的特征图feature2；Conv2 is composed of cascaded residual blocks Basic Block, which is used to obtain feature map feature2 with a size of 1/8 of the original image;

conv3_x开始分成并行的高低分辨率两个分支conv3_1和conv3_2；conv3_1采用与conv2相同的残差块，得到1/8原图大小的高分辨率分支特征图feature3_1；conv3_2将conv2的输出进行下采样，得到 1/16原图大小的低分辨率分支特征图feature3_2；conv3_x starts to be divided into two parallel high-resolution and low-resolution branches conv3_1 and conv3_2; conv3_1 uses the same residual block as conv2 to obtain a high-resolution branch feature map feature3_1 with a size of 1/8 of the original image; conv3_2 downsamples the output of conv2 to obtain a low-resolution branch feature map feature3_2 with a size of 1/16 of the original image;

conv4_x分为并行的高低分辨率两个分支conv4_1和conv4_2，conv4_1用于不断融入低分辨信息并维持1/8原图大小的高分辨率分支特征图；conv4_2用于得到1/32原图大小的低分辨率分支特征图；conv4_x is divided into two parallel high and low resolution branches conv4_1 and conv4_2. conv4_1 is used to continuously integrate low resolution information and maintain a high resolution branch feature map of 1/8 the size of the original image; conv4_2 is used to obtain a low resolution branch feature map of 1/32 the size of the original image.

conv5_x分为并行的高低分辨率两个分支conv5_1和conv5_2，conv5_1用于不断融入低分辨信息并维持1/8原图大小的高分辨率分支特征图；conv5_2用于得到1/64原图大小的低分辨率分支特征图；conv5_x is divided into two parallel high and low resolution branches conv5_1 and conv5_2. conv5_1 is used to continuously integrate low resolution information and maintain a high resolution branch feature map of 1/8 the size of the original image; conv5_2 is used to obtain a low resolution branch feature map of 1/64 the size of the original image.

DPPM用于扩大感受野并且融合多尺度的上下文信息。DPPM is used to expand the receptive field and fuse multi-scale contextual information.

进一步地，所述conv2的Basic Block包含两个卷积核大小为3*3的卷积层和Identity Block，其中 3*3的卷积层是为了提取输入的不同特征，降低模型的运算量，Identity Block复制了浅层的特征，避免随着网络加深出现梯度消失的情况。Furthermore, the Basic Block of conv2 includes two convolutional layers with a convolution kernel size of 3*3 and an Identity Block, wherein the 3*3 convolutional layer is to extract different features of the input and reduce the amount of computation of the model, and the Identity Block copies the features of the shallow layer to avoid the gradient disappearance as the network deepens.

进一步地，所述特征图feature3_1进行conv4_1操作得到特征图hfeature3_1，特征图feature3_2 进行1*1卷积实现操作通道压缩然后通过双线性插值进行上采样得到特征图hfeature3_2,融合特征图 hfeature3_1和hfeature3_2得到1/8原图大小的高分辨率分支特征图feature4_1；对特征图feature3_2进行 conv4_2操作得到特征图lfeature3_2,对特征图feature3_1进行步长为2的3*3卷积实现下采样得到特征图 lfeature3_1,融合特征图lfeature3_1和lfeature3_2得到1/32原图大小的低分辨率分支特征图feature4_2。Further, the feature map feature3_1 is subjected to a conv4_1 operation to obtain a feature map hfeature3_1, the feature map feature3_2 is subjected to a 1*1 convolution to realize operation channel compression and then is upsampled by bilinear interpolation to obtain a feature map hfeature3_2, and the feature maps hfeature3_1 and hfeature3_2 are fused to obtain a high-resolution branch feature map feature4_1 of 1/8 the size of the original image; the feature map feature3_2 is subjected to a conv4_2 operation to obtain a feature map lfeature3_2, and the feature map feature3_1 is subjected to a 3*3 convolution with a step size of 2 to realize downsampling to obtain a feature map lfeature3_1, and the feature maps lfeature3_1 and lfeature3_2 are fused to obtain a low-resolution branch feature map feature4_2 of 1/32 the size of the original image.

进一步地，所述conv5_x由级联的残差块BottleneckBlock组成，BottleneckBlock包含两个卷积核大小为1*1的卷积层、一个卷积核大小3*3的卷积层和IdentityBlock，在深层的网络中减少消耗；对特征图feature4_1进行conv5_1操作得到特征图hfeature4_1,对特征图feature4_2进行1*1卷积实现操作通道压缩然后通过双线性插值进行上采样得到特征图hfeature4_2,融合特征图hfeature4_1和hfeature4_2得到1/8原图大小的高分辨率分支特征图feature5_1，对特征图feature4_2进行conv5_2操作得到特征图lfeature4_2, 对特征图feature4_1进行步长为2的3*3卷积实现下采样得到特征图lfeature4_1,融合特征图lfeature4_1 和lfeature4_2得到1/64原图大小的低分辨率分支特征图feature5_2。Furthermore, the conv5_x is composed of a cascaded residual block BottleneckBlock, which includes two convolution layers with a convolution kernel size of 1*1, a convolution layer with a convolution kernel size of 3*3 and IdentityBlock, which reduces consumption in deep networks; the feature map feature4_1 is subjected to a conv5_1 operation to obtain a feature map hfeature4_1, the feature map feature4_2 is subjected to a 1*1 convolution to realize operation channel compression and then upsampled by bilinear interpolation to obtain a feature map hfeature4_2, the feature maps hfeature4_1 and hfeature4_2 are fused to obtain a high-resolution branch feature map feature5_1 with a size of 1/8 of the original image, the feature map feature4_2 is subjected to a conv5_2 operation to obtain a feature map lfeature4_2, the feature map feature4_1 is subjected to a 3*3 convolution with a step size of 2 to realize downsampling to obtain a feature map lfeature4_1, and the feature map lfeature4_1 is fused. And lfeature4_2 get the low-resolution branch feature map feature5_2 with a size of 1/64 of the original image.

进一步地，所述DPPM包含五个并行的分支：特征图feature5_2经过1*1卷积得到特征图y1；特征图feature5_2通过kernel_size＝3，stride＝2的池化层、1*1卷积和上采样后得到的特征图与特征图y1进行融合，融合后的特征图经过3*3卷积得到特征图y2；特征图feature5_通过kernel_size＝5，stride＝4的池化层、1*1卷积和上采样后得到的特征图与特征图y2进行融合，融合后的特征图经过3*3卷积得到特征图 y3；特征图feature5_2通过kernel_size＝9，stride＝8的池化层、1*1卷积和上采样后得到的特征图与特征图 y3进行融合，融合后的特征图经过3*3卷积得到特征图y4；特征图feature5_2通过全局平均池化、1*1卷积和上采样后得到的特征图与特征图y4进行融合，融合后的特征图经过3*3卷积得到特征图y5；对特征图y1、y2、y3、y4、y5进行拼接操作后再进行1*1卷积操作来改变通道数得到最终的多尺度融合的低分辨特征图feature6。Furthermore, the DPPM includes five parallel branches: the feature map feature5_2 is subjected to 1*1 convolution to obtain the feature map y1; the feature map feature5_2 is fused with the feature map y1 after passing through the pooling layer with kernel_size=3, stride=2, 1*1 convolution and upsampling, and the fused feature map is subjected to 3*3 convolution to obtain the feature map y2; the feature map feature5_ is fused with the feature map y2 after passing through the pooling layer with kernel_size=5, stride=4, 1*1 convolution and upsampling, and the fused feature map is subjected to 3*3 convolution to obtain the feature map y3; the feature map feature5_2 is fused with the feature map after passing through the pooling layer with kernel_size=9, stride=8, 1*1 convolution and upsampling The fused feature map y3 is fused through 3*3 convolution to obtain feature map y4; the feature map feature5_2 is fused with feature map y4 through global average pooling, 1*1 convolution and upsampling, and the fused feature map is fused through 3*3 convolution to obtain feature map y5; the feature maps y1, y2, y3, y4, and y5 are concatenated and then 1*1 convolution is performed to change the number of channels to obtain the final multi-scale fused low-resolution feature map feature6.

进一步地，所述差分边界注意力模块由四个并行的像素差分卷积模块和空间注意力模块组成，所述像素差分卷积模块包含卷积核大小3*3的差分卷积层、ReLU层、卷积核大小1*1卷积层；所述空间注意力模块包含两个卷积核大小为1*1的卷积层、一个卷积核为3*3的卷积层、ReLU层和Sigmoid函数；从S1中选取的每一个特征图先通过像素差分卷积模块(PDCM)再通过空间注意力模块(SAM)得到对应的边界特征图。Furthermore, the differential boundary attention module is composed of four parallel pixel differential convolution modules and a spatial attention module. The pixel differential convolution module includes a differential convolution layer with a convolution kernel size of 3*3, a ReLU layer, and a convolution layer with a convolution kernel size of 1*1; the spatial attention module includes two convolution layers with a convolution kernel size of 1*1, a convolution layer with a convolution kernel of 3*3, a ReLU layer and a Sigmoid function; each feature map selected from S1 first passes through a pixel differential convolution module (PDCM) and then passes through a spatial attention module (SAM) to obtain the corresponding boundary feature map.

进一步地，所述S3中得到像素-区域增强特征图的具体步骤为：Furthermore, the specific steps of obtaining the pixel-region enhancement feature map in S3 are:

S3-1：对多尺度融合的低分辨特征图进行Softmax操作，得到K个粗分割区域{R₁,R₂,...,R_K}，其中K代表分割的类别数，R_K是一个二维向量，R_K里面的每个元素代表着对应像素属于类别K的概率；S3-1: Perform Softmax operation on the multi-scale fused low-resolution feature map to obtain K coarse segmentation regions {R ₁ ,R ₂ ,...,R _K }, where K represents the number of segmentation categories, R _K is a two-dimensional vector, and each element in R _K represents the probability that the corresponding pixel belongs to category K;

S3-2：利用下述公式得到第K个区域表示特征，即将整张图所有像素的特征与它们属于区域K的概率进行加权求和：S3-2: Use the following formula to obtain the K-th region representation feature, that is, to perform a weighted sum of the features of all pixels in the entire image and the probability that they belong to region K:

其中x_i表示像素p_i的特征，r_ki表示像素p_i属于区域K的概率，f_k代表区域表示特征；Where _xi represents the feature of pixel _pi , _rki represents the probability that pixel _pi belongs to region K, and _fk represents the region representation feature;

S3-3：通过自注意力机制计算每个像素与每个区域的对应关系，其计算公式如下：S3-3: The correspondence between each pixel and each region is calculated through the self-attention mechanism. The calculation formula is as follows:

其中t(x,f)＝u₁(x)^Tu₂(f)，u₁、u₂、u₃和u₄表示由1*1卷积、BatchNorm层和ReLU层组成的转换函数FFN；将u₁(x)^T和u₂(f)分别作为自注意力机制的key和query,计算像素特征与区域表示特征之间的相关性并将其归一化得到w_ik，将w_ik作为权重与u₃(f_k)相乘得到像素-区域增强特征y_i；Where t(x,f)＝u ₁ (x) ^T u ₂ (f), u ₁ , u ₂ , u ₃ and u ₄ represent the transformation function FFN composed of 1*1 convolution, BatchNorm layer and ReLU layer; u ₁ (x) ^T and u ₂ (f) are used as the key and query of the self-attention mechanism respectively, the correlation between the pixel feature and the region representation feature is calculated and normalized to obtain w _ik , and w _ik is used as the weight to multiply u ₃ (f _k ) to obtain the pixel-region enhancement feature y _i ;

S3-4：利用每个像素点的像素-区域增强特征y_aug组成像素-区域增强特征图y_aug。S3-4: Use the pixel-region enhancement feature y _aug of each pixel point to form a pixel-region enhancement feature map y _aug .

本发明的有益效果为：The beneficial effects of the present invention are:

1、本发明中的边界注意力与传统的边缘提取算子相比，优势在于加入卷积层和ReLU层后，边缘特征图的提取不容易受到环境、光照等因素的影响,泛化性能更好。与基于普通卷积神经网络的边缘特征提取模块相比，优势在于普通卷积神经网络的卷积核参数是从随机初始化开始进行优化，没有对梯度信息进行编码，因此很难集中于边缘相关的特征。而本发明中的边界注意力使用的是像素差分卷积，从边缘产生的原理出发，利用相邻像素的差分来实现梯度信息的编码并对卷积核参数进行优化，并且在像素差分卷积处理后加入空间注意力减少背景噪音的干扰，因此提取的边缘特征更为有效。1. Compared with the traditional edge extraction operator, the boundary attention in the present invention has the advantage that after adding the convolution layer and the ReLU layer, the extraction of the edge feature map is not easily affected by factors such as the environment and illumination, and the generalization performance is better. Compared with the edge feature extraction module based on the ordinary convolutional neural network, the advantage is that the convolution kernel parameters of the ordinary convolutional neural network are optimized from random initialization, and the gradient information is not encoded, so it is difficult to focus on edge-related features. The boundary attention in the present invention uses pixel differential convolution, starting from the principle of edge generation, using the difference of adjacent pixels to realize the encoding of gradient information and optimize the convolution kernel parameters, and adding spatial attention after the pixel differential convolution process to reduce the interference of background noise, so the extracted edge features are more effective.

2、与当前图像分割算法(以DeepLabv3+为代表)利用全局上下文信息不同，本发明中的区域注意力是对类别层面的上下文关系建模，在整个过程中利用低分辨特征图生成粗分割区域，然后加入高分辨率特征和自注意力机制，不断增强来自于同一类物体的像素的特征，能有效解决透明物体因环境、遮挡等因素造成语义信息缺失的情况。2. Unlike the current image segmentation algorithm (represented by DeepLabv3+) that uses global context information, the regional attention in this invention models the contextual relationship at the category level. In the whole process, low-resolution feature maps are used to generate coarse segmentation areas, and then high-resolution features and self-attention mechanisms are added to continuously enhance the features of pixels from the same type of objects. This can effectively solve the problem of missing semantic information of transparent objects due to environmental, occlusion and other factors.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明实施例中透明物体图像分割方法的网络模型结构框图。FIG. 1 is a block diagram of a network model structure of a transparent object image segmentation method according to an embodiment of the present invention.

图2为本发明实施例中差分边界注意力模块框图。FIG2 is a block diagram of a differential boundary attention module in an embodiment of the present invention.

图3为本发明实施例中区域注意力模块结构框图。FIG3 is a block diagram of the regional attention module in an embodiment of the present invention.

图4为本发明实施例中高分辨率分支和低分辨率分支交叉融合结构框图。FIG4 is a block diagram of a cross-fusion structure of a high-resolution branch and a low-resolution branch in an embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图和实施例对本发明的实施方式作进一步描述。需要说明的是，实施例并不对本发明要求保护的范围构成限制。The following is a further description of the embodiments of the present invention in conjunction with the accompanying drawings and examples. It should be noted that the examples do not limit the scope of protection claimed by the present invention.

实施例1Example 1

如附图1至图4所示，一种透明物体图像分割方法，包括以下步骤：As shown in Figures 1 to 4, a transparent object image segmentation method includes the following steps:

S1：建立含有高分辨率分支和低分辨率分支的双分辨率特征提取模块，高分辨率分支通过连接并行的不同分辨率特征图和重复进行多尺度交叉融合来维持精确的空间位置信息，得到1/8原图大小的高分辨率特征图；低分辨率分支通过连续的下采样以及与高分辨特征图进行交叉融合来提取高维的语义信息，得到1/64原图大小的低分辨率特征图；在低分辨率分支末端加上深度金字塔池化模块，扩大了有效地感受野并且融合了多尺度的上下文信息，得到多尺度融合的低分辨特征图，通过摄像头采集原始图像，并对原始图像进行随机裁剪、随机翻转、光度失真、归一化的预处理，得到输入图像，将输入图像输入双分辨率特征提取模块，具体实施方法为：S1: A dual-resolution feature extraction module containing a high-resolution branch and a low-resolution branch is established. The high-resolution branch maintains accurate spatial position information by connecting parallel feature maps of different resolutions and repeatedly performing multi-scale cross-fusion to obtain a high-resolution feature map of 1/8 the size of the original image; the low-resolution branch extracts high-dimensional semantic information by continuous downsampling and cross-fusion with the high-resolution feature map to obtain a low-resolution feature map of 1/64 the size of the original image; a deep pyramid pooling module is added to the end of the low-resolution branch to expand the effective receptive field and fuse multi-scale context information to obtain a multi-scale fused low-resolution feature map. The original image is collected through a camera, and the original image is pre-processed by random cropping, random flipping, photometric distortion, and normalization to obtain an input image, which is input into the dual-resolution feature extraction module. The specific implementation method is as follows:

双分辨率特征提取网络由conv1、conv2、conv3_x、conv4_x、conv5_x、DPPM共六个层级构成 (x＝1或2,其中1代表高分辨率分支，2代表低分辨率分支)。conv1层包含步长为2且卷积核为3*3的卷积层、BatchNorm层和ReLU层，conv1层的作用是改变输入图像的维度，经过conv1操作得到特征图feature1；conv2是由级联的残差块Basic Block组成，BasicBlock包含两个卷积核大小为3*3的卷积层和Identity Block，其中3*3的卷积层是为了提取输入的不同特征，降低模型的运算量，Identity Block复制了浅层的特征，避免了随着网络加深出现梯度消失的情况，经过conv2操作得到1/8原图大小的特征图feature2；conv3开始分成并行的高低分辨率两个分支conv3_1和conv3_2，conv3_1采用与conv2相同的残差块，得到1/8原图大小的高分辨率分支特征图feature3_1；conv3_2将conv2的输出进行下采样，得到1/16原图大小的低分辨率分支特征图feature3_2；对特征图feature3_1进行conv4_1操作得到特征图hfeature3_1,对特征图 feature3_2进行1*1卷积操作通道压缩然后通过双线性插值进行上采样得到特征图hfeature3_2,融合特征图hfeature3_1和hfeature3_2得到1/8原图大小的高分辨率分支特征图feature4_1，对特征图feature3_2进行 conv4_2操作得到特征图lfeature3_2,对特征图feature3_1进行步长为2的3*3卷积实现下采样得到特征图 lfeature3_1,融合特征图lfeature3_1和lfeature3_2得到1/32原图大小的低分辨率分支特征图feature4_2，特征图feature4_1和feature4_2是高分辨率分支和低分辨率分支特征图交叉融合的结果；conv5_x由级联的残差块BottleneckBlock组成，BottleneckBlock包含两个卷积核大小为1*1的卷积层、一个卷积核大小3*3 的卷积层和Identity Block，在深层的网络中减少消耗，对特征图feature4_1进行conv5_1操作得到特征图 hfeature4_1,对特征图feature4_2进行1*1卷积操作通道压缩然后通过双线性插值进行上采样得到特征图 hfeature4_2,融合特征图hfeature4_1和hfeature4_2得到1/8原图大小的高分辨率分支特征图feature5_1，对特征图feature4_2进行conv5_2操作得到特征图lfeature4_2,对特征图feature4_1进行步长为2的3*3卷积实现下采样得到特征图lfeature4_1,融合特征图lfeature4_1和lfeature4_2得到1/64原图大小的低分辨率分支特征图feature5_2。在特征图feature5_2后面加上深度金字塔池化模块(DPPM)有效地扩大感受野并且融合了多尺度的上下文信息，它包含五个并行的分支，特征图feature5_2经过1*1卷积得到特征图y1,特征图feature5_2通过kernel_size＝3，stride＝2的池化层、1*1卷积和上采样后得到的特征图与特征图y1进行融合，融合后的特征图经过3*3卷积得到特征图y2，特征图feature5_通过kernel_size＝5，stride＝4的池化层、1*1卷积和上采样后得到的特征图与特征图y2进行融合，融合后的特征图经过3*3卷积得到特征图 y3，特征图feature5_2通过kernel_size＝9，stride＝8的池化层、1*1卷积和上采样后得到的特征图与特征图 y3进行融合，融合后的特征图经过3*3卷积得到特征图y4，特征图feature5_2通过全局平均池化、1*1卷积和上采样后得到的特征图与特征图y4进行融合，融合后的特征图经过3*3卷积得到特征图y5，对特征图y1、y2、y3、y4、y5进行拼接操作后再进行1*1卷积操作来改变通道数得到最终的多尺度融合的低分辨特征图feature6。The dual-resolution feature extraction network consists of six layers: conv1, conv2, conv3_x, conv4_x, conv5_x, and DPPM (x = 1 or 2, where 1 represents a high-resolution branch and 2 represents a low-resolution branch). The conv1 layer contains a convolution layer with a stride of 2 and a convolution kernel of 3*3, a BatchNorm layer, and a ReLU layer. The function of the conv1 layer is to change the dimension of the input image, and the feature map feature1 is obtained after the conv1 operation; conv2 is composed of cascaded residual blocks Basic Block. BasicBlock contains two convolution layers with a convolution kernel size of 3*3 and Identity Block. The 3*3 convolution layer is to extract different features of the input and reduce the amount of calculation of the model. Identity Block copies the features of the shallow layer to avoid the gradient disappearance as the network deepens. After the conv2 operation, the feature map feature2 with a size of 1/8 of the original image is obtained; conv3 begins to be divided into two parallel high and low resolution branches conv3_1 and conv3_2. Conv3_1 uses the same residual block as conv2 to obtain a high-resolution branch feature map feature3_1 with a size of 1/8 of the original image; conv3_2 downsamples the output of conv2 to obtain a low-resolution branch feature map feature3_2 with a size of 1/16 of the original image; conv4_1 is performed on the feature map feature3_1 to obtain the feature map hfeature3_1, and the feature map feature3_2 is compressed by a 1*1 convolution operation channel and then upsampled by bilinear interpolation to obtain the feature map hfeature3_2. The feature maps hfeature3_1 and hfeature3_2 are fused to obtain a high-resolution branch feature map feature4_1 with a size of 1/8 of the original image, and the feature map feature3_2 is The conv4_2 operation obtains the feature map lfeature3_2, and the feature map feature3_1 is downsampled by a 3*3 convolution with a step size of 2 to obtain the feature map lfeature3_1. The feature maps lfeature3_1 and lfeature3_2 are fused to obtain a low-resolution branch feature map feature4_2 with a size of 1/32 of the original image. The feature maps feature4_1 and feature4_2 are the results of the cross-fusion of the high-resolution branch and the low-resolution branch feature maps; conv5_x is composed of a cascaded residual block BottleneckBlock, which contains two convolution layers with a convolution kernel size of 1*1, a convolution layer with a convolution kernel size of 3*3, and Identity Block, which reduces consumption in deep networks. The conv5_1 operation is performed on the feature map feature4_1 to obtain the feature map hfeature4_1, and the feature map feature4_2 is compressed by a 1*1 convolution operation channel and then upsampled by bilinear interpolation to obtain the feature map hfeature4_2, fuse the feature maps hfeature4_1 and hfeature4_2 to obtain a high-resolution branch feature map feature5_1 with a size of 1/8 of the original image, perform conv5_2 operation on the feature map feature4_2 to obtain the feature map lfeature4_2, perform a 3*3 convolution with a step size of 2 on the feature map feature4_1 to achieve downsampling to obtain the feature map lfeature4_1, and fuse the feature maps lfeature4_1 and lfeature4_2 to obtain a low-resolution branch feature map feature5_2 with a size of 1/64 of the original image. Adding a deep pyramid pooling module (DPPM) after the feature map feature5_2 effectively expands the receptive field and integrates multi-scale contextual information. It contains five parallel branches. The feature map feature5_2 is subjected to 1*1 convolution to obtain the feature map y1. The feature map feature5_2 is fused with the feature map y1 after passing through the pooling layer with kernel_size=3, stride=2, 1*1 convolution and upsampling. The fused feature map is fused with the feature map y1 after 3*3 convolution. The feature map feature5_ is fused with the feature map y2 after passing through the pooling layer with kernel_size=5, stride=4, 1*1 convolution and upsampling. The fused feature map is fused with the feature map y2 after 3*3 convolution. The feature map feature5_ is fused with the feature map y3 after passing through the pooling layer with kernel_size=9, stride=8, 1*1 convolution and upsampling. The fused feature map y3 is fused through 3*3 convolution to obtain feature map y4. The feature map feature5_2 is fused with feature map y4 through global average pooling, 1*1 convolution and upsampling. The fused feature map is fused through 3*3 convolution to obtain feature map y5. The feature maps y1, y2, y3, y4, and y5 are concatenated and then 1*1 convolution is performed to change the number of channels to obtain the final multi-scale fused low-resolution feature map feature6.

该步骤连接并行的不同分辨率特征图和重复进行多尺度交叉融合来维持高分辨率特征图，由此产生的高分辨率特征图提供了丰富的细节信息，对提升分割结果的精度大有帮助。低分辨率分支通过连续的下采样以及与高分辨特征图进行交叉融合来提取丰富的语义信息，该分支末端的特征图尺寸为原图的1/64，加上深度金字塔模块后，不仅扩大了有效地感受野并且融合了多尺度的上下文信息，还能降低模型的计算量。与现有大多数串行连接的特征提取模块不同，本模块双分辨率分支都是并行连接，其中高分辨率分支能够始终维持了精确的空间位置信息，并且不断融入低分辨信息，避免了串行连接中先下采样后上采样来恢复分辨率的信息损失，能够有效地解决透明物体受到背景以及光照变化影响导致特征信息难以提取的情形，对于后续精细化区域至关重要。This step connects parallel feature maps of different resolutions and repeatedly performs multi-scale cross-fusion to maintain the high-resolution feature map. The resulting high-resolution feature map provides rich detail information, which is very helpful in improving the accuracy of the segmentation results. The low-resolution branch extracts rich semantic information through continuous downsampling and cross-fusion with the high-resolution feature map. The feature map size at the end of the branch is 1/64 of the original image. After adding the deep pyramid module, it not only expands the effective receptive field and integrates multi-scale context information, but also reduces the computational complexity of the model. Unlike most existing serially connected feature extraction modules, the dual-resolution branches of this module are connected in parallel, in which the high-resolution branch can always maintain accurate spatial position information and continuously integrate low-resolution information, avoiding the information loss of downsampling first and then upsampling in the serial connection to restore the resolution. It can effectively solve the situation where transparent objects are affected by background and illumination changes, making it difficult to extract feature information, which is crucial for subsequent refined areas.

S2：利用差分边界注意力模块对提取到的S1中提取到的四个不同尺度的特征图feature2、feature3_2、feature4_2、feature5_2分别进行差分卷积和空间注意力操作，提取多尺度的边缘特征图并进行融合，经过特征降维后得到透明物体的边缘分割图像，具体实施方法为：S2: Use the differential boundary attention module to perform differential convolution and spatial attention operations on the four feature maps of different scales extracted from S1, feature2, feature3_2, feature4_2, and feature5_2, respectively, extract multi-scale edge feature maps and fuse them, and obtain the edge segmentation image of the transparent object after feature dimensionality reduction. The specific implementation method is as follows:

从S1中选取提取到的四个不同尺度的特征图feature2、feature3_2、feature4_2、feature5_2，经过差分边界注意力模块后得到四个分支的边界特征图boundary1、boundary2、boundary3、boundary4，差分边界注意力模块由四个并行的像素差分卷积模块和空间注意力模块组成，从S1中选取的每一个特征图先通过像素差分卷积模块(PDCM)再通过空间注意力模块(SAM)得到对应的边界特征图，其中，像素差分卷积模块包含卷积核大小3*3的差分卷积层、ReLU层、卷积核大小1*1卷积层。像素差分卷积层结合了传统边缘检测算子LBP(局部二值模式)和卷积神经网络，利用3*3卷积核在图像的局部区域8对像素差，然后通过与卷积核权重进行元素相乘并求和，以生成输出特征图中的值。空间注意力模块包含两个卷积核大小为1*1的卷积层、一个卷积核为3*3的卷积层、ReLU层和Sigmoid函数。1*1卷积层将特征图压缩成单通道的，利用双线性插值将特征图恢复到原图大小。最后将四个分支得到的边界特征图boundary1、 boundary2、boundary3、boundary4进行拼接操作得到多尺度的边缘特征图boundary5，再经过卷积核大小 1*1卷积层和Sigmoid函数，可以得到透明物体的边缘分割图。Four feature maps of different scales, feature2, feature3_2, feature4_2, and feature5_2, are extracted from S1. After passing through the differential boundary attention module, the boundary feature maps of four branches, boundary1, boundary2, boundary3, and boundary4, are obtained. The differential boundary attention module consists of four parallel pixel differential convolution modules and a spatial attention module. Each feature map selected from S1 first passes through the pixel differential convolution module (PDCM) and then passes through the spatial attention module (SAM) to obtain the corresponding boundary feature map. Among them, the pixel differential convolution module includes a differential convolution layer with a convolution kernel size of 3*3, a ReLU layer, and a convolution layer with a convolution kernel size of 1*1. The pixel differential convolution layer combines the traditional edge detection operator LBP (local binary pattern) and the convolution neural network, using a 3*3 convolution kernel to obtain 8 pairs of pixel differences in the local area of the image, and then multiplying and summing the elements with the convolution kernel weight to generate the value in the output feature map. The spatial attention module contains two convolutional layers with a convolution kernel size of 1*1, a convolutional layer with a convolution kernel size of 3*3, a ReLU layer, and a Sigmoid function. The 1*1 convolutional layer compresses the feature map into a single channel, and uses bilinear interpolation to restore the feature map to the original size. Finally, the boundary feature maps boundary1, boundary2, boundary3, and boundary4 obtained by the four branches are concatenated to obtain a multi-scale edge feature map boundary5, and then the edge segmentation map of the transparent object can be obtained by passing through a convolutional layer with a convolution kernel size of 1*1 and a Sigmoid function.

该步骤包括多个差分卷积模块和空间注意力模块。差分卷积模块通过像素之间的差分与卷积核的卷积操作，获取丰富的边界信息，空间注意力模块是为了减少背景噪音的干扰。This step includes multiple differential convolution modules and spatial attention modules. The differential convolution module obtains rich boundary information by performing convolution operations on the difference between pixels and the convolution kernel, and the spatial attention module is to reduce the interference of background noise.

经过特征降维后得到透明物体的边缘预测图像，计算出边缘损失函数L1作为总损失函数L的一部分，在梯度下降过程中参与网络权重更新来优化模型参数；其中L1采用的是交叉熵损失函数，p_i为边界处像素i的预测结果，y_i为边界处像素i的实际结果，其计算公式如下：After feature dimensionality reduction, the edge prediction image of the transparent object is obtained, and the edge loss function L1 is calculated as part of the total loss function L. It participates in the network weight update during the gradient descent process to optimize the model parameters; L1 uses the cross entropy loss function, _pi is the prediction result of pixel i at the boundary, _yi is the actual result of pixel i at the boundary, and its calculation formula is as follows:

S3：利用区域注意力模块对S1中的高分辨率特征图feature5_1和多尺度融合的低分辨特征图 feature6进行类别层面的上下文关系建模，增强来自于同一类物体的像素的特征，得到像素-区域增强特征图。融合高分辨率特征图、多尺度的边缘特征图和像素-区域增强特征图，经过特征降维后得到最终的透明物体分割结果，计算出透明物体的损失函数L2作为总损失函数L的另外一部分。其中L2与L1相同都是采用的交叉熵损失函数，总损失函数L为L1、L2两者之和。具体实施方法为：S3: Use the regional attention module to model the contextual relationship at the category level between the high-resolution feature map feature5_1 and the multi-scale fused low-resolution feature map feature6 in S1, enhance the features of pixels from the same class of objects, and obtain the pixel-region enhanced feature map. After fusing the high-resolution feature map, the multi-scale edge feature map and the pixel-region enhanced feature map, the final transparent object segmentation result is obtained after feature dimensionality reduction, and the loss function L2 of the transparent object is calculated as another part of the total loss function L. Among them, L2 and L1 both use the cross entropy loss function, and the total loss function L is the sum of L1 and L2. The specific implementation method is:

利用多尺度融合的低分辨特征图feature6进行Softmax操作得到K个粗分割区域{R₁,R₂,...,R_K}，其中K代表分割的类别数，R_K是一个二维向量，里面的每个元素代表着对应像素属于类别K的概率。第 K个区域表示特征是将所有像素的特征与它们属于区域K的概率进行加权求和，其公式如下:Using the multi-scale fused low-resolution feature map feature6, a Softmax operation is performed to obtain K coarse segmentation regions {R ₁ ,R ₂ ,...,R _K }, where K represents the number of segmentation categories, and R _K is a two-dimensional vector, each element of which represents the probability that the corresponding pixel belongs to category K. The Kth region representation feature is the weighted sum of the features of all pixels and their probabilities of belonging to region K, and its formula is as follows:

其中x_i表示像素p_i的特征，r_ki表示像素p_i属于区域K的概率，f_k代表区域表示特征；然后，通过自注意力机制计算每个像素与每个区域的对应关系，其计算公式如下：Where _xi represents the feature of pixel _pi , _rki represents the probability that pixel _pi belongs to region K, and _fk represents the region representation feature; then, the correspondence between each pixel and each region is calculated through the self-attention mechanism, and the calculation formula is as follows:

其中t(x,f)＝u₁(x)^Tu₂(f)，u₁、u₂、u₃和u₄表示由1*1卷积、BatchNorm层和ReLU层组成的转换函数FFN。将u₁(x)^T和u₂(f)分别作为自注意力机制的key和query,计算像素特征与区域表示特征之间的相关性并将其归一化得到w_ik，将w_ik作为权重与u₃(f_k)相乘得到像素-区域增强特征y_i,每个像素点的像素-区域增强特征y_aug组成像素-区域增强特征图y_aug。利用拼接操作融合高分辨率特征图feature5_1、多尺度的边缘特征图boundary5和像素-区域增强特征图y_aug，经过卷积核大小1*1卷积层和Sigmoid函数，最终得到透明物体的分割结果，解决透明物体被遮挡以及受环境影响的情况。Where t(x,f)＝u ₁ (x) ^T u ₂ (f), u ₁ , u ₂ , u ₃ and u ₄ represent the transformation function FFN composed of 1*1 convolution, BatchNorm layer and ReLU layer. Take u ₁ (x) ^T and u ₂ (f) as the key and query of the self-attention mechanism respectively, calculate the correlation between the pixel feature and the regional representation feature and normalize it to get w _ik , multiply w _ik with u ₃ (f _k ) as the weight to get the pixel-region enhancement feature y _i , and the pixel-region enhancement feature y _aug of each pixel point constitutes the pixel-region enhancement feature map y _aug . The high-resolution feature map feature5_1, the multi-scale edge feature map boundary5 and the pixel-region enhancement feature map y _aug are fused by splicing operation, and after the convolution layer with convolution kernel size 1*1 and Sigmoid function, the segmentation result of the transparent object is finally obtained, which solves the problem of transparent objects being blocked and affected by the environment.

本发明与一些通用图像分割算法在透明物体数据集Trans10K-v2上的性能比较如表1所示,其中 mIOU表示各个类别真实值和预测值的交集与并集之比的平均值，ACC表示像素准确率。The performance comparison between the present invention and some general image segmentation algorithms on the transparent object dataset Trans10K-v2 is shown in Table 1, where mIOU represents the average value of the ratio of the intersection and union of the true value and the predicted value of each category, and ACC represents the pixel accuracy.

表1Table 1

从表中可以看出，与目前主流的语义分割算法比较，本发明所提出的方法在ACC和mIou两个性能指标上都具有明显优势。本发明的性能指标相比于UNet有大幅提升，证明了双分辨率特征提取模块能够提取到透明物体更为鲁棒的特征，本发明的性能指标相比于DeepLabv3+、DenseASPP有提升，证明区域注意力可以改善透明物体被遮挡造成语义信息缺失的情形，本发明的性能指标比OCRNet要好，证明了通过添加边界注意力可以改善透明物体的分割结果。As can be seen from the table, compared with the current mainstream semantic segmentation algorithm, the method proposed in the present invention has obvious advantages in both ACC and mIou performance indicators. The performance indicators of the present invention are greatly improved compared with UNet, proving that the dual-resolution feature extraction module can extract more robust features of transparent objects. The performance indicators of the present invention are improved compared with DeepLabv3+ and DenseASPP, proving that regional attention can improve the situation where transparent objects are blocked and semantic information is missing. The performance indicators of the present invention are better than OCRNet, proving that the segmentation results of transparent objects can be improved by adding boundary attention.

本发明实施例还提供了一种基于差分边界注意力和区域注意力的透明物体图像分割系统，其包括计算机设备；所述计算机设备被配置或编程为用于执行上述实施例方法的步骤。An embodiment of the present invention also provides a transparent object image segmentation system based on differential boundary attention and regional attention, which includes a computer device; the computer device is configured or programmed to execute the steps of the above-mentioned embodiment method.

最后说明的是，以上实施例仅用以说明本发明的技术方案而非限制，尽管参照较佳实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，可以对本发明的技术方案进行修改或者等同替换，而不脱离本发明技术方案的宗旨和范围，其均应涵盖在本发明的权利要求范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention rather than to limit it. Although the present invention has been described in detail with reference to the preferred embodiments, those skilled in the art should understand that the technical solution of the present invention can be modified or replaced by equivalents without departing from the purpose and scope of the technical solution of the present invention, which should be included in the scope of the claims of the present invention.

Claims

1. A transparent object image segmentation method, characterized by comprising the steps of:

S1: establishing a dual-resolution feature extraction module comprising a high-resolution branch and a low-resolution branch, inputting an input image into the dual-resolution feature extraction module, and maintaining accurate spatial position information by the high-resolution branch through connecting parallel different-resolution feature images and repeatedly performing multi-scale cross fusion to obtain a high-resolution feature image with the size of 1/8 original image; the low-resolution branch extracts high-dimensional semantic information through continuous downsampling and cross fusion with the high-resolution feature map to obtain a low-resolution feature map with the size of 1/64 original map; adding a depth pyramid pooling module at the tail end of the low-resolution branch, wherein the depth pyramid pooling module is used for expanding an effective receptive field and fusing multi-scale context information to obtain a multi-scale fused low-resolution feature map;

the dual-resolution feature extraction network is composed of six levels of conv1, conv2, conv3_x, conv4_x, conv5_x and DPPM, wherein x=1 or 2, x=1 represents a high-resolution branch, and x=2 represents a low-resolution branch;

conv1 comprises a convolution layer with a step size of 2 and a convolution kernel of 3*3, a BatchNorm layer and a ReLU layer, conv1 layer being used to change the dimension of the input image;

conv2 is composed of concatenated residual blocks Basic Block; feature2 for obtaining 1/8 original size;

conv3_x starts to split into two branches conv3_1 and conv3_2 of high and low resolution in parallel; adopting a residual block which is the same as that of conv2 for conv3_1 to obtain a high-resolution branch characteristic diagram feature3_1 with the size of 1/8 original diagram; the output of conv2 is downsampled by conv3_2 to obtain a low-resolution branch characteristic diagram feature3_2 with the size of 1/16 original diagram;

The conv4_x is divided into two branches of high and low resolution conv4_1 and conv4_2 in parallel, wherein the conv4_1 is used for continuously merging low resolution information and maintaining a high resolution branch feature map feature4_1 of 1/8 original size; conv4_2 is used for obtaining a low-resolution branch feature map feature4_2 with the original size of 1/32;

The conv5_x is divided into two branches of high and low resolution conv5_1 and conv5_2 in parallel, wherein the conv5_1 is used for continuously merging low resolution information and maintaining a high resolution branch feature map feature5_1 of 1/8 original size; conv5_2 is used for obtaining a low-resolution branch feature map feature5_2 with the original size of 1/64;

DPPM is used to expand receptive fields and fuse multi-scale context information;

S2: differential convolution and spatial attention operations are respectively carried out on the feature images feature2 with the 1/8 original image size and the low-resolution branch feature images feature3_2 with the 1/16 original image size extracted in the S1, the low-resolution branch feature images feature4_2 with the 1/32 original image size and the low-resolution branch feature images feature5_2 with the 1/64 original image size by utilizing a differential boundary attention module, multi-scale edge feature images are extracted and fused, and an edge prediction image of a transparent object is obtained after feature dimension reduction;

s3: carrying out category-level context relation modeling on the high-resolution feature map obtained in the step S1 and the multi-scale fused low-resolution feature map by utilizing a region attention module, enhancing the features of pixels from the same object, and obtaining a pixel-region enhanced feature map; and fusing the high-resolution feature map, the multi-scale edge feature map and the pixel-area enhancement feature map, and obtaining a final transparent object segmentation result after feature dimension reduction.

2. The transparent object image segmentation method according to claim 1, wherein the Basic Block of conv2 comprises two convolution layers with a convolution kernel size 3*3 and an Identity Block, wherein the convolution layers of 3*3 are used for extracting different input features, so that the operand of a model is reduced, the Identity Block replicates the features of a shallow layer, and the condition that gradient vanishes with network deepening is avoided.

3. The transparent object image segmentation method according to claim 1, wherein the feature map feature3_1 is subjected to conv4_1 operation to obtain a feature map hfeature _1, the feature map feature3_2 is subjected to 1*1 convolution to realize operation channel compression, then up-sampling is performed through bilinear interpolation to obtain a feature map hfeature3_2, and the feature maps hfeature _1 and hfeature3_2 are fused to obtain a high-resolution branch feature map feature4_1 with a 1/8 original size; performing conv4_2 operation on the feature map feature3_2 to obtain a feature map lfeature3_2, performing 3*3 convolution with a step length of 2 on the feature map feature3_1 to realize downsampling to obtain a feature map lfeature3_1, and fusing the feature maps lfeature3_1 and lfeature3_2 to obtain a low-resolution branch feature map feature4_2 with a 1/32 original map size; the conv5_x is composed of cascaded residual blocks BottleneckBlock, bottleneck Block comprises two convolution layers with the convolution kernel size of 1*1, one convolution layer with the convolution kernel size of 3*3 and an Identity Block, and consumption is reduced in a deep network; performing conv5_1 operation on the feature map feature4_1 to obtain a feature map hfeature4_1, performing 1*1 convolution on the feature map feature4_2 to realize operation channel compression, then performing up-sampling through bilinear interpolation to obtain a feature map hfeature4_2, merging the feature maps hfeature _1 and hfeature4_2 to obtain a high-resolution branch feature map feature5_1 with 1/8 original size, performing conv5_2 operation on the feature map feature4_2 to obtain a feature map lfeature4_2, performing 3*3 convolution with a step length of 2 on the feature map feature4_1 to realize down-sampling to obtain a feature map lfeature4_1, and merging the feature maps lfeature4_1 and lfeature4_2 to obtain a low-resolution branch feature map feature5_2 with 1/64 original size.

4. A method of transparency image segmentation according to claim 3, wherein the DPPM comprises five parallel branches: the feature5_2 of the feature map is subjected to 1*1 convolution to obtain a feature map y1; feature5_2 is fused with feature y1 through a pooling layer of kernel_size=3 and stride=2, 1*1 convolution and up-sampling, and the fused feature is convolved with 3*3 to obtain feature y2; feature 5_is fused with feature y2 through a pooling layer of kernel_size=5 and stride=4, 1*1 is rolled and up-sampled to obtain a feature y3, and 3*3 is convolved to obtain a feature y3; feature5_2 is fused with feature y3 through a pooling layer of kernel_size=9 and stride=8, 1*1 convolution and up-sampling, and the fused feature is convolved with 3*3 to obtain feature y4; the feature5_2 of the feature map is fused with the feature map y4 through global average pooling, 1*1 convolution and up-sampling, and the feature map y5 is obtained through 3*3 convolution of the fused feature map; and performing a 1*1 convolution operation after performing a splicing operation on the feature images y1, y2, y3, y4 and y5 to change the channel number so as to obtain a final multi-scale fused low-resolution feature image 6.

5. The transparent object image segmentation method according to claim 1, wherein the differential boundary attention module is composed of four parallel pixel differential convolution modules and a spatial attention module, and the pixel differential convolution modules include a differential convolution layer with a convolution kernel size 3*3, a ReLU layer, and a convolution layer with a convolution kernel size 1*1; the spatial attention module comprises two convolution layers with the convolution kernel size 1*1, one convolution layer with the convolution kernel 3*3, a ReLU layer and a Sigmoid function; each feature map selected in S1 is first passed through a Pixel Difference Convolution Module (PDCM) and then passed through a Spatial Attention Module (SAM) to obtain a corresponding boundary feature map.

6. The method for segmenting the transparent object image according to claim 1, wherein the specific steps for obtaining the pixel-area enhancement feature map in S3 are as follows:

S3-1: carrying out Softmax operation on the multi-scale fused low-resolution feature map to obtain K rough segmentation areas { R ₁,R₂,...,R_K }, wherein K represents the number of segmented categories, R _K is a two-dimensional vector, and each element in R _K represents the probability that the corresponding pixel belongs to the category K;

S3-2: the K-th region representation feature is obtained by using the following formula, namely, the weighted summation of the features of all pixels of the whole image and the probability that the features belong to the region K is carried out:

Where x _i represents the feature of pixel p _i, r _ki represents the probability that pixel p _i belongs to region K, and f _k represents the region representation feature;

s3-3: the corresponding relation between each pixel and each region is calculated through a self-attention mechanism, and the calculation formula is as follows:

where t (x, f) =u ₁(x)^Tu₂(f),u₁、u₂、u₃ and u ₄ represent a transfer function FFN consisting of 1*1 convolutions, batchNorm layers and ReLU layers; taking u ₁(x)^T and u ₂ (f) as keys and queries of a self-attention mechanism respectively, calculating the correlation between the pixel characteristics and the region representation characteristics, normalizing the correlation to obtain w _ik, and multiplying w _ik as a weight by u ₃(f_k to obtain a pixel-region enhancement characteristic y _i;

S3-4: the pixel-area enhancement feature map y _aug is composed using the pixel-area enhancement feature y _aug for each pixel point.

7. The transparent object image segmentation method according to claim 1, wherein the edge prediction image of the transparent object obtained in S2 is used for calculating an edge loss function L1, the transparent object segmentation result obtained in S3 is used for calculating a loss function L2 as the transparent object, the total loss function L is the sum of the edge loss function L1 and the loss function L2, and the model parameters are optimized by participating in the update of the network weight in the gradient descent process; the edge loss function L1 and the loss function L2 adopt cross entropy loss functions, and the edge loss function L1 is taken as an example, and the calculation formula is as follows: the calculation formula is as follows:

Where p _i is the predicted result for pixel i at the boundary and y _i is the actual result for pixel i at the boundary.