CN115082675A

CN115082675A - Transparent object image segmentation method and system

Info

Publication number: CN115082675A
Application number: CN202210633162.3A
Authority: CN
Inventors: 胡泊; 王勇; 邹逸群
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2022-06-07
Filing date: 2022-06-07
Publication date: 2022-09-20
Anticipated expiration: 2042-06-07

Abstract

The invention discloses a transparent object image segmentation method, which comprises the following steps: s1: establishing a dual-resolution characteristic extraction module containing a high-resolution branch and a low-resolution branch to obtain a high-resolution characteristic diagram and a multi-scale fused low-resolution characteristic diagram; s2: performing differential convolution and spatial attention operations on the feature maps with different dimensions extracted in the S1 by using a differential boundary attention module, extracting multi-scale edge feature maps and fusing the edge feature maps; s3: performing category-level context modeling on the high-resolution feature map and the multi-scale fused low-resolution feature map by using a region attention module to obtain a pixel-region enhanced feature map; the high-resolution feature map, the multi-scale edge feature map and the pixel-area enhancement feature map are fused, and a final transparent object segmentation result is obtained after feature dimensionality reduction, so that the problem that semantic information is lost due to factors such as environment and shielding of the transparent object is effectively solved.

Description

Transparent object image segmentation method and system

Technical Field

The invention relates to the field of computer vision, in particular to a transparent object image segmentation method and a transparent object image segmentation system.

Background

The image semantic segmentation technology is one of key technologies for an intelligent system to understand a natural scene, but for objects such as transparent objects which widely exist in the real world, the existing general image segmentation method often cannot obtain satisfactory results, and mainly has the following problems that the transparent objects are easily influenced by environmental factors to cause difficulty in extracting features with good robustness, the transparent objects are easily shielded to cause incomplete semantic information, the edge segmentation of the transparent objects is inaccurate, and the problems finally influence the segmentation effect of the transparent objects.

Disclosure of Invention

The invention aims to overcome the defects of the technology and provides a transparent object image segmentation method, which can continuously enhance the characteristics of pixels from the same object by adding high-resolution characteristics and a self-attention mechanism and effectively solve the problem of semantic information deficiency of the transparent object caused by factors such as environment, shielding and the like.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a transparent object image segmentation method, comprising the steps of:

s1: establishing a dual-resolution characteristic extraction module containing a high-resolution branch and a low-resolution branch, inputting an input image into the dual-resolution characteristic extraction module, maintaining accurate spatial position information by connecting parallel different resolution characteristic graphs and repeatedly performing multi-scale cross fusion by the high-resolution branch, and obtaining a high-resolution characteristic graph with the size of 1/8 original graphs; the low-resolution branch extracts high-dimensional semantic information through continuous downsampling and cross fusion with the high-resolution feature map to obtain 1/64 original-size low-resolution feature map; a depth pyramid pooling module is added at the tail end of the low-resolution branch for expanding the effective receptive field and fusing multi-scale context information to obtain a multi-scale fused low-resolution feature map;

s2: difference of utilizationThe boundary attention module respectively performs differential convolution and space attention operations on the feature maps with different dimensions extracted in the S1, extracts multi-scale edge feature maps and fuses the feature maps, obtains an edge prediction image of the transparent object after feature dimensionality reduction, calculates an edge loss function L1 as a part of a total loss function L, and participates in network weight updating in the gradient descent process to optimize model parameters; where L1 employs a cross-entropy loss function, p _i Is the prediction of the pixel i at the boundary, y _i Is the actual result of pixel i at the boundary; the calculation formula is as follows:

s3: performing category-level context modeling on the high-resolution feature map obtained in the step S1 and the multi-scale fused low-resolution feature map by using a region attention module, and enhancing the features of pixels from the same type of object to obtain a pixel-region enhanced feature map; and fusing the high-resolution feature map, the multi-scale edge feature map and the pixel-area enhancement feature map, obtaining a final transparent object segmentation result after feature dimensionality reduction, and calculating a loss function L2 of the transparent object as the other part of the total loss function L. Wherein L2 and L1 are the same cross entropy loss function, and the total loss function L is the sum of L1 and L2.

Further, the dual-resolution feature extraction network is composed of six levels, including conv1, conv2, conv3_ x, conv4_ x, conv5_ x and DPPM, wherein x ═ 1 or 2 represents a high-resolution branch, and x ═ 2 represents a low-resolution branch;

conv1 contains convolution layer with step size of 2 and convolution kernel of 3 × 3, BatchNorm layer and ReLU layer, conv1 layer is used to change the dimension of input image;

conv2 is composed of cascaded residual Block Basic Block; 1/8, obtaining feature map 2 with original image size;

conv3_ x starts to split into two branches conv3_1 and conv3_2 of high and low resolution in parallel; conv3_1 adopts the same residual block as conv2 to obtain 1/8 high-resolution branch feature map feature3_1 of the original size; sampling the output of the conv2 by the conv3_2 to obtain a low-resolution branch feature map feature3_2 with the size of 1/16 original image;

conv4_ x is divided into two branches conv4_1 and conv4_2 with high and low resolution in parallel, conv4_1 is used for continuously blending in low resolution information and maintaining 1/8 a high resolution branch feature map of the original image size; conv4_2 is used for obtaining 1/32 low-resolution branch feature map of original image size;

conv5_ x is divided into two branches conv5_1 and conv5_2 with high and low resolution in parallel, conv5_1 is used for continuously blending in low resolution information and maintaining 1/8 a high resolution branch feature map of the original image size; conv5_2 is used for obtaining 1/64 low-resolution branch feature map of original image size;

DPPM is used to expand the receptive field and fuse multi-scale contextual information.

Furthermore, the Basic Block of conv2 includes two convolution layers with convolution kernel size of 3 × 3 and an Identity Block, where the convolution layer of 3 × 3 is to extract different input features and reduce the amount of computation of the model, and the Identity Block copies the features of the shallow layer to avoid the situation that the gradient disappears as the network deepens.

Further, the feature map feature3_1 performs conv4_1 operation to obtain a feature map hfefect 3_1, the feature map feature3_2 performs 1 × 1 convolution to realize operation channel compression, then performs up-sampling through bilinear interpolation to obtain a feature map hfect 3_2, and fuses the feature maps hfect 3_1 and hfect 3_2 to obtain a high-resolution branch feature map feature4_1 with the original size of 1/8; and performing conv4_2 operation on the feature map feature3_2 to obtain a feature map feature3_2, performing 3 × 3 convolution with the step size of 2 on the feature map feature3_1 to realize downsampling to obtain a feature map feature3_1, and fusing the feature maps lfefecture 3_1 and lfecture 3_2 to obtain a 1/32 original-size low-resolution branch feature map feature4_ 2.

Further, the conv5_ x is composed of a cascade of residual Block bottleckblocks, each of which includes two convolutional layers with convolutional kernel size of 1 × 1, one convolutional layer with convolutional kernel size of 3 × 3 and an identyblock, and reduces consumption in a deep network; performing conv5_1 operation on feature map feature4_1 to obtain feature map hfefecture 4_1, performing 1 × 1 convolution on feature map feature4_2 to realize operation channel compression, then performing up-sampling through bilinear interpolation to obtain feature map hfecture 4_2, fusing feature map hfecture 4_1 and hfecture 4_2 to obtain high-resolution branch feature map feature5_1 of 1/8 original size, performing conv5_2 operation on feature map feature4_2 to obtain feature map lfecture 4_2, performing 3_ 3 convolution with step size of 2 on feature map feature4_1 to obtain feature map lfecture 4_1, and fusing feature map 4_1 and lfecture 4_2 to obtain low-resolution feature map 5_2 of 1/64 original size.

Further, the DPPM includes five parallel branches: the feature map 5_2 is convolved by 1 × 1 to obtain a feature map y 1; the feature map 5_2 is fused with the feature map y1 through a pooling layer of kernel _ size 3 and stride 2, a feature map obtained by 1 × 1 convolution and upsampling, and the fused feature map is convolved with 3 × 3 to obtain a feature map y 2; the feature map 5_ is fused with the feature map y2 through a pooling layer of kernel _ size 5 and stride 4, a feature map obtained by 1 × 1 convolution and upsampling, and the fused feature map is convolved with 3 × 3 to obtain a feature map y 3; the feature map 5_2 is fused with the feature map y3 through a pooling layer of kernel _ size 9 and stride 8, a feature map obtained by 1 × 1 convolution and upsampling, and the fused feature map is convolved by 3 × 3 to obtain a feature map y 4; the feature map 5_2 is fused with the feature map y4 through global average pooling, 1 × 1 convolution and upsampling, and the fused feature map is convolved with 3 × 3 to obtain a feature map y 5; and after splicing the feature maps y1, y2, y3, y4 and y5, performing 1-by-1 convolution operation to change the number of channels to obtain the final multi-scale fused low-resolution feature map 6.

Further, the differential boundary attention module is composed of four parallel pixel differential convolution modules and a spatial attention module, wherein each pixel differential convolution module comprises a differential convolution layer with a convolution kernel size of 3 × 3, a ReLU layer and a convolution kernel size of 1 × 1 convolution layer; the spatial attention module comprises two convolution layers with convolution kernel size of 1 x 1, one convolution layer with convolution kernel size of 3 x 3, a ReLU layer and a Sigmoid function; each feature map selected in S1 is passed through a Pixel Difference Convolution Module (PDCM) and then through a Spatial Attention Module (SAM) to obtain a corresponding boundary feature map.

Further, the specific step of obtaining the pixel-area enhancement feature map in S3 is:

s3-1: performing Softmax operation on the multi-scale fused low-resolution feature map to obtain K coarse segmentation regions { R ₁ ,R ₂ ,...,R _K Where K represents the number of classes of segmentation, R _K Is a two-dimensional vector, R _K Each element in the inner part represents the probability that the corresponding pixel belongs to the class K;

s3-2: the Kth region representation feature is obtained by using the following formula, namely, the features of all pixels of the whole image are weighted and summed with the probability that the pixels belong to the region K:

wherein x _i Representing a pixel p _i Is characterized by r _ki Representing a pixel p _i Probability of belonging to region K, f _k The representative region represents a feature;

s3-3: calculating the corresponding relation of each pixel and each region through a self-attention mechanism, wherein the calculation formula is as follows:

wherein t (x, f) ═ u ₁ (x) ^T u ₂ (f)，u ₁ 、u ₂ 、u ₃ And u ₄ Represents the transfer function FFN consisting of 1 × 1 convolution, BatchNorm layer and ReLU layer; will u ₁ (x) ^T And u ₂ (f) Calculating the correlation between the pixel characteristics and the region representation characteristics and normalizing the correlation to obtain the correlation respectively serving as keys and queries of the self-attention mechanismw _ik W is to be _ik As weights and u ₃ (f _k ) Multiplying to obtain a pixel-area enhancement feature y _i ；

S3-4: using pixel-area enhancement feature y of each pixel point _aug Component pixel-region enhancement feature map y _aug 。

The invention has the beneficial effects that:

1. compared with the traditional edge extraction operator, the boundary attention of the method has the advantages that after the convolution layer and the ReLU layer are added, the extraction of the edge characteristic graph is not easily influenced by factors such as environment, illumination and the like, and the generalization performance is better. Compared with the edge feature extraction module based on the common convolutional neural network, the edge feature extraction module based on the common convolutional neural network has the advantages that the convolutional kernel parameters of the common convolutional neural network are optimized from random initialization, gradient information is not coded, and therefore edge-related features are difficult to concentrate on. The boundary attention in the invention uses pixel differential convolution, and based on the principle of edge generation, the difference of adjacent pixels is utilized to realize the coding of gradient information and optimize the convolution kernel parameters, and space attention is added after the pixel differential convolution processing to reduce the interference of background noise, so that the extracted edge features are more effective.

2. Different from the current image segmentation algorithm (represented by DeepLabv3 +) utilizing global context information, the regional attention in the invention is to model the context relationship of the class level, a low-resolution feature map is utilized to generate a coarse segmentation region in the whole process, then a high-resolution feature and a self-attention mechanism are added, the features of pixels from the same class of objects are continuously enhanced, and the situation that semantic information is lost due to factors such as environment and shielding of transparent objects can be effectively solved.

Drawings

Fig. 1 is a block diagram of a network model structure of a transparent object image segmentation method according to an embodiment of the present invention.

FIG. 2 is a block diagram of a differential boundary attention module according to an embodiment of the present invention.

Fig. 3 is a block diagram of a regional attention module according to an embodiment of the present invention.

Fig. 4 is a block diagram of a cross-fusion structure of a high resolution branch and a low resolution branch according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention will be further described with reference to the drawings and examples. It should be noted that the examples do not limit the scope of the claimed invention.

Example 1

As shown in fig. 1 to 4, a method for segmenting an image of a transparent object includes the following steps:

s1: establishing a dual-resolution characteristic extraction module containing a high-resolution branch and a low-resolution branch, wherein the high-resolution branch maintains accurate spatial position information by connecting parallel different resolution characteristic graphs and repeatedly performing multi-scale cross fusion to obtain a high-resolution characteristic graph with the size of 1/8 original graphs; the low-resolution branch extracts high-dimensional semantic information through continuous downsampling and cross fusion with the high-resolution feature map to obtain 1/64 original-size low-resolution feature map; the method comprises the following steps of adding a depth pyramid pooling module at the tail end of a low-resolution branch, expanding effective receptive field, fusing multi-scale context information, obtaining a multi-scale fused low-resolution feature map, collecting an original image through a camera, carrying out pretreatment of random cutting, random turning, luminosity distortion and normalization on the original image, obtaining an input image, and inputting the input image into a dual-resolution feature extraction module, wherein the specific implementation method comprises the following steps:

the dual-resolution feature extraction network is composed of six levels, including conv1, conv2, conv3_ x, conv4_ x, conv5_ x and DPPM (x is 1 or 2, where 1 represents a high-resolution branch and 2 represents a low-resolution branch). The conv1 layer comprises a convolution layer with the step size of 2 and the convolution kernel of 3 x 3, a BatchNorm layer and a ReLU layer, the conv1 layer is used for changing the dimension of the input image, and a feature map feature1 is obtained through the conv1 operation; the conv2 is composed of cascaded Basic blocks, wherein the Basic blocks comprise two convolution layers with convolution kernel sizes of 3 × 3 and an Identity Block, the convolution layers with the sizes of 3 × 3 are used for extracting different input features and reducing the operation amount of the model, the Identity Block copies shallow features, the situation that gradient disappears along with the network deepening is avoided, and feature maps 2 with the sizes of 1/8 original images are obtained through conv2 operation; the conv3 starts to be divided into two branches conv3_1 and conv3_2 with high and low resolution in parallel, the conv3_1 adopts the same residual block as the conv2, and a high-resolution branch feature map feature3_1 with the size of 1/8 original image is obtained; sampling the output of the conv2 by the conv3_2 to obtain a low-resolution branch feature map feature3_2 with the size of 1/16 original image; performing conv4_1 operation on feature map feature3_1 to obtain feature map hfefect 3_1, performing 1 × 1 convolution operation channel compression on feature map feature3_2 and then performing up-sampling through bilinear interpolation to obtain feature map hfect 3_2, fusing feature map hfect 3_1 and hfect 3_2 to obtain 1/8 high-resolution branch feature map feature4_1 of the original size, performing conv4_2 operation on feature map feature3_2 to obtain feature map lfure 3_2, performing 3 × 3 convolution with step size of 2 on feature map feature3_1 to obtain feature map lfure 3_1, fusing feature maps 3_1 and lfure 3_2 to obtain low-resolution branch feature map feature4_2 of the original size, and fusing feature map 4 with high-resolution branch 4 and a low-resolution branch 4 branch feature map 4; conv5_ x consists of a concatenated residual Block, bottleckblock, containing two convolutional layers with a convolutional kernel size of 1 x 1, one convolutional layer with a convolutional kernel size of 3 x 3, and an Identity Block, reducing consumption in deep networks, the conv5_1 operation on the feature map feature4_1 obtains a feature map hfefect 4_1, performing 1 x 1 convolution operation channel compression on the feature map 4_2 and then performing up-sampling through bilinear interpolation to obtain a feature map hfefecture 4_2, fusing the feature maps hfecture 4_1 and hfecture 4_2 to obtain 1/8 a high-resolution branch feature map feature5_1 with the size of an original image, the conv5_2 operation on the feature map feature4_2 results in a feature map lfefect 4_2, and performing 3 × 3 convolution with the step size of 2 on the feature map 4_1 to realize downsampling to obtain a feature map lfefect 4_1, and fusing the feature maps lfure 4_1 and lfure 4_2 to obtain 1/64 low-resolution branch feature map 5_2 with the size of the original image. A Depth Pyramid Pooling Module (DPPM) is added after the feature map 5_2 to effectively enlarge the field and fuse the multi-scale context information, the context information comprises five parallel branches, the feature map 5_2 is convolved by 1 × 1 to obtain a feature map y1, the feature map 5_2 is convolved by kernel _ size 3, the feature map merge 2 is convolved by 1 × 1 convolution and upsampled feature map y1, the feature map fused by 3 × 3 is convolved to obtain a feature map y2, the feature map 5_ is convolved by kernel _ size 5, the feature map merge 4 is fused with the feature map obtained by 1 × 1 convolution and upsampled feature map 2, the feature map fused by 3 × 3 convolution to obtain a feature map 3, the feature map 5_2 is convolved by kernel _ size 3 and the feature map fused by 3_ size 3 and upsampled feature map 3_ 8, and the fused feature map is subjected to 3-3 convolution to obtain a feature map y4, the feature map feature5_2 is fused with the feature map y4 through global average pooling, 1-1 convolution and upsampling, the fused feature map is subjected to 3-3 convolution to obtain a feature map y5, and the feature maps y1, y2, y3, y4 and y5 are subjected to splicing operation and then 1-1 convolution operation to change the number of channels to obtain the final multi-scale fused low-resolution feature map feature 6.

The step connects the parallel different resolution characteristic graphs and repeatedly carries out multi-scale cross fusion to maintain the high resolution characteristic graph, and the generated high resolution characteristic graph provides rich detail information and is greatly helpful for improving the precision of the segmentation result. The low-resolution branch extracts rich semantic information by continuous downsampling and cross fusion with the high-resolution feature map, the feature map size at the tail end of the branch is 1/64 of the original image, and after the depth pyramid module is added, the effective receptive field is enlarged, multi-scale context information is fused, and the calculated amount of the model can be reduced. Different from most of the existing serial connection feature extraction modules, the dual-resolution branch module is in parallel connection, wherein the high-resolution branch can always maintain accurate spatial position information and continuously integrates low-resolution information, so that the information loss of recovering the resolution by sampling before and after sampling in the serial connection is avoided, the problem that the feature information is difficult to extract due to the influence of background and illumination change on a transparent object can be effectively solved, and the dual-resolution branch module is of great importance for a subsequent refined area.

S2: the method comprises the following steps of respectively performing differential convolution and spatial attention operations on four feature maps, feature2, feature3_2, feature4_2 and feature5_2 of different scales extracted in S1 by using a differential boundary attention module, extracting multi-scale edge feature maps, fusing the edge feature maps, and obtaining an edge segmentation image of a transparent object after feature dimensionality reduction, wherein the specific implementation method comprises the following steps:

the feature maps of four different scales selected and extracted from S1, feature2, feature3_2, feature4_2 and feature5_2, are subjected to a differential boundary attention module to obtain boundary feature maps boundary1, boundary2, boundary3 and boundary4 of four branches, the differential boundary attention module is composed of four parallel pixel differential convolution modules and a spatial attention module, each feature map selected from S1 is subjected to a Pixel Differential Convolution Module (PDCM) and then a Spatial Attention Module (SAM) to obtain a corresponding boundary feature map, wherein the pixel differential convolution module includes a differential convolution kernel layer of 3 layers of convolution kernel size, a ReLU layer and a convolution kernel size 1 SAM 1 convolution kernel size. The pixel difference convolution layer combines the traditional edge detection operator LBP (local binary pattern) and convolutional neural network, uses 3 × 3 convolutional kernels to perform pixel difference in local regions 8 of the image, and then performs element multiplication and summation by the weights of the convolutional kernels to generate values in the output feature map. The spatial attention module contains two convolution layers with convolution kernel size 1 x 1, one convolution layer with convolution kernel 3 x 3, a ReLU layer and a Sigmoid function. 1 × 1 convolutional layers compress the feature map into a single channel, and restore the feature map to the original size by bilinear interpolation. Finally, splicing the boundary feature maps boundary1, 2, 3 and 4 obtained by the four branches to obtain a multi-scale edge feature map boundary5, and obtaining an edge segmentation map of the transparent object through a convolution kernel size 1 × 1 convolution layer and a Sigmoid function.

This step includes a plurality of differential convolution modules and a spatial attention module. The difference convolution module obtains rich boundary information through convolution operation of difference between pixels and convolution kernels, and the space attention module is used for reducing interference of background noise.

Obtaining an edge prediction image of the transparent object after feature dimensionality reduction, and calculating an edge loss functionL1 as part of the total loss function L, participating in network weight update during gradient descent to optimize model parameters; where L1 employs a cross-entropy loss function, p _i Is the prediction of the pixel i at the boundary, y _i The actual result of the pixel i at the boundary is calculated as follows:

s3: and performing category-level context modeling on the high-resolution feature map feature5_1 and the multi-scale fused low-resolution feature map feature6 in the step S1 by using a region attention module, and enhancing the features of pixels from the same object to obtain a pixel-region enhanced feature map. And fusing the high-resolution feature map, the multi-scale edge feature map and the pixel-area enhancement feature map, obtaining a final transparent object segmentation result after feature dimensionality reduction, and calculating a loss function L2 of the transparent object as the other part of the total loss function L. Wherein L2 and L1 are the same cross entropy loss function, and the total loss function L is the sum of L1 and L2. The specific implementation method comprises the following steps:

performing Softmax operation by using multi-scale fused low-resolution feature map feature6 to obtain K coarse segmentation regions { R ₁ ,R ₂ ,...,R _K Where K represents the number of classes of segmentation, R _K Is a two-dimensional vector, each element in which represents the probability that the corresponding pixel belongs to the class K. The K-th region representation feature is a weighted sum of the features of all pixels and their probability of belonging to region K, which is expressed as follows:

wherein x _i Representing a pixel p _i Is characterized by r _ki Representing a pixel p _i Probability of belonging to region K, f _k The representative region represents a feature; then, the correspondence relationship between each pixel and each region is calculated by the self-attention mechanism, and the calculation formula is as follows：

Wherein t (x, f) ═ u ₁ (x) ^T u ₂ (f)，u ₁ 、u ₂ 、u ₃ And u ₄ Represents the transfer function FFN consisting of 1 × 1 convolution, BatchNorm layer and ReLU layer. Will u ₁ (x) ^T And u ₂ (f) Calculating the correlation between the pixel characteristics and the region representation characteristics and normalizing the correlation to obtain w as keys and query of a self-attention mechanism respectively _ik W is to be _ik As weights and u ₃ (f _k ) Multiplying to obtain a pixel-area enhancement feature y _i Pixel-area enhancement feature y of each pixel _aug Component pixel-region enhancement feature map y _aug . Fusing high-resolution feature map feature5_1, multi-scale edge feature map boundary5 and pixel-region enhanced feature map y by using a stitching operation _aug And finally obtaining the segmentation result of the transparent object through the convolution kernel size 1 × 1 convolution layer and the Sigmoid function, and solving the problems that the transparent object is shielded and is influenced by the environment.

The performance of the invention compared with some general image segmentation algorithms on transparent object data sets Trans10K-v2 is shown in Table 1, wherein mIOU represents the average value of the intersection and union ratio of real and predicted values of each category, and ACC represents the pixel accuracy.

TABLE 1

As can be seen from the table, compared with the current mainstream semantic segmentation algorithm, the method provided by the invention has obvious advantages on two performance indexes of ACC and mIou. Compared with UNet, the performance index of the invention is greatly improved, and the dual-resolution characteristic extraction module is proved to be capable of extracting more robust characteristics of the transparent object.

The embodiment of the invention also provides a transparent object image segmentation system based on the differential boundary attention and the regional attention, which comprises computer equipment; the computer device is configured or programmed for performing the steps of the above-described embodiment method.

Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.

Claims

1. A transparent object image segmentation method is characterized by comprising the following steps:

s1: establishing a dual-resolution characteristic extraction module containing a high-resolution branch and a low-resolution branch, inputting an input image into the dual-resolution characteristic extraction module, maintaining accurate spatial position information by connecting parallel different resolution characteristic graphs and repeatedly performing multi-scale cross fusion by the high-resolution branch, and obtaining a high-resolution characteristic graph with the size of 1/8 original graphs; the low-resolution branch extracts high-dimensional semantic information through continuous downsampling and cross fusion with the high-resolution feature map to obtain 1/64 original-size low-resolution feature map; adding a depth pyramid pooling module at the tail end of the low-resolution branch, wherein the depth pyramid pooling module is used for expanding the effective receptive field and fusing multi-scale context information to obtain a multi-scale fused low-resolution feature map;

the dual-resolution feature extraction network is composed of six levels, namely conv1, conv2, conv3_ x, conv4_ x, conv5_ x and DPPM, wherein x is 1 or 2, x is 1 to represent a high-resolution branch, and x is 2 to represent a low-resolution branch;

conv4_ x is divided into two branches conv4_1 and conv4_2 with high and low resolution in parallel, conv4_1 is used for continuously fusing low resolution information and maintaining 1/8 the high resolution branch feature map feature4_1 of the original size; conv4_2 is used for obtaining 1/32 low resolution branch feature map feature4_2 of the original size;

conv5_ x is divided into two branches conv5_1 and conv5_2 with high and low resolution in parallel, conv5_1 is used for continuously fusing low resolution information and maintaining 1/8 the high resolution branch feature map feature5_1 of the original size; conv5_2 is used for obtaining 1/64 low resolution branch feature map feature5_2 of the original size;

DPPM is used for enlarging the receptive field and fusing multi-scale context information;

s2: performing differential convolution and spatial attention operations on feature maps 2 of 1/8 original size extracted in S1, low-resolution branch feature maps feature3_2 of 1/16 original size, low-resolution branch feature maps feature4_2 of 1/32 original size and low-resolution branch feature maps feature5_2 of 1/64 original size respectively by using a differential boundary attention module, extracting and fusing multi-scale edge feature maps, and obtaining an edge prediction image of the transparent object after feature dimension reduction;

s3: performing category-level context modeling on the high-resolution feature map obtained in the step S1 and the multi-scale fused low-resolution feature map by using a region attention module, and enhancing the features of pixels from the same type of object to obtain a pixel-region enhanced feature map; and fusing the high-resolution feature map, the multi-scale edge feature map and the pixel-area enhancement feature map, and obtaining a final transparent object segmentation result after feature dimensionality reduction.

2. The transparency image segmentation method as claimed in claim 1, wherein the Basic Block of conv2 includes two convolution layers with convolution kernel size of 3 × 3 and an Identity Block, wherein the convolution layer of 3 × 3 is for extracting different input features and reducing the computation amount of the model, and the Identity Block duplicates shallow features and avoids the situation that the gradient disappears as the network deepens.

3. The transparent object image segmentation method as claimed in claim 1, wherein the feature map feature3_1 performs conv4_1 operation to obtain a feature map hfefect 3_1, the feature map feature3_2 performs 1 × 1 convolution to realize operation channel compression and then performs up-sampling by bilinear interpolation to obtain a feature map hfect 3_2, and the feature maps hfect 3_1 and hfect 3_2 are fused to obtain a 1/8 high-resolution branch feature map feature4_1 of original size; performing conv4_2 operation on the feature map feature3_2 to obtain a feature map feature3_2, performing 3 × 3 convolution with the step size of 2 on the feature map feature3_1 to realize downsampling to obtain a feature map feature3_1, and fusing the feature maps lfefecture 3_1 and lfecture 3_2 to obtain a 1/32 original-size low-resolution branch feature map feature4_ 2; the conv5_ x consists of cascaded residual blocks botteleckblock, wherein the botteleckblock comprises two convolution layers with convolution kernel size of 1 × 1, one convolution layer with convolution kernel size of 3 × 3 and an Identity Block, and consumption is reduced in a deep network; performing conv5_1 operation on feature map feature4_1 to obtain feature map hfefecture 4_1, performing 1 × 1 convolution on feature map feature4_2 to realize operation channel compression, then performing up-sampling through bilinear interpolation to obtain feature map hfecture 4_2, fusing feature map hfecture 4_1 and hfecture 4_2 to obtain high-resolution branch feature map feature5_1 of 1/8 original size, performing conv5_2 operation on feature map feature4_2 to obtain feature map lfecture 4_2, performing 3_ 3 convolution with step size of 2 on feature map feature4_1 to obtain feature map lfecture 4_1, and fusing feature map 4_1 and lfecture 4_2 to obtain low-resolution feature map 5_2 of 1/64 original size.

4. The method according to claim 3, wherein the DPPM comprises five parallel branches: the feature map 5_2 is convolved by 1 × 1 to obtain a feature map y 1; the feature map 5_2 is fused with the feature map y1 through a pooling layer of kernel _ size 3 and stride 2, a feature map obtained by 1 × 1 convolution and upsampling, and the fused feature map is convolved with 3 × 3 to obtain a feature map y 2; the feature map 5_ is fused with the feature map y2 through a pooling layer of kernel _ size 5 and stride 4, a feature map obtained by 1 × 1 convolution and upsampling, and the fused feature map is convolved with 3 × 3 to obtain a feature map y 3; the feature map 5_2 is fused with the feature map y3 through a pooling layer of kernel _ size 9 and stride 8, a feature map obtained by 1 × 1 convolution and upsampling, and the fused feature map is convolved by 3 × 3 to obtain a feature map y 4; the feature map 5_2 is fused with the feature map y4 through global average pooling, 1 × 1 convolution and upsampling, and the fused feature map is convolved with 3 × 3 to obtain a feature map y 5; and after splicing the feature maps y1, y2, y3, y4 and y5, performing 1-by-1 convolution operation to change the number of channels to obtain the final multi-scale fused low-resolution feature map 6.

5. The transparency image segmentation method as claimed in claim 1, wherein the differential boundary attention module is composed of four parallel pixel differential convolution modules and a spatial attention module, the pixel differential convolution module comprises a differential convolution layer with convolution kernel size of 3 × 3, a ReLU layer, and a convolution kernel size of 1 × 1 convolution layer; the spatial attention module comprises two convolution layers with convolution kernel size of 1 x 1, one convolution layer with convolution kernel size of 3 x 3, a ReLU layer and a Sigmoid function; each feature map selected in S1 is passed through a Pixel Difference Convolution Module (PDCM) and then through a Spatial Attention Module (SAM) to obtain a corresponding boundary feature map.

6. The method for segmenting the transparent object image according to claim 1, wherein the specific steps of obtaining the pixel-region enhancement feature map in S3 are as follows:

wherein x is _i Representing a pixel p _i Is characterized by r _ki Representing a pixel p _i Probability of belonging to region K, f _k The representative region represents a feature;

wherein t (x, f) ═ u ₁ (x) ^T u ₂ (f)，u ₁ 、u ₂ 、u ₃ And u ₄ Represents the transfer function FFN consisting of 1 × 1 convolution, BatchNorm layers and ReLU layers; will u ₁ (x) ^T And u ₂ (f) Key and query meters as self-attentive mechanisms, respectivelyCalculating and normalizing the correlation between the pixel features and the region representation features to obtain w _ik W is to be _ik As weights and u ₃ (f _k ) Multiplying to obtain a pixel-area enhancement feature y _i ；

7. The method for segmenting the transparent object image according to claim 1, wherein the predicted image of the edge of the transparent object obtained in S2 is used for calculating an edge loss function L1, the segmentation result of the transparent object obtained in S3 is used for calculating a loss function L2 as the transparent object, and the total loss function L is the sum of the edge loss function L1 and the loss function L2, and participates in network weight update during gradient descent to optimize the model parameters; the edge loss function L1 and the loss function L2 both use cross entropy loss functions, and take the edge loss function L1 as an example, the calculation formula is as follows: the calculation formula is as follows:

wherein p is _i Is the prediction of the pixel i at the boundary, y _i The actual result of pixel i at the boundary.