CN115082675A - Transparent object image segmentation method and system - Google Patents

Transparent object image segmentation method and system Download PDF

Info

Publication number
CN115082675A
CN115082675A CN202210633162.3A CN202210633162A CN115082675A CN 115082675 A CN115082675 A CN 115082675A CN 202210633162 A CN202210633162 A CN 202210633162A CN 115082675 A CN115082675 A CN 115082675A
Authority
CN
China
Prior art keywords
feature map
resolution
convolution
feature
size
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210633162.3A
Other languages
Chinese (zh)
Other versions
CN115082675B (en
Inventor
胡泊
王勇
邹逸群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202210633162.3A priority Critical patent/CN115082675B/en
Priority claimed from CN202210633162.3A external-priority patent/CN115082675B/en
Publication of CN115082675A publication Critical patent/CN115082675A/en
Application granted granted Critical
Publication of CN115082675B publication Critical patent/CN115082675B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4007Scaling of whole images or parts thereof, e.g. expanding or contracting based on interpolation, e.g. bilinear interpolation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a transparent object image segmentation method, which comprises the following steps: s1: establishing a dual-resolution characteristic extraction module containing a high-resolution branch and a low-resolution branch to obtain a high-resolution characteristic diagram and a multi-scale fused low-resolution characteristic diagram; s2: performing differential convolution and spatial attention operations on the feature maps with different dimensions extracted in the S1 by using a differential boundary attention module, extracting multi-scale edge feature maps and fusing the edge feature maps; s3: performing category-level context modeling on the high-resolution feature map and the multi-scale fused low-resolution feature map by using a region attention module to obtain a pixel-region enhanced feature map; the high-resolution feature map, the multi-scale edge feature map and the pixel-area enhancement feature map are fused, and a final transparent object segmentation result is obtained after feature dimensionality reduction, so that the problem that semantic information is lost due to factors such as environment and shielding of the transparent object is effectively solved.

Description

Transparent object image segmentation method and system
Technical Field
The invention relates to the field of computer vision, in particular to a transparent object image segmentation method and a transparent object image segmentation system.
Background
The image semantic segmentation technology is one of key technologies for an intelligent system to understand a natural scene, but for objects such as transparent objects which widely exist in the real world, the existing general image segmentation method often cannot obtain satisfactory results, and mainly has the following problems that the transparent objects are easily influenced by environmental factors to cause difficulty in extracting features with good robustness, the transparent objects are easily shielded to cause incomplete semantic information, the edge segmentation of the transparent objects is inaccurate, and the problems finally influence the segmentation effect of the transparent objects.
Disclosure of Invention
The invention aims to overcome the defects of the technology and provides a transparent object image segmentation method, which can continuously enhance the characteristics of pixels from the same object by adding high-resolution characteristics and a self-attention mechanism and effectively solve the problem of semantic information deficiency of the transparent object caused by factors such as environment, shielding and the like.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a transparent object image segmentation method, comprising the steps of:
s1: establishing a dual-resolution characteristic extraction module containing a high-resolution branch and a low-resolution branch, inputting an input image into the dual-resolution characteristic extraction module, maintaining accurate spatial position information by connecting parallel different resolution characteristic graphs and repeatedly performing multi-scale cross fusion by the high-resolution branch, and obtaining a high-resolution characteristic graph with the size of 1/8 original graphs; the low-resolution branch extracts high-dimensional semantic information through continuous downsampling and cross fusion with the high-resolution feature map to obtain 1/64 original-size low-resolution feature map; a depth pyramid pooling module is added at the tail end of the low-resolution branch for expanding the effective receptive field and fusing multi-scale context information to obtain a multi-scale fused low-resolution feature map;
s2: difference of utilizationThe boundary attention module respectively performs differential convolution and space attention operations on the feature maps with different dimensions extracted in the S1, extracts multi-scale edge feature maps and fuses the feature maps, obtains an edge prediction image of the transparent object after feature dimensionality reduction, calculates an edge loss function L1 as a part of a total loss function L, and participates in network weight updating in the gradient descent process to optimize model parameters; where L1 employs a cross-entropy loss function, p i Is the prediction of the pixel i at the boundary, y i Is the actual result of pixel i at the boundary; the calculation formula is as follows:
Figure BDA0003680889500000011
s3: performing category-level context modeling on the high-resolution feature map obtained in the step S1 and the multi-scale fused low-resolution feature map by using a region attention module, and enhancing the features of pixels from the same type of object to obtain a pixel-region enhanced feature map; and fusing the high-resolution feature map, the multi-scale edge feature map and the pixel-area enhancement feature map, obtaining a final transparent object segmentation result after feature dimensionality reduction, and calculating a loss function L2 of the transparent object as the other part of the total loss function L. Wherein L2 and L1 are the same cross entropy loss function, and the total loss function L is the sum of L1 and L2.
Further, the dual-resolution feature extraction network is composed of six levels, including conv1, conv2, conv3_ x, conv4_ x, conv5_ x and DPPM, wherein x ═ 1 or 2 represents a high-resolution branch, and x ═ 2 represents a low-resolution branch;
conv1 contains convolution layer with step size of 2 and convolution kernel of 3 × 3, BatchNorm layer and ReLU layer, conv1 layer is used to change the dimension of input image;
conv2 is composed of cascaded residual Block Basic Block; 1/8, obtaining feature map 2 with original image size;
conv3_ x starts to split into two branches conv3_1 and conv3_2 of high and low resolution in parallel; conv3_1 adopts the same residual block as conv2 to obtain 1/8 high-resolution branch feature map feature3_1 of the original size; sampling the output of the conv2 by the conv3_2 to obtain a low-resolution branch feature map feature3_2 with the size of 1/16 original image;
conv4_ x is divided into two branches conv4_1 and conv4_2 with high and low resolution in parallel, conv4_1 is used for continuously blending in low resolution information and maintaining 1/8 a high resolution branch feature map of the original image size; conv4_2 is used for obtaining 1/32 low-resolution branch feature map of original image size;
conv5_ x is divided into two branches conv5_1 and conv5_2 with high and low resolution in parallel, conv5_1 is used for continuously blending in low resolution information and maintaining 1/8 a high resolution branch feature map of the original image size; conv5_2 is used for obtaining 1/64 low-resolution branch feature map of original image size;
DPPM is used to expand the receptive field and fuse multi-scale contextual information.
Furthermore, the Basic Block of conv2 includes two convolution layers with convolution kernel size of 3 × 3 and an Identity Block, where the convolution layer of 3 × 3 is to extract different input features and reduce the amount of computation of the model, and the Identity Block copies the features of the shallow layer to avoid the situation that the gradient disappears as the network deepens.
Further, the feature map feature3_1 performs conv4_1 operation to obtain a feature map hfefect 3_1, the feature map feature3_2 performs 1 × 1 convolution to realize operation channel compression, then performs up-sampling through bilinear interpolation to obtain a feature map hfect 3_2, and fuses the feature maps hfect 3_1 and hfect 3_2 to obtain a high-resolution branch feature map feature4_1 with the original size of 1/8; and performing conv4_2 operation on the feature map feature3_2 to obtain a feature map feature3_2, performing 3 × 3 convolution with the step size of 2 on the feature map feature3_1 to realize downsampling to obtain a feature map feature3_1, and fusing the feature maps lfefecture 3_1 and lfecture 3_2 to obtain a 1/32 original-size low-resolution branch feature map feature4_ 2.
Further, the conv5_ x is composed of a cascade of residual Block bottleckblocks, each of which includes two convolutional layers with convolutional kernel size of 1 × 1, one convolutional layer with convolutional kernel size of 3 × 3 and an identyblock, and reduces consumption in a deep network; performing conv5_1 operation on feature map feature4_1 to obtain feature map hfefecture 4_1, performing 1 × 1 convolution on feature map feature4_2 to realize operation channel compression, then performing up-sampling through bilinear interpolation to obtain feature map hfecture 4_2, fusing feature map hfecture 4_1 and hfecture 4_2 to obtain high-resolution branch feature map feature5_1 of 1/8 original size, performing conv5_2 operation on feature map feature4_2 to obtain feature map lfecture 4_2, performing 3_ 3 convolution with step size of 2 on feature map feature4_1 to obtain feature map lfecture 4_1, and fusing feature map 4_1 and lfecture 4_2 to obtain low-resolution feature map 5_2 of 1/64 original size.
Further, the DPPM includes five parallel branches: the feature map 5_2 is convolved by 1 × 1 to obtain a feature map y 1; the feature map 5_2 is fused with the feature map y1 through a pooling layer of kernel _ size 3 and stride 2, a feature map obtained by 1 × 1 convolution and upsampling, and the fused feature map is convolved with 3 × 3 to obtain a feature map y 2; the feature map 5_ is fused with the feature map y2 through a pooling layer of kernel _ size 5 and stride 4, a feature map obtained by 1 × 1 convolution and upsampling, and the fused feature map is convolved with 3 × 3 to obtain a feature map y 3; the feature map 5_2 is fused with the feature map y3 through a pooling layer of kernel _ size 9 and stride 8, a feature map obtained by 1 × 1 convolution and upsampling, and the fused feature map is convolved by 3 × 3 to obtain a feature map y 4; the feature map 5_2 is fused with the feature map y4 through global average pooling, 1 × 1 convolution and upsampling, and the fused feature map is convolved with 3 × 3 to obtain a feature map y 5; and after splicing the feature maps y1, y2, y3, y4 and y5, performing 1-by-1 convolution operation to change the number of channels to obtain the final multi-scale fused low-resolution feature map 6.
Further, the differential boundary attention module is composed of four parallel pixel differential convolution modules and a spatial attention module, wherein each pixel differential convolution module comprises a differential convolution layer with a convolution kernel size of 3 × 3, a ReLU layer and a convolution kernel size of 1 × 1 convolution layer; the spatial attention module comprises two convolution layers with convolution kernel size of 1 x 1, one convolution layer with convolution kernel size of 3 x 3, a ReLU layer and a Sigmoid function; each feature map selected in S1 is passed through a Pixel Difference Convolution Module (PDCM) and then through a Spatial Attention Module (SAM) to obtain a corresponding boundary feature map.
Further, the specific step of obtaining the pixel-area enhancement feature map in S3 is:
s3-1: performing Softmax operation on the multi-scale fused low-resolution feature map to obtain K coarse segmentation regions { R 1 ,R 2 ,...,R K Where K represents the number of classes of segmentation, R K Is a two-dimensional vector, R K Each element in the inner part represents the probability that the corresponding pixel belongs to the class K;
s3-2: the Kth region representation feature is obtained by using the following formula, namely, the features of all pixels of the whole image are weighted and summed with the probability that the pixels belong to the region K:
Figure BDA0003680889500000031
wherein x i Representing a pixel p i Is characterized by r ki Representing a pixel p i Probability of belonging to region K, f k The representative region represents a feature;
s3-3: calculating the corresponding relation of each pixel and each region through a self-attention mechanism, wherein the calculation formula is as follows:
Figure BDA0003680889500000032
Figure BDA0003680889500000033
wherein t (x, f) ═ u 1 (x) T u 2 (f),u 1 、u 2 、u 3 And u 4 Represents the transfer function FFN consisting of 1 × 1 convolution, BatchNorm layer and ReLU layer; will u 1 (x) T And u 2 (f) Calculating the correlation between the pixel characteristics and the region representation characteristics and normalizing the correlation to obtain the correlation respectively serving as keys and queries of the self-attention mechanismw ik W is to be ik As weights and u 3 (f k ) Multiplying to obtain a pixel-area enhancement feature y i
S3-4: using pixel-area enhancement feature y of each pixel point aug Component pixel-region enhancement feature map y aug
The invention has the beneficial effects that:
1. compared with the traditional edge extraction operator, the boundary attention of the method has the advantages that after the convolution layer and the ReLU layer are added, the extraction of the edge characteristic graph is not easily influenced by factors such as environment, illumination and the like, and the generalization performance is better. Compared with the edge feature extraction module based on the common convolutional neural network, the edge feature extraction module based on the common convolutional neural network has the advantages that the convolutional kernel parameters of the common convolutional neural network are optimized from random initialization, gradient information is not coded, and therefore edge-related features are difficult to concentrate on. The boundary attention in the invention uses pixel differential convolution, and based on the principle of edge generation, the difference of adjacent pixels is utilized to realize the coding of gradient information and optimize the convolution kernel parameters, and space attention is added after the pixel differential convolution processing to reduce the interference of background noise, so that the extracted edge features are more effective.
2. Different from the current image segmentation algorithm (represented by DeepLabv3 +) utilizing global context information, the regional attention in the invention is to model the context relationship of the class level, a low-resolution feature map is utilized to generate a coarse segmentation region in the whole process, then a high-resolution feature and a self-attention mechanism are added, the features of pixels from the same class of objects are continuously enhanced, and the situation that semantic information is lost due to factors such as environment and shielding of transparent objects can be effectively solved.
Drawings
Fig. 1 is a block diagram of a network model structure of a transparent object image segmentation method according to an embodiment of the present invention.
FIG. 2 is a block diagram of a differential boundary attention module according to an embodiment of the present invention.
Fig. 3 is a block diagram of a regional attention module according to an embodiment of the present invention.
Fig. 4 is a block diagram of a cross-fusion structure of a high resolution branch and a low resolution branch according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention will be further described with reference to the drawings and examples. It should be noted that the examples do not limit the scope of the claimed invention.
Example 1
As shown in fig. 1 to 4, a method for segmenting an image of a transparent object includes the following steps:
s1: establishing a dual-resolution characteristic extraction module containing a high-resolution branch and a low-resolution branch, wherein the high-resolution branch maintains accurate spatial position information by connecting parallel different resolution characteristic graphs and repeatedly performing multi-scale cross fusion to obtain a high-resolution characteristic graph with the size of 1/8 original graphs; the low-resolution branch extracts high-dimensional semantic information through continuous downsampling and cross fusion with the high-resolution feature map to obtain 1/64 original-size low-resolution feature map; the method comprises the following steps of adding a depth pyramid pooling module at the tail end of a low-resolution branch, expanding effective receptive field, fusing multi-scale context information, obtaining a multi-scale fused low-resolution feature map, collecting an original image through a camera, carrying out pretreatment of random cutting, random turning, luminosity distortion and normalization on the original image, obtaining an input image, and inputting the input image into a dual-resolution feature extraction module, wherein the specific implementation method comprises the following steps:
the dual-resolution feature extraction network is composed of six levels, including conv1, conv2, conv3_ x, conv4_ x, conv5_ x and DPPM (x is 1 or 2, where 1 represents a high-resolution branch and 2 represents a low-resolution branch). The conv1 layer comprises a convolution layer with the step size of 2 and the convolution kernel of 3 x 3, a BatchNorm layer and a ReLU layer, the conv1 layer is used for changing the dimension of the input image, and a feature map feature1 is obtained through the conv1 operation; the conv2 is composed of cascaded Basic blocks, wherein the Basic blocks comprise two convolution layers with convolution kernel sizes of 3 × 3 and an Identity Block, the convolution layers with the sizes of 3 × 3 are used for extracting different input features and reducing the operation amount of the model, the Identity Block copies shallow features, the situation that gradient disappears along with the network deepening is avoided, and feature maps 2 with the sizes of 1/8 original images are obtained through conv2 operation; the conv3 starts to be divided into two branches conv3_1 and conv3_2 with high and low resolution in parallel, the conv3_1 adopts the same residual block as the conv2, and a high-resolution branch feature map feature3_1 with the size of 1/8 original image is obtained; sampling the output of the conv2 by the conv3_2 to obtain a low-resolution branch feature map feature3_2 with the size of 1/16 original image; performing conv4_1 operation on feature map feature3_1 to obtain feature map hfefect 3_1, performing 1 × 1 convolution operation channel compression on feature map feature3_2 and then performing up-sampling through bilinear interpolation to obtain feature map hfect 3_2, fusing feature map hfect 3_1 and hfect 3_2 to obtain 1/8 high-resolution branch feature map feature4_1 of the original size, performing conv4_2 operation on feature map feature3_2 to obtain feature map lfure 3_2, performing 3 × 3 convolution with step size of 2 on feature map feature3_1 to obtain feature map lfure 3_1, fusing feature maps 3_1 and lfure 3_2 to obtain low-resolution branch feature map feature4_2 of the original size, and fusing feature map 4 with high-resolution branch 4 and a low-resolution branch 4 branch feature map 4; conv5_ x consists of a concatenated residual Block, bottleckblock, containing two convolutional layers with a convolutional kernel size of 1 x 1, one convolutional layer with a convolutional kernel size of 3 x 3, and an Identity Block, reducing consumption in deep networks, the conv5_1 operation on the feature map feature4_1 obtains a feature map hfefect 4_1, performing 1 x 1 convolution operation channel compression on the feature map 4_2 and then performing up-sampling through bilinear interpolation to obtain a feature map hfefecture 4_2, fusing the feature maps hfecture 4_1 and hfecture 4_2 to obtain 1/8 a high-resolution branch feature map feature5_1 with the size of an original image, the conv5_2 operation on the feature map feature4_2 results in a feature map lfefect 4_2, and performing 3 × 3 convolution with the step size of 2 on the feature map 4_1 to realize downsampling to obtain a feature map lfefect 4_1, and fusing the feature maps lfure 4_1 and lfure 4_2 to obtain 1/64 low-resolution branch feature map 5_2 with the size of the original image. A Depth Pyramid Pooling Module (DPPM) is added after the feature map 5_2 to effectively enlarge the field and fuse the multi-scale context information, the context information comprises five parallel branches, the feature map 5_2 is convolved by 1 × 1 to obtain a feature map y1, the feature map 5_2 is convolved by kernel _ size 3, the feature map merge 2 is convolved by 1 × 1 convolution and upsampled feature map y1, the feature map fused by 3 × 3 is convolved to obtain a feature map y2, the feature map 5_ is convolved by kernel _ size 5, the feature map merge 4 is fused with the feature map obtained by 1 × 1 convolution and upsampled feature map 2, the feature map fused by 3 × 3 convolution to obtain a feature map 3, the feature map 5_2 is convolved by kernel _ size 3 and the feature map fused by 3_ size 3 and upsampled feature map 3_ 8, and the fused feature map is subjected to 3-3 convolution to obtain a feature map y4, the feature map feature5_2 is fused with the feature map y4 through global average pooling, 1-1 convolution and upsampling, the fused feature map is subjected to 3-3 convolution to obtain a feature map y5, and the feature maps y1, y2, y3, y4 and y5 are subjected to splicing operation and then 1-1 convolution operation to change the number of channels to obtain the final multi-scale fused low-resolution feature map feature 6.
The step connects the parallel different resolution characteristic graphs and repeatedly carries out multi-scale cross fusion to maintain the high resolution characteristic graph, and the generated high resolution characteristic graph provides rich detail information and is greatly helpful for improving the precision of the segmentation result. The low-resolution branch extracts rich semantic information by continuous downsampling and cross fusion with the high-resolution feature map, the feature map size at the tail end of the branch is 1/64 of the original image, and after the depth pyramid module is added, the effective receptive field is enlarged, multi-scale context information is fused, and the calculated amount of the model can be reduced. Different from most of the existing serial connection feature extraction modules, the dual-resolution branch module is in parallel connection, wherein the high-resolution branch can always maintain accurate spatial position information and continuously integrates low-resolution information, so that the information loss of recovering the resolution by sampling before and after sampling in the serial connection is avoided, the problem that the feature information is difficult to extract due to the influence of background and illumination change on a transparent object can be effectively solved, and the dual-resolution branch module is of great importance for a subsequent refined area.
S2: the method comprises the following steps of respectively performing differential convolution and spatial attention operations on four feature maps, feature2, feature3_2, feature4_2 and feature5_2 of different scales extracted in S1 by using a differential boundary attention module, extracting multi-scale edge feature maps, fusing the edge feature maps, and obtaining an edge segmentation image of a transparent object after feature dimensionality reduction, wherein the specific implementation method comprises the following steps:
the feature maps of four different scales selected and extracted from S1, feature2, feature3_2, feature4_2 and feature5_2, are subjected to a differential boundary attention module to obtain boundary feature maps boundary1, boundary2, boundary3 and boundary4 of four branches, the differential boundary attention module is composed of four parallel pixel differential convolution modules and a spatial attention module, each feature map selected from S1 is subjected to a Pixel Differential Convolution Module (PDCM) and then a Spatial Attention Module (SAM) to obtain a corresponding boundary feature map, wherein the pixel differential convolution module includes a differential convolution kernel layer of 3 layers of convolution kernel size, a ReLU layer and a convolution kernel size 1 SAM 1 convolution kernel size. The pixel difference convolution layer combines the traditional edge detection operator LBP (local binary pattern) and convolutional neural network, uses 3 × 3 convolutional kernels to perform pixel difference in local regions 8 of the image, and then performs element multiplication and summation by the weights of the convolutional kernels to generate values in the output feature map. The spatial attention module contains two convolution layers with convolution kernel size 1 x 1, one convolution layer with convolution kernel 3 x 3, a ReLU layer and a Sigmoid function. 1 × 1 convolutional layers compress the feature map into a single channel, and restore the feature map to the original size by bilinear interpolation. Finally, splicing the boundary feature maps boundary1, 2, 3 and 4 obtained by the four branches to obtain a multi-scale edge feature map boundary5, and obtaining an edge segmentation map of the transparent object through a convolution kernel size 1 × 1 convolution layer and a Sigmoid function.
This step includes a plurality of differential convolution modules and a spatial attention module. The difference convolution module obtains rich boundary information through convolution operation of difference between pixels and convolution kernels, and the space attention module is used for reducing interference of background noise.
Obtaining an edge prediction image of the transparent object after feature dimensionality reduction, and calculating an edge loss functionL1 as part of the total loss function L, participating in network weight update during gradient descent to optimize model parameters; where L1 employs a cross-entropy loss function, p i Is the prediction of the pixel i at the boundary, y i The actual result of the pixel i at the boundary is calculated as follows:
Figure BDA0003680889500000061
s3: and performing category-level context modeling on the high-resolution feature map feature5_1 and the multi-scale fused low-resolution feature map feature6 in the step S1 by using a region attention module, and enhancing the features of pixels from the same object to obtain a pixel-region enhanced feature map. And fusing the high-resolution feature map, the multi-scale edge feature map and the pixel-area enhancement feature map, obtaining a final transparent object segmentation result after feature dimensionality reduction, and calculating a loss function L2 of the transparent object as the other part of the total loss function L. Wherein L2 and L1 are the same cross entropy loss function, and the total loss function L is the sum of L1 and L2. The specific implementation method comprises the following steps:
performing Softmax operation by using multi-scale fused low-resolution feature map feature6 to obtain K coarse segmentation regions { R 1 ,R 2 ,...,R K Where K represents the number of classes of segmentation, R K Is a two-dimensional vector, each element in which represents the probability that the corresponding pixel belongs to the class K. The K-th region representation feature is a weighted sum of the features of all pixels and their probability of belonging to region K, which is expressed as follows:
Figure BDA0003680889500000062
wherein x i Representing a pixel p i Is characterized by r ki Representing a pixel p i Probability of belonging to region K, f k The representative region represents a feature; then, the correspondence relationship between each pixel and each region is calculated by the self-attention mechanism, and the calculation formula is as follows:
Figure BDA0003680889500000071
Figure BDA0003680889500000072
Wherein t (x, f) ═ u 1 (x) T u 2 (f),u 1 、u 2 、u 3 And u 4 Represents the transfer function FFN consisting of 1 × 1 convolution, BatchNorm layer and ReLU layer. Will u 1 (x) T And u 2 (f) Calculating the correlation between the pixel characteristics and the region representation characteristics and normalizing the correlation to obtain w as keys and query of a self-attention mechanism respectively ik W is to be ik As weights and u 3 (f k ) Multiplying to obtain a pixel-area enhancement feature y i Pixel-area enhancement feature y of each pixel aug Component pixel-region enhancement feature map y aug . Fusing high-resolution feature map feature5_1, multi-scale edge feature map boundary5 and pixel-region enhanced feature map y by using a stitching operation aug And finally obtaining the segmentation result of the transparent object through the convolution kernel size 1 × 1 convolution layer and the Sigmoid function, and solving the problems that the transparent object is shielded and is influenced by the environment.
The performance of the invention compared with some general image segmentation algorithms on transparent object data sets Trans10K-v2 is shown in Table 1, wherein mIOU represents the average value of the intersection and union ratio of real and predicted values of each category, and ACC represents the pixel accuracy.
TABLE 1
Figure BDA0003680889500000073
As can be seen from the table, compared with the current mainstream semantic segmentation algorithm, the method provided by the invention has obvious advantages on two performance indexes of ACC and mIou. Compared with UNet, the performance index of the invention is greatly improved, and the dual-resolution characteristic extraction module is proved to be capable of extracting more robust characteristics of the transparent object.
The embodiment of the invention also provides a transparent object image segmentation system based on the differential boundary attention and the regional attention, which comprises computer equipment; the computer device is configured or programmed for performing the steps of the above-described embodiment method.
Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.

Claims (7)

1. A transparent object image segmentation method is characterized by comprising the following steps:
s1: establishing a dual-resolution characteristic extraction module containing a high-resolution branch and a low-resolution branch, inputting an input image into the dual-resolution characteristic extraction module, maintaining accurate spatial position information by connecting parallel different resolution characteristic graphs and repeatedly performing multi-scale cross fusion by the high-resolution branch, and obtaining a high-resolution characteristic graph with the size of 1/8 original graphs; the low-resolution branch extracts high-dimensional semantic information through continuous downsampling and cross fusion with the high-resolution feature map to obtain 1/64 original-size low-resolution feature map; adding a depth pyramid pooling module at the tail end of the low-resolution branch, wherein the depth pyramid pooling module is used for expanding the effective receptive field and fusing multi-scale context information to obtain a multi-scale fused low-resolution feature map;
the dual-resolution feature extraction network is composed of six levels, namely conv1, conv2, conv3_ x, conv4_ x, conv5_ x and DPPM, wherein x is 1 or 2, x is 1 to represent a high-resolution branch, and x is 2 to represent a low-resolution branch;
conv1 contains convolution layer with step size of 2 and convolution kernel of 3 × 3, BatchNorm layer and ReLU layer, conv1 layer is used to change the dimension of input image;
conv2 is composed of cascaded residual Block Basic Block; 1/8, obtaining feature map 2 with original image size;
conv3_ x starts to split into two branches conv3_1 and conv3_2 of high and low resolution in parallel; conv3_1 adopts the same residual block as conv2 to obtain 1/8 high-resolution branch feature map feature3_1 of the original size; sampling the output of the conv2 by the conv3_2 to obtain a low-resolution branch feature map feature3_2 with the size of 1/16 original image;
conv4_ x is divided into two branches conv4_1 and conv4_2 with high and low resolution in parallel, conv4_1 is used for continuously fusing low resolution information and maintaining 1/8 the high resolution branch feature map feature4_1 of the original size; conv4_2 is used for obtaining 1/32 low resolution branch feature map feature4_2 of the original size;
conv5_ x is divided into two branches conv5_1 and conv5_2 with high and low resolution in parallel, conv5_1 is used for continuously fusing low resolution information and maintaining 1/8 the high resolution branch feature map feature5_1 of the original size; conv5_2 is used for obtaining 1/64 low resolution branch feature map feature5_2 of the original size;
DPPM is used for enlarging the receptive field and fusing multi-scale context information;
s2: performing differential convolution and spatial attention operations on feature maps 2 of 1/8 original size extracted in S1, low-resolution branch feature maps feature3_2 of 1/16 original size, low-resolution branch feature maps feature4_2 of 1/32 original size and low-resolution branch feature maps feature5_2 of 1/64 original size respectively by using a differential boundary attention module, extracting and fusing multi-scale edge feature maps, and obtaining an edge prediction image of the transparent object after feature dimension reduction;
s3: performing category-level context modeling on the high-resolution feature map obtained in the step S1 and the multi-scale fused low-resolution feature map by using a region attention module, and enhancing the features of pixels from the same type of object to obtain a pixel-region enhanced feature map; and fusing the high-resolution feature map, the multi-scale edge feature map and the pixel-area enhancement feature map, and obtaining a final transparent object segmentation result after feature dimensionality reduction.
2. The transparency image segmentation method as claimed in claim 1, wherein the Basic Block of conv2 includes two convolution layers with convolution kernel size of 3 × 3 and an Identity Block, wherein the convolution layer of 3 × 3 is for extracting different input features and reducing the computation amount of the model, and the Identity Block duplicates shallow features and avoids the situation that the gradient disappears as the network deepens.
3. The transparent object image segmentation method as claimed in claim 1, wherein the feature map feature3_1 performs conv4_1 operation to obtain a feature map hfefect 3_1, the feature map feature3_2 performs 1 × 1 convolution to realize operation channel compression and then performs up-sampling by bilinear interpolation to obtain a feature map hfect 3_2, and the feature maps hfect 3_1 and hfect 3_2 are fused to obtain a 1/8 high-resolution branch feature map feature4_1 of original size; performing conv4_2 operation on the feature map feature3_2 to obtain a feature map feature3_2, performing 3 × 3 convolution with the step size of 2 on the feature map feature3_1 to realize downsampling to obtain a feature map feature3_1, and fusing the feature maps lfefecture 3_1 and lfecture 3_2 to obtain a 1/32 original-size low-resolution branch feature map feature4_ 2; the conv5_ x consists of cascaded residual blocks botteleckblock, wherein the botteleckblock comprises two convolution layers with convolution kernel size of 1 × 1, one convolution layer with convolution kernel size of 3 × 3 and an Identity Block, and consumption is reduced in a deep network; performing conv5_1 operation on feature map feature4_1 to obtain feature map hfefecture 4_1, performing 1 × 1 convolution on feature map feature4_2 to realize operation channel compression, then performing up-sampling through bilinear interpolation to obtain feature map hfecture 4_2, fusing feature map hfecture 4_1 and hfecture 4_2 to obtain high-resolution branch feature map feature5_1 of 1/8 original size, performing conv5_2 operation on feature map feature4_2 to obtain feature map lfecture 4_2, performing 3_ 3 convolution with step size of 2 on feature map feature4_1 to obtain feature map lfecture 4_1, and fusing feature map 4_1 and lfecture 4_2 to obtain low-resolution feature map 5_2 of 1/64 original size.
4. The method according to claim 3, wherein the DPPM comprises five parallel branches: the feature map 5_2 is convolved by 1 × 1 to obtain a feature map y 1; the feature map 5_2 is fused with the feature map y1 through a pooling layer of kernel _ size 3 and stride 2, a feature map obtained by 1 × 1 convolution and upsampling, and the fused feature map is convolved with 3 × 3 to obtain a feature map y 2; the feature map 5_ is fused with the feature map y2 through a pooling layer of kernel _ size 5 and stride 4, a feature map obtained by 1 × 1 convolution and upsampling, and the fused feature map is convolved with 3 × 3 to obtain a feature map y 3; the feature map 5_2 is fused with the feature map y3 through a pooling layer of kernel _ size 9 and stride 8, a feature map obtained by 1 × 1 convolution and upsampling, and the fused feature map is convolved by 3 × 3 to obtain a feature map y 4; the feature map 5_2 is fused with the feature map y4 through global average pooling, 1 × 1 convolution and upsampling, and the fused feature map is convolved with 3 × 3 to obtain a feature map y 5; and after splicing the feature maps y1, y2, y3, y4 and y5, performing 1-by-1 convolution operation to change the number of channels to obtain the final multi-scale fused low-resolution feature map 6.
5. The transparency image segmentation method as claimed in claim 1, wherein the differential boundary attention module is composed of four parallel pixel differential convolution modules and a spatial attention module, the pixel differential convolution module comprises a differential convolution layer with convolution kernel size of 3 × 3, a ReLU layer, and a convolution kernel size of 1 × 1 convolution layer; the spatial attention module comprises two convolution layers with convolution kernel size of 1 x 1, one convolution layer with convolution kernel size of 3 x 3, a ReLU layer and a Sigmoid function; each feature map selected in S1 is passed through a Pixel Difference Convolution Module (PDCM) and then through a Spatial Attention Module (SAM) to obtain a corresponding boundary feature map.
6. The method for segmenting the transparent object image according to claim 1, wherein the specific steps of obtaining the pixel-region enhancement feature map in S3 are as follows:
s3-1: performing Softmax operation on the multi-scale fused low-resolution feature map to obtain K coarse segmentation regions { R 1 ,R 2 ,...,R K Where K represents the number of classes of segmentation, R K Is a two-dimensional vector, R K Each element in the inner part represents the probability that the corresponding pixel belongs to the class K;
s3-2: the Kth region representation feature is obtained by using the following formula, namely, the features of all pixels of the whole image are weighted and summed with the probability that the pixels belong to the region K:
Figure FDA0003680889490000031
wherein x is i Representing a pixel p i Is characterized by r ki Representing a pixel p i Probability of belonging to region K, f k The representative region represents a feature;
s3-3: calculating the corresponding relation of each pixel and each region through a self-attention mechanism, wherein the calculation formula is as follows:
Figure FDA0003680889490000032
Figure FDA0003680889490000033
wherein t (x, f) ═ u 1 (x) T u 2 (f),u 1 、u 2 、u 3 And u 4 Represents the transfer function FFN consisting of 1 × 1 convolution, BatchNorm layers and ReLU layers; will u 1 (x) T And u 2 (f) Key and query meters as self-attentive mechanisms, respectivelyCalculating and normalizing the correlation between the pixel features and the region representation features to obtain w ik W is to be ik As weights and u 3 (f k ) Multiplying to obtain a pixel-area enhancement feature y i
S3-4: using pixel-area enhancement feature y of each pixel point aug Component pixel-region enhancement feature map y aug
7. The method for segmenting the transparent object image according to claim 1, wherein the predicted image of the edge of the transparent object obtained in S2 is used for calculating an edge loss function L1, the segmentation result of the transparent object obtained in S3 is used for calculating a loss function L2 as the transparent object, and the total loss function L is the sum of the edge loss function L1 and the loss function L2, and participates in network weight update during gradient descent to optimize the model parameters; the edge loss function L1 and the loss function L2 both use cross entropy loss functions, and take the edge loss function L1 as an example, the calculation formula is as follows: the calculation formula is as follows:
Figure FDA0003680889490000034
wherein p is i Is the prediction of the pixel i at the boundary, y i The actual result of pixel i at the boundary.
CN202210633162.3A 2022-06-07 Transparent object image segmentation method and system Active CN115082675B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210633162.3A CN115082675B (en) 2022-06-07 Transparent object image segmentation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210633162.3A CN115082675B (en) 2022-06-07 Transparent object image segmentation method and system

Publications (2)

Publication Number Publication Date
CN115082675A true CN115082675A (en) 2022-09-20
CN115082675B CN115082675B (en) 2024-06-04

Family

ID=

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294412A (en) * 2022-10-10 2022-11-04 临沂大学 Real-time coal rock segmentation network generation method based on deep learning
CN115880567A (en) * 2023-03-03 2023-03-31 深圳精智达技术股份有限公司 Self-attention calculation method and device, electronic equipment and storage medium
CN116309274A (en) * 2022-12-12 2023-06-23 湖南红普创新科技发展有限公司 Method and device for detecting small target in image, computer equipment and storage medium
CN117788722A (en) * 2024-02-27 2024-03-29 国能大渡河金川水电建设有限公司 BIM-based safety data monitoring system for underground space

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1854965B1 (en) * 2006-05-02 2009-10-21 Carl Freudenberg KG Oil seal
AU2020103901A4 (en) * 2020-12-04 2021-02-11 Chongqing Normal University Image Semantic Segmentation Method Based on Deep Full Convolutional Network and Conditional Random Field
CN114359297A (en) * 2022-01-04 2022-04-15 浙江大学 Attention pyramid-based multi-resolution semantic segmentation method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1854965B1 (en) * 2006-05-02 2009-10-21 Carl Freudenberg KG Oil seal
AU2020103901A4 (en) * 2020-12-04 2021-02-11 Chongqing Normal University Image Semantic Segmentation Method Based on Deep Full Convolutional Network and Conditional Random Field
CN114359297A (en) * 2022-01-04 2022-04-15 浙江大学 Attention pyramid-based multi-resolution semantic segmentation method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
翟鹏博;杨浩;宋婷婷;余亢;马龙祥;黄向生;: "结合注意力机制的双路径语义分割", 中国图象图形学报, no. 08, 12 August 2020 (2020-08-12), pages 119 - 128 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294412A (en) * 2022-10-10 2022-11-04 临沂大学 Real-time coal rock segmentation network generation method based on deep learning
CN116309274A (en) * 2022-12-12 2023-06-23 湖南红普创新科技发展有限公司 Method and device for detecting small target in image, computer equipment and storage medium
CN116309274B (en) * 2022-12-12 2024-01-30 湖南红普创新科技发展有限公司 Method and device for detecting small target in image, computer equipment and storage medium
CN115880567A (en) * 2023-03-03 2023-03-31 深圳精智达技术股份有限公司 Self-attention calculation method and device, electronic equipment and storage medium
CN115880567B (en) * 2023-03-03 2023-07-25 深圳精智达技术股份有限公司 Self-attention calculating method and device, electronic equipment and storage medium
CN117788722A (en) * 2024-02-27 2024-03-29 国能大渡河金川水电建设有限公司 BIM-based safety data monitoring system for underground space
CN117788722B (en) * 2024-02-27 2024-05-03 国能大渡河金川水电建设有限公司 BIM-based safety data monitoring system for underground space

Similar Documents

Publication Publication Date Title
CN111462126B (en) Semantic image segmentation method and system based on edge enhancement
CN112396607B (en) Deformable convolution fusion enhanced street view image semantic segmentation method
CN110136062B (en) Super-resolution reconstruction method combining semantic segmentation
CN113033570B (en) Image semantic segmentation method for improving void convolution and multilevel characteristic information fusion
CN110738207A (en) character detection method for fusing character area edge information in character image
CN111325165B (en) Urban remote sensing image scene classification method considering spatial relationship information
CN113256649B (en) Remote sensing image station selection and line selection semantic segmentation method based on deep learning
CN112001931A (en) Image segmentation method, device, equipment and storage medium
CN113066089B (en) Real-time image semantic segmentation method based on attention guide mechanism
CN114724155A (en) Scene text detection method, system and equipment based on deep convolutional neural network
CN111401380A (en) RGB-D image semantic segmentation method based on depth feature enhancement and edge optimization
CN115775316A (en) Image semantic segmentation method based on multi-scale attention mechanism
CN114119993A (en) Salient object detection method based on self-attention mechanism
CN112926533A (en) Optical remote sensing image ground feature classification method and system based on bidirectional feature fusion
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN117576402B (en) Deep learning-based multi-scale aggregation transducer remote sensing image semantic segmentation method
Ren et al. Robust low-rank deep feature recovery in cnns: Toward low information loss and fast convergence
CN112215241B (en) Image feature extraction device based on small sample learning
CN114445620A (en) Target segmentation method for improving Mask R-CNN
Van Hoai et al. Feeding Convolutional Neural Network by hand-crafted features based on Enhanced Neighbor-Center Different Image for color texture classification
Chan et al. Asymmetric cascade fusion network for building extraction
CN113888505A (en) Natural scene text detection method based on semantic segmentation
CN115035402B (en) Multistage feature aggregation system and method for land cover classification problem
CN116681978A (en) Attention mechanism and multi-scale feature fusion-based saliency target detection method
CN115082675B (en) Transparent object image segmentation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant