CN115082675B - Transparent object image segmentation method and system - Google Patents
Transparent object image segmentation method and system Download PDFInfo
- Publication number
- CN115082675B CN115082675B CN202210633162.3A CN202210633162A CN115082675B CN 115082675 B CN115082675 B CN 115082675B CN 202210633162 A CN202210633162 A CN 202210633162A CN 115082675 B CN115082675 B CN 115082675B
- Authority
- CN
- China
- Prior art keywords
- feature map
- feature
- resolution
- convolution
- size
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000003709 image segmentation Methods 0.000 title claims abstract description 19
- 230000011218 segmentation Effects 0.000 claims abstract description 19
- 238000000605 extraction Methods 0.000 claims abstract description 17
- 230000009467 reduction Effects 0.000 claims abstract description 8
- 238000010586 diagram Methods 0.000 claims description 25
- 238000005070 sampling Methods 0.000 claims description 19
- 238000011176 pooling Methods 0.000 claims description 17
- 230000004927 fusion Effects 0.000 claims description 10
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 8
- 230000007246 mechanism Effects 0.000 claims description 8
- 230000008859 change Effects 0.000 claims description 6
- 230000006835 compression Effects 0.000 claims description 6
- 238000007906 compression Methods 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 4
- 230000002708 enhancing effect Effects 0.000 claims description 3
- XGCDBGRZEKYHNV-UHFFFAOYSA-N 1,1-bis(diphenylphosphino)methane Chemical compound C=1C=CC=CC=1P(C=1C=CC=CC=1)CP(C=1C=CC=CC=1)C1=CC=CC=C1 XGCDBGRZEKYHNV-UHFFFAOYSA-N 0.000 claims 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000009977 dual effect Effects 0.000 description 2
- 238000005286 illumination Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 238000003708 edge detection Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/40—Scaling of whole images or parts thereof, e.g. expanding or contracting
- G06T3/4007—Scaling of whole images or parts thereof, e.g. expanding or contracting based on interpolation, e.g. bilinear interpolation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a transparent object image segmentation method, which comprises the following steps: s1: establishing a dual-resolution feature extraction module containing a high-resolution branch and a low-resolution branch to obtain a high-resolution feature map and a multi-scale fused low-resolution feature map; s2: the differential boundary attention module is utilized to respectively carry out differential convolution and spatial attention operation on the feature images with different dimensions extracted in the step S1, and multi-scale edge feature images are extracted and fused; s3: modeling the context relation of the category level of the high-resolution feature map and the multi-scale fused low-resolution feature map by using a region attention module to obtain a pixel-region enhanced feature map; and the high-resolution feature map, the multi-scale edge feature map and the pixel-area enhancement feature map are fused, and a final transparent object segmentation result is obtained after feature dimension reduction, so that the situation that semantic information of a transparent object is lost due to factors such as environment and shielding is effectively solved.
Description
Technical Field
The invention relates to the field of computer vision, in particular to a transparent object image segmentation method and a transparent object image segmentation system.
Background
The image semantic segmentation technology is one of key technologies for an intelligent system to understand natural scenes, but for targets such as transparent objects which widely exist in the real world, the conventional general image segmentation method often cannot obtain satisfactory results, mainly has the following problems that the transparent objects are easily affected by environmental factors to cause that the characteristics with good robustness are difficult to extract, the transparent objects are easily blocked to cause incomplete semantic information, the edge segmentation of the transparent objects is inaccurate, and the problems finally affect the segmentation effect of the transparent objects.
Disclosure of Invention
The invention aims to overcome the defects of the technology, and provides a transparent object image segmentation method which can continuously enhance the characteristics of pixels from the same object by adding high-resolution characteristics and a self-attention mechanism and effectively solve the problem that the transparent object lacks semantic information due to factors such as environment and shielding.
In order to solve the technical problems, the invention adopts the following technical scheme:
a transparent object image segmentation method comprising the steps of:
S1: establishing a dual-resolution feature extraction module comprising a high-resolution branch and a low-resolution branch, inputting an input image into the dual-resolution feature extraction module, and maintaining accurate spatial position information by the high-resolution branch through connecting parallel different-resolution feature images and repeatedly performing multi-scale cross fusion to obtain a high-resolution feature image with the size of 1/8 original image; the low-resolution branch extracts high-dimensional semantic information through continuous downsampling and cross fusion with the high-resolution feature map to obtain a low-resolution feature map with the size of 1/64 original map; a depth pyramid pooling module is added at the tail end of the low-resolution branch and is used for expanding an effective receptive field and fusing multi-scale context information to obtain a multi-scale fused low-resolution feature map;
S2: the differential boundary attention module is utilized to respectively carry out differential convolution and spatial attention operation on the feature graphs with different dimensions extracted in the S1, the edge feature graphs with multiple dimensions are extracted and fused, an edge prediction image of a transparent object is obtained after feature dimension reduction, an edge loss function L1 is calculated as a part of the total loss function L, and the model parameters are optimized by participating in network weight updating in the gradient descent process; wherein L1 adopts a cross entropy loss function, p i is the prediction result of the pixel i at the boundary, and y i is the actual result of the pixel i at the boundary; the calculation formula is as follows:
S3: carrying out category-level context relation modeling on the high-resolution feature map obtained in the step S1 and the multi-scale fused low-resolution feature map by utilizing a region attention module, enhancing the features of pixels from the same object, and obtaining a pixel-region enhanced feature map; and fusing the high-resolution feature map, the multi-scale edge feature map and the pixel-area enhancement feature map, obtaining a final transparent object segmentation result after feature dimension reduction, and calculating a loss function L2 of the transparent object as another part of the total loss function L. Wherein L2 and L1 are the same as each other and are the cross entropy loss functions, and the total loss function L is the sum of L1 and L2.
Further, the dual resolution feature extraction network is composed of six levels of conv1, conv2, conv3_x, conv4_x, conv5_x, DPPM, where x=1 or 2, x=1 represents a high resolution branch, and x=2 represents a low resolution branch;
conv1 comprises a convolution layer with a step size of 2 and a convolution kernel of 3*3, a BatchNorm layer and a ReLU layer, conv1 layer being used to change the dimension of the input image;
conv2 is composed of concatenated residual blocks Basic Block; feature2 for obtaining 1/8 original size;
conv3_x starts to split into two branches conv3_1 and conv3_2 of high and low resolution in parallel; adopting a residual block which is the same as that of conv2 for conv3_1 to obtain a high-resolution branch characteristic diagram feature3_1 with the size of 1/8 original diagram; the output of conv2 is downsampled by conv3_2 to obtain a low-resolution branch characteristic diagram feature3_2 with the size of 1/16 original diagram;
The conv4_x is divided into two branches of high and low resolution conv4_1 and conv4_2 in parallel, and the conv4_1 is used for continuously merging low resolution information and maintaining a high resolution branch characteristic diagram with the size of 1/8 original image; the conv4_2 is used for obtaining a low-resolution branch characteristic diagram with the original diagram size of 1/32;
the conv5_x is divided into two branches of high and low resolution conv5_1 and conv5_2 in parallel, and the conv5_1 is used for continuously merging low resolution information and maintaining a high resolution branch characteristic diagram with the size of 1/8 original image; conv5_2 is used for obtaining a low-resolution branch characteristic diagram with the size of 1/64 original diagram;
DPPM is used to expand receptive fields and fuse multi-scale context information.
Further, the Basic Block of conv2 includes two convolution layers with a convolution kernel size 3*3 and an Identity Block, wherein the convolution layer of 3*3 is used for extracting different input features, reducing the operation amount of a model, and the Identity Block replicates features of a shallow layer, so that gradient disappearance along with network deepening is avoided.
Further, performing conv4_1 operation on the feature map feature3_1 to obtain a feature map hfeature3_1, performing 1*1 convolution on the feature map feature3_2 to realize operation channel compression, then performing up-sampling through bilinear interpolation to obtain a feature map hfeature3_2, and fusing the feature maps hfeature3_1 and hfeature3_2 to obtain a high-resolution branch feature map feature4_1 with the original size of 1/8; performing conv4_2 operation on the feature map feature3_2 to obtain a feature map lfeature3_2, performing 3*3 convolution with a step length of 2 on the feature map feature3_1 to realize downsampling to obtain a feature map lfeature3_1, and fusing the feature maps lfeature3_1 and lfeature3_2 to obtain a low-resolution branch feature map feature4_2 with a 1/32 original size.
Further, the conv5_x consists of a concatenated residual block BottleneckBlock, bottleneck Block comprises two convolution layers of convolution kernel size 1*1, one convolution layer of convolution kernel size 3*3, and IdentityBlock, reducing consumption in deep networks; performing conv5_1 operation on the feature map feature4_1 to obtain a feature map hfeature4_1, performing 1*1 convolution on the feature map feature4_2 to realize operation channel compression, then performing up-sampling through bilinear interpolation to obtain a feature map hfeature4_2, merging the feature maps hfeature _1 and hfeature4_2 to obtain a high-resolution branch feature map feature5_1 with 1/8 original size, performing conv5_2 operation on the feature map feature4_2 to obtain a feature map lfeature4_2, performing 3*3 convolution with a step length of 2 on the feature map feature4_1 to realize down-sampling to obtain a feature map lfeature4_1, and merging the feature maps lfeature4_1 and lfeature4_2 to obtain a low-resolution branch feature map feature5_2 with 1/64 original size.
Further, the DPPM includes five parallel branches: the feature5_2 of the feature map is subjected to 1*1 convolution to obtain a feature map y1; feature5_2 is fused with feature y1 through a pooling layer of kernel_size=3 and stride=2, 1*1 convolution and up-sampling, and the fused feature is convolved with 3*3 to obtain feature y2; feature 5_is fused with feature y2 through a pooling layer of kernel_size=5 and stride=4, 1*1 is rolled and up-sampled to obtain a feature y3, and 3*3 is convolved to obtain a feature y3; feature5_2 is fused with feature y3 through a pooling layer of kernel_size=9 and stride=8, 1*1 convolution and up-sampling, and the fused feature is convolved with 3*3 to obtain feature y4; the feature5_2 of the feature map is fused with the feature map y4 through global average pooling, 1*1 convolution and up-sampling, and the feature map y5 is obtained through 3*3 convolution of the fused feature map; and performing a 1*1 convolution operation after performing a splicing operation on the feature images y1, y2, y3, y4 and y5 to change the channel number so as to obtain a final multi-scale fused low-resolution feature image 6.
Further, the differential boundary attention module is composed of four parallel pixel differential convolution modules and a spatial attention module, wherein the pixel differential convolution modules comprise a differential convolution layer with a convolution kernel size 3*3, a ReLU layer and a convolution kernel size 1*1 convolution layer; the spatial attention module comprises two convolution layers with the convolution kernel size 1*1, one convolution layer with the convolution kernel 3*3, a ReLU layer and a Sigmoid function; each feature map selected in S1 is first passed through a Pixel Difference Convolution Module (PDCM) and then passed through a Spatial Attention Module (SAM) to obtain a corresponding boundary feature map.
Further, the specific step of obtaining the pixel-area enhancement feature map in S3 is:
S3-1: carrying out Softmax operation on the multi-scale fused low-resolution feature map to obtain K rough segmentation areas { R 1,R2,...,RK }, wherein K represents the number of segmented categories, R K is a two-dimensional vector, and each element in R K represents the probability that the corresponding pixel belongs to the category K;
S3-2: the K-th region representation feature is obtained by using the following formula, namely, the weighted summation of the features of all pixels of the whole image and the probability that the features belong to the region K is carried out:
Where x i represents the feature of pixel p i, r ki represents the probability that pixel p i belongs to region K, and f k represents the region representation feature;
s3-3: the corresponding relation between each pixel and each region is calculated through a self-attention mechanism, and the calculation formula is as follows:
where t (x, f) =u 1(x)Tu2(f),u1、u2、u3 and u 4 represent a transfer function FFN consisting of 1*1 convolutions, batchNorm layers and ReLU layers; taking u 1(x)T and u 2 (f) as keys and queries of a self-attention mechanism respectively, calculating the correlation between the pixel characteristics and the region representation characteristics, normalizing the correlation to obtain w ik, and multiplying w ik as a weight by u 3(fk to obtain a pixel-region enhancement characteristic y i;
S3-4: the pixel-area enhancement feature map y aug is composed using the pixel-area enhancement feature y aug for each pixel point.
The beneficial effects of the invention are as follows:
1. Compared with the traditional edge extraction operator, the edge feature image extraction method has the advantages that after the convolution layer and the ReLU layer are added, the edge feature image extraction is not easily affected by factors such as environment and illumination, and the generalization performance is better. Compared with an edge feature extraction module based on a common convolutional neural network, the edge feature extraction module has the advantages that the convolutional kernel parameters of the common convolutional neural network are optimized from random initialization, gradient information is not encoded, and therefore edge-related features are difficult to concentrate. The boundary attention in the invention uses pixel differential convolution, and from the principle of edge generation, gradient information coding is realized by utilizing the difference of adjacent pixels, convolution kernel parameters are optimized, and spatial attention is added after pixel differential convolution processing to reduce the interference of background noise, so that the extracted edge features are more effective.
2. Unlike the current image segmentation algorithm (represented by DeepLabv & lt+ & gt) which utilizes global context information, the regional attention in the invention is modeling on the context relation of a category level, a low-resolution feature map is utilized to generate a rough segmentation region in the whole process, and then high-resolution features and a self-attention mechanism are added to continuously enhance the features of pixels from the same object, so that the condition that semantic information is lost due to factors such as environment and shielding of a transparent object can be effectively solved.
Drawings
Fig. 1 is a block diagram of a network model of a transparent object image segmentation method according to an embodiment of the present invention.
FIG. 2 is a block diagram of a differential boundary attention module in an embodiment of the present invention.
FIG. 3 is a block diagram illustrating a block diagram of a zone attention module according to an embodiment of the present invention.
FIG. 4 is a block diagram of a high resolution branch and low resolution branch cross-fusion architecture in an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are further described below with reference to the accompanying drawings and examples. It should be noted that the examples do not limit the scope of the invention as claimed.
Example 1
As shown in fig. 1 to 4, a transparent object image segmentation method includes the following steps:
S1: establishing a dual-resolution feature extraction module comprising a high-resolution branch and a low-resolution branch, wherein the high-resolution branch maintains accurate spatial position information by connecting parallel feature images with different resolutions and repeatedly performing multi-scale cross fusion to obtain a high-resolution feature image with the size of 1/8 original image; the low-resolution branch extracts high-dimensional semantic information through continuous downsampling and cross fusion with the high-resolution feature map to obtain a low-resolution feature map with the size of 1/64 original map; the depth pyramid pooling module is added at the tail end of the low-resolution branch, so that the effective receptive field is enlarged, the multi-scale context information is fused, a multi-scale fused low-resolution feature map is obtained, an original image is acquired through a camera, the original image is subjected to pretreatment of random cutting, random overturning, luminosity distortion and normalization, an input image is obtained, and the input image is input into the dual-resolution feature extraction module, and the specific implementation method is as follows:
The dual resolution feature extraction network is composed of six levels of conv1, conv2, conv3_x, conv4_x, conv5_x, DPPM (x=1 or 2, where 1 represents the high resolution branch and 2 represents the low resolution branch). The conv1 layer comprises a convolution layer with a step length of 2 and a convolution kernel of 3*3, a BatchNorm layer and a ReLU layer, the conv1 layer has the function of changing the dimension of an input image, and a feature image 1 is obtained through conv1 operation; the conv2 consists of cascaded residual blocks Basic Block, wherein the Basic Block comprises two convolution layers with the convolution kernel size of 3*3 and an Identity Block, wherein the convolution layers of 3*3 are used for extracting different input features, so that the operation amount of a model is reduced, the Identity Block replicates the features of a shallow layer, the condition that gradient vanishes along with network deepening is avoided, and a feature map feature2 with the original picture size of 1/8 is obtained through conv2 operation; the conv3 is divided into two branches conv3_1 and conv3_2 with high and low resolutions in parallel, and the conv3_1 adopts the residual block which is the same as the conv2 to obtain a high resolution branch characteristic diagram feature3_1 with the size of 1/8 original map; the output of conv2 is downsampled by conv3_2 to obtain a low-resolution branch characteristic diagram feature3_2 with the size of 1/16 original diagram; performing conv4_1 operation on the feature map feature3_1 to obtain a feature map hfeature3_1, performing 1*1 convolution operation channel compression on the feature map feature3_2, then performing up-sampling through bilinear interpolation to obtain a feature map hfeature3_2, merging the feature maps hfeature _1 and hfeature3_2 to obtain a high-resolution branch feature map feature4_1 with 1/8 original size, performing conv4_2 operation on the feature map feature3_2 to obtain a feature map lfeature3_2, performing 3*3 convolution with a step length of 2 on the feature map feature3_1 to realize down-sampling to obtain a feature map lfeature _1, merging the feature maps lfeature3_1 and lfeature3_2 to obtain a low-resolution branch feature map feature4_2 with 1/32 original size, feature maps feature4_1 and feature4_2 are the result of the cross-fusion of the high resolution branch and low resolution branch feature maps; conv5_x is composed of cascaded residual blocks BottleneckBlock, bottleneckBlock contains two convolution layers with convolution kernel size 1*1, one convolution layer with convolution kernel size 3*3 and Identity Block, consumption is reduced in deep network, feature map hfeature4_1 is obtained by performing conv5_1 operation on feature map feature4_1, feature map feature4_2 is obtained by performing 1*1 convolution operation channel compression and then up-sampling by bilinear interpolation to obtain feature map hfeature4_2, high resolution branch feature map feature5_1 with 1/8 original size is obtained by fusing feature maps hfeature4_1 and hfeature4_2, performing conv5_2 operation on the feature map feature4_2 to obtain a feature map lfeature _2, performing 3*3 convolution with a step length of 2 on the feature map feature4_1 to realize downsampling to obtain a feature map lfeature4_1, and fusing the feature maps lfeature _1 and lfeature4_2 to obtain a low-resolution branch feature map feature5_2 with a 1/64 original size. A Depth Pyramid Pooling Module (DPPM) is added behind the feature map feature5_2 to effectively enlarge the receptive field and fuse multi-scale context information, the method comprises five parallel branches, the feature map feature5_2 is obtained by 1*1 convolution, the feature map feature5_2 is fused with the feature map y1 through a kernel_size=3, the feature map obtained by the pooling layer of the stride=2, the 1*1 convolution and up-sampling, the feature map obtained by the fusion is obtained by 3*3 convolution, the feature map y2 is obtained by the feature map feature 5_through the kernel_size=5, the pooling layer of stride=4, 1*1 convolution and up-sampling are fused with the feature map y2, the feature map y3 is obtained by the convolution of 3*3, the feature map feature5_2 is fused with the feature map y3 by the pooling layer of stride=8, the feature map obtained by the convolution of 1*1 and up-sampling is obtained by the convolution of 3*3, the feature map y4 is obtained by the convolution of 3*3, the feature map feature5_2 is fused with the feature map y4 by the global average pooling, 1*1 convolution and up-sampling, the fused feature images are subjected to 3*3 convolution to obtain a feature image y5, the feature images y1, y2, y3, y4 and y5 are spliced, and then 1*1 convolution operation is performed to change the number of channels to obtain a final multi-scale fused low-resolution feature image feature6.
The step is used for connecting the parallel characteristic diagrams with different resolutions and repeatedly carrying out multi-scale cross fusion to maintain the high-resolution characteristic diagrams, so that the generated high-resolution characteristic diagrams provide rich detail information, and the method is helpful for improving the precision of the segmentation result. The low-resolution branch extracts rich semantic information by continuous downsampling and cross fusion with the high-resolution feature map, the feature map at the tail end of the branch is 1/64 of the original map in size, and after the depth pyramid module is added, the effective receptive field is expanded, the multi-scale context information is fused, and the calculation amount of the model can be reduced. Different from the existing most serial connection feature extraction modules, the module double-resolution branches are connected in parallel, wherein the high-resolution branches can always maintain accurate spatial position information and continuously integrate low-resolution information, so that the problem that the resolution information is recovered by downsampling and upsampling in serial connection is avoided, the problem that the feature information is difficult to extract due to the influence of background and illumination change on a transparent object can be effectively solved, and the module double-resolution branches are very important for a follow-up refined area.
S2: the differential boundary attention module is utilized to respectively carry out differential convolution and spatial attention operation on four feature images feature2, feature3_2, feature4_2 and feature5_2 with different scales extracted in the extracted S1, multi-scale edge feature images are extracted and fused, and edge segmentation images of transparent objects are obtained after feature dimension reduction, and the specific implementation method comprises the following steps:
And selecting four extracted feature graphs feature2, feature3_2, feature4_2 and feature5_2 with different scales from S1, and obtaining four branched boundary feature graphs boundary1, boundary2, boundary3 and boundary4 after passing through a differential boundary attention module, wherein the differential boundary attention module consists of four parallel pixel differential convolution modules and a spatial attention module, and each feature graph selected from S1 firstly passes through a Pixel Differential Convolution Module (PDCM) and then passes through a Spatial Attention Module (SAM) to obtain a corresponding boundary feature graph, and the pixel differential convolution module comprises a differential convolution layer with a convolution kernel size 3*3, a ReLU layer and a convolution kernel size 1*1 convolution layer. The pixel difference convolution layer combines the conventional edge detection operator LBP (local binary pattern) and the convolutional neural network, uses 3*3 convolution kernels to perform pixel difference pairs in the local region 8 of the image, and then performs element multiplication and summation by the convolution kernel weights to generate values in the output feature map. The spatial attention module contains two convolution layers of convolution kernel size 1*1, one convolution layer of convolution kernel 3*3, a ReLU layer, and a Sigmoid function. The 1*1 convolution layer compresses the feature map into a single channel, and the feature map is restored to the original map size by using bilinear interpolation. And finally, splicing the boundary feature graphs boundary1, boundary2, boundary3 and boundary4 obtained by the four branches to obtain a multi-scale boundary feature graph boundary5, and obtaining an edge segmentation graph of the transparent object through a convolution kernel size 1*1 convolution layer and a Sigmoid function.
This step includes a plurality of differential convolution modules and a spatial attention module. The difference convolution module acquires abundant boundary information through the convolution operation of difference and convolution kernel among pixels, and the spatial attention module is used for reducing interference of background noise.
Obtaining an edge prediction image of the transparent object after feature dimension reduction, calculating an edge loss function L1 as a part of the total loss function L, and participating in network weight updating to optimize model parameters in the gradient descent process; wherein L1 adopts a cross entropy loss function, p i is a prediction result of the pixel i at the boundary, y i is an actual result of the pixel i at the boundary, and the calculation formula is as follows:
S3: and modeling the context relation of the category level of the high-resolution feature map feature5_1 and the multi-scale fused low-resolution feature map feature6 in the S1 by using the regional attention module, enhancing the features of pixels from the same object, and obtaining a pixel-region enhanced feature map. And (3) fusing the high-resolution feature map, the multi-scale edge feature map and the pixel-area enhancement feature map, obtaining a final transparent object segmentation result after feature dimension reduction, and calculating a loss function L2 of the transparent object as another part of the total loss function L. Wherein L2 and L1 are the same as each other and are the cross entropy loss functions, and the total loss function L is the sum of L1 and L2. The specific implementation method comprises the following steps:
And performing Softmax operation by using the low-resolution feature map feature6 with multi-scale fusion to obtain K coarse segmentation areas { R 1,R2,...,RK }, wherein K represents the number of segmented categories, R K is a two-dimensional vector, and each element in the K coarse segmentation areas represents the probability that the corresponding pixel belongs to the category K. The K-th region representation feature is a weighted summation of the features of all pixels and their probabilities of belonging to region K, and its formula is as follows:
Where x i represents the feature of pixel p i, r ki represents the probability that pixel p i belongs to region K, and f k represents the region representation feature; then, the correspondence between each pixel and each region is calculated by a self-attention mechanism, and the calculation formula is as follows:
Where t (x, f) =u 1(x)Tu2(f),u1、u2、u3 and u 4 represent the transfer function FFN consisting of 1*1 convolutions, batchNorm layers and ReLU layers. Taking u 1(x)T and u 2 (f) as key and query of a self-attention mechanism respectively, calculating correlation between pixel characteristics and region representation characteristics, normalizing the correlation to obtain w ik, multiplying w ik as weight and u 3(fk to obtain pixel-region enhancement characteristics y i, and forming a pixel-region enhancement characteristic graph y aug by the pixel-region enhancement characteristics y aug of each pixel point. And fusing the high-resolution feature map feature5_1, the multi-scale edge feature map boundary5 and the pixel-area enhancement feature map y aug by utilizing a splicing operation, and finally obtaining a segmentation result of the transparent object through a convolution kernel size 1*1 convolution layer and a Sigmoid function, thereby solving the problems that the transparent object is shielded and influenced by the environment.
The comparison of the performance of the present invention with some general image segmentation algorithms on a transparent object dataset Trans10K-v2 is shown in Table 1, where mIOU represents the average of the intersection to union ratio of the true and predicted values of each class, and ACC represents the pixel accuracy.
TABLE 1
Compared with the existing mainstream semantic segmentation algorithm, the method provided by the invention has obvious advantages in the two performance indexes of ACC and mIou. Compared with UNet, the performance index of the invention is greatly improved, and the double-resolution feature extraction module can extract the more robust feature of the transparent object, the performance index of the invention is improved compared with DeepLabv & lt/EN & gt and DENSEASPP, and the regional attention can be proved to improve the condition that the transparent object is blocked to cause the loss of semantic information, and the performance index of the invention is better than OCRNet, and the segmentation result of the transparent object can be improved by adding boundary attention.
The embodiment of the invention also provides a transparent object image segmentation system based on the differential boundary attention and the regional attention, which comprises computer equipment; the computer device is configured or programmed to perform the steps of the embodiment methods described above.
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered by the scope of the claims of the present invention.
Claims (7)
1. A transparent object image segmentation method, characterized by comprising the steps of:
S1: establishing a dual-resolution feature extraction module comprising a high-resolution branch and a low-resolution branch, inputting an input image into the dual-resolution feature extraction module, and maintaining accurate spatial position information by the high-resolution branch through connecting parallel different-resolution feature images and repeatedly performing multi-scale cross fusion to obtain a high-resolution feature image with the size of 1/8 original image; the low-resolution branch extracts high-dimensional semantic information through continuous downsampling and cross fusion with the high-resolution feature map to obtain a low-resolution feature map with the size of 1/64 original map; adding a depth pyramid pooling module at the tail end of the low-resolution branch, wherein the depth pyramid pooling module is used for expanding an effective receptive field and fusing multi-scale context information to obtain a multi-scale fused low-resolution feature map;
the dual-resolution feature extraction network is composed of six levels of conv1, conv2, conv3_x, conv4_x, conv5_x and DPPM, wherein x=1 or 2, x=1 represents a high-resolution branch, and x=2 represents a low-resolution branch;
conv1 comprises a convolution layer with a step size of 2 and a convolution kernel of 3*3, a BatchNorm layer and a ReLU layer, conv1 layer being used to change the dimension of the input image;
conv2 is composed of concatenated residual blocks Basic Block; feature2 for obtaining 1/8 original size;
conv3_x starts to split into two branches conv3_1 and conv3_2 of high and low resolution in parallel; adopting a residual block which is the same as that of conv2 for conv3_1 to obtain a high-resolution branch characteristic diagram feature3_1 with the size of 1/8 original diagram; the output of conv2 is downsampled by conv3_2 to obtain a low-resolution branch characteristic diagram feature3_2 with the size of 1/16 original diagram;
The conv4_x is divided into two branches of high and low resolution conv4_1 and conv4_2 in parallel, wherein the conv4_1 is used for continuously merging low resolution information and maintaining a high resolution branch feature map feature4_1 of 1/8 original size; conv4_2 is used for obtaining a low-resolution branch feature map feature4_2 with the original size of 1/32;
The conv5_x is divided into two branches of high and low resolution conv5_1 and conv5_2 in parallel, wherein the conv5_1 is used for continuously merging low resolution information and maintaining a high resolution branch feature map feature5_1 of 1/8 original size; conv5_2 is used for obtaining a low-resolution branch feature map feature5_2 with the original size of 1/64;
DPPM is used to expand receptive fields and fuse multi-scale context information;
S2: differential convolution and spatial attention operations are respectively carried out on the feature images feature2 with the 1/8 original image size and the low-resolution branch feature images feature3_2 with the 1/16 original image size extracted in the S1, the low-resolution branch feature images feature4_2 with the 1/32 original image size and the low-resolution branch feature images feature5_2 with the 1/64 original image size by utilizing a differential boundary attention module, multi-scale edge feature images are extracted and fused, and an edge prediction image of a transparent object is obtained after feature dimension reduction;
s3: carrying out category-level context relation modeling on the high-resolution feature map obtained in the step S1 and the multi-scale fused low-resolution feature map by utilizing a region attention module, enhancing the features of pixels from the same object, and obtaining a pixel-region enhanced feature map; and fusing the high-resolution feature map, the multi-scale edge feature map and the pixel-area enhancement feature map, and obtaining a final transparent object segmentation result after feature dimension reduction.
2. The transparent object image segmentation method according to claim 1, wherein the Basic Block of conv2 comprises two convolution layers with a convolution kernel size 3*3 and an Identity Block, wherein the convolution layers of 3*3 are used for extracting different input features, so that the operand of a model is reduced, the Identity Block replicates the features of a shallow layer, and the condition that gradient vanishes with network deepening is avoided.
3. The transparent object image segmentation method according to claim 1, wherein the feature map feature3_1 is subjected to conv4_1 operation to obtain a feature map hfeature _1, the feature map feature3_2 is subjected to 1*1 convolution to realize operation channel compression, then up-sampling is performed through bilinear interpolation to obtain a feature map hfeature3_2, and the feature maps hfeature _1 and hfeature3_2 are fused to obtain a high-resolution branch feature map feature4_1 with a 1/8 original size; performing conv4_2 operation on the feature map feature3_2 to obtain a feature map lfeature3_2, performing 3*3 convolution with a step length of 2 on the feature map feature3_1 to realize downsampling to obtain a feature map lfeature3_1, and fusing the feature maps lfeature3_1 and lfeature3_2 to obtain a low-resolution branch feature map feature4_2 with a 1/32 original map size; the conv5_x is composed of cascaded residual blocks BottleneckBlock, bottleneck Block comprises two convolution layers with the convolution kernel size of 1*1, one convolution layer with the convolution kernel size of 3*3 and an Identity Block, and consumption is reduced in a deep network; performing conv5_1 operation on the feature map feature4_1 to obtain a feature map hfeature4_1, performing 1*1 convolution on the feature map feature4_2 to realize operation channel compression, then performing up-sampling through bilinear interpolation to obtain a feature map hfeature4_2, merging the feature maps hfeature _1 and hfeature4_2 to obtain a high-resolution branch feature map feature5_1 with 1/8 original size, performing conv5_2 operation on the feature map feature4_2 to obtain a feature map lfeature4_2, performing 3*3 convolution with a step length of 2 on the feature map feature4_1 to realize down-sampling to obtain a feature map lfeature4_1, and merging the feature maps lfeature4_1 and lfeature4_2 to obtain a low-resolution branch feature map feature5_2 with 1/64 original size.
4. A method of transparency image segmentation according to claim 3, wherein the DPPM comprises five parallel branches: the feature5_2 of the feature map is subjected to 1*1 convolution to obtain a feature map y1; feature5_2 is fused with feature y1 through a pooling layer of kernel_size=3 and stride=2, 1*1 convolution and up-sampling, and the fused feature is convolved with 3*3 to obtain feature y2; feature 5_is fused with feature y2 through a pooling layer of kernel_size=5 and stride=4, 1*1 is rolled and up-sampled to obtain a feature y3, and 3*3 is convolved to obtain a feature y3; feature5_2 is fused with feature y3 through a pooling layer of kernel_size=9 and stride=8, 1*1 convolution and up-sampling, and the fused feature is convolved with 3*3 to obtain feature y4; the feature5_2 of the feature map is fused with the feature map y4 through global average pooling, 1*1 convolution and up-sampling, and the feature map y5 is obtained through 3*3 convolution of the fused feature map; and performing a 1*1 convolution operation after performing a splicing operation on the feature images y1, y2, y3, y4 and y5 to change the channel number so as to obtain a final multi-scale fused low-resolution feature image 6.
5. The transparent object image segmentation method according to claim 1, wherein the differential boundary attention module is composed of four parallel pixel differential convolution modules and a spatial attention module, and the pixel differential convolution modules include a differential convolution layer with a convolution kernel size 3*3, a ReLU layer, and a convolution layer with a convolution kernel size 1*1; the spatial attention module comprises two convolution layers with the convolution kernel size 1*1, one convolution layer with the convolution kernel 3*3, a ReLU layer and a Sigmoid function; each feature map selected in S1 is first passed through a Pixel Difference Convolution Module (PDCM) and then passed through a Spatial Attention Module (SAM) to obtain a corresponding boundary feature map.
6. The method for segmenting the transparent object image according to claim 1, wherein the specific steps for obtaining the pixel-area enhancement feature map in S3 are as follows:
S3-1: carrying out Softmax operation on the multi-scale fused low-resolution feature map to obtain K rough segmentation areas { R 1,R2,...,RK }, wherein K represents the number of segmented categories, R K is a two-dimensional vector, and each element in R K represents the probability that the corresponding pixel belongs to the category K;
S3-2: the K-th region representation feature is obtained by using the following formula, namely, the weighted summation of the features of all pixels of the whole image and the probability that the features belong to the region K is carried out:
Where x i represents the feature of pixel p i, r ki represents the probability that pixel p i belongs to region K, and f k represents the region representation feature;
s3-3: the corresponding relation between each pixel and each region is calculated through a self-attention mechanism, and the calculation formula is as follows:
where t (x, f) =u 1(x)Tu2(f),u1、u2、u3 and u 4 represent a transfer function FFN consisting of 1*1 convolutions, batchNorm layers and ReLU layers; taking u 1(x)T and u 2 (f) as keys and queries of a self-attention mechanism respectively, calculating the correlation between the pixel characteristics and the region representation characteristics, normalizing the correlation to obtain w ik, and multiplying w ik as a weight by u 3(fk to obtain a pixel-region enhancement characteristic y i;
S3-4: the pixel-area enhancement feature map y aug is composed using the pixel-area enhancement feature y aug for each pixel point.
7. The transparent object image segmentation method according to claim 1, wherein the edge prediction image of the transparent object obtained in S2 is used for calculating an edge loss function L1, the transparent object segmentation result obtained in S3 is used for calculating a loss function L2 as the transparent object, the total loss function L is the sum of the edge loss function L1 and the loss function L2, and the model parameters are optimized by participating in the update of the network weight in the gradient descent process; the edge loss function L1 and the loss function L2 adopt cross entropy loss functions, and the edge loss function L1 is taken as an example, and the calculation formula is as follows: the calculation formula is as follows:
Where p i is the predicted result for pixel i at the boundary and y i is the actual result for pixel i at the boundary.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210633162.3A CN115082675B (en) | 2022-06-07 | 2022-06-07 | Transparent object image segmentation method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210633162.3A CN115082675B (en) | 2022-06-07 | 2022-06-07 | Transparent object image segmentation method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115082675A CN115082675A (en) | 2022-09-20 |
CN115082675B true CN115082675B (en) | 2024-06-04 |
Family
ID=83248245
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210633162.3A Active CN115082675B (en) | 2022-06-07 | 2022-06-07 | Transparent object image segmentation method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115082675B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115294412A (en) * | 2022-10-10 | 2022-11-04 | 临沂大学 | Real-time coal rock segmentation network generation method based on deep learning |
CN116309274B (en) * | 2022-12-12 | 2024-01-30 | 湖南红普创新科技发展有限公司 | Method and device for detecting small target in image, computer equipment and storage medium |
CN115880567B (en) * | 2023-03-03 | 2023-07-25 | 深圳精智达技术股份有限公司 | Self-attention calculating method and device, electronic equipment and storage medium |
CN117788722B (en) * | 2024-02-27 | 2024-05-03 | 国能大渡河金川水电建设有限公司 | BIM-based safety data monitoring system for underground space |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1854965B1 (en) * | 2006-05-02 | 2009-10-21 | Carl Freudenberg KG | Oil seal |
AU2020103901A4 (en) * | 2020-12-04 | 2021-02-11 | Chongqing Normal University | Image Semantic Segmentation Method Based on Deep Full Convolutional Network and Conditional Random Field |
CN114359297A (en) * | 2022-01-04 | 2022-04-15 | 浙江大学 | Attention pyramid-based multi-resolution semantic segmentation method and device |
-
2022
- 2022-06-07 CN CN202210633162.3A patent/CN115082675B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1854965B1 (en) * | 2006-05-02 | 2009-10-21 | Carl Freudenberg KG | Oil seal |
AU2020103901A4 (en) * | 2020-12-04 | 2021-02-11 | Chongqing Normal University | Image Semantic Segmentation Method Based on Deep Full Convolutional Network and Conditional Random Field |
CN114359297A (en) * | 2022-01-04 | 2022-04-15 | 浙江大学 | Attention pyramid-based multi-resolution semantic segmentation method and device |
Non-Patent Citations (1)
Title |
---|
结合注意力机制的双路径语义分割;翟鹏博;杨浩;宋婷婷;余亢;马龙祥;黄向生;;中国图象图形学报;20200812(第08期);119-128 * |
Also Published As
Publication number | Publication date |
---|---|
CN115082675A (en) | 2022-09-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115082675B (en) | Transparent object image segmentation method and system | |
CN112396607B (en) | Deformable convolution fusion enhanced street view image semantic segmentation method | |
CN111462126B (en) | Semantic image segmentation method and system based on edge enhancement | |
CN109522966B (en) | Target detection method based on dense connection convolutional neural network | |
CN114724155A (en) | Scene text detection method, system and equipment based on deep convolutional neural network | |
CN113256649B (en) | Remote sensing image station selection and line selection semantic segmentation method based on deep learning | |
CN113362242B (en) | Image restoration method based on multi-feature fusion network | |
CN116485860A (en) | Monocular depth prediction algorithm based on multi-scale progressive interaction and aggregation cross attention features | |
CN116797787A (en) | Remote sensing image semantic segmentation method based on cross-modal fusion and graph neural network | |
CN115631513B (en) | Transformer-based multi-scale pedestrian re-identification method | |
CN113066089A (en) | Real-time image semantic segmentation network based on attention guide mechanism | |
CN115424017B (en) | Building inner and outer contour segmentation method, device and storage medium | |
CN114092824A (en) | Remote sensing image road segmentation method combining intensive attention and parallel up-sampling | |
CN115424059A (en) | Remote sensing land use classification method based on pixel level comparison learning | |
CN116596966A (en) | Segmentation and tracking method based on attention and feature fusion | |
CN117726954B (en) | Sea-land segmentation method and system for remote sensing image | |
Van Hoai et al. | Feeding Convolutional Neural Network by hand-crafted features based on Enhanced Neighbor-Center Different Image for color texture classification | |
Cho et al. | Modified perceptual cycle generative adversarial network-based image enhancement for improving accuracy of low light image segmentation | |
Vijayalakshmi K et al. | Copy-paste forgery detection using deep learning with error level analysis | |
CN116681978A (en) | Attention mechanism and multi-scale feature fusion-based saliency target detection method | |
Li et al. | A new algorithm of vehicle license plate location based on convolutional neural network | |
CN116704367A (en) | Multi-scale feature fusion farmland change detection method and system | |
Mujtaba et al. | Automatic solar panel detection from high-resolution orthoimagery using deep learning segmentation networks | |
Bendre et al. | Natural disaster analytics using high resolution satellite images | |
CN113065547A (en) | Character supervision information-based weak supervision text detection method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |