CN114821580A

CN114821580A - Noise-containing image segmentation method by stage-by-stage merging with denoising module

Info

Publication number: CN114821580A
Application number: CN202210497742.4A
Authority: CN
Inventors: 陈飞; 黄琳; 曾勋勋
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2022-05-09
Filing date: 2022-05-09
Publication date: 2022-07-29

Abstract

The invention provides a segmentation method of a noisy image merged into a denoising module in stages, which comprises the steps of firstly inputting a noisy image into a backbone network, and extracting feature maps of four stages through convolution operation; secondly, obtaining a preliminary semantic segmentation result by a double attention mechanism on the feature map extracted in the fourth stage. On the basis, by means of feature difference of different stages of a backbone network, through iterative fusion of multi-stage semantic features, a denoising assisting partition mode is formed, and a denoising assisting mode is partitioned; and finally, combining the obtained three semantic segmentation results to form a final segmentation result, and further optimizing parameters through mixed cross entropy loss. The invention utilizes the cooperative denoising and segmentation to improve the semantic segmentation precision of the noise image, and solves the problems that the denoising link in the existing semantic segmentation method aiming at the noise image loses semantic information, so that the accuracy of subsequent target category segmentation and the segmentation integrity of the target contour are influenced.

Description

Noise-containing image segmentation method by stage-by-stage merging with denoising module

Technical Field

The invention relates to the technical field of computer vision, in particular to a noise-containing image segmentation method by stage merging a denoising module.

Background

The goal of semantic segmentation is to determine the classification of each pixel (e.g., belonging to the background, person or car, etc.) and thereby convert some of the original image into a mask with a highlighted region of interest. At present, many advanced segmentation methods have been applied in various fields, such as automatic driving, scene parsing, target detection, human-computer interaction, and the like. Recently, deep neural networks have achieved significant success in the task of semantic segmentation. Such as PSANet, DenseAPP, and DANet. However, the success of these networks is premised on a high quality training data set, i.e., clean, noise-free images. However, in practical applications, due to the environment, focus failure and shake of the camera during shooting, the image captured by the image capturing apparatus often has different degrees of noise information, such as: gaussian noise, short noise, thermal noise, etc. These noise information are not controllable, and even a sophisticated image capturing apparatus cannot control the environment when a real image is captured. Noise information tends to cover some small textures of the image, reducing the ability to semantically segment the model. When a noisy data set is used to train a semantic segmentation model of the current mainstream, the segmentation accuracy is significantly reduced. In order to reduce the interference of noise on semantic information, the most direct method is to complete an image denoising task before performing a semantic segmentation task. A series connection method of one-step in-place denoising and one-step in-place segmentation is adopted, although the method can remove noise, some small texture semantic information is lost correspondingly, and the result of training a segmentation model by using a clean image still cannot be achieved. When the DANet is trained by using a data set with noise, the noise information seriously interferes the image texture structure, the boundary part of a target is influenced, so that the context information cannot be accurately acquired, the target area is wrongly divided, and the semantic segmentation precision is obviously reduced. A simple series architecture is formed by adding a denoising module to the DANet, and although the damage of noise information to a target texture structure can be weakened, the loss of partial texture information is inevitably caused, so that the error positioning is generated in the semantic segmentation in the next step, and the segmentation of a target region is incomplete.

Disclosure of Invention

In view of the above, the present invention aims to provide a noisy image segmentation method that incorporates a denoising module in stages, which improves the semantic segmentation precision of a noisy image by using cooperative denoising and segmentation, and solves the problem that the semantic information is lost in the denoising link in the existing noisy image semantic segmentation method, so that the accuracy of subsequent target class division and the segmentation integrity of a target contour are affected.

In order to achieve the purpose, the invention adopts the following technical scheme: a noise-containing image segmentation method by stage merging into a denoising module comprises the following steps:

step S1: in a clean PASCAL VOC 2012 data set y ⁽¹⁾ ,y ⁽²⁾ ,...,y ^(m) On the mean value of random superposition 0, standard deviation 0,30]The Gaussian noise of (2) yields a training set of noisy images { x } ⁽¹⁾ ,x ⁽²⁾ ,...,x ^(m) }；

Step S2: noise image x ⁽ⁱ⁾ Inputting the data into a backbone network ResNet50, and sequentially passing through Stage1, Stage2, Stage3 and Stage4 to extract the characteristics of each Stage;

step S3: generating a feature map f by the backbone network Stage4 ₄ Inputting the data into a DAM (double attention Module), refining the characteristics, and outputting a primary segmentation result z ₁ ；

Step S4: generating a feature map f by the backbone network Stage3 ₃ And a preliminary segmentation result z ₁ Input to segmentation based on staged collaborationIn noise block SDBSC; firstly, the stage characteristic, the segmentation result and the multi-scale characteristic of the denoising task of the backbone network are combined through a linear transformation formula to generate a new characteristic diagram, and then a new segmentation result z is generated through a segmentation module SSM in the SDBSC ₂ ；

Step S5: generating a feature map f of a backbone network Stage2 ₂ And a preliminary segmentation result z ₂ Inputting the data into a segmentation and denoising block SDBSC based on the stepwise cooperation, repeating the step S4, and generating a new segmentation result z ₃ And through the clean image y ⁽ⁱ⁾ Calculating the mean square error loss L of the denoised image _d ，L _d Can be expressed as:

wherein y is _i A group Truth representing the pixel i,

representing the probability estimation of the pixel i, and n represents the number of pixel points;

step S6: finally, the staged segmentation result z is obtained ₁ z ₂ z ₃ Performing superposition to generate semantic segmentation result of multi-stage feature fusion

Step S7: by splitting label z pairs

Calculating the Mixed Cross entropy loss L _S Can be represented as

Wherein Ls ₁ Represents the cross entropy loss:

where p represents the number of pixels of a picture,

represents the group Truth class of pixel i,

representing the probability estimate, Ls, of the pixel i ₂ Represents mIOU loss:

where X denotes a predicted pixel set and Y denotes a GT pixel set.

In a preferred embodiment: the method for segmenting the noisy image by being fused into the denoising module in stages according to claim 1, wherein: in S3, the method further includes:

step S31: feature map f generated by Stage4 of backbone network ₄ ^C×H×W Respectively obtaining feature maps A through reshape ^C ^×HW Characteristic diagram B ^HW×C (ii) a Performing matrix multiplication on B and A, and calculating by a softmax layer to obtain a spatial attention feature map M ^HW×HW It can be expressed as:

wherein M is _ji Representing the relation between the ith position and the jth position in the characteristic diagram, and H and W respectively representing the characteristic diagram f ₄ Length and width of (B) _i Denotes the ith position, A, of the matrix B _j Representing the jth position of matrix a. Will the characteristic diagram f ₄ ^C×H×W Remodelling to C ^C×HW Multiplying by M and reshaping it into a feature map P consistent with the original feature map size ^C×H×W Multiplying the value by a scale parameter lambda, initializing the lambda to be 0, and continuously distributing more weights through learning, wherein the formula is as follows:

wherein, P _j Denotes the jth position, C, of the feature map P _i Represents the ith position of the matrix C;

step S32: feature map f generated by Stage4 of backbone network ₄ ^C×H×W Respectively obtaining feature maps A through reshape ^C ^×HW Characteristic diagram B ^HW×C (ii) a Matrix multiplication is carried out on A and B, and a channel attention feature map N is obtained through calculation of a softmax layer ^C×C It can be expressed as:

wherein N is _ji Representing the association of the ith location and the jth channel in the profile,

to represent

Will the characteristic diagram f ₄ ^C×H×W Remodelling to C ^C×HW Multiplying the matrix N and reshaping the matrix N into a feature map Q consistent with the original feature map size ^C ^×H×W Multiplying it by a scale parameter μ, initializing μ to 0, and formulating as:

wherein Q _j The jth position of the characteristic diagram Q is shown.

In a preferred embodiment: in S4, the method further includes:

step S41: for input image R ^3×H×W Performing convolution operation with convolution kernel of 3 and padding of 1 for multiple times and adopting a global average pooling downsampling mode to obtain a feature map R ^4C×H/4×W/4 Is obtained toLow-level features of noisy images;

step S42: respectively convolving the low-level features of the noise image by the holes of different expansion factors (rate3, rate6 and rate9), and fusing multi-scale information to obtain a new feature map R ^4C×H/4×W/4 And then the linear conversion formula is:

wherein α (·) and

for a linear transformation function, x represents a feature map R convolved with a hole ^4C×H/4×W/4 Stage represents the profile of the backbone network output, S _out Output representing last stage semantic segmentation

output is a feature graph output after fusion;

step S43: in the decoding stage of the denoising step, the upsampling adopts a deconvolution mode, and the linear conversion formula in the step S42 is applied in the last step of the decoding step.

Compared with the prior art, the invention has the following beneficial effects:

1) by introducing the idea of multi-scale and utilizing the cavity convolution, the problem that the small texture structure of the image is covered by noise is better solved. Moreover, the semantic information of each stage, the low-level characteristics in a denoising link and the segmentation result are effectively combined by utilizing linear conversion through converting the staged segmentation into a denoising method, so that the semantic segmentation promotes the image denoising;

2) the method for enhancing the target contour semantic information by utilizing the high-level semantic features and the low-level semantic features extracted from each stage in the backbone network realizes denoising and segmentation of the image with noise in stages and helps denoising and segmentation tasks from different levels.

Drawings

Fig. 1 is a flowchart of a noisy image segmentation method by stage-wise merging into a denoising module in a preferred embodiment of the present invention.

FIG. 2 is a graph comparing semantic segmentation algorithms with other noisy images in a preferred embodiment of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application; as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, the method for segmenting a noisy image by incorporating a denoising module in stages according to the present invention is implemented by the following steps:

step S3: generating a feature map f of a backbone network Stage4 ₄ Inputting the data into a Double Attention Module (DAM), thinning the characteristics and outputting a preliminary segmentation result z 1;

step S4: generating a feature map f of a backbone network Stage3 ₃ And preliminarySegmentation result z ₁ Inputting the data into a Segmentation and Denoising Block (SDBSC) based on the stage synergy. Firstly, the stage characteristic, the segmentation result and the multi-scale characteristic of the denoising task of the backbone network are combined through a linear transformation formula to generate a new characteristic diagram, and then a new segmentation result z is generated through a segmentation module SSM in the SDBSC ₂ ；

Step S5: generating a feature map f of a backbone network Stage2 ₂ And a preliminary segmentation result z ₂ Inputting the data into a Segmentation and Denoising Block (SDBSC) based on the stepwise cooperation, repeating the step S4, and generating a new segmentation result z ₃ And through the clean image y ⁽ⁱ⁾ Calculating the mean square error loss L of the denoised image _d ，L _d Can be expressed as:

wherein y is _i A group Truth representing the pixel i,

Step S7: by splitting label z pairs

Calculating the Mixed Cross entropy loss L _S Can be represented as

Wherein Ls ₁ Represents the cross entropy loss:

where p represents the number of pixels of a picture,

represents the group Truth class of pixel i,

where X denotes a predicted pixel set and Y denotes a GT pixel set.

In step S3, the method further includes the steps of:

step S31: feature map f generated by Stage4 of backbone network ₄ ^C×H×W Respectively obtaining feature maps A through reshape ^C ^×HW Characteristic diagram B ^HW×C . Performing matrix multiplication on B and A, and calculating by a softmax layer to obtain a spatial attention feature map M ^HW×HW It can be expressed as:

step S32: feature map f generated by Stage4 of backbone network ₄ ^C×H×W Respectively obtaining feature maps A through reshape ^C ^×HW Characteristic diagram B ^HW×C . Matrix multiplication is carried out on A and B, and a channel attention feature map N is obtained through calculation of a softmax layer ^C×C It can be expressed as:

to represent

wherein Q _j The jth position of the characteristic diagram Q is shown.

In step S4, the method further includes the steps of:

step S41: for input image R ^3×H×W Performing convolution operation with convolution kernel of 3 and padding of 1 for multiple times and adopting a downsampling mode of global average pooling to obtain a feature map R ^4C×H/4×W/4 Acquiring low-level features of the noise image;

step S42: will make an uproarThe low-level features of the sound image are respectively subjected to cavity convolution by different expansion factors (rate3, rate6 and rate9), and multi-scale information is fused to obtain a new feature map R ^4C×H/4×W/4 And then the linear conversion formula is:

wherein α (·) and

output is a feature graph output after fusion;

The following is a specific embodiment of the present invention.

The application of the algorithm provided by the invention to the semantic segmentation of the noisy image comprises the following specific steps:

1. in a clean PASCAL VOC 2012 data set y ⁽¹⁾ ,y ⁽²⁾ ,...,y ^(m) On the mean value of random superposition 0, standard deviation 0,30]The Gaussian noise of (2) yields a training set of noisy images { x } ⁽¹⁾ ,x ⁽²⁾ ,...,x ^(m) }；

2. Noise image x ⁽ⁱ⁾ Inputting the data into a backbone network ResNet50, and sequentially passing through Stage1, Stage2, Stage3 and Stage4 to extract the characteristics of each Stage;

3. generating a feature map f of a backbone network Stage4 ₄ ^C×H×W Respectively obtaining feature maps A through reshape ^C×HW Characteristic diagram B ^HW×C . Performing matrix multiplication on B and A, and calculating by a softmax layer to obtain a spatial attention feature map M ^HW×HW ；

4. Generating a feature map f of a backbone network Stage4 ₄ ^C×H×W Respectively obtaining feature maps A through reshape ^C×HW Characteristic diagram B ^HW×C . Matrix multiplication is carried out on A and B, and a channel attention feature map N is obtained through calculation of softmax layer ^C×C ；

5. Feature map M of spatial attention ^HW×HW And channel attention profile N ^C×C Are respectively reacted with A ^C×HW Matrix multiplication is carried out, and then a primary segmentation result z1 is obtained through addition after reshaping;

6. generating a feature map f of a backbone network Stage3 ₃ And a preliminary segmentation result z ₁ Inputting the data into a Segmentation and Denoising Block (SDBSC) based on the staged collaboration;

7. for input image R ^3×H×W Performing convolution operation with convolution kernel of 3 and padding of 1 for multiple times and adopting a global average pooling downsampling mode to obtain a feature map R ^4C×H/4×W/4 Acquiring low-level features of the noise image;

8. respectively convolving the low-level features of the noise image by the holes of different expansion factors (rate3, rate6 and rate9), and fusing multi-scale information to obtain a new feature map R ^4C×H/4×W/4 By the linear transformation formula:

obtaining a fused feature map;

9. in the decoding stage of the denoising link, the up-sampling adopts a deconvolution mode, and simultaneously, a linear conversion formula is applied in the last link of the decoding stage to obtain an image after preliminary denoising.

10. Passing the denoised image through a backbone network ResNet50, and then repeating the steps 3) -5) to obtain a primary segmentation result z ₂ 。

11. Generating a feature map f of a backbone network Stage2 ₂ And a preliminary segmentation result z ₂ Inputting the data into a Segmentation and Denoising Block (SDBSC) based on the stage synergy, repeating the steps 7) -10) to obtain a new segmentation result z ₃ And denoised images.

12. Segmenting the result z in stages ₁ z ₂ z ₃ Performing superposition to generate semantic segmentation result of multi-stage feature fusion

13. Computing mean square error loss for denoised images

14. By splitting label z pairs

Calculating the mixed cross entropy loss:

wherein

Figure 2 is a graph showing the qualitative comparison of the algorithm of this example with other methods on a PASCAL VOC 2012 data set with a gaussian noise standard deviation of 20. As can be seen from the three columns of fig. 2(c) (d) (e), after the denoising module is added, the semantic segmentation quality is still not satisfactory, the target range is identified incorrectly, and the boundary of the target is not accurate and smooth, fig. 2(f) shows the result of the DMS, the problem of incorrect identification is improved, but the boundary area is not identified well. Fig. 2(g) shows the result of the algorithm in this embodiment, and it is obvious that the segmentation can be performed well no matter in the case of multiple classes or small targets. Moreover, as can be seen from the images in the 4 th and 5 th rows, the algorithm in the present embodiment achieves a significant improvement in the target boundary.

Claims

1. A noise-containing image segmentation method by stage merging into a denoising module is characterized by comprising the following steps: the method comprises the following steps:

step S3: feature map f generated by Stage4 of backbone network ₄ Inputting the data into a DAM (double attention Module), refining the characteristics, and outputting a primary segmentation result z ₁ ；

Step S4: generating a feature map f by the backbone network Stage3 ₃ And a preliminary segmentation result z ₁ Inputting the data into a segmentation de-noising block SDBSC based on the stage cooperation; firstly, the stage characteristic, the segmentation result and the multi-scale characteristic of the denoising task of the backbone network are combined through a linear transformation formula to generate a new characteristic diagram, and then a new segmentation result z is generated through a segmentation module SSM in the SDBSC ₂ ；

Step S5: generating a feature map f of a backbone network Stage2 ₂ And a preliminary segmentation result z ₂ Inputting the data into a segmentation and denoising block SDBSC based on the stepwise cooperation, repeating the step S4, and generating a new segmentation result z ₃ And through the clean image y ⁽ⁱ⁾ Calculating the mean square error loss L of the denoised image _d ，L _d Expressed as:

wherein y is _i A group Truth representing the pixel i,

Step S7: by splitting label z pairs

Calculating the Mixed Cross entropy loss L _S Is shown as

Wherein

Represents the cross entropy loss:

where p represents the number of pixels of a picture,

represents the group Truth class of pixel i,

representing the probability estimate for the pixel i,

represents mIOU loss:

where X denotes a predicted pixel set and Y denotes a GT pixel set.

2. The method for segmenting the noisy image by being fused into the denoising module in stages according to claim 1, wherein: in S3, the method further includes:

step S31: feature map f generated by Stage4 of backbone network ₄ ^C×H×W Respectively obtaining feature maps A through reshape ^C×HW Characteristic diagram B ^HW×C (ii) a Performing matrix multiplication on B and A, and calculating by a softmax layer to obtain a spatial attention feature map M ^HW×HW Expressed as:

wherein M is _ji Representing the relation between the ith position and the jth position in the characteristic diagram, and H and W respectively representing the characteristic diagram f ₄ Length and width of (B) _i Denotes the ith position, A, of the matrix B _j Represents the jth position of matrix A; will the characteristic diagram f ₄ ^C×H×W Remodelling to C ^C×HW Multiplying by M and reshaping it into a feature map P consistent with the original feature map size ^C×H×W Multiplying the value by a scale parameter lambda, initializing the lambda to be 0, and continuously distributing more weights through learning, wherein the formula is as follows:

step S32: feature map f generated by Stage4 of backbone network ₄ ^C×H×W Respectively obtaining feature maps A through reshape ^C×HW Characteristic diagram B ^HW×C (ii) a Matrix multiplication is carried out on A and B, and a channel attention feature map N is obtained through calculation of a softmax layer ^C×C Expressed as:

to represent

Will the characteristic diagram f ₄ ^C×H×W Remodelling to C ^C×HW Multiplying the matrix N and reshaping the matrix N into a feature map Q consistent with the original feature map size ^C×H×W Multiplying it by a scale parameter μ, initializing μ to 0, and formulating as:

wherein Q _j The jth position of the characteristic diagram Q is shown.

3. The method for segmenting the noisy image by being fused into the denoising module in stages according to claim 1, wherein: in S4, the method further includes:

wherein α (·) and

for a linear transformation function, x represents a feature map R convolved with a hole ^4C×H/4×W/4 ，stage represents a characteristic diagram of the backbone network output, S _out Output representing last stage semantic segmentation

output is a feature graph output after fusion;