CN117975002A

CN117975002A - Weak supervision image segmentation method based on multi-scale pseudo tag fusion

Info

Publication number: CN117975002A
Application number: CN202410094113.6A
Authority: CN
Inventors: 付佳; 王国泰; 张少霆
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2024-01-23
Filing date: 2024-01-23
Publication date: 2024-05-03

Abstract

The invention discloses a weak supervision image segmentation method based on multi-scale pseudo tag fusion, and belongs to the technical field of automatic image recognition. The invention uses an image classification model based on multi-scale characteristics to carry out uncertainty weighting on multi-level attention force diagrams of training images only comprising image-level labels, and generates high-resolution fusion attention force diagrams. Aiming at the problem of unclear edge of attention try, the invention further provides a method for generating a pseudo tag containing context information based on seed point expansion, and the pseudo tag is subjected to space self-adaptive fusion with the attention try to obtain a high-quality segmentation pseudo tag. In order to solve the noise problem in the segmentation pseudo tag, the invention further provides a segmentation model training method based on the credibility weighting of the pseudo tag, and a high-performance segmentation model is obtained. The method provided by the invention can solve the problem that accurate pixel-level labeling is difficult to obtain in an image segmentation task, and can enable the segmentation model to obtain good performance only by training through the image-level label, thereby greatly reducing the labeling quantity and the labeling cost of the image segmentation training set.

Description

Weak supervision image segmentation method based on multi-scale pseudo tag fusion

Technical Field

The invention relates to a technology for realizing automatic segmentation of images by training only using image-level labels, belonging to the technical field of automatic image recognition.

Background

Image segmentation is an important basis for intelligent recognition of images. For example, accurate segmentation of a target object or lesion based on medical images may provide assistance for disease diagnosis, quantitative assessment of the course of treatment. Due to factors such as changeable background, irregular object shape, low image contrast and the like in an image, the traditional image segmentation method such as a threshold method, a movable contour model, a region growing method, an edge detection method and the like cannot obtain reliable segmentation precision.

In recent years, the deep learning method learns a large number of training images containing labels, so that knowledge of a target area can be effectively extracted, and a high-precision segmentation result can be obtained. However, these good performances rely on a large number of images with accurate pixel-level labeling to train the model. The labeling of the pixel level of the image needs to be completed by an operator with abundant experience, is quite time-consuming, causes high-quality labeling data to be difficult to acquire, and greatly limits the application of a deep learning algorithm in an image segmentation task.

To overcome this problem, a weakly supervised learning approach helps to reduce the reliance on high quality annotation data. The weak supervision method only uses images with weak labels such as detection frames, graffiti, points, image-level labels and the like for training, so that the labeling quantity can be greatly reduced. In these labeling methods, for image-level labeling, labeling personnel only need label the category of an image (such as whether a certain object is contained) without precisely sketching the edges of the object in the image, so that the labeling efficiency is extremely high and the cost is lower than that of other labeling methods. However, due to the lack of pixel-level labeling information, the supervisory signals provided for the image segmentation model are very limited, in which case it becomes very difficult to train a segmentation model with good performance.

The existing weak supervision method based on the image-level label usually trains an image classification model, and then visualizes attention force caused by the classification model, so that the initial positioning of a target area is realized to obtain a segmentation result. However, the resolution of the attention map obtained by the classification model in such methods is typically eight times lower than that of the original image, and it is difficult to obtain a high-resolution segmentation result. Meanwhile, the method can only obtain the most obvious area in the target object, cannot obtain the accurate area of the target, easily causes the problem that the segmentation area is incomplete or the false positive area exists, and still has lower precision of the segmentation result. Therefore, there is a need for a more efficient method to achieve higher accuracy segmentation results with training the segmentation model with only image-level annotation data.

Disclosure of Invention

The invention aims to overcome the defects of the existing weak supervision segmentation method based on image-level labels, and provides a new weak supervision learning algorithm based on multi-scale attention-seeking and pseudo-label learning aiming at the problem of low image segmentation precision under the weak labeling condition. The method is divided into two stages, wherein an image classification model is trained by using image-level labeling in the first stage, self-adaptive weighted fusion is carried out by using attention force diagrams obtained by the model on different resolutions, and positioning information with different resolutions is effectively utilized, so that a fusion attention force diagram with higher quality is obtained. Then, a fine pseudo tag is generated using the geodesic distance based on the seed points. And in the second stage, the accurate pseudo-label supervision is utilized, and the accurate segmentation of the target object is realized through noise robustness learning training segmentation model.

The aim of the invention can be achieved by the following technical scheme: a weak supervision image segmentation method based on multi-scale pseudo tag fusion comprises the following steps:

Step 1: constructing an image classification model based on multi-scale features;

Aiming at a segmentation training set only containing image category labels, firstly training an image classification model by using the category labels; the classification model consists of a feature extractor and a classification head; the feature extractor consists of M cascaded feature perception modules, a downsampling layer is arranged between the two feature perception modules, and the resolution of the feature image is reduced to 1/2 of the original resolution of the feature image through convolution operation with the span of 2; each feature perception module comprises two multi-scale attention modules, a Dropout layer and a jump connection are arranged between the two multi-scale attention modules, and the jump connection represents adding the input and the output of the two multi-scale attention modules; each multi-scale attention module comprises a multi-scale feature extraction module and a channel self-attention module; the multi-scale feature extraction module aims at extracting features on different scales in an image and fusing the features, and is realized by K parallel convolution layers with different expansion ratios, wherein the expansion ratios of the K convolution layers are respectively set as d ₁,d₂,…,d_K; by using Representing an input feature map, wherein C is the number of channels, and H and W are the length and width respectively; the operation of the multi-scale feature extraction module is expressed as:

F' = cov (d ₁,F)⊕cov(d₂,F)⊕…⊕cov(d_K, F) formula 1

Wherein cov (d, F) represents the combination of a convolutional layer with an expansion ratio of d, a batch normalization layer and an active layer; representing the operation of splicing the feature images along the channel direction;

On the basis of F', the channel self-attention module corrects the feature map so as to obtain the feature map with stronger expression capacity in a self-adaptive way; the channel number of F ' is expressed as C ', the operation of global average pooling P _avg and global maximum pooling P _max is sequentially carried out on F ', and after the output of the two operations is subjected to feature transformation through a multi-layer perceptron module MLP, the obtained feature vectors are respectively expressed as follows:

v _a＝MLP(P_avg (F')) equation 2

V _m＝MLP(P_max (F')) equation 3

The MLP consists of two fully-connected neural network layers, and the number of output nodes of the MLP is C '/2 and C'; v _a is the output of the multi-layer perceptron module after global average pooling, and v _m is the output of the multi-layer perceptron module after global maximum pooling;

on the basis of v _a and v _m, the characteristic correction coefficient alpha is obtained as follows:

Alpha = sigma (v _a+v_m) equation 4

Where σ represents the Sigmoid activation function, the output of the channel self-attention module is expressed as:

f "=f '·α+f' equation 5

The classification head of the classification model converts the output of the M-th cascaded feature perception module into probability output of each category, and the probability output comprises two fully-connected layers and a Softmax layer which are connected in series; r is used for representing the number of categories marked in the training set, and the classification result of the classification model is represented by p epsilon [0,1] ^R;

Step 2: training an image classification model;

after an image classification model is constructed, a training image and a classification label thereof are input into the model for training, a cross entropy loss function is adopted for optimization, and training weights of the classification model are updated; the cross entropy loss function of the classification is defined as:

Wherein p represents a classification prediction result of the model, y represents a classification label of the training image, R represents the number of categories, y _r represents the probability of the R-th category in the label of the image level, and p _r is a prediction value corresponding to the classification model;

step 3: generating an attention map based on multi-scale information fusion;

Sequentially calculating attention force diagrams of the target object by utilizing the trained classification model and the feature diagrams output by the last N feature perception modules of the classification model, and carrying out weighted fusion; for the feature map F ⁽ⁿ⁾ of the nth feature perception module, attention is paid to the feature map denoted as a ⁽ⁿ⁾, and the calculation process is as follows:

Wherein, The c-th channel of the feature map F ⁽ⁿ⁾, i is the pixel index value,/>, isRepresenting classification model prediction result pairs/>Is a partial derivative of (2); α _c is the weight of the c-th channel, reLU represents the linear rectification activation function, a ⁽ⁿ⁾ (i) represents the value of attention map a ⁽ⁿ⁾ at the i-th pixel;

Upsampling A ⁽ⁿ⁾ to the input image size and normalizing to between [0,1] according to its maximum and minimum values, the resulting normalized attention map is represented as The set of normalized attention profiles sequentially obtained at the last N feature perception modules of the classification model is denoted as/>

Attention attempts due to the different values of nThere is a difference in resolution, in order to get a more stable attention map, pair/>Each element in the (2) is subjected to weighted fusion through uncertainty; nth attention strive to/>The corresponding uncertainty u ⁽ⁿ⁾ and weight w ⁽ⁿ⁾ map is represented as:

Wherein w ⁽ⁿ⁾ (i) represents the value of the i-th pixel in the weight map w ⁽ⁿ⁾; from the post-fusion attention profile of w ⁽ⁿ⁾ Expressed as:

step 4: generating a pseudo tag based on seed point expansion;

Due to post-fusion attention seeking The obtained result is used as a preliminary positioning result to generate seed points, and a pseudo tag based on the ground wire distance of the seed points is further generated;

First, by Generating seed points; will/>Thresholding, setting the threshold as t, and setting/>The region with the value higher than t is taken as a foreground region/>Then calculating a boundary box of omega _f, wherein the central point of the boundary box is used as a set S _f of foreground seed points, and the four corner points of the boundary box are used as a set S _b of background seed points;

then, calculating the similarity between each pixel point and each foreground and background seed point according to the geodesic distance; for the foreground seed point S _f, the geodesic distance calculation mode is as follows:

Wherein I represents an input image; Is the set of all possible paths for pixels i through S _f, r (l) is one of the paths, parameterized by l ε [0,1], u (l) is the unit vector of that path in the tangential direction at l;

the similarity graph of the foreground seed points is denoted as P _fs, and the calculation mode is as follows:

Similarly, the similarity plot for the background seed points is denoted as P _bs, which is obtained by the following formula:

Finally, combining P _fs and P _bs to obtain a single pseudo tag graph based on seed points The calculation process is as follows:

Wherein the method comprises the steps of The larger the value at pixel i, the more likely it is to be a foreground portion; /(I)A value of 0.5 indicates that the pixel is in a pending state;

Step 5: pseudo tag space self-adaptive fusion;

At the position of And/>Based on the above, fusing the two to obtain a more comprehensive segmentation pseudo tag; due to the fact that in the vicinity of the seed point/>More reliable, and remote from the seed point/>More credible, a space self-adaptive weight map W pair is designed to fuse the two; /(I)The pseudo tag after fusion is marked as:

then due to Is a soft label, in order to provide stronger supervisory signals for the segmentation model, the sharpening process is carried out on the segmentation model, and the process is as follows:

Wherein T is more than or equal to 1, is a super parameter for controlling the sharpening degree, Is the final pseudo tag;

Step 6: further training by adopting a pseudo tag segmentation model;

Step 7: and (3) dividing the image to be divided by adopting the division model trained in the step (6).

Further, the training method in the step 6 is as follows:

Because the pseudo tag contains noise, the pseudo tag is directly used for training the segmentation model, so that the model is interfered by the noise to limit the performance of the model; by adopting a noise robustness training method based on pseudo tag credibility weighting, noise interference is avoided through consistency constraint of prediction results of two different visual angles;

First, a set of spatial transforms is defined, including random scaling, rotation, flipping, cropping, using two different spatial transforms for an input image X And/>Respectively obtaining enhanced images, inputting the enhanced images into a segmentation model, and obtaining a prediction result according to/>And/>The respective inverse transforms are returned to the original image space, the results are denoted as P ^a and P ^b, respectively, and the average result of the two is denoted as/> And/>Different parts are possibly false labels, and if the reliability is low, the loss weight of the false labels is reduced; to this end, a consistency weighted pseudo tag penalty is defined:

Wherein the weight pi (i) of pixel i is defined as:

second, for And/>The inconsistent portion, again introducing a consistency constraint penalty between P ^a and P ^b, is defined as:

finally, the loss function of the pseudo-label based segmentation model training is defined as:

l _seg＝L_p+λL_c equation 21

Where λ is a weight coefficient, typically set to 1.0; training the segmentation model by using the pseudo tag and the L _seg on the training image; after training, a test image is inferred by using the segmentation model to obtain a segmentation result.

Compared with the prior art, the invention has the following advantages:

(1) The existing image segmentation method based on the deep learning needs a large number of segmentation labels at the pixel level, and has the disadvantages of large labeling workload and high labor cost and time cost. The weak supervision image segmentation method based on the image level annotation provided by the invention only utilizes the image level annotation to train the segmentation model, thereby greatly reducing the annotation cost and time.

(2) Most of the existing weak supervision segmentation methods generate class activation graphs through feature graphs close to an output layer, so that the edges of the class activation graphs are too smooth and low in resolution. According to the method, the class activation diagram with higher resolution and higher precision can be obtained by fusing activation diagrams generated by a plurality of layers of the classification network, and the segmentation pseudo tag with higher quality can be obtained by correcting the ground wire distance based on the seed point.

(3) Most of the existing methods directly use pseudo tags and a full supervision strategy to train an image segmentation network, and neglect the influence of noise in the pseudo tags on the performance of a segmentation model. The invention provides a noise robust learning strategy based on consistency constraint, which effectively reduces the influence caused by noise pseudo labels, thereby improving the performance of a segmentation model.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention;

FIG. 2 is a diagram showing attention from multi-scale information fusion based on an image classification model in the present invention;

FIG. 3 is a schematic diagram of pseudo tag generation and fusion based on seed point expansion;

FIG. 4 shows the segmentation effect of fetal brain in MRI images compared with other methods of the present invention; white lines represent gold standards and black lines represent the segmentation results of the algorithm.

Detailed Description

In connection with the present invention, the following embodiments of brain region segmentation based on image level labeling in fetal brain MRI images are provided, which are implemented in a computer with CPU as Intel (R) Core (TM) i9-10940X 3.30GHz,GPU as Nvidia GTX3090 and memory as 64.0GB, and programming language as Python.

Step 1, training data collection and preprocessing

A batch of abdominal three-dimensional MRI images was collected, the three-dimensional images were converted into two-dimensional slices, and the resolution of these two-dimensional images was resampled to 1mm x 1mm by preprocessing, cut to 256 x 256 size, and saved as png format. The data was divided into 80% training set and 20% validation set. Manually labeling each slice in the training set with a binary label: 0 indicates that the section does not contain fetal brain, and 1 indicates that the section contains fetal brain.

Step 2, classifying network training based on image level annotation

An image classification model is first trained using class labels. The classification model consists of a feature extractor and a classification head. The feature extractor consists of M=6 cascaded feature sensing modules, a downsampling layer is arranged between the two feature sensing modules, and the resolution of the feature image is reduced to 1/2 of the original resolution of the feature image through convolution operation with the span of 2. Each feature-aware module comprises two multi-scale attention modules, between which a Dropout layer and a jump connection are comprised (adding the inputs and outputs of the two multi-scale attention modules). Each multi-scale attention module includes a multi-scale feature extraction module and a channel self-attention module. The multi-scale feature extraction module aims at extracting features on different scales in an image and fusing the features, the features are realized by K=3 parallel convolution layers with different expansion ratios, and the expansion ratios of the K convolution layers are respectively set to d ₁＝1,d₂＝2,d₃ =4. By usingAn input profile is shown, where C is the number of channels and H and W are the length and width, respectively. The operation of the multi-scale feature extraction module is expressed as:

F' = cov (d ₁,F)⊕cov(d₂,F)⊕…⊕cov(d_K, F) formula 1

Wherein cov (d, F) represents the combination of a convolutional layer with an expansion ratio of d, a batch normalization layer and an active layer. Representing the operation of stitching the feature map along the channel direction. The channel number of F ' is expressed as C ', the operation of global average pooling (P _avg) and global maximum pooling (P _max) is sequentially carried out on F ', and after the output of the two operations is subjected to feature transformation through a multi-layer perceptron Module (MLP), the obtained feature vectors are respectively expressed as follows:

v _a＝MLP(P_avg (F')) equation 2

V _m＝MLP(P_max (F')) equation 3

Wherein the MLP consists of two fully-connected neural network layers, and the number of output nodes is C '/2 and C', respectively. In m=6 modules of the feature extractor, the values of C' are set to 32, 64, 128, 256, 512, and 512 in this order.

Further, on the basis of v _a and v _m, the feature correction coefficients are obtained as follows:

Alpha = sigma (v _a+v_m) equation 4

Where σ represents the Sigmoid activation function. The output of the channel self-attention module is expressed as:

f "=f '·α+f' equation 5

The classification head of the classification model converts the output of the m=5 th cascaded feature-aware module into a probability output for each class, comprising two fully connected layers and one Softmax layer in series. The number of classes marked in the training set is represented by r=2, and the classification result of the classification model is represented by p e [0,1] ^R.

Step 3: training of image classification models

After the image classification model is constructed, the training image and the classification label thereof are input into the model for training, and the cross entropy loss function is adopted for optimization. The cross entropy loss function of the classification is defined as:

Where p represents the classification prediction result of the model, y represents the classification label of the training image, and r=2 represents the number of classes. y _r denotes the probability of the r-th category in the label at the image level, and p _r is the prediction value corresponding to the classification model.

Step 4: generating attention-seeking diagrams based on multi-scale information fusion

And sequentially calculating the attention map of the target object by using the trained classification model and carrying out weighted fusion on the feature maps of the last N=4 feature perception modules of the classification model. For the feature map F ⁽ⁿ⁾ of the nth feature perception module, attention is paid to the feature map denoted as a ⁽ⁿ⁾, and the calculation process is as follows:

Wherein the method comprises the steps of The c-th channel of the feature map F ⁽ⁿ⁾, i is the pixel index value,/>, isRepresenting classification model prediction result pairs/>Is a partial derivative of (c). Alpha _c is the weight of the c-th channel and ReLU represents the linear rectification activation function. A ⁽ⁿ⁾ (i) represents the value of attention map a ⁽ⁿ⁾ at the i-th pixel.

Upsampling A ⁽ⁿ⁾ to the input image size and normalizing to between [0,1] according to its maximum and minimum values, the resulting normalized attention map is represented asTo get a more stable attention-seeking diagram, for each/>And carrying out weighted fusion through uncertainty. /(I)The corresponding uncertainty u ⁽ⁿ⁾ and weight w ⁽ⁿ⁾ map is represented as:

Where w ⁽ⁿ⁾ (i) represents the value of the i-th pixel in the weight map w ⁽ⁿ⁾. From the post-fusion attention profile of w ⁽ⁿ⁾ Expressed as:

step 5: pseudo tag generation based on seed point expansion

First, bySeed points are generated. Will/>Thresholding is performed, and the threshold is set to t=0.5, which is/>The region with the value higher than t is taken as a foreground region/>Then, a bounding box of Ω _f is calculated, the center point of the bounding box being the set of foreground seed points S _f, and the four corner points of the bounding box being the set of background seed points S _b.

And then, calculating the similarity between each pixel point and each foreground and background seed point according to the geodesic distance. For the foreground seed point S _f, the geodesic distance calculation mode is as follows:

Where I represents an input image. Is the set of all possible paths for pixels i through S _f, r (l) is one of the paths, parameterized by l ε [0,1], and u (l) is the unit vector of that path in the tangential direction at l.

step 6: pseudo tag space adaptive fusion

At the position ofAnd/>Based on the above, the two are fused to obtain a more comprehensive segmentation pseudo tag. And designing a space self-adaptive weight map W pair to fuse the two. /(I)The pseudo tag after fusion is marked as:

Then, to Sharpening is carried out, and the process is as follows:

Where t=2 is a super parameter controlling the degree of sharpening, Is the final pseudo tag.

Step 7: segmentation model training based on pseudo tag credibility weighting

Based on the pseudo tag, the segmentation model is trained through the consistency constraint of the prediction results of two different visual angles, so that the interference of noise is avoided. First, a set of spatial transforms is defined, including random scaling, rotation, flipping, cropping, etc., using two different spatial transforms for an input image XAnd/>Respectively obtaining enhanced images, inputting the enhanced images into a segmentation model, and obtaining a prediction result according to/>And/>The respective inverse transforms are returned to the original image space, the results are denoted as P ^a and P ^b, respectively, and the average result of the two is denoted as/> And/>The different parts, possibly the false label, have low credibility, and the false label loss weight is reduced. To this end, a consistency weighted pseudo tag penalty is defined:

Wherein the weight pi (i) of pixel i is defined as:

l _seg＝L_p+λL_c equation 21

Where λ=1.0 is a weight coefficient. The segmentation model is trained using the pseudo tag and L _seg on the training image. After training, a test image is inferred by using the segmentation model to obtain a segmentation result.

Step8: segmentation on test images

After training, for any test image X _t, the segmentation model trained in the step 7 is directly utilized to predict the test image X _t, so as to complete the segmentation of the target.

Claims

1. A weak supervision image segmentation method based on multi-scale pseudo tag fusion comprises the following steps:

v _a＝MLP(P_avg (F')) equation 2

V _m＝MLP(P_max (F')) equation 3

Alpha = sigma (v _a+v_m) equation 4

f "=f '·α+f' equation 5

Step 2: training an image classification model;

step 3: generating an attention map based on multi-scale information fusion;

step 4: generating a pseudo tag based on seed point expansion;

Step 5: pseudo tag space self-adaptive fusion;

Step 6: further training by adopting a pseudo tag segmentation model;

2. The weak supervision image segmentation method based on multi-scale pseudo tag fusion as set forth in claim 1, wherein the specific method of step 2 is as follows:

Wherein p represents the classification prediction result of the model, y represents the classification label of the training image, R represents the number of categories, y _r represents the probability of the R-th category in the label of the image level, and p _r is the prediction value corresponding to the classification model.

3. The weak supervision image segmentation method based on multi-scale pseudo tag fusion as set forth in claim 1, wherein the specific method of step 5 is as follows:

Wherein T is more than or equal to 1, is a super parameter for controlling the sharpening degree, Is the final pseudo tag.

4. The method for segmentation of a weakly supervised image based on multi-scale pseudo label fusion as set forth in claim 1, wherein the training method of step 6 is as follows:

Wherein the weight pi (i) of pixel i is defined as:

l _seg＝L_p+λL_c equation 21