CN117975002A - Weak supervision image segmentation method based on multi-scale pseudo tag fusion - Google Patents

Weak supervision image segmentation method based on multi-scale pseudo tag fusion Download PDF

Info

Publication number
CN117975002A
CN117975002A CN202410094113.6A CN202410094113A CN117975002A CN 117975002 A CN117975002 A CN 117975002A CN 202410094113 A CN202410094113 A CN 202410094113A CN 117975002 A CN117975002 A CN 117975002A
Authority
CN
China
Prior art keywords
image
segmentation
attention
feature
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410094113.6A
Other languages
Chinese (zh)
Inventor
付佳
王国泰
张少霆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202410094113.6A priority Critical patent/CN117975002A/en
Publication of CN117975002A publication Critical patent/CN117975002A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/28Quantising the image, e.g. histogram thresholding for discrimination between background and foreground patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/52Scale-space analysis, e.g. wavelet analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/03Recognition of patterns in medical or anatomical images

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a weak supervision image segmentation method based on multi-scale pseudo tag fusion, and belongs to the technical field of automatic image recognition. The invention uses an image classification model based on multi-scale characteristics to carry out uncertainty weighting on multi-level attention force diagrams of training images only comprising image-level labels, and generates high-resolution fusion attention force diagrams. Aiming at the problem of unclear edge of attention try, the invention further provides a method for generating a pseudo tag containing context information based on seed point expansion, and the pseudo tag is subjected to space self-adaptive fusion with the attention try to obtain a high-quality segmentation pseudo tag. In order to solve the noise problem in the segmentation pseudo tag, the invention further provides a segmentation model training method based on the credibility weighting of the pseudo tag, and a high-performance segmentation model is obtained. The method provided by the invention can solve the problem that accurate pixel-level labeling is difficult to obtain in an image segmentation task, and can enable the segmentation model to obtain good performance only by training through the image-level label, thereby greatly reducing the labeling quantity and the labeling cost of the image segmentation training set.

Description

Weak supervision image segmentation method based on multi-scale pseudo tag fusion
Technical Field
The invention relates to a technology for realizing automatic segmentation of images by training only using image-level labels, belonging to the technical field of automatic image recognition.
Background
Image segmentation is an important basis for intelligent recognition of images. For example, accurate segmentation of a target object or lesion based on medical images may provide assistance for disease diagnosis, quantitative assessment of the course of treatment. Due to factors such as changeable background, irregular object shape, low image contrast and the like in an image, the traditional image segmentation method such as a threshold method, a movable contour model, a region growing method, an edge detection method and the like cannot obtain reliable segmentation precision.
In recent years, the deep learning method learns a large number of training images containing labels, so that knowledge of a target area can be effectively extracted, and a high-precision segmentation result can be obtained. However, these good performances rely on a large number of images with accurate pixel-level labeling to train the model. The labeling of the pixel level of the image needs to be completed by an operator with abundant experience, is quite time-consuming, causes high-quality labeling data to be difficult to acquire, and greatly limits the application of a deep learning algorithm in an image segmentation task.
To overcome this problem, a weakly supervised learning approach helps to reduce the reliance on high quality annotation data. The weak supervision method only uses images with weak labels such as detection frames, graffiti, points, image-level labels and the like for training, so that the labeling quantity can be greatly reduced. In these labeling methods, for image-level labeling, labeling personnel only need label the category of an image (such as whether a certain object is contained) without precisely sketching the edges of the object in the image, so that the labeling efficiency is extremely high and the cost is lower than that of other labeling methods. However, due to the lack of pixel-level labeling information, the supervisory signals provided for the image segmentation model are very limited, in which case it becomes very difficult to train a segmentation model with good performance.
The existing weak supervision method based on the image-level label usually trains an image classification model, and then visualizes attention force caused by the classification model, so that the initial positioning of a target area is realized to obtain a segmentation result. However, the resolution of the attention map obtained by the classification model in such methods is typically eight times lower than that of the original image, and it is difficult to obtain a high-resolution segmentation result. Meanwhile, the method can only obtain the most obvious area in the target object, cannot obtain the accurate area of the target, easily causes the problem that the segmentation area is incomplete or the false positive area exists, and still has lower precision of the segmentation result. Therefore, there is a need for a more efficient method to achieve higher accuracy segmentation results with training the segmentation model with only image-level annotation data.
Disclosure of Invention
The invention aims to overcome the defects of the existing weak supervision segmentation method based on image-level labels, and provides a new weak supervision learning algorithm based on multi-scale attention-seeking and pseudo-label learning aiming at the problem of low image segmentation precision under the weak labeling condition. The method is divided into two stages, wherein an image classification model is trained by using image-level labeling in the first stage, self-adaptive weighted fusion is carried out by using attention force diagrams obtained by the model on different resolutions, and positioning information with different resolutions is effectively utilized, so that a fusion attention force diagram with higher quality is obtained. Then, a fine pseudo tag is generated using the geodesic distance based on the seed points. And in the second stage, the accurate pseudo-label supervision is utilized, and the accurate segmentation of the target object is realized through noise robustness learning training segmentation model.
The aim of the invention can be achieved by the following technical scheme: a weak supervision image segmentation method based on multi-scale pseudo tag fusion comprises the following steps:
Step 1: constructing an image classification model based on multi-scale features;
Aiming at a segmentation training set only containing image category labels, firstly training an image classification model by using the category labels; the classification model consists of a feature extractor and a classification head; the feature extractor consists of M cascaded feature perception modules, a downsampling layer is arranged between the two feature perception modules, and the resolution of the feature image is reduced to 1/2 of the original resolution of the feature image through convolution operation with the span of 2; each feature perception module comprises two multi-scale attention modules, a Dropout layer and a jump connection are arranged between the two multi-scale attention modules, and the jump connection represents adding the input and the output of the two multi-scale attention modules; each multi-scale attention module comprises a multi-scale feature extraction module and a channel self-attention module; the multi-scale feature extraction module aims at extracting features on different scales in an image and fusing the features, and is realized by K parallel convolution layers with different expansion ratios, wherein the expansion ratios of the K convolution layers are respectively set as d 1,d2,…,dK; by using Representing an input feature map, wherein C is the number of channels, and H and W are the length and width respectively; the operation of the multi-scale feature extraction module is expressed as:
F' = cov (d 1,F)⊕cov(d2,F)⊕…⊕cov(dK, F) formula 1
Wherein cov (d, F) represents the combination of a convolutional layer with an expansion ratio of d, a batch normalization layer and an active layer; representing the operation of splicing the feature images along the channel direction;
On the basis of F', the channel self-attention module corrects the feature map so as to obtain the feature map with stronger expression capacity in a self-adaptive way; the channel number of F ' is expressed as C ', the operation of global average pooling P avg and global maximum pooling P max is sequentially carried out on F ', and after the output of the two operations is subjected to feature transformation through a multi-layer perceptron module MLP, the obtained feature vectors are respectively expressed as follows:
v a=MLP(Pavg (F')) equation 2
V m=MLP(Pmax (F')) equation 3
The MLP consists of two fully-connected neural network layers, and the number of output nodes of the MLP is C '/2 and C'; v a is the output of the multi-layer perceptron module after global average pooling, and v m is the output of the multi-layer perceptron module after global maximum pooling;
on the basis of v a and v m, the characteristic correction coefficient alpha is obtained as follows:
Alpha = sigma (v a+vm) equation 4
Where σ represents the Sigmoid activation function, the output of the channel self-attention module is expressed as:
f "=f '·α+f' equation 5
The classification head of the classification model converts the output of the M-th cascaded feature perception module into probability output of each category, and the probability output comprises two fully-connected layers and a Softmax layer which are connected in series; r is used for representing the number of categories marked in the training set, and the classification result of the classification model is represented by p epsilon [0,1] R;
Step 2: training an image classification model;
after an image classification model is constructed, a training image and a classification label thereof are input into the model for training, a cross entropy loss function is adopted for optimization, and training weights of the classification model are updated; the cross entropy loss function of the classification is defined as:
Wherein p represents a classification prediction result of the model, y represents a classification label of the training image, R represents the number of categories, y r represents the probability of the R-th category in the label of the image level, and p r is a prediction value corresponding to the classification model;
step 3: generating an attention map based on multi-scale information fusion;
Sequentially calculating attention force diagrams of the target object by utilizing the trained classification model and the feature diagrams output by the last N feature perception modules of the classification model, and carrying out weighted fusion; for the feature map F (n) of the nth feature perception module, attention is paid to the feature map denoted as a (n), and the calculation process is as follows:
Wherein, The c-th channel of the feature map F (n), i is the pixel index value,/>, isRepresenting classification model prediction result pairs/>Is a partial derivative of (2); α c is the weight of the c-th channel, reLU represents the linear rectification activation function, a (n) (i) represents the value of attention map a (n) at the i-th pixel;
Upsampling A (n) to the input image size and normalizing to between [0,1] according to its maximum and minimum values, the resulting normalized attention map is represented as The set of normalized attention profiles sequentially obtained at the last N feature perception modules of the classification model is denoted as/>
Attention attempts due to the different values of nThere is a difference in resolution, in order to get a more stable attention map, pair/>Each element in the (2) is subjected to weighted fusion through uncertainty; nth attention strive to/>The corresponding uncertainty u (n) and weight w (n) map is represented as:
Wherein w (n) (i) represents the value of the i-th pixel in the weight map w (n); from the post-fusion attention profile of w (n) Expressed as:
step 4: generating a pseudo tag based on seed point expansion;
Due to post-fusion attention seeking The obtained result is used as a preliminary positioning result to generate seed points, and a pseudo tag based on the ground wire distance of the seed points is further generated;
First, by Generating seed points; will/>Thresholding, setting the threshold as t, and setting/>The region with the value higher than t is taken as a foreground region/>Then calculating a boundary box of omega f, wherein the central point of the boundary box is used as a set S f of foreground seed points, and the four corner points of the boundary box are used as a set S b of background seed points;
then, calculating the similarity between each pixel point and each foreground and background seed point according to the geodesic distance; for the foreground seed point S f, the geodesic distance calculation mode is as follows:
Wherein I represents an input image; Is the set of all possible paths for pixels i through S f, r (l) is one of the paths, parameterized by l ε [0,1], u (l) is the unit vector of that path in the tangential direction at l;
the similarity graph of the foreground seed points is denoted as P fs, and the calculation mode is as follows:
Similarly, the similarity plot for the background seed points is denoted as P bs, which is obtained by the following formula:
Finally, combining P fs and P bs to obtain a single pseudo tag graph based on seed points The calculation process is as follows:
Wherein the method comprises the steps of The larger the value at pixel i, the more likely it is to be a foreground portion; /(I)A value of 0.5 indicates that the pixel is in a pending state;
Step 5: pseudo tag space self-adaptive fusion;
At the position of And/>Based on the above, fusing the two to obtain a more comprehensive segmentation pseudo tag; due to the fact that in the vicinity of the seed point/>More reliable, and remote from the seed point/>More credible, a space self-adaptive weight map W pair is designed to fuse the two; /(I)The pseudo tag after fusion is marked as:
then due to Is a soft label, in order to provide stronger supervisory signals for the segmentation model, the sharpening process is carried out on the segmentation model, and the process is as follows:
Wherein T is more than or equal to 1, is a super parameter for controlling the sharpening degree, Is the final pseudo tag;
Step 6: further training by adopting a pseudo tag segmentation model;
Step 7: and (3) dividing the image to be divided by adopting the division model trained in the step (6).
Further, the training method in the step 6 is as follows:
Because the pseudo tag contains noise, the pseudo tag is directly used for training the segmentation model, so that the model is interfered by the noise to limit the performance of the model; by adopting a noise robustness training method based on pseudo tag credibility weighting, noise interference is avoided through consistency constraint of prediction results of two different visual angles;
First, a set of spatial transforms is defined, including random scaling, rotation, flipping, cropping, using two different spatial transforms for an input image X And/>Respectively obtaining enhanced images, inputting the enhanced images into a segmentation model, and obtaining a prediction result according to/>And/>The respective inverse transforms are returned to the original image space, the results are denoted as P a and P b, respectively, and the average result of the two is denoted as/> And/>Different parts are possibly false labels, and if the reliability is low, the loss weight of the false labels is reduced; to this end, a consistency weighted pseudo tag penalty is defined:
Wherein the weight pi (i) of pixel i is defined as:
second, for And/>The inconsistent portion, again introducing a consistency constraint penalty between P a and P b, is defined as:
finally, the loss function of the pseudo-label based segmentation model training is defined as:
l seg=Lp+λLc equation 21
Where λ is a weight coefficient, typically set to 1.0; training the segmentation model by using the pseudo tag and the L seg on the training image; after training, a test image is inferred by using the segmentation model to obtain a segmentation result.
Compared with the prior art, the invention has the following advantages:
(1) The existing image segmentation method based on the deep learning needs a large number of segmentation labels at the pixel level, and has the disadvantages of large labeling workload and high labor cost and time cost. The weak supervision image segmentation method based on the image level annotation provided by the invention only utilizes the image level annotation to train the segmentation model, thereby greatly reducing the annotation cost and time.
(2) Most of the existing weak supervision segmentation methods generate class activation graphs through feature graphs close to an output layer, so that the edges of the class activation graphs are too smooth and low in resolution. According to the method, the class activation diagram with higher resolution and higher precision can be obtained by fusing activation diagrams generated by a plurality of layers of the classification network, and the segmentation pseudo tag with higher quality can be obtained by correcting the ground wire distance based on the seed point.
(3) Most of the existing methods directly use pseudo tags and a full supervision strategy to train an image segmentation network, and neglect the influence of noise in the pseudo tags on the performance of a segmentation model. The invention provides a noise robust learning strategy based on consistency constraint, which effectively reduces the influence caused by noise pseudo labels, thereby improving the performance of a segmentation model.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention;
FIG. 2 is a diagram showing attention from multi-scale information fusion based on an image classification model in the present invention;
FIG. 3 is a schematic diagram of pseudo tag generation and fusion based on seed point expansion;
FIG. 4 shows the segmentation effect of fetal brain in MRI images compared with other methods of the present invention; white lines represent gold standards and black lines represent the segmentation results of the algorithm.
Detailed Description
In connection with the present invention, the following embodiments of brain region segmentation based on image level labeling in fetal brain MRI images are provided, which are implemented in a computer with CPU as Intel (R) Core (TM) i9-10940X 3.30GHz,GPU as Nvidia GTX3090 and memory as 64.0GB, and programming language as Python.
Step 1, training data collection and preprocessing
A batch of abdominal three-dimensional MRI images was collected, the three-dimensional images were converted into two-dimensional slices, and the resolution of these two-dimensional images was resampled to 1mm x 1mm by preprocessing, cut to 256 x 256 size, and saved as png format. The data was divided into 80% training set and 20% validation set. Manually labeling each slice in the training set with a binary label: 0 indicates that the section does not contain fetal brain, and 1 indicates that the section contains fetal brain.
Step 2, classifying network training based on image level annotation
An image classification model is first trained using class labels. The classification model consists of a feature extractor and a classification head. The feature extractor consists of M=6 cascaded feature sensing modules, a downsampling layer is arranged between the two feature sensing modules, and the resolution of the feature image is reduced to 1/2 of the original resolution of the feature image through convolution operation with the span of 2. Each feature-aware module comprises two multi-scale attention modules, between which a Dropout layer and a jump connection are comprised (adding the inputs and outputs of the two multi-scale attention modules). Each multi-scale attention module includes a multi-scale feature extraction module and a channel self-attention module. The multi-scale feature extraction module aims at extracting features on different scales in an image and fusing the features, the features are realized by K=3 parallel convolution layers with different expansion ratios, and the expansion ratios of the K convolution layers are respectively set to d 1=1,d2=2,d3 =4. By usingAn input profile is shown, where C is the number of channels and H and W are the length and width, respectively. The operation of the multi-scale feature extraction module is expressed as:
F' = cov (d 1,F)⊕cov(d2,F)⊕…⊕cov(dK, F) formula 1
Wherein cov (d, F) represents the combination of a convolutional layer with an expansion ratio of d, a batch normalization layer and an active layer. Representing the operation of stitching the feature map along the channel direction. The channel number of F ' is expressed as C ', the operation of global average pooling (P avg) and global maximum pooling (P max) is sequentially carried out on F ', and after the output of the two operations is subjected to feature transformation through a multi-layer perceptron Module (MLP), the obtained feature vectors are respectively expressed as follows:
v a=MLP(Pavg (F')) equation 2
V m=MLP(Pmax (F')) equation 3
Wherein the MLP consists of two fully-connected neural network layers, and the number of output nodes is C '/2 and C', respectively. In m=6 modules of the feature extractor, the values of C' are set to 32, 64, 128, 256, 512, and 512 in this order.
Further, on the basis of v a and v m, the feature correction coefficients are obtained as follows:
Alpha = sigma (v a+vm) equation 4
Where σ represents the Sigmoid activation function. The output of the channel self-attention module is expressed as:
f "=f '·α+f' equation 5
The classification head of the classification model converts the output of the m=5 th cascaded feature-aware module into a probability output for each class, comprising two fully connected layers and one Softmax layer in series. The number of classes marked in the training set is represented by r=2, and the classification result of the classification model is represented by p e [0,1] R.
Step 3: training of image classification models
After the image classification model is constructed, the training image and the classification label thereof are input into the model for training, and the cross entropy loss function is adopted for optimization. The cross entropy loss function of the classification is defined as:
Where p represents the classification prediction result of the model, y represents the classification label of the training image, and r=2 represents the number of classes. y r denotes the probability of the r-th category in the label at the image level, and p r is the prediction value corresponding to the classification model.
Step 4: generating attention-seeking diagrams based on multi-scale information fusion
And sequentially calculating the attention map of the target object by using the trained classification model and carrying out weighted fusion on the feature maps of the last N=4 feature perception modules of the classification model. For the feature map F (n) of the nth feature perception module, attention is paid to the feature map denoted as a (n), and the calculation process is as follows:
Wherein the method comprises the steps of The c-th channel of the feature map F (n), i is the pixel index value,/>, isRepresenting classification model prediction result pairs/>Is a partial derivative of (c). Alpha c is the weight of the c-th channel and ReLU represents the linear rectification activation function. A (n) (i) represents the value of attention map a (n) at the i-th pixel.
Upsampling A (n) to the input image size and normalizing to between [0,1] according to its maximum and minimum values, the resulting normalized attention map is represented asTo get a more stable attention-seeking diagram, for each/>And carrying out weighted fusion through uncertainty. /(I)The corresponding uncertainty u (n) and weight w (n) map is represented as:
Where w (n) (i) represents the value of the i-th pixel in the weight map w (n). From the post-fusion attention profile of w (n) Expressed as:
step 5: pseudo tag generation based on seed point expansion
First, bySeed points are generated. Will/>Thresholding is performed, and the threshold is set to t=0.5, which is/>The region with the value higher than t is taken as a foreground region/>Then, a bounding box of Ω f is calculated, the center point of the bounding box being the set of foreground seed points S f, and the four corner points of the bounding box being the set of background seed points S b.
And then, calculating the similarity between each pixel point and each foreground and background seed point according to the geodesic distance. For the foreground seed point S f, the geodesic distance calculation mode is as follows:
Where I represents an input image. Is the set of all possible paths for pixels i through S f, r (l) is one of the paths, parameterized by l ε [0,1], and u (l) is the unit vector of that path in the tangential direction at l.
The similarity graph of the foreground seed points is denoted as P fs, and the calculation mode is as follows:
Similarly, the similarity plot for the background seed points is denoted as P bs, which is obtained by the following formula:
Finally, combining P fs and P bs to obtain a single pseudo tag graph based on seed points The calculation process is as follows:
step 6: pseudo tag space adaptive fusion
At the position ofAnd/>Based on the above, the two are fused to obtain a more comprehensive segmentation pseudo tag. And designing a space self-adaptive weight map W pair to fuse the two. /(I)The pseudo tag after fusion is marked as:
Then, to Sharpening is carried out, and the process is as follows:
Where t=2 is a super parameter controlling the degree of sharpening, Is the final pseudo tag.
Step 7: segmentation model training based on pseudo tag credibility weighting
Based on the pseudo tag, the segmentation model is trained through the consistency constraint of the prediction results of two different visual angles, so that the interference of noise is avoided. First, a set of spatial transforms is defined, including random scaling, rotation, flipping, cropping, etc., using two different spatial transforms for an input image XAnd/>Respectively obtaining enhanced images, inputting the enhanced images into a segmentation model, and obtaining a prediction result according to/>And/>The respective inverse transforms are returned to the original image space, the results are denoted as P a and P b, respectively, and the average result of the two is denoted as/> And/>The different parts, possibly the false label, have low credibility, and the false label loss weight is reduced. To this end, a consistency weighted pseudo tag penalty is defined:
Wherein the weight pi (i) of pixel i is defined as:
second, for And/>The inconsistent portion, again introducing a consistency constraint penalty between P a and P b, is defined as:
finally, the loss function of the pseudo-label based segmentation model training is defined as:
l seg=Lp+λLc equation 21
Where λ=1.0 is a weight coefficient. The segmentation model is trained using the pseudo tag and L seg on the training image. After training, a test image is inferred by using the segmentation model to obtain a segmentation result.
Step8: segmentation on test images
After training, for any test image X t, the segmentation model trained in the step 7 is directly utilized to predict the test image X t, so as to complete the segmentation of the target.

Claims (4)

1. A weak supervision image segmentation method based on multi-scale pseudo tag fusion comprises the following steps:
Step 1: constructing an image classification model based on multi-scale features;
Aiming at a segmentation training set only containing image category labels, firstly training an image classification model by using the category labels; the classification model consists of a feature extractor and a classification head; the feature extractor consists of M cascaded feature perception modules, a downsampling layer is arranged between the two feature perception modules, and the resolution of the feature image is reduced to 1/2 of the original resolution of the feature image through convolution operation with the span of 2; each feature perception module comprises two multi-scale attention modules, a Dropout layer and a jump connection are arranged between the two multi-scale attention modules, and the jump connection represents adding the input and the output of the two multi-scale attention modules; each multi-scale attention module comprises a multi-scale feature extraction module and a channel self-attention module; the multi-scale feature extraction module aims at extracting features on different scales in an image and fusing the features, and is realized by K parallel convolution layers with different expansion ratios, wherein the expansion ratios of the K convolution layers are respectively set as d 1,d2,…,dK; by using Representing an input feature map, wherein C is the number of channels, and H and W are the length and width respectively; the operation of the multi-scale feature extraction module is expressed as:
wherein cov (d, F) represents the combination of a convolutional layer with an expansion ratio of d, a batch normalization layer and an active layer; Representing the operation of splicing the feature images along the channel direction;
On the basis of F', the channel self-attention module corrects the feature map so as to obtain the feature map with stronger expression capacity in a self-adaptive way; the channel number of F ' is expressed as C ', the operation of global average pooling P avg and global maximum pooling P max is sequentially carried out on F ', and after the output of the two operations is subjected to feature transformation through a multi-layer perceptron module MLP, the obtained feature vectors are respectively expressed as follows:
v a=MLP(Pavg (F')) equation 2
V m=MLP(Pmax (F')) equation 3
The MLP consists of two fully-connected neural network layers, and the number of output nodes of the MLP is C '/2 and C'; v a is the output of the multi-layer perceptron module after global average pooling, and v m is the output of the multi-layer perceptron module after global maximum pooling;
on the basis of v a and v m, the characteristic correction coefficient alpha is obtained as follows:
Alpha = sigma (v a+vm) equation 4
Where σ represents the Sigmoid activation function, the output of the channel self-attention module is expressed as:
f "=f '·α+f' equation 5
The classification head of the classification model converts the output of the M-th cascaded feature perception module into probability output of each category, and the probability output comprises two fully-connected layers and a Softmax layer which are connected in series; r is used for representing the number of categories marked in the training set, and the classification result of the classification model is represented by p epsilon [0,1] R;
Step 2: training an image classification model;
step 3: generating an attention map based on multi-scale information fusion;
Sequentially calculating attention force diagrams of the target object by utilizing the trained classification model and the feature diagrams output by the last N feature perception modules of the classification model, and carrying out weighted fusion; for the feature map F (n) of the nth feature perception module, attention is paid to the feature map denoted as a (n), and the calculation process is as follows:
Wherein, The c-th channel of the feature map F (n), i is the pixel index value,/>, isRepresenting classification model prediction result pairs/>Is a partial derivative of (2); α c is the weight of the c-th channel, reLU represents the linear rectification activation function, a (n) (i) represents the value of attention map a (n) at the i-th pixel;
Upsampling A (n) to the input image size and normalizing to between [0,1] according to its maximum and minimum values, the resulting normalized attention map is represented as The set of normalized attention profiles sequentially obtained at the last N feature perception modules of the classification model is denoted as/>
Attention attempts due to the different values of nThere is a difference in resolution, in order to get a more stable attention map, pair/>Each element in the (2) is subjected to weighted fusion through uncertainty; nth attention strive to/>The corresponding uncertainty u (n) and weight w (n) map is represented as:
Wherein w (n) (i) represents the value of the i-th pixel in the weight map w (n); from the post-fusion attention profile of w (n) Expressed as:
step 4: generating a pseudo tag based on seed point expansion;
Due to post-fusion attention seeking The obtained result is used as a preliminary positioning result to generate seed points, and a pseudo tag based on the ground wire distance of the seed points is further generated;
First, by Generating seed points; will/>Thresholding, setting the threshold as t, and setting/>The region with the value higher than t is taken as a foreground region/>Then calculating a boundary box of omega f, wherein the central point of the boundary box is used as a set S f of foreground seed points, and the four corner points of the boundary box are used as a set S b of background seed points;
then, calculating the similarity between each pixel point and each foreground and background seed point according to the geodesic distance; for the foreground seed point S f, the geodesic distance calculation mode is as follows:
Wherein I represents an input image; Is the set of all possible paths for pixels i through S f, r (l) is one of the paths, parameterized by l ε [0,1], u (l) is the unit vector of that path in the tangential direction at l;
the similarity graph of the foreground seed points is denoted as P fs, and the calculation mode is as follows:
Similarly, the similarity plot for the background seed points is denoted as P bs, which is obtained by the following formula:
Finally, combining P fs and P bs to obtain a single pseudo tag graph based on seed points The calculation process is as follows:
Wherein the method comprises the steps of The larger the value at pixel i, the more likely it is to be a foreground portion; /(I)A value of 0.5 indicates that the pixel is in a pending state;
Step 5: pseudo tag space self-adaptive fusion;
Step 6: further training by adopting a pseudo tag segmentation model;
Step 7: and (3) dividing the image to be divided by adopting the division model trained in the step (6).
2. The weak supervision image segmentation method based on multi-scale pseudo tag fusion as set forth in claim 1, wherein the specific method of step 2 is as follows:
after an image classification model is constructed, a training image and a classification label thereof are input into the model for training, a cross entropy loss function is adopted for optimization, and training weights of the classification model are updated; the cross entropy loss function of the classification is defined as:
Wherein p represents the classification prediction result of the model, y represents the classification label of the training image, R represents the number of categories, y r represents the probability of the R-th category in the label of the image level, and p r is the prediction value corresponding to the classification model.
3. The weak supervision image segmentation method based on multi-scale pseudo tag fusion as set forth in claim 1, wherein the specific method of step 5 is as follows:
At the position of And/>Based on the above, fusing the two to obtain a more comprehensive segmentation pseudo tag; due to the fact that in the vicinity of the seed point/>More reliable, and remote from the seed point/>More credible, a space self-adaptive weight map W pair is designed to fuse the two; /(I)The pseudo tag after fusion is marked as:
then due to Is a soft label, in order to provide stronger supervisory signals for the segmentation model, the sharpening process is carried out on the segmentation model, and the process is as follows:
Wherein T is more than or equal to 1, is a super parameter for controlling the sharpening degree, Is the final pseudo tag.
4. The method for segmentation of a weakly supervised image based on multi-scale pseudo label fusion as set forth in claim 1, wherein the training method of step 6 is as follows:
First, a set of spatial transforms is defined, including random scaling, rotation, flipping, cropping, using two different spatial transforms for an input image X And/>Respectively obtaining enhanced images, inputting the enhanced images into a segmentation model, and obtaining a prediction result according to/>And/>The respective inverse transforms are returned to the original image space, the results are denoted as P a and P b, respectively, and the average result of the two is denoted as/> And/>Different parts are possibly false labels, and if the reliability is low, the loss weight of the false labels is reduced; to this end, a consistency weighted pseudo tag penalty is defined:
Wherein the weight pi (i) of pixel i is defined as:
second, for And/>The inconsistent portion, again introducing a consistency constraint penalty between P a and P b, is defined as:
finally, the loss function of the pseudo-label based segmentation model training is defined as:
l seg=Lp+λLc equation 21
Where λ is a weight coefficient, typically set to 1.0; training the segmentation model by using the pseudo tag and the L seg on the training image; after training, a test image is inferred by using the segmentation model to obtain a segmentation result.
CN202410094113.6A 2024-01-23 2024-01-23 Weak supervision image segmentation method based on multi-scale pseudo tag fusion Pending CN117975002A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410094113.6A CN117975002A (en) 2024-01-23 2024-01-23 Weak supervision image segmentation method based on multi-scale pseudo tag fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410094113.6A CN117975002A (en) 2024-01-23 2024-01-23 Weak supervision image segmentation method based on multi-scale pseudo tag fusion

Publications (1)

Publication Number Publication Date
CN117975002A true CN117975002A (en) 2024-05-03

Family

ID=90850550

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410094113.6A Pending CN117975002A (en) 2024-01-23 2024-01-23 Weak supervision image segmentation method based on multi-scale pseudo tag fusion

Country Status (1)

Country Link
CN (1) CN117975002A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118212422A (en) * 2024-05-22 2024-06-18 山东军地信息技术集团有限公司 Weak supervision image segmentation method for multidimensional differential mining
CN118397283A (en) * 2024-07-01 2024-07-26 山东大学 Gastric atrophy area segmentation system, electronic equipment and readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118212422A (en) * 2024-05-22 2024-06-18 山东军地信息技术集团有限公司 Weak supervision image segmentation method for multidimensional differential mining
CN118397283A (en) * 2024-07-01 2024-07-26 山东大学 Gastric atrophy area segmentation system, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN110428428B (en) Image semantic segmentation method, electronic equipment and readable storage medium
Li Research and application of deep learning in image recognition
CN113421269B (en) Real-time semantic segmentation method based on double-branch deep convolutional neural network
CN111612008B (en) Image segmentation method based on convolution network
CN114120102A (en) Boundary-optimized remote sensing image semantic segmentation method, device, equipment and medium
CN113763442B (en) Deformable medical image registration method and system
CN111461232A (en) Nuclear magnetic resonance image classification method based on multi-strategy batch type active learning
CN113313164B (en) Digital pathological image classification method and system based on super-pixel segmentation and graph convolution
CN111382686B (en) Lane line detection method based on semi-supervised generation confrontation network
CN117975002A (en) Weak supervision image segmentation method based on multi-scale pseudo tag fusion
CN112950780B (en) Intelligent network map generation method and system based on remote sensing image
CN114359153B (en) Insulator defect detection method based on improvement CENTERNET
CN110599502B (en) Skin lesion segmentation method based on deep learning
CN111626994A (en) Equipment fault defect diagnosis method based on improved U-Net neural network
CN113591617B (en) Deep learning-based water surface small target detection and classification method
CN114332473A (en) Object detection method, object detection device, computer equipment, storage medium and program product
CN113298817A (en) High-accuracy semantic segmentation method for remote sensing image
CN115393289A (en) Tumor image semi-supervised segmentation method based on integrated cross pseudo label
CN114283285A (en) Cross consistency self-training remote sensing image semantic segmentation network training method and device
CN113344110A (en) Fuzzy image classification method based on super-resolution reconstruction
CN115661459A (en) 2D mean teacher model using difference information
CN114283326A (en) Underwater target re-identification method combining local perception and high-order feature reconstruction
CN117274355A (en) Drainage pipeline flow intelligent measurement method based on acceleration guidance area convolutional neural network and parallel multi-scale unified network
CN114494284B (en) Scene analysis model and method based on explicit supervision area relation
CN113269171B (en) Lane line detection method, electronic device and vehicle

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination