WO2024098318A1

WO2024098318A1 - Medical image segmentation method

Info

Publication number: WO2024098318A1
Application number: PCT/CN2022/131075
Authority: WO
Inventors: 吴文霞; 李志成; 梁栋; 赵源深; 段静娴
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2022-11-10
Filing date: 2022-11-10
Publication date: 2024-05-16

Abstract

A medical image segmentation method, comprising: collecting nuclear magnetic resonance image data of tumor patients as a data set (S1); performing data processing on the image data in the data set; after the data is processed, using multi-modal images meeting requirements in the data set as input of a model; designing a separate Transformer for each modality to extract a feature; designing a modality fusion Transformer to fuse data of a plurality of modalities (S5); gradually reshaping encoder output of different scales into the size of the input so as to obtain a segmentation result matching an original image; for unlabeled data in the data set, constructing a weakly enhanced image and a strongly enhanced image; selecting positives and negatives for the output of the differently enhanced images according to the encoders, and calculating contrastive loss (S8); calculating dice loss of labels and the segmentation result; and training the model to obtain a final model and saving the final model. The medical image segmentation method can better localize a tumor, thereby improving the segmentation effect.

Description

Medical Image Segmentation Methods

Technical Field

The invention relates to a medical image segmentation method.

Background technique

Medical image segmentation is the basis of various medical image applications. In clinical auxiliary diagnosis, image-guided surgery and radiotherapy, medical image segmentation technology shows increasingly important clinical value. Traditional medical image segmentation is based on manual segmentation by experienced doctors, but this purely manual segmentation method is often time-consuming and laborious, and is greatly affected by the doctor's subjective influence. With the rapid development of deep learning technology, fully automatic image segmentation based on deep learning has developed rapidly. However, deep learning often relies on a large amount of high-quality labeled data, while medical image data is often scarce, and it is usually difficult to obtain high-quality labeled data.

The semi-supervised learning framework can directly learn from limited labeled data and a large amount of unlabeled data to obtain high-quality segmentation results. Current semi-supervised medical image segmentation methods can be divided into three categories: adversarial learning methods, consistency regularization methods, and collaborative training methods. Adversarial learning methods use discriminators to align the distribution of labeled and unlabeled data in the embedding space. The data needs to meet the distribution assumption, and many adversarial learning models are difficult to train. The basic idea of the consistency regularization method is to regularize the model prediction, that is, a robust model should have similar outputs for similar inputs. The difference between each method lies in how to inject noise and how to calculate consistency, but the consistency regularization method relies on a suitable data augmentation strategy, and the wrong pseudo-labels will continue to strengthen during training. The collaborative training method is based on the assumption of low-density separation of data. The disadvantage of this method is that if the generated pseudo-labels are inaccurate, they will lead to self-reinforcement of classification errors.

In order to make full use of unlabeled data, semi-supervised segmentation is generally performed using adversarial learning methods, consistency regularization methods, and collaborative training methods. In general, the above methods all use the consistency of the output space and lack constraints in the feature space. Therefore, in many cases, the model cannot recognize the wrong features, causing this error to continue to accumulate during the training process.

Summary of the invention

In view of this, it is necessary to provide a medical image segmentation method.

The present invention provides a medical image segmentation method, which comprises the following steps: a. collecting nuclear magnetic resonance image data of tumor patients as a data set; b. performing data processing on the image data in the data set, wherein the data processing comprises: performing format conversion, resampling, registration and standardization on the image data in the data set; c. taking multimodal images that meet the requirements in the data set after the data processing as input of a model; d. establishing a multi-branch Transformer neural network as an encoder, and designing a separate Transformer for each modality to extract features; e. designing a modality fusion Transformer to fuse data of multiple modalities; f. establishing a decoder, and gradually reshaping encoder outputs of different scales into input sizes to obtain a segmentation result matching the original image; g. constructing a weakly enhanced image and a strongly enhanced image for unlabeled data in the data set; h. selecting positive examples and negative examples according to the output of the encoder for different enhanced images, and calculating the contrast loss; i. calculating the dice loss for labels and segmentation results; j. training the model, selecting a result with better effect as the final model and saving it.

Specifically, the patient's nuclear magnetic resonance image data is a multimodal nuclear magnetic resonance image; the nuclear magnetic resonance image data of each patient includes four commonly used modalities; the four commonly used modalities are T1, T2, T1C, and Flair modalities.

Specifically, the step b specifically includes:

First, the DICOM format is converted into the NIFTI format; then the image is resampled; then the image is registered, and the points corresponding to the same spatial position at multiple time points are matched one by one. The rigid registration mode is used for registration, and the mutual information is used as the image similarity measure; the image data in the data set is standardized using grayscale normalization and histogram equalization methods.

Specifically, the step c specifically includes:

The multimodal images that meet the requirements in the dataset are used as the input of the model, and the dataset is divided into a training set and a test set. First, the magnetic resonance image data with missing modalities, failed registration, or without tumors are excluded to avoid affecting the generalization performance of the model. Then, the dataset is divided into a training set and a test set in a ratio of 4:1. For the training set, the labeled data and unlabeled data are divided as needed and processed separately.

Specifically, the step d specifically includes:

A separate Transformer is designed for each modality to extract features. For input with four modalities, a multi-branch Transformer is proposed with the same number of branches as the input modalities in order to simultaneously extract independent features of multiple modalities. The three-dimensional whole brain image is divided into K three-dimensional image blocks of fixed size, mapped into a one-dimensional vector of fixed length D, and position encoding is added to retain position information before being input into the visual Transformer model.

Specifically, the step e specifically includes:

A fusion Transformer based on the cross-attention mechanism is designed separately: the fusion Transformer based on the cross-attention mechanism is divided into two parts, namely, a partial fusion Transformer and a global fusion Transformer; the partial fusion Transformer uses a single one-dimensional vector of each branch as a query to exchange information with other branches, and inputs the partial fusion result into the global fusion Transformer, and the multimodal information is more thoroughly fused together through the self-attention mechanism therein, thereby utilizing the global context information at the overall semantic structure level of the data.

Specifically, the step f specifically includes:

The decoder gradually reshapes the encoder outputs of different scales to the input size to obtain a segmentation result that matches the original image. The decoder takes the encoder output as five channel inputs. The encoder outputs of each layer are fused layer by layer through convolution and deconvolution operations, and the image is restored to the specified size, and the sigmoid function is applied to obtain the final segmentation result.

Specifically, the step g specifically includes:

Two enhancement methods are designed for a single unlabeled image. In each training step, a transformation is randomly selected for each sample in the batch from a predefined range: the first enhancement method is weak enhancement, which is the result of random flipping, moving and random scaling strategies with a probability of 50%; the other enhancement method is strong enhancement, which adds grayscale transformation on the basis of the weakly enhanced image.

Specifically, the step h specifically includes:

The unlabeled data loss is divided into two parts, including the output space consistency loss and the contrastive learning loss. The contrastive learning loss is calculated in that the encoder generates features based on the weakly enhanced image and the strongly enhanced image respectively. The features at the same position are regarded as positive examples, and the features at different positions are regarded as negative examples. The sampling method of negative examples adopts the gumbel sampling strategy, and selects k pixels with the smallest cosine similarity to form negative examples, or selects pixels with a longer distance as negative examples based on anatomical prior knowledge. The InfoNCE loss is combined with the cosine similarity to obtain the pixel contrast loss.

Specifically, the step i specifically includes:

For the segmentation results obtained with labeled data, the dice loss is calculated with the label as the supervised learning loss; for unlabeled data, the consistency loss is calculated between the results of weakly enhanced images and strongly enhanced images.

Specifically, the step j specifically includes:

Use stochastic gradient descent as the optimizer for training, and use weight decay to prevent overfitting; after the model training is completed, select the more accurate model under supervised data of various proportions to save.

This application not only takes into account the consistency of the output space, but also solves to a certain extent the problem of error accumulation caused by the inability to filter out erroneous features that is common in current methods. It also uses Transformer as the main feature extraction network and utilizes the attention mechanism and global receptive field advantages in Transformer to locate tumors faster and more accurately, which improves the accuracy compared to the convolutional neural network method with only a local receptive field.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG1 is a flow chart of a medical image segmentation method according to the present invention;

FIG2 is a schematic diagram of a Transformer neural network provided by an embodiment of the present invention;

FIG3 is a schematic diagram of the Transformer neural network segmentation process provided by an embodiment of the present invention.

Detailed ways

The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

Referring to FIG. 1 , it is a flowchart of a preferred embodiment of the medical image segmentation method of the present invention.

Please refer to Figure 2-3, step S1, collecting the MRI image data of tumor patients as a data set. Specifically:

In this embodiment, nuclear magnetic resonance image data of tumor patients are collected. The nuclear magnetic resonance image data of the patients are multimodal nuclear magnetic resonance images. The nuclear magnetic resonance image data of each patient includes four common modes; the four common modes are T1, T2, T1C, and Flair modes.

The patient images obtained in this step come from the patient image datasets jointly collected by the hospital, TCIA (The Cancer Imaging Archive) and TCGA (The Cancer Genome Atlas).

This embodiment does not limit the size of the data set. The larger the data set, the stronger the generalization ability.

Step S2, performing data processing on the image data in the data set, the data processing comprising: performing format conversion, resampling, registration and standardization on the image data in the data set. Specifically:

Perform format conversion, resampling, registration and standardization on the image data in the data set. DICOM (Digital Imaging and Communications in Medicine) refers to the medical digital image transmission protocol, which is a set of universal standard protocols for medical image processing, storage, printing and transmission.

The data obtained from the medical device is in DICOM format. First, the DICOM format is converted into NIFTI (Neuro Imaging Informatics Technology Initiative) format; then the image is resampled to improve the image resolution; then the image is registered, and the points corresponding to the same position in space at multiple time points are matched one by one. The rigid registration mode is used for registration, and mutual information is used as the image similarity metric. After registration and resampling, the spatial resolution of the image is 1mm. Grayscale normalization, histogram equalization and other methods are used to standardize the image data in the data set.

Step S3: After the data is processed, the multimodal images that meet the requirements in the data set are used as the input of the model, and the data set is divided into a training set and a test set. The training set is divided into labeled data and unlabeled data as needed, and processed separately. Specifically:

The multimodal images that meet the requirements in the dataset are used as the input of the model, and the dataset is divided into a training set and a test set. First, exclude the MRI data with missing modalities, failed registration, or without tumors to avoid affecting the generalization performance of the model. Then divide it into a training set and a test set at a ratio of 4:1. For the training set, divide the labeled data and unlabeled data as needed and process them separately. In semi-supervised tasks, the proportion of labeled data seriously affects the segmentation results. Therefore, the number of labeled data in the training set is gradually reduced by 10%, and experiments are carried out separately.

Step S4: Establish a multi-branch Transformer neural network as an encoder, and design a separate Transformer for each modality to extract features. Specifically:

A multi-branch Transformer neural network is established. The expected segmentation model is an encoder-decoder structure. The encoder extracts appropriate features and the decoder restores the image to the input size. A separate Transformer is designed for each modality to extract features. For input with four modalities, a multi-branch Transformer is proposed, whose number of branches is equal to the number of input modalities in order to extract independent features of multiple modalities at the same time. The three-dimensional whole brain image is divided into K fixed-size three-dimensional image blocks, mapped into a one-dimensional vector of fixed length D, and position encoding is added to retain position information, and input into the visual Transformer model.

Step S5: Design a modality fusion Transformer to fuse data from multiple modalities. Specifically:

In order to fully integrate the features of each modality from multiple angles to produce stronger image features, this application separately designs a fusion Transformer based on a cross-attention mechanism. The fusion Transformer based on the cross-attention mechanism is divided into two parts, namely a partial fusion Transformer and an overall fusion Transformer. The partial fusion Transformer uses a single one-dimensional vector of each branch as a query to exchange information with other branches. The partial fusion result is input into the overall fusion Transformer, and the multimodal information is more thoroughly integrated through the self-attention mechanism, thereby utilizing the global context information at the overall semantic structure level of the data.

Step S6: Establish a decoder to gradually reshape the encoder outputs of different scales into the input size to obtain a segmentation result that matches the original image. Specifically:

Step S7: construct weakly enhanced images and strongly enhanced images for unlabeled data. Specifically:

Two enhancement methods are designed for a single unlabeled image. In each training step, a transformation is randomly selected for each sample in the batch from a predefined range. The first enhancement method is weak enhancement, which is the result of a random flip, move, and random scaling strategy with a probability of 50%. The other enhancement method is strong enhancement, which adds grayscale transformation to the weakly enhanced image.

Step S8, select positive examples and negative examples according to the output of the encoder for different enhanced images, and calculate the contrast loss. Specifically:

The loss of unlabeled data is divided into two parts, including output space consistency loss and contrastive learning loss. The calculation method of contrastive learning loss is that the encoder generates features based on weakly enhanced images and strongly enhanced images respectively. Features at the same position are regarded as positive examples, and features at different positions are regarded as negative examples. The sampling method of negative examples adopts the gumbel sampling strategy, and selects k pixels with the smallest cosine similarity to form negative examples, or selects pixels with a longer distance as negative examples based on anatomical prior knowledge. The goal of contrastive learning loss is to increase its similarity with positive pixels and reduce its similarity with k negative pixels. To achieve this goal, InfoNCE loss is combined with cosine similarity to obtain pixel contrast loss. Specifically, the positive example uses all labels as 1 to calculate the cross entropy loss, and the negative example uses all labels as 0 to calculate the cross entropy loss. The sum of the losses is the contrastive learning loss.

Step S9, calculate the dice loss for the label and the segmentation result. Calculate the consistency loss for the output of the two branches of the unlabeled data. The total loss is the supervised learning loss, the contrastive learning loss and the consistency loss. Specifically:

Calculate the total loss. For the segmentation results obtained with labeled data, calculate the dice loss with the label as the supervised learning loss. For unlabeled data, calculate the consistency loss between the results of the weakly enhanced image and the strongly enhanced image; the consistency loss is added to the contrastive learning loss as the semi-supervised loss. The total loss is the sum of the semi-supervised loss and the supervised loss.

Step S10: train the model, select the result with better effect as the final model and save it. Specifically:

During training, data enhancement is performed using methods including but not limited to rotation, translation, scaling, and cropping to improve the generalization ability of the model;

Stochastic gradient descent is used as the optimizer for training, and weight decay is used to prevent overfitting. For the input image data, the network output is a binary segmentation result;

The results output by the network correspond to the original image, assisting doctors in diagnosing patients.

After the model training is completed, the more accurate model is selected and saved under the supervised data of each proportion.

This application uses the ability of contrastive learning to bring similar features closer and push heterogeneous features farther away to impose constraints on the feature space, further improving the effect of semi-supervised learning. The model is constructed using visual Transformer instead of convolutional neural network, and the global receptive field brought by the attention mechanism is used to fuse multimodal information to better locate the tumor position, thereby improving the segmentation effect.

The specific implementation methods of the invention are described in detail above, but they are only examples, and the invention is not limited to the specific implementation methods described above. For those skilled in the art, any equivalent modification or substitution of the invention is also within the scope of the invention, and therefore, the equalization, modification, improvement, etc. made without departing from the spirit and principle of the invention should be included in the scope of the invention.

Claims

A medical image segmentation method, characterized in that the method comprises the following steps:

a. Collect magnetic resonance imaging data of tumor patients as a data set;

b. performing data processing on the image data in the data set, the data processing comprising: format conversion, resampling, registration and standardization on the image data in the data set;

c. After data processing, the multimodal images in the data set that meet the requirements are used as the input of the model;

d. Establish a multi-branch Transformer neural network as an encoder and design a separate Transformer for each modality to extract features;

e. Design a modality fusion Transformer to fuse data from multiple modalities;

f. Build a decoder to gradually reshape the encoder outputs of different scales to the input size to obtain a segmentation result that matches the original image;

g. For the unlabeled data in the dataset, construct weakly enhanced images and strongly enhanced images;

h. Select positive and negative examples based on the encoder's output of different enhanced images and calculate the contrast loss;

i. Calculate the dice loss for the labels and segmentation results;

j. Train the model, select the result with better effect as the final model and save it.
The medical image segmentation method as described in claim 1 is characterized in that the patient's nuclear magnetic resonance image data is a multimodal nuclear magnetic resonance image; the nuclear magnetic resonance image data of each patient includes four commonly used modalities; the four commonly used modalities are T1, T2, T1C, and Flair modalities.
The medical image segmentation method according to claim 2, wherein the step b specifically comprises:

First, the DICOM format is converted into the NIFTI format; then the image is resampled; then the image is registered, and the points corresponding to the same spatial position at multiple time points are matched one by one. The rigid registration mode is used for registration, and the mutual information is used as the image similarity measure; the image data in the data set is standardized using grayscale normalization and histogram equalization methods.
The medical image segmentation method according to claim 3, characterized in that said step c specifically comprises:

The multimodal images that meet the requirements in the dataset are used as the input of the model, and the dataset is divided into a training set and a test set. First, the magnetic resonance image data with missing modalities, failed registration, or without tumors are excluded to avoid affecting the generalization performance of the model. Then, the dataset is divided into a training set and a test set in a ratio of 4:1. For the training set, labeled data and unlabeled data are divided as needed and processed separately.
The medical image segmentation method according to claim 4, characterized in that the step d specifically comprises:

A separate Transformer is designed for each modality to extract features. For input with four modalities, a multi-branch Transformer is proposed with the same number of branches as the input modalities in order to simultaneously extract independent features of multiple modalities. The three-dimensional whole brain image is divided into K three-dimensional image blocks of fixed size, mapped into a one-dimensional vector of fixed length D, and position encoding is added to retain position information before being input into the visual Transformer model.
The medical image segmentation method according to claim 5, characterized in that the step e specifically comprises:

A fusion Transformer based on the cross-attention mechanism is designed separately: the fusion Transformer based on the cross-attention mechanism is divided into two parts, namely, a partial fusion Transformer and a global fusion Transformer; the partial fusion Transformer uses a single one-dimensional vector of each branch as a query to exchange information with other branches, and inputs the partial fusion result into the global fusion Transformer, and the multimodal information is more thoroughly fused together through the self-attention mechanism therein, thereby utilizing the global context information at the overall semantic structure level of the data.
The medical image segmentation method according to claim 6, wherein the step f specifically comprises:

The decoder gradually reshapes the encoder outputs of different scales to the input size to obtain a segmentation result that matches the original image. The decoder takes the encoder output as five channel inputs. The encoder outputs of each layer are fused layer by layer through convolution and deconvolution operations, and the image is restored to the specified size, and the sigmoid function is applied to obtain the final segmentation result.
The medical image segmentation method according to claim 7, characterized in that the step g specifically comprises:

Two enhancement methods are designed for a single unlabeled image. In each training step, a transformation is randomly selected for each sample in the batch from a predefined range: the first enhancement method is weak enhancement, which is the result of random flipping, moving and random scaling strategies with a probability of 50%; the other enhancement method is strong enhancement, which adds grayscale transformation on the basis of the weakly enhanced image.
The medical image segmentation method according to claim 8, wherein the step h specifically comprises:

The unlabeled data loss is divided into two parts, including output space consistency loss and contrastive learning loss. The calculation method of contrastive learning loss is that the encoder generates features based on weakly enhanced images and strongly enhanced images respectively. Features at the same position are regarded as positive examples, and features at different positions are regarded as negative examples. The sampling method of negative examples adopts the gumbel sampling strategy, and selects k pixels with the smallest cosine similarity to form negative examples, or selects pixels with a longer distance as negative examples based on anatomical prior knowledge. The InfoNCE loss is combined with the cosine similarity to obtain the pixel contrast loss.
The medical image segmentation method according to claim 9, wherein the step i specifically comprises:

For the segmentation results obtained with labeled data, the dice loss is calculated with the label as the supervised learning loss; for unlabeled data, the consistency loss is calculated between the results of weakly enhanced images and strongly enhanced images.
The medical image segmentation method according to claim 10, characterized in that the step j specifically comprises:

Use stochastic gradient descent as the optimizer for training, and use weight decay to prevent overfitting; after the model training is completed, select the more accurate model under supervised data of various proportions to save.