CN117274147A

CN117274147A - Lung CT image segmentation method based on mixed Swin Transformer U-Net

Info

Publication number: CN117274147A
Application number: CN202211412454.0A
Authority: CN
Inventors: 张聚; 应长钢; 龚伟伟; 马栋; 上官之博; 孙晓燕
Original assignee: Hangzhou Normal University
Current assignee: Hangzhou Normal University
Priority date: 2022-11-11
Filing date: 2022-11-11
Publication date: 2023-12-22

Abstract

The invention relates to a lung CT image segmentation method based on a mixture Swin Transformer U-Net. The invention comprises data preprocessing and data enhancement; constructing a segmentation model HySwinUNet; setting a training strategy and a loss function, and training the model; verifying the trained model; by constructing the HySwinUNet model and combining convolution and a Transformer, adding a pre-activated residual module, avoiding large-scale preprocessing by utilizing the induction deviation of a convolution image, directly transmitting information to any other module by one module in the forward and backward propagation of a network, reducing training load and enabling the network to obtain better training; and the self-adaptive attention module is used for acquiring the multi-scale global features by integrating two attention mechanisms, so that the weight ratio of the features of the target region is improved. The invention combines the Swin transducer and the U-Net to enhance the functionality and flexibility of the traditional encoder-decoder architecture, realize the automatic segmentation of the lung infection part of the lung CT, and can accurately segment the lung infection area from the CT image.

Description

Lung CT image segmentation method based on mixed Swin Transformer U-Net

Technical Field

The invention belongs to the technical field of image segmentation, and relates to a lung CT image segmentation method based on mixing Swin Transformer U-Net.

Background

Medical images play a critical role in helping healthcare providers contact patients for diagnosis and treatment. Study medical images are primarily dependent on the radiologist's visual interpretation. However, this typically takes a lot of time and is very subjective depending on the experience of the radiologist. To overcome these limitations, the use of computer-aided systems has become necessary. Computerization of medical image segmentation plays an important role in medical imaging applications. It has wide application in different fields of diagnosis, pathological positioning, anatomical structure research, treatment planning, computer integrated operation, etc. However, variability and complexity of human anatomy has led to medical image segmentation as still a problem.

The current standard for diagnosing covd-19 is the real-time reverse transcription polymerase chain reaction (RT-PCR) swab test. However, the diagnostic results of RT-PCR require several hours to process, and the false negative rate of the assay is high, often requiring repeated assays. Compared to RT-PCR, chest Computed Tomography (CT) imaging enables efficient disease screening of covd-19 with high sensitivity and ease of use in a clinical setting.

The application of the deep learning technology in medical diagnosis can improve the detection rate and efficiency of diseases, and has great success in the field of medical image recognition. In order to diagnose lung cancer, lung tumor and lung nodule, many scholars have studied a lung CT image recognition method based on deep learning, and CT image recognition has proven to be very useful for diagnosis of lung diseases. This is critical for quantification and diagnosis of lung disease (including covd-19) if the lung infected area can be accurately segmented from CT images.

However, accurate segmentation of lung infection lesions on CT images remains a challenging task based on the fact: 1. on CT images, the infected boundaries are irregular, different in size and shape, and have the characteristics of blurred appearance and low contrast. This can easily lead to missing some small ground glass lesions, or over-segmentation of infection on CT images; 2. absent the marker dataset, large-scale infection annotations provided by clinicians are not readily available.

Disclosure of Invention

The invention aims to provide a lung CT image segmentation method based on a mixture Swin Transformer U-Net, which is used for accurately segmenting a lung infection area from a CT image.

The method specifically comprises the following steps:

step one, data preprocessing and data enhancement:

collecting a large number of public lung infection CT images, performing data enhancement, expanding the number of samples, and normalizing the images; as a training set of models, for training the models; the data enhancement is specifically: the image is subjected to random cropping, inversion, rotation, scaling and offset processing.

Step two, constructing a segmentation model HySwinUNet:

the method comprises the steps that a partitioning model HySwinUNet is built based on a coder-decoder structure of U-Net, the partitioning model HySwinUNet comprises a coder, an adaptive attention module, a decoder and a jump connection, a basic unit of the HySwinUNet is a Swin converter module (Swin Transformer Block), and the Swin converter is used as a backbone network of the U-Net;

in the encoder, an input image is divided into 4×4 small blocks (patches) by block division (Patch division), and after Linear Embedding (Linear Embedding), vectors are usedThe dimension of (2) will become a preset value; dimension C, resolution isIs fed into two successive Swin converters for performing a characterization learning, in which the feature dimensions and resolution remain unchanged; the Swin converter module is responsible for feature representation learning, performs block Merging (Patch Merging) after learning is completed, downsamples and increases dimensions, reduces the space size by 1/2, and increases the feature dimensions to the original two times, so that a hierarchical design is formed; the above procedure will be repeated three times in the encoder, passing the pre-activated remaining modules (PRBs) in advance during each layer propagation;

in the encoding process, a self-adaptive attention module (ADM) is adopted to locate the characteristic information of the region of interest (RoI), the region characteristic information without correlation is restrained, the characteristic information is effectively extracted, and the focus region is more accurately segmented; thereby improving the weight ratio of the characteristics of the target area and improving the network segmentation precision;

constructing a symmetric decoder based on the Swin converter block; remolding the feature map of adjacent dimensions into a higher resolution feature map by upsampling and correspondingly reducing the feature dimensions to half of the original dimensions; the extracted context features are fused with the multi-scale features of the encoder through jump connection to compensate for the loss of spatial information caused by downsampling and recover valuable spatial information.

The pre-activation residual module is adopted at the inlet of the encoding stage and the outlet of the decoding stage, the pre-activation residual module initializes the transducer into a convolution network, and the local intensity characteristics are extracted by utilizing the convolution layer to avoid large-scale pretreatment of the transducer, so that the training of the Swin converter is easier; since the misclassified region is typically located on the boundary of the region of interest (RoI), high resolution context information plays a critical role in segmentation; the module sequentially executes element addition with the original input after twice continuous batch standardization BN, activation function ReLU and convolution operation Conv; firstly, performing Conv convolution operation through a ReLU layer, wherein the dimension and resolution of the feature map are not changed; the pre-activated remaining modules enable information to be smoother in the forward and backward propagation processes of the network;

the Swin converter modules are constructed based on moving windows, comprising two successive Swin converters, each Swin converter comprising a multi-headed self-attention Module (MSA) and a multi-layer perceptron (MLP), further employing a Layer Normalization (LN) layer prior to each MSA module and MLP module; on the basis of the multi-head self-attention module, the Swin converter provides a window-based multi-head self-attention module (W-MSA) and a moving window-based multi-head self-attention module (SW-MSA), and the calculation formula is as follows:

wherein the method comprises the steps ofAnd z ^l Respectively representing the output of the first layer W-MSA and the MLP; />And z ^l+1 Output of the first +1 layer SW-MSA and MLP are shown respectively; the self-attentiveness of the W-MSA and SM-MSA were calculated as: /> Wherein Q, K, < >>Representing a matrix of queries, keys, and values; m is M ² And 3 represents the patch number of the window and the dimension of the query or key, respectively; the value of B is taken from the bias matrix +.>

The input channels of the adaptive attention module (ADM) are combined into dual-attention input through 3X 3 convolution with expansion rate (expansion rate) of 1 and 3, and different global information is found through two different attention mechanisms; the matrix with the size of C multiplied by H multiplied by W is obtained by obtaining channel data based on global average pooling (Global Average Pool) and Pixel-by-Pixel Correlation (Pixel-wise Correlation), and then is normalized by using a Sigmoid function after being combined through connection operation (connection); furthermore, more non-linear features are generated with fully connected layers (FC); finally, applying Softmax operation to the channel, the attention of the cross-channel can adaptively select receptive fields with different sizes.

Step three, setting a training strategy and a loss function, and training the model;

dividing the preprocessed data set into a training set, a testing set and a verification set; adopting random initialization and Adam optimization algorithm; setting BatchSize, epoch and proper learning rate, and simultaneously adopting a regularization strategy to prevent overfitting; updating weights and biases in the segmentation model HySwinUNet by using a back propagation algorithm; updating parameters by using a loss function in the training iteration process;

step four, verifying the trained network model: inputting the segmented verification set into a trained segmentation model HySwinUNet, segmenting a focus part in a lung CT image by an output result to obtain a segmented image, and evaluating the model by comparing the CT image segmented by an expert with the image segmented by the trained network model;

after verification, any lung CT image is input into a segmentation model HySwinUNet, and a lung CT image of the segmented focus is output.

Compared with the prior art, the technical scheme provided by the invention has the following technical effects:

the invention effectively combines the Swin transducer and the U-Net to enhance the functionality and flexibility of the traditional encoder-decoder architecture, is applied to the field of medical image segmentation, realizes automatic segmentation of lung CT lung infection parts, and can accurately segment lung infection areas from CT images.

Because the transfomer has no inductive bias on the image, it performs poorly on small-scale datasets, even with pre-trained imagenets, the transfomer may not perform as well as a residual network; the HySwinUNet model combines convolution and a Transformer, adds a pre-activation residual module (PRB) module, utilizes the induction deviation of a convolution image to avoid large-scale preprocessing, can directly transfer information to any other module by one module in the forward and backward propagation of a network, reduces the training burden, and enables the network to obtain better training;

an Adaptive Dual-attention Module is used for acquiring multi-scale global features by integrating two attention mechanisms, so that the weight ratio of the features of a target region is improved, and the aim of more accurately dividing a focus region with irregular CT images and low contrast of novel coronavirus infection is achieved; on CT images, the infected area may have discontinuous boundaries and irregular shapes, and the images may have blurred appearance and low contrast; the information of the channel and the pixel is important information for acquiring the representative characteristics of the region of interest; therefore, an Adaptive Dual-attention Module (Adaptive Dual-attention Module) is used to extract feature information and improve the weight ratio of the features of the target region.

Drawings

Fig. 1 is a block diagram of the hyswinlunet network of the present invention;

FIG. 2 is a block diagram of the pre-activated remaining modules of the present invention;

FIG. 3 is an illustration of a Swin converter module implementation;

fig. 4 is an illustration of an adaptive attention module implementation.

Detailed Description

The invention is further described with reference to the accompanying drawings:

a lung CT image segmentation method based on a mixture Swin Transformer U-Net specifically comprises the following steps:

step one, data preprocessing and data enhancement;

the new coronavirus infection CT images collected by the italian medical and interventional radiology institute (SIRM) and the new coronavirus infection CT images in the MosMedData dataset were used for training models in this example; the image is subjected to random cutting, inversion, rotation, scaling, offset and other modes to enlarge the data set, increase the number of training samples and improve the robustness of the model; finally, normalizing the images;

step two, constructing a segmentation model HySwinUNet:

as shown in fig. 1, the partitioning model hyswinlunet is a U-Net based encoder-decoder architecture, comprising an encoder, an adaptive attention module, a decoder and a jump connection, the basic unit of hyswinlunet is a Swin converter module (Swin Transformer Block), with the Swin converter as the backbone network of the U-Net;

for the encoder, in the process of dividing the model, an input image is firstly divided into small blocks (Patch) with the size of 4 multiplied by 4 through block division (Patch), and after Linear Embedding (Linear Embedding), the dimension of a vector is changed into a preset value; dimension C, resolution isIs fed into two successive Swin converters for performing a characterization learning, in which process both the feature dimension and the resolution remain unchanged; the Swin converter module is responsible for feature representation learning, and the function of block Merging (Patch Merging) is to downsample and increase the dimension, reduce the space size by 1/2, and increase the feature dimension by two times as much as the original, thereby forming a hierarchical design; during the propagation of each layer, the remaining modules (Pre-activation Residual Block) are Pre-activated in advance; this procedure will be repeated three times in the encoder;

in the encoding process, in order to effectively extract the characteristic information, more accurately segment the focus area, an Adaptive Dual-attention Module (Adaptive Dual-attention Module) is adopted to locate the characteristic information of a region of interest (RoI), and the region characteristic information without correlation is restrained, so that the weight ratio of the characteristics of a target region is improved, and the network segmentation precision is improved;

for a decoder, constructing a symmetric decoder based on the Swin transformer block; remolding the feature map of adjacent dimensions into a higher resolution feature map by upsampling and correspondingly reducing the feature dimensions to half of the original dimensions; the extracted context features are fused with the multi-scale features of the encoder through jump connection so as to compensate the space information loss caused by downsampling and recover valuable space information;

as shown in fig. 2, the transfomer does not generalize bias to images and therefore performs poorly on small-scale datasets, even though the pre-trained ImageNet may not perform as well as the residual network; the HySwinUNet model combines the convolution and the force of the transducer, a Pre-activation residual module (Pre-activation Residual Block) initializes the transducer to a convolution network, and local intensity features are extracted by using a convolution layer to avoid large-scale preprocessing of the transducer, so that the training of the Swin converter is easier; since the misclassified region is typically located on the boundary of the region of interest (RoI), high resolution context information plays a critical role in segmentation; the module sequentially executes element addition with the original input after twice continuous batch standardization BN, activation function ReLU and convolution operation Conv; firstly, performing Conv convolution operation after passing through a ReLU layer, and not changing the dimension and resolution of the feature map; the pre-activated remaining modules enable information to be smoother in the forward and backward propagation processes of the network; pre-activation of the remaining modules is adopted at both the inlet of the encoding stage and the outlet of the decoding stage;

as shown in fig. 3, unlike the conventional Multi-head Self-attention Module (MSA), the Swin converter Module is constructed based on a moving window, and is composed of two consecutive Swin converters, each of which is composed of a Multi-head Self-attention Module (Multi-head Self-attention Module) and a Multi-layer perceptron (MLP), and further employs a Layer Normalization (LN) layer before each MSA Module and MLP Module; on the basis of the multi-head self-attention module, the Swin converter provides a window-based multi-head self-attention module (W-MSA) and a moving window-based multi-head self-attention module (SW-MSA), and the calculation formula is as follows: wherein (1)>And z ^l Respectively representing the output of the first layer W-MSA and the MLP; />And z ^l+1 Output of the first +1 layer SW-MSA and MLP are shown respectively; the self-attentiveness of the W-MSA and SM-MSA were: />Wherein Q, K, < >>Representing a matrix of queries, keys, and values; m is M ² And 3 represents the patch number of the window and the dimension of the query or key, respectively; the value of B is taken from the bias matrix +.>

The infected area on the lung CT image may have discontinuous boundaries and irregular shapes, and the image may have a blurred appearance and low contrast; the information of the channel and the pixel is important information for acquiring the representative characteristics of the region of interest; therefore, an Adaptive Dual-attention Module (Adaptive Dual-attention Module) is used for extracting more comprehensive and distinguishing characteristic information, and the weight ratio of the characteristics of the target area is improved so as to identify the boundary of the focus; the module captures boundary discontinuities of a lung CT lesion by a global averaging pooling and processes shape irregularities by another pixel-by-pixel correlation;

the structure of the adaptive attention module is shown in fig. 4, and the input channels are convolved by 3×3 with expansion rates (expansion rates) of 1 and 3, respectively, and they are combined into dual-attention input, and different global information is found out by two different attention mechanisms; the matrix with the size of C multiplied by H multiplied by W is obtained by obtaining channel data based on global average pooling (Global Average Pool) and Pixel-by-Pixel Correlation (Pixel-wise Correlation), and then is normalized by using a Sigmoid function after being combined through connection operation (connection); furthermore, more non-linear features are generated with fully connected layers (FC); finally, applying Softmax operation to the channel and outputting a feature map with the original size; the attention of the cross-channel can adaptively select receptive fields of different sizes;

dividing the preprocessed data set into a training set, a testing set and a verification set in sequence according to the proportion of 5:3:2; adopting random initialization and Adam optimization algorithm; setting BatchSize, epoch and proper learning rate, and simultaneously adopting a regularization strategy to prevent overfitting; updating weights and biases in the network by using a back propagation algorithm in the HySwinUNet network model; updating parameters by using a loss function in the training iteration process;

training the HySwinUNet network model according to the set training strategy; in the training stage, hySwinUNet trains in an end-to-end manner using an objective function; updating parameters in the iteration process by using the loss function; in the selection of the Loss function, all networks are trained using a combination of Dice Loss (Dice Loss) and binary cross entropy Loss (Binary Cross Entropy Loss); thus, the loss function is Loss＝αL _Di7e +βL _BCE ；

Where y is the true probability of sample i,is the predictive probability for sample i; l (L) _Di7e And L _BCE Respectively representing dice loss and binary cross entropy loss; loss represents the final Loss function, the dice Loss and the binary cross entropy Loss are combined in one term, and L is given _Di7e The class imbalance problem can be better processed by more weights; alpha has a value of 0.9 and beta has a value of 0.1.

Step 4, verifying the trained network model;

sending the segmented verification set into a trained HySwinUNet network model, and outputting a result to segment a focus part in a lung CT image to obtain a segmented image, and evaluating the model by comparing the CT image segmented by an expert with the image segmented by the trained network model;

4 widely adopted evaluation criteria were used to measure the performance of the hyswinlunet model; the evaluation index is as follows:

dice similarity coefficient (Dice similarity coefficient): DSC is used to measure the similarity between predicted pulmonary infections and facts, where V _Seg Represents the region divided by the model algorithm, V _GT Representing a real segmentation area; TP, TN, FP, FN each represents a true positive, a true negative, a false positive, and a false negative;

sensitivity (Sensitivity): SEN represents the percentage of correctly segmented lung infections;

specificity (Specificity): SPE represents the percentage of non-infected areas that are correctly segmented;

positive predictive value (Precision): PRE represents the accuracy of the segmentation of the pulmonary infection area,

The drawings in the disclosed embodiments of the invention relate only to the structures that are related to the disclosed embodiments, but the above description is only a preferred embodiment of the invention and it is fully applicable to various fields of adaptation to the invention, and therefore the invention is not limited to the specific details and illustrations shown and described herein without departing from the general concepts defined by the claims and their equivalents.

Claims

1. A lung CT image segmentation method based on a mixture Swin Transformer U-Net is characterized by comprising the following steps: the method specifically comprises the following steps:

step one, data preprocessing and data enhancement:

collecting a large number of public lung infection CT images, performing data enhancement, expanding the number of samples, normalizing the images to serve as a training set of a model, and training the model;

step two, constructing a segmentation model HySwinUNet:

constructing a segmentation model HySwinUNet based on a U-Net encoder-decoder structure, wherein the segmentation model HySwinUNet comprises an encoder, an adaptive attention module, a decoder and jump connection;

in the encoder, an input image is divided into small blocks of 4×4 by block division, and after linear embedding, the dimension of the vector will become a preset value; dimension C, resolution isIs fed into two successive Swin converters for performing a characterization learning, in which the feature dimensions and resolution remain unchanged; the Swin converter module is responsible for feature representation learning, performs block merging, downsampling and dimension increasing after learning is completed, reduces the space size by 1/2, and increases the feature dimension to the original two times, so that a hierarchical design is formed; the above process is repeated three times in the encoder, and the rest modules are pre-activated in advance in the process of each layer of propagation;

in the encoding process, the self-adaptive attention module is adopted to locate the characteristic information of the region of interest, the region characteristic information without correlation is restrained, the characteristic information is effectively extracted, and the focus region is more accurately segmented; thereby improving the weight ratio of the characteristics of the target area and improving the network segmentation precision;

constructing a symmetric decoder based on the Swin converter block; remolding the feature map of adjacent dimensions into a higher resolution feature map by upsampling and correspondingly reducing the feature dimensions to half of the original dimensions; the extracted context features are fused with the multi-scale features of the encoder through jump connection so as to compensate the space information loss caused by downsampling and recover valuable space information;

step three, setting a training strategy and a loss function;

2. The hybrid Swin Transformer U-Net based lung CT image segmentation method of claim 1, further comprising: the data enhancement specifically comprises the following steps: the image is subjected to random cropping, inversion, rotation, scaling and offset processing.

3. The hybrid Swin Transformer U-Net based lung CT image segmentation method of claim 1, further comprising: the pre-activation residual module is adopted at the inlet of the encoding stage and the outlet of the decoding stage, the pre-activation residual module initializes the transducer into a convolution network, and the local intensity characteristics are extracted by utilizing the convolution layer to avoid large-scale pretreatment of the transducer, so that the training of the Swin converter is easier; since the misclassified region is typically located on the boundary of the region of interest, high resolution context information plays a critical role in segmentation; the module sequentially executes element addition with the original input after twice continuous batch standardization BN, activation function ReLU and convolution operation Conv; conv convolution operation is performed after the ReLU layer, the dimension and resolution of the feature map are not changed, and information is smoother in the forward and backward propagation processes of the network.

4. The hybrid Swin Transformer U-Net based lung CT image segmentation method of claim 1, further comprising: the Swin converter module is constructed based on a moving window and comprises two continuous Swin converters, wherein each Swin converter comprises a multi-head self-attention module MSA and a multi-layer perceptron MLP, and a layer normalization layer LN is adopted before each MSA module and each MLP module; based on the multi-head self-attention module, the Swin converter provides a window-based multi-head self-attention module W-MSA and a moving window-based multi-head self-attention module SW-MSA, and the calculation formula is as follows:

wherein,and z ^l Respectively representing the output of the first layer W-MSA and the MLP; />And z ^l+1 Output of the first +1 layer SW-MSA and MLP are shown respectively; W-MSA and SM-MSASelf-attention is calculated as: /> Wherein (1)>Representing queries, keys, and a matrix of values; m is M ² And d represents the patch number of the window and the dimension of the query or key, respectively; the value of B is taken from the bias matrix +.>

5. The hybrid Swin Transformer U-Net based lung CT image segmentation method of claim 1, further comprising: the input channels of the self-adaptive attention module are combined into double-attention input through 3X 3 convolution with expansion rates of 1 and 3, and different global information is found out through two different attention mechanisms; the matrix with the size of C multiplied by H multiplied by W is obtained based on global average pooling and pixel-by-pixel correlation to obtain a matrix with the size of C multiplied by 1, and the matrix is normalized by a Sigmoid function after the connection operation; further, more non-linear features are generated with fully connected layers; finally, applying Softmax operation to the channel, the attention of the cross-channel can adaptively select receptive fields with different sizes.