CN113888466A

CN113888466A - Pulmonary nodule image detection method and system based on CT image

Info

Publication number: CN113888466A
Application number: CN202111030746.3A
Authority: CN
Inventors: 李波; 徐麒皓
Original assignee: Wuhan University of Science and Engineering WUSE
Current assignee: Wuhan University of Science and Engineering WUSE; Wuhan University of Science and Technology WHUST
Priority date: 2021-09-03
Filing date: 2021-09-03
Publication date: 2022-01-04

Abstract

The invention discloses a lung nodule image detection method and system based on a CT image, wherein the detection method comprises the following steps: s1, image serialization: performing labeling by reshaping the slices of the input lung CT image into a set of patch sequences; s2, utilizing patch embedding, and mapping the vectorization patch sequence to a potential two-dimensional embedding space by using trainable linear mapping; s3, establishing a CNN and Transformer mixed encoder: coding the marked image block from the CNN characteristic map into an input sequence for extracting the global context through a Transformer; s4, cascade decoder: firstly, the coding features obtained in the step S3 are up-sampled through a decoder, then the up-sampled coding features are combined with a high-resolution CNN feature map to achieve accurate positioning, and finally, the U-Net is utilized to recover local spatial information to enhance more accurate detail detection information. The method can effectively improve the accuracy of pulmonary nodule detection.

Description

Pulmonary nodule image detection method and system based on CT image

Technical Field

The invention relates to the technical field of image processing, in particular to a lung nodule image detection method and system based on a CT image.

Background

The lung cancer is the cancer with the highest cause of death in the world, and the lung nodules are used as early-stage expression forms of the lung cancer, can be observed on a CT image as quasi-circular lung shadows with the diameter not more than 3cm, and can help doctors to realize diagnosis of benign and malignant lung nodules by accurately detecting the outlines of the lung nodules. Since the lung nodules are minute in size and have features such as morphology and brightness similar to those of tissues such as blood vessels in the lung parenchyma, it is difficult to separate them by visual observation alone, and thus they are likely to cause serious interference in judgment by a doctor. In order to reduce the workload of doctors and improve the efficiency of nodule diagnosis, computer-aided diagnosis techniques have been used in clinical work.

Deep learning currently achieves excellent application effects in the field of computer vision. The U-Net architecture has become a de facto standard in various medical image segmentation tasks and has enjoyed great success. However, due to the inherent locality of convolution operations, U-Net typically exhibits limitations in explicitly modeling remote dependencies. The design of transformers for sequence-to-sequence prediction has become an alternative architecture with a congenital global self-attack mechanism, but may result in limited localization capabilities due to insufficient details of its low-level features.

Disclosure of Invention

Aiming at the problem that the conventional method for encoding a marked image block by using only a Transformer and then directly upsampling a hidden feature representation into a dense output with complete resolution cannot produce a satisfactory result, which generally shows large difference in texture, shape and size among patients, the invention provides a lung nodule image detection method and system based on a CT image, wherein the detection method uses a Transformer Unet combined framework, and proposes a self-attention mechanism based on CNN features on the basis of the conventional research, and different from the conventional CNN-based method, the Transformer Unet establishes the self-attention mechanism from the perspective of sequence to sequence prediction. In order to compensate for the loss of feature resolution caused by the Transformer, the network adopts a mixed structure of CNN and Transformer to utilize detailed high-resolution spatial information from CNN features and Transformer-encoded global context information. Inspired by the design of the U-shaped architecture, self-attention features encoded by the transform are then upsampled to combine with different high resolution CNN features that jump from the encoding path to achieve accurate positioning.

The invention relates to a pulmonary nodule image detection method and system based on a CT image, which adopts a transformerUnet combined framework, wherein the transformerUnet combined framework is a neural network framework (Transformer) based on an attention coding technology in deep learning and a biomedical semantic segmentation network framework (U-Net) based on a full convolution network technology, a substitute system framework of the Transformer with an innate global self-entry mechanism for sequence-to-sequence prediction is designed, and the problem of low positioning accuracy caused by insufficient low-level features can be solved by adding a medical image segmentation model. Different from the existing manually designed pulmonary nodule detection model, the detection framework provided by the invention consists of two parts: a Transformer part and a U-Net part.

Interpretation of terms:

1. transformer: attention is directed to neural network architectures for coding techniques.

2. CNN: convolutional Neural Networks.

3. U-Net: a biomedical semantic segmentation network architecture belongs to a full convolution neural network.

4. Batch: the feature detector in the convolutional neural network divides the input image into a plurality of patches, and the patch is called patch.

5. And (4) CUP: cascaded Upsampler, a Cascaded decoder that upsamples larger pictures with a less computationally intensive decoder to increase decoding speed.

6. MSA: Multi-Head Self Attention, while understanding the input sequence from different angles, and computing multiple attentions.

7. MLP: the multi layer perceiver, also called artificial neural network, has a structure with a plurality of hidden layers in the middle besides an input and output layer.

The technical scheme adopted by the invention for overcoming the technical problems is as follows:

the invention discloses a lung nodule image detection method based on a CT image, which adopts a Transformer Unet combined framework to detect the lung nodule image, wherein the Transformer Unet combined framework comprises a Transformer part and a U-Net part, and the detection method comprises the following steps:

s1, image serialization: performing labeling by reshaping the slices of the input lung CT image into a set of patch sequences;

s2, utilizing patch embedding, and mapping the vectorization patch sequence to a potential two-dimensional embedding space by using trainable linear mapping;

s3, establishing a CNN and Transformer mixed encoder: coding the marked image block from the CNN characteristic map into an input sequence for extracting the global context through a Transformer;

s4, cascade decoder: firstly, the coding features obtained in the step S3 are up-sampled through a decoder, then the up-sampled coding features are combined with a high-resolution CNN feature map to achieve accurate positioning, and finally, the U-Net is utilized to recover local spatial information to enhance more accurate detail detection information.

Further, in step S1, let the lung CT image be

H × W is the spatial resolution, and C is the number of channels.

Further, step S1 specifically includes:

tokenization is performed by remodeling the input lung CT image x into a set of patch sequences

Where p is the sequence size, so the size of each patch is p × p, the number of each image patch

I.e. the input sequence length.

Further, step S2 specifically includes:

s21, in order to encode the patch sequence space information, a specific position code added to the patch sequence embedding is learned to retain the position information, as shown in the following equation:

wherein the content of the first and second substances,

is a patch embedded map that is embedded in,

representing position embedding information, D is the dimension of the input patch;

s22, in order to recover the spatial order of the patch embedded, the size of the coding feature is first selected from

Become into

The channel size of the features is reduced to the number of feature classes using 1 × 1 convolution, and then the feature map is directly upsampled to full resolution H × W for predicting the final segmentation result.

Further, step S3 specifically includes:

the CNN and transform hybrid encoder is constructed by l-layer multi-headed self-attention and multi-layer perceptrons as the expressions shown in equations (2) and (3), so the output of the l-th layer can be written as follows:

where MSA denotes multi-head self-attention, MLP denotes multi-layer perceptron, LN (-) denotes the normalization operator of the image,

indicating the first layer of multi-headed attention output, z_lRepresenting layer I codingDescription of the image.

Further, the method further includes compensating for information loss of a CNN and Transformer hybrid encoder, and specifically includes:

similar to U-Net, skip concatenation is used to fuse the multi-scale features from the hybrid encoder with the upsampled features, using CNN as a feature extractor to generate a feature map instead of inputting a1 × 1 patch extracted from the original image, thereby preserving more deep and shallow features to compensate for information loss.

Further, step S4 specifically includes:

the plurality of upsampling steps are used for decoding the hidden features to output a final segmentation mask map, specifically:

in the case of hidden features

Is reconstructed into

Then, a cascaded decoder is realized by cascading a plurality of upsampling blocks to achieve the following

To H × W full resolution, wherein cascading multiple upsampling blocks sequentially comprises two upsamples, a 3 × 3 convolutional layer and a ReLU layer;

and finally, the cascade decoder and the hybrid encoder form a U-shaped structure together, and the feature fusion is carried out by realizing the upsampling of feature maps with different levels of resolution ratios through jump connection.

The invention also discloses a lung nodule image detection system based on the CT image, which adopts a transformer Unet combined framework and specifically comprises the following steps:

an image serialization module for remodeling the slices of the input lung CT image into a set of patch sequences to perform labeling;

a patch embedding module to map the vectorized patch sequence to a potential two-dimensional embedding space using a trainable linear mapping;

a mixed encoder module of CNN and Transformer, which is used for encoding the marked image block from the CNN feature map into an input sequence for extracting the global context through the Transformer;

and the cascade decoder module is used for firstly up-sampling the coding characteristics obtained by the CNN and Transformer hybrid encoder module through the decoder, then combining the up-sampled coding characteristics with the high-resolution CNN characteristic diagram to realize accurate positioning, and finally utilizing U-Net to enhance more accurate detail detection information by recovering local spatial information.

The invention has the beneficial effects that:

1. the existing algorithm for detecting the pulmonary nodule needs a lot of time in the process of feature extraction. The traditional feature extraction algorithm needs a large amount of manual labeling, the features need a large amount of priori knowledge, the method for detecting and classifying the pulmonary nodules by using deep learning can effectively avoid subjective uncertainty of judgment of doctors, effectively relieve the workload of the doctors and simultaneously improve the accuracy rate of pulmonary nodule detection, and the deep learning model can automatically learn and extract the features suitable for the current task.

2. The effect of lung nodule detection by directly using a Transformer is not as good as that of U-Net or Attenttion, and the Transformer can well extract high-level semantic features, which is beneficial to a classification task, but lacks low-level features to segment a lung nodule image. Therefore, a Transformer Unet network formed by combining Transformer jump connection with a U-Net structure has strong learning capacity of high-level semantic features and bottom-level detail features, can effectively improve the accuracy of pulmonary nodule detection, and assists doctors in judgment.

Drawings

Fig. 1 is a lung CT image serial slice according to an embodiment of the present invention.

Fig. 2 is a schematic diagram illustrating a principle of a TransformerUnet binding architecture according to an embodiment of the present invention.

Detailed Description

In order to facilitate a better understanding of the invention for those skilled in the art, the invention will be described in further detail with reference to the accompanying drawings and specific examples, which are given by way of illustration only and do not limit the scope of the invention.

Examples 1,

As shown in fig. 1 and fig. 2, the present embodiment discloses a lung nodule image detection method based on CT images, which performs lung nodule image detection by using a Transformer unet combination architecture, where the Transformer unet combination architecture includes a Transformer portion and a U-Net portion.

The lung nodule image detection method based on the CT image comprises the following steps:

step S1, image preprocessing, which is to perform image serialization by remodeling slices of the input CT image of the lung into a batch sequence to perform labeling.

Given a CT image of the lung as

H × W is the spatial resolution, and C is the number of channels. The goal is to predict a pixel label map of the corresponding size H W, unlike prior methods of training CNN (e.g., U-Net), encoding an image into a high-level feature representation, and then decoding it to full spatial resolution, by introducing the self-attribute mechanism into the encoder design using the Transformer, the image is first encoded into a high-level feature representation and then decoded to the original resolution size.

The pixel sizes and the thickness granularity of different scanning surfaces are different, so that the training task of the model is not facilitated, and the situation can be effectively avoided by adopting image serialization. The image serialization described in this embodiment specifically includes:

Where p is the sequence size and the unit of p is the pixels, so the size of each patch is p × p and the number of each image patch

I.e. the input sequence length.

Step S2, patch embedding: with patch embedding, a trainable linear mapping is used to map the vectorized patch sequence to a potential two-dimensional embedding space.

In this embodiment, step S2 specifically includes:

wherein the content of the first and second substances,

is a patch embedded map that is embedded in,

Become into

Step S3, establishing a CNN and Transformer hybrid encoder: the marked image blocks from the CNN feature map are encoded by the Transformer as an input sequence for extracting the global context.

In the mixed encoder of the CNN and the Transformer, different suspected lung nodule candidate sets are obtained after embedding according to patch. Due to the internal limitations of convolution operations (which still remain in terms of long distance relationships in the modeling), these architectures often yield poor performance, especially for patients exhibiting large differences in structure texture, shape, and size. To overcome this limitation, a self-entry mechanism is established based on the CNN features, which encodes the labeled image blocks from the CNN feature map into an input sequence that extracts the global context. Secondly, unlike previous CNN-based methods, the Transformer is not only powerful in global feature extraction, but also exhibits excellent transferability to downstream tasks under large-scale pre-training, as an alternative architecture, it completely employs distributed convolution operations, relying only on attention mechanism.

Specifically, step S3 specifically includes:

indicating the first layer of multi-headed attention output, z_lRepresenting a description of the l-th layer coded picture.

Because of the information loss of the CNN and Transformer hybrid encoder, this embodiment further includes compensation for the information loss of the CNN and Transformer hybrid encoder, and a hybrid CNN-Transformer architecture is used as an encoder and cascaded upsampling is performed to achieve accurate positioning. The method specifically comprises the following steps:

Here shallow and deep features are concatenated together to reduce the loss of spatial information from down-sampling. Then a linear layer, the connecting feature size remains the same as the size of the upsampling feature.

Step S4, the concatenated decoder: firstly, the coding features obtained in the step S3 are up-sampled through a decoder, then the up-sampled coding features are combined with a high-resolution CNN feature map to achieve accurate positioning, and finally, more accurate detail detection information is enhanced by recovering local spatial information through U-Net, false positive of lung nodule detection is effectively reduced, and an accurate image is provided for an auxiliary diagnosis system.

In this embodiment, step S4 specifically includes:

in the case of hidden features

Is reconstructed into

The transformerUnet combined architecture provided by the invention is shown in FIG. 2, and establishes self-attention mechanism from the perspective of sequence-to-sequence prediction. To compensate for the loss of feature resolution caused by the transform, the Transformer uet employs a CNN-transform hybrid structure to exploit the high-resolution spatial information from CNN features and the transform-encoded global context information. Inspired by the U-Net design, the self-attribute feature of the transform coding is then upsampled, which combines with the different high resolution CNN features that hop the connection from the coding path to achieve accurate positioning. This design enables the overall network framework to retain the advantage of the Transformer and also benefits lung nodule image detection. Fig. 1 is a slice of a CT image acquired of a lung.

The network establishes a deep learning framework in a Python environment on the basis of an Nvidia RTX2080Ti GPU hardware platform under an Ubuntu16 operating system, and is trained by using a LUNA16 and a LIDC data set, and a large number of experiments prove the feasibility of transformer Unet model training and testing.

Data amplification, such as random rotation and flipping, was used for all experiments. For the Transformer encoder, only ViT with 12 Transformer layers is employed. For the hybrid encoder design, in combination with ResNet-50 and ViT, all transform architectures (i.e., ViT) and ResNet-50 are pre-trained on ImageNet, the resolution and patch size of the input image are set to 224 × 224 and 16, respectively, and four cascaded upsampled blocks need to be set in the CUP to achieve the original image resolution. The model was trained using an SGD optimizer with a learning rate of 0.01, momentum of 0.9, weight decay of 1e^-4. The default batch size is 24, the default number of training iterations for the LUNA16 dataset is 20k, and the default number of training iterations for the LIDC dataset is 14 k.

The invention is characterized in that on one hand, a CNN architecture (U-Net) is utilized to provide a way for extracting low-level characteristic clues, and such fine spatial details can be well supplemented. And on the other hand, a Transformer network is adopted to encode the marked image blocks from the Convolutional Neural Network (CNN) feature map into an input sequence for extracting the global context under the U-Net framework. Finally, the decoder upsamples the encoded features, which are combined with the high resolution CNN feature map to achieve accurate positioning. With the combination of U-Net, Transformers can be used as a powerful encoder for lung nodule detection tasks by recovering local spatial information.

Examples 2,

The embodiment discloses a system of a lung nodule image detection method based on a CT image described in embodiment 1, which adopts a TransformerUnet combination architecture, and specifically includes:

The functions of the above modules correspond to those of embodiment 1, and are not described herein again.

The foregoing merely illustrates the principles and preferred embodiments of the invention and many variations and modifications may be made by those skilled in the art in light of the foregoing description, which are within the scope of the invention.

Claims

1. A lung nodule image detection method based on CT image is characterized in that lung nodule image detection is carried out by adopting a transformer Unet combined framework, and the detection method comprises the following steps:

2. The method according to claim 1, wherein in step S1, the lung CT image is taken as

H × W is the spatial resolution, and C is the number of channels.

3. The method according to claim 2, wherein step S1 specifically includes:

I.e. the input sequence length.

4. The method according to claim 3, wherein step S2 specifically comprises:

wherein the content of the first and second substances,

is a patch embedded map that is embedded in,

Become into

5. The method according to claim 1, wherein step S3 specifically comprises:

6. The method according to any one of claims 1-5, further comprising compensating for a loss of information of a hybrid encoder of CNN and Transformer, specifically comprising:

7. The method according to claim 5, wherein step S4 specifically comprises:

in the case of hidden features

Is reconstructed into

8. A pulmonary nodule image detection system based on CT images is characterized in that a transformer Unet combined framework is adopted, and the pulmonary nodule image detection system specifically comprises: