CN115526829A

CN115526829A - Honeycomb lung focus segmentation method and network based on ViT and context feature fusion

Info

Publication number: CN115526829A
Application number: CN202210907330.3A
Authority: CN
Inventors: 张玲; 李钢; 卫建建; 贺艺斌; 孙梦霞; 孙源瑾; 李智超
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2022-07-29
Filing date: 2022-07-29
Publication date: 2022-12-27

Abstract

The invention provides a cellular lung lesion segmentation method and a cellular lung lesion segmentation network based on ViT and context feature fusion, and belongs to the technical field of image processing; the network can realize higher-precision segmentation of the honeycomb lung focus part; increasing information interaction among different channels by using a channel mixed convolution block, and fully extracting focus part characteristic information; a Transformer architecture is adopted as a characteristic connector of an encoder and a decoder, so that the characteristic expression of global information is enhanced, and the receptive field of a network is enlarged; the CFB module is adopted to fuse the multi-stage features, so that the semantic difference between the high-level features and the low-level features is reduced, and the segmentation precision of the focus part and the target edge is improved; the invention is applied to the segmentation of the honeycomb lung disease focus.

Description

Honeycomb lung focus segmentation method and network based on ViT and context feature fusion

Technical Field

The invention provides a cellular lung lesion segmentation method and a cellular lung lesion segmentation network based on ViT and context feature fusion, and belongs to the technical field of image processing.

Background

Interstitial lung disease is a diffuse lung disease with high lethality, strong concealing property and destructiveness, and is also called "alveolar lung" because it is partially alveolated in the lung matrix and has a high density image in a CT image. At present, most of clinical diagnosis of interstitial lung diseases is still completed by professional radiologists, and whether honeycomb lung exists in lung CT image analysis of patients is analyzed. With the great increase of clinical image data, the radiologist in China seriously lacks compared with a huge population; in the process of disease diagnosis, a professional doctor carries out manual visual identification on a CT image according to the possessed knowledge and experience, but visual fatigue and visual errors are easily generated by a large amount of mental labor and long-time work, so that the diagnosis result has higher subjectivity, and even the conditions of misdiagnosis and missed diagnosis occur, thereby increasing the difficulty of later treatment of patients. Therefore, the honeycomb lung is automatically segmented by using the image segmentation method, so that a doctor is assisted to accurately diagnose the disease condition degree of a patient, the accuracy rate and the film reading efficiency of the doctor in diagnosis and treatment are improved, a proper scheme is provided for guiding clinical decision and prognosis treatment, and the method has very important clinical value.

In recent years, thanks to the strong feature expression capability of deep learning and the capability of modeling complex tasks, the CNN-based method is widely used in the field of medical image processing, and particularly, a U-shaped convolutional neural network composed of a jump connection and an encoder-decoder achieves remarkable performance in medical image segmentation. Ronneberger et al, using the idea of "full convolution", first proposes an encoder-decoder network model for medical image segmentation — UNet, using skip connection to fuse the high-level features of the up-sampling stage with the low-level features of the down-sampling stage, and obtains better segmentation results on three medical data sets. In order to solve the problem of different scales of medical images, IBTEHAZ et al propose a MultiResUNet network for dividing skin diseases, construct a multi-residual convolution module and introduce a residual path into the UNet network by adopting the idea of residual learning, and improve the training effect. Sharp U-Net applies a sharpened convolution kernel to generate an intermediate characteristic graph to replace jump connection in a U-shaped network, solves the over-segmentation problem caused by semantic difference, and shows better segmentation performance on data sets of lung pneumonia, new coronary pneumonia and the like. In order to obtain more accurate edge information, ALOM and the like propose an R2U-Net model based on a recurrent neural network, which uses a recurrent residual convolution layer to accumulate characteristics, and has better performance in a retina segmentation task while the network parameters are not changed. The HDA-Resunet network adopts an expansion convolution layer containing a channel attention mechanism to replace a convolution layer at the bottom in the UNet, fuses information of different sizes of reception fields, and solves the problem of multi-scale information loss in the network. On the basis of UNet, UNet + + adopts improved dense jump connection to fuse feature information of different levels of a depth layer, and reduces semantic difference between feature maps of an encoder and a decoder. Chen et al propose a DeepLabv3+ model of an Encoder-Decoder structure, increase the image receptive field by introducing cavity convolution and spatial pyramid pooling, and extract richer context information. In order to solve the problem that spatial feature information is easy to lose in the convolution process, GU and the like propose a CE-Net model for medical image segmentation, and a context connector fused with dense convolution blocks is used for generating more semantic feature maps. Similarly, CA-Net based encoder-decoder architectures propose a joint spatial attention module and scale attention module that re-calibrates channel feature responses using an attention mechanism to enhance the expression of correlated feature channels. The above CNN-based methods all belong to variants of UNet network-based and the excellent segmentation performance over numerous medical data sets has fully demonstrated the applicability of UNet networks in the field of medical segmentation, and therefore, the present invention uses UNet networks as a basic model for segmenting cellular lungs.

The work mainly solves the problems of insufficient feature extraction and multi-scale information loss in the network feature extraction process, although the CNN method based on the encoder-decoder structure has great advantages in the aspect of extracting the local features of the image, the traditional convolution accumulation operation in the full convolution network causes feature redundancy and affects the segmentation effect; and most of networks of the encoder-decoder structure only carry out recalibration operation on the encoder characteristic diagram, and do not consider the importance of deep semantic information contained in the decoder characteristic diagram to the extraction of edge information, so that detail information is lost. Meanwhile, because the receptive field of a single convolution kernel is limited, the network only focuses on a certain sub-region in the image, and the context relationship in the image is difficult to model. The limitation of convolution operation poses a challenge for learning global information in an image, and is especially important for pixel-level tasks such as semantic segmentation.

Recently, the advent of Transformer has broken the absolute position of CNN in computer vision tasks, where ViT belongs to one of many variants of Transformer, which performs well in medical image segmentation tasks. Chen et al proposes a TransUNet network for multi-organ segmentation, uses ViT to replace a common volume block as an encoder basic module of a segmentation network, and utilizes the excellent global information modeling capability of a Transformer to realize the accurate positioning of a focus part. In order to segment the target edge more accurately, the Medical Transformer uses a gated axial Transformer module to obtain more accurate position information. Therefore, the medical segmentation network based on the Transformer architecture can fully extract global features by establishing global relations in images, and complete the segmentation task of the medical target with high quality.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention aims to solve the technical problems that: an improvement of a honeycomb lung lesion segmentation method based on ViT and context feature fusion is provided.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a honeycomb lung lesion segmentation method based on ViT and context feature fusion comprises the following steps:

s1: acquiring honeycomb lung CT image data, performing data preprocessing on an original image, and dividing a preprocessed data set into a training set and a testing set;

s2: constructing a basic UNet network, wherein the basic UNet network comprises a down-sampling encoder, an up-sampling decoder, a jump connection and a bottleneck layer;

s3: the constructed basic UNet network is improved, convolution operations in an up-sampling layer and a down-sampling layer of the network are changed, and a channel mixed convolution block is used for replacing a traditional convolution structure;

the channel mixing convolution block separates the feature map and then divides the feature map into two branches S1 and S2, the branch S1 keeps the number of the feature map and the number of the channels unchanged, the branch S2 is convolved, and then the feature map after convolution operation is fused with the feature map of the branch S1; finally, exchanging information among all channels of the characteristic diagram by adopting channel mixing operation, wherein the size of the obtained characteristic diagram is the same as that of the upper layer characteristic diagram, and carrying out channel separation, branch convolution, channel splicing and channel mixing operation on the obtained characteristic diagram again;

s4: in a bottleneck layer of a network, modeling the global relationship among all pixel points in an image by using a context connector based on ViT, and replacing a convolution module of a high channel with the ViT;

s5: improving a feature fusion scheme of low-level and high-level features of a jump connection part by using a context perception fusion module, and relearning the features;

s6: defining hyper-parameters of a segmentation model, and training by using the data set processed by the S1 to obtain a loss value of a loss function and a segmentation result;

s7: and adjusting network parameters according to the result, generating and storing a trained focus segmentation model, inputting test set data into the trained focus segmentation model, segmenting the cellular lung CT image focus, and outputting a segmentation result.

The ViT-based context connector adopts six transform coding blocks to extract global information in an image, wherein the ViT comprises four parts:

(1) Embedding the slices: converting an original two-dimensional image into one-dimensional sequence data, setting the size of a slice according to an output image of an encoder, converting the image into a plurality of slices, and calculating the dimension of the slice;

(2) Position coding: marking each slice with the corresponding position information so as to restore the dimensionality of the image, wherein the position information is consistent with the output of slice embedding, storing the correct position information of each slice, and calculating the dimensionality of each slice;

(3) Multi-head attention mechanism in transform coding block: after the image slicing and position coding information establishing stage, inputting the slice information of the image into a Transformer coding block, learning the relation between slices and between each pixel in each slice, and carrying out context modeling on the global information; the multi-head attention layer receives slice and position coding information, more related information is learned in different subspaces by using a plurality of attention heads, Q and K in each attention head are initialized by using different weight matrixes, dropout operation is performed on global characteristics learned by the multi-head attention layer, characteristic dimensions are modified by using a normalization layer to adapt to subsequent output, and the convergence speed of the model is accelerated;

(4) The multi-layer perceptron layer in the transform coding block: this part uses the MLP block to learn the non-linear relationship between features; residual join and layer normalization operations are used at each sub-layer.

The context perception fusion module receives low-level features from an encoder and high-level features of a decoder, firstly, a feature graph with global spatial information is generated by adopting GAP (global average pooling operation), and vectors h and l are generated by modeling the context information in the low-level feature graph and the high-level feature graph by utilizing a multilayer perception machine sharing weight, wherein h and l respectively represent weight vectors in the high-level feature graph and the low-level feature graph; secondly, adopting the idea of residual error learning, multiplying the weight vector by the two feature maps to generate a redistributed feature map, and then performing splicing operation according to channel dimensions to generate a feature map containing global context information of local and high-level stages; and thirdly, adopting two 3-by-3 convolutions for realizing weighted feature fusion, and using residual connection to receive information from high-level features.

The encoder comprises four channel mixed convolution blocks and is used for extracting high-layer features and low-layer feature information of an image, each channel mixed convolution block comprises 2 convolution layers, a batch normalization layer and a modified linear unit ReLU, a feature map extracted by each channel mixed convolution block comprises two paths, and the first path is connected with a maximum pooling layer to carry out down-sampling operation on the feature map and transmits the down-sampling operation to the next convolution block; the other path enters the jump connection path.

The decoder comprises 4 channel mixed convolution blocks which are the same as the encoder, each channel mixed convolution block comprises 2 layers of convolution calculation, batch normalization and Relu activation functions, each channel mixed convolution block utilizes deconvolution to conduct upsampling on a feature map, and the feature map after transposition convolution operation is expanded to 2 times.

The cellular lung lesion segmentation network based on ViT and context feature fusion comprises an encoder, a decoder, a ViT-based context connector arranged on a network bottleneck layer, and four context perception fusion modules arranged at a jump connection stage, wherein the encoder comprises four down-sampling modules, the decoder comprises four up-sampling modules, the down-sampling modules and the up-sampling modules respectively comprise channel mixed convolution blocks, each channel convolution mixed block comprises 2 convolution layers, a batch normalization layer and a correction linear unit ReLU, a feature map extracted by each channel mixed convolution block in the down-sampling module comprises two paths, and a first path is connected with a maximum pooling layer to perform down-sampling operation on the feature map and transmits the down-sampling operation to a next convolution block; the other path enters a jump connection path, each channel mixed convolution block in the up-sampling module performs up-sampling on the feature graph by using deconvolution, and the feature graph after the transposition convolution operation is expanded to 2 times;

on a bottleneck layer of a network, adopting ViT to slice the characteristic diagram, and calculating the global relation in the characteristic diagram;

and in a jump connection stage, respectively inputting a feature map obtained by convolution mixing blocks of four channels in an encoder and the feature map after up-sampling into a path containing a context perception fusion module to carry out feature enhancement operation.

The channel convolution mixing block comprises a channel separation module, a channel splicing module and a channel mixing module, wherein the channel separation module comprises two branches S1 and S2, the branch S1 keeps the characteristic diagram unchanged with the number of channels, the branch S2 is convoluted, the characteristic diagrams output by the two branches pass through the channel splicing module and the channel mixing module, and each channel convolution mixing block outputs the characteristic diagram by twice channel separation, channel splicing and channel mixing operations.

The ViT-based context connector comprises a slice embedding module, a position coding module and six transform coding blocks, wherein each transform coding block comprises a multi-head attention mechanism module and a multi-layer perceptron module, the slice embedding module and the position coding module are positioned at the beginning of the ViT module, the slice embedding module converts an original two-dimensional image into one-dimensional sequence data of a plurality of slices, the position coding module marks corresponding position information for each slice, the multi-head attention mechanism module is positioned at the front half part of the transform coding block, receives the slices and the position coding information and comprises a plurality of attention heads and a Dropout operation and normalization layer, and the multi-layer perceptron module is positioned at the back half part of the transform coding block and comprises a multi-layer perceptron layer and a normalization layer.

The context perception fusion module comprises two parallel global average pooling layers, a multilayer perception machine layer and a convolution block, wherein the two parallel global average pooling layers respectively receive low-level features from an encoder and high-level features from a decoder to generate a feature map with global spatial information, the multilayer perception machine utilizes shared weight to model the context information in the low-level and high-level feature maps to generate weight vectors, the weight vectors and the two feature maps are multiplied to generate redistributed feature maps and are spliced according to channel dimensions to generate feature maps containing the global context information of local and high-level stages, and the weighted feature fusion is realized by adopting two 3-by-3 convolutions in the convolution block.

Compared with the prior art, the invention has the beneficial effects that: the invention provides a U-shaped image segmentation network (CSC-UTNet) combined with a channel mixed convolution block, a ViT and a context perception fusion module for honeycomb lung segmentation based on a deep learning technology, which effectively solves the problem of loss of edge information of a focus part and reduces the occurrence of over-segmentation. The model utilizes the channel mixed convolution block to increase the information interaction among different channels and fully extract the characteristic information of the honeycomb lung focus part; viT is used as a characteristic connector of an encoder and a decoder, so that the characteristic expression of global information is enhanced, and the receptive field of a network is enlarged; and a context perception fusion module is adopted to fuse multi-stage features, so that the semantic difference between high-level features and low-level features is reduced, and the segmentation precision of focus edges is improved. By performing an ablation experiment on the honeycomb lung data set, the CSC-UTNet achieves a better effect on evaluation indexes such as a similarity cross-correlation ratio, a Dice coefficient, a mIoU, a mDice and the like, has stronger generalization and higher segmentation precision.

Drawings

The invention is further described below with reference to the accompanying drawings:

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a schematic diagram of a network architecture according to the present invention;

FIG. 3 is a schematic diagram of the channel-mixed rolling block structure at the up-sampling stage according to the present invention;

FIG. 4 is a schematic diagram of the ViT-based context connector of the present invention;

FIG. 5 is a diagram illustrating a calculation process of a multi-head attention mechanism in a transform coding block according to the present invention;

FIG. 6 is a schematic structural diagram of a context-aware fusion module according to the present invention.

Detailed Description

As shown in fig. 1 to fig. 6, the main improvement points of the honeycomb lung lesion segmentation method based on ViT and context feature fusion of the present invention are:

(1) The light channel mixed convolution block is provided, the convolution calculation of partial characteristic channels is reduced, and the reduction of segmentation accuracy due to excessive redundant characteristics is prevented;

(2) ViT is introduced into a network bottleneck layer to serve as a characteristic connector of an encoder and a decoder, so that the characteristic expression of global information is enhanced, and the segmentation precision of a focus region is improved;

(3) The context perception fusion module is adopted to reconstruct the feature distribution in the encoder, reduce the semantic clearance generated by the mismatching of the receiving domain, enhance the semantic relevance of the encoder and the decoder and realize the context feature fusion of semantic matching.

The invention adopts a UNet network as a reference segmentation network for cellular lung segmentation, the network consists of an encoder and a decoder, and a bottleneck layer and a jump connection module are respectively connected with the encoder and the decoder in a high-channel convolution and characteristic addition mode. The structure of the medical image segmentation network CSC-UTNet provided by the invention is shown in figure 2. Specifically, the encoder includes four channel hybrid rolling blocks for extracting high-level features and low-level feature information of the image. Each convolution block comprises 2 convolution layers, a Batch Normalization layer (Batch Normalization) and a modified linear unit ReLU, a feature map extracted by each convolution block comprises two paths, and the first path is connected with the maximum pooling layer to perform downsampling operation on the feature map and then transmits the downsampled feature map to the next convolution block; the other path enters the jump connection path. Similar to the encoder, in the decoder there are 4 identical convolutional blocks, each containing 2 layers of convolutional calculation, batch normalization and Relu activation functions. Each convolution block is then up-sampled by deconvolution, and the feature map after the transposed convolution operation is expanded to 2 times the size. And on a bottleneck layer of the network, the ViT is adopted to slice the feature map, so that the interaction of global information in the slicing feature is improved. In the jump connection stage, feature graphs obtained by four convolution blocks in the encoder correspond to feature graphs after up-sampling one by one, paths containing a context perception fusion module are respectively input to carry out feature enhancement operation, and semantic differences among features are made up by improving a feature fusion scheme of low-level and high-level features. In the feature fusion process, more favorable features are reserved by using splicing operation, and the loss of original features and the influence on segmentation results caused by direct addition operation are avoided.

The improved network part of the present invention is described in turn below.

Channel-mixed volume block: in the current mainstream segmentation model, many networks use high-level channel convolution to extract deep abstract features of an image so as to obtain a better segmentation result. However, a large number of convolutional layer stacks increases the computation of the model, and especially, the number of channels increases to cause an excessive parameter. Therefore, the present invention proposes a new channel-mixed volume block, the structure of which is shown in fig. 3. The module is divided into two branches S1 and S2 after separating the feature map, the branch S1 keeps the feature map and the number of channels unchanged, the branch S2 is convoluted, and the convolution calculation amount of the branch S2 is reduced by 1/1-r times compared with the common convolution in the original network; then, fusing the feature map subjected to the convolution operation with the feature map of the branch S1; and finally, exchanging information among all channels of the characteristic diagram by adopting channel mixing operation, and enhancing the information interaction among different groups of the characteristic diagram. At this time, the obtained feature map has the same size as the upper-layer feature map, and the obtained feature map is subjected to channel separation, branch convolution, channel splicing and channel mixing operation again. Finally, the feature map M is subjected to down-sampling operation of the image by using the average pooling layer, and the output feature map is used as the input of the next channel mixed rolling block. The channel mixing and the channel split operation can effectively reduce the calculated amount in the convolution process, increase the information interaction among different channels, effectively extract the local characteristic information of the focus part and prevent the waste of calculation resources caused by characteristic redundancy.

ViT-based contextual connector: at present, many existing segmentation models use convolution operation to extract features, although the local feature extraction capability is strong, the fixed convolution kernel size can affect the receptive field of an image, and multiple times of convolution operation cause global information loss and local feature redundancy; meanwhile, the edge of the honeycomb lung lesion is irregular and fuzzy, so that the CNN cannot sufficiently extract the features of the honeycomb lung lesion, and the edge information is easily lost. Aiming at the problems, in order to construct a feature extractor which focuses on global information and has a larger receptive field, viT is introduced into a bottleneck layer of a U-shaped network, the global receptive field of an image is expanded by enriching coding representation information in different subspaces of the network, and the acquired global information is fused with local information of an encoder to extract more effective feature information.

In order to improve the performance of the model while ensuring that excessive calculation consumption is not increased, six transform coding blocks are adopted in the ViT-based context connector to extract global information in an image, so that more favorable characteristics are provided for the learning of a U-type network decoder part. The model structure of the ViT-based context connector is shown in fig. 4 and 5, and mainly comprises four parts:

(1) And embedding the slices. The vision task is translated into NLP problem by translating the original two-dimensional image into one-dimensional sequence data. The ViT is positioned at the bottleneck layer of the U-shaped network, and the size of the output image of the encoder is X belongs to R ^C×H×W Setting the size of the slice to be patch ∈ R ^p×p The image is converted into (H/p multiplied by W/p) slices, the dimension of the slice is x _p ⁱ ∈R ^N×(p×p×C) (where N refers to the number of slices).

(2) And (4) position coding. This part is to mark each slice with its corresponding position information in order to restore the dimensions of the image, ensuring that the structure of the image is not destroyed. Similarly, the position information is kept consistent with the slice-embedded output, the correct position information for each slice is saved, and the dimension is E _pos ∈R ^(p2∙C)×D) (where D represents the dimension of the linear mapping layer).

(3) A multi-headed attention mechanism in a transform coding block. After the image slicing and position coding information establishing stage, the slice information of the image is input into a Transformer coding block, the relations between the slices and between each pixel in each slice are learned, and the context modeling is carried out on the global information. The multi-head attention layer of the part receives slice and position coding information, 4 attention heads are arranged, more related information is learned in different subspaces by using the plurality of attention heads, Q and K in each attention head are initialized by using different weight matrixes, and more different characteristics are expressed; in order to prevent the over-fitting phenomenon, dropout operation is carried out on the overall features learned by the multi-head attention layer, and the normalization layer is used for modifying feature dimensions to adapt to subsequent output, so that the convergence speed of the model is accelerated.

(4) And a multi-layer perceptron layer in the transform coding block. This part uses the nonlinear relationship between MLP block learning features; likewise, to prevent gradient disappearance or gradient explosion and speed up model convergence, residual join and layer normalization operations are used at each sub-layer. Therefore, the calculation process of the transform coding block is as shown in equations (1) to (3).

（1）

（2）

（3）

In the above formula:

and E denote the slice size and the output layer of the linear projection respectively,

indicating position information indicating the respective slices,

and

the outputs of the multi-head attention layer and the MLP layer are shown separately.

Context-aware fusion module: in a traditional U-shaped network, the characteristics of an encoder and a decoder are directly connected to perform low-level semantic information and high-level abstract characteristic fusion, the context relation between the characteristics is ignored, and the expression of important characteristics is weakened. The network of the invention introduces a context perception fusion module to reconstruct a jump connection structure, and the module utilizes the thought of residual error learning to fuse multi-stage characteristics, enhances the expression of important characteristics in global context information and inhibits irrelevant background noise caused by low-order characteristics. The structure of the context perception fusion module is shown in fig. 6, the module receives low-level features from an encoder and high-level features Fh and Fl from a decoder, firstly, a feature graph with global spatial information is generated by using GAP (global average pooling operation), and vectors h and l are generated by modeling the context information in the low-level and high-level feature graphs by using a multilayer perception machine sharing weight, wherein h and l respectively represent weight vectors in the high-level and low-level feature graphs; secondly, adopting the idea of residual error learning, multiplying the weight vector by the two feature maps to generate redistributed feature maps Fl and Fh, and then performing splicing operation according to the channel dimension to generate a feature map containing global context information of local and high-level stages; and thirdly, adopting two 3-by-3 convolutions for realizing weighted feature fusion, and using residual error connection to receive information from high-level features so as to capture more effective features.

It should be noted that, regarding the specific structure of the present invention, the connection relationship between the modules adopted in the present invention is determined and can be realized, except for the specific description in the embodiment, the specific connection relationship can bring the corresponding technical effect, and the technical problem proposed by the present invention is solved on the premise of not depending on the execution of the corresponding software program.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A honeycomb lung focus segmentation method based on ViT and context feature fusion is characterized in that: the method comprises the following steps:

the channel mixed convolution block separates the feature map and then divides the feature map into two branches S1 and S2, the branch S1 keeps the number of the feature map and the number of the channels unchanged, the branch S2 is convolved, and then the feature map after convolution operation is fused with the feature map of the branch S1; finally, exchanging information among all channels of the characteristic diagram by adopting channel mixing operation, wherein the size of the obtained characteristic diagram is the same as that of the upper layer characteristic diagram, and carrying out channel separation, branch convolution, channel splicing and channel mixing operation on the obtained characteristic diagram again;

2. The ViT and context feature fusion based cellular lung lesion segmentation method of claim 1, wherein: the ViT-based context connector adopts six transform coding blocks to extract global information in an image, wherein the ViT comprises four parts:

(1) Embedding slices: converting an original two-dimensional image into one-dimensional sequence data, setting the size of a slice according to an output image of an encoder, converting the image into a plurality of slices, and calculating the dimension of the slice;

(4) The multi-layer perceptron layer in the transform coding block: this part uses the nonlinear relationship between MLP block learning features; residual join and layer normalization operations are used at each sub-layer.

3. The ViT-context feature fusion-based cellular lung lesion segmentation method of claim 1, wherein: the context perception fusion module receives low-level features from an encoder and high-level features of a decoder, firstly, a feature graph with global spatial information is generated by adopting GAP (global average pooling operation), and vectors h and l are generated by modeling the context information in the low-level feature graph and the high-level feature graph by utilizing a multilayer perception machine sharing weight, wherein the h and the l respectively represent weight vectors in the high-level feature graph and the low-level feature graph; secondly, generating a redistributed feature map by multiplying a weight vector and the two feature maps by adopting the idea of residual error learning, and then performing splicing operation according to channel dimensions to generate a feature map containing the global context information of local and high-level stages; and thirdly, adopting two 3-by-3 convolutions for realizing weighted feature fusion, and using residual connection to receive information from high-level features.

4. The ViT-context feature fusion-based cellular lung lesion segmentation method of claim 1, wherein: the encoder comprises four channel mixed convolution blocks for extracting high-layer characteristics and low-layer characteristics of an image, each channel mixed convolution block comprises 2 convolution layers, a batch normalization layer and a modified linear unit ReLU, a characteristic diagram extracted by each channel mixed convolution block comprises two paths, and the first path is connected with a maximum pooling layer to carry out downsampling operation on the characteristic diagram and transmits the downsampled characteristic diagram to the next convolution block; the other path enters the jump connection path.

5. The ViT-context feature fusion-based cellular lung lesion segmentation method of claim 1, wherein: the decoder comprises 4 channel mixed convolution blocks which are the same as the encoder, each channel mixed convolution block comprises 2 layers of convolution calculation, batch normalization and Relu activation functions, each channel mixed convolution block utilizes deconvolution to perform upsampling on a feature map, and the feature map subjected to transposition convolution operation is expanded to 2 times.

6. Cellular lung lesion segmentation network based on ViT and context feature fusion is characterized in that: the system comprises an encoder, a decoder, a ViT-based context connector arranged on a network bottleneck layer, and four context sensing fusion modules arranged at a jump connection stage, wherein the encoder comprises four down-sampling modules, the decoder comprises four up-sampling modules, the down-sampling modules and the up-sampling modules respectively comprise channel mixed convolution blocks, each channel convolution mixed block comprises 2 convolution layers, a batch normalization layer and a modified linear unit ReLU, a feature map extracted from each channel mixed convolution block in the down-sampling module comprises two paths, and a first path is connected with a maximum pooling layer to perform down-sampling operation on the feature map and transmits the down-sampling operation to a next convolution block; the other path enters a jump connection path, each channel mixed convolution block in an up-sampling module performs up-sampling on the feature graph by using deconvolution, and the feature graph subjected to transposition convolution operation is expanded to 2 times;

7. The ViT and context feature fusion based cellular lung lesion segmentation network of claim 6, wherein: the channel convolution mixing block comprises a channel separation module, a channel splicing module and a channel mixing module, wherein the channel separation module comprises two branches S1 and S2, the branch S1 keeps the characteristic diagram unchanged with the number of channels, the branch S2 is convoluted, the characteristic diagrams output by the two branches pass through the channel splicing module and the channel mixing module, and each channel convolution mixing block outputs the characteristic diagram by twice channel separation, channel splicing and channel mixing operations.

8. The ViT and context feature fusion based cellular lung lesion segmentation network of claim 6, wherein: the ViT-based context connector comprises a slice embedding module, a position coding module and six transform coding blocks, wherein each transform coding block comprises a multi-head attention mechanism module and a multi-layer perceptron module, the slice embedding module and the position coding module are positioned at the beginning part of the ViT module, the slice embedding module converts an original two-dimensional image into one-dimensional sequence data of a plurality of slices, the position coding module marks corresponding position information for each slice, the multi-head attention mechanism module is positioned at the front half part of the transform coding block and receives slice and position coding information and comprises a plurality of attention heads and a Dropout operation and normalization layer, and the multi-layer perceptron module is positioned at the back half part of the transform coding block and comprises a multi-layer perceptron layer and a normalization layer.

9. The ViT and context feature fusion based cellular lung lesion segmentation network of claim 6, wherein: the context perception fusion module comprises two parallel global average pooling layers, a multilayer perception machine layer and a convolution block, wherein the two parallel global average pooling layers respectively receive low-level features from an encoder and high-level features from a decoder to generate a feature map with global spatial information, the multilayer perception machine utilizes shared weight to model the context information in the low-level and high-level feature maps to generate weight vectors, the weight vectors and the two feature maps are multiplied to generate redistributed feature maps and are spliced according to channel dimensions to generate feature maps containing the global context information of local and high-level stages, and the weighted feature fusion is realized by adopting two 3-by-3 convolutions in the convolution block.