CN117593275A

CN117593275A - Medical image segmentation system

Info

Publication number: CN117593275A
Application number: CN202311620496.8A
Authority: CN
Inventors: 潜丽妃; 钟代笛; 仲元红; 罗玲; 赵宇; 黄智勇; 韩术
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2023-11-30
Filing date: 2023-11-30
Publication date: 2024-02-23

Abstract

The invention relates to a medical image segmentation system which comprises a preprocessing module, an image segmentation module and a network light module. The medical image segmentation system can complete selection, division and preprocessing of a data set, build a segmentation network framework, perform module design on a selected base line network, select an optimization strategy, a supervision strategy and a proper loss function, perform model simplification on the basis of the designed model, reduce the calculation and storage resource requirements of the model, improve the running efficiency and deployment flexibility of the model, perform re-parameterization on the model according to a re-parameterization principle, reduce the number of parameters through reasonable parameter fusion and approximate representation, design and realize an algorithm aiming at a three-dimensional CT image of a kidney tumor through deep research on an image segmentation algorithm, and provide powerful support for diagnosis and treatment of the kidney tumor through the re-parameterization of the model and establishment of the kidney tumor image segmentation system based on deep learning.

Description

Medical image segmentation system

Technical Field

The invention relates to the technical field of computer vision, in particular to a medical image segmentation system.

Background

Medical image segmentation plays a vital role in the modern medical field, particularly in terms of tumor diagnosis and treatment planning, kidney tumor is a common malignant tumor, early accurate segmentation and localization of which are vital to early diagnosis and treatment, however, shape and size changes of kidney tumor tend to be very complex, which brings challenges to the traditional image segmentation method, so that an efficient and accurate kidney tumor medical image segmentation algorithm is developed to be urgent, and automatic medical image segmentation based on deep learning has many applications in terms of clinical quantization, treatment and operation planning, and the medical image segmentation aims at assigning labels to pixels so that pixels with the same labels form a segmented object, and as network structures are continuously increased, the image segmentation algorithm is further improved to obtain more and more accurate segmentation results.

At present, deep learning-based methods have achieved remarkable achievement in the field of medical image segmentation, their segmentation precision has exceeded that of traditional segmentation methods, full convolution networks achieve pixel-level prediction by introducing a transposed convolution layer, so that input images can be segmented at the pixel level, FCN only comprises the convolution layer, which enables it to output a segmentation map of the same size as the input images, FCN is one milestone of a deep learning-based image semantic segmentation model, inspired by FCN, many models for image segmentation are sequentially proposed, such as U-Net, the structure of which comprises two parts, one is a contracted path for capturing the background, and the other is a symmetrical extended path for achieving accurate positioning, and U-Net uses data enhancement to effectively learn from few annotated images.

However, while CNN-based networks perform well, they still have limitations in processing image segmentation tasks, particularly in terms of computational efficiency in processing multi-scale objects and accurate boundaries, and in processing large-scale images, where CNN has difficulties, a transducer has attracted extensive attention to overcome these limitations, and while originally designed to address natural language processing tasks, its powerful representation and parallel computing capabilities have rendered it a tremendous potential in the image segmentation field, vision Transformer is a model based on the transducer architecture, introducing a transducer into the image processing task that applies the transducer's self-attention mechanism and full-link layer to the image data, enabling global modeling of the relationship between pixels in the image to capture global information and local features in the image, while Vision Transformer achieves remarkable performance on image processing tasks without relying on convolution operations and surpasses the traditional CNN model on some reference data sets, vision Transformer is challenging in processing large-size images and dense pixel-level tasks due to the high dimensionality and computational complexity of the images, in order to solve this problem, swin transducer is an image processing model based on the transducer architecture, specifically designed for processing high-resolution images, it improves on the basis of Vision Transformer to address the computational and memory overhead problems of conventional transducers in processing large-size images, swin transducer solves this problem by introducing a hierarchical window mechanism that divides the images into larger windows, then performs self-attention operations within each window, which has the advantage of reducing the length of the input sequence, reducing computation and memory overhead, and can process high resolution images.

Although the converter architecture can be used independently of the CNN model, the hybrid model of the converter and the CNN can fully exert their respective advantages, and is a popular research hotspot, the UNETR is a hybrid model based on the combination of the converter and the U-Net structure and is used for image segmentation tasks, the UNETR adopts an encoder-decoder structure, the encoder part uses the converter to model and capture long-distance dependence in a global range, the decoder part adopts the U-Net structure and is used for decoding and upsampling the characteristics output by the encoder to generate a final segmentation result, the Swin UNETR is further improved on the basis of the UNETR, the converter is changed into the Swin converter encoder, model parameters and reasoning time are reduced while the performance is improved, the decoder part still adopts the U-Net structure, and the difference is that the converter firstly uses the CNN to perform characteristic extraction and then uses the converter to input the characteristics map to realize the overall modeling and the following.

However, in practical applications, hardware resources are usually limited, especially in environments with limited resources such as embedded devices or mobile devices, so in order to achieve efficient reasoning and real-time performance, the simplification of an image segmentation model is also a concern, the simplification method of the model includes parameter pruning, model quantization, network structure design, knowledge distillation and the like, deep labv3 introduces depth separable convolution to reduce the parameters, and improves the calculation efficiency, and the depth separable convolution decomposes standard convolution operation into depth convolution and point-by-point convolution, thereby reducing the calculation complexity, and simultaneously maintaining good feature representation capability.

The kidney tumor image segmentation is a widely studied field, and no general algorithm is applicable to all segmented objects at present because of different morphological characteristics of different segmented objects.

Therefore, how to design a medical image segmentation system capable of improving the segmentation accuracy of kidney tumor images and reducing the parameter and time complexity is a technical problem to be solved.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a medical image segmentation system which has the advantage of high image segmentation precision, and solves the problem that the kidney tumor image segmentation is a widely studied field, and a general algorithm is not applicable to all segmented objects at present because the morphological characteristics of different segmented objects are different.

In order to achieve the above purpose, the present invention provides the following technical solutions: a medical image segmentation system comprises a preprocessing module, an image segmentation module and a network light module;

wherein, preprocessing module is used for: preprocessing an input image;

an image segmentation module for: inputting the preprocessed image into an image segmentation network and obtaining segmentation results;

the network lightweight module is used for: the network is lightweight.

Further, the image segmentation module operates as follows:

s1, inputting data synchronously into a CNN network and a self-attention network to obtain image feature extraction graphs under respective calculation results;

s2, connecting the obtained feature extracted by using CNN and the feature graph extracted by using self-attention in series according to the channel dimension;

s3, connecting the serially connected feature images with the feature images after upsampling in the decoder through jump connection, and performing further feature extraction and upsampling by using a convolution layer;

and S4, carrying out loss function calculation on the final obtained post-four-stage separation result and the label, carrying out weight distribution according to the image size, carrying out weighted summation on each stage of loss function to obtain a final loss function, and updating model parameters.

Further, the network lightweight module comprises the following steps:

1) Replacing the standard convolution in the model with a depth separable convolution;

2) And replacing the IN layer with the BN layer, carrying out re-parameterization calculation on the model parameters to obtain new model parameters, building a new model with the BN layer removed, and reading the re-parameterized model parameters to complete re-parameterization.

Further, the system adopts an encoder-decoder architecture, the encoder is composed of a CNN network and a self-attention network, each stage of the CNN part of the encoder comprises two convolution layers, the stride of the first convolution layer is 2, the number of output channels is twice the number of input channels, the stride of the second convolution layer is 1, the number of output channels is equal to the number of input channels, and each convolution layer comprises an InstanceNorm layer and a LeakyReLU layer for feature normalization and nonlinear modeling.

Further, each stage of the self-attention portion of the encoder includes two self-attention computation layers, the first self-attention layer is a standard self-attention layer, the second self-attention layer is a self-attention layer after sliding window, the decoder portion upsamples the input of the previous layer, the decoder includes a deconvolution layer for upsampling, and then a decoding block is formed by the two convolution layers, the first convolution layer reduces output channels, the number of output channels of the second convolution layer is unchanged, the stride is 1, and each decoding block includes an InstanceNorm layer and a LeakyReLU layer.

Further, the feature map of the fused encoder is transferred to the decoder part in a jump connection mode, so that fusion of deep layer features and shallow layer features is realized.

Further, the feature map of the jump connection and the feature map after the sampling of the decoder are spliced and fused in the channel dimension.

Further, the model is replaced by the depth separable convolution, the system replaces standard convolution with the depth separable convolution in the stacked convolution layers of the encoder and the decoder, the standard convolution layer is changed into the group convolution with the number of groups equal to the number of input channels, and the standard convolution with the convolution kernel of 1 is added after DW convolution.

Further, the model is re-parameterized, a re-parameterization algorithm is used for reducing the parameter quantity of the model, the IN layer is replaced by the BN layer for training, the obtained model parameters are subjected to re-parameterization calculation to obtain new model parameters, and the output calculation formula of the convolution layer with BN is as follows:

wherein I is input, F represents convolution kernel parameters, j is channel index, μ _j Sum sigma _j Gamma, the cumulative channel mean and standard deviation _i And beta _j The learned scaling factor and bias term, respectively.

Further, the model parameters after the re-parameterization are as follows:

F′ _{j，：，：，：} ＝γ _j /σ _j ·F _{j，：，：，：}

b′ _j ＝-μ _j ·γ _j /σ _j +β _j 。

compared with the prior art, the technical scheme of the application has the following beneficial effects:

the medical image segmentation system can complete selection, division and preprocessing of a data set, build a segmentation network framework, perform module design on a selected base line network, select an optimization strategy, a supervision strategy and a proper loss function, perform model simplification on the basis of the designed model, reduce the calculation and storage resource requirements of the model, improve the running efficiency and deployment flexibility of the model, perform re-parameterization on the model according to a re-parameterization principle, reduce the number of parameters through reasonable parameter fusion and approximate expression based on the correlation and redundancy among the model parameters, design and realize an algorithm aiming at a three-dimensional CT image of a kidney tumor through deep study of an image segmentation algorithm, provide an efficient and accurate solution and provide powerful support for diagnosis and treatment of the kidney tumor through the re-parameterization of the model.

Drawings

FIG. 1 is a cross-sectional view of a case 00000CT image of the present invention;

FIG. 2 is a cross-sectional view of the invention after preprocessing of case 00000CT images;

FIG. 3 is a schematic diagram of the STransUnet network according to the present invention;

FIG. 4 is a block diagram of a Swin transducer portion of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1 to 4, a medical image segmentation system in the present embodiment includes a preprocessing module, an image segmentation module, and a network lightweight module;

wherein, preprocessing module is used for: preprocessing an input image;

the network lightweight module is used for: the network is lightweight.

The image segmentation module operates as follows:

The network lightweight module comprises the following steps:

The system adopts a coder decoder architecture, the coder consists of a CNN network and a self-attention network, each stage of the CNN part of the coder comprises two convolution layers, the stride of the first convolution layer is 2, the number of output channels is twice the number of input channels, the stride of the second convolution layer is 1, the number of output channels is equal to the number of input channels, and each convolution layer comprises an InstanceNorm layer and a LeakyReLU layer for feature normalization and nonlinear modeling.

Each stage of the self-attention part of the encoder comprises two self-attention calculating layers, the first self-attention layer is a standard self-attention layer, the second self-attention layer is a self-attention layer after sliding window, the decoder part upsamples the input of the previous layer, the decoder comprises a deconvolution layer for upsampling, then a decoding block is formed by two convolution layers, the first convolution layer reduces output channels, the number of the output channels of the second convolution layer is unchanged, the steps are all 1, and each decoding block comprises an InstanceNorm layer and a LeakyReLU layer.

The feature map of the fused encoder is transmitted to the decoder part in a jump connection mode so as to realize the fusion of deep layer features and shallow layer features.

And splicing and fusing the jump connection characteristic diagram and the characteristic diagram after the sampling on the decoder in the channel dimension.

The model is replaced by the depth separable convolution, the system replaces standard convolution with the depth separable convolution in the stacked convolution layers of the encoder and the decoder, the standard convolution layer is changed into the group convolution with the number of groups equal to the number of input channels, and the standard convolution with the convolution kernel of 1 is added after DW convolution.

The model is re-parameterized, a re-parameterization algorithm is used for reducing the parameter quantity of the model, the IN layer is replaced by the BN layer for training, the obtained model parameters are subjected to re-parameterization calculation to obtain new model parameters, and the output calculation formula of the convolution layer with BN is as follows:

The model parameters after re-parameterization are as follows:

F′ _{j，：，：，：} ＝γ _j /σ _j ·F _{j，：，：，：}

b′ _j ＝-μ _j ·γ _j /σ _j +β _j 。

specific examples:

the image preprocessing has great significance in image segmentation tasks, can improve the performance of an algorithm, improve the robustness of a model and reduce the influence of noise, the image segmentation algorithm is very sensitive to the image quality, a low-quality image can lead to inaccurate segmentation results, the problems of noise, blurring, artifacts and the like in the image can be eliminated or reduced through the preprocessing, the quality of the image is improved, so that the image segmentation results can be obtained more accurately, and important features in the image such as edges, colors, textures and the like can be emphasized through the preprocessing, so that the model can be better distinguished from different areas.

Resampling: resampling is an important step in image processing, which involves changing the resolution or size of the image, large-size images may cause a large increase in the computation of the model, thereby reducing the processing speed of the segmentation task, images may be adjusted to an appropriate size by selecting an appropriate sampling interval, the computational burden of the model may be reduced, the reasoning process may be accelerated, while large-size images may require a large amount of memory, which may limit the operation of the model on resource-constrained devices, by resampling the images, the memory requirements may be reduced, the model may be operated in resource-constrained environments, and distortions or information loss may occur at different resolutions, distortions may be minimized by selecting an appropriate sampling interval, so as to preserve important features and details in the image, the difference of sampling intervals influences the size of image resolution, the larger the sampling interval is, the lower the resolution of the image is, the less detail information is contained, but more context information can be reserved, the smaller the sampling interval is, the higher the resolution of the image is, the more detail information is contained, but the context information is lost, because the sampling interval of a data set voxel is not uniform (the voxel sampling interval of an x axis is between (0.39 and 1.04), the voxel sampling interval of a y axis is between (0.39 and 1.04), the voxel sampling interval of a z axis is between (0.5 and 5.0), the invention adopts a nnU-Net mode to select the median value of each axial sampling interval as a target sampling interval, when the anisotropy is serious, the method for selecting the median value can possibly cause serious interpolation artifact or information loss due to great difference of the resolution of the whole data set, therefore, when the anisotropism of the sampling interval is more than 3 times, one tenth of the maximum sampling interval is selected as the target sampling interval, and because the anisotropism of the data set adopted by the invention is not serious, the final selected target sampling interval is (1.00,0.78,0.78), in order to further reduce the memory occupation, shorten the processing time, the final selected voxel target sampling interval is (2.38,1.86,1.86), and the resampling is performed by using the third-order spline interpolation according to the selected target sampling interval.

Image clipping: image cropping is a common operation in image preprocessing, which refers to that a certain part of an image is cropped out under the premise of keeping a region of interest, image edges and irrelevant areas are removed, as an image segmentation algorithm usually needs to be calculated on each pixel of the image, the image segmentation algorithm usually consumes a large amount of memory and prolongs training and reasoning time, the size of the image can be reduced to the region of interest by performing image cropping, thus reducing the number of pixels required to be processed by the algorithm, reducing the calculation burden, accelerating the execution speed of the algorithm, meanwhile, the image usually contains irrelevant areas such as background or noise, the areas can interfere the performance of the image segmentation algorithm, the algorithm is difficult to accurately segment out a target object, the irrelevant areas can be removed by performing image cropping, the segmentation algorithm is more focused on the region of interest, and the performance of the image segmentation algorithm usually relates to the relative size of the target area in the image, if the target area occupies a large proportion in the image, the segmentation algorithm can be more easily and accurately segmented into the regions, the image can be processed by clipping the image, the target object can be more accurately segmented into patches, and the image can be scaled into patches, and the images can be scaled 128 are processed, and the patch is usually processed, and the patch is scaled 128.

Normalization of image intensity: the invention relates to an image intensity normalization technology, which is an image preprocessing technology, is used for adjusting pixel values of an image to be distributed in a certain range, and can reduce the probability of abnormal conditions in calculation because the brightness and the contrast of different images can have great variation, so that the performance of an algorithm can be influenced, the brightness variation in the image can be reduced, the robustness of the algorithm to different images can be improved, the algorithm can perform well in different scenes, in deep learning, the intensity normalization can enable the input image to have similar distribution, the condition that the excessive or insufficient pixel values can cause numerical overflow or unstable numerical value can be caused, the intensity normalization can limit the pixel values in a smaller range, the probability of abnormal conditions in calculation can be reduced, thereby helping the training process of a model to be more stable, the problems of gradient explosion or gradient disappearance and the like can be avoided, the training process can be more reliable, the convergence speed of the model can be accelerated at the same time, the invention collects all intensity values appearing in a training data set, and then performs the standard deviation of the whole data set, namely, the score z is realized by shearing the 0.005 score and 0.995 score of the intensity values, and the standard deviation z is realized according to the standard of the score value of the collection of the score z:

Z＝(X-μ/σ)，

Where X is raw data, μ is mean, σ is standard deviation.

Image data augmentation is to perform a series of transformations or operations on original data in a training process to generate new data samples modified but still maintaining the same label, the data augmentation can be performed by generating new data samples, expanding the scale of a training set, in many cases, the number of samples of the training set is limited, which may cause the model to be over-fitted or unable to learn the diversity characteristics of the data sufficiently, through the data augmentation, the size of the training set can be effectively increased, the generalization capability of the model is improved, and the data augmentation can introduce diversity and variability, so that the model can better adapt to different data distribution, illumination condition, angle and other changes, which helps to improve the robustness of the model, make it perform better in different scenes in the real world, for tasks with unbalanced categories, the data augmentation can be performed by generating more minority category samples, thus balancing the data distribution of each category, improving the recognition capability of the model for minority categories, common data augmentation techniques include rotation, inversion, translation, scaling, clipping, color transformation and the like, and the data augmentation techniques used in the invention are as follows:

Randomly rotating: the image is randomly rotated according to a certain angle, so that the recognition capability of the model for objects with different angles can be improved, samples with different angles and directions can be generated by rotating the image, the probability of random rotation is 0.2, the range of the rotation angle is (-0.52,0.52),

random scaling: the random scaling is to perform random scaling operation on the image, and the segmentation capability of the model for objects with different scales can be increased by scaling the size and the proportion of the image, wherein the probability of random scaling is 0.2, and the scale range is (0.7, 1.4).

Random elastic deformation: the random elastic deformation is to carry out random deformation operation on an image, the tolerance of a model to deformation and distortion can be increased by introducing random deformation parameters within a certain range, and the robustness of the model is improved.

Gamma correction: gamma correction is an enhancement for adjusting the brightness and contrast of an image, and by changing the pixel value distribution of the image, the adaptation of the model to different brightness conditions can be improved. The realization formula is as follows:

Y＝cx ^γ ，

the invention takes 1 as the scaling factor of C, gamma is an adjusting constant, controls the scaling degree of gamma transformation, has great influence on the characteristics of transformation functions, the range of gamma is (0.7,1.5), the probability of gamma correction is 0.3, when gamma >1, the gray level compression is carried out on brighter images, when gamma <1, the contrast enhancement is carried out on darker images, and the image details are enhanced.

Mirror image: the mirror image is to turn over the image according to the horizontal or vertical direction to generate a mirror-symmetrical sample, and can increase the invariance and the robustness of the model to the object segmentation.

Network architecture:

the method selects a nnU-Net with low resolution, which is the first stage of low cascade nnU-Net, as a base line network (because for three-dimensional CT images with high information density, a training model needs to pay high time cost under the condition of not reducing resolution and has long response delay in an inference stage), integrates long-distance characteristic information extracted by Swin transducer to make up for the limitation of nnU-Net local receptive field on the basis of nnU-Net, and in order to integrate a characteristic diagram extracted by nnU-Net and a characteristic diagram extracted by Swin transducer, on the premise of keeping the integrity of a nnU-Net frame, the method connects 1, 2, 3 and 4-level characteristic diagrams output by nnU-Net and 0, 1, 2 and 3-level characteristic diagrams output by Swin transducer in series in a channel dimension, so that the obtained final encoder output characteristic diagram size is as follows:

Table 1 each network encoder output profile size

Output order	nnU-Net	Swin Transformer	STransUnet
				0	(32,128,128,128)	(32,64,64,64)	(32,128,128,128)
1	(64,64,64,64)	(64,32,32,32)	(96,64,64,64)
				2	(128,32,32,32)	(128,16,16,16)	(192,32,32,32)
3	(256,16,16,16)	(256,8,8,8)	(384,16,16,16)
				4	(320,8,8,8)	(512,4,4,4)	(576,8,8,8)

Our network structure adopts the form of Encoder and Decoder, the Encoder-Decoder is a common neural network structure, and is composed of two parts, the Encoder and Decoder, the Encoder is responsible for encoding the input image into a vector representation with fixed length, the Decoder uses the vector representation to generate the output image, the training of the invention uses the end-to-end mode, the complete process from input to output is completed by a single model, no middle obvious stage or manual intervention exists, the input image is directly mapped to output, the complicated process of manual design features or middle steps is omitted, the whole training process is more concise and efficient, the Encoder stage uses the stacked two convolution layer coding feature patterns, the stride of the first convolution layer is 2, the output channel number is twice the input channel number, the downsampling function is realized, the step of the second convolution layer is 1, the number of output channels is equal to the number of input channels, the function of further extracting features is achieved, the convolution kernels of the convolution layers are 3, an InstanceNorm layer and a LeakyReLU layer are arranged behind the convolution layers and used for normalizing feature sequences and increasing nonlinearity of models, after feature images extracted by a convolution neural network and feature images extracted by a self-attention mechanism are spliced and fused in channel dimensions, the fused codes are transferred to a Decoder part in a jump connection mode to achieve fusion of deep features and shallow features, the jump connected feature images and the feature images sampled by the Decoder are spliced in channel dimensions, the features are further processed in the Decoder, in a Decoder stage where information is relatively abstract, the fused shallow features are helpful to enriching detail information such as textures and boundaries of the images, in the Decoder stage, the method comprises the steps of up-sampling by a deconvolution, halving the number of channels, doubling the resolution, enabling the convolution kernel and the stride of deconvolution to be 2, forming a decoding block by two stacked convolution layers, enabling the output channels to be reduced by the first convolution layer, enabling the output channels of the second convolution layer to be unchanged, enabling the stride to be 1, and enabling the output channels to follow an InstanceNorm layer and a LeakyReLU layer.

And (3) supervision strategy:

the invention uses a deep supervision strategy, the deep supervision is a technology for training the neural network, aims at improving the training effect of the network and optimizing gradient propagation, adds extra supervision signals in the middle layers (1 st, 2 nd and 3 rd output layers) of the neural network besides the final segmentation result (0 th level output) so as to better guide network learning and gradient propagation, calculates loss functions on four output layers, predicts the output characteristic diagrams through a convolution of 1 multiplied by 1 and a softmax layer and then compares the output characteristic diagrams with a real label (or a real label after downsampling), the deep supervision can alleviate the problem of gradient disappearance or explosion, the gradient can be better propagated in the network, more learning information can be provided by introducing a plurality of supervision signals at different levels, the network can be helped to learn and fit data better, meanwhile, as the network can be guided to learn at different levels more directly, the network can be helped to converge more quickly by deep supervision, the four-level loss is weighted to obtain the final loss, the weight is distributed according to the size of an output result, the weight of the loss from the 0 th level to the 3 rd level is 0.53, 0.27, 0.13 and 0.07 in sequence, and meanwhile, weight attenuation is introduced in the calculation process of a loss function, and the expression is as follows:

Wherein L (w) is a loss function with weight decay,is the original loss function, usingPrediction results in metrology model->The difference from the real tag y, lambda is the coefficient of the weight decay term, for controlling the intensity of the regularization, the invention takes 3e-05 of the total weight of the product, i w i ² Is the square of the L2 norm of the weight parameter w of the model, representing the sum of squares of the weight parameters.

By adding the weight decay term to the original loss function, not only the original loss is minimized in the optimization process, but also the magnitude of the weight parameters is constrained by the influence of the regularization term, so that the optimization process tends to select smaller weight values, thereby limiting the complexity of the model and reducing the risk of overfitting.

Optimization algorithm:

the optimization algorithm used in the invention is random gradient descent with Nesterov momentum, the SGD utilizes random subsets of training samples to estimate the gradient of the whole training set, for each small batch, a loss function is calculated for the gradient of model parameters, the gradient represents the change rate and direction of the loss function at the current parameter value, the SGD with momentum accelerates convergence and reduces concussion by introducing the concept of momentum, the momentum can be understood as inertia of parameter update, similar to the concept of momentum in physics, by considering historical gradient information in the parameter update, the method gives larger weight to the nearest gradient to reduce the variance and instability of the parameter update, nesterov momentum is an improvement of a standard momentum method and is used for more accurately estimating the gradient and better guiding the parameter update, nesterov momentum firstly calculates the gradient of the current parameter position plus the position of the momentum item and then uses the gradient for parameter update, the additional step enables the estimation of the gradient to be more accurate, thereby improving the direction of the parameter update, and the realization mode of Nesterov is as follows:

θ＝θ-V _t ，

Initializing momentum variable V ₀ 0, the dimension of the variable is the same as the parameter dimension of the model, eta is the learning rate,for the loss function to take advantage of the velocity V at the previous instant _t-1 The parameter gradient value after one pass is temporarily updated, θ represents a parameter, compared with a standard momentum method, nesterov momentum can estimate the gradient of the current parameter position more accurately, excessive update possibly occurring in the standard momentum method is avoided, nesterov momentum can converge to an optimal solution more quickly relative to the standard momentum method, particularly under the condition of large curvature or flat area, the parameter update direction can be improved, the Nesterov momentum can better guide the parameter update direction by calculating the gradient in advance, unnecessary update is reduced, gamma of the invention is set to 0.99, the learning rate determines the step length of parameter update, the larger learning rate can lead to rapid convergence, but the optimal solution can be missed, the smaller learning rate can be more stable, but the convergence speed is slower, and the initial learning rate selected by the invention is 0.01.

Experimental results:

in all experiments we used an official evaluation code that calculated the Dice and Surface Dice indices of the segmented results, the segmented labels of the KiTS23 dataset were in the form of classified evaluation layers, including three classified evaluation layers "kidney+tumor+cyst", "tumor+cyst" and "tumor only", and we expressed the Dice score of each classified evaluation layer as D1, D2, D3, respectively, the average as MD, and the Surface Dice index scores as S1, S2, S3, the average as MS, the average as Surface Dice, and the average as Avg in the tables of the present invention.

Firstly, in order to determine the loss function used by the data set, we choose several loss functions with the best performance in the image segmentation field to perform experiments, the model used in the experiments is a baseline model (nnU-Net), the number of training iterations (epochs) is 100, the display card used in the experiments is an NVIDIA RTX 4090 display card, the time spent in one iteration is about 280 seconds, and the experimental results are shown in the following table:

table 2 loss function selection experiment results

Loss function	D1	D2	D3	MD	S1	S2	S3	MS
									Dice+CE	0.9663	0.8311	0.7865	0.8613	0.9244	0.6718	0.6135	0.7366
Dice+TopK	0.9673	0.8154	0.7926	0.8585	0.9243	0.6414	0.6032	0.7230
									GDL+CE	0.9432	0.6713	0.6880	0.7675	0.8870	0.4753	0.4914	0.6179

According to experimental results, the loss functions of the die+ce are respectively 0.33%, 1.88%, 12.22% and 19.21% higher than those of the die+topk and the gdl+ce in the indexes of MD and MS, so that the loss functions selected by the invention are compound loss functions of the die+ce, and the final loss function expression is as follows in combination with weight attenuation:

subsequently, after 500 iterations on the training set (we increased the number of iterations to improve model performance) using our StransUnet and the base network nnU-Net, the test set was partitioned, and the experimental results are shown in the following table:

TABLE 3 StransUnet vs. base line network nnU-Net results of experiments

Compared with a base line network nnU-Net, the StransUnet is improved by 0.040% and 0.036% in MD and MS indexes respectively, and the average index is improved by 0.039%, so that the StransUnet has certain advantages, and the StransUnet has better performance than the base line network nnU-Net in a tumor+cyst HEC and a tumor only HEC with smaller segmentation labels.

The parameter number of StransUnet is 36.43M, the time used for training one epoch is 303.60 seconds, the time for reasoning one case is 29.60 seconds, the parameter number of nnU-Net is 35.54M on one NVIDIA RTX 4090 display card, the time used for training one epoch is 266.79 seconds, the time for reasoning one case is 27.10 seconds, the reasoning time of StransUnet is only 2.50 seconds later than nnU-Net, the parameter amount is increased by 2.5%, analysis is performed based on experimental results, the performance of the StransUnet is superior to that of the base line network nnU-Net on the 'tumor+cyst' HEC and the 'tumor only' HEC which are smaller in split labels, the self-attention mechanism can capture local and context information and process more adaptive label details when the self-attention mechanism benefits from the feature map extracted based on the self-attention mechanism incorporated by the StransUnet in the encoder stage, the self-attention mechanism allows the network to focus on the relevant areas of different locations in the image, so that the details of the object can be focused more accurately in the segmentation task, in particular the self-attention mechanism can dynamically assign weights in the feature map to distinguish the importance of different locations, which is very beneficial for smaller labels, because it can focus attention on these tiny areas, better capturing their features, which helps the network to more accurately discern boundaries, textures and other critical features, thus improving the accuracy of the segmentation, while at the same time the self-attention mechanism can also capture context information, even if the object is smaller, the context of surrounding areas can be exploited to improve the effect of the segmentation, which is very important to ensure consistency between the smaller object and its surrounding environment, and the Swin Transformer window size in the present invention is (7, 7, 7) can be understood as that the receptive field size of the Swin transducer is (7, 7), and the context information is integrated through the sliding window, so that the detail and the integrity of certain local features are reserved to a certain extent, the model performance is improved, and in sum, the self-attention mechanism introduced by the StransUnet can help the network to better adapt to the segmentation task of processing smaller labels, thereby improving the accuracy and the detail capturing capability of segmentation.

And (3) light weight of the model:

in stransUnet, the parameters of the convolution module are 35.54M and account for 97.56% of the parameters of the model, so that the simplified convolution module has significance for the light weight of the stransUnet, the convolution module of the stransUnet is a stacked convolution, the stacked two-layer convolution is sequentially subjected to module modification, four modification schemes are provided, namely, the first-layer convolution is changed into PW convolution, the second-layer convolution is changed into standard convolution, the first-layer convolution is changed into depth separable convolution (DS), the second-layer convolution is changed into standard convolution, the first-layer convolution is changed into PW convolution, the first-layer convolution and the first-layer convolution are changed into depth separable convolution, other configurations are kept unchanged, and the experimental results are shown in the following table:

table 4 convolution module light weight test results

It can be seen from the table that after the standard convolution is changed into the PW convolution, the division precision is reduced due to the reduction of the convolution kernel, after the DW convolution is added before the PW convolution (namely, the DW convolution is changed into the depth separable convolution and DS), the precision of a certain loss can be compensated due to the extraction of space information by the DW convolution, the average precision is optimal as a stransUnet, the average precision is suboptimal as the stransUnet after the depth separable convolution replacement, and the average precision is reduced by 0.07%.

The parameters and the inference time of each network model are as follows:

TABLE 5 parameter and time complexity for each network model

As can be seen from the table, the DS+PW network is the one with the smallest parameter quantity and the shortest reasoning time, namely, the first layer of the stacked convolution is changed into the depth separable convolution, the second layer is changed into the PW convolution model, the parameter quantity is 6.48M, the parameter quantity is 17.79% of the StransUnet parameter quantity, the time required for one training round is 78.30 seconds less than that of the StransUnet, the model with the shortest reasoning time is PW+Standard, namely, the first layer of the stacked convolution is changed into the PW layer network, and the reasoning time is 9.18 seconds shorter than that of the StransUnet.

The model precision, model parameter and time complexity are comprehensively considered, a DS+DS model is selected as the light StransUnet, and the simplification of the model parameter (17.92% of the original parameter) and the reduction of the time complexity (78.12 seconds are shortened in training time and 5.13 seconds are shortened in reasoning time) are realized with small precision loss (0.07%).

Then, a re-parameterization algorithm is used for the lightweight stransuret to further simplify the model, before re-parameterization, the network structure of the stransuret needs to be finely adjusted, an IN layer is changed into a BN layer, all BN layers and corresponding convolution layer parameters thereof are found out to carry out re-parameterization, and a re-parameterization calculation formula is as follows:

F′ _{j，：，：，：} ＝bn.weight/std·F _{j，：，：，：} ，

b′ _j ＝bn.bias-bn.running_mean·gamma/std，

Where bn.running_var, bn.eps are parameters in the BN layer, representing the run-time variance of each feature tracked and accumulated during training and a small one to prevent division operationsZero dividing error constant value for ensuring numerical stability, F' _j And b' _j Is the re-parameterized convolution kernel weight and offset.

After recalculating the convolution kernel parameters, the model architecture needs to be changed, the BN layer deleted, and the parameter values after reparameterization reloaded.

Ablation experiments were performed for each improvement, with the following table of experimental results:

table 6 ablation experimental results

Table 4.3 The results of ablation experiment

Network model	D1	D2	D3	MD	S1	S2	S3	MS	Avg
										nnU-Net	0.9751	0.8787	0.8484	0.9001	0.9467	0.7651	0.7227	0.8115	0.8558
StransUnet	0.9748	0.8822	0.8541	0.9037	0.9437	0.7685	0.7310	0.8144	0.8591
										StransUnet+DS	0.9750	0.8769	0.8466	0.8995	0.9448	0.7719	0.7355	0.8174	0.8585
StransUnet+DS+Rep	0.9750	0.8769	0.8465	0.8995	0.9447	0.7719	0.7355	0.8174	0.8585

The parameters and time complexity of each model are as follows:

TABLE 7 parameter and time complexity of each model

Table 4.4 The parameter count and time complexity of each model

The re-parameterized StransUnet is calculated on the basis of the parameters of the StransUnet+DS model, compared with the StransUnet+DS model, the re-parameterized StransUnet model reduces BN layer parameters, 7680Byte is reduced altogether, the precision of the re-parameterized model is hardly attenuated, only the D3 index and the S1 index are reduced by 0.0001, the time complexity is reduced, and the light model reasoning time is shortened by 4.06 seconds.

Compared with the base network nnU-Net, the average Dice index of the light StransUnet is reduced by 0.0006, the average Surface Dice index is improved by 0.0059, the average Surface Dice index is improved by 0.0027, the parameter quantity of the light StransUnet is reduced by 29.02M, the parameter quantity is 18.35 percent of the base network, and the reasoning time is shortened by 6.69 seconds.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made hereto without departing from the spirit and principles of the present invention.

Claims

1. The medical image segmentation system is characterized by comprising a preprocessing module, an image segmentation module and a network light module;

wherein, preprocessing module is used for: preprocessing an input image;

the network lightweight module is used for: the network is lightweight.

2. A medical image segmentation system according to claim 1, wherein the image segmentation module operates as follows:

3. The medical image segmentation system according to claim 1, wherein the network lightweight module is used as follows:

4. A medical image segmentation system according to claim 1, characterized in that the system employs an encoder-decoder architecture, the encoder being composed of a CNN network and a self-attention network, each stage of the CNN part of the encoder comprising two convolution layers, the first convolution layer having a stride of 2, the number of output channels being twice the number of input channels, the second convolution layer having a stride of 1, the number of output channels being equal to the number of input channels, and each convolution layer thereafter comprising an I nstanceNorm layer and a LeakyReLU layer for feature normalization and nonlinear modeling.

5. The medical image segmentation system according to claim 4, wherein each stage of the self-attention portion of the encoder includes two self-attention computation layers, the first self-attention layer being a standard self-attention layer, the second self-attention layer being a sliding window self-attention layer, the decoder portion upsamples the previous layer input, the decoder includes a deconvolution layer upsampling and then forming a decoding block by the two convolution layers, the first convolution layer reducing output channels, the second convolution layer having a constant number of output channels, the stride being 1, each decoding block being followed by an I nstanceNorm layer and a LeakyReLU layer.

6. A medical image segmentation system according to claim 4, wherein the feature map of the fused encoder is transferred to the decoder portion by means of a jump connection to achieve fusion of deep features and shallow features.

7. The medical image segmentation system according to claim 6, wherein the skip-connected feature map is merged with the decoder up-sampled feature map in a channel dimension.

8. A medical image segmentation system according to claims 1 to 7, wherein the model is replaced by a depth separable convolution, the system replaces the standard convolution with a depth separable convolution in the stacked convolution layers of the encoder and decoder, changes the standard convolution layer to a group convolution with a number of groups equal to the number of input channels, and adds a standard convolution with a convolution kernel of 1 after DW convolution.

9. A medical image segmentation system according to claims 1-8, wherein the model is re-parameterized, a re-parameterization algorithm is used to reduce the number of parameters of the model, the I N layer is replaced by the BN layer for training, and the obtained model parameters are re-parameterized to obtain new model parameters, and the output calculation formula of the convolution layer with BN is as follows:

10. A medical image segmentation system according to claims 1-9, characterized in that the model parameters after the re-parameterization are as follows:

F′ _{j，：，：，：} ＝γ _j /σ _j ·F _{j，：，：；：}

b _j ＝-μ _j ·γ _j /σ _j +β _j 。