CN117994517A

CN117994517A - Method capable of accurately dividing medical image

Info

Publication number: CN117994517A
Application number: CN202410174697.8A
Authority: CN
Inventors: 兰利彬; 夏遵辉; 冯欣; 谭暑秋; 蔡鹏洲
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2024-02-07
Filing date: 2024-02-07
Publication date: 2024-05-07

Abstract

The invention relates to a method capable of accurately segmenting medical images, which comprises the following steps: obtaining a public dataset, building and training a U-shaped hybrid CNN-transporter network named BRAU-Net++, the BRAU-Net++ uses dual-layer routing attention as a core building block to design a U-shaped encoder-decoder structure, where both the encoder and decoder are built hierarchically in order to learn global semantic information while reducing computational complexity. In addition, the BRAU-Net++ reconstructs jump connection by integrating channel-space attention, the attention adopts convolution operation to minimize local space information loss and enlarge global dimension interaction capability of multi-scale features, and finally, a test medical image is input into a trained CNN-transducer network, and output is a result after the test medical image is segmented. A large number of experiments performed on three different imaging modality data sets prove the universality and the robustness of the method for the multi-modality medical image segmentation task.

Description

Method capable of accurately dividing medical image

Technical Field

The invention relates to the technical field of medical image segmentation, in particular to a method capable of accurately segmenting medical images.

Background

Accurate and robust medical image segmentation is an integral part of computer-aided diagnosis systems, in particular in image-guided clinical surgery, disease diagnosis, treatment planning, clinical quantification, etc. Medical image segmentation is generally considered to be substantially identical to natural image segmentation, the corresponding techniques of which are generally derived from the latter. Common to both fields is that they extract the exact regions of interest (ROIs) of the image as a research target, either manually or automatically. With the aid of deep learning techniques, segmentation tasks in natural image vision have gained attractive performance. But unlike natural image segmentation, medical image segmentation requires more accurate ROI segmentation results, such as organs, lesions, and abnormalities, to quickly identify ROI boundaries and accurately assess the level of the ROI. This is because subtle segmentation errors in medical images may reduce the user experience and increase the risk of subsequent computer-aided diagnosis. Furthermore, manually delineating the ROI and its boundaries in various imaging modalities requires a lot of effort, is very time consuming and even impractical, and the final segmentation results may be affected by clinician preferences and expertise. Therefore, it is considered critical to develop intelligent and robust techniques to efficiently and accurately segment organs, lesions, and abnormal regions in medical images.

With the development of deep learning and the widespread and promising application, many medical image segmentation methods have been proposed that rely on convolution operations for segmenting specific target objects in medical images. In these approaches, U-shaped architectures such as U-Net and Full Convolutional Networks (FCNs) have been dominant in medical image segmentation. Subsequent variants, such as U-Net++, U-Net3+, attentionU-Net, and 3DU-Net, V-Net, have also been developed for 2D and 3D medical image segmentation of different imaging modalities and have met with outstanding success in numerous medical applications, such as multi-organ segmentation, dermatological lesion segmentation, polyp segmentation, etc. This suggests that Convolutional Neural Networks (CNNs) have a strong ability to learn semantic information. But it often has limitations in explicitly capturing long-range dependencies due to the inherent locality of convolution operations. To address this limitation, some studies have proposed expanding the receptive field by deep stacking or dilation convolution operations of standard convolution operations, or establishing self-attention mechanisms that rely on CNN features. However, these methods do not significantly improve the modeling ability for long-range dependencies.

Inspired by the successful application of the transducer to Natural Language Processing (NLP), many studies have attempted to introduce the transducer into the field of vision. These efforts have led to a consistent improvement in various visual tasks, indicating that visual convertors have great potential in the field of vision. However, conventional transformers are often plagued by high computational costs and large memory usage, and there are model efficiency problems in long sequence scenarios. The most common improvement approach is to introduce sparse bias into the common attention mechanism, i.e. to employ sparse attention rather than global attention to reduce computational complexity. Global attention needs to be computed for pairs of token similarities at all spatial locations, while sparse attention allows each query token to focus on only a small number of key-value tokens, not the entire sequence. To this end, some hand-made static sparse attention methods are proposed, such as local attention, expanded attention, axial attention or deformable attention, according to a specific predefined pattern. In the field of medical image vision, many studies have also considered the introduction of transgenes into medical image segmentation tasks, such as nnFormer, UTNet, transUNet, transCeption, hiFormer, focal-UNet and MISSFormer. However, to our knowledge, only a few studies have considered the introduction of sparsity concepts into this field, with representative works including Swin-Unet, gatedAxialUNet (MedT). But these sparse attention mechanisms merge or select sparse modes in a manually formulated manner. Thus, these selected modes are query independent, i.e., applicable to all queries. The application of dynamic and query-related sparse attention mechanisms to medical image segmentation has not yet been extensively explored.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention aims to solve the technical problems that: how to perform accurate medical image segmentation.

In order to solve the technical problems, the invention adopts the following technical scheme: a method for accurately segmenting medical images, comprising the steps of:

s1: and acquiring the public data set and a label corresponding to each picture in the public data set.

S2: a U-shaped hybrid CNN-fransformer network is constructed and trained that includes seven phases, including an encoder, bottlennek, and a decoder in sequence, and a hopped connection channel-space attention SCCSA module.

The encoder and the decoder are respectively constructed in a layering way by adopting a three-stage pyramid structure; the encoder includes first to third stages, bottlennek is a fourth stage, and the decoder includes fifth to seventh stages.

The SCCSA module includes a channel attention sub-module and a spatial attention sub-module. First, by connecting outputs from the encoder and decoder, we getThen, the channel attention sub-module uses MLP to enhance cross-dimensional channel-space dependencies; in the spatial attention sub-module, convolutional layer focusing spatial information is used.

Each picture input into the encoder is a three-channel picture with H x W x 3, and the resolution of the feature map obtained after the first stage is thatAfter the second stage, the resolution of the characteristic diagram is/>After the third stage, the resolution of the characteristic diagram is as followsThe resolution of the feature map output by the third stage of the encoder after Bottlennek processing is/>Resolution is/>The feature map of (2) enters a patch expansion layer of the fifth stage, and the feature map resolution becomes/>Then, the feature map obtained by processing the feature map obtained in the third stage and the feature map obtained in the third stage through SCCSA modules enters a sixth stage, and the resolution is obtained after the patch expansion layer in the sixth stage is processedThe feature map obtained by carrying out SCCSA module processing on the feature map and the feature map obtained in the second stage enters a seventh stage, and the resolution is obtained after the patch expansion layer processing in the seventh stageThe feature map obtained by processing the feature map obtained in the first stage and the feature map obtained in the first stage through a SCCSA module is subjected to up-sampling to obtain a patch expansion layer which is 4 times as large as H, a feature map of H, W and C is obtained, a linear mapping layer is further adopted, finally a final feature map of H, W and Class is output, class represents the number of channels, the value of each channel represents the confidence or probability that a U-shaped hybrid CNN-transducer network belongs to a corresponding Class for a pixel, and for a pixel, the confidence or probability of which channel is the largest, then the pixel belongs to which Class, and the Class of each pixel in the final feature map is judged to obtain the segmented picture.

The CNN-transporter network is trained using the public dataset, and when the loss of the CNN-transporter network is no longer changed, the trained CNN-transporter network is considered to be obtained.

S3: inputting the new medical picture into a trained CNN-converter network, and outputting the new medical picture to be the picture after the new medical picture is segmented.

Preferably, the encoder and the decoder are each constructed hierarchically in a three-stage pyramid structure, wherein the first stage of the encoder comprises a patch embedding layer and BiFormer blocks, the second to third stages comprise a patch merging layer and BiFormer blocks, and the first to third stages of the decoder comprise a patch expanding layer and BiFormer blocks.

The patch embedding layer adopts two 3×3 convolution layers to dimension the characteristics of each regionConverted to an arbitrary dimension, channel, denoted C.

Preferably, bottlennek consists of a patch combining layer and BiFormer blocks, which reduce resolution and increase channel number for the output of the encoder.

Preferably, the SCCSA modules specifically include: inputting a feature mapIntermediate state F ₁,F₂,F₃ and output x ₃ are represented as:

F₁＝Concat(x₁,x₂), (8)

x₃＝FC(F₃). (11)

Wherein F ₂ and F ₃ are the outputs of the channel attention sub-module and the spatial attention sub-module, respectively, And σ represents the element-wise multiplication and sigmoid activation functions, respectively.

Preferably, the loss function used for training the U-shaped hybrid CNN-transporter networkThe method comprises the following steps:

Wherein, Representing the Dice loss,/>Representing cross entropy loss, N is the number of pixels, g (k, i) ε (0, 1) and p (k, i) ε (0, 1) represent the true label and probability of generation for class k, respectively. K is the number of categories,/>Is the sum of the weights of all categories. Lambda is the equilibrium/>And/>The weighting factor of the effect, ω _k, represents the weight of the class, k is the index of the sum, i is the index of the pixel, and n is a scaling factor.

Compared with the prior art, the invention has at least the following advantages:

The invention provides a U-shaped hybrid CNN-converter architecture, which is called BRAU-Net++, and is used for medical image segmentation tasks. The architecture utilizes dynamic sparse attention to replace full attention or static manual designed sparse attention, and can effectively learn local and global semantic information while reducing computational complexity. Furthermore, a novel module is proposed: the jump connection channel-spatial attention (SCCSA) is used to integrate multi-scale features to compensate for loss of spatial information and enhance cross-dimensional interactions. Experimental results show that the method can obtain the most advanced performance under almost all evaluation indexes on a Synase multi-organ segmentation, ISIC-2018 challenge and CVC-ClinicDB dataset, and is particularly good at capturing the characteristics of small targets.

Drawings

Fig. 1 (a) shows a BRAU-net++ architecture, and fig. 1 (b) shows a jump connection channel space attention (SCCSA) module.

FIG. 2 is a qualitative comparison of the method of the present invention with other most advanced methods on a Synapse multi-organ segmentation dataset.

FIG. 3 is a visual comparison of the inventive method with other most advanced methods on the ISIC-2018 challenge and CVC-ClinicDB datasets.

Detailed Description

The present invention will be described in further detail below.

The U-shaped hybrid CNN-converter network is defined as BRAU-Net++, and the BRAU-Net++ is composed of an encoder, bottlennek, a decoder and SCCSA modules, so that a U-shaped hybrid network structure is formed. The overall architecture of BRAU-Net++ is shown in FIG. 1 (a). At the top of the network, a linear projection layer is applied to the full resolution feature map, reducing their dimensions to a category number for predicting the final pixel level classification result. The core components of BRAU-Net++ are BiFormer blocks and SCCSA modules. The network has 7 phases in total. From stage 1 to stage 7, there are 2, 8, 2 and 2 BiFormer blocks, respectively, per stage. The SCCSA module replaces the traditional jump connection, aggregates the features of different scales, and is realized based on a global attention mechanism so as to minimize the loss of local spatial information and enhance the global dimension interaction of the multi-scale features. The detailed information of SCCSA modules is found in fig. 1 (b). The overall network takes into account the advantages of integrating self-attention and convolution to enhance the ability to capture long-range dependencies and learn local information. Furthermore, the network has the advantage of low complexity due to the dynamics and sparsity of the double layer attention.

A method for accurately segmenting medical images, comprising the steps of:

The SCCSA module can effectively compensate the space information loss caused by downsampling and enhance the global dimension interaction of multi-scale features of each stage of the decoder, so that fine granularity detail recovery is realized when an output mask is generated. As shown in fig. 1 (b), the SCCSA module includes a channel attention sub-module and a spatial attention sub-module. First, by connecting outputs from the encoder and decoder, we getThe channel attention sub-module then enhances the cross-dimensional channel-space dependence using a two-layer MLP with a reduction ratio e=4; in the spatial attention sub-module, two 7 x 7 convolutional layers are used to focus spatial information because it has a relatively large receptive field.

S3: and inputting the new medical picture into a trained CNN-converter network, and outputting the new medical picture to be a picture after the new medical picture is segmented.

Specifically, the encoder and the decoder are both constructed in a layered manner by adopting a three-stage pyramid structure, wherein the first stage of the encoder comprises a patch embedding layer and BiFormer blocks, the second stage to the third stage of the encoder comprises a patch merging layer and BiFormer blocks, and the first stage to the third stage of the decoder comprises a patch expanding layer and BiFormer blocks. In the invention, the number of BiFormer blocks per stage is set to 2,2 and 8 in turn.

The patch embedding layer adopts two 3×3 convolution layers to dimension the characteristics of each region(For example, in stage 1, the resolution of the feature map is/>The feature dimension of each region is 8×8=64) into an arbitrary dimension, i.e., a channel, denoted as C; the patch merging layer uses a 3 x 3 convolution layer to halve the spatial resolution of the feature map while increasing the dimension by a factor of 2.

As shown in fig. 1, at stage 1, the tokenized inputs of s×s regions (each region having a dimension of 64) and C channels of each region are fed into two consecutive BiFormer blocks to learn the feature representation. In stage 2, the first patch combining layer performs 2 XC downsampling to reduce resolutionThe feature dimension is increased by a factor of 2 to 2C. In stage 3, the process is similar to stage 2, with a resolution of/>The dimension is 4C.

Specifically, bottlennek consists of a patch merging layer and BiFormer blocks, and reduces resolution and improves channel number for the output of the encoder; wherein the number of BiFormer blocks is set to 2. The patch merging layer increases the dimension to 8C, i.e., the dimension of each region is 8C, and the resolution of the feature map is reduced toI.e. each area has a size of 1 x 1; that is, each region is now one pixel. The resolution and dimensions of the feature map passed through two consecutive BiFormer blocks remain unchanged.

Specifically, the steps of the SCCSA module include: inputting a feature mapIntermediate state F ₁,F₂,F₃ and output x ₃ are represented as:

F₁＝Concat(x₁,x₂), (15)

x₃＝FC(F₃). (18)

Specifically, training a loss function for a U-shaped hybrid CNN-transporter networkThe method comprises the following steps:

Wherein, Representing the Dice loss,/>Representing cross entropy loss, N is the number of pixels, g (k, i) ε (0, 1) and p (k, i) ε (0, 1) represent the true label and probability of generation for class k, respectively. K is the number of categories,/>Is the sum of the weights of all categories. Lambda is the equilibrium/>And/>Weighting factors of the influence. Omega _k denotes the weight of a class, k is the index of the sum, i is the index of the pixel, n is typically a scaling factor for controlling the size range of the loss function.

In all our experiments, ω _k and λ were empirically set to beAnd 0.6.

1. Experimental setup

(1) Data set

We trained and tested the proposed BRAU-net++ on three publicly available medical image segmentation datasets: synase multi-organ segmentation, ISIC-2018 challenges, and CVC-ClinicDB. The detailed information about the data division is shown in table 1. All these datasets are relevant for clinical diagnosis, so that their segmentation results are crucial for the treatment of patients, comprising images of different modalities and their corresponding real label masks. We have deliberately chosen these diverse imaging modality datasets to assess the versatility and robustness of the proposed method. More detailed information about these datasets is as follows.

TABLE 1 details of the medical segmentation dataset used in the experiments are as follows

A Synapse multi-organ segmentation dataset: the dataset used in the experiment included 30 abdominal Computed Tomography (CT) scans from MICCAI multiple atlas abdominal labeling challenges, together containing 3,779 Zhang Fubu CT images. Each CT volume includes 85-198 slices of 512 x 512 pixels with a voxel spatial resolution ([ 0.54-0.54] × [0.98-0.98] × [2.5-5.0 ]) mm 3. The training set and the test set contained 18 (containing 2,212 Zhang Zhouxiang slices) and 12 samples, respectively.

ISIC-2018 challenge dataset: the dataset used was the training set for lesion segmentation tasks in the ISIC-2018 challenge, which contained 2,594 dermatoscopic images with real label segmentation annotations. We performed five-fold cross-validation to evaluate the performance of the model and select the best model for reasoning.

CVC-ClinicDB dataset: the CVC-ClinicDB dataset is typically used for polyp segmentation tasks. This is also the training dataset for MICCAI automated polyp detection challenge. The dataset contained 612 images, randomly split into 490 training images, 61 verification images and 61 test images.

(2) Evaluation index

To evaluate the performance of BRAU-Net++, consider the average Dice Similarity Coefficient (DSC) and average Hausdorff Distance (HD) for 8 abdominal organs (aorta, gall bladder, spleen, left kidney, right kidney, liver, pancreas, spleen and stomach) as an evaluation index, while DSC alone was used for evaluation of individual organs. In addition, for model performance on the ISIC-2018 challenge and CVC-ClinicDB dataset, average cross-over (mIoU), DSC, accuracy, precision, recall, etc. were used as evaluation metrics. Specifically, predictions can be divided into true (TruePositive, TP), false positive (FalsePositive, FP), true negative (TrueNegative, TN), and false negative (FALSENEGATIVE, FN), and then the DSC, ioU, accuracy, precision, and recall calculations are as follows:

HD can be described as:

Wherein Y and Real label mask and predictive segmentation map, respectively,/>Representing points Y and/>Euclidean distance between them.

(3) Implementation details

We trained our BRAU-net++ model and its various ablative variants on NVIDIA3090 graphics cards with 24GB memory. We use Python3.10 and PyTorrch2.0 to implement the method of the present invention. During training, models were initialized and fine-tuned using BiFormer pre-trained weights on ImageNet-1K, and we also trained the proposed model from scratch only on the Synapse multi-organ segmentation dataset, taking into account space constraints. On these resulting models, we performed a series of ablative studies to analyze the contribution of each component.

For the Synapse multi-organ segmentation dataset, we adjusted all images to 224×224 resolution and trained 400 epochs using random gradient descent, with a batch size of 24, a learning rate of 0.05, a momentum of 0.9, and a weight decay of 1e-4. With respect to the ISIC-2018 challenge and CVC-ClinicDB dataset, we adjusted all images to 256X 256 resolution and used the Adam optimizer to perform all model training for 200 epochs, batch size 16. We used CosineAnnealingLR scheduling with an initial learning rate of 5e-4. Data enhancement with a probability of 0.25, such as horizontal flip, vertical flip, rotation, and cutout, is used to enhance data diversity.

Other hyper-parameters are also empirical settings. For example, the region division factor S is set to 7 and 8 according to the resolutions 224×224 and 256×256, respectively. The number of top-k from stage 1 to stage 7 is set to 2, 4, 8, S ², 8, 4, and 2, respectively, where S ² represents use of full attention.

2. Experimental results

We will compare in detail the performance of BRAU-net++ with other most advanced methods, including CNN-based, transformer-based, and hybrid of both, on Synapse multi-organ segmentation, ISIC-2018 challenge, and CVC-ClinicDB dataset.

(1) Comparison of Synase multiple organ segmentations

Automated multi-organ abdominal CT segmentation plays an important role in improving the efficiency of clinical workflow, including disease diagnosis, prognosis analysis, and treatment planning. Thus, we have chosen this dataset to evaluate the performance of various methods. We compared our proposals with the previous most advanced methods on the Synapse multi-organ abdominal CT segmentation dataset with DSC and HD as indicators, the best results are shown in bold in table 2.

BRAU-Net++ is significantly better than CNN-based methods and our baseline BRAU-Net in both DSC and HD evaluation metrics. Compared with two mainstream transducer-based methods TransUNet and Swin-Unet, BRAU-Net++ is significantly improved by 4.49% and 3.34% on DSC and significantly reduced by 12.62mm and 2.48mm on HD. This suggests that using the dual-layer routing attention as a core building concept to design a U-shaped encoder-decoder structure may help to learn global semantic information efficiently. More specifically, BRAU-Net++ stably defeats other approaches in segmentation of most organs, particularly for left kidney and liver segments. As can be seen from Table 2, the DSC value obtained by the method of the present invention is highest, which reaches 82.47%, and the segmentation map predicted by the method of the present invention has higher overlap with the real label mask. It can be seen that the process of the present invention achieves relatively low values (19.07 mm) on HD compared to the best and sub-best results of HiFormer and MISSFormer of 14.7mm and 18.20 mm. BRAU-Net++ increases HD by only 0.87mm compared to MISSFormer, but a significant increase of 4.37mm relative to HiFormer, indicating that the method of the invention may have less ability to learn target edge information than HiFormer. Overall, table 2 shows that the proposed BRAU-net++ has significantly improved in previous work, e.g. performance improvement on DSC ranging from 0.51% to 12.2% and HD from 1.59mm to 20.63mm, in addition to HiFormer and MISSFormer.

Further, as can be seen from Table 2, the number of learnable parameters of BRAU-Net++ is about 50.76M, with SCCSA module generating parameters of about 19.36M.

Table 2: the method of the invention is used for quantifying the parameters, DSC and HD of other most advanced methods on a Synapse multi-organ segmentation data set.

FIG. 2 shows some qualitative results of different methods on the Synapse dataset. As can be seen from fig. 2, our method creates a smooth segmentation map for the gall bladder, left kidney and pancreas, indicating that dual layer routing attention may be good at capturing features of small objects, while BRAU-net++ can learn better about local and distant semantic information, yielding better segmentation results.

(2) Comparison on an ISIC-2018 challenge

We performed five-fold cross-validation on the ISIC-2018 challenge data set to evaluate the performance of the method of the invention to avoid overfitting. We regenerate the results of all methods from the publicly published code. The quantitative and qualitative results are presented in table 3 and fig. 3 (left side), respectively. The method of the invention reaches 84.01 in mIoU aspect, the DSC is 90.10, the accuracy is 95.61, the precision is 91.18, the recall rate is 92.24, the best performance is obtained in the aspects of mIoU, DSC and accuracy, and the sub-best result is obtained in the aspects of precision and recall rate. It can be observed that the proposed BRAU-Net++ improves by 1.84% and 1.2% on mIoU over the recently published DCSAU-Net and preprinted BRAU-Net, respectively. In addition, the method of the invention realizes 92.24 recall rate and has more advantages in clinical application. From the above analysis and fig. 3 (left side), it is clear that BRAU-net++ achieves better boundary segmentation predictions on the ISIC-2018 challenge data set. The mask profile segmented by BRAU-Net++ is closer to the real tag.

TABLE 3 qualitative results of different methods on ISIC-2018 challenge dataset

(3) Comparison on CVC-ClinicDB

Early detection may increase survival before polyps have the potential to develop into colorectal cancer. This is of great importance for clinical practice. Therefore, we have also chosen this dataset to verify the performance of our model in experiments. The quantitative results are presented in table 4. The method of the present invention achieves optimal results in mIoU (88.17), DSC (92.94), precision (93.84), and recall (93.06), exceeding 1.99%, 1.27%, 2.12%, and 1.03% for the sub-optimal methods, respectively. The qualitative results are shown in fig. 3 (right). It can be seen that the polyp mask generated by the method of the present invention closely matches the boundaries and shape of the real label.

TABLE 4 qualitative results of different methods on CVC-CLINICDB dataset

(4) Ablation study

We validated and analyzed the effect of SCCSA module on all three datasets, as well as the skip connection and top-k number, input size and partition factor S, as well as the model size and effect of using pre-training weights only on the Synapse dataset.

1) Validity of SCCSA modules: the SCCSA module is an important part of BRAU-Net++. It uses channel-space attention to enhance cross-dimensional interactions in channel and space aspects, helping to generate more accurate segmentation masks. Table 2 shows the results of BRAU-Net++ without SCCSA and with SCCSA modules (i.e., BRAU-Net++) on the Synapse dataset. Compared with the BRAU-Net++ without SCCSA, the BRAU-Net++ is increased by 0.82% on DSC and is reduced by 0.39mm on HD. The segmentation results on the ISIC-2018 challenge and CVC-ClinicDB datasets are shown in Table 5. We can see that adding SCCSA modules to the BRAU-net++ model can achieve the best results under almost all evaluation metrics. For example, SCCSA may help to increase by 0.54% on an ISIC-2018 challenge race and 0.8% on a CVC-ClinicDB, relative to mIoU index. In addition, the number of parameters, floating point number of operations (FLOPs), and number of Frames Per Second (FPS) were also calculated to further investigate the validity of the module. It can be observed that SCCSA does not have a significant impact on the FPS on both datasets, especially for the ISIC-2018 challenge dataset, which still appears to improve the FPS.

TABLE 5 ablation study of SCCSA Module effects on ISIC-2018 challenge and CVC-ClinicDB dataset

2) Validity of the number of hop connections: it has been observed that the hopping connection of the U-shaped network can improve finer segmentation details by using low level spatial information. The ablation experiment mainly aims at discussing the influence of different numbers of jump connections on the improvement of the BRAU-Net++ performance. This experiment was performed on the Synapse dataset. The number of jumping connections is added to the 1/4, 1/8 and 1/16 resolution scale positions, and the number of jumping connections can be changed to 0, 1, 2 and 3 by a combination of connections at different positions, wherein "0" means that no jumping connection is added. Other added connections and their corresponding DSC and HD split performance are shown in table 6. It can be observed that the segmentation performance gradually improves as the number of jump connections increases, and that by adding jump connections at all positions of 1/4, 1/8 and 1/16 resolution ratios, optimal DSC and HD can be achieved. We therefore employ this configuration with a number of hop-connections of 3 to enhance the ability of the BRAU-net++ to learn accurate low-level details.

TABLE 6 jump connection number ablation study on Synapse dataset

/>

3) Validity of Top-k number: as the size of the routing area gradually decreases in the subsequent stages, we increase k accordingly to keep a reasonable number of tokens for attention calculations. The results of the ablation on the map dataset with respect to the number of top-k are shown in table 7, where the number of top-k and the number of tokens to be noted for each stage of the network are listed. It can be seen that increasing the number of tokens to be noted at a stage near the bottom of the encoder seems to improve the segmentation performance. This may be because building blocks near the bottom of the network may capture low-level information, such as edges or textures, which is necessary for the segmentation task. Furthermore, blindly increasing the number of tokens to be noted may compromise performance, which suggests that explicit sparsity constraints may act as a regularization method, improving the generalization ability of the model.

TABLE 7 Top-k quantitative ablation study on Synapse dataset

4) Validity of input resolution and partition factor S: the main goal of performing this ablation is to test the effect of input resolution on the performance of the model. We performed three sets of experiments on the Synapse dataset at resolution scales of 128 x 128, 224 x 224 and 256 x 256, respectively, and reported the results in table 8. The partition factor S is chosen as a divisor of the feature map size of each stage to avoid padding, and images of different input resolutions should employ different partition factors S. Thus, for the three resolutions described above, we set the corresponding partition factors to s=4, s=7, and s=8. It can be seen that maintaining the patch size the same (e.g., 32) and progressively increasing the resolution scale (i.e., increasing the sequence length of tokens) can result in a consistent increase in model performance. This is consistent with the common sense that larger resolution images contain more semantic information, thereby improving performance. However, this comes at the cost of greater computational cost. Therefore, in consideration of the calculation cost, and for the purpose of fair comparison with other methods, all experiments were performed with the default resolution set to 224×224.

Table 8 ablation study of input resolution and partitioning factor S on the Synapse dataset. Sign symbolRepresenting the original resolution.

5) Model size and effectiveness of pre-training weights: discussing the impact of model size on performance, the performance of a transducer-based model is severely impacted by pre-training. Thus, we considered four ablation studies on two different model scales, with each model trained and pre-trained from the head, respectively. These two different model scales are called the "tiny" and "base" models, respectively. Their configuration and results on the Synapse data set are listed in table 9. It can be seen that the "base" model produces better results. Particularly, in the HD evaluation index, the result of the base model is improved by 14.77mm compared with that of the tiny model. This suggests that the "base" model can better predict edges. Therefore, we use a "base" model for medical image segmentation.

TABLE 9 ablation study of model size and pre-training weights for Synapse dataset

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered by the scope of the claims of the present invention.

Claims

1. A method for accurately segmenting medical images, comprising the steps of:

S1: acquiring a public data set and a label corresponding to each picture in the public data set;

S2: constructing and training a U-shaped hybrid CNN-transporter network, wherein the U-shaped hybrid CNN-transporter network comprises seven stages, namely an encoder, a Bottlennek and a decoder, and a jump connection channel-space attention SCCSA module;

the encoder and the decoder are respectively constructed in a layering way by adopting a three-stage pyramid structure; the encoder includes first to third stages, bottlennek is a fourth stage, and the decoder includes fifth to seventh stages;

The SCCSA module includes a channel attention sub-module and a spatial attention sub-module. First, by connecting outputs from the encoder and decoder, we get Then, the channel attention sub-module uses MLP to enhance cross-dimensional channel-space dependencies; in the spatial attention sub-module, the spatial information is focused using a convolution layer;

Each picture input into the encoder is a three-channel picture with H x W x 3, and the resolution of the feature map obtained after the first stage is that After the second stage, the resolution of the characteristic diagram is/>After the third stage, the resolution of the characteristic diagram is as followsThe resolution of the feature map output by the third stage of the encoder after Bottlennek processing is/>Resolution is/>The feature map of (2) enters a patch expansion layer of the fifth stage, and the feature map resolution becomes/>Then, the feature map obtained by processing the feature map obtained in the third stage and the feature map obtained in the third stage through SCCSA modules enters a sixth stage, and the resolution is obtained after the patch expansion layer in the sixth stage is processedThe feature map obtained by carrying out SCCSA module processing on the feature map and the feature map obtained in the second stage enters a seventh stage, and the resolution is obtained after the patch expansion layer processing in the seventh stageThe feature map obtained by processing the feature map obtained in the first stage and the feature map obtained in the first stage through a SCCSA module is subjected to up-sampling to obtain a patch expansion layer which is 4 times as large as that of the feature map obtained in the first stage, then the feature map is subjected to linear mapping layer, finally the final feature map of H.W.class is output, class represents the number of channels, the value of each channel represents the confidence or probability that the U-shaped hybrid CNN-transducer network belongs to the corresponding Class of pixel points, and for one pixel point, the confidence or probability of which channel is the largest is that the pixel point belongs to which Class, and the segmented picture is obtained after judging the Class of each pixel point in the final feature map;

Training the CNN-transporter network by using the public data set, and when the loss of the CNN-transporter network is not changed, considering that the trained CNN-transporter network is obtained;

2. The method for accurately segmenting medical images of claim 1, wherein: the encoder and the decoder are both constructed in a layering way by adopting a three-stage pyramid structure, wherein the first stage of the encoder comprises a patch embedding layer and BiFormer blocks, the second stage to the third stage of the encoder comprises a patch merging layer and BiFormer blocks, and the first stage to the third stage of the decoder comprises a patch expanding layer and BiFormer blocks;

The patch embedding layer adopts two 3×3 convolution layers to dimension the characteristics of each region Converted to an arbitrary dimension, channel, denoted C.

3. A method of enabling accurate medical image segmentation as set forth in claim 2, wherein: the Bottlennek consists of a patch combining layer and BiFormer blocks, which reduce resolution and increase channel number for the output of the encoder.

4. A method of enabling accurate medical image segmentation as set forth in claim 3, wherein: the SCCSA module comprises the following specific steps: the feature map x ₁ is input and,Intermediate state F ₁,F₂,F₃ and output x ₃ are represented as:

F₁＝Concat(x₁,x₂), (1)

x₃＝FC(F₃). (4)

5. A method of enabling accurate medical image segmentation as set forth in claim 3, wherein: training a loss function for use in a U-shaped hybrid CNN-transporter networkThe method comprises the following steps:

Wherein, Representing the Dice loss,/>Representing cross entropy loss, N is the number of pixels, g (k, i) ε (0, 1) and p (k, i) ε (0, 1) represent the true label and probability of generation for class k, respectively. K is the number of categories,/>Is of all kinds

And the sum of other weights. Lambda is the balanceAnd/>The weighting factor of the effect, ω _k, represents the weight of the category, k is the index of the summation,

The index representing the class i is the index of the pixel point and n is a scaling factor.