CN114066902A

CN114066902A - Medical image segmentation method, system and device based on convolution and transformer fusion

Info

Publication number: CN114066902A
Application number: CN202111381789.6A
Authority: CN
Inventors: 方贤勇; 王凯兵; 汪粼波
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2022-02-18

Abstract

The invention belongs to the field of image segmentation, and particularly relates to a medical image segmentation method, system and device based on convolution and transform fusion. The method comprises the following steps: s1: and constructing an improved transformer module with a sliding window based on the standard transformer module. S2: and constructing a deep fusion network comprising a convolution module, an improved transformer module, a feature fusion module and a decoder module. S3: a plurality of medical images with polyps are selected to form an original data set, and the original data set is divided into a training set and a testing set. S4: and setting a learning strategy, a training epoch and a loss function in a training stage, and training and testing the deep fusion network by using a training set and a testing set. S5: and saving the trained deep fusion network for the segmented medical image. The invention solves the problems of insufficient receptive field, incapability of effectively establishing remote dependence, utilization of global context information and the like of various conventional convolutional neural networks on the medical image segmentation problem.

Description

Medical image segmentation method, system and device based on convolution and transformer fusion

Technical Field

The invention belongs to the field of image segmentation, and particularly relates to a medical image segmentation method, system and device based on convolution and transform fusion.

Background

Medical image segmentation is an important and challenging research topic; many common tasks involved in clinical applications, such as: polyp segmentation, lesion segmentation, cell segmentation, and the like. Is also the most complex and critical step in the field of medical image processing. Medical image segmentation plays an important role in computer-aided clinical diagnosis systems; the method can be used for semi-automatically or automatically segmenting and extracting partial features with special significance in the medical image, so that reliable basis is provided for clinical diagnosis and pathological research, and doctors are assisted to make more accurate diagnosis.

Convolutional neural networks represented by Res-net have enjoyed great success in the field of computer vision, particularly in the directions of target detection, picture classification, picture segmentation, and the like. Similarly, the convolutional neural network is dominant in a series of medical image segmentation. The U-net proposes a classic coding and decoding structure, and is prominent in segmentation task; the encoder extracts features through continuous down-sampling, and the decoder performs up-sampling by gradually utilizing the features output by the encoder through skipping connection, so that the network can more fully utilize the features. On the basis, technicians develop a series of algorithm networks specially designed for medical image segmentation, such as Unet + +, Res-Unet, Attention-Unet, DenseUnet, R2U-net and the like in succession, and all achieve good segmentation results.

CNNs (convolutional neural networks) have met bottlenecks, although they have met with great success in the field of medical image segmentation. The field of experience of the convolution operation is very limited, and only very local features can be calculated, but global features cannot be calculated, and context information cannot be utilized. Although the field of view may be increased in some networks by stacking convolutional layers and downsampling, this approach still loses much information. Furthermore, the skilled person tries to improve the problem by using new convolution operations (such as hole convolution or inflected convolution), but this again complicates the network and makes it easier for overfitting phenomena to occur. These problems all limit the application of convolutional networks to high-precision medical image segmentation.

Disclosure of Invention

The method aims to solve the problems that the conventional various convolutional neural networks have insufficient receptive field, cannot effectively calculate and obtain global characteristics and have network parameter expansion and overfitting phenomena in the medical image segmentation problem; the invention provides a medical image segmentation method, system and device based on convolution and transformer fusion.

The invention is realized by adopting the following technical scheme:

a medical image segmentation method based on convolution and transformer fusion comprises the following steps:

s1: and constructing an improved transformer module with a sliding window based on the standard transformer module. The improved transform module consists of two consecutive Swin transform blocks. The former Swin Transformer Block comprises a window-based MSA (W-MSA, Multi-headed attention) layer and an MLP (Multi-layer susceptor) layer connected in series; both the window based MSA layer and the MLP layer are preceded by an ln (layernorm) layer and connected using residual after the window based MSA layer and the MLP layer. The latter Swin Transformer Block comprises a shifted window based MSA (SW-MSA, multi-headed attention with sliding window) layer and an MLP layer which are connected in sequence; the shifted window based MSA layer and the MLP layer are both provided with LN layers in front of the LN layers, and are connected by using residual errors after the shifted window based MSA layer and the MLP layer.

S2: and constructing a deep fusion network, wherein the deep fusion network comprises a convolution module, an improved transformer network, a feature fusion module and a decoder module. The input of the depth fusion network is a medical image, and the output is a segmentation result of a target region in the medical image. After being input into the depth fusion network, the medical image is firstly processed by a convolution module. The convolution module is a backbone network, and the output path of the convolution characteristic output by the convolution module is divided into three paths. The first path is input into a modified transformer network. And outputting the second path to a characteristic fusion module, and performing characteristic fusion with the transformer characteristics output by the improved transformer network. And the third path and the feature fusion module output fusion features and jointly send the fusion features to the decoder module to complete decoding, so that a required segmentation result is obtained.

S3: selecting a plurality of medical images with polyps as original data to form an original data set, and carrying out image transformation and enhancement processing on the original data in the original data set to expand the number of the original data set. The raw data set was then divided into a training set and a test set according to a 2:1 data size ratio.

S4: and setting a learning strategy, a training epoch and a loss function in a training stage, training the constructed deep fusion network by using a training set, and testing a training effect by using a test set.

S5: storing a deep fusion network with the performance reaching a preset index after training is finished and testing; and performing semantic segmentation on the medical image to be segmented by using the network as an image segmentation network.

As a further improvement of the present invention, in step S1, in the improved transform network, the calculation formula of the consecutive Swin transform Block is:

in the above formula, Z_i-1Representing the input characteristics of the Swin Transformer Block of the i-th layer;

representing the output of the W-MSA of the i-th layer；Z_iThe output characteristic of Swin Transformer Block at the i-th layer is also the input characteristic of the i + 1-th layer; z_i+1The output characteristics of the Swin Transformer Block of the (i + 1) th layer are shown;

represents the output of the W-MSA of the i +1 th layer;

as a further improvement of the invention, in the deep fusion network of step S2, the convolution module selects Res2net-50 to form the backbone part of the network, and after the medical image is output to the convolution module, the convolution characteristics e from shallow layer to deep layer are obtained in turn through convolution processing_iI is 1, 2, 3, 4; the channel dimensions of the four sets of convolution features are 256, 512, 1024, 2048, respectively, and the feature scales are 128, 64, 32, 16, respectively.

Convolution characteristic e of convolution module output_iAfter being processed by the improved transformer network, four groups of transformer characteristics containing global characteristics are respectively obtained and marked as d_i，i＝1、2、3、4。

As a further improvement of the present invention, in the deep fusion network of step S1, the feature fusion module includes a front convolutional layer, an upsampling layer, a feature splicing layer, and a back convolutional layer. The front and back convolutional layers are both two 3 x 3 convolutional modules. The feature map scale of the up-sampling layer output is twice that of the input. The feature splicing layer splices the two input features in the channel dimension, and outputs the two input features as a fusion feature after post-convolution layer processing.

As a further improvement of the invention, in the feature fusion module, the improved transformer network outputs the transformer feature d_iProcessing by a pre-convolution layer, then carrying out scale transformation by an up-sampling layer, and carrying out size and convolution characteristic e of the transform characteristic processed by the up-sampling layer_iThe same is true. Followed by the same size of transformer feature M_iAnd inputting the convolution characteristics into a network of a characteristic splicing layer, splicing the two input paths of characteristics on a channel dimension by the characteristic splicing layer, and then processing the post-convolution layer to obtain the required fusion characteristics Z_i。

Wherein, the fusion character Z output by the feature fusion module_iThe expression of the characters is as follows:

M_i＝upsample(conv(conv(e_i))，

Z_i＝σ(conv(cat(M_i,d_i)))

wherein conv represents 3 × 3 convolution, step size is 1, upsample is upsampling, cat is splicing operation, and σ is relu activation function.

As a further improvement of the invention, in the deep fusion network of step S1, the input of the decoder module is the convolution characteristic e of the convolution module output_iAnd the fusion characteristics output by the characteristic fusion module; the output of the decoder module is the decoded image segmentation result.

As a further refinement of the present invention, in step S3, the raw data in the raw data set is derived from the public polyp data sets kvasir, cvc-clicicDB, ETIS, cvc-colonDB, and EndoScene. The image transformation methods adopted in the data set amplification process comprise random horizontal mirror image inversion, vertical mirror image inversion and angle rotation of 90 degrees, 180 degrees and 270 degrees. The adopted image enhancement processing method comprises random brightness, contrast and sharpening adjustment. The random probability of each image transformation method and image enhancement processing method is set to be 0.5.

As a further improvement of the present invention, in the training process of step S4, the beclos function and the IOU function are selected as the loss function, the PolyLr learning rate reduction strategy is selected, the learning rate is set to 0.0001, and the epoch of the training is set to 240.

The invention also comprises a medical image segmentation system based on convolution and transformer fusion, wherein the medical image segmentation system adopts the medical image segmentation method based on convolution and transformer fusion to perform semantic segmentation on the acquired medical image so as to obtain an image segmentation prediction result of the target characteristic.

The medical image segmentation system comprises: the system comprises an image acquisition module, a convolution network, an improved transformer network, a feature fusion network and a decoder.

The image acquisition module is used for acquiring a medical image to be segmented and preprocessing the medical image so as to meet the input standard of the system.

The convolutional network uses Res2net-50 to form the backbone network of the system. After the medical image is input into the convolution network for processing, the output of the convolution network is convolution characteristics, and the output path of the convolution characteristics comprises three paths.

The improved transformer network receives a first convolution characteristic output by the convolution network. The improved transform network consists of two consecutive Swin transform blocks. The former Swin Transformer Block comprises a window based MSA layer and an MLP layer which are connected in sequence; the window based MSA layer and the MLP layer are both preceded by an LN layer and connected using a residual after the window based MSA layer and the MLP layer. The latter Swin Transformer Block comprises a shifted window based MSA layer and an MLP layer connected in sequence. The shifted window based MSA layer and the MLP layer are both provided with LN layers in front of the LN layers, and are connected by using residual errors after the shifted window based MSA layer and the MLP layer. The input convolution characteristics are processed by the improved transformer network, and then the output is transformer characteristics.

And the feature fusion network receives the second path of convolution features output by the convolution network and the transformer features output by the improved transformer network. The feature fusion module comprises a front convolution layer, an upper sampling layer, a feature splicing layer and a rear convolution layer. The front convolution layer and the rear convolution layer are two convolution modules of 3 x 3, and the characteristic graph output by the upper sampling layer is twice of the input characteristic graph. The feature fusion network firstly carries out pre-convolution layer processing on input transformer features and then carries out scale transformation on the input transformer features through an up-sampling layer, and the dimensions of the transformer features processed by the up-sampling layer are the same as those of the convolution features. And splicing the convolution characteristics and the transformer characteristics on the channel dimension in the characteristic splicing network, and then outputting the spliced characteristics after the processing of the post convolution layer.

And the decoder receives the third path of convolution characteristics output by the convolution network and the fusion characteristics output by the characteristic fusion network, and then decodes the third path of convolution characteristics and the fusion characteristics to obtain the semantic segmentation result of the required medical image.

The invention also comprises a medical image segmentation apparatus based on convolution and transformer fusion, comprising a memory, a processor and a computer program stored on the memory and executable on the processor. The processor, when executing the program, performs the steps of the medical image segmentation method based on convolution and transform fusion as described above.

The technical scheme provided by the invention has the following beneficial effects:

in the medical image segmentation method based on convolution and transformer fusion, a new depth fusion network is creatively constructed. In the deep fusion network, an improved swin transformer module is introduced; the self-attention mechanism in the swin transformer module can fully utilize the context information and establish remote dependence. And the problems that the traditional convolutional neural network is small in receptive field and cannot utilize global information, extracted feature information is lost, and a segmentation processing result is inaccurate are solved.

The improved transform network used in the embodiment also solves the problems of large data volume and high complexity in the large-scale characteristic diagram processing process through two consecutive Swin transform blocks with sliding windows, thereby improving the network robustness and avoiding the over-fitting of the network. Meanwhile, richer global information is obtained, and the accuracy of medical image segmentation is improved.

In order to better utilize the features generated by the swin transformer module, a feature fusion module is further provided, and the feature fusion module can well fuse the two features of the convolution feature and the transform feature together to generate a fusion feature, so that the embodiment provides the advantage that the deep fusion neural network can combine the features of both the convolution feature and the transform.

Drawings

Fig. 1 is a flowchart illustrating steps of a medical image segmentation method based on convolution and transform fusion according to embodiment 1 of the present invention.

Fig. 2 is a schematic structural diagram of an improved transformer module in embodiment 1 of the present invention.

Fig. 3 is a model architecture diagram of the deep convergence network constructed in embodiment 1 of the present invention.

Fig. 4 is a basic flowchart of a deep convergence network processing procedure in embodiment 1 of the present invention.

Fig. 5 is a schematic structural diagram of a feature fusion module in embodiment 1 of the present invention.

Fig. 6 is a schematic block diagram of a medical image segmentation system based on convolution and transform fusion according to embodiment 2 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example 1

The present embodiment provides a medical image segmentation method based on convolution and transform fusion, as shown in fig. 1, the medical image segmentation method includes the following steps:

s1: and constructing an improved transformer network with a sliding window based on the standard transformer module. As shown in fig. 2, the improved Transformer network consists of two consecutive Swin Transformer blocks. The former Swin Transformer Block comprises a window based MSA layer and an MLP layer which are connected in sequence; the window based MSA layer and the MLP layer are both preceded by an LN layer and connected using a residual after the window based MSA layer and the MLP layer. The latter Swin Transformer Block comprises a shifted window based MSA layer and an MLP layer which are connected in sequence; the shifted window based MSA layer and the MLP layer are both provided with LN layers in front of the LN layers, and are connected by using residual errors after the shifted window based MSA layer and the MLP layer.

The standard transform module is usually composed of a multi-headed attention Module (MSA) and a multi-layered perceptron (MLP); and applying a layernorm (ln) layer before each MSA module and each MLP module; while a residual join is also applied after each module. Therefore, the output of i-layer in the encoder of the transform module can be expressed as:

in the above formula, Z_l-1Representing the input characteristics of the transform Block of the i-th layer;

represents the output of the i-th layer MSA; z_lIndicating that the output characteristic of the transform Block at the i-th layer is also the input characteristic of the i + 1-th layer;

wherein the operation of the multi-head attention Module (MSA) is defined as follows:

wherein the content of the first and second substances,

the input features are multiplied by three matrixes respectively to obtain; and d represents the number of characteristic division regions and the channel dimension respectively; the values in B are from the bias matrix

The standard transformer module performs self-attention calculation on the whole input feature scale, so that the calculation amount of the processing mode and the complexity of the calculation process are large. In the improved transform network of this embodiment, the swin transform module can divide the whole input features into several non-overlapping regions, and transform is performed in each region, so that the computational complexity is reduced. The improved transform network comprises two continuous swin transform modules, and in the next swin transform module, the region is divided again, so that the existing divided region and the previously divided region are intersected, and richer global information is acquired.

Specifically, in the improved transform network of this embodiment, the calculation formula of the continuous Swin transform Block is as follows:

represents the output of the W-MSA of the ith layer; z_iThe output characteristic of Swin Transformer Block at the i-th layer is also the input characteristic of the i + 1-th layer; z_i+1The output characteristics of the Swin Transformer Block of the (i + 1) th layer are shown;

represents the output of the W-MSA of the i +1 th layer;

s2: and constructing a deep fusion network, wherein the deep fusion network comprises a convolution module, an improved transformer network, a feature fusion module and a decoder module, as shown in fig. 3. The input of the depth fusion network is a medical image, and the output is a segmentation result of a target region in the medical image. After being input into the depth fusion network, the medical image is firstly processed by a convolution module. The convolution module is a backbone network, and the output path of the convolution characteristic output by the convolution module is divided into three paths. The first path is input into a modified transformer network. And outputting the second path to a characteristic fusion module, and performing characteristic fusion with the transformer characteristics output by the improved transformer network. And the third path and the feature fusion module output fusion features and jointly send the fusion features to the decoder module to complete decoding, so that a required segmentation result is obtained.

In the deep convergence network of this embodiment, the network workflow is as shown in fig. 4:

specifically, the convolution module selects Res2net-50 to form a backbone part of the network, and after the medical image is input into the deep fusion network, the medical image is firstly processed by the convolution module to extract characteristic information in the medical image. In the Res2net-50 network, four groups of convolution characteristics from a shallow layer to a deep layer are sequentially obtained from an input medical image through convolution processing, and the four groups of convolution characteristics are marked as e_iAnd i is 1, 2, 3 and 4. The channel dimensions of the four sets of convolution features are 256, 512, 1024, 2048, respectively, and the feature scales are 128, 64, 32, 16, respectively.

Convolution characteristic e of convolution module output_iOne of the paths is output to the improved transform network with a sliding window. In an improved transform network, two Swin transform modules in series would be applied to four sets of convolution features e₁、e₂、e₃、e₄The transformer treatments were performed separately. Processing, convolution features e_iFirstly, deforming, multiplying by three matrixes to obtain Q, K, V three characteristics, performing attention operation by using three characteristic values, sending output to an LN layer for processing, and then entering an MLP module to finish the processing process of a Swin transform module. And after the Swin Transformer module at the upper stage finishes processing, repeating the operations in the Swin Transformer module at the lower stage in sequence, and finally outputting the required Transformer characteristics. The convolution characteristics that allow for the convolution model input to the improved transformer network include four groups, respectively: e.g. of the type₁、e₂、e₃、e₄(ii) a The transform features output are also four groups, which are: d₁、d₂、d₃、d₄。

Convolution characteristic e₁、e₂、e₃、e₄And transformer characteristics d₁、d₂、d₃、d₄Are input into the feature fusion module for feature fusion processing. As shown in fig. 5, the feature fusion module includes a front convolutional layer, an upsampling layer, a feature splicing layer, and a back convolutional layer. The front and back convolutional layers are both two 3 x 3 convolutional modules. The feature map scale of the up-sampling layer output is twice that of the input. The feature splicing layer splices the two input features in the channel dimension, and outputs the two input features as a fusion feature after post-convolution layer processing.

Therefore, the fusion process of the two types of features in the feature fusion module is as follows: transformer characteristics d_iFirstly, the pre-convolution layer is processed, and then the scale transformation is carried out through the upper sampling layer. Due to transformer characteristic d_iIs smaller than e of convolution characteristic after being processed_iThe latter is twice the former. Therefore, the size of the transform feature after the double upsampling operation is compared with the convolution feature e_iThe same is true. When the sizes of the two are the same, the fusion treatment can be performed. transformer feature M_iAnd inputting the convolution characteristics into a network of a characteristic splicing layer, splicing the two input characteristics in a channel dimension by the characteristic splicing layer to obtain a mixed characteristic, and processing the mixed characteristic by a post-convolution layer to obtain a required fusion characteristic Z_i。

M_i＝upsample(conv(conv(e_i))，

Z_i＝σ(conv(cat(M_i,d_i)))

The obtained fusion feature is output to a decoder module, and the decoder module outputs a convolution feature e according to the convolution module_iAnd fusing the characteristics, decoding the characteristic information, and further obtaining the processing result of image segmentation.

In the medical image segmentation method based on convolution and transform fusion, a new depth fusion network is creatively constructed. In the deep fusion network, an improved swin transformer module is introduced; the self-attention mechanism in the swin transformer module can fully utilize the context information and establish remote dependence. And the problems that the traditional convolutional neural network is small in receptive field and cannot utilize global information, extracted feature information is lost, and a segmentation processing result is inaccurate are solved.

The Transformer network used in this embodiment is a novel architecture designed for sequence-to-sequence modeling in natural language processing. Great progress has been made in most nlp tasks, such as machine translation, naming, entity recognition, and question and answer. The embodiment applies it in the field of medical image segmentation, and unexpectedly realizes the effective utilization of the multi-head self-attention (MSA) mechanism. The network model leverages the powerful ability to establish global connections between tokens of a sequence, as well as to contact remote context information.

In order to verify the effectiveness of the method provided by the embodiment, a simulation experiment is also designed in the embodiment. The experimental environment in this example is: intel (R) Xeon (R) CPU E5-2609V 4@1.70GHz, 16G memory, Ubuntu20.04 system, graphics card GTX2060, programming environment pycharm, deep learning framework pytorch 1.5.1.

In terms of training data sets, this embodiment uses the 5 polyp data sets kvasir, cvc-clinicDB (cvc-612), ETIS, cvc-colonDB, and EndoScene that are published on the web. Wherein, 1000 images are contained in the kvasir data set, 900 images are randomly picked out from the kvasir data set and put into the training set, and the rest 100 images are put into the testing set. The cvc-clinicDB data set had 612 images, and 550 images were randomly picked from the 612 images into the training set and the rest into the testing set using the same operation as kvasir. At this point, one of the training sets had 1450 images and the test set had 162 pictures.

In addition, there are 196 and 380 pictures in the ETIS dataset and CVc-colonDB dataset, respectively. The embodiment uses both of them as a test set; and further to verify the generalization capability of the network. The EndoScene dataset is made up of cvc-612 and cvc 300. Since a portion of the cvc-612 data set has been used as training set data, only the data in EndoScene-cvc300 was used as a test set, containing 60 pictures. Thus far, the test set employed in the present embodiment includes 1450 polyp images, and 789 polyp images. The data volume ratio of the training set to the test set is close to 2: 1.

For the original data in the test set and the training set, the data set amplification method is adopted to perform data amplification in the embodiment, so as to improve the data volume of the training data and enhance the robustness of the deep fusion network. The data set amplification method comprises an image change method and an image enhancement processing method. Wherein. The image transformation method adopted by the present embodiment includes random horizontal mirror inversion, vertical mirror inversion, and angular rotations of 90 °, 180 °, and 270 °. The adopted image enhancement processing method comprises random brightness, contrast and sharpening adjustment. The random probability of each image transformation method and image enhancement processing method is set to be 0.5.

In the initial stage of training, the present embodiment sets all the images to 448 x 448, and in the later stage, the generalization performance of the deep fusion network for different tasks is improved. The present embodiment also adopts a multi-scale training strategy. In addition, in the training process, the present embodiment selects a beclos function and an IOU function as loss functions, selects a PolyLr learning rate reduction strategy, sets the learning rate in the training phase to 0.0001, and sets the epoch of the training to 240.

After training, the training effect of the deep fusion network is verified through the test set in the embodiment. And counting the segmentation effect of the depth fusion network aiming at different data set images, wherein the adopted evaluation criteria are mIoU (average cross-over ratio) and mDisc (average Disc), and obtaining the data as shown in the following table.

Table 1: segmentation result of depth fusion network for images in different data sets in embodiment

Analysis of the above data reveals that: the depth fusion network adopted in the method provided by the embodiment shows good segmentation effect for images in the kvasir dataset and the cvc-cliciddb dataset. mIoU reaches 0.856 and 0.877 respectively; higher than the average value of the traditional convolution neural network such as U-net, SFA and PraNet. For the data in the ETIS, the cvc-colonDB and the EndoScene data sets, the deep fusion neural network of the embodiment is not adopted in the training stage, but still shows a good segmentation effect in the testing process. This can prove that: the deep fusion neural network model provided by the embodiment also has good generalization and is suitable for segmentation processing of various different medical images.

Example 2

The present embodiment provides a medical image segmentation system based on convolution and transform fusion, which performs semantic segmentation on an acquired medical image by using the medical image segmentation method based on convolution and transform fusion as described in embodiment 1, so as to obtain an image segmentation prediction result of a target feature.

As shown in fig. 6, the medical image segmentation system includes: the system comprises an image acquisition module, a convolution network, an improved transformer network, a feature fusion network and a decoder.

Example 3

The invention also comprises a medical image segmentation apparatus based on convolution and transformer fusion, comprising a memory, a processor and a computer program stored on the memory and executable on the processor. The processor executes the program to implement the steps of the medical image segmentation method based on volume and transform fusion as described in embodiment 1.

The computer device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server or a cabinet server (including an independent server or a server cluster composed of a plurality of servers) capable of executing programs, and the like. The computer device of the embodiment at least includes but is not limited to: a memory, a processor communicatively coupled to each other via a system bus.

In this embodiment, the memory (i.e., the readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the memory may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the computer device. Of course, the memory may also include both internal and external storage devices for the computer device. In this embodiment, the memory is generally used for storing an operating system, various types of application software, and the like installed in the computer device. In addition, the memory may also be used to temporarily store various types of data that have been output or are to be output.

The processor may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor is typically used to control the overall operation of the computer device. In this embodiment, the processor is configured to run the program code stored in the memory or process data to implement the processing procedure of the medical image segmentation method based on convolution and transform fusion in the foregoing embodiment, so as to obtain a segmentation result of feature information such as polyps in an image according to a given medical image.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A medical image segmentation method based on convolution and transformer fusion is characterized by comprising the following steps:

s1: constructing an improved Transformer network with a sliding window based on a standard Transformer module, wherein the improved Transformer network consists of two continuous Swin Transformer blocks; the former Swin Transformer Block comprises a window based MSA layer and an MLP layer which are connected in sequence; an LN layer is arranged in front of the window based MSA layer and the MLP layer, and residual errors are used for connection after the window based MSA layer and the MLP layer; the latter Swin Transformer Block comprises a shifted window based MSA layer and an MLP layer which are connected in sequence; an LN layer is arranged in front of the shifted window based MSA layer and the MLP layer, and residual errors are used for connection after the shifted window based MSA layer and the MLP layer;

s2: constructing a deep fusion network, wherein the deep fusion network comprises a convolution module, the improved transformer module, a feature fusion module and a decoder module; the input of the depth fusion network is a medical image, and the output is a segmentation result of a target region in the medical image; after being input into a depth fusion network, a medical image is firstly processed by a convolution module; the convolution module is a backbone network, the output path of the convolution characteristic output by the convolution module is divided into three paths, and the first path is input into the improved transformer network; the second path is output to a characteristic fusion module and is subjected to characteristic fusion with the transformer characteristics output by the improved transformer network; the third path and the feature fusion module output fusion features and jointly send the fusion features to the decoder module to complete decoding, and further required segmentation results are obtained;

s3: selecting a plurality of medical images with polyps as original data to form an original data set, and dividing the original data set into a training set and a testing set according to a data volume ratio of 2: 1;

s4: setting a learning strategy, a training epoch and a loss function in a training stage, training the constructed deep fusion network by using a training set, and testing a training effect by using a test set;

2. The medical image segmentation method based on convolution and transform fusion of claim 1, characterized by: in step S1, in the improved transform network, the calculation formula of the continuous Swin transform Block is:

represents the output of the W-MSA of the i +1 th layer.

3. The medical image segmentation method based on convolution and transform fusion of claim 2, characterized in that: in the deep fusion network of step S2, the convolution module selects Res2net-50 to form the backbone part of the network, and after the medical image is output to the convolution module, four shallow-to-deep convolution features are sequentially obtained through convolution processing, and e is the respective one_iI is 1, 2, 3, 4; the channel dimensions of the four groups of convolution features are respectively 256, 512, 1024 and 2048, and the feature scales are respectively 128, 64, 32 and 16;

convolution characteristic e of the convolution module output_iAfter being processed by the improved transformer network, four groups of transfo containing global features are respectively obtainedrmer feature, labeled d_i，i＝1、2、3、4。

4. A method of medical image segmentation based on convolution and transform fusion according to claim 3, characterized in that: in the deep fusion network of step S1, the feature fusion module includes a front convolutional layer, an upsampling layer, a feature splicing layer, and a rear convolutional layer; the front convolution layer and the rear convolution layer are both two convolution modules of 3 x 3, and the characteristic graph scale output by the upper sampling layer is twice of the input characteristic graph scale; the feature splicing layer splices the two input features in the channel dimension, and outputs the two input features as a fusion feature after post-convolution layer processing.

5. The medical image segmentation method based on convolution and transform fusion of claim 4, characterized in that: in the feature fusion module, the transformer features d output by the improved transformer network_iPerforming pre-convolution layer processing, performing scale transformation on the pre-convolution layer, and performing up-sampling layer processing on the pre-convolution layer to obtain transform characteristics M_iSize and convolution characteristic e_iThe same; inputting the transformer characteristics and the convolution characteristics with the same size into a characteristic splicing layer, splicing the two input characteristics on a channel dimension by the characteristic splicing layer, and then processing a post-convolution layer to obtain the required fusion characteristics; the fusion characteristic Z output by the characteristic fusion module_iThe expression of (A) is as follows:

M_i＝upsample(conv(conv(e_i))，

Z_i＝σ(conv(cat(M_i,d_i)))

6. The medical image segmentation method based on convolution and transform fusion of claim 5, characterized by: in the deep fusion network of step S1, the input of the decoder module is the convolution characteristic e output by the convolution module_iAnd the fusion characteristics output by the characteristic fusion module; the output of the decoder module is the decoded image segmentation result.

7. The medical image segmentation method based on convolution and transform fusion of claim 1, characterized by: in step S3, the original data in the original data set is derived from the public polyp data sets kvasir, cvc-clinicDB, ETIS, cvc-colonDB and EndoScene; the number of the original data sets is amplified by carrying out image transformation and enhancement processing on the original data in the original data sets; the image transformation method adopted in the data set amplification process comprises random horizontal mirror image turning, vertical mirror image turning and angle rotation of 90 degrees, 180 degrees and 270 degrees; the adopted image enhancement processing method comprises random brightness, contrast and sharpening adjustment; the random probability of each image transformation method and image enhancement processing method is set to be 0.5.

8. The medical image segmentation method based on convolution and transform fusion of claim 1, characterized by: in the training process of step S4, the beclos function and the IOU function are selected as loss functions, the PolyLr learning rate reduction strategy is selected, the learning rate is set to 0.0001, and the epoch of the training is set to 240.

9. A medical image segmentation system based on convolution and transform fusion, wherein the medical image segmentation system adopts the medical image segmentation method based on convolution and transform fusion according to any one of claims 1 to 8 to perform semantic segmentation on an acquired medical image so as to obtain an image segmentation prediction result of a target feature; the medical image segmentation system comprises:

the image acquisition module is used for acquiring a medical image to be segmented and preprocessing the medical image so as to meet the input standard of a system;

a convolution network, which adopts Res2net-50 to form a backbone network of the system; after the medical image is input into a convolution network for processing, the output of the convolution network is a convolution characteristic, and the output path of the convolution characteristic comprises three paths;

the improved transformer network receives a first path of convolution characteristics output by the convolution network; the improved Transformer network consists of two consecutive Swin Transformer blocks; the former Swin Transformer Block comprises a window based MSA layer and an MLP layer which are connected in sequence; an LN layer is arranged in front of the window based MSA layer and the MLP layer, and residual errors are used for connection after the window based MSA layer and the MLP layer; the latter Swin Transformer Block comprises a shifted window based MSA layer and an MLP layer which are connected in sequence; an LN layer is arranged in front of the shifted window based MSA layer and the MLP layer, and residual errors are used for connection after the shifted window based MSA layer and the MLP layer; after the input convolution characteristics are processed by the improved transformer network, outputting the convolution characteristics as transformer characteristics;

a feature fusion network receiving the second path of convolution features output by the convolution network and the transform features output by the improved transform network; the characteristic fusion module comprises a front convolution layer, an upper sampling layer, a characteristic splicing layer and a rear convolution layer; the front convolution layer and the rear convolution layer are both two convolution modules of 3 x 3, and the characteristic graph scale output by the upper sampling layer is twice of the input characteristic graph scale; the feature fusion network firstly carries out pre-convolution processing on input transformer features and then carries out scale transformation on the input transformer features through an upper sampling layer, and the dimensions of the transformer features processed by the upper sampling layer are the same as those of the convolution features; splicing the convolution characteristics and the transformer characteristics in the channel dimension in a characteristic splicing network, and then outputting the spliced characteristics as fusion characteristics after the processing of a post convolution layer; and the decoder is used for receiving the third path of convolution characteristics output by the convolution network and the fusion characteristics output by the characteristic fusion network, and then decoding the third path of convolution characteristics and the fusion characteristics to obtain a semantic segmentation result of the required medical image.

10. Medical image segmentation apparatus based on convolution and transform fusion, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the medical image segmentation method based on convolution and transform fusion according to any one of claims 1 to 8 when executing the program.