CN116740513A

CN116740513A - Multi-mode fusion lightweight segmentation network and segmentation method for brain MRI image

Info

Publication number: CN116740513A
Application number: CN202310558286.4A
Authority: CN
Inventors: 樊亚文; 黄谌子谊; 胡正开; 陈建新; 王潮远
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-05-17
Filing date: 2023-05-17
Publication date: 2023-09-12

Abstract

The invention discloses a multi-mode fusion lightweight segmentation network and a segmentation method for brain MRI images, which mainly comprise three parts: the coding part comprises four independent encoders for respectively extracting features of original pictures of four modes and adopting different attention strategies for different modes, wherein each encoder comprises three layers of convolution modules for downsampling through a convolution layer and a pooling layer; the feature fusion part is used for carrying out feature fusion on four modes in a feature layer, and adding light-weight mode attention, space attention and channel attention into different feature layers by different combinations during fusion so as to improve the segmentation precision of the model; and a decoding section for restoring an original resolution of the feature map using convolution and upsampling using transpose convolution. Compared with the prior art, the network architecture designed by the invention achieves higher segmentation precision while keeping the model lightweight.

Description

Multi-mode fusion lightweight segmentation network and segmentation method for brain MRI image

Technical Field

The invention relates to a multi-mode fusion lightweight segmentation network and a segmentation method for brain MRI images, and belongs to the field of medical image segmentation.

Background

Brain tumors are a disease formed by cancer cells in the brain, a life-threatening medical condition. The aim of brain tumor segmentation is to identify and segment tumor regions from healthy tissue by means of medical imaging techniques. Computed Tomography (CT), positron emission computed tomography (PET), and Magnetic Resonance Imaging (MRI) are the three most common imaging methods for diagnosing brain tumors. Among them, MRI is widely used because of its high resolution, strong soft tissue contrast, and non-invasiveness. Previous manual tumor contour segmentation based on brain tumor images is both laborious and error-prone. With the development of deep learning, brain tumor segmentation neural network models based on automatic image features are becoming mainstream. In general, the four modes of MRI are simply stacked, and the characteristics of different modes are not utilized, so that the segmentation accuracy is not very high. And the 3D medical segmentation calculation amount is large, the training time is long, and the reduction of the model parameter number is important on the premise of ensuring that the precision accords with the application standard.

The U-Net is composed of an encoder that extracts features from an input image and a decoder that generates a segmentation mask. The encoder and decoder communicate advanced features to the decoder through a "bridge" connection. Because U-Net has the characteristics of small parameter quantity and high accuracy, it has been widely used in various segmentation tasks. However, most of the current mainstream models are based on U-Net and extremely variant, but mainly the four modes are directly stacked at the pixel layer and then encoded by one encoder, so that different strategies cannot be adopted for different modes.

Disclosure of Invention

The invention aims to provide a multi-mode fusion lightweight segmentation network and a segmentation method for brain MRI images, which are used for improving model segmentation accuracy and simultaneously keeping a model lightweight.

To achieve the above object, the present invention provides a multi-modal fused lightweight segmentation network for brain MRI images, comprising:

the coding part comprises four independent encoders for respectively extracting features of original pictures of four modes and adopting different attention strategies for different modes, wherein each encoder comprises three layers of convolution modules for downsampling through a convolution layer and a pooling layer;

the feature fusion part is used for carrying out feature fusion on four modes in a feature layer, and adding light-weight mode attention, space attention and channel attention into different feature layers by different combinations during fusion so as to improve the segmentation precision of the model; and

and a decoding part, which adopts convolution and up-sampling to restore the original resolution of the characteristic map, wherein the up-sampling is realized by transpose convolution.

As a further improvement of the invention, each layer of the encoder where the enhancement sequence is located before feature fusion adds a channel attention mechanism to focus on the weights of the different channels of the enhancement sequence.

As a further improvement of the present invention, in the feature fusion section, the same layer features of the four encoders are connected by skip.

As a further improvement of the invention, the core component of the feature fusion part is a light attention module ACSMB based on modal, channel and space combination, and channel attention CA, space attention SA and modal attention MA are adopted to respectively calculate the weights of the channel, the space and the mode.

As a further improvement of the invention, the second layer of characteristics of the four encoders comprises detail information and high-level semantics, and the second layer of characteristics adopts a mode, channel and space combined attention mechanism when in fusion, and the other two layers of characteristics only adopt the channel and space combined attention mechanism when in fusion.

In the decoding part, the fused multi-mode characteristics are decoded and output through a same decoder and three up-sampled pooling layers to obtain a segmentation result.

In order to achieve the above object, the present invention provides a method for splitting a multi-mode fusion lightweight split network, which is applied to the split network and mainly comprises the following steps:

s1: carrying out two continuous convolution with 3 multiplied by 3 step length of 2 on the input original images T1, T2, T1ce and Flair sequence to generate a first feature map with resolution equal to that of the original images and channel number of 16;

s2: the method comprises the steps of pooling a maximum value with a step length of 2 for a first feature map of a T1, T2, T1ce and Flair sequence, and obtaining a second feature map with a resolution of 1/2 of an original image and a channel number of 32 through convolution with a step length of 2 of two continuous 3X 3;

s3: carrying out maximum value pooling with the step length of 2 on the second feature map of the T1, T2, T1ce and Flair sequence, and then obtaining a third feature map with the resolution of 1/4 of the original image and the channel number of 64 through two continuous convolution with the step length of 3 multiplied by 3 and the step length of 2;

s4: respectively carrying out two continuous convolutions with 3 multiplied by 3 and step length of 2 on the third characteristic diagrams of the T1, T2, T1ce and Flair sequences to generate a fourth characteristic diagram with resolution of 1/8 of the original image and channel number of 64 unchanged;

s5: after characteristic splicing of channel dimensions is carried out on the first characteristic graphs of the four sequences, the generated characteristic graphs with the channel number of 64 are subjected to convolution with the step length of 2 of two continuous 3 multiplied by 3 after passing through a channel space joint attention module, and a fifth characteristic graph with the resolution equal to that of an original image and the channel number of 16 is generated;

s6: after characteristic splicing of channel dimensions is carried out on the second characteristic graphs of the four sequences, the generated characteristic graphs with the channel number of 128 are subjected to convolution with the step length of 2 of two continuous 3 multiplied by 3 after passing through a light self-attention module, and a sixth characteristic graph with the resolution of 1/2 of an original image and the channel number of 32 is generated;

s7: after characteristic splicing of channel dimensions is carried out on the third characteristic graphs of the four sequences, the generated characteristic graphs with 256 channels are subjected to convolution with 2 steps of two continuous 3 multiplied by 3 after passing through a light self-attention module, and a seventh characteristic graph with 1/4 of the resolution of an original image and 64 channels is generated;

s8: after the characteristic splicing of the channel dimension is carried out on the fourth characteristic diagram of the four sequences, generating a characteristic diagram with 256 channels, and generating an eighth characteristic diagram with the resolution of 1/8 of the original image and 128 channels after the convolution with 2 steps of two continuous 3 multiplied by 3;

s9: after convolving the eighth feature map with a 3 x 3 transpose with a step size of 2, generating a ninth feature map with the resolution of 1/4 of the original image and the channel number of 64;

s10: after the ninth feature map and the seventh feature map are subjected to feature stitching of channel dimension, a feature map with 128 channels is generated, then by two consecutive convolutions of 3 x 3 steps of 2, generating a tenth characteristic diagram with resolution of 1/4 of the original image and channel number of 64;

s11: after the tenth characteristic diagram is subjected to transpose convolution with 3 multiplied by 3 and the step length of 2, an eleventh characteristic diagram with the resolution of 1/2 of the original image and the channel number of 32 is generated;

s12: after the eleventh characteristic diagram and the sixth characteristic diagram are subjected to characteristic splicing of channel dimension, the characteristic diagram with the channel number of 64 is generated, then by two consecutive convolutions of 3 x 3 steps of 2, generating a twelfth feature map with the resolution of 1/2 of the original image and the channel number of 32;

s13: after convolving the twelfth feature map with a 3 x 3 transpose of step size 2, generating a thirteenth feature map with the resolution equal to the original image and the channel number of 16;

s14: after the thirteenth feature map and the fifth feature map are subjected to feature stitching of channel dimension, a feature map with the channel number of 32 is generated, and generating a fourteenth characteristic image with the resolution equal to that of the original image and the channel number of 16 by convolution with the step length of 2 of two continuous 3 multiplied by 3, namely a segmentation result image.

As a further improvement of the present invention, the step S1 specifically includes:

s11: carrying out two continuous convolution with 3 multiplied by 3 step length of 2 on an input original image T1 sequence to generate a first characteristic diagram with resolution equal to that of the original image and channel number of 16;

s12: carrying out two continuous convolution with 3 multiplied by 3 step length of 2 on an input original image T2 sequence to generate a first characteristic diagram with resolution equal to that of the original image and channel number of 16;

s13: carrying out two continuous convolutions with 3 multiplied by 3 step length of 2 on an input original image T1ce sequence, and then generating a first feature map with the resolution equal to that of the original image and the channel number of 16 through a channel attention module;

s14: two successive convolutions of 3 x 3 steps of 2 are performed on the input original image Flair sequence to generate a first feature map with the resolution equal to that of the original image and the channel number of 16.

As a further improvement of the present invention, the step S2 specifically includes:

s21: pooling the maximum value of the step length of 2 of the first feature map of the T1 sequence, and then obtaining a second feature map with the resolution of 1/2 of the original image and the channel number of 32 through convolution of two continuous 3X 3 step lengths of 2;

s22: pooling the maximum value of the step length of 2 of the first feature map of the T2 sequence, and obtaining a second feature map with the resolution of 1/2 of the original image and the channel number of 32 through convolution of two continuous 3X 3 step lengths of 2;

s23: pooling the first feature map of the T1ce sequence with a maximum value of step length of 2, convolving two continuous convolution with step length of 3 multiplied by 3 with step length of 2, and obtaining a second feature map with resolution of 1/2 of the original image and channel number of 32 through a channel attention module;

s24: the first feature map of the Flair sequence is first pooled at a step size of 2, then convolved at step sizes of 2 by two consecutive 3 x 3, a second feature map with a resolution of 1/2 of the original image and a channel number of 32 is obtained.

As a further improvement of the present invention, the step S3 specifically includes:

s31: carrying out maximum value pooling with the step length of 2 on the second characteristic diagram of the T1 sequence, and then obtaining a third characteristic diagram with the resolution of 1/4 of the original image and the channel number of 64 through convolution with the step length of 2 of two continuous 3 multiplied by 3;

s32: pooling the maximum value of the step length of 2 of the second feature map of the T2 sequence, and obtaining a third feature map with the resolution of 1/4 of the original image and the channel number of 64 through convolution of two continuous 3X 3 steps of 2;

s33: the second feature map of the T1ce sequence is first pooled with a step size of 2 maximum, and two consecutive 3 x 3 convolutions with a step size of 2, obtaining a third characteristic diagram with the resolution of 1/4 of the original image and the channel number of 64 through a channel attention module;

s34: and (3) carrying out maximum value pooling with the step length of 2 on the second characteristic diagram of the Flair sequence, and then obtaining a third characteristic diagram with the resolution of 1/4 of the original image and the channel number of 64 through convolution with the step length of 2 of two continuous 3 multiplied by 3.

Compared with the prior art, the invention has the technical effects that: compared with the existing model, the precision of the network architecture is obviously improved, and experimental results show that the network architecture designed by the invention achieves higher segmentation precision while keeping the weight of the model light.

Drawings

Fig. 1 is a schematic diagram of a multi-modal fused lightweight segmented network for MRI images of the brain in accordance with the present invention.

Fig. 2 is a graph of channel attention, spatial attention, modal attention, and 3DConvBlock of the present invention.

Fig. 3 is a block diagram of a modal channel and spatial joint attention module (ACSMB) and channel spatial joint attention module (ACSB) of the present invention.

Fig. 4 is a graph of the segmentation result of the network of the present invention on the bras 2021 dataset.

FIG. 5 is a graph of experimental results of the present invention compared to other models.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

In this case, in order to avoid obscuring the present invention due to unnecessary details, only the structures and/or processing steps closely related to the aspects of the present invention are shown in the drawings, and other details not greatly related to the present invention are omitted.

The tags in the Brats2021 dataset are divided into significantly enhanced tumor regions (ET), general tumor regions also known as "tumor cores" (TC), and intact tumor regions (WT). Compared to conventional T1, T1ce shows a high signal in the ET region. The TC region contains the majority of the tumor, which is typically the region that needs to be resected. The appearance of Necrotic (NCR) and non-reinforced (NET) tumors in T1ce is usually significantly lower in signal compared to T1. The whole tumor describes the integrity of the disease, as it includes the tumor core and peri-tumor oedema (ED), usually described by the high intensity signal in FLAIR. According to different tumor biological characteristics, different attention mechanisms are added to different modes in the coding part, and corresponding strategies are adopted when feature layers are fused.

The whole network framework is a multimode fusion lightweight segmentation network based on modal attention, after four modal sequences pass through an independent encoder, the multimode fusion of feature levels is carried out first, and then the fused features are decoded to obtain segmented images.

As shown in fig. 1, the multi-modal fusion lightweight segmentation network for brain MRI images provided by the present invention is composed of three basic parts including a coding part, a feature fusion part and a decoding part. The coding part comprises four independent encoders to respectively perform feature extraction on the original pictures of the four modes so as to adopt different attention strategies for each mode according to different tumor biological characteristics before feature fusion. Each encoder contains a three-layer convolution module to downsample through the convolution and pooling layers. The invention focuses on the difference among the modes, and carries out different treatment on different modes before the early fusion of the modes. According to the related knowledge in the medical field, the T1ce enhancement sequence has more abundant characteristic information on the tumor part, so each layer of the encoder where the T1ce enhancement sequence is positioned is added with a Channel Attention mechanism (CA) before the modal characteristics are fused, so that the weight of different channels of the enhancement sequence is concerned, and the characteristic of the characteristic enhancement of the sequence is fully utilized.

The feature fusion part is used for carrying out feature fusion on four modes in a feature layer, and when in fusion, light-weight mode attention, space attention and channel attention are added in different feature layers by different combinations, so that the segmentation precision of the model is improved. Specifically, in the feature fusion part, the same layer of features of the four encoders are connected through jump, and the fact that the second layer of features of the four encoders not only contains detail information but also has high-level semantics is considered, so that only the second layer of features adopts a mode, channel and space combined Attention (ACSM) mechanism during fusion, and the other two layers of features only adopt a channel and space combined Attention (ACS) mechanism during fusion, thereby not only improving the precision of a model, but also keeping the weight of the model.

That is, the core component of the feature fusion part is a novel light-weight Attention module (ACSM, attention Based On Channel Spatial and Modal) based on Modal, channel and space combination, and the weights of the Channel, the space and the mode are calculated by adopting light-weight Channel Attention (CA), space-Attention (SA) and Mode Attention (MA) in sequence, so that the respective features of each mode are fully utilized during the fusion of the modes, and the segmentation precision is improved. Meanwhile, the light attention module can also effectively reduce the calculation complexity of the network, so that the network can keep very few model parameters, the effect of feature extraction is improved, and the network can realize the best compromise between accuracy and efficiency.

The decoding section employs convolution and upsampling to restore the original resolution of the feature map. The upsampling is implemented using transpose convolution. Specifically, the fused multi-mode features are decoded and output through a same decoder and three up-sampled pooling layers, so that a segmentation result is obtained.

The overall network framework according to fig. 1 is capable of efficiently extracting modality information and detail information of nuclear magnetic data, and can be trained end-to-end. Compared with the latest mainstream segmentation network, the designed network architecture realizes higher detection precision and a lightweight model structure.

Referring to fig. 1, the method for dividing the overall network structure diagram designed by the invention is explained and mainly comprises the following steps:

s1: two successive convolutions of 3 x 3 steps of 2 are performed on the input original images T1, T2, T1ce, flair sequence to generate a first feature map with resolution equal to the original images and channel number of 16. The method specifically comprises the following steps:

S2: the first feature map of the sequence T1, T2, T1ce, flair is first pooled at a maximum of step size 2, and then, through two continuous convolution with the step length of 3 multiplied by 3 being 2, obtaining a second characteristic diagram with the resolution of 1/2 of the original image and the channel number of 32. The method specifically comprises the following steps:

S3: and (3) carrying out maximum value pooling with the step length of 2 on the second characteristic diagram of the T1, T2, T1ce and Flair sequence, and then obtaining a third characteristic diagram with the resolution of 1/4 of the original image and the channel number of 64 through two continuous convolution with the step length of 3 multiplied by 3 and the step length of 2. The method specifically comprises the following steps:

S4: two consecutive 3×3 convolutions (3D ConvBlock) with step size of 2 are performed on the third feature map of the sequence T1, T2, T1ce, flair, respectively, to generate a fourth feature map with resolution of 1/8 of the original image and channel number of 64 unchanged.

S5: after the characteristic mosaic of channel dimension is carried out on the first characteristic graph of the four sequences, the generated characteristic graph with the channel number of 64 is subjected to convolution (3 DConvBlock) with the step length of 2, which is two continuous 3 multiplied by 3, after passing through a channel space joint attention module (ACSB), and a fifth characteristic graph with the resolution equal to the original image and the channel number of 16 is generated.

S6: after the characteristic mosaic of channel dimension is carried out on the second characteristic graphs of the four sequences, the generated characteristic graph with the channel number of 128 passes through an ACSMB module, and then two continuous convolution (3D ConvBlock) with the step length of 3 multiplied by 3 and 2 is carried out, so that a sixth characteristic graph with the resolution of 1/2 of the original image and the channel number of 32 is generated.

S7: after the characteristic mosaic of the channel dimension is carried out on the third characteristic graphs of the four sequences, the generated characteristic graphs with 256 channels pass through a light self-attention module (ACSMB), two continuous convolution (3D ConvBlock) with the step length of 3 multiplied by 3 being 2 is carried out, and a seventh characteristic graph with the resolution of 1/4 of the original image and the channel number of 64 is generated.

S8: after the feature stitching of the channel dimension is carried out on the fourth feature graphs of the four sequences, the generated feature graphs with 256 channels are convolved (3D ConvBlock) with 2 step sizes of two continuous 3 multiplied by 3, and then an eighth feature graph with 1/8 of the resolution of the original image and 128 channels is generated.

S9: after convolving the eighth feature map with a 3 x 3 transpose with a step size of 2, a ninth feature map having a resolution of 1/4 of the original image and a channel number of 64 is generated.

S10: after the ninth feature map and the seventh feature map are subjected to feature stitching of channel dimensions, a feature map with the channel number of 128 is generated, and then a tenth feature map with the resolution of 1/4 of the original image and the channel number of 64 is generated through two continuous convolutions (3D ConvBlock) with the step length of 3×3 being 2.

S11: after convolving the tenth feature map with a 3 x 3 transpose with a step size of 2, an eleventh feature map having a resolution of 1/2 of the original image and a channel number of 32 is generated.

S12: after the eleventh feature map and the sixth feature map are subjected to feature stitching of channel dimensions, a feature map with the channel number of 64 is generated, and then a twelfth feature map with the resolution of 1/2 of the original image and the channel number of 32 is generated through two continuous convolutions (3D ConvBlock) with the step length of 3×3 being 2.

S13: after convolving the twelfth feature map with a 3 x 3 transpose of step size 2, a thirteenth feature map with a resolution equal to the original image and a channel number of 16 is generated.

S14: and after the thirteenth feature map and the fifth feature map are subjected to feature stitching of channel dimension, generating a feature map with the channel number of 32, and generating a fourteenth feature map with the channel number of 16, namely a segmentation result map, by two continuous convolution (3D ConvBlock) with the step length of 3 multiplied by 3 being 2, wherein the resolution is equal to that of the original image.

The specific description is as follows: steps S1-S4 represent the output of the four modality three layers, respectively, and steps S5-S14 are described in conjunction with fig. 1, which steps and fig. 1 can mutually prove.

Fig. 2 is a core module CA, SA, MA and 3D ConvBlock of the entire network, and is described in detail as follows:

FIG. 2 (a) is a spatial attention module that maps a given input feature in three dimensionsMaximum pooling and average pooling of channel dimensions, respectively, to generate two 3D spatial attention maps +.>Andthen adds the two obtained characteristic images, and then passes throughA convolution layer of 7 x 7 and step size 1, finally, compressing by the Sigmoid activation function to obtain +.>

M _S (F ₁ )＝sigmoid(Conv ^7×7×7 (AvgPool(F ₁ )+MaxPool(F ₁ ))) (1)。

Fig. 2 (b) is a 3D ConvBlock block, consisting of two 3 x 3 convolutional layers, a batch process, and a Relu layer in series.

FIG. 2 (c) is a channel attention module that maps a given input feature in three-dimensional spaceRespectively carrying out adaptive average pooling and adaptive maximum pooling to generate two 1D channel attention force diagrams ++>Andthen try two attentiveness to M _CM And M _CA Respectively inputting into a multi-layer perceptron (MLP) module, adding, and compressing by Sigmoid activation function to obtain +.>

M _C (F ₂ )＝sigmoid(MLP(AvgPool(F ₂ ))+MLP(MaxPool(F ₂ ))) (2)。

Fig. 2 (d) is a modal attention module MA of the design of the present invention, given an inputFirst of all its dimensions are deformed to +>The feature images of the four modes of (2) are obtained by carrying out average pooling compression on the space dimensionCompressing channel dimension by averaging pooling to obtain +.>And multiplying the generated weight scalar of the four modes back to the original feature map:

M _M (F ₃ )＝Reshape(AvgPool(AvgPool(Reshape(F ₃ )))×F ₃ ’) (3)。

fig. 3 (a) is a schematic diagram of a lightweight self-attention module (ACSMB) according to the present invention. Given inputFirst, it is passed through a channel attention module to output a +.>The channel attention weight matrix of (2) is multiplied by the original feature map to obtain +.>And then output by a space attention moduleIs multiplied by the spatial attention weight of (2) and the input to get +.>Then through a mode attention module, the characteristic information of different positions in the 3d space is focused, and the obtained output and the input are added>Added to generate a +.>Feature map:

F _CSM ＝M _M (F _CS )+F _CS (6)。

fig. 3 (b) is a schematic diagram of an ACSB module, and similar to fig. 3 (a), one mode attention module is removed based on ACSMB.

FIG. 4 is a graph of the segmentation validation result of the segmentation network of the present invention on a Brats2021 dataset. To verify the accuracy and efficiency of the design network of the present invention, models were trained, evaluated, and predicted on widely used Brats2021 datasets. The Brats2021 training/validation/test sets contained 1251/219/570 images, respectively.

Fig. 5 is an experimental result comparing with other models, the average Dice coefficient of the four-encoder model is improved by 0.009 compared with the basic single-encoder model, and after the SA module is added to the T1ce sequence, the characteristic information of the sequence is better utilized, so that it can be found that the Dice coefficients of ET and CT are respectively improved by 0.013 and 0.010 compared with the common four-encoder model structure. For the characteristic splicing part of the multi-mode fusion, ACS modules are added to the first layer and the third layer, and ACSM modules are added to the second layer, so that the weight of each part in the mode fusion process and the attention to different modes are enhanced. The average DICE of the final model was found to be 0.891 best overall, with DICE coefficients, ET lifting 0.037, TC lifting 0.038, WT lifting 0.015.

In summary, the invention uses a novel light attention module ACSM based on modal, channel and space combination, and sequentially uses light channel attention CA, space attention SA and modal attention MA to calculate the weights of the channel, space and mode respectively, so that the respective characteristics of each mode are fully utilized when the modes are fused, and the segmentation precision is improved. Compared with the existing model, the precision of the network architecture is obviously improved, and experimental results show that the network architecture designed by the invention achieves higher segmentation precision while keeping the weight of the model light.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. A multi-modal fused lightweight segmentation network for MRI images of the brain, comprising:

2. The multi-modal fusion lightweight split network of claim 1, wherein: before feature fusion, each layer of the encoder where the enhancement sequence is located adds a channel attention mechanism to focus on the weights of the different channels of the enhancement sequence.

3. The multi-modal fusion lightweight split network of claim 1, wherein: in the feature fusion section, the same layer features of the four encoders are connected by a jump.

4. The multi-modal fusion lightweight split network of claim 1, wherein: the core component of the feature fusion part is a light attention module ACSMB based on mode, channel and space combination, and channel attention CA, space attention SA and mode attention MA are adopted in sequence to respectively calculate the weights of the channel, the space and the mode.

5. The multi-modal fusion lightweight split network as claimed in claim 4, wherein: the second layer of features of the four encoders comprise detail information and high-level semantics, and the second layer of features adopts a mode, channel and space joint attention mechanism when in fusion, and the other two layers of features only adopt the channel and space joint attention mechanism when in fusion.

6. The multi-modal fusion lightweight split network of claim 1, wherein: in the decoding part, the fused multi-mode characteristics are decoded and output through a same decoder and three layers of up-sampled pooling layers, so that a segmentation result is obtained.

7. A segmentation method of a multi-mode fusion lightweight segmentation network, applied to the segmentation network as set forth in any one of claims 1-6, comprising the following steps:

8. The segmentation method according to claim 7, wherein the step S1 specifically includes:

9. The segmentation method according to claim 7, wherein the step S2 specifically includes:

10. The segmentation method according to claim 7, wherein the step S3 specifically includes: