CN116740513A - Multi-mode fusion lightweight segmentation network and segmentation method for brain MRI image - Google Patents

Multi-mode fusion lightweight segmentation network and segmentation method for brain MRI image Download PDF

Info

Publication number
CN116740513A
CN116740513A CN202310558286.4A CN202310558286A CN116740513A CN 116740513 A CN116740513 A CN 116740513A CN 202310558286 A CN202310558286 A CN 202310558286A CN 116740513 A CN116740513 A CN 116740513A
Authority
CN
China
Prior art keywords
original image
feature map
resolution
convolution
channel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310558286.4A
Other languages
Chinese (zh)
Inventor
樊亚文
黄谌子谊
胡正开
陈建新
王潮远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202310558286.4A priority Critical patent/CN116740513A/en
Publication of CN116740513A publication Critical patent/CN116740513A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/467Encoded features or binary features, e.g. local binary patterns [LBP]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a multi-mode fusion lightweight segmentation network and a segmentation method for brain MRI images, which mainly comprise three parts: the coding part comprises four independent encoders for respectively extracting features of original pictures of four modes and adopting different attention strategies for different modes, wherein each encoder comprises three layers of convolution modules for downsampling through a convolution layer and a pooling layer; the feature fusion part is used for carrying out feature fusion on four modes in a feature layer, and adding light-weight mode attention, space attention and channel attention into different feature layers by different combinations during fusion so as to improve the segmentation precision of the model; and a decoding section for restoring an original resolution of the feature map using convolution and upsampling using transpose convolution. Compared with the prior art, the network architecture designed by the invention achieves higher segmentation precision while keeping the model lightweight.

Description

Multi-mode fusion lightweight segmentation network and segmentation method for brain MRI image
Technical Field
The invention relates to a multi-mode fusion lightweight segmentation network and a segmentation method for brain MRI images, and belongs to the field of medical image segmentation.
Background
Brain tumors are a disease formed by cancer cells in the brain, a life-threatening medical condition. The aim of brain tumor segmentation is to identify and segment tumor regions from healthy tissue by means of medical imaging techniques. Computed Tomography (CT), positron emission computed tomography (PET), and Magnetic Resonance Imaging (MRI) are the three most common imaging methods for diagnosing brain tumors. Among them, MRI is widely used because of its high resolution, strong soft tissue contrast, and non-invasiveness. Previous manual tumor contour segmentation based on brain tumor images is both laborious and error-prone. With the development of deep learning, brain tumor segmentation neural network models based on automatic image features are becoming mainstream. In general, the four modes of MRI are simply stacked, and the characteristics of different modes are not utilized, so that the segmentation accuracy is not very high. And the 3D medical segmentation calculation amount is large, the training time is long, and the reduction of the model parameter number is important on the premise of ensuring that the precision accords with the application standard.
The U-Net is composed of an encoder that extracts features from an input image and a decoder that generates a segmentation mask. The encoder and decoder communicate advanced features to the decoder through a "bridge" connection. Because U-Net has the characteristics of small parameter quantity and high accuracy, it has been widely used in various segmentation tasks. However, most of the current mainstream models are based on U-Net and extremely variant, but mainly the four modes are directly stacked at the pixel layer and then encoded by one encoder, so that different strategies cannot be adopted for different modes.
Disclosure of Invention
The invention aims to provide a multi-mode fusion lightweight segmentation network and a segmentation method for brain MRI images, which are used for improving model segmentation accuracy and simultaneously keeping a model lightweight.
To achieve the above object, the present invention provides a multi-modal fused lightweight segmentation network for brain MRI images, comprising:
the coding part comprises four independent encoders for respectively extracting features of original pictures of four modes and adopting different attention strategies for different modes, wherein each encoder comprises three layers of convolution modules for downsampling through a convolution layer and a pooling layer;
the feature fusion part is used for carrying out feature fusion on four modes in a feature layer, and adding light-weight mode attention, space attention and channel attention into different feature layers by different combinations during fusion so as to improve the segmentation precision of the model; and
and a decoding part, which adopts convolution and up-sampling to restore the original resolution of the characteristic map, wherein the up-sampling is realized by transpose convolution.
As a further improvement of the invention, each layer of the encoder where the enhancement sequence is located before feature fusion adds a channel attention mechanism to focus on the weights of the different channels of the enhancement sequence.
As a further improvement of the present invention, in the feature fusion section, the same layer features of the four encoders are connected by skip.
As a further improvement of the invention, the core component of the feature fusion part is a light attention module ACSMB based on modal, channel and space combination, and channel attention CA, space attention SA and modal attention MA are adopted to respectively calculate the weights of the channel, the space and the mode.
As a further improvement of the invention, the second layer of characteristics of the four encoders comprises detail information and high-level semantics, and the second layer of characteristics adopts a mode, channel and space combined attention mechanism when in fusion, and the other two layers of characteristics only adopt the channel and space combined attention mechanism when in fusion.
In the decoding part, the fused multi-mode characteristics are decoded and output through a same decoder and three up-sampled pooling layers to obtain a segmentation result.
In order to achieve the above object, the present invention provides a method for splitting a multi-mode fusion lightweight split network, which is applied to the split network and mainly comprises the following steps:
s1: carrying out two continuous convolution with 3 multiplied by 3 step length of 2 on the input original images T1, T2, T1ce and Flair sequence to generate a first feature map with resolution equal to that of the original images and channel number of 16;
s2: the method comprises the steps of pooling a maximum value with a step length of 2 for a first feature map of a T1, T2, T1ce and Flair sequence, and obtaining a second feature map with a resolution of 1/2 of an original image and a channel number of 32 through convolution with a step length of 2 of two continuous 3X 3;
s3: carrying out maximum value pooling with the step length of 2 on the second feature map of the T1, T2, T1ce and Flair sequence, and then obtaining a third feature map with the resolution of 1/4 of the original image and the channel number of 64 through two continuous convolution with the step length of 3 multiplied by 3 and the step length of 2;
s4: respectively carrying out two continuous convolutions with 3 multiplied by 3 and step length of 2 on the third characteristic diagrams of the T1, T2, T1ce and Flair sequences to generate a fourth characteristic diagram with resolution of 1/8 of the original image and channel number of 64 unchanged;
s5: after characteristic splicing of channel dimensions is carried out on the first characteristic graphs of the four sequences, the generated characteristic graphs with the channel number of 64 are subjected to convolution with the step length of 2 of two continuous 3 multiplied by 3 after passing through a channel space joint attention module, and a fifth characteristic graph with the resolution equal to that of an original image and the channel number of 16 is generated;
s6: after characteristic splicing of channel dimensions is carried out on the second characteristic graphs of the four sequences, the generated characteristic graphs with the channel number of 128 are subjected to convolution with the step length of 2 of two continuous 3 multiplied by 3 after passing through a light self-attention module, and a sixth characteristic graph with the resolution of 1/2 of an original image and the channel number of 32 is generated;
s7: after characteristic splicing of channel dimensions is carried out on the third characteristic graphs of the four sequences, the generated characteristic graphs with 256 channels are subjected to convolution with 2 steps of two continuous 3 multiplied by 3 after passing through a light self-attention module, and a seventh characteristic graph with 1/4 of the resolution of an original image and 64 channels is generated;
s8: after the characteristic splicing of the channel dimension is carried out on the fourth characteristic diagram of the four sequences, generating a characteristic diagram with 256 channels, and generating an eighth characteristic diagram with the resolution of 1/8 of the original image and 128 channels after the convolution with 2 steps of two continuous 3 multiplied by 3;
s9: after convolving the eighth feature map with a 3 x 3 transpose with a step size of 2, generating a ninth feature map with the resolution of 1/4 of the original image and the channel number of 64;
s10: after the ninth feature map and the seventh feature map are subjected to feature stitching of channel dimension, a feature map with 128 channels is generated, then by two consecutive convolutions of 3 x 3 steps of 2, generating a tenth characteristic diagram with resolution of 1/4 of the original image and channel number of 64;
s11: after the tenth characteristic diagram is subjected to transpose convolution with 3 multiplied by 3 and the step length of 2, an eleventh characteristic diagram with the resolution of 1/2 of the original image and the channel number of 32 is generated;
s12: after the eleventh characteristic diagram and the sixth characteristic diagram are subjected to characteristic splicing of channel dimension, the characteristic diagram with the channel number of 64 is generated, then by two consecutive convolutions of 3 x 3 steps of 2, generating a twelfth feature map with the resolution of 1/2 of the original image and the channel number of 32;
s13: after convolving the twelfth feature map with a 3 x 3 transpose of step size 2, generating a thirteenth feature map with the resolution equal to the original image and the channel number of 16;
s14: after the thirteenth feature map and the fifth feature map are subjected to feature stitching of channel dimension, a feature map with the channel number of 32 is generated, and generating a fourteenth characteristic image with the resolution equal to that of the original image and the channel number of 16 by convolution with the step length of 2 of two continuous 3 multiplied by 3, namely a segmentation result image.
As a further improvement of the present invention, the step S1 specifically includes:
s11: carrying out two continuous convolution with 3 multiplied by 3 step length of 2 on an input original image T1 sequence to generate a first characteristic diagram with resolution equal to that of the original image and channel number of 16;
s12: carrying out two continuous convolution with 3 multiplied by 3 step length of 2 on an input original image T2 sequence to generate a first characteristic diagram with resolution equal to that of the original image and channel number of 16;
s13: carrying out two continuous convolutions with 3 multiplied by 3 step length of 2 on an input original image T1ce sequence, and then generating a first feature map with the resolution equal to that of the original image and the channel number of 16 through a channel attention module;
s14: two successive convolutions of 3 x 3 steps of 2 are performed on the input original image Flair sequence to generate a first feature map with the resolution equal to that of the original image and the channel number of 16.
As a further improvement of the present invention, the step S2 specifically includes:
s21: pooling the maximum value of the step length of 2 of the first feature map of the T1 sequence, and then obtaining a second feature map with the resolution of 1/2 of the original image and the channel number of 32 through convolution of two continuous 3X 3 step lengths of 2;
s22: pooling the maximum value of the step length of 2 of the first feature map of the T2 sequence, and obtaining a second feature map with the resolution of 1/2 of the original image and the channel number of 32 through convolution of two continuous 3X 3 step lengths of 2;
s23: pooling the first feature map of the T1ce sequence with a maximum value of step length of 2, convolving two continuous convolution with step length of 3 multiplied by 3 with step length of 2, and obtaining a second feature map with resolution of 1/2 of the original image and channel number of 32 through a channel attention module;
s24: the first feature map of the Flair sequence is first pooled at a step size of 2, then convolved at step sizes of 2 by two consecutive 3 x 3, a second feature map with a resolution of 1/2 of the original image and a channel number of 32 is obtained.
As a further improvement of the present invention, the step S3 specifically includes:
s31: carrying out maximum value pooling with the step length of 2 on the second characteristic diagram of the T1 sequence, and then obtaining a third characteristic diagram with the resolution of 1/4 of the original image and the channel number of 64 through convolution with the step length of 2 of two continuous 3 multiplied by 3;
s32: pooling the maximum value of the step length of 2 of the second feature map of the T2 sequence, and obtaining a third feature map with the resolution of 1/4 of the original image and the channel number of 64 through convolution of two continuous 3X 3 steps of 2;
s33: the second feature map of the T1ce sequence is first pooled with a step size of 2 maximum, and two consecutive 3 x 3 convolutions with a step size of 2, obtaining a third characteristic diagram with the resolution of 1/4 of the original image and the channel number of 64 through a channel attention module;
s34: and (3) carrying out maximum value pooling with the step length of 2 on the second characteristic diagram of the Flair sequence, and then obtaining a third characteristic diagram with the resolution of 1/4 of the original image and the channel number of 64 through convolution with the step length of 2 of two continuous 3 multiplied by 3.
Compared with the prior art, the invention has the technical effects that: compared with the existing model, the precision of the network architecture is obviously improved, and experimental results show that the network architecture designed by the invention achieves higher segmentation precision while keeping the weight of the model light.
Drawings
Fig. 1 is a schematic diagram of a multi-modal fused lightweight segmented network for MRI images of the brain in accordance with the present invention.
Fig. 2 is a graph of channel attention, spatial attention, modal attention, and 3DConvBlock of the present invention.
Fig. 3 is a block diagram of a modal channel and spatial joint attention module (ACSMB) and channel spatial joint attention module (ACSB) of the present invention.
Fig. 4 is a graph of the segmentation result of the network of the present invention on the bras 2021 dataset.
FIG. 5 is a graph of experimental results of the present invention compared to other models.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
In this case, in order to avoid obscuring the present invention due to unnecessary details, only the structures and/or processing steps closely related to the aspects of the present invention are shown in the drawings, and other details not greatly related to the present invention are omitted.
The tags in the Brats2021 dataset are divided into significantly enhanced tumor regions (ET), general tumor regions also known as "tumor cores" (TC), and intact tumor regions (WT). Compared to conventional T1, T1ce shows a high signal in the ET region. The TC region contains the majority of the tumor, which is typically the region that needs to be resected. The appearance of Necrotic (NCR) and non-reinforced (NET) tumors in T1ce is usually significantly lower in signal compared to T1. The whole tumor describes the integrity of the disease, as it includes the tumor core and peri-tumor oedema (ED), usually described by the high intensity signal in FLAIR. According to different tumor biological characteristics, different attention mechanisms are added to different modes in the coding part, and corresponding strategies are adopted when feature layers are fused.
The whole network framework is a multimode fusion lightweight segmentation network based on modal attention, after four modal sequences pass through an independent encoder, the multimode fusion of feature levels is carried out first, and then the fused features are decoded to obtain segmented images.
As shown in fig. 1, the multi-modal fusion lightweight segmentation network for brain MRI images provided by the present invention is composed of three basic parts including a coding part, a feature fusion part and a decoding part. The coding part comprises four independent encoders to respectively perform feature extraction on the original pictures of the four modes so as to adopt different attention strategies for each mode according to different tumor biological characteristics before feature fusion. Each encoder contains a three-layer convolution module to downsample through the convolution and pooling layers. The invention focuses on the difference among the modes, and carries out different treatment on different modes before the early fusion of the modes. According to the related knowledge in the medical field, the T1ce enhancement sequence has more abundant characteristic information on the tumor part, so each layer of the encoder where the T1ce enhancement sequence is positioned is added with a Channel Attention mechanism (CA) before the modal characteristics are fused, so that the weight of different channels of the enhancement sequence is concerned, and the characteristic of the characteristic enhancement of the sequence is fully utilized.
The feature fusion part is used for carrying out feature fusion on four modes in a feature layer, and when in fusion, light-weight mode attention, space attention and channel attention are added in different feature layers by different combinations, so that the segmentation precision of the model is improved. Specifically, in the feature fusion part, the same layer of features of the four encoders are connected through jump, and the fact that the second layer of features of the four encoders not only contains detail information but also has high-level semantics is considered, so that only the second layer of features adopts a mode, channel and space combined Attention (ACSM) mechanism during fusion, and the other two layers of features only adopt a channel and space combined Attention (ACS) mechanism during fusion, thereby not only improving the precision of a model, but also keeping the weight of the model.
That is, the core component of the feature fusion part is a novel light-weight Attention module (ACSM, attention Based On Channel Spatial and Modal) based on Modal, channel and space combination, and the weights of the Channel, the space and the mode are calculated by adopting light-weight Channel Attention (CA), space-Attention (SA) and Mode Attention (MA) in sequence, so that the respective features of each mode are fully utilized during the fusion of the modes, and the segmentation precision is improved. Meanwhile, the light attention module can also effectively reduce the calculation complexity of the network, so that the network can keep very few model parameters, the effect of feature extraction is improved, and the network can realize the best compromise between accuracy and efficiency.
The decoding section employs convolution and upsampling to restore the original resolution of the feature map. The upsampling is implemented using transpose convolution. Specifically, the fused multi-mode features are decoded and output through a same decoder and three up-sampled pooling layers, so that a segmentation result is obtained.
The overall network framework according to fig. 1 is capable of efficiently extracting modality information and detail information of nuclear magnetic data, and can be trained end-to-end. Compared with the latest mainstream segmentation network, the designed network architecture realizes higher detection precision and a lightweight model structure.
Referring to fig. 1, the method for dividing the overall network structure diagram designed by the invention is explained and mainly comprises the following steps:
s1: two successive convolutions of 3 x 3 steps of 2 are performed on the input original images T1, T2, T1ce, flair sequence to generate a first feature map with resolution equal to the original images and channel number of 16. The method specifically comprises the following steps:
s11: carrying out two continuous convolution with 3 multiplied by 3 step length of 2 on an input original image T1 sequence to generate a first characteristic diagram with resolution equal to that of the original image and channel number of 16;
s12: carrying out two continuous convolution with 3 multiplied by 3 step length of 2 on an input original image T2 sequence to generate a first characteristic diagram with resolution equal to that of the original image and channel number of 16;
s13: carrying out two continuous convolutions with 3 multiplied by 3 step length of 2 on an input original image T1ce sequence, and then generating a first feature map with the resolution equal to that of the original image and the channel number of 16 through a channel attention module;
s14: two successive convolutions of 3 x 3 steps of 2 are performed on the input original image Flair sequence to generate a first feature map with the resolution equal to that of the original image and the channel number of 16.
S2: the first feature map of the sequence T1, T2, T1ce, flair is first pooled at a maximum of step size 2, and then, through two continuous convolution with the step length of 3 multiplied by 3 being 2, obtaining a second characteristic diagram with the resolution of 1/2 of the original image and the channel number of 32. The method specifically comprises the following steps:
s21: pooling the maximum value of the step length of 2 of the first feature map of the T1 sequence, and then obtaining a second feature map with the resolution of 1/2 of the original image and the channel number of 32 through convolution of two continuous 3X 3 step lengths of 2;
s22: pooling the maximum value of the step length of 2 of the first feature map of the T2 sequence, and obtaining a second feature map with the resolution of 1/2 of the original image and the channel number of 32 through convolution of two continuous 3X 3 step lengths of 2;
s23: pooling the first feature map of the T1ce sequence with a maximum value of step length of 2, convolving two continuous convolution with step length of 3 multiplied by 3 with step length of 2, and obtaining a second feature map with resolution of 1/2 of the original image and channel number of 32 through a channel attention module;
s24: the first feature map of the Flair sequence is first pooled at a step size of 2, then convolved at step sizes of 2 by two consecutive 3 x 3, a second feature map with a resolution of 1/2 of the original image and a channel number of 32 is obtained.
S3: and (3) carrying out maximum value pooling with the step length of 2 on the second characteristic diagram of the T1, T2, T1ce and Flair sequence, and then obtaining a third characteristic diagram with the resolution of 1/4 of the original image and the channel number of 64 through two continuous convolution with the step length of 3 multiplied by 3 and the step length of 2. The method specifically comprises the following steps:
s31: carrying out maximum value pooling with the step length of 2 on the second characteristic diagram of the T1 sequence, and then obtaining a third characteristic diagram with the resolution of 1/4 of the original image and the channel number of 64 through convolution with the step length of 2 of two continuous 3 multiplied by 3;
s32: pooling the maximum value of the step length of 2 of the second feature map of the T2 sequence, and obtaining a third feature map with the resolution of 1/4 of the original image and the channel number of 64 through convolution of two continuous 3X 3 steps of 2;
s33: the second feature map of the T1ce sequence is first pooled with a step size of 2 maximum, and two consecutive 3 x 3 convolutions with a step size of 2, obtaining a third characteristic diagram with the resolution of 1/4 of the original image and the channel number of 64 through a channel attention module;
s34: and (3) carrying out maximum value pooling with the step length of 2 on the second characteristic diagram of the Flair sequence, and then obtaining a third characteristic diagram with the resolution of 1/4 of the original image and the channel number of 64 through convolution with the step length of 2 of two continuous 3 multiplied by 3.
S4: two consecutive 3×3 convolutions (3D ConvBlock) with step size of 2 are performed on the third feature map of the sequence T1, T2, T1ce, flair, respectively, to generate a fourth feature map with resolution of 1/8 of the original image and channel number of 64 unchanged.
S5: after the characteristic mosaic of channel dimension is carried out on the first characteristic graph of the four sequences, the generated characteristic graph with the channel number of 64 is subjected to convolution (3 DConvBlock) with the step length of 2, which is two continuous 3 multiplied by 3, after passing through a channel space joint attention module (ACSB), and a fifth characteristic graph with the resolution equal to the original image and the channel number of 16 is generated.
S6: after the characteristic mosaic of channel dimension is carried out on the second characteristic graphs of the four sequences, the generated characteristic graph with the channel number of 128 passes through an ACSMB module, and then two continuous convolution (3D ConvBlock) with the step length of 3 multiplied by 3 and 2 is carried out, so that a sixth characteristic graph with the resolution of 1/2 of the original image and the channel number of 32 is generated.
S7: after the characteristic mosaic of the channel dimension is carried out on the third characteristic graphs of the four sequences, the generated characteristic graphs with 256 channels pass through a light self-attention module (ACSMB), two continuous convolution (3D ConvBlock) with the step length of 3 multiplied by 3 being 2 is carried out, and a seventh characteristic graph with the resolution of 1/4 of the original image and the channel number of 64 is generated.
S8: after the feature stitching of the channel dimension is carried out on the fourth feature graphs of the four sequences, the generated feature graphs with 256 channels are convolved (3D ConvBlock) with 2 step sizes of two continuous 3 multiplied by 3, and then an eighth feature graph with 1/8 of the resolution of the original image and 128 channels is generated.
S9: after convolving the eighth feature map with a 3 x 3 transpose with a step size of 2, a ninth feature map having a resolution of 1/4 of the original image and a channel number of 64 is generated.
S10: after the ninth feature map and the seventh feature map are subjected to feature stitching of channel dimensions, a feature map with the channel number of 128 is generated, and then a tenth feature map with the resolution of 1/4 of the original image and the channel number of 64 is generated through two continuous convolutions (3D ConvBlock) with the step length of 3×3 being 2.
S11: after convolving the tenth feature map with a 3 x 3 transpose with a step size of 2, an eleventh feature map having a resolution of 1/2 of the original image and a channel number of 32 is generated.
S12: after the eleventh feature map and the sixth feature map are subjected to feature stitching of channel dimensions, a feature map with the channel number of 64 is generated, and then a twelfth feature map with the resolution of 1/2 of the original image and the channel number of 32 is generated through two continuous convolutions (3D ConvBlock) with the step length of 3×3 being 2.
S13: after convolving the twelfth feature map with a 3 x 3 transpose of step size 2, a thirteenth feature map with a resolution equal to the original image and a channel number of 16 is generated.
S14: and after the thirteenth feature map and the fifth feature map are subjected to feature stitching of channel dimension, generating a feature map with the channel number of 32, and generating a fourteenth feature map with the channel number of 16, namely a segmentation result map, by two continuous convolution (3D ConvBlock) with the step length of 3 multiplied by 3 being 2, wherein the resolution is equal to that of the original image.
The specific description is as follows: steps S1-S4 represent the output of the four modality three layers, respectively, and steps S5-S14 are described in conjunction with fig. 1, which steps and fig. 1 can mutually prove.
Fig. 2 is a core module CA, SA, MA and 3D ConvBlock of the entire network, and is described in detail as follows:
FIG. 2 (a) is a spatial attention module that maps a given input feature in three dimensionsMaximum pooling and average pooling of channel dimensions, respectively, to generate two 3D spatial attention maps +.>Andthen adds the two obtained characteristic images, and then passes throughA convolution layer of 7 x 7 and step size 1, finally, compressing by the Sigmoid activation function to obtain +.>
M S (F 1 )=sigmoid(Conv 7×7×7 (AvgPool(F 1 )+MaxPool(F 1 ))) (1)。
Fig. 2 (b) is a 3D ConvBlock block, consisting of two 3 x 3 convolutional layers, a batch process, and a Relu layer in series.
FIG. 2 (c) is a channel attention module that maps a given input feature in three-dimensional spaceRespectively carrying out adaptive average pooling and adaptive maximum pooling to generate two 1D channel attention force diagrams ++>Andthen try two attentiveness to M CM And M CA Respectively inputting into a multi-layer perceptron (MLP) module, adding, and compressing by Sigmoid activation function to obtain +.>
M C (F 2 )=sigmoid(MLP(AvgPool(F 2 ))+MLP(MaxPool(F 2 ))) (2)。
Fig. 2 (d) is a modal attention module MA of the design of the present invention, given an inputFirst of all its dimensions are deformed to +>The feature images of the four modes of (2) are obtained by carrying out average pooling compression on the space dimensionCompressing channel dimension by averaging pooling to obtain +.>And multiplying the generated weight scalar of the four modes back to the original feature map:
M M (F 3 )=Reshape(AvgPool(AvgPool(Reshape(F 3 )))×F 3 ’) (3)。
fig. 3 (a) is a schematic diagram of a lightweight self-attention module (ACSMB) according to the present invention. Given inputFirst, it is passed through a channel attention module to output a +.>The channel attention weight matrix of (2) is multiplied by the original feature map to obtain +.>And then output by a space attention moduleIs multiplied by the spatial attention weight of (2) and the input to get +.>Then through a mode attention module, the characteristic information of different positions in the 3d space is focused, and the obtained output and the input are added>Added to generate a +.>Feature map:
F CSM =M M (F CS )+F CS (6)。
fig. 3 (b) is a schematic diagram of an ACSB module, and similar to fig. 3 (a), one mode attention module is removed based on ACSMB.
FIG. 4 is a graph of the segmentation validation result of the segmentation network of the present invention on a Brats2021 dataset. To verify the accuracy and efficiency of the design network of the present invention, models were trained, evaluated, and predicted on widely used Brats2021 datasets. The Brats2021 training/validation/test sets contained 1251/219/570 images, respectively.
Fig. 5 is an experimental result comparing with other models, the average Dice coefficient of the four-encoder model is improved by 0.009 compared with the basic single-encoder model, and after the SA module is added to the T1ce sequence, the characteristic information of the sequence is better utilized, so that it can be found that the Dice coefficients of ET and CT are respectively improved by 0.013 and 0.010 compared with the common four-encoder model structure. For the characteristic splicing part of the multi-mode fusion, ACS modules are added to the first layer and the third layer, and ACSM modules are added to the second layer, so that the weight of each part in the mode fusion process and the attention to different modes are enhanced. The average DICE of the final model was found to be 0.891 best overall, with DICE coefficients, ET lifting 0.037, TC lifting 0.038, WT lifting 0.015.
In summary, the invention uses a novel light attention module ACSM based on modal, channel and space combination, and sequentially uses light channel attention CA, space attention SA and modal attention MA to calculate the weights of the channel, space and mode respectively, so that the respective characteristics of each mode are fully utilized when the modes are fused, and the segmentation precision is improved. Compared with the existing model, the precision of the network architecture is obviously improved, and experimental results show that the network architecture designed by the invention achieves higher segmentation precision while keeping the weight of the model light.
The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention.

Claims (10)

1. A multi-modal fused lightweight segmentation network for MRI images of the brain, comprising:
the coding part comprises four independent encoders for respectively extracting features of original pictures of four modes and adopting different attention strategies for different modes, wherein each encoder comprises three layers of convolution modules for downsampling through a convolution layer and a pooling layer;
the feature fusion part is used for carrying out feature fusion on four modes in a feature layer, and adding light-weight mode attention, space attention and channel attention into different feature layers by different combinations during fusion so as to improve the segmentation precision of the model; and
and a decoding part, which adopts convolution and up-sampling to restore the original resolution of the characteristic map, wherein the up-sampling is realized by transpose convolution.
2. The multi-modal fusion lightweight split network of claim 1, wherein: before feature fusion, each layer of the encoder where the enhancement sequence is located adds a channel attention mechanism to focus on the weights of the different channels of the enhancement sequence.
3. The multi-modal fusion lightweight split network of claim 1, wherein: in the feature fusion section, the same layer features of the four encoders are connected by a jump.
4. The multi-modal fusion lightweight split network of claim 1, wherein: the core component of the feature fusion part is a light attention module ACSMB based on mode, channel and space combination, and channel attention CA, space attention SA and mode attention MA are adopted in sequence to respectively calculate the weights of the channel, the space and the mode.
5. The multi-modal fusion lightweight split network as claimed in claim 4, wherein: the second layer of features of the four encoders comprise detail information and high-level semantics, and the second layer of features adopts a mode, channel and space joint attention mechanism when in fusion, and the other two layers of features only adopt the channel and space joint attention mechanism when in fusion.
6. The multi-modal fusion lightweight split network of claim 1, wherein: in the decoding part, the fused multi-mode characteristics are decoded and output through a same decoder and three layers of up-sampled pooling layers, so that a segmentation result is obtained.
7. A segmentation method of a multi-mode fusion lightweight segmentation network, applied to the segmentation network as set forth in any one of claims 1-6, comprising the following steps:
s1: carrying out two continuous convolution with 3 multiplied by 3 step length of 2 on the input original images T1, T2, T1ce and Flair sequence to generate a first feature map with resolution equal to that of the original images and channel number of 16;
s2: the method comprises the steps of pooling a maximum value with a step length of 2 for a first feature map of a T1, T2, T1ce and Flair sequence, and obtaining a second feature map with a resolution of 1/2 of an original image and a channel number of 32 through convolution with a step length of 2 of two continuous 3X 3;
s3: carrying out maximum value pooling with the step length of 2 on the second feature map of the T1, T2, T1ce and Flair sequence, and then obtaining a third feature map with the resolution of 1/4 of the original image and the channel number of 64 through two continuous convolution with the step length of 3 multiplied by 3 and the step length of 2;
s4: respectively carrying out two continuous convolutions with 3 multiplied by 3 and step length of 2 on the third characteristic diagrams of the T1, T2, T1ce and Flair sequences to generate a fourth characteristic diagram with resolution of 1/8 of the original image and channel number of 64 unchanged;
s5: after characteristic splicing of channel dimensions is carried out on the first characteristic graphs of the four sequences, the generated characteristic graphs with the channel number of 64 are subjected to convolution with the step length of 2 of two continuous 3 multiplied by 3 after passing through a channel space joint attention module, and a fifth characteristic graph with the resolution equal to that of an original image and the channel number of 16 is generated;
s6: after characteristic splicing of channel dimensions is carried out on the second characteristic graphs of the four sequences, the generated characteristic graphs with the channel number of 128 are subjected to convolution with the step length of 2 of two continuous 3 multiplied by 3 after passing through a light self-attention module, and a sixth characteristic graph with the resolution of 1/2 of an original image and the channel number of 32 is generated;
s7: after characteristic splicing of channel dimensions is carried out on the third characteristic graphs of the four sequences, the generated characteristic graphs with 256 channels are subjected to convolution with 2 steps of two continuous 3 multiplied by 3 after passing through a light self-attention module, and a seventh characteristic graph with 1/4 of the resolution of an original image and 64 channels is generated;
s8: after the characteristic splicing of the channel dimension is carried out on the fourth characteristic diagram of the four sequences, generating a characteristic diagram with 256 channels, and generating an eighth characteristic diagram with the resolution of 1/8 of the original image and 128 channels after the convolution with 2 steps of two continuous 3 multiplied by 3;
s9: after convolving the eighth feature map with a 3 x 3 transpose with a step size of 2, generating a ninth feature map with the resolution of 1/4 of the original image and the channel number of 64;
s10: after the ninth feature map and the seventh feature map are subjected to feature stitching of channel dimension, a feature map with 128 channels is generated, then by two consecutive convolutions of 3 x 3 steps of 2, generating a tenth characteristic diagram with resolution of 1/4 of the original image and channel number of 64;
s11: after the tenth characteristic diagram is subjected to transpose convolution with 3 multiplied by 3 and the step length of 2, an eleventh characteristic diagram with the resolution of 1/2 of the original image and the channel number of 32 is generated;
s12: after the eleventh characteristic diagram and the sixth characteristic diagram are subjected to characteristic splicing of channel dimension, the characteristic diagram with the channel number of 64 is generated, then by two consecutive convolutions of 3 x 3 steps of 2, generating a twelfth feature map with the resolution of 1/2 of the original image and the channel number of 32;
s13: after convolving the twelfth feature map with a 3 x 3 transpose of step size 2, generating a thirteenth feature map with the resolution equal to the original image and the channel number of 16;
s14: after the thirteenth feature map and the fifth feature map are subjected to feature stitching of channel dimension, a feature map with the channel number of 32 is generated, and generating a fourteenth characteristic image with the resolution equal to that of the original image and the channel number of 16 by convolution with the step length of 2 of two continuous 3 multiplied by 3, namely a segmentation result image.
8. The segmentation method according to claim 7, wherein the step S1 specifically includes:
s11: carrying out two continuous convolution with 3 multiplied by 3 step length of 2 on an input original image T1 sequence to generate a first characteristic diagram with resolution equal to that of the original image and channel number of 16;
s12: carrying out two continuous convolution with 3 multiplied by 3 step length of 2 on an input original image T2 sequence to generate a first characteristic diagram with resolution equal to that of the original image and channel number of 16;
s13: carrying out two continuous convolutions with 3 multiplied by 3 step length of 2 on an input original image T1ce sequence, and then generating a first feature map with the resolution equal to that of the original image and the channel number of 16 through a channel attention module;
s14: two successive convolutions of 3 x 3 steps of 2 are performed on the input original image Flair sequence to generate a first feature map with the resolution equal to that of the original image and the channel number of 16.
9. The segmentation method according to claim 7, wherein the step S2 specifically includes:
s21: pooling the maximum value of the step length of 2 of the first feature map of the T1 sequence, and then obtaining a second feature map with the resolution of 1/2 of the original image and the channel number of 32 through convolution of two continuous 3X 3 step lengths of 2;
s22: pooling the maximum value of the step length of 2 of the first feature map of the T2 sequence, and obtaining a second feature map with the resolution of 1/2 of the original image and the channel number of 32 through convolution of two continuous 3X 3 step lengths of 2;
s23: pooling the first feature map of the T1ce sequence with a maximum value of step length of 2, convolving two continuous convolution with step length of 3 multiplied by 3 with step length of 2, and obtaining a second feature map with resolution of 1/2 of the original image and channel number of 32 through a channel attention module;
s24: the first feature map of the Flair sequence is first pooled at a step size of 2, then convolved at step sizes of 2 by two consecutive 3 x 3, a second feature map with a resolution of 1/2 of the original image and a channel number of 32 is obtained.
10. The segmentation method according to claim 7, wherein the step S3 specifically includes:
s31: carrying out maximum value pooling with the step length of 2 on the second characteristic diagram of the T1 sequence, and then obtaining a third characteristic diagram with the resolution of 1/4 of the original image and the channel number of 64 through convolution with the step length of 2 of two continuous 3 multiplied by 3;
s32: pooling the maximum value of the step length of 2 of the second feature map of the T2 sequence, and obtaining a third feature map with the resolution of 1/4 of the original image and the channel number of 64 through convolution of two continuous 3X 3 steps of 2;
s33: the second feature map of the T1ce sequence is first pooled with a step size of 2 maximum, and two consecutive 3 x 3 convolutions with a step size of 2, obtaining a third characteristic diagram with the resolution of 1/4 of the original image and the channel number of 64 through a channel attention module;
s34: and (3) carrying out maximum value pooling with the step length of 2 on the second characteristic diagram of the Flair sequence, and then obtaining a third characteristic diagram with the resolution of 1/4 of the original image and the channel number of 64 through convolution with the step length of 2 of two continuous 3 multiplied by 3.
CN202310558286.4A 2023-05-17 2023-05-17 Multi-mode fusion lightweight segmentation network and segmentation method for brain MRI image Pending CN116740513A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310558286.4A CN116740513A (en) 2023-05-17 2023-05-17 Multi-mode fusion lightweight segmentation network and segmentation method for brain MRI image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310558286.4A CN116740513A (en) 2023-05-17 2023-05-17 Multi-mode fusion lightweight segmentation network and segmentation method for brain MRI image

Publications (1)

Publication Number Publication Date
CN116740513A true CN116740513A (en) 2023-09-12

Family

ID=87900222

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310558286.4A Pending CN116740513A (en) 2023-05-17 2023-05-17 Multi-mode fusion lightweight segmentation network and segmentation method for brain MRI image

Country Status (1)

Country Link
CN (1) CN116740513A (en)

Similar Documents

Publication Publication Date Title
CN110084863B (en) Multi-domain image conversion method and system based on generation countermeasure network
Yuan et al. An effective CNN and Transformer complementary network for medical image segmentation
Liang et al. MCFNet: Multi-layer concatenation fusion network for medical images fusion
CN113012172B (en) AS-UNet-based medical image segmentation method and system
CN115482241A (en) Cross-modal double-branch complementary fusion image segmentation method and device
CN111627019A (en) Liver tumor segmentation method and system based on convolutional neural network
CN110706214B (en) Three-dimensional U-Net brain tumor segmentation method fusing condition randomness and residual error
CN110689599A (en) 3D visual saliency prediction method for generating countermeasure network based on non-local enhancement
CN116309648A (en) Medical image segmentation model construction method based on multi-attention fusion
CN116433914A (en) Two-dimensional medical image segmentation method and system
CN115311194A (en) Automatic CT liver image segmentation method based on transformer and SE block
CN112819914A (en) PET image processing method
CN112488971A (en) Medical image fusion method for generating countermeasure network based on spatial attention mechanism and depth convolution
CN115880312A (en) Three-dimensional image automatic segmentation method, system, equipment and medium
CN115457359A (en) PET-MRI image fusion method based on adaptive countermeasure generation network
Xu et al. Infrared and visible image fusion using a deep unsupervised framework with perceptual loss
KR102092205B1 (en) Image processing method and apparatus for generating super resolution, inverse tone mapping and joint super resolution-inverse tone mapping processed multiple output image
CN117475268A (en) Multimode medical image fusion method based on SGDD GAN
CN117197627B (en) Multi-mode image fusion method based on high-order degradation model
CN116740513A (en) Multi-mode fusion lightweight segmentation network and segmentation method for brain MRI image
CN113744284B (en) Brain tumor image region segmentation method and device, neural network and electronic equipment
CN116757982A (en) Multi-mode medical image fusion method based on multi-scale codec
Chen et al. Contrastive learning with feature fusion for unpaired thermal infrared image colorization
CN115731280A (en) Self-supervision monocular depth estimation method based on Swin-Transformer and CNN parallel network
CN115994892A (en) Lightweight medical image segmentation method and system based on ghostnet

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination