CN114119515A

CN114119515A - Brain tumor detection method based on attention mechanism and MRI multi-mode fusion

Info

Publication number: CN114119515A
Application number: CN202111343977.XA
Authority: CN
Inventors: 蒋宗礼; 李聪; 张津丽; 顾问
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-11-14
Filing date: 2021-11-14
Publication date: 2022-03-01

Abstract

The invention discloses a brain tumor detection method based on attention mechanism and MRI Multi-mode fusion, which is characterized in that a common volume block of an encoder is replaced by a mixed cavity volume block based on a Multi-Unet model. Referring to a multi-branch encoder structure of an increment model, a multi-branch output volume block, MB-OutConv for short, is designed autonomously; and designing a channel-based Attention module CB-Attention, capturing pixel point association among channels of the original segmentation graph, and performing Attention weighting on the channels. The neural network is properly improved, and a new attention module is independently designed to further improve the segmentation result, wherein the attention module is based on an image channel and completes attention weighting at the pixel point level. Finally, the tumor and other lesion areas in the brain MRI image are segmented. Based on a Multi-mode convolutional neural network Multi-Unet, partial encoder branches of the Multi-mode convolutional neural network are improved, and an attention module is added behind the Multi-mode convolutional neural network, so that the segmentation effect of the brain tumor is improved together.

Description

Brain tumor detection method based on attention mechanism and MRI multi-mode fusion

Technical Field

The invention relates to the fields of deep learning and medical image processing, which are the cross fields of computers and medicine. Based on a multi-mode convolutional neural network, combining with the challenges encountered at present, the neural network is properly improved, and a new attention module is independently designed to further improve the segmentation result, wherein the attention module is based on an image channel and can complete attention weighting at the pixel point level. The ultimate goal is to segment out tumors and other diseased regions in MRI images of the brain.

Background

Image segmentation, which is a process of covering a target contour with a pixel-level mask, is more meaningful than target detection because the shape of a diseased region is irregular, and a doctor must judge the position of a disease as accurately as possible during diagnosis, otherwise, healthy tissues are used as the diseased region to perform an operation, which may cause a life risk to a patient. Generally, experienced doctors can accurately judge the position of a lesion in an image, but the reality problem is that hospitals generate a large number of images every day, and the number and the energy of doctors are not enough, so that a computer is urgently needed to quickly identify the lesion area in the image for the reference of the doctors. The development of deep learning in the field of natural image analysis is now well established and these studies show that computers are fully capable of learning the diagnostic experience of a physician in a short period of time and for analyzing medical images.

MRI medical imaging techniques are commonly used to view soft tissues such as the brain, spinal cord, etc. After the scan is completed, 155 layers of 2D-slices are generated first and then integrated into a 3D image along the depth axis, so that the brain's spatial structure is reflected. The 3D image sequence generated by MRI [ T1, T2, T1c, Flair ], is also referred to as a multi-modal image sequence. Although MRI images contain many detailed features of soft tissues, some images can actually show tumor regions clearly, but researches have shown that diseases originating in the brain are usually accompanied by other additional symptoms; for example, in the process of growing glioma, peripheral tissues are inevitably pressed to cause edema, and the tissue area surrounded by the tumor is often caused with symptoms such as necrosis due to long-term insufficient blood supply, so that the lesion area is the sum of the symptoms, but not the tumor itself. Just because of the complexity of MRI images, coupled with the limited energy of physicians, computer-aided diagnosis of brain tumors is required; the computer segmentation of the MRI images should also be able to distinguish the location and contour information of each lesion type more accurately, so as to assist the doctor in diagnosis. Based on the above, through research, the model is designed and is specially used for solving the problem of inaccurate segmentation.

There are many types of brain lesions, so the nature of segmenting brain tumors in MRI images from the perspective of computers is the problem of multi-classification of pixel points, i.e. the classification to which each pixel point in the image belongs is identified. Image classification was originally implemented based on CNN, which replaced its fully-connected layer with an deconvoluted layer, enabling its use for segmentation. Thereafter, to improve performance, many models have been improved based on FCN. SegNet adds multiple deconvolution layers and adds pooled index connections. The VGG uses a plurality of small convolution kernels to replace the original large convolution kernel, so that the number of parameters is reduced while the receptive field is maintained. The uet improves SegNet and passes down-sampled feature maps directly to the decoder to recover detail features lost during encoding. DenseNet refers to ResNet, and dense connections are designed, so that each layer has residual connection pointing to other layers, and feature reuse is enhanced. The Mask-RCNN firstly intercepts the image by using a target detection frame, and then performs upsampling on the cropped image, thereby avoiding the interference of irrelevant areas.

An MRI image is a 3D image consisting of a plurality of 2D scan layers, each of which can also be understood as a 2D cross-section, and therefore how to process the 3D-MRI image is a primary issue. The existing model has two general ideas, one is to directly process 3D images with 3D convolution kernels of different sizes, such as DeepMedic; secondly, a 2D segmentation and recombination method is adopted, each layer of 2D transverse section of the source 3D image is processed by a 2D convolution network according to the time sequence, and a 3D structure is recombined after the segmentation image is generated. The 3D convolution kernel has a large number of parameters, and this is only the number of one channel, and if it is used to extract a source 3D MRI image and store the features in a dispersed manner in a plurality of channels, the overall data size will be very large, and both the memory and the video memory will not be supported. Therefore, in this case, both 3D-MRI and 3D-CT images need to be sliced into a plurality of small cubes and then input into a 3D convolutional network, which destroys the 2D cross-sectional structure of the source 3D image. Since most features of the MRI image are distributed on the 2D cross-section, the 2D segmentation and reconstruction method is a better choice for processing the 3D-MRI image and is more consistent with the generation principle of the MRI image.

Furthermore, the MRI device will typically scan the brain from four angles, which are the basic structure T1, the tissue water content T2, the tissue blood supply T1c, and the tissue bound water content Flair. The different modalities reflect different information, which can seriously affect the accuracy of segmentation if they are not fused together. Therefore, further improvement is needed on the basis of the Unet, and the mainstream method at present is to design the encoder into a Multi-branch convolution architecture and perform fusion at the network deep level, such as Multi Path density Unet and IVD-Net, which have more Dense connections than the former and are suitable for the segmentation of Multi-modal MRI images.

Although most features are distributed across 2D cross-sections, a sequence-based model is also required to capture the correlation between these 2D images. With the development of deep learning, attention mechanism originated from Transformer is gradually used for image segmentation. Such as AutoFocus autofocusing a global optimal scale associated with each region of the image with an attention mechanism; ozan Oktay et al embed the Attention Gate into the Unet decoder to focus on feature recovery of the target region, suppressing irrelevant regions; the DANet applies an attention mechanism to the image, improves the segmentation effect from two aspects of channels and positions, and then the idea is widely applied; the CAnet embeds the DAnet attention block between the Unet codecs for recovering tree-shaped detail characteristics in the medical image; ashish Sinha et al used DANet to analyze the connection between each level of ResNet and the sum of all feature layers, namely capturing long distance dependence and capturing channel feature dependence. The model brings many inspirations. On the other hand, a similar 1 × 1 convolution as attention was initially used to adjust the inter-channel feature distribution to replace the fully-connected layer for classification; the nature of the Self-orientation and 1 x 1 convolutions therefore differ, and the Attention module herein combines the advantages of both.

The important index for measuring the segmentation performance of the model is whether the most of the pixel points can be correctly classified. The existing segmentation network, no matter whether the model is based on 2D or 3D, can segment the general outline of the lesion region, besides the need to consider the connection between 2D images, there are still some places to be improved in terms of convolution, for example, in T1 and T1c images, most outlines are not clear, especially in edema regions, and this phenomenon also causes the problem that the network is easy to segment excessively when extracting features. In addition, the necrotic area and the tumor area are not very distinguishable on an MRI image, the phenomenon that the two areas are split wrongly is common, and aiming at the problems, an HDC convolution block is used for replacing the original partial down-sampling branch of the Multi-Unet so as to expand the receptive field of the Multi-Unet and extract more features, and the original distribution can be changed by combining the features.

Disclosure of Invention

The invention aims to improve a multi-mode-based convolutional neural network, so that the convolutional neural network can capture characteristics which are not considered before from a 2D cross section space and a 3D depth axis space, and the segmentation accuracy is improved. In order to achieve the above purpose, the Multi-mode convolutional neural network Multi-Unet based on the present invention improves part of its encoder branches, and adds an attention module behind it to jointly improve the segmentation effect of brain tumors, and the structure of the whole model is shown in FIG. 1, which specifically includes the following three key structures.

The key structure is as follows: based on the Multi-Unet model, the common Convolution Block of the encoder is replaced by a Hybrid partitioned Convolution Block (HDC-Block for short) for expanding the receptive field, extracting more details outside the boundary contour, and the improved model is named as HDC-Unet.

The key structure II is as follows: referring to a Multi-Branch encoder structure of an inclusion model, a Multi-Branch Output Convolution Block (MB-OutConv for short) is designed autonomously, a segmentation graph generated by HDC-MUnet is processed by 1 × 1 and 3 × 3 Convolution kernels simultaneously, and features are summarized and sorted to generate an original segmentation graph Origin-Segment which can be used for classification.

The key structure is three: referring to a Self-Attention model in a transform, a Channel-based Attention-Block (CB-Attention for short) is independently designed, pixel point association among channels of an original segmentation graph is captured, and Attention weighting is carried out on the channels. The following will describe the specific stage design of the present invention in conjunction with the preprocessing of data and the above-mentioned key structure.

The technical scheme adopted by the invention is a brain tumor detection method based on attention mechanism and MRI multi-mode fusion, and the brain tumor detection method comprises the following steps

S1: and preprocessing key input data.

Since each sample consists of four 3D-MRI images, the overall dimension is (4,155,240,240), representing modality, depth, length, and width, respectively. A 3D-MRI image can be seen as a sequence of 155 slices of 2D images, which 2D images are also referred to as 2D cross-sections, as mentioned earlier, and are represented in fig. 1, so that the second dimension is also referred to as the time series dimension. Considering from the time series dimension, two 2D cross sections with close distances have stronger characteristic correlation, in order to reduce useless calculation, the 3D-MRI is firstly divided into a plurality of 3D-slices (temp < <155) with the dimension of (4, temp, 240) in the time dimension, and then each layer of the 2D cross sections of the 3D-slices are transmitted into the HDC-MUnet one by one in the time sequence for processing. Where N is the number of output channels of the first convolutional layer of HDC-mutet, and if the value is too small, the features cannot be fully extracted, and if the value is too large, feature redundancy and even overfitting are easily caused.

Since the labels of the Brats data sets are unevenly distributed, so that the model is difficult to converge to a full local optimum value in some cases, a median balancing strategy is used to solve the problem, and the basic principle is to redefine the weight of each category of cross entropy loss into the following formula.

Where freq (c) is the number of pixels belonging to category c, a proportion of the number of pixels in all categories, i.e. the frequency with which the current pixel type appears. The treatment method is scientifically verified and is also adopted in IVD-Net and LSTM-MUnet.

S2: improve the Multi-Unet and generate the HDC-MUnet structure.

Since the T1 image only reflects the basic structure, it does not distinguish the details well, and there is a phenomenon of boundary blurring. In addition, as the tumor grows, firstly, the surrounding soft tissues are squeezed, causing tissue edema; the second is the growth of the tumor, which is deprived of nutrients, resulting in necrosis of the surrounding soft tissue. However, even if the tissue has been edematous or necrotic, their blood supply does not vary much, so the difference between the two lesions is not very significant on the T1c image. If a common 3 × 3 volume block is used to scan T1 and T1c images, the perception field is small, and the feature discrimination is still not large after the fuzzy boundary is convolved, so that a relatively accurate segmentation effect can be achieved only by requiring a relatively long training time. In addition, the difference is not obvious, but different pixel points are labeled, so that the necrosis and the tumor area are easily confused by the model, the improvement of the accuracy is prevented, and some pixel points with obvious difference around the fuzzy boundary are observed aiming at the problem, so that the perception domain is expanded, and more pixel points can be captured.

The cavity convolution can enlarge the receptive field on the premise of not increasing the size of a convolution kernel, and is more suitable for segmenting a larger object or an object with a fuzzy boundary. Therefore, the normal convolutional Block in the downsampled branch of the original T1, T1c is replaced by a hybrid hole convolutional Block (HDC-Block), which is shown in fig. 2. In HDC-Block, the first 3 x 3 convolution kernel does not use hole convolution, keeps the original expansion rate as 1 and is used for comprehensively extracting detail features; the second 3 × 3 convolution kernel extends the exposure field from 3 × 3 to 5 × 5 using a hole convolution with an expansion rate of 2. If going back forward, a pixel value in the feature map is output, and the corresponding input receptive field is enlarged from 5 × 5 to 7 × 7. In addition, because the water content is reflected by the T2 and Flair modalities, the edema region boundary of the two images is obvious, and the receptor field is not required to be enlarged by using the hole convolution.

S3: an original segmentation map is generated that can be weighted for attention.

Fig. 1 also shows a block diagram of a multi-branch output volume block MB-OutConv, which consists of 1 × 1 and 3 × 3 convolutions in parallel, which is embodied in fig. 1, and which has a different flow structure than the serial volume block of the MUnet. The 1 × 1 convolution and the 3 × 3 convolution have different receptive fields and have obvious difference in function. Where the effect of the 1 x 1 convolution is the redistribution and integration of features, it is similar to a linear weighted process that can integrate the features of N channels to compress the number of channels. The role of the 3 x 3 convolution is feature extraction and summarization because its field of view is moderate and can scan the image omnidirectionally for extracting useful features and fusion.

The feature map (temp, N, 240) output by the HDC-MUnet contains N internal channels, but the essence of image segmentation is the classification of pixel points, each pixel point has a C classification possibility, so the number of channels of the feature map needs to be converted from N to the final number of classes C, which is also the role of MB-OutConv volume block. In order to simultaneously consider the advantages of both convolutions, the 3 × 3 convolution and the 1 × 1 convolution are processed in parallel with the output feature maps of the HDC-MUnet, and the feature maps generated by them are added to generate the original segmentation map Origin-Segment with dimension (temp, C, 240), where C is the final number of lesion types. The original segmentation map is a temporal sequence of images [ OS ] of length temp₁,OS₂,...,OS_temp]The subscript is the time slice in which the current segmentation map OS is located, and the formula is expressed as follows.

[OS₁,OS₂,...,OS_temp]＝MBOutConv{HdcMUnet([img₁,img₂,...,img_temp])}

The idea of a multi-branch output volume block is derived from the inclusion multi-branch architecture proposed by Google. Since CB-Attention of S3 needs to calculate the degree of association between every two time slices of Origin-Segment, and then Attention-weights the Origin-Segment, it is taken as the input of CB-Attention.

S4: the separating effect is further improved by an attention mechanism.

For each 3D-Slice, only one of the modalities is seen, the dimension is (temp, 240), and then the two 2D cross sections with close spatial distance necessarily have strong correlation, so that it is very important to capture the pixel point correlation between the 2D cross section images. Each slice cross-section of a 3D MRI image can also be considered as a channel, for which purpose an image channel based Attention module CB-Attention has been proposed specifically to solve this problem, the whole module being shown in fig. 3. The idea of CB-Attention is derived from Self-Attention, which uses a point multiplication based Attention mechanism because it is more efficient based on point multiplication than based on addition.

First, it is to be understood that the input image of the attention module is the original segmentation-Segment, which is derived from the 3D-Slice input, and each layer of 2D cross-sectional image is obtained after being subjected to the segmentation process of HDC-MUnet and MB-OutConv in time sequence, and its dimension is (temp, C, 240). Next, the i (1. ltoreq. i. ltoreq. temp) th element OSeg in Origin-Segment is calculated_iAttention weight with other elements. Order Query_i＝OSeg_iThe other images are expressed as keys in the following formula.

[Key₁,Key2,...,Key_i-1]＝[OSeg₁,OSeg₂,...,OSeg_i-1]；

[Key_i,Key_i+1,...,Key_temp-1]＝[OSeg_i+1,OSeg_i+2,...,OSeg_temp]

When subscript of Key is less than i, Key_i＝OSeg_iOtherwise, there is Key_i＝OSeg_i+1The corresponding relationship of (1).

And then, performing point multiplication operation of Query and Key to obtain temp-1 relevancy matrixes.

......

......

And splicing the data on the time channel to obtain a three-dimensional relevance matrix RelevMatrix.

And calculating the association degree between the Query and the Key. In a sense, obtaining attention weights is also a special classification problem, because analyzing the association between channels is equivalent to analyzing the probability that keys belong to Query types. In order to find out the Key with the maximum relevance to Query, 1 × 1 × 1 convolution is used to process RelevMatrix, the number of channels is fused from C to 1, and the 3D convolution kernel is used to perform linear combination on the channels to adjust the feature distribution thereof for the classification process. It is similar to the fully-connected layer, but the number of parameters is much smaller than the fully-connected layer, and the image data can be directly processed, so the process is implemented with it.

Through the adjustment, the feature graph can reflect the pixel point association degree between the Query and the Key, and in order to obtain the attention weight, the global average pooling is directly acted on each time channel of the RelevMatrix, so that the feature graph is converted into a numerical value. And then converting the result of the global average pooling into a probability value between 0 and 1 by using Softmax, wherein the overall formula is as follows.

weight_x＝AvgPooling{Conv^1×1×1(RelevMatrix_x)}

And the channel Max-Key corresponding to the time step with the maximum probability is the channel with the maximum correlation degree with Query. Finally attention weighting is performed. And performing point multiplication on the Max-Key and the attention weight Max-weight corresponding to the Max-Key, directly adding the point multiplication with the Query to obtain AttenQuery, and finishing the attention weighting process of the pixel point, wherein the formula is expressed as follows.

Attentively weighted attentively, AttenQuery reflects the result of MaxKey merging with Query at the pixel level, which changes the partial pixel values of the source Query. The field of view is expanded over the entire original segmentation-Segment, which becomes the new Attention segmentation-Segment with dimensions (temp, C, 240) after Attention weighting. Although the Attention-Segment is more perfect, the pixel points are still gray values, the classification condition of each pixel point cannot be reflected, and the channel number of the pixel points is still C. Therefore, the channel number of the Attention-Segment needs to be converted into 1 finally to complete the classification process of the pixel points, which is described in detail below.

Consider a time slice in the Attention-Segment whose dimension is (C, 240), and a pixel point on the two-dimensional plane is pixel (a) and its plane coordinates are (x, y). In the Attention-Segment, the other pixel points directly related to pixel (a) can only be (x, y,0), (x, y, 1). Because image segmentation is the classification problem of pixel points, aiming at the pixel points (x, y), firstly, according to a standard classification method, a coordinate sequence (x, y,0)And (x, y,1) and (x, y, C) are processed by Log-Softmax to obtain C probability values which are respectively (P)₀,P₁,......,P_C) (ii) a Then calculated using the cross entropy loss function (P)₀,P₁,......,P_C) With a real label Y_iLoss between (2). Other pixels are also the processing method.

The Final step of the attention module is to generate a Final segmentation map Final-Segment. This step is mainly achieved by scanning the C channels of the Attention-Segment. If a certain pixel value (x, y) on the 2D plane, the value (x, y, ch) on the ch-th channel is maximum (ch ∈ (0, C)), then the pixel indicating that position should be classified as ch. The purpose of adding CB-Attention is to further adjust the pixel distribution of the channels, so that the pixel points belonging to the pathological type ch are distributed on the channel ch as accurately as possible, and the accuracy is further improved.

Finally, the whole process is viewed throughout, namely, the process is started from the input of the 3D-Slice image to the output of the Final-Segment segmentation image. The data names, data dimension changes, and model component names that are passed through between these are shown in detail in the flowchart shown in fig. 4, where T represents the time slice temp, and C represents the final classification number, i.e., the number of lesion types.

Drawings

FIG. 1: the structure of the whole model.

FIG. 2: HDC-Block hybrid hole convolution Block maps.

FIG. 3: CB-Attention block diagram.

FIG. 4: and a data format change flow chart.

FIG. 5: and calculating a region map referred by the evaluation index.

FIG. 6: and comparing the actual segmentation graph with the basic model.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and examples.

A brain tumor detection method based on attention mechanism and MRI multi-modal fusion comprises the following steps,

step S1: introduction of data sets.

To verify the performance of the model, the Brats2015 was selected as the dataset. The entire data set contained 274 samples, 220 of which were HGG cases, medically known as high grade gliomas, poorly differentiated malignancies; the remaining 54 samples were LGG cases, medically known as low-grade gliomas, which were benign tumors with better differentiation. Each sample in the dataset contains 5 3D volumes, each 3D Volume consisting of 155 layers of 2D images. The first four 3D images are MRI scan results T1, T2, T1c, Flair, representing the basic structure, tissue water content, tissue blood supply volume, tissue bound water content, respectively. The last 3D image is the label, i.e. the manually labeled actual segmentation map. In this scenario, each pixel value of label has only five possibilities of [0,1,2,3,4], representing five different lesion types. Other scenarios, collectively summarized as the classification possibilities in C, are that there are [0, 1.... C ] possibilities per pixel value.

Step S2: an evaluation index and a region are determined.

Fig. 5 shows regions to be referred to when calculating the evaluation index, where T1 denotes a true disease region, P1 denotes a predicted disease region, T0 denotes a true health region, and P0 denotes a predicted health region. On the four regions, IoU, Dice, Sensitivity, and PPV are respectively performed to reflect the segmentation performance of the model in all directions, which is described below one by one.

IoU are cross-over ratios, and given a prediction map and a real image, IoU can measure the degree of overlap of the same target site on both images. The overlap degree is from 0 to 1, and the higher the overlap degree is, the more accurate the prediction is.

The Dice is used for measuring the similarity between two set distributions, given a prediction graph and a real graph, the Dice can measure the overall similarity of two images from a pixel level, the value range of the Dice is 0-1, and the Dice represents that the Dice is from the worst to the best.

Sensitivity, which indicates how many of the true tumor regions are predicted to be tumor regions.

The positive prediction rate PPV is used to indicate how many of the regions predicted to be tumors are true tumor regions.

The tumor region is segmented, and the essence is a multi-classification process of pixel points. For example, the five classification problem is processed, and each pixel value has five possibilities of [0,1,2,3,4], which respectively represent a healthy region, necrosis, edema, tumor and enhanced tumor. In the verification and test, 0,1,2,3,4 is divided into the following three combinations. They were tested separately with IoU, Dice, Sensitivity, PPV.

[1,2,3,4] indicates the Entire lesion region, i.e., the complete Lesion area

[1,3,4] indicates the Entire tumor region, i.e., the tissue Tumorarea

[3,4] indicates the Core tumor region, i.e., Core Tumorarea

In addition, 01234 was also tested separately in order to test the segmentation performance on each lesion type, again using the above evaluation index.

Step S3: training and testing the model.

In order for the model to learn the characteristics of different tumors simultaneously, HGG and LGG samples were mixed and trained. Of the 274 samples, 224 samples were selected as the training set, 20 samples were selected as the validation set, and 30 samples were selected as the test set. It should be noted that the training set, the verification set, and the test set are all randomly selected, and there is no situation that the test result of the model is too high due to the similar data distribution rule. In the training process, in order to check whether the training condition of the current epoch is optimal in real time, each time an epoch training is completed, the model is tested once on the verification set, and the Dice verification result of the model on the [1,2,3,4] area is obtained. If the Dice result is better than the previous Dice result, the current model is saved, the global optimal Dice is immediately updated, and otherwise, the model and the Dice are not updated. In the training stage, if the training epoch is increased, but the verification loss is gradually increased, the model gradually goes to a fitting state, in this case, whether the Dice is improved in a breakthrough manner is checked, and if the Dice is basically unchanged, the training is immediately stopped.

The training server used, there are four NVIDIA Geforce RTX2080Ti display cards, each GPU can hold a batch size of 2, thus setting the batch size to 8. If there are only one or two 2080Ti, the model can be trained normally, but the training batch size needs to be reduced to 2 or 4. If the GPU computation power is lower than 2080Ti, the internal channel number N of the HDC-MUnet needs to be reduced to normally train the model, but the segmentation performance is reduced with high probability. In addition, the CPU performance of the server is also strong, at least above the i9-9900K level, the memory 96GB can support the Batchsize 8 model for training.

The model is trained in order to update all parameters through back propagation, so that all parameters can complete tasks as accurately as possible, therefore, it is important to select a suitable optimizer, and Adam is more efficient, and takes first-order and second-order momentum into account, so that the model is robust to gradient expansion and contraction. Besides, by the control variable method, the optimal selection of other hyper-parameters is obtained, such as the number of internal channels N and the learning rate are 32 and 1e-4 respectively. After training, the model is well represented on IoU, Dice and PPV indexes and is superior to the existing classical models such as Unet, Attention-Unet, Vnet and the like.

Step S4: and outputting the segmentation chart.

In order to visually demonstrate the improvement of the segmentation effect of the present invention, the segmentation maps outputted by the basic model and the model are shown in fig. 6, wherein red, green, blue and yellow represent

labels

1,2,3 and 4, i.e. necrosis, edema, tumor and enhanced tumor types, respectively. Column f is the segmentation effect of the model, the last column is the true label, and the first 6 columns are the segmentation maps of the base model. It can be seen visually that the invention can improve the segmentation effect on the whole and has a certain inhibiting effect on noise and wrong segmentation.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof; the present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein; any reference sign in a claim should not be construed as limiting the claim concerned.

Moreover, it should be understood that although the present description refers to embodiments, not every embodiment may contain a single embodiment, and such description is for clarity only, and those skilled in the art will be able to make the description as a whole and that the embodiments may be suitably combined to form other embodiments as will be appreciated by those skilled in the art.

Claims

1. A brain tumor detection method based on attention mechanism and MRI multi-modal fusion is characterized in that: the brain tumor detection method comprises the following steps

S1: preprocessing key input data;

each sample consists of four 3D-MRI images, with overall dimensions (4,155,240,240) representing modality, depth, length, and width, respectively; one 3D-MRI image is considered as a sequence of 155 slices of 2D images, the 2D images also being referred to as 2D cross-sections, the second dimension also being referred to as the time series dimension; cutting the 3D-MRI into a plurality of 3D-slices with the dimension of (4, temp, 240) in the time dimension, and sequentially transmitting each layer of 2D cross sections of the 3D-slices into the HDC-MUnet one by one to process, wherein N is the number of output channels of the first convolution layer of the HDC-MUnet;

redefining the weight of each category of the cross entropy loss into the following formula;

wherein freq (c) is the number of pixels belonging to category c, which is a proportion of the number of pixels in all categories, i.e. the frequency of occurrence of the current pixel type;

s2: improving the Multi-Unet and generating an HDC-Unet structure;

pixels with obvious differences exist around the fuzzy boundary, and a sensing domain is expanded to capture more pixels;

replacing common convolution blocks in the original T1 and T1c downsampling branches by a mixed hole convolution Block HDC-Block, wherein in the HDC-Block, a first 3 multiplied by 3 convolution kernel does not use hole convolution, and the original expansion rate is kept to be 1, so that the HDC-Block is used for comprehensively extracting detailed features; the second 3 × 3 convolution kernel uses a hole convolution with an expansion rate of 2 to expand the receptive field from 3 × 3 to 5 × 5; if the forward tracing is continued, outputting a pixel value in the characteristic diagram, and expanding the corresponding input receptive field from 5 multiplied by 5 to 7 multiplied by 7; because the T2 and the Flair modes reflect the water content, the edema region boundary of the two images is obvious, and the cavity convolution is not needed to be used for enlarging the receptive field;

s3: generating an original segmentation map which can be weighted by attention;

1 × 1 and 3 × 3 convolutions are formed into an attention weighted convolution network in parallel, the 1 × 1 convolution has the function of redistribution and integration of features, and the features of N channels are integrated to compress the number of channels;

the characteristic diagram (temp, N, 240) output by the HDC-MUnet comprises N internal channels, but the essence of image segmentation is the classification of pixel points, each pixel point has the classification possibility of C, the number of the channels of the characteristic diagram is converted into the final number of the classes C from N, and the function of an MB-OutConv rolling block is realized; to consider two convolutions simultaneouslyThe method has the advantages that the output feature maps of the HDC-MUnet are processed in parallel by the convolution of 3 x 3 and the convolution of 1 x 1, and the feature maps generated by the output feature maps are added to generate an original segmentation map Origin-Segment with the dimension (temp, C, 240), wherein C is the final lesion type number; the original segmentation map is a temporal sequence of images [ OS ] of length temp₁,OS₂,...,OS_temp]The subscript is the time slice in which the current segmentation map OS is located, and the formula is expressed as follows;

[OS₁,OS₂,...,OS_temp]＝MBOutConv{HdcMUnet([img₁,img₂,...,img_temp])}

attention-weighting the Origin-Segment, so it is taken as input to CB-Attention;

s4: improving the segmentation effect by an attention mechanism;

for each 3D-Slice, the dimension is (temp, 240), and then there must be a degree of correlation between two 2D cross sections with close spatial distance; each layer of cross section of the 3D MRI image is regarded as a channel, and an attention mechanism based on point multiplication is used;

calculating the i (i is more than or equal to 1 and less than or equal to temp) th element OSeg in Origin-Segment_iAttention weights to other elements; order Query_i＝OSeg_iTaking other images as Key, and specifically expressing the other images as follows by a formula;

[Key₁,Key2,...,Key_i-1]＝[OSeg₁,OSeg₂,...,OSeg_i-1]；

[Key_i,Key_i+1,...,Key_temp-1]＝[OSeg_i+1,OSeg_i+2,...,OSeg_temp]

when subscript of Key is less than i, Key_i＝OSeg_iOtherwise, there is Key_i＝OSeg_i+1The corresponding relationship of (a); next, performing point multiplication operation of Query and Key to obtain temp-1 association degree matrixes;

splicing the time channels to obtain a three-dimensional relevance matrix RelevMatrix;

calculating the association degree between Query and Key; analyzing the probability that the keys belong to the Query type; in order to find out the Key with the maximum relevance with the Query, 1 × 1 × 1 convolution is used for processing RelevMatrix, the number of channels is fused into 1 from C, and the 3D convolution kernel is used for carrying out linear combination on the channels so as to adjust the characteristic distribution of the channels and is used for the classification process;

so far, the feature map can reflect the pixel point association degree between Query and Key, in order to obtain the attention weight, the global average pooling is directly acted on each time channel of RelevMatrix, and the feature map is converted into a numerical value; converting the result of the global average pooling into a probability value between 0 and 1 by using Softmax, wherein the overall formula is as follows;

weight_x＝AvgPooling{Conv^1×1×1(RelevMatrix_x)}

the channel Max-Key corresponding to the time step with the maximum probability is the channel with the maximum correlation degree with the Query; finally, attention weighting is carried out; performing point multiplication on the Max-Key and the attention weight Max-weight corresponding to the Max-Key, then directly adding the point multiplication with the Query to obtain AttenQuery, and finishing the attention weighting process of pixel points, wherein a formula is expressed as follows;

after attention weighting, AttenQuery reflects the result of fusion of MaxKey and Query at pixel level, which changes part of pixel values of source Query; expanding the visual field to the whole original segmentation map Origin-Segment, and changing the visual field into a new Attention segmentation map Attention-Segment after Attention weighting, wherein the dimension of the new Attention segmentation map Attention-Segment is still (temp, C, 240); and finally, converting the channel number of the Attention-Segment into 1 so as to finish the classification process of the pixel points.

2. The brain tumor detection method based on attention mechanism and MRI multi-modal fusion as claimed in claim 1, characterized in that: in the classification process of the pixel points, a certain time slice in the Attention-Segment is considered, the dimensionality is (C, 240), a certain pixel point on a two-dimensional plane is set as pixel (a), and the plane coordinate is (x, y); in the Attention-Segment, other pixel points directly related to pixel (a) can only be (x, y,0), (x, y,1), (x, y, C); because image segmentation is a classification problem of pixel points, aiming at the pixel points (x, y), firstly, according to a standard classification method, a coordinate sequence (x, y,0), (x, y,1), (x, y, C) is subjected to Log-Softmax processing to obtain C probability values which are respectively (P)₀,P₁,......,P_C) (ii) a Then calculated using the cross entropy loss function (P)₀,P₁,......,P_C) With a real label Y_iLoss between (2).

3. The brain tumor detection method based on attention mechanism and MRI multi-modal fusion as claimed in claim 2, characterized in that: generating the Final segmentation graph Final-Segment is realized by scanning C channels of the orientation-Segment; if a certain pixel value (x, y) on the 2D plane and the value (x, y, ch) on the ch-th channel is maximum (ch ∈ (0, C)), the pixel indicating that position should be classified as ch.

4. The brain tumor detection method based on attention mechanism and MRI multi-modal fusion as claimed in claim 3, characterized in that: the CB-Attention is added for adjusting the pixel distribution of the channel, so that the pixel points belonging to the pathological change type ch are accurately distributed on the channel ch, and the accuracy is improved.