CN116152492A

CN116152492A - Medical image segmentation method based on multi-attention fusion

Info

Publication number: CN116152492A
Application number: CN202310064474.1A
Authority: CN
Inventors: 章勇勤; 米继宗; 刘钰; 叶易凤; 常明则
Original assignee: NORTHWEST UNIVERSITY
Current assignee: NORTHWEST UNIVERSITY
Priority date: 2023-01-12
Filing date: 2023-01-12
Publication date: 2023-05-23

Abstract

The application relates to a medical image segmentation method based on multi-attention fusion, which enables a model to pay more attention to a target area by improving modules such as optimized attention, and improves the segmentation precision of the model to the region of interest and boundaries; the method solves the problems that the existing target detection method tends to segment and classify the target area, but tends to not have high edge precision segmentation, and experiments show that the method better maintains the segmentation result of the backbone network and is obviously superior to the existing method in the segmentation result of the focus area.

Description

Medical image segmentation method based on multi-attention fusion

Technical Field

The application relates to the technical field of image segmentation, in particular to a medical image segmentation method based on multi-attention fusion.

Background

The brain is the heart of the human nervous system, and brain lesions may lead to permanent brain function impairment, leading to disability or death. Accurate detection and segmentation of brain lesions can help to quantify various pathological indicators of brain lesions (e.g., total volume of lesions, lesion location and number of lesion mass, etc.). These quantitative indicators are closely related to aging and pathological changes of the brain and may provide useful clues to the prognosis of the patient, and may further be used to analyze the effects of pharmaceutical interventions and to guide the design of surgical intervention protocols. Post-cerebral infarction bleeding transformation refers to bleeding caused by vascular resumption of blood perfusion in the ischemic area following acute cerebral infarction. The bleeding transformation after acute cerebral infarction is a part of the natural course of cerebral infarction, is also a main adverse reaction of therapies such as thrombolysis, is not only related to poor prognosis of cerebral infarction, but also is an important reason for the insufficient use of various therapies for improving blood flow. The rapid analysis and judgment of the bleeding and transformation CT image after cerebral infarction relate to whether a doctor can rapidly and accurately diagnose and treat the illness state of a patient, so that a scientific method is necessary to rapidly divide the bleeding and transformation area.

In the prior art, the deep learning automatic segmentation technology is used for carrying out segmentation detection on brain lesions such as hemorrhage transformation after cerebral infarction, and the method has important significance for further analysis of brain tissues and diagnosis and accurate positioning of brain diseases. However, for some tasks requiring high edge accuracy, depending on the accuracy of the frame, some non-square objects tend to be less effective in segmentation, especially for targets with non-fixed boundaries, most irregularly shaped distribution of lesions, the segmentation performance tends to be poor.

Disclosure of Invention

In order to overcome at least one of the shortcomings in the prior art, embodiments of the present application provide a medical image segmentation method based on multi-attention fusion.

In a first aspect, a medical image segmentation model based on multi-attention fusion is provided, comprising: the system comprises a depth residual error network, a cavity convolution space attention module, a pyramid expansion module and a double-branch fusion module;

the depth residual error network is used for extracting features of the image to be segmented to obtain a plurality of first feature images with different scales;

the cavity convolution space attention module is used for carrying out cavity convolution on the first feature images with different scales to obtain second feature images with different scales;

the pyramid expansion module is used for carrying out convolution and fusion operation on a plurality of second feature images with different scales from large to small according to the scales to obtain a plurality of third feature images with different scales; convolving and fusing the third feature images with different scales from small to large to obtain fourth feature images with different scales;

the double-branch fusion module is used for carrying out category mask prediction and front background mask prediction on a plurality of fourth feature images with different scales, and fusing a category mask prediction result and a front background mask prediction result to obtain an image segmentation result.

In one embodiment, the hole convolution spatial attention module includes an input layer, a max pooling and averaging pooling layer, and a hole convolution layer connected in sequence.

In one embodiment, the dual branch fusion module includes a category mask prediction branch and a front background mask prediction branch;

the category mask prediction branch comprises 4 convolution layers, deconvolution layers and convolution layers which are connected in sequence;

the front background mask prediction branch comprises 2 convolution layers and a full connection layer which are connected in sequence;

the last convolution layer of the category mask prediction branch outputs a category mask prediction result, and the full connection layer of the front background mask prediction branch outputs a front background mask prediction result.

In a second aspect, a medical image segmentation method based on multi-attention fusion is provided, including:

inputting an image to be segmented into a medical image segmentation model based on multi-attention fusion to obtain an image segmentation result;

the medical image segmentation model based on the multi-attention fusion is the medical image segmentation model based on the multi-attention fusion;

in one embodiment, the method further comprises training the medical image segmentation model based on the multi-attention fusion to obtain a trained medical image segmentation model based on the multi-attention fusion.

In one embodiment, training a medical image segmentation model based on multi-attention fusion includes:

and determining at least one target area for each fourth feature map obtained by the pyramid expansion module, and inputting the at least one target area into the double-branch fusion module.

In one embodiment, determining at least one target region for each fourth feature map comprises:

determining a plurality of prediction target region boxes in each fourth feature map;

calculating the corresponding intersection ratio of each predicted target area frame according to the real target area frame and the plurality of predicted target area frames;

and sorting the corresponding cross ratios of all the predicted target area frames from large to small, and selecting at least one predicted target area frame with the front cross ratio as at least one target area.

In one embodiment, the corresponding intersection ratio IoU of each predicted target region frame is calculated according to the real target region frame and the plurality of predicted target region frames _new The following formula is used:

wherein S is ₁ To predict the target region frame, S ₂ And lambda is a penalty factor for a real target region box.

HU index cutting-off processing and data expansion processing are carried out on the image slice containing the focus area, and training data are obtained.

In one embodiment, the depth residual network in the multi-attention fusion based medical image segmentation model is a pre-trained depth residual network during model training.

Compared with the prior art, the application has the following beneficial effects:

according to the method, the model is focused on the target region by improving the modules such as the optimized attention, and the segmentation accuracy of the model on the region of interest and the boundary is improved; the method solves the problems that the existing target detection method tends to segment and classify the target area, but tends to not have high edge precision segmentation, and experiments show that the method better maintains the segmentation result of the backbone network and is obviously superior to the existing method in the segmentation result of the focus area.

Drawings

The present application may be better understood by reference to the following description taken in conjunction with the accompanying drawings, which are incorporated in and form a part of this specification, together with the following detailed description. In the drawings:

FIG. 1 shows a schematic diagram of a multi-attention fusion based medical image segmentation model according to an embodiment of the present application;

FIG. 2 illustrates a schematic diagram of a hole convolution spatial attention module according to an embodiment of the present application;

FIG. 3 illustrates a schematic diagram of a pyramid expansion module according to an embodiment of the present application;

FIG. 4 shows a schematic diagram of a dual branch fusion module according to an embodiment of the present application;

fig. 5 shows a comparison of the image segmentation experimental results of the present application with the prior art.

Detailed Description

Exemplary embodiments of the present application will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual embodiment are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions may be made to achieve the developers' specific goals, and that these decisions may vary from one implementation to another.

It should be noted that, in order to avoid obscuring the present application with unnecessary details, only the device structures closely related to the solution according to the present application are shown in the drawings, and other details not greatly related to the present application are omitted.

It is to be understood that the present application is not limited to the described embodiments due to the following description with reference to the drawings. In this context, embodiments may be combined with each other, features replaced or borrowed between different embodiments, one or more features omitted in one embodiment, where possible.

The embodiment of the application provides a medical image segmentation method based on multi-attention fusion, which comprises the following steps: and inputting the image to be segmented into a medical image segmentation model based on multi-attention fusion to obtain an image segmentation result.

Turning in detail to the specific structure of a medical image segmentation model based on multi-attention fusion, fig. 1 shows a schematic diagram of a medical image segmentation model based on multi-attention fusion according to an embodiment of the present application, the model includes: depth residual network, hole convolution spatial attention module (Dilated Convolutional Spatial Attention, DCSA), pyramid expansion module and dual branch fusion module (Dual Branch Fusion Module, DBF); the specific implementation functions of each module are described in detail below.

In this embodiment, the number of layers of the depth residual network ResNet is 50, the image to be segmented is input into the depth residual network, and feature extraction is performed on the image to be segmented to obtain a plurality of first feature maps [ C ] of different scales (128×128×256, 64×64×256, 32×32×256, 16×16×256) ₂ ,C ₃ ,C ₄ ,C ₅ ]The method comprises the steps of carrying out a first treatment on the surface of the Experiments have shown that as the number of network layers of the ResNet increases, the performance tends to saturate and the number of network layers 50 is sufficient.

In order to enable a network to automatically notice a place rich in pixels in an image during learning by only increasing a trace amount of calculation, a simple and effective cavity convolution space attention module for a feedforward convolution neural network is creatively added in a feature extraction part of a model, and fig. 2 shows a schematic diagram of the cavity convolution space attention module provided according to an embodiment of the application, DCSA is an improved SAM space attention module, and specifically, a common convolution part in the SAM space attention module is replaced by cavity convolution.

In this embodiment, the hole convolution may provide two benefits, one is that it may help to enlarge the receptive field and reduce the amount of computation, the receptive field is large to detect large segmented objects, and the resolution is high to accurately locate the objects. Secondly, multi-scale context information can be captured, and the cavity convolution has a parameter that can set the expansion rate (differential rate), so when settingWhen different expansion rates are set, the receptive fields are different, namely, multi-scale information is acquired. Multiscale information is important in visual tasks. Therefore, the combination of the traditional SAM space attention and the cavity convolution herein shows that the capability of detecting and segmenting brain focus areas of the model can be greatly improved when the expansion rate is 2. After the first feature map passes through the cavity convolution space attention module, a plurality of second feature maps [ D ] with different scales are obtained ₂ ,D ₃ ,D ₄ ,D ₅ ]；

Specifically, fig. 3 shows a schematic diagram of a pyramid expansion module according to an embodiment of the present application, where the pyramid expansion module is configured to generate a plurality of second feature graphs [ D ] with different scales ₂ ,D ₃ ,D ₄ ,D ₅ ]Performing convolution and fusion operations according to the scale from large to small to obtain a plurality of third feature maps [ P ] with different scales ₂ ,P ₃ ,P ₄ ,P ₅ ]I.e. first to D ₅ Performing 1×1 convolution operation to obtain a third feature map P ₅ The method comprises the steps of carrying out a first treatment on the surface of the And then to D ₄ Performing a 1×1 convolution operation, the result of the convolution operation and a third feature map P ₅ Fusing to obtain a third feature map P ₄ The method comprises the steps of carrying out a first treatment on the surface of the And then to D ₃ Performing a 1×1 convolution operation, the result of the convolution operation and a third feature map P ₄ Fusing to obtain a third feature map P ₃ The method comprises the steps of carrying out a first treatment on the surface of the And then to D ₂ Performing a 1×1 convolution operation, the result of the convolution operation and a third feature map P ₃ Fusing to obtain a third feature map P ₂ ；

For a plurality of third feature maps [ P ] of different scales ₂ ,P ₃ ,P ₄ ,P ₅ ]Convolution and fusion operations are carried out according to the scale from small to large to obtain a plurality of fourth feature graphs [ N ] with different scales ₂ ,N ₃ ,N ₄ ,N ₅ ](II), (III), (V), (; i.e. first to P ₂ Performing a 3×3 convolution operation to obtain a fourth feature map N ₂ The method comprises the steps of carrying out a first treatment on the surface of the And then P is to ₃ Performing a 3×3 convolution operation, and comparing the result of the convolution operation with a fourth feature map N ₂ Fusing to obtain a fourth feature diagram N ₃ The method comprises the steps of carrying out a first treatment on the surface of the And then P is to ₄ Performing a 3×3 convolution operation, and comparing the result of the convolution operation with a fourth feature map N ₃ Fusing to obtain a fourth feature diagram N ₄ The method comprises the steps of carrying out a first treatment on the surface of the And thenP pair P ₅ Performing a 3×3 convolution operation, and comparing the result of the convolution operation with a fourth feature map N ₄ Fusing to obtain a fourth feature diagram N ₅ 。

In one embodiment, FIG. 4 shows a schematic diagram of a dual branch fusion module according to an embodiment of the present application, including a category mask prediction branch and a front background mask prediction branch; the category mask prediction branch comprises 4 convolution layers, deconvolution layers and convolution layers which are connected in sequence; the front background mask prediction branch comprises 2 convolution layers and a full connection layer which are connected in sequence; the last convolution layer of the category mask prediction branch outputs a category mask prediction result, and the full connection layer of the front background mask prediction branch outputs a front background mask prediction result.

In this embodiment, the class mask prediction branch is the main path, which is a small FCN network composed of 4 continuous convolution layers and 1 deconvolution layer, and is an end-to-end network, and the main modules include convolution and deconvolution, that is, multiple convolutions are performed on the image first to extract deep information; then deconvolution operation, namely interpolation operation, is carried out, the characteristic diagram is continuously increased, and finally each pixel value is classified, so that accurate segmentation of an input image is realized; each convolution layer of the main path is up-sampled by a factor of 2 by a 3×3 convolution kernel and a deconvolution layer, and a binary pixel mask is predicted for each class independently to obtain a class mask prediction result, so that segmentation and classification are realized.

A short path, i.e., a front background mask prediction branch, is added after the 3 rd convolution (conv 3) layer of the main path, comprising 2 3 x 3 convolution layers and 1 fully connected layer for predicting category independent foreground/background masks; it is not only efficient, but also allows the parameters in the fully connected layer to be trained with more samples, thus achieving better versatility. The main path is followed by a deconvolution operation and a convolution operation after the 3 rd convolution for adjusting the feature map dimensions. To obtain the final masked predictions, features from each class of the path and foreground/background predictions from the short path full connection are fused out. Only one full connection layer is used in short path prediction instead of a plurality of full connection layers, so that the hidden space feature map can be prevented from being folded into a short feature vector, and space information is prevented from being lost. Ablation experiments have shown that adding a short path branch from conv3 and eventually fusing yields the best results.

Further, the medical image segmentation method based on multi-attention fusion according to the embodiment of the application further includes: and training the medical image segmentation model based on the multi-attention fusion to obtain the trained medical image segmentation model based on the multi-attention fusion.

Specifically, training a medical image segmentation model based on multi-attention fusion includes:

In this embodiment, candidate regions (Region of Interest, roI) with low classification scores may be filtered out in a manner that identifies target regions to alleviate the problem of class imbalance while reducing subsequent computation of unnecessary information.

Specifically, determining at least one target region for each fourth feature map includes:

determining a plurality of prediction target region boxes in each fourth feature map; here, a plurality of prediction target region boxes may be determined using a Pyramid RoI Align module in the related art.

Then, calculating the corresponding intersection ratio of each predicted target area frame according to the real target area frame and the plurality of predicted target area frames; specifically, the following formula may be used to determine the overlap ratio:

And finally, sorting the corresponding intersection ratios of all the predicted target area frames according to the sequence from large to small, and selecting at least one predicted target area frame corresponding to the intersection ratio with the front sorting as at least one target area.

Further, training the medical image segmentation model based on multi-attention fusion comprises HU index truncation processing and data expansion processing on an image slice containing a focus area to obtain training data. Here, the data expansion method can be a method of cutting/filling, horizontal overturning, vertical overturning, affine transformation and the like without changing the original pixel value of the CT image, and can keep the characteristics of the original image to the maximum extent; the medical image data are rare, and the label data marked by doctors manually are more rare, so that the number and diversity of training samples are increased through data expansion, and the limited data are utilized to create as much utilization value as possible; HU index cut-off processing is a conventional operation in the field of medical image processing, and can help to observe bleeding transformation data after cerebral infarction, and effectively improve the detection rate of lesions.

Further, in the model training process, the depth residual error network in the medical image segmentation model based on multi-attention fusion is a pre-trained depth residual error network. The model convergence can be accelerated by adopting the pre-trained depth residual error network, and the model with better performance can be obtained quickly. Experiments prove that the method can effectively improve the segmentation level of the focus.

According to the method, a cavity convolution space attention module and a double-branch fusion module are innovatively introduced on the basis of a backbone network, the used evaluation index is an AP (Average Precision, average accuracy) value, the AP is a common evaluation index for target detection, namely, the average value of a given class of APs is set, for example, a set of fixed confidence thresholds (0.7 is selected in experiments), then TP (True Positive representing a Positive sample for detection) is calculated, FP (False Positive representing a Positive sample for detection), FN (False Positive representing a Negative sample for detection) is calculated, then predicted values precision=TP/(TP+FP) and return=TP/(TP+FN) under each confidence threshold are respectively calculated, after the precision and the return values are calculated, a series of precision points can be obtained, a PR curve is drawn, the APs are further calculated, and the AP values represent good target detection results. The present application makes comparative experiments on a plurality of latest models, and the results are as follows:

table 1 comparative results

In Table 1, AP50 represents the AP measurement at a IoU (cross-over) threshold of 0.5, AP75 represents the AP measurement at a IoU threshold of 0.75, and AP _S Representative pixel area is less than 32 ² AP measurement at target frame time, AP _M Representative pixel area is at 32 ² -96 ² Measurement of target frame in between, AP _L Representative pixel area is greater than 96 ² AP measurements for the target frame of (a).

The first row of Mask R-CNN is a segmentation result of a backbone network model, the second row, the third row and the fourth row are segmentation results of a Mask transfiner, a DCT-Mask and a Refine Mask of the existing method respectively, and the fifth row is a segmentation result of the method of the application. As can be seen from table 1, the present application has a good image segmentation effect.

Fig. 5 shows a comparison of the image segmentation experimental results of the present application and the prior art, wherein the first column is a CT slice original image, the second column is label data manually marked by a doctor, the third column is an image segmentation result of a backbone network of the prior art, and the fourth column is an image segmentation result of the present application; according to fig. 5, compared with a backbone network, the method and the device enhance the propagation of semantic information through a plurality of simple and effective components, obtain better effects on segmentation and detection precision, and have better results close to the true value label data manually marked by doctors, thereby proving the feasibility of the method and device.

The foregoing is merely various embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A medical image segmentation model based on multi-attention fusion, comprising: the system comprises a depth residual error network, a cavity convolution space attention module, a pyramid expansion module and a double-branch fusion module;

the pyramid expansion module is used for carrying out convolution and fusion operation on the second feature images with different scales from large to small according to the scales to obtain third feature images with different scales; convolving and fusing the third feature images with different scales from small to large to obtain fourth feature images with different scales;

the double-branch fusion module is used for carrying out category mask prediction and front background mask prediction on the fourth feature images with different scales, and fusing a category mask prediction result and a front background mask prediction result to obtain an image segmentation result.

2. The model of claim 1, wherein the hole convolution spatial attention module comprises an input layer, a max-pooling and average-pooling layer, and a hole convolution layer connected in sequence.

3. The model of claim 1, wherein the dual branch fusion module includes a category mask prediction branch and a front background mask prediction branch;

the front background mask prediction branch comprises 2 convolution layers and a full connection layer which are sequentially connected;

the last convolution layer of the category mask prediction branch outputs the category mask prediction result, and the full connection layer of the front background mask prediction branch outputs the front background mask prediction result.

4. A medical image segmentation method based on multi-attention fusion, comprising:

the multi-attention fusion based medical image segmentation model is the multi-attention fusion based medical image segmentation model according to any one of claims 1-3.

5. The method of claim 4, further comprising training the multi-attention fusion-based medical image segmentation model to obtain a trained multi-attention fusion-based medical image segmentation model.

6. The method of claim 5, wherein the training the multi-attention fusion-based medical image segmentation model comprises:

7. The method of claim 6, wherein said determining at least one target region for each of said fourth feature maps comprises:

and sorting the corresponding cross ratios of all the predicted target area frames from large to small, and selecting at least one predicted target area frame with the front cross ratio as the at least one target area.

8. The method of claim 7, wherein the corresponding intersection ratio IoU of each predicted target region frame is calculated from a real target region frame and the plurality of predicted target region frames _new The following formula is used:

9. The method of claim 5, wherein the training the multi-attention fusion-based medical image segmentation model comprises:

10. The method of claim 5, wherein during model training, the depth residual network in the multi-attention fusion based medical image segmentation model is a pre-trained depth residual network.