CN112288041B

CN112288041B - Feature fusion method of multi-mode deep neural network

Info

Publication number: CN112288041B
Application number: CN202011477932.7A
Authority: CN
Inventors: 陈凌; 朱闻韬; 张铎; 申慧; 李辉; 叶宏伟; 王瑶法
Original assignee: Zhejiang Lab; Mingfeng Medical System Co Ltd
Current assignee: Zhejiang Lab; Mingfeng Medical System Co Ltd
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2021-03-30
Anticipated expiration: 2040-12-15
Also published as: CN112288041A

Abstract

The invention discloses a feature fusion method of a multi-modal deep neural network, which can obtain a channel attention mask between modes by utilizing a compression excitation (S & E) module on a deep learning feature domain in a multi-modal deep three-dimensional CNN (CNN), namely giving greater attention to channels which obviously help task targets in all the modes, thereby explicitly establishing weight distribution of a multi-modal three-dimensional depth feature map on the channels; subsequently, a spatial attention mask between the modalities can be obtained by utilizing the four-dimensional convolution and Sigmoid activation function calculation, namely, in the three-dimensional feature map of each modality, which positions in the space need to give greater attention, so that the correlation of the multi-modality three-dimensional depth feature map on the space is explicitly established, and the positions with important information in the modality, the channel and the space are given greater attention, thereby improving the diagnosis efficiency of the multi-modality intelligent diagnosis system.

Description

Feature fusion method of multi-mode deep neural network

Technical Field

The invention relates to the field of medical imaging and the field of deep learning, in particular to a feature fusion method of a multi-mode deep neural network.

Background

The existing tumor detection and diagnosis means are generally realized by medical imaging technology, including planar X-ray imaging, CT, MRI, PET/CT, ultrasound and other modalities, and perform tissue biopsy on suspicious lesions found in images. However, due to the heterogeneity of tumors, their properties cannot be fully characterized on single-modality images. As on planar X-ray and CT images, the extent of X-ray absorption by tumor tissue is characterized; on MRI images, the hydrogen proton density of tumor tissues is characterized; on FDG PET/CT, the activity of the tumor tissue to metabolize glucose is characterized; on the ultrasound image, what is characterized is the degree of reflection of the tumor tissue to the acoustic waves. Therefore, more and more clinical researches based on multi-modal images are now performed, aiming at providing more comprehensive, comprehensive and multi-dimensional indexes for diagnosis and prognosis prediction, and helping doctors to make treatment schemes, so as to realize accurate medical treatment.

The deep Convolutional Neural Network (CNN) is one of the common methods for constructing a medical artificial intelligence model in recent years, and extracts high-order feature information of an image through multilayer convolution processing, and meanwhile, the dimensionality of the features is reduced by combining pooling processing, and the extracted high-order features are input into a subsequent specific network to perform specific tasks, such as classification, segmentation, registration, detection, noise reduction and the like. At present, more and more studies based on multi-modal imaging are performed using CNN, and the diagnostic performance is significantly improved compared to that of a single modality.

With respect to the existing CNN-based multi-modal intelligent diagnosis model, basically, the following three methods are used for feature fusion between the respective modalities: (1) multiple branches: the method comprises the following steps that a plurality of deep convolution neural network branches are provided, each branch is responsible for modal convolution calculation, and all depth feature maps are added and fused on the same grade of each branch; (2) multi-channel: when data is input, images of different modes are superposed into a multi-channel image as input; (3) image fusion: when data is input, a specific image fusion algorithm is used for fusing images of multiple modalities to obtain a single-channel fusion image as input. However, these methods do not analyze and process the feature weight distribution of each modality, but simply perform direct addition, superposition, or fusion. The method provided by the invention gives greater attention to the positions with important information in the modes, channels and spaces on the deep learning characteristic domain, thereby improving the diagnosis efficiency of the multi-mode intelligent diagnosis system.

Disclosure of Invention

The invention aims to provide a feature fusion method of a multi-modal deep neural network, aiming at the defects of the prior art. In the multi-modal depth three-dimensional CNN, a channel attention mask between the modalities can be obtained by utilizing a compressive excitation (S & E) module, namely, in all the modalities, the channels which are obviously helpful to a task target are given more attention, so that the weight distribution of the multi-modal three-dimensional depth feature map on the channels is explicitly established; subsequently, using the four-dimensional convolution and Sigmoid activation function calculation, spatial attention masks between modalities can be obtained, i.e., in the three-dimensional feature map of each modality, which positions in space need to be given greater attention, thereby explicitly establishing the spatial correlation of the multi-modal three-dimensional depth feature map.

The purpose of the invention is realized by the following technical scheme: a feature fusion method of a multi-modal deep neural network specifically comprises the following steps:

the method comprises the following steps: in a multi-modal (MM) depth CNN of a multi-branch (MB), superposing three-dimensional feature maps output by the nth stage of each branch on channel dimensions to obtain a three-dimensional feature map of which the original channel number is x times, wherein x represents the branch number; carrying out average pooling on the three dimensions of depth, height and width, and compressing to obtain a one-dimensional vector of one channel dimension; performing down-sampling and up-sampling processing on the one-dimensional vector, and calculating by using an activation function Sigmoid to obtain a multi-mode channel attention mask; multiplying a multi-mode channel attention mask by a three-dimensional feature map of which the number is x times of the original channel number to obtain a multi-mode three-dimensional feature map, splitting the multi-mode three-dimensional feature map on a channel according to the channel number in the original branch mode to obtain x single-mode three-dimensional feature maps which are weighted on multiple modes; the three-dimensional characteristic diagram output by each mode has the same channel number, depth, height and width.

Step two: performing one-dimensional average pooling and one-dimensional maximum pooling calculation on the multi-mode three-dimensional feature map on channel dimensions to obtain two three-dimensional feature maps after pooling, respectively newly building modal dimensions in the two three-dimensional feature maps after pooling, and overlapping on the modal dimensions to obtain a four-dimensional feature map; and (3) performing convolution on the four-dimensional feature map by using x four-dimensional convolution kernels, so that the four-dimensional convolution kernels learn how to obtain the weight distribution of x modes on the space. The number of channels of the convolution output is x, and the channels correspond to each mode respectively; calculating each output by using an activation function Sigmoid, and compressing on a modal dimension to obtain x single-modal spatial attention masks;

step three: and multiplying the single-mode space attention mask obtained in the step two by the single-mode three-dimensional feature map obtained in the step one according to the corresponding mode to obtain a multi-mode fusion feature map, and completing multi-mode feature fusion.

Further, the single-mode three-dimensional feature map obtained in the first step is used as an input of feature fusion of a next level of the corresponding branch.

Further, the modalities include at least two of planar X-ray imaging, CT, MRI, PET/CT, sonograms, and the like.

Further, the multi-branched multi-modal deep convolutional neural network is ResNet, Faster RCNN, U-Net, or CenterNet.

The method has the advantages that the multimode channel and space attention mechanism is utilized, so that the multimode depth feature map can be fused more delicately on each level of the multi-branch depth CNN, and positions with important information in the channel, the mode and the space in the multimode feature map are paid more attention, so that the performance of the multimode depth CNN intelligent diagnosis model is maximized.

Drawings

FIG. 1 is a flow chart of a bimodal depth feature fusion method of the present invention, exemplified by PET/CT bimodal;

fig. 2 is a schematic diagram of a U-Net network structure.

Detailed Description

The following describes the present invention in detail with reference to the drawings, taking the PET/CT bimodal (i.e. x = 2) as an example.

As shown in fig. 1, the method of the present invention specifically includes the following steps:

the method comprises the following steps: in the bimodal three-dimensional CNN with two branches, the two branches respectively correspond to a convolution branch of a PET modality and a convolution branch of a CT modality. And for the three-dimensional characteristic diagrams output by the nth stage of the two three-dimensional convolution branches, superposing the three-dimensional characteristic diagrams of the two modes on the channel dimension to obtain a three-dimensional characteristic diagram with twice of the original channel number. And then, carrying out average pooling on the three dimensions of depth, height and width, and compressing the three dimensions of depth, height and width to obtain a one-dimensional vector of one channel dimension. For which the compression ratio 16: 1: and 16, after downsampling and upsampling processing is carried out and the activation function Sigmoid is used for calculation, the weight distribution of the bimodal PET-CT characteristic diagram about each channel, namely the bimodal channel (time sequence) attention mask, is obtained. Finally, multiplying the bimodal channel attention mask by the three-dimensional feature map which is overlapped previously and is twice as many as the original channel, and then performing the following steps on the channels according to the ratio of 1: 1, splitting to obtain two single-mode three-dimensional characteristic maps weighted on a PET-CT mode. The method specifically comprises the following substeps:

(1.1) in the nth stage of the two-branch depth three-dimensional CNN, the characteristic diagram of the CT branch isu _ct(c, d, h, w) Characteristic diagram of PET branch isu _pt(c, d, h, w) Firstly, overlapping the three-dimensional characteristic graphs of the CT branch and the PET branch on the channel dimension to obtain an overlapped three-dimensional characteristic graphu _stack(2×c, d, h, w) (ii) a Wherein the content of the first and second substances,c, d, h, wrepresenting channel, depth, height and width, respectively.

(1.2) obtaining the average pooling of the superposed three-dimensional feature map in three dimensions of depth, height and width by using the formula (1)z _stack(2×c, 1, 1, 1) And compressing the three dimensions to obtain a three-dimensional characteristic diagram after poolingz _stack(2×c)；

(1)

In the formulaC _nIs the number of channels output by the nth stage,D _n、H _n、W _nthe depth, height and width of the three-dimensional characteristic graph output by the nth stage respectively,C _npreferably a number that is divisible by 16.

(1.3) the three-dimensional feature map after the pooling obtained in the step (1.2)z _stack(2×c) The compression ratio 16 is performed in the channel dimension according to equation (2): 1: 16, each sampling is superposed with the calculation of an activation function, and finally a bimodal channel attention mask s (2) related to the channel is obtainedc). WhereinW _aIs composed ofC _n ’/16×C _n ’A matrix of sizes of the components of the image,W _bis composed ofC _n ’×C _n ’A matrix of size/16, δ being the RELU function, σ being the Sigmoid function,C _n ’=2×C _n；

(2)

(1.4) using the bimodal channel attention mask obtained in the step (1.3) and expanding the depth, height and width dimensions s (2) by using a broadcasting mechanismc, d, h, w) Multiplying the superposed three-dimensional feature map obtained in the step (1.1) (formula (3)), so as to obtain a superposed bimodal three-dimensional feature map based on bimodal attention mask correctionu _stack ’(2×c, d, h, w) And finally, according to the number of original channels of each mode, stacking the three-dimensional feature map slices CT and PET according to the ratio of 1: 1 obtaining two single-mode three-dimensional characteristic graphs respectivelyu _ct ’(c, d, h, w) Andu _pt ’(c, d, h, w)；

(3)

step two: respectively carrying out one-dimensional average pooling and one-dimensional maximum pooling on the stacked bimodal three-dimensional feature maps in channel dimensions to obtain two three-dimensional feature maps after pooling, respectively newly building modal dimensions in the two three-dimensional feature maps after pooling, and stacking on the modal dimensions to obtain a four-dimensional feature map, wherein each dimension is modal, depth, height and width; for the four-dimensional feature map, 2 four-dimensional convolution kernels are used to convolve the four-dimensional feature map, so that the four-dimensional convolution kernels learn how to obtain the weight distribution of two modes on the space. The number of channels output by the four-dimensional convolution is 2, the channels respectively correspond to two modes, namely two channels of a fused four-dimensional feature map output by the four-dimensional convolution correspond to PET and CT, the output of the two channels is calculated by using an activation function Sigmoid, and the mode dimensions are compressed, so that the monomodal spatial attention mask can be obtained. The method specifically comprises the following substeps:

(2.1) with respect to the bimodal channel attention mask weighted bimodal three-dimensional feature map obtained in step (1.4)u _stack ’(2×c, d, h, w) Respectively carrying out one-dimensional average pooling and one-dimensional maximum pooling calculation on the channel dimension to obtainu _mean ’(c, d, h, w) Andu _max ’(c, d, h, w). Respectively establishing a dimension m called a modal dimension for the two three-dimensional characteristic graphs after the pooling calculation, and overlapping the dimension m on the modal dimension to obtain a bimodal four-dimensional characteristic graphp _stack(c, m, d, h, w)；

(2.2) for the bimodal four-dimensional feature map obtained in the step (2.1), a four-dimensional convolution kernel with the output channel number of 2, the kernel size of (2, 3, 3, 3), the step size of (1, 1, 1, 1) and the filling of (0, 1, 1, 1) is usedKernel _d4Performing convolution and sigma calculation through a Sigmoid function (formula 4, wherein x represents convolution), and obtaining a double-channel four-dimensional feature diagram which is a bimodal spatial attention maskp _stack ’(c, m, d, h, w) Where m =1 and c = 2. The correspondence between channels and modalities can be specified, for example, by channel one for CT, channel two for PETTaking a four-dimensional characteristic diagram of a channel corresponding to the mode and compressing the mode dimension to obtain a mode space attention mask corresponding to CT and PET respectivelyp _ct ’(c, m, d, h, w) Andp _pt ’(c, m, d, h, w)；

（4）

step three: and (3) fusing the CT and PET modal feature map obtained in the step (1.4) and the CT and PET modal attention mask obtained in the step (2.2) through a formula (5), and finally obtaining a PET/CT bimodal fusion feature map weighted by the bimodal channel spatial attention masku _fusionThe fusion feature map considers attention weights of the two modalities between the channels and also considers attention weights of the two modalities on space;

（5）

for calculating the single-mode three-dimensional feature map of the next stage (n + 1) of each branch, the CT and PET three-dimensional feature maps obtained in the step (1.4) and subjected to attention calculation of the dual-mode channels are usedu _ct ’(c, d, h, w) Andu _pt ’(c, d, h, w) And (5) calculating a next-stage (n + 1) multi-modal fusion feature map.

The following takes the deep convolutional neural network DLA34 as an example, and explains how to use the method of the present invention on the deep convolutional neural network.

In the practical implementation process of the DLA-34 network structure (from Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2403 and 2412, 2018), for example, PET/CT bimodal feature fusion, the downsampling part of DLA-34 needs to be expanded into a dual-branch form (as shown In FIG. 2), that is, one branch corresponds to the downsampling of one modal, then the fusion node at each stage is used for bimodal feature fusion by the present invention, and the fused features are used for the input of the next upper sampled part. The specific implementation steps are as follows:

(1) inputting PET/CT bimodal data, inputting CT image data into a branch circuit 1, and inputting PET image data into a branch circuit 2;

(2) in each fusion node of the down-sampling module, the three-dimensional feature maps output by the branches 1 and 2 are subjected to feature fusion by using the method of the invention and are output to the up-sampling part as residual error items. Meanwhile, the single-mode three-dimensional feature maps of the branches 1 and 2 obtained by the dual-mode channel attention calculation are used as the input of the next stage (in fig. 2, the single-mode three-dimensional feature map obtained by fusing the three-dimensional feature maps of the original map 1/8 is used as the input of the fusion of the down-sampling module of the original map 1/16) to be subjected to feature fusion and down-sampling processing. And so on, up to the end of downsampling.

(3) The upsampling process remains the same as the original single branch, except that the input feature map is a bimodal feature map using the invention for feature fusion.

The performance of the locked multi-modal deep CNN smart diagnostic model can be maximized because of greater attention given to the locations in the multi-modal feature map with important information about channels, modalities, and space.

Furthermore, the method of the present invention can be used in deep two-dimensional convolutional neural networks as well. The specific method is the same as above, except that there is no Depth dimension (d, Depth).

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. This need not be, nor should all embodiments be exhaustive. And obvious variations or modifications of the invention may be made without departing from the scope of the invention.

Claims

1. A feature fusion method of a multi-modal deep neural network is characterized by specifically comprising the following steps:

the method comprises the following steps: in a multi-branch multi-mode deep convolutional neural network, superposing three-dimensional characteristic graphs output by the nth level of each branch on channel dimensions to obtain a three-dimensional characteristic graph of which the original channel number is x times, wherein x represents the branch number; carrying out average pooling on three-dimensional characteristic graphs of which the number is x times of the original channel in three dimensions of depth, height and width, and compressing to obtain a one-dimensional vector of one channel dimension; performing down-sampling and up-sampling processing on the one-dimensional vector, and calculating by using an activation function Sigmoid to obtain a multi-mode channel attention mask; multiplying a multi-mode channel attention mask by a three-dimensional feature map of which the number is x times of the original channel number to obtain a multi-mode three-dimensional feature map, splitting the multi-mode three-dimensional feature map on a channel dimension according to the channel number in the original branch mode to obtain x single-mode three-dimensional feature maps which are weighted on multiple modes;

step two: respectively performing one-dimensional average pooling and one-dimensional maximum pooling calculation on the multi-modal three-dimensional feature map on channel dimensions to obtain two pooled three-dimensional feature maps, newly modeling a modal dimension in the two pooled three-dimensional feature maps, and superposing the modal dimension to obtain a four-dimensional feature map; performing convolution on the four-dimensional characteristic graph by using x four-dimensional convolution kernels, so that the four-dimensional convolution kernels learn how to obtain the weight distribution of each mode on the space; the number of channels of the convolution output is x, and the channels correspond to each mode respectively; calculating each output by using an activation function Sigmoid, and compressing on a modal dimension to obtain x single-modal spatial attention masks;

step three: multiplying the single-mode space attention mask obtained in the step two by the single-mode three-dimensional feature map obtained in the step one according to the corresponding mode to obtain a multi-mode fusion feature map;

wherein, the modalities are at least two of planar X-ray imaging, CT, MRI, PET/CT and ultrasonography.

2. The feature fusion method according to claim 1, wherein the monomodal three-dimensional feature map obtained in the first step is used as an input of feature fusion at a next stage of the corresponding branch.

3. The feature fusion method of claim 1 wherein the multi-branched multi-modal deep convolutional neural network is ResNet, Faster RCNN, U-Net or CenterNet.