CN115761377A

CN115761377A - Smoker brain magnetic resonance image classification method based on contextual attention mechanism

Info

Publication number: CN115761377A
Application number: CN202211561153.4A
Authority: CN
Inventors: 邓鹤; 匡星; 林富春; 张晓龙
Original assignee: Wuhan University of Science and Engineering WUSE
Current assignee: Wuhan University of Science and Engineering WUSE
Priority date: 2022-12-07
Filing date: 2022-12-07
Publication date: 2023-03-07

Abstract

The invention discloses a smoker brain magnetic resonance image classification method based on a context attention mechanism, which comprises the steps of establishing a U-net model; superposing a multi-head space attention module between a penultimate down-sampling layer and a last down-sampling layer of the U-net model; superposing a lightweight channel attention module before each jump connection of the U-net model and the corresponding upper sampling layer; linking a full connection layer after the last upper sampling layer of the U-net model; and inputting the samples into the U-net model, and training the U-net model. The invention is beneficial to overcoming the defects of the common U-net model, capturing the morphological characteristics of the brain and exploring the change rule and mechanism of the brain structure.

Description

Smoker brain magnetic resonance image classification method based on contextual attention mechanism

Technical Field

The invention belongs to the technical field of digital image processing, is suitable for multiple two-dimensional or three-dimensional imaging modal data, and particularly relates to a smoker brain magnetic resonance image classification method based on a contextual attention mechanism.

Background

The human brain contains over 100 million neurons and is a complex system that changes dynamically. In recent years, techniques such as electroencephalogram, near infrared spectroscopy, positron emission tomography, computed tomography, and Magnetic Resonance Imaging (MRI) have been used to search for neural activity of the brain under non-invasive conditions. Among them, MRI has advantages of no wound, no damage, no radioactivity, high temporal and spatial resolution, and is widely used in brain science research (d.a.wood, et al.accurate brain-image models for routine clinical MRI experiments, neuroimage,2022,249 118871). The technology can reflect the anatomical structure and physiological condition of the brain by acquiring the brain structure and functional images of a dependent through different imaging sequences, and further know the change of the brain nerve activity mode caused by the stimulation of the dependent substances.

Brain MRI offers the possibility to study structure-physiological changes in the brains of smokers. One of the mainstream methods for change detection is to classify images first and then compare the images, that is, to classify the images first and then compare the classification results. The method has the advantages that an image source and an imaging mode are not needed, and the method has the defect that high-precision image classification is difficult to realize. Because the deep learning has the characteristics of improving and increasing the multilayer nonlinear change, the method not only can eliminate the subjective factor influence of an artificial method, but also can better extract high-level abstract features and the like, thereby being beneficial to greatly improving the classification efficiency of brain images of smokers.

However, the limitation of convolution operations in the common model U-net for medical image processing tasks (deep neural networks based on U-shaped architectures and hopping connections, full convolution neural networks with systolic and extended paths) makes it difficult to model the context of distant features. A multi-headed spatial attention module (such as a Transformer) may model global context information through a multi-headed attention mechanism, but has the disadvantage of difficulty in obtaining local detail information. If a lightweight channel attention module (such as Squeeze-and-excitation) is superposed on each jump connection of the U-net model, edge characteristics lost by down-sampling operation can be effectively compensated, and characteristic directivity and detail information are enhanced. In order to realize accurate classification of the MRI images of the brain of the smoker, a context attention mechanism can be adopted, and the defects of a common U-net model are overcome.

Disclosure of Invention

The invention provides a smoker brain magnetic resonance image classification method based on a contextual attention mechanism, aiming at the technical problems of the existing smoker brain MRI image classification. The method comprises the steps that firstly, a multi-head space attention module is superposed after a penultimate down-sampling layer of a U-net model, and global context information of an image is modeled through a multi-head attention mechanism; then, superposing a lightweight channel attention module before each jump connection and the upper sampling layer of the up-sampling to enhance the characteristic directivity and the detail information; and finally, linking a full-connection layer behind the last upper sampling layer to realize accurate classification of the brain images.

A smoker brain magnetic resonance image classification method based on a contextual attention mechanism comprises the following steps:

step 1, building a U-net model, wherein the U-net model sequentially comprises a multistage down-sampling layer and a multistage up-sampling layer which are sequentially arranged, and the down-sampling layer and the corresponding up-sampling layer are in jumping connection;

step 2, superposing a multi-head space attention module between a penultimate down-sampling layer and a last down-sampling layer of the U-net model;

step 3, superposing a lightweight channel attention module before each jump connection of the U-net model and the corresponding upper sampling layer;

step 4, linking a full connection layer after the last upper sampling layer of the U-net model;

and 5, inputting samples into the U-net model, wherein the samples comprise the brain magnetic resonance images of the smoking group and the brain magnetic resonance images of the normal group, and training the U-net model by taking a minimum cross entropy Loss function Loss as a target to obtain an optimal U-net model weight parameter.

Each downsampling layer shrinkage process sequentially comprises two convolution layers and one pooling layer as described above, each convolution of a downsampling layer is followed by adding a Relu activation function, and each upsampling layer sequentially comprises an deconvolution layer and two convolution layers.

The multi-head space attention module comprises a linear projection layer, a position coding layer, n transformer layers and a characteristic mapping layer which are arranged in sequence,

the linear projection layer divides input feature information into feature block vectors with equal size by adopting feature dimension linear transformation, and converts the coded feature block vectors into linear embedding sequences to be input into the transformer layer;

the position coding layer codes the characteristic block vector;

the n transformer layers perform feature extraction on the input linear embedded sequence and output the linear embedded sequence to the feature mapping layer;

and the feature mapping layer converts the dimension of the nth transformer layer output feature and transmits the nth transformer layer output feature after dimension conversion to the last-stage up-sampling layer.

The dimension of the output feature of the penultimate down-sampling layer of the U-net model as described above is (B, C) _m-1 ,W _m-1 ,H _m-1 ) In which C is _m-1 ,W _m-1 ,H _m-1 Respectively representing the number of channels in the U-net model in the process of reciprocal second-level down-sampling, the width of the features after reciprocal second-level down-sampling and the height of the features after reciprocal second-level down-sampling, wherein B is the number of samples, the input dimensionality of a transformer layer is (B, num _ token, token _ dim), and num _ token = (W) _m-1 /p)×(H _m-1 P) + the dimension of the classification, token _ dim = C _m-1 The x is multiplied by p, and p is the size of a characteristic block vector after the linear projection layer is divided;

the characteristic mapping layer outputs the dimension (B, (W) of the characteristic of the nth transformer layer _m-1 /p)×(H _m-1 P) + dimension of classification, C _m-1 Xp × p) to (B, C _m-1 ,W _m-1 ,H _m-1 )。

As described above, in the lightweight channel attention module, the input feature size is W × H × C, and the scaling parameter is r, where W, H, and C respectively represent the input feature width, height, and number of channels, and r is a constant greater than 0:

compressing the features with the size of W multiplied by H multiplied by C, and obtaining global compression features with the size of 1 multiplied by C through global average pooling;

the global compression characteristic firstly passes through a full connection layer containing C/r neurons and a Relu activation function;

obtaining the normalized channel weight of each characteristic channel through a full connection layer containing C neurons and a sigmoid activation function;

and performing characteristic re-calibration operation of channel weight multiplication on the characteristic with the dimension of W multiplied by H multiplied by C input by the lightweight channel attention module to obtain the characteristic with the dimension of W multiplied by H multiplied by C and serve as the input of the deconvolution layer of the corresponding upper sampling layer.

Compared with the prior art, the invention has the following advantages:

1. aiming at the characteristics of magnetic resonance images of the brain of smokers, a U-net model suitable for classifying the magnetic resonance images of the brain of smokers is built for the first time, on one hand, the influence of subjective factors of an artificial classification method can be eliminated, and on the other hand, the classification efficiency of the brain images of smokers can be greatly improved compared with a traditional classification method (such as a support vector machine).

2. The features extracted by the conventional U-net model lack global context information, subject to the inherent nature of convolution. A multi-head space attention module (such as a Transformer) is superposed after a second-stage down-sampling layer of the U-net model is inverted, and global feature information can be obtained by modeling a global context through the multi-head attention mechanism of the module.

3. A disadvantage of a multi-headed spatial attention module (such as a Transformer) is the lack of local detail information. A lightweight channel attention module (such as Squeeze-and-excitation) is superimposed before each jump connection and the upsampling layer in the U-net model, and only a small amount of calculation is added. By integrating the multi-head space attention module and the channel attention module into the U-net model, a global remote information interaction and local detail information recovery framework of the brain magnetic resonance image is constructed, on one hand, global information can be extracted, on the other hand, local detail information can also be recovered, and image interpretation is facilitated.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is the training and verification results of the magnetic resonance image classification of smokers' brains according to the present invention, wherein A is the accuracy curve of the training set and the verification set, B is the loss function curve of the training set and the verification set, and the number of rounds (epoch) is 100.

Detailed Description

The present invention will be described in further detail with reference to examples for the purpose of facilitating understanding and practice of the invention by those of ordinary skill in the art, and it is to be understood that the present invention has been described in the illustrative embodiments and is not to be construed as limited thereto.

step 1, building a U-net model aiming at a brain magnetic resonance image of a smoker, wherein the U-net model comprises a down-sampling layer, an up-sampling layer and a jump connection.

A U-net model is built for a brain magnetic resonance image of a smoker, and comprises a plurality of down-sampling layers, an up-sampling layer and jump connections, wherein the down-sampling layers and the up-sampling layers are connected in a jumping manner, and the jump connections are arranged between the down-sampling layers and the corresponding up-sampling layers, as shown in attached figure 1. The U-net model has symmetric encoders and decoders. The encoder has a total of four downsampled layers, each downsampled layer puncturing process comprising two convolutional layers (and adding a Relu activation function after each convolution) and one pooling layer in turn. The number of characteristic channels in the down-sampling layer of the U-net model is doubled, the size of image characteristics is reduced by half, and the depth semantic information of the magnetic resonance image of the smoker can be obtained. The decoder has four up-sampling layers, and each up-sampling layer sequentially comprises an anti-convolution layer and two convolution layers. Through the expansion operation of upsampling, the number of characteristic channels in the U-net model is reduced by half, the image characteristic size is doubled, and the position information of the bottom layer of the magnetic resonance image of the brain of the smoker can be obtained.

Let the dimension of the input features of the U-net model be (B, C) _l ,W _l ,H _l ) The dimension of the output feature is (B, C) _r ,W _r ,H _r ) Class parameter N, where B represents the number of samples input and output to the U-net model for training, C _l And C _r Each represents U-nNumber of channels, W, of et model input and output _l And H _l Width and height, W, respectively, of input features of the U-net model _r And H _r Representing the width and height of the output features of the U-net model, respectively. The method mainly carries out two classifications on the brain magnetic resonance images of smoking and non-smoking, wherein N =2 represents the classification efficiency of the brain magnetic resonance images of the smoking group and the brain magnetic resonance images of the normal group through deep learning training, verification and testing. The samples comprise brain magnetic resonance images of smoking groups and brain magnetic resonance images of normal groups, 70% of the samples are used as training sets, 10% of the samples are used as verification sets, 20% of the samples are used as testing sets, the normal groups are represented by labels 0, and the smoking groups are represented by labels 1 through random division.

And 2, superposing a multi-head space attention module (such as a Transformer) between a penultimate down-sampling layer and a last down-sampling layer of the U-net model, wherein the number of the down-sampling layers is m =4 in the embodiment, and the penultimate down-sampling layer is a third-level down-sampling layer, so as to obtain globally and locally fused deep semantic information.

After the third-level down-sampling layer of the U-net model obtained in step 1, a multi-head spatial attention module (such as a Transformer) is embedded, as shown in FIG. 1. The multi-head spatial attention module comprises a Linear Projection Layer (Linear Projection), a Position encoding Layer (Position encoding), n Transformer layers (Transformer layers) and a Feature Mapping Layer (Feature Mapping) which are arranged in sequence. The image Feature data of the brain magnetic resonance image needs to be subjected to dimension conversion in the network model transmission process, including Feature dimension preprocessing and Feature dimension postprocessing, which are mainly embodied in a Linear Projection layer and a Feature Mapping layer. Let the dimensions of the input features of the multi-head spatial attention module be (B, C) _m-1 ,W _m-1 ,H _m-1 ) This example is (B, C) ₃ ,W ₃ ,H ₃ ) In which C is _m-1 ,W _m-1 ,H _m-1 Respectively representing the number of channels in the U-net model in the process of reciprocal second-level down-sampling, the width of the features after reciprocal second-level down-sampling and the height of the features after reciprocal second-level down-sampling, wherein B is the number of samples.And performing characteristic dimension preprocessing through Linear Projection of a Linear Projection Layer, and converting the input characteristic dimension into a characteristic dimension (B, num _ token, token _ dim) acceptable by a Transformer Layer to ensure that the multi-head space attention module can be linked behind a third-level down-sampling Layer of the U-net model. And then, through Position encoding, encoding the feature block vector patch of the important Position in a Position embedding manner, so that the model can sense the input sequence of the feature block vector patch. Then extracting features through n Transformer Layer transformers layers, wherein the dimensionality (B, num _ token, token _ dim) in the whole Feature extraction process is kept unchanged, the extracted Feature dimensionality needs to be subjected to Feature dimensionality post-processing through a Feature Mapping Layer, namely, the dimensionality (B, num _ token, token _ dim) of the n Transformer Layer output features is converted into the four-dimensional Feature dimensionality (B, C) acceptable by a fourth-level down-sampling Layer subsequently linked by the multi-head space attention module ₃ ,W ₃ ,H ₃ )。

The characteristic dimension preprocessing process of Linear Projection layer is as follows: the characteristic dimensionality of the image output after the three times of downsampling processing is four-dimensional (B, C) ₃ ,W ₃ ,H ₃ ) And the Transformer Layer can accept the feature with dimension (B, num _ token, token _ dim), so the Linear project in the multi-head spatial attention module is required to perform feature dimension preprocessing. The multi-head spatial attention module divides input feature information into feature block vectors patch with equal size by Linear Projection through feature dimension Linear transformation, and converts the feature block vectors patch into Linear embedded sequences (B, num _ token, token _ dim) which can be accepted by the Transformer Layer, wherein num _ token and token _ dim respectively represent the number and the dimension of the Linear embedded sequences which can be accepted by the Transformer Layer. Let the vector size of the feature block divided by Linear Projection layer be p, and the dimension variation of the Linear embedding sequence be (B, C) ₃ ×p×p,W ₃ /p,H ₃ /p)，(B,C ₃ ×p×p,(W ₃ /p)×(H ₃ /p))，(B,(W ₃ /p)×(H ₃ /p),C ₃ X p) plus oneIn the dimension of classification (B, (W) ₃ /p)×(H ₃ /p)+1,C ₃ Xp × p), num _ token = (W) ₃ /p)×(H ₃ /p)+1，token_dim＝C ₃ ×p×p。

The Feature dimension post-processing process of Feature Mapping of the Feature Mapping layer comprises the following steps: the multi-headed spatial attention module is followed by a fourth level down-sampling layer, which can accept dimensions of (B, C) ₃ ,W ₃ ,H ₃ ) Therefore, feature Mapping Layer Feature Mapping is required to perform Feature dimension post-processing, that is, dimension (B, (W) of transform Layer output Feature of nth Transformer Layer ₃ /p)×(H ₃ /p)+1,C ₃ X p) to a fourth level of downsampling layers of (B, C) acceptable dimensions ₃ ,W ₃ ,H ₃ ) The characteristics of (1). The dimension of the Transformer Layer output characteristic of the nth Transformer Layer is (B, (W) ₃ /p)×(H ₃ /p)+1,C ₃ X p), obtaining the dimension of (B, (W) through dimension splitting of Feature Mapping of a Feature Mapping layer ₃ /p)×(H ₃ /p),C ₃ X p) and (B, 1, C) ₃ Features of x p), and further dimensions (B, (W) ₃ /p)×(H ₃ /p),C ₃ X p) by a reshaping operation to obtain dimensions (B, C) ₃ ,W ₃ ,H ₃ ) The method is characterized in that.

And 3, superposing a lightweight channel attention module (such as Squeeze-and-excitation) before each jump connection of the U-net model and the corresponding upper sampling layer, and aiming at enhancing the characteristic directivity and recovering the detailed information.

Embedding a lightweight channel attention module (such as Squeeze-and-excitation) in the U-net module obtained through the steps 1 and 2. As shown in FIG. 1, the lightweight channel attention module performs a compression (squeeze) operation followed by an excitation (excitation) operation followed by a feature recalibration (scale) operation. Let the input feature size of the lightweight channel attention module be W × H × C, and the scaling parameter be r, where W, H, and C respectively represent the input feature width, height, and number of channels, and r is a constant greater than 0. First, the features with the size of W × H × C are subjected to a compression operation, and global compressed features with the size of 1 × 1 × C are obtained through global average pooling. Secondly, the global compression feature firstly passes through a full connection layer containing C/r neurons and a Relu activation function (the input feature dimension of the process is 1 multiplied by 1C, and the output feature dimension is 1 multiplied by 1C/r), and then passes through the full connection layer containing C neurons and a sigmoid activation function (the input feature dimension of the process is 1 multiplied by 1C/r, and the output feature dimension is 1 multiplied by 1C), so that a normalized channel weight of each feature channel is obtained. Then, a feature re-calibration (scale) operation of multiplying channel weights is performed on the features with the dimension of W × H × C input by the lightweight channel attention module, that is, the features are re-calibrated in a channel-by-channel weighting manner to obtain the features with the dimension of W × H × C, and the features are used as the input of the deconvolution layer of the corresponding upsampling layer.

And 4, linking a full connection layer after the last upper sampling layer of the U-net model, so as to realize the classification task of the brain magnetic resonance structure image.

And (4) linking a full connection layer and a softmax function after the last-stage upper sampling layer of the U-net model obtained in the step (1-3), so that the feature vector obtained by the last convolution layer is mapped to an output layer of the network to obtain the classification information of the magnetic resonance image of the brain of the smoker.

And 5, using 70% of samples as a training set, 10% of samples as a verification set and 20% of samples as a test set in the classification experiment. In the U-net model training, an epoch parameter is set to be 100, a batch size parameter is set to be 4, a learning rate is set to be 0.0001, a cross entropy loss function is adopted, and a random gradient descent optimization algorithm is adopted. The cross entropy Loss function Loss can be expressed as: loss = - [ y × logq + (1-y) × log (1-q) ], wherein y represents a label, q represents a prediction probability distribution corresponding to the label y, and the model gives a prediction probability corresponding to the label y according to continuously updated and iteratively learned feature weight information by inputting image feature information into the overall model. In the present invention, label 0 represents the normal group, and label 1 represents the smoking group. And training the U-net model by taking the minimum cross entropy Loss function Loss as a target to obtain an optimal U-net model weight parameter.

Through the steps 1 to 4, firstly, a multi-head space attention module (such as a Transformer) is superposed behind a third-level sampling layer of the U-net model, secondly, a lightweight channel attention module (such as a Squeeze-and-excitation) is superposed in front of each jumping connection and an upper sampling layer of the U-net model, and then, a full connection layer is linked behind a fourth-level upper sampling layer of the U-net model, so that the classification of the magnetic resonance images of the brains of smokers is realized, as shown in the attached figure 1. Therefore, the convolution limitation of the U-net model can be overcome, and the accurate classification of the brain magnetic resonance images of smokers is realized.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments, or alternatives may be employed, by those skilled in the art, without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A smoker brain magnetic resonance image classification method based on a contextual attention mechanism is characterized by comprising the following steps:

and 5, inputting samples into the U-net model, wherein the samples comprise brain magnetic resonance images of the smoking group and brain magnetic resonance images of the normal group, and training the U-net model by taking a minimum cross entropy Loss function Loss as a target to obtain an optimal U-net model weight parameter.

2. The method for classifying magnetic resonance images of smokers' brains based on a contextual attention mechanism according to claim 1, wherein each downsampling layer shrinking process comprises two convolutional layers and one pooling layer in turn, each convolution of a downsampling layer is followed by adding a Relu activation function, and each upsampling layer comprises an deconvolution layer and two convolutional layers in turn.

3. The method of classifying magnetic resonance images of smokers' brains based on a contextual attention mechanism as claimed in claim 2, wherein the multi-head spatial attention module comprises a linear projection layer, a position encoding layer, n transformer layers, and a feature mapping layer sequentially disposed,

the position coding layer codes the characteristic block vector;

the n transformer layers perform feature extraction on the input linear embedding sequence and output the linear embedding sequence to the feature mapping layer;

4. The method of claim 3, wherein the dimension of the output feature of the second-to-last downsampled layer of the U-net model is (B, C) _m-1 ,W _m-1 ,H _m-1 ) In which C is _m-1 ,W _m-1 ,H _m-1 Respectively representing the number of channels in the process of reciprocal second-stage down-sampling of the U-net model, the width of the features after reciprocal second-stage down-sampling and the height of the features after reciprocal second-stage down-sampling, wherein B is the number of samples, the input dimensionality of a transformer layer is (B, num _ token, token _ dim), and num _ token = (W) _m-1 /p)×(H _m-1 Dimension of/p) + class, token _ dim = C _m-1 The x p is multiplied by p, and p is the size of the characteristic block vector after the linear projection layer is divided;

the characteristic mapping layer outputs the dimension (B, (W) of the characteristic of the nth transformer layer _m-1 /p)×(H _m-1 P) + dimension of the classification, C _m-1 Xp.p) to (B, C) _m-1 ,W _m-1 ,H _m-1 )。

5. The method of smoker brain magnetic resonance image classification based on contextual attention mechanism of claim 4, wherein the input feature size of the lightweight channel attention module is WxHxC, and the scaling parameter is r, wherein W, H and C represent input feature width, height and channel number, respectively, and r is a constant greater than 0, in the lightweight channel attention module:

compressing the W multiplied by H multiplied by C characteristics, and obtaining 1 multiplied by C global compression characteristics through global average pooling;