CN114140368B

CN114140368B - Multi-mode medical image synthesis method based on generation type countermeasure network

Info

Publication number: CN114140368B
Application number: CN202111465819.1A
Authority: CN
Inventors: 朱鹏飞; 刘家旭; 曹兵; 胡清华
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-12-03
Filing date: 2021-12-03
Publication date: 2024-04-23
Anticipated expiration: 2041-12-03
Also published as: CN114140368A

Abstract

The invention relates to a multi-mode medical image synthesis method based on a generation type countermeasure network. The method comprises the following steps: constructing a modal attention generation type countermeasure network; the modal attention-generating countermeasure network includes a self-representation network and an image conversion network implemented by the generating countermeasure network; training and testing the modal attention generation type countermeasure network by adopting a multi-modal medical image data set to generate a trained modal attention generation type countermeasure network; and inputting the existing multi-modal medical image into the trained modal attention generation type countermeasure network, and outputting a composite image of the target modality. The method is used for synthesizing the missing medical mode images, the quality of the synthesis of the multi-mode medical images can be improved, and the complete multi-mode medical images are beneficial to doctors to make more accurate decisions.

Description

Multi-mode medical image synthesis method based on generation type countermeasure network

Technical Field

The invention relates to the technical field of image synthesis, in particular to a multi-mode medical image synthesis method based on a generation type countermeasure network.

Background

With the diversification of data acquisition means, more and more fields are beginning to use multi-modal data. For example, in the field of precise medicine, brain nuclear magnetic resonance can obtain image data of four modes, namely T1 weighted imaging (T1), gd contrast agent imaging (T1 Gd), T2 weighted imaging (T2) and liquid attenuation inversion recovery sequence imaging (T2-FLAIR), and a doctor can make more precise diagnosis by combining medical images of multiple modes. The actual situation is limited by the factors of limited medical conditions, insufficient scanning time and cost limitation, and systematic errors, even partial modality deletion and the like may occur in the multi-modality data acquisition process, so that partial modality images are finally unavailable. These unavailable modality images may cause deviations in the physician decision.

How to synthesize unusable modalities with existing, complete modality image data has become an important topic of research in computer vision and image processing. Complementing the missing modalities in multi-modality image data can be seen as a special class of image conversion problems. Image transformation refers to the transformation of an image from one image domain to another, similar tasks include style migration, semantic segmentation, deblurring, super resolution, etc. Different image fields refer to various imaging differences between lighting conditions, facial expressions, sensors, etc. Researchers have invested significant effort in developing efficient image conversion algorithms and have made significant progress in methods based on generating a countermeasure network, such as Pix2Pix. However, these methods are applicable to single modality inputs only, and do not consider the valid information contained in other available modalities in a multi-modality scenario in terms of model inputs and network design. CollaGAN (Collaborative GENERATIVE ADVERSARIAL Networks) is one of a few image conversion models designed for multiple modes, dongwook Lee et al propose multiple-cycle consistency loss for the multiple mode scene on the basis of CycleGAN aiming at the multiple mode scene, so that the function of generating target mode images from any multiple mode input is realized. CollaGAN has the disadvantage that no modeling is directed to modality complementarity, which makes it impossible to make full use of the information of the input model, and thus the quality of the composite image is low when applied to multi-modality medical image synthesis.

Disclosure of Invention

The invention aims to provide a multi-modal medical image synthesis method based on a generation type countermeasure network, so as to improve the quality of multi-modal medical image synthesis.

In order to achieve the above object, the present invention provides the following solutions:

a method of multimodal medical image synthesis based on a generated countermeasure network, comprising:

Constructing a modal attention generation type countermeasure network; the modal attention-generating countermeasure network includes a self-representation network and an image conversion network implemented by the generating countermeasure network;

Training and testing the modal attention generation type countermeasure network by adopting a multi-modal medical image data set to generate a trained modal attention generation type countermeasure network;

and inputting the existing multi-modal medical image into the trained modal attention generation type countermeasure network, and outputting a composite image of the target modality.

Optionally, the constructing the modal attention generation type countermeasure network specifically includes:

Constructing a self-representing network; the self-presenting network includes an encoder and a decoder;

Constructing an image conversion network; the image conversion network comprises a generator and a discriminator; the generator includes a multi-encoder, a modal attention module, and a decoder.

Optionally, the method for implanting in the modal attention module specifically includes:

acquiring a plurality of feature maps corresponding to medical images of other modes except the target mode;

Calculating the cross-modal channel weight according to the feature maps;

supplementing the information of the other modes into each input mode according to the cross-mode channel weight so as to obtain a new feature diagram after the supplementary information of a plurality of modes;

and splicing the new feature graphs after the plurality of modal supplementary information to generate a spliced feature graph.

Optionally, the training and testing the modal attention generation type countermeasure network by using the multi-modal medical image dataset to generate a trained modal attention generation type countermeasure network specifically includes:

Inputting the multi-modal medical image dataset into an encoder of the self-representing network, training the self-representing network using a random gradient descent algorithm, generating a pre-trained self-representing network;

inputting medical images of a target mode into the pre-trained self-expression network, inputting medical images of other modes except the target mode and masks corresponding to the target mode into a multi-encoder of the image conversion network, training the image conversion network by using a random gradient descent algorithm, and monitoring the training of the image conversion network by the self-expression network in the training process to generate a trained image conversion network; the pre-trained self-representing network and the trained image conversion network together comprise the trained modal attention-generating countermeasure network.

Optionally, the inputting the existing multi-modal medical image into the trained modal attention generation type countermeasure network outputs a composite image of the target modality, specifically including:

and inputting the existing multi-modal medical image and the mask corresponding to the target modality into a multi-encoder of the trained image conversion network, and outputting a composite image of the target modality by a generator of the trained image conversion network.

A multi-modal medical image composition system based on a generative countermeasure network, comprising:

the modal attention generation type antagonism network construction module is used for constructing a modal attention generation type antagonism network; the modal attention-generating countermeasure network includes a self-representation network and an image conversion network implemented by the generating countermeasure network;

the system comprises a modal attention generation type countermeasure network training module, a modal attention generation type countermeasure network training module and a modal attention generation type countermeasure network training module, wherein the modal attention generation type countermeasure network training module is used for training and testing the modal attention generation type countermeasure network by adopting a multi-modal medical image data set to generate a trained modal attention generation type countermeasure network;

and the multi-modal image synthesis module is used for inputting the existing multi-modal medical images into the trained modal attention generation type countermeasure network and outputting the synthesized images of the target modalities.

Optionally, the modal attention generation type countering network construction module specifically includes:

A self-expression network construction unit for constructing a self-expression network; the self-presenting network includes an encoder and a decoder;

an image conversion network construction unit for constructing an image conversion network; the image conversion network comprises a generator and a discriminator; the generator includes a multi-encoder, a modal attention module, and a decoder.

Optionally, the modal attention module specifically includes:

The characteristic map acquisition unit is used for acquiring a plurality of characteristic maps corresponding to medical images of other modes except the target mode;

A channel weight calculation unit, configured to calculate a cross-modal channel weight according to the plurality of feature graphs;

the modal information complementation unit is used for supplementing the information of the other modalities into each input modality according to the cross-modal channel weight so as to obtain a new feature map after the plurality of modal supplementary information;

And the characteristic diagram splicing unit is used for splicing the new characteristic diagrams after the plurality of modal supplementary information to generate spliced characteristic diagrams.

Optionally, the modal attention generation type countermeasure network training module specifically includes:

A self-expression network training unit, configured to input the multi-modal medical image dataset into an encoder of the self-expression network, train the self-expression network using a random gradient descent algorithm, and generate a pre-trained self-expression network;

an image conversion network training unit, configured to input a medical image of a target modality into the pre-trained self-expression network, input medical images of other modalities except the target modality and masks corresponding to the target modality into a multi-encoder of the image conversion network, train the image conversion network using a random gradient descent algorithm, and monitor training of the image conversion network by the self-expression network during training, so as to generate a trained image conversion network; the pre-trained self-representing network and the trained image conversion network together comprise the trained modal attention-generating countermeasure network.

Optionally, the multi-mode image synthesis module specifically includes:

and the multi-mode image synthesis unit is used for inputting the masks corresponding to the existing multi-mode medical images and the target modes into a multi-encoder of the trained image conversion network, and outputting the synthesized image of the target modes by a generator of the trained image conversion network.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

The invention provides a multi-mode medical image synthesis method based on a generated type countermeasure network, which comprises the following steps: constructing a modal attention generation type countermeasure network; the modal attention-generating countermeasure network includes a self-representation network and an image conversion network implemented by the generating countermeasure network; training and testing the modal attention generation type countermeasure network by adopting a multi-modal medical image data set to generate a trained modal attention generation type countermeasure network; and inputting the existing multi-modal medical image into the trained modal attention generation type countermeasure network, and outputting a composite image of the target modality. The method is used for synthesizing the missing medical mode images, the quality of the synthesis of the multi-mode medical images can be improved, and the complete multi-mode medical images are beneficial to doctors to make more accurate decisions.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for synthesizing a multimodal medical image based on a generated challenge network in accordance with the present invention;

FIG. 2 is a schematic diagram of a modal attention-generating type countermeasure network according to the present invention;

FIG. 3 is a schematic diagram of a training process of a modal attention generating type countermeasure network provided by the present invention;

FIG. 4 is a schematic diagram of one embodiment of a modal attention-generating countermeasure network provided by the present invention;

FIG. 5 is a schematic diagram illustrating the complementation of model information in an embodiment of a modal attention generation type countermeasure network according to the present invention;

Fig. 6 is a block diagram of a multi-modal medical image synthesis system based on a generated countermeasure network according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Fig. 1 is a flowchart of a method for synthesizing a multi-modal medical image based on a generated countermeasure network according to the present invention. As shown in fig. 1, the method for synthesizing a multi-modal medical image based on a generated type countermeasure network of the present invention includes:

step 101: a modal attention generating type countermeasure network is constructed.

A model of a Modal Attention GENERATIVE ADVERSARIAL Networks (MAGAN) proposed by the present invention is constructed, and the model MAGAN includes a self-expression network and an image conversion network (also referred to as a conversion network in the present invention) implemented by the generated type of network, where the image conversion network includes a Modal Attention module proposed by the present invention.

Fig. 2 is a schematic diagram of a modal attention generation type countermeasure network according to the present invention. Referring to fig. 2, the modal attention generation type countermeasure network MAGAN includes a conversion network and a self-expression network, and the conversion network incorporates the modal attention module proposed by the present invention to better utilize complementary information between modalities. The structure of the self-presenting network and the switching network is shown in fig. 2. Wherein the self-expression network is composed of an encoder and a decoder; the conversion network consists of a generator and a discriminator, wherein the generator consists of n-1 encoders, a modal attention module and a decoder, and the encoders and the decoder are connected through the modal attention module provided by the invention.

Specifically, the step 101 constructs a modal attention generation type countermeasure network, including:

Step 1.1: constructing a self-representing network; the self-presenting network includes an encoder and a decoder.

The self-expression network is based on a self-encoder implementation and consists of an encoder and a decoder.

Step 1.2: constructing an image conversion network; the image conversion network comprises a generator and a discriminator; the generator includes a multi-encoder, a modal attention module, and a decoder.

A switching network as shown in fig. 2 is constructed. The transformation network is based on Pix2Pix (image transformation classical model based on deep learning, known to the person skilled in the art) implementation, and is consistent with other generation-antagonism networks, the transformation network being composed of a generator G and a arbiter D. In order to adapt to the input of a plurality of modes, the single encoder structure of the generator G is changed into a multi-encoder structure shown on the left side of fig. 2, the number of encoders is n-1, and n is the total mode number of a specific application field. Furthermore, in order to fully mine the correlation among the modes, the invention adds the mode attention module in the characteristic fusion stage of the generator, and the mode attention module can automatically fuse the information from a plurality of input modes according to the information required by the target mode, thereby improving the quality of the synthesized mode image. The modal attention module is connected with a multi-encoder branch and a decoder, and the multi-encoder, the modal attention module and the decoder together form a generator of the conversion network.

The model MAGAN in the present invention is composed of a self-presenting network and a conversion network, and the improvement to the model is focused on the implementation of the conversion network. The conversion network is realized based on Pix2Pix, and the improvement is divided into two aspects, namely, in order to use a plurality of inputs, the input of the network is changed into a multi-encoder structure from an original single encoder; secondly, a mode attention module originally created by the method is added, and feature fusion is carried out on multi-branch extraction features aiming at a target mode. The principles and implementation of the modal attention module will be described in detail below.

Each channel in the feature map can be seen as a pattern extracted from the input by the deep neural network. The information contained in each input modality is different for any channel. Taking the medical image as an example, if the T-th channel is a texture feature for a tumor, it is apparent that T1Gd contains the most abundant and valuable information; t2 is the most important modality if the channel is a feature with respect to cerebrospinal fluid. One natural idea is to add tumor-related information in T1Gd to the T2 modality and cerebrospinal fluid-related information in T2 to the T1Gd modality. This is in effect the complementarity between the utilized modalities.

It is worth noting that it is necessary to know which modality contains the most rich information, or more attention to which modality, for a given channel or mode before supplementing other modalities. In the following description, the problem is formed as medical image data of n modalities altogether, with the objective of providing n-1 images belonging to n-1 modalities respectively, and synthesizing an image of the nth modality (i.e., the target modality).

The mode attention module is realized by adopting an algorithm, and the algorithm implanted in the mode attention module specifically comprises the following steps:

Step 1.2.1: and acquiring a plurality of feature maps corresponding to medical images of other modes except the target mode.

Referring to fig. 2, without loss of generality, medical images from the n modalities described above are denoted by x ₁,x₂,...,x_n, respectively, where medical images from n-1 input modalities (i.e., other modalities than the target modality) are denoted by x ₁,x₂,...,x_n-1, and medical images from n modalities, i.e., the target modality, are denoted by x _n.

As shown in fig. 2, the masks corresponding to x ₁,x₂,...,x_n-1 and the target mode are respectively input to n-1 encoders of the conversion network, so as to obtain feature images f ₁,f₂,...,f_n-1, that is, a plurality of feature images f ₁,f₂,...,f_n-1 corresponding to medical images of modes other than the target mode.

Step 1.2.2: and calculating the cross-modal channel weight according to the feature maps.

For a set of samples { x ₁,x₁,...,x_n } from n modalities, equation (1) is used to calculate the cross-modality channel weights (denoted by a):

A＝S₂(σ((δ([f₁,f₂,...,f_n-1]W₁)W₂) (1)

here, f _k＝E_k(x_k) represents a feature map corresponding to the medical image x _k of the kth modality other than the target modality, k E (1, 2,..n-1), E _k (·) represents the output of the kth encoding branch E _k of the generator G. The notation [. Cndot. ] indicates that tensors are stitched, e.g., [ f ₁,f₂,...,f_n-1 ] indicates that the n-1 tensors f ₁,f₂,...,f_n-1 are flattened and then stitched. W ₁ and W ₂ are two fully connected layers. Sigma and delta are the ReLU and sigmoid activation functions, respectively (two activation functions in the deep learning field, well known to those skilled in the art). S ₂ is a function of softmax operations for each row of the weight matrix a. From the result, element a _i,j is the attention of the ith channel to the jth modality, in other words, the weight of the jth modality when supplementing the ith channel with information.

Step 1.2.3: and supplementing the information of the other modes into each input mode according to the cross-mode channel weight so as to obtain a new feature map after the supplementary information of a plurality of modes.

Next, supplementing information of other modes into each input mode by using a cross-mode channel weight matrix a according to a formula (2):

Where a _j represents the multiplication between the scalar and the feature map or the channel multiplication between the vector and the feature map, and a is the j-th column of the channel weight matrix a. The information of the current mode can be reserved by setting balance parameters gamma while the information is supplemented, and the value of gamma defaults to 0.5. Representing the new feature map after the kth modality supplemental information.

Step 1.2.4: and splicing the new feature graphs after the plurality of modal supplementary information to generate a spliced feature graph.

New feature map obtained by supplementing multiple modes according to formula (3)Splicing to generate a spliced characteristic diagram f _all:

Aiming at the problem that Pix2Pix is only suitable for single-mode input, the invention modifies the network structure and adds the supervision of the self-expression network; aiming at the problem that CollaGAN does not fully utilize the mode complementarity, a mode attention module is provided; these two improvements improve the quality of multi-modality medical image synthesis.

Step 102: training and testing the modal attention generation type countermeasure network by adopting the multi-modal medical image data set, and generating the trained modal attention generation type countermeasure network.

Fig. 3 is a schematic diagram of a training process of a modal attention generation type countermeasure network provided by the present invention. Referring to fig. 3, the image synthesis network is trained and tested using a multi-modality medical image dataset, and the training of the network supervision switching network is self-represented during the training process. The multi-modal medical image dataset used in the training of the present invention is medical image dataset BraTS2020, which is a public dataset provided by an international brain tumor segmentation challenge (Brain Tumor Segmentation, abbreviated as BraTS), the training set contains 369 sets of samples, and the test set contains 125 sets of samples. Each set of samples contained four modalities of brain tumor nmr scans of T1, T1Gd, T2-FLAIR.

Training the self-expression network by using target modal data in the training set to obtain a pre-trained self-expression network SR; training the conversion network by using all data of the training set, and extracting the characteristics of the SR supervision generator of the self-expression network in the process to obtain the conversion network after training.

The user can synthesize the image of any target mode by using the trained conversion network, namely, inputting medical images respectively belonging to n-1 modes and masks corresponding to the target modes to the conversion network, and outputting the synthesized image of the target mode by the conversion network.

As mentioned previously, the present invention introduces a self-representing network (implemented with a self-encoder) to drive the training of the switching network. The self-encoder is an unsupervised feature extraction model with strong expression capacity in the field of deep learning, and the difficulty of the multi-branch feature extraction structure of the conversion network in extracting effective features is considered to be larger than that of the original single branch, so that the pre-trained self-expression network SR is used for driving the feature extraction of the generator. Specifically, the following losses are calculated during the training phase:

In equation (4) we assume that the nth modality is the target modality, E ^SR (·) represents the output of the self-representative network encoder.

In addition, the total loss of the model conversion network generator is:

Wherein the method comprises the steps of And L _GAN are loss functions already defined in Pix2Pix, which are common knowledge in the art and are not described in detail herein. Lambda is the hyper-parameter of balance L _SR, defaulting to 1. The calculation formula for self-representing network loss is/>I.e. the L1 norm between the generated image and the original. Equations (4) and (5) are modal attention generating type against the constraints of the training phase of the network, with the training goal being to minimize loss from phase to phase.

Specifically, the step 102 trains and tests the modal attention generation type countermeasure network by using the multi-modal medical image dataset, and generates a trained modal attention generation type countermeasure network, which specifically includes:

Step 2.1: inputting the multi-modal medical image dataset into an encoder of the self-representing network, training the self-representing network using a random gradient descent algorithm, generating a pre-trained self-representing network.

All images are taken from the training set and input into the encoder of the self-representative network, which is trained using SGD (stationary GRADIENT DESCENT, random gradient descent), resulting in a trained self-representative network SR, hereinafter referred to as pre-trained SR.

Specifically, referring to fig. 3, in the embodiment of the present invention, a training set in BraTS2020 is used to train the self-expression network SR, where the parameter of SR is W _SR, and a specific training procedure of SR is as follows:

Step 1): inputting iteration times T, randomly initializing a parameter W _SR, and initializing the current iteration times iter;

Step 2): traversing all images of the training set, and repeating the step 3 and the step 4 until the iter reaches T;

step 3): inputting the current images I to SR by the formula Calculating loss;

Step 4): w _SR is updated using an SGD (random gradient descent) algorithm, updating iter=iter+1.

Step 2.2: inputting medical images of a target mode into the pre-trained self-expression network, inputting medical images of other modes except the target mode and masks corresponding to the target mode into a multi-encoder of the image conversion network, training the image conversion network by using a random gradient descent algorithm, and monitoring the training of the image conversion network by the self-expression network in the training process to generate a trained image conversion network; the pre-trained self-representing network and the trained image conversion network together comprise the trained modal attention-generating countermeasure network.

Referring to fig. 3, after the training of the self-presenting network SR is completed, the switching network is trained next. The above-mentioned requirement is that a mask corresponding to the target modality is added to the network input so that the model can determine which modality of image to synthesize. The mask adopts one-hot coding, taking BraTS datasets as an example, and defining that T1, T1Gd, T2 and T2-FLAIR are respectively 1 st to 4 th modes, then the mask corresponding to the T1 mode is tensor of dimension 4 x h x w (h and w are respectively width and height of an image), the mask can be regarded as being formed by splicing matrices of 4 layers h x w, then the value of the matrix of the 1 st layer is set to be 1, the rest values are set to be 0, namely the mask corresponding to the T1 mode, and then the mask corresponding to the T1Gd, the T2 and the T2-FLAIR is obtained by analogy. In training and using a network conversion network, a mask corresponding to a target modality needs to be stitched behind each input modality. The training process of the switching network is as follows:

Step 1): inputting iteration times T, parameters W _SR of a pre-trained SR, randomly initializing a generator parameter W _G and a discriminator parameter W _D, and initializing a current iteration times item;

step 2): repeating the step3 and the step 4 until the iteration times reach T;

Step 3): because MAGAN provided by the invention can receive the input of any mode, each mode can be used as a target mode in sequence, other modes are used as input modes to synthesize images, the loss of a generator is calculated by using a formula L _G, and the loss of a discriminator is calculated by using a formula L _GAN;

Step 4): w _G and W _D are alternately updated using an SGD (random gradient descent) algorithm, with iter=iter+1 updated.

More specifically, referring to fig. 2 and 3, the detailed training steps of the image conversion network are as follows:

Step (1): the medical images from the n-1 input modalities are denoted by x ₁,x₂,...,x_n-1 and the medical images from the n modalities, i.e. the target modality, are denoted by x _n. As shown in fig. 2, the masks corresponding to x ₁,x₂,...,x_n-1 and the target modes are respectively input into n-1 encoders to respectively obtain feature maps f ₁,f₂,...,f_n-1;

Step (2): inputting x _n into a pre-trained encoder of the self-expression network to obtain a feature map E ^SR(x_n), and calculating the difference between E ^SR(x_n) and f ₁,f₂,...,f_n-1 by adopting a formula (4) to obtain a part L _SR of the generator loss, thereby completing supervision of feature extraction by the self-expression network;

Step (3): inputting the f ₁,f₂,...,f_n-1 into a modal attention module to perform modal information complementation to obtain a spliced characteristic diagram f _all;

Step (4): inputting f _all into a decoder of a conversion network to obtain a synthetic image of a target mode The function of the generator is to generate/>, as close as possible to the real x _n

Step (5): will beInput to a arbiter, which essentially is a classifier, with the goal of distinguishing as much as possible between x _n and/>

Step (6): repeating the steps (1) to (4), and alternately optimizing the generator and the discriminator by using an SGD algorithm until the appointed iteration times T are reached, so as to finally obtain the trained image conversion network.

Step 103: and inputting the existing multi-modal medical image into the trained modal attention generation type countermeasure network, and outputting a composite image of the target modality.

In the using stage of the model, the user can synthesize an image of any target mode by using the conversion network after training is completed, and the method is similar to the model training: the medical images belonging to n-1 modes and the masks corresponding to the target modes are input into n-1 encoders of a conversion network generator G, and the G outputs the synthesized images of the target modes. For example, in fig. 2, x ₁,x₂,...,x_n-1 is input to n-1 encoders of the generator, respectively, and the output of the generator is a composite image of the nth mode (i.e., the target mode)

Thus, the step 103 inputs the existing multi-modal medical image into the trained modal attention generation type countermeasure network, and outputs a composite image of the target modality, specifically including:

The invention provides an image synthesis framework MAGAN in a multi-mode scene: the user inputs images of a plurality of modes and a target mode corresponding mask, and the model MAGAN automatically synthesizes medical images corresponding to the target mode; the method introduces a modal attention module, and effectively improves the quality of the synthesized image by modeling the correlation among the modalities; the present invention proposes a method for extracting features related to a target modality using a self-representative network constraint model, which allows MAGAN (modality attention generating type countermeasure network) to extract only features related to the target modality from the input. In summary, the multi-modal medical image synthesis model MAGAN based on the generated countermeasure network can receive any modal image as input and synthesize a target modal image, is mainly used for synthesizing a missing modal image in the multi-modal medical image, can improve the quality of synthesizing the missing medical modal image, and is beneficial to a doctor to make more accurate decisions.

An embodiment of a method for synthesizing a multi-modal medical image based on a generated countermeasure network according to the present invention is given below, and fig. 4 is a schematic diagram of an embodiment of a modal attention generated countermeasure network according to the present invention, where n=4 is assumed. Referring to fig. 4, a specific embodiment of the method comprises the steps of:

Step one: the network model shown in fig. 4 is constructed, and the network model comprises a conversion network and a self-expression network, wherein the conversion network is added with the modal attention module provided by the invention so as to better utilize the complementary information among modalities. The structure of the self-presenting network and the switching network is shown in fig. 4: the self-expression network consists of an encoder and a decoder; the conversion network consists of a generator and a arbiter, wherein the generator consists of n-1 encoders and one decoder, the encoders and the decoders being connected by the modal attention module presented herein.

Step two: training the self-expression network by using target modal data in the training set to obtain a pre-trained self-expression network; training the conversion network by using all data of the training set, and extracting the characteristics of the SR supervision generator of the self-expression network in the process to obtain the conversion network for completing training.

As shown in fig. 4, for a set of samples from BraTS2020, the medical images of the 4 modes are denoted by x ₁,x₂,x₃ and x ₄ respectively, in fig. 4 x ₁,x₂,x₃ is input to 3 encoders of the generator (we assume n=4, and hence n-1=3 here) respectively, and the output of the generator is a composite image of the 4 th mode

In this particular embodiment, the training process of the switching network is as follows:

1. Respectively inputting masks corresponding to the x ₁,x₂,x₃ and the target modes into 3 encoders to respectively obtain a feature map f ₁,f₂,f₃;

2. Inputting the true x ₄ into a pre-trained decoder of the self-expression network to obtain a feature map E ^SR(x₄), and calculating the difference between E ^SR(x₄) and f ₁,f₂,f₃ by adopting a formula (4) to obtain a part L _SR of the generator loss, thereby completing supervision of feature extraction by the self-expression network;

3. Inputting f ₁,f₂,f₃ into a modal attention module, and carrying out modal information complementation to obtain f _all; the method comprises the following steps:

Using equation (1) to calculate the cross-modal channel weight (denoted as a):

A＝S₂(σ((δ([f₁,f₂,f₃]W₁)W₂)

Information of other modalities is next supplemented into each input modality using a. Fig. 5 is a schematic diagram illustrating complementation of model information in an embodiment of a modal attention generation type countermeasure network according to the present invention. In fig. 5, the 3 rd channel of the 1 st modality is supplemented with information of other modalities, the 2 nd modality supplements the 3 rd channel of the 1 st modality with weight a _3,2, and the 3 rd modality supplements the 3 rd channel of the 1 st modality with weight a _3,3. In this way, a new feature f _k ^comp after the kth modality supplemental information can be obtained:

Here, f _k＝E_k(x_k),E_k (·) represents the output of generator G kth encoding branch E _k, [ ·, ·, · ] represents the concatenation after flattening the 3 tensors. W ₁ and W ₂ are two fully connected layers. Sigma and delta are the ReLU and sigmoid activation functions, respectively (two activation functions in the deep learning field, well known to those skilled in the art). S ₂ is a function of softmax operation on each row of the weight matrix a, representing multiplication between scalar and feature map or channel multiplication between vector and feature map, a _j is the j-th column of the weight matrix a. The information of the current mode can be reserved by setting balance parameters gamma while the information is supplemented, and the value of gamma defaults to 0.5.

Finally, since the decoder is of a single-branch structure, the mode attention module needs to splice the feature images and send the feature images into the decoder, and the output of the mode attention module is as follows:

f_all＝[f₁ ^comp,f₂ ^comp,f₃ ^comp]

4. Inputting f _all into a decoder of a conversion network to obtain a synthetic image of a target mode The function of the generator is to generate/>, as close as possible to x ₄

5. Will beInput to a arbiter, which essentially is a classifier, with the goal of distinguishing as much as possible between x ₄ and/>

6. Repeating the steps 1, 4, and alternately optimizing the generator and the discriminator by using the SGD algorithm until the appointed iteration times are reached, and finally obtaining the trained conversion network.

Step three: the user can synthesize the image of any target mode by using the conversion network obtained in the last step, specifically, the image belonging to n-1 modes and the mask corresponding to the target mode are input to the conversion network, and the conversion network outputs the synthesized image of the target mode.

For example, x ₁,x₂,x₃ is input to 3 encoders of a generator, and the output of the generator is a composite image of the 4 th mode

In summary, according to the multi-mode medical image synthesis method based on the generation type countermeasure network, through a single model, any mode image can be used as input, and a target mode image (generally a mode with a missing or poor quality) can be synthesized. The method can solve the problem of missing of a certain mode image in the multi-mode medical image scene.

The experimental results of the embodiment of the present invention on BraTS's 2020 validation set are shown in table 1 and table 2, table 1 compares the scores of MAGAN and Pix2Pix proposed by the present invention on the structural similarity (structural similarity index, SSIM), and table 2 compares the scores of MAGAN and Pix2Pix proposed by the present invention on the feature similarity (feature similarity index, FSIM) (SSIM and FSIM are both methods for measuring image similarity, which are well known to those skilled in the art).

Table 1 SSIM score comparison

Table 2 FSIM score comparison

As can be seen from the comparison of the data in tables 1 and 2, the structural similarity and the characteristic similarity of the invention MAGAN are greatly improved in the synthetic scenes of various modes.

Based on the method for synthesizing the multi-modal medical image based on the generated type countermeasure network, the invention also provides a system for synthesizing the multi-modal medical image based on the generated type countermeasure network. Fig. 6 is a block diagram of a multi-modal medical image synthesis system based on a generated countermeasure network according to the present invention, and referring to fig. 6, the system includes:

a modal attention generation type antagonism network construction module 601 for constructing a modal attention generation type antagonism network; the modal attention-generating countermeasure network includes a self-representation network and an image conversion network implemented by the generating countermeasure network;

The modal attention generation type countermeasure network training module 602 is configured to train and test the modal attention generation type countermeasure network by using a multi-modal medical image dataset, and generate a trained modal attention generation type countermeasure network;

The multi-mode image synthesis module 603 is configured to input an existing multi-mode medical image into the trained modal attention generation type countermeasure network, and output a synthesized image of the target modality.

The modal attention generation type countering network construction module 601 specifically includes:

The modal attention module specifically includes:

The modal attention generation type countermeasure network training module 602 specifically includes:

The multi-modal image composition module 603 specifically includes:

The invention provides a multi-modal medical image synthesis method and a system based on a generation type countermeasure network, which provide a multi-modal medical image synthesis model based on the generation type countermeasure network, and the quality of synthesized modal images can be improved by improving the structure of the existing image synthesis model so as to adapt to multi-modal input and fully utilize information of multiple modes; the invention provides a mode attention module, which supplements information required by generating a target mode for each input mode according to the complementarity among modes, solves the problem that the existing model CollaGAN does not model the correlation of the input modes, and effectively improves the quality of synthesized mode images; the invention also introduces the self-expression network to guide the training of the generator, and filters out irrelevant information in the characteristic extraction stage, thereby accelerating the convergence rate of the model and further improving the quality of the synthesized modal image.

In summary, the method and the system of the invention can adapt to the image synthesis network input by multiple modes, add the mode attention module, introduce the self-expression network to guide the training of the generator, and finally achieve the purpose of synthesizing the high-quality medical mode image with missing.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. A method of multi-modal medical image synthesis based on a generated countermeasure network, comprising:

the construction modality attention generation type countermeasure network specifically includes:

Constructing an image conversion network; the image conversion network comprises a generator and a discriminator; the generator includes a multi-encoder, a modal attention module, and a decoder;

the method for implanting the modal attention module specifically comprises the following steps:

Calculating the cross-modal channel weight according to the feature maps;

Splicing the new feature graphs after the plurality of modal supplementary information to generate a spliced feature graph;

2. The method according to claim 1, wherein the training and testing the modal attention-generating countermeasure network with the multimodal medical image dataset to generate a trained modal attention-generating countermeasure network, specifically comprises:

3. The method according to claim 2, wherein said inputting an existing multi-modal medical image into the trained modal attention generating countermeasure network outputs a composite image of a target modality, comprising in particular:

4. A multi-modal medical image composition system based on a generative countermeasure network, comprising:

The modal attention generation type antagonism network construction module specifically comprises:

An image conversion network construction unit for constructing an image conversion network; the image conversion network comprises a generator and a discriminator; the generator includes a multi-encoder, a modal attention module, and a decoder;

the modal attention module specifically includes:

The feature map splicing unit is used for splicing the new feature maps after the plurality of modal supplementary information to generate a spliced feature map;

5. The system of claim 4, wherein the modal attention generation type countermeasure network training module specifically comprises:

6. The system of claim 5, wherein the multi-modality image synthesis module specifically comprises: