WO2022120762A1

WO2022120762A1 - Multi-modal medical image generation method and apparatus

Info

Publication number: WO2022120762A1
Application number: PCT/CN2020/135439
Authority: WO
Inventors: 蒋昌辉; 胡战利; 梁栋; 张其阳; 洪序达; 郑海荣
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2022-06-16

Abstract

A multi-modal medical image generation method and apparatus (1000), an electronic device, and a computer-readable storage medium. The method comprises: inputting a first modal image; inputting the first modal image into a pre-trained generator; and outputting a second modal image, wherein the second modal image is different from the first modal image. By means of the method, one modality of a medical image is converted so as to generate another modal image, and mutual conversion and generation between a computed tomography (CT) image, a nuclear magnetic resonance (MR) image and a positron emission tomography (PET) image are realized. Only one of the three modalities, i.e. CT, MR and PET needs to be scanned, and the other two modalities can be generated. The purchasing burden of equipment usage units is effectively reduced, and the inspection time of and economic burden on patients are saved on.

Description

Multimodal medical image generation method and device

technical field

The embodiments of the present application relate to the field of medical images, in particular to the field of medical image modalities, and in particular, to a method and apparatus for generating a multimodal medical image.

Background technique

X-ray computed tomography (X-ray CT) is a method that uses the principle of interaction between X-rays and matter to scan a layer with a certain thickness of the part to be inspected, and the detector receives the X-rays passing through the layer and converts them into electrical signals , and then converted into a digital signal (projection data) by an analog/digital converter, input into a computer, and a technology for imaging the internal information of an object.

Nuclear magnetic resonance (MR) imaging technology is based on the principle of nuclear magnetic resonance (nuclear magnetic resonance), according to the different attenuation of the released energy in different structural environments inside the material, and by applying a gradient magnetic field to detect the emitted electromagnetic waves, you can know the composition. The position and type of the nuclei of the object can be used to map the internal structure of the object.

Positron emission tomography (PET) imaging technology is a diagnostic tool for tumor diseases. Its method is to label certain substances, generally necessary substances in the metabolism of biological life, such as glucose, protein, nucleic acid, and fatty acid. Short half-life radionuclides (such as ¹⁸ F, ¹¹ C, etc.) are injected into the human body. Due to the vigorous metabolism of tumor cells, these substances will accumulate at the tumor cells. By detecting and imaging the photons emitted by the radionuclide to localize the tumor, the lesion can be diagnosed and analyzed.

PET images, as shown in Figure 1(a), can provide diagnostic information of the tumor, but the images lack the anatomical structure information of the tumor and its surrounding tissues. These additional anatomies need to be provided by CT or MR images, as shown in Figure 1(b) shown.

At present, in clinical medical examination, each medical imaging modality requires a separate device for imaging, and then aggregates it to the doctor for comprehensive diagnosis based on multiple modality images, which consumes a lot of manpower and material resources.

At present, in clinical medical examinations, each medical imaging modality requires separate equipment for imaging. To obtain PET, CT, and MR images, three types of equipment are required. The cost of purchasing equipment is very high for the user; for patients These inspections are also very expensive and time-consuming.

The present invention overcomes the above shortcomings, converts one mode of a medical image to generate another mode image, and realizes a computed tomography (CT) image, a nuclear magnetic resonance (MR) image, and a positron emission tomography (PET) image. ) mutual conversion and generation between images. Only one of the three modalities of CT, MR, and PET needs to be scanned, and the other two can be generated. It can effectively reduce the procurement burden of equipment users, while saving examination time and economic burden for patients.

SUMMARY OF THE INVENTION

The embodiments of the present application propose a method and apparatus for generating a multimodal medical image.

In a first aspect, an embodiment of the present application provides a method for generating a multimodal medical image, including:

Input the first modal image and the target second modal image into the training network to realize supervised learning, stop the training after the iterative training reaches the stopping condition, and obtain the generator;

inputting the first modal image into a pre-trained generator, and converting the first modal image into a second modal image, where the second modal image is different from the first modal image;

The second modality image is output.

In some embodiments, the generator network of the generator adopts a convolutional neural network, a residual network and a generative adversarial network.

In some embodiments, the generator network adopts a generative adversarial network, and the network structure of the generator includes a generator and a discriminator;

The generator is used to generate the target modality image;

The discriminator is used to judge whether the generated image meets the requirements, and stops training if it meets the requirement to stop training.

In some embodiments, the network structure of the generator can use transfer learning technology to speed up network training by loading a pre-trained network.

In some embodiments, the generator is used to implement multi-level down-sampling of the input image, and then corresponds to multi-level up-sampling, and multiple residual network connections are used between down-sampling and up-sampling.

In some embodiments, the network combining the down-sampling and the up-sampling includes but is not limited to: both down-sampling and up-sampling use conventional convolution layers or residual convolution layers, and the activation function of the convolution layer uses relu, leaky_relu, tanh, sigmod.

In some embodiments, the part of the combination of down-sampling and up-sampling may be directly connected, may be connected by a fully connected layer, or may be connected by one or more residual networks.

In a second aspect, the embodiments of the present application also provide a multimodal medical image generation device, including:

The training device is used for inputting the first modal image and the target second modal image into the training network to realize supervised learning, stop the training after the iterative training reaches the stopping condition, and obtain the generator;

an input device for inputting a first modal image into a pre-trained generator, converting the first modal image into a second modal image, the second modal image and the first modal image modalities are different;

an output device for outputting the second modal image.

In some embodiments, the network structure of the generator includes a generator and a discriminator;

The generator is used to generate the target modality image;

In a third aspect, an embodiment of the present application provides an electronic device, including:

one or more processors;

storage means for storing one or more programs,

When the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method of any one of the above-described multimodal medical image generation methods.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored, wherein, when the program is executed by a processor, the method of any of the foregoing multimodal medical image generation methods is implemented .

The invention overcomes the shortcomings in the prior art, converts one mode of medical images to generate another mode image, and realizes computed tomography (CT) images, nuclear magnetic resonance (MR) images and positron emission tomography scans Interconversion and generation between (PET) images. Taking inputting PET images and outputting MR images as an example, the present invention inputs the PET images obtained by patient inspection into the neural network of the present invention, and can directly generate MR images without re-scanning with additional dedicated MR equipment. Because the deep convolutional neural network has been pre-trained, it can be directly applied when deployed, and the entire reconstruction process is very fast.

Description of drawings

1 is a medical image of two modalities in the prior art;

2 is a flowchart of an embodiment of the multimodal medical image generation method of the present application;

Fig. 3 is the overall flow chart of the PET image generation MR image of the present application;

Fig. 4 is the generator network structure schematic diagram of the present application;

Fig. 5 is the generator discriminator network structure schematic diagram of the present application;

FIG. 6 is a schematic flowchart of the generator residual network structure of the present application;

Fig. 7 is a schematic diagram of the structure of the generator convolutional network A of the present application;

FIG. 8 is a schematic structural diagram of the generator convolutional network B of the present application;

9 is a comparison diagram of an MR image generated according to a PET image according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a multimodal medical image generating apparatus of the present application.

Detailed ways

The present application will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the related invention, but not to limit the invention. In addition, it should be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings.

It should be noted that the embodiments in the present application and the features of the embodiments may be combined with each other in the case of no conflict. The present application will be described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

FIG. 2 shows a flowchart of one embodiment of a multimodal medical image generation method according to the present application. The multimodal medical image generation method includes the following steps:

Step 201, input the first modal image and the target second modal image into the training network to realize supervised learning, stop the training after the iterative training reaches the stopping condition, and obtain a generator;

Step 202: Input the first modal image into a pre-trained generator, convert the first modal image into a second modal image, the second modal image and the first modal image different state;

Step 203, outputting the second modal image.

The second modality image is one of PET, CT, MR or other medical modality images.

In this embodiment, the pre-trained generator in step 202 is trained in the following manner:

First, the first modality image data is input into the neural network. In the training process, the first modal image and the target second modal image are input into the training network to realize supervised learning, and the training stops when the iterative training reaches the stopping condition. During the deployment process, a single modal image is input to obtain the target modal image. Because the deep convolutional neural network has been pre-trained, it can be directly applied when deployed, and the entire reconstruction process is very fast. Taking the input PET image and outputting the MR image as an example, as shown in Figure 3, the conversion between other modalities is the same as this method.

The generator network can be implemented by networks such as convolutional neural networks, residual networks, and generative adversarial networks. This example uses a generative adversarial network. Its network structure includes a generator and a discriminator. The generator is used to generate the target modal image, and the discriminator is used to judge whether the generated image meets the requirements, and stops training if it meets the requirements of stopping training. The network structure of the generator can use transfer learning technology to speed up network training by loading a pre-trained network.

In this embodiment, the generator is shown in FIG. 4 . The generator implements multi-level downsampling of the input image, and then corresponds to multi-level upsampling. Multiple residual network connections are used between downsampling and upsampling. The structure of the convolutional network A in the generator is shown in Figure 7. It contains several convolutional layers and activation layers. The last layer is a pooling layer, which can use maximum pooling or average pooling operations. The main function is to extract high-level features of the image, a total of n series, and short-circuit with the corresponding convolutional network B; the structure of the convolutional network B in the generator is shown in Figure 8, a total of n, including several Convolutional layer and activation layer, the last layer is the upsampling layer, which can use deconvolution or de-pooling operation to realize image generation; the residual network structure in the generator is shown in Figure 6. The discriminant network is shown in Figure 5. There are n convolutional layers and activation layers in the discriminator network respectively, and the last three layers are fully connected layer + activation layer + fully connected layer.

In some optional implementations of this embodiment, in the generator: the number of convolutional networks A is 5, wherein the co-convolutional layer and the activation layer are 3 layers respectively, and the last layer is the maximum pooling layer, The number of corresponding convolutional networks B is 5, of which the convolutional layer and the activation layer are 3 layers respectively, and the last layer is the maximum deconvolution layer; in the discriminator: the convolutional layer and the activation layer are 8 layers respectively, The last three layers are fully connected layer + activation layer + fully connected layer.

In some optional implementations of this embodiment, the generator network structure adopts a generative adversarial network. In one training of the network, the discriminator network is trained n times in advance, where n=5, and then the generator is trained once , the image generated by the generator training and the reference image are judged by the discriminator whether they reach the level of clarity of the reference image. If not, continue to train the discriminator 5 times, and then train the generator 1 time, and so on. train.

In some optional implementations of this embodiment, the downsampling process of the generator network includes: the input and output image sizes of the convolutional network A1 at the starting position of the network are the same as the input image size of the network, and the size of the input image is denoted as M ×N, the feature image size generated by other convolutional layers (multiple concatenated convolutional units) inside this convolution module is M×N. The input layer (unit) of the next level convolutional network A2 adopts stride=2 mode for downsampling convolution, and the obtained feature image size is (M/2)×(N/2), and the size of other layers (units) is (M/2)×(N/2). The feature image size is the same as (M/2)×(N/2). By analogy, the feature image size of the inner convolutional layer of the convolutional network An is (M/n)×(N/n).

In some optional implementations of this embodiment, the up-sampling process of the generator network includes: the up-sampling process of the network is opposite to the down-sampling process, and the feature image size of the internal convolutional layer of the convolutional network Bn is (M/n)× (N/n), the last layer of the module uses stride=2 "deconvolution" to upsample the feature image to (M/(n-1))×(N/(n-1)), and the subsequent upsampling The module goes on like this until the final upsampling convolutional network B1, the last layer of the convolutional network B1 is no longer upsampling, and the shaping layer ensures that the output image size is the same as the initial input image size of the network.

In some optional implementations of this embodiment, when the downsampling and upsampling convolutional networks output the same feature image size, a residual path is used to directly connect the output of the downsampling convolutional network A to the upsampling convolutional network output of B. For example: the size of the output feature image of the downsampling convolutional network A1 is M×N, and the size of the output feature image of the upsampling convolutional network B2 is the same as M×N, and the output of the downsampling convolution module 1 is connected to the upsampling volume. On the output of the product network B2, they are jointly sent to the convolution network B1.

In some optional implementation manners of this embodiment, at the end of down-sampling and at the beginning of up-sampling, multiple cascaded residual network connections are used.

In some optional implementation manners of this embodiment, the size of the convolution kernel used by the "convolution module" in the network structure may be selected from (3×3), (5×5), (7×7), and the like. The number of input and output feature images of each convolution module and the convolution layer inside the module can be selected from 8, 16, 32, 64, etc. The activation function of each convolution module and the convolution layer inside the module can be selected from relu, leaky_relu, tanh, etc.

In one embodiment, a PET image is used as an input, and an MR image is finally generated through the network, and the result is shown in Figure 9 below. The effect is ideal, and can provide doctors with anatomical structure information of PET images and MR images for the localization and diagnosis of lesions. The same method and process can be used in the conversion process between other modal images (such as PET to CT, CT to MR, CT to PET, MR to CT, and MR to PET).

With further reference to FIG. 10 , the present application further provides a multimodal medical image generation apparatus, the apparatus embodiment corresponds to the method embodiment shown in FIG. 2 , and the apparatus can be specifically applied to various electronic devices.

As shown in FIG. 10 , the multimodal medical image processing apparatus 1000 in this embodiment includes: a training apparatus 1001 , an input apparatus 1002 and an output apparatus 1003 .

Wherein, the training device 1001 is used to input the first modal image and the target second modal image into the training network to realize supervised learning, stop training after the iterative training reaches the stop condition, and obtain a generator;

An input device 1002, configured to input a first modality image into a pre-trained generator, convert the first modality image into a second modality image, the second modality image and the first modality image The modalities of the images are different;

An output device 1003, configured to output the second modal image.

The second modality image is one of PET, CT, MR or other medical modality images

The above descriptions are only preferred embodiments of the present specification, and are not intended to limit the protection scope of the present specification. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this specification shall be included within the protection scope of this specification.

The systems, devices, modules or units described in the above embodiments may be specifically implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or A combination of any of these devices.

Computer-readable media includes both persistent and non-permanent, removable and non-removable media, and storage of information may be implemented by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media does not include transitory computer-readable media, such as modulated data signals and carrier waves.

It should also be noted that the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device comprising a series of elements includes not only those elements, but also Other elements not expressly listed, or which are inherent to such a process, method, article of manufacture, or apparatus are also included. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article of manufacture, or device that includes the element.

Each embodiment in this specification is described in a progressive manner, and the same and similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, as for the system embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and for related parts, please refer to the partial descriptions of the method embodiments.

Claims

A multimodal medical image generation method, comprising:

Input the first modal image and the target second modal image into the training network to realize supervised learning, stop the training after the iterative training reaches the stopping condition, and obtain the generator;

inputting the first modal image into a pre-trained generator, and converting the first modal image into a second modal image, where the second modal image is different from the first modal image;

The second modality image is output.
The method of claim 1, wherein the generator network of the generator adopts a convolutional neural network, a residual network and a generative adversarial network.
The method of claim 2, wherein the generator network adopts a generative adversarial network, and the network structure of the generator comprises a generator and a discriminator;

The generator is used to generate the target modality image;

The discriminator is used to judge whether the generated image meets the requirements, and stops training if it meets the requirement to stop training.
The method of claim 3, wherein the network structure of the generator can use transfer learning technology to speed up network training by loading a pre-trained network.
The method according to claim 4, wherein the generator is configured to implement multi-level down-sampling of the input image, and then corresponds to multi-level up-sampling, and multiple residual network connections are used between down-sampling and up-sampling.
The method of claim 5, wherein the network combining the down-sampling and the up-sampling includes but is not limited to: both down-sampling and up-sampling use conventional convolution layers or residual convolution layers, convolution layers The activation function adopts relu, leaky_relu, tanh, and sigmod.
The method according to claim 6, wherein the part of the combination of down-sampling and up-sampling may be directly connected, may be connected by a fully connected layer, or may be connected by one or more residual networks.
A multimodal medical image generation device, comprising:

The training device is used for inputting the first modal image and the target second modal image into the training network to realize supervised learning, stop the training after the iterative training reaches the stopping condition, and obtain the generator;

an input device for inputting a first modal image into a pre-trained generator, converting the first modal image into a second modal image, the second modal image and the first modal image modalities are different;

an output device for outputting the second modal image.
The apparatus of claim 8, wherein the generator network of the generator adopts a convolutional neural network, a residual network and a generative adversarial network.
The apparatus of claim 9, wherein the network structure of the generator comprises a generator and a discriminator;

The generator is used to generate the target modality image;

The discriminator is used to judge whether the generated image meets the requirements, and stops training if it meets the requirement to stop training.
The apparatus of claim 10, wherein the network structure of the generator can use a transfer learning technique to speed up network training by loading a pre-trained network.
The apparatus according to claim 11, wherein the generator is configured to implement multi-level down-sampling of the input image, and then correspond to multi-level up-sampling, and use multiple residual network connections between down-sampling and up-sampling.
The apparatus according to claim 12, wherein the network combining the down-sampling and the up-sampling includes but is not limited to: both down-sampling and up-sampling use conventional convolution layers or residual convolution layers, convolution layers The activation function adopts relu, leaky_relu, tanh, and sigmod.
The apparatus according to claim 13, wherein the part of the combination of down-sampling and up-sampling may be directly connected, may be connected by a fully connected layer, or may be connected by one or more residual networks.
An electronic device comprising:

one or more processors;

storage means for storing one or more programs,

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.
A computer-readable storage medium on which a computer program is stored, wherein the program, when executed by a processor, implements the method according to any one of claims 1-7.