CN117994143A

CN117994143A - Multi-mode MR image synthesis method, system, storage medium and equipment

Info

Publication number: CN117994143A
Application number: CN202410037947.3A
Authority: CN
Inventors: 吕骏; 陈秀东
Original assignee: Yantai University
Current assignee: Yantai University
Priority date: 2024-01-09
Filing date: 2024-01-09
Publication date: 2024-05-07
Anticipated expiration: 2044-01-09
Also published as: CN117994143B

Abstract

The invention relates to a multi-mode MR image synthesis method, a system, a storage medium and equipment, wherein MR images of different modes are obtained and divided into mutually non-overlapping blocks, images of at least two modes are randomly aligned and shielded, an alignment mark and a corresponding position in each mode image are extracted based on an encoder, an image of a target mode is obtained based on a decoder, characteristics in the image of the target mode are utilized to be subjected to consistency processing, and a synthesized image of a real mode is obtained through training; during training, an edge map corresponding to the mode image is obtained by using a generator and is spliced with the original mode image to form the characteristic representation of the set mode. By means of limited pairing data synthesis missing modalities, the time of data pairing can be saved and the cost of the MR device reduced.

Description

Multi-mode MR image synthesis method, system, storage medium and equipment

Technical Field

The invention relates to the technical field of image synthesis, in particular to a multi-mode MR image synthesis method, a multi-mode MR image synthesis system, a storage medium and multi-mode MR image synthesis equipment.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

MR images, i.e. magnetic resonance images, are a non-invasive, non-radiative imaging technique that uses signals generated by nuclei under the action of strong magnetic fields and radio frequency pulses to generate high resolution images of human tissue. Its high contrast, multi-planar imaging and superior performance on soft tissue make it an indispensable tool in clinical and scientific research. While this technique has some limitations, the manufacturing and maintenance costs of the equipment are high; acquiring magnetic resonance images generally requires a long scan, and for some patients, particularly patients that cannot remain stationary, may lead to discomfort leading to reduced imaging quality; some patients carrying metal implants or instruments present a certain safety risk in acquiring magnetic resonance images.

Disclosure of Invention

In order to solve the technical problems in the background art, the invention provides a multi-mode MR image synthesis method, a system, a storage medium and equipment, wherein under the condition of limited pairing data, a multi-mode medical image synthesis frame (MCF-Net) with patch complementary pre-training (PC-MAE) is introduced, and the limitation of the traditional method on noise, motion artifact and cost is overcome by fusing information of different modes of a magnetic resonance image, so that the cost of acquiring the image is reduced, and the quality of the image is improved.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

A first aspect of the invention provides a method of multi-modality MR image synthesis comprising the steps of:

MR images of different modes are obtained and divided into mutually non-overlapping blocks, at least two of the images of the modes are randomly aligned and shielded, an alignment mark and a corresponding position in each mode image are extracted based on an encoder, an image of a target mode is obtained based on a decoder, characteristics in the image of the target mode are subjected to consistency processing, and a synthesized image of a real mode is obtained through training;

During training, an edge map corresponding to the mode image is obtained by using a generator and is spliced with the original mode image to form the characteristic representation of the set mode.

Further, the alignment mark and the corresponding position in each modal image are extracted based on the encoder, specifically: and acquiring the alignment mark and the corresponding position in the mode image by utilizing the encoder corresponding to each mode, and acquiring high-dimensional data coding representation of different mode images by using the shared encoder.

Further, the image of the target mode is obtained based on the decoder, specifically: and performing cross-modal interaction by a shared decoder corresponding to the shared encoder, and obtaining images of corresponding target modalities based on the specific decoder.

Furthermore, the characteristic representation of the mode is set, downsampling is realized through branches of the encoder corresponding to each mode, the decoder corresponding to each mode improves the characteristic resolution, and the image of the target mode is obtained through fusion.

Further, obtaining an image of the target mode through fusion comprises the following steps:

Determining a characteristic representation acquired by an encoder corresponding to each mode, and an up-sampling characteristic representation acquired by a decoder corresponding to each mode;

in the space attention guiding stage, splicing the characteristic representation acquired by the encoder and the up-sampling characteristic representation acquired by the decoder, and obtaining a space attention map and scaling through point-by-point convolution operation;

a spatial multiplication is performed between the spatial attention map and the input feature to obtain a filtered feature.

Further, obtaining an image of the target mode through fusion further comprises:

in the channel attention guiding stage, the filtered characteristics are subjected to pooling treatment to obtain a channel context descriptor;

And recombining the channel context descriptors to obtain a fused characteristic diagram and taking the fused characteristic diagram as input of the next up-sampling stage.

Further, the consistency processing specifically includes: and balancing a loss function through weight parameters to realize consistency processing by measuring the difference between the synthesized image and the true value based on the characteristic consistency loss by using the fused target mode image and the corresponding true value of the missing mode.

A second aspect of the invention provides a multi-modality MR image synthesis system comprising:

A patch complementary pre-training module configured to: MR images of different modes are obtained and divided into mutually non-overlapping blocks, at least two of the images of the modes are randomly aligned and shielded, an alignment mark and a corresponding position in each mode image are extracted based on an encoder, an image of a target mode is obtained based on a decoder, characteristics in the image of the target mode are subjected to consistency processing, and a synthesized image of a real mode is obtained through training;

an edge enhancement trim module configured to: during training, an edge map corresponding to the mode image is obtained by using a generator and is spliced with the original mode image to form the characteristic representation of the set mode.

A third aspect of the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above-described multi-modality MR image synthesis method.

A fourth aspect of the invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the above-described multi-modality MR image synthesis method when the program is executed.

Compared with the prior art, the above technical scheme has the following beneficial effects:

Since MR images are mostly used in medical image processing, due to the high cost of data acquisition and labeling, only a small amount of paired data is often available, and at the same time, the missing modality may cause incomplete medical image data, which may affect subsequent clinical diagnosis and scientific research. Therefore, each block contains complete anatomical structure information by dividing a limited modal image into mutually non-overlapping blocks, simulating differences and deletions in an actual medical image by randomly aligning and masking, and reconstructing two complete modal images (images of target modalities) which are not masked originally by using a corresponding encoder and decoder, thereby determining weights; and when the method is operated, a generator is utilized to obtain edge images corresponding to the two input mode images and splice the edge images with the original mode images, the characteristics are further extracted, and the synthesized third mode image is obtained by utilizing the weight obtained in the prior art through characteristic fusion, so that the time for data pairing can be saved, and the cost of the MR equipment can be reduced.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a schematic diagram of an overall architecture of a multi-modality MR image synthesis process provided by one or more embodiments of the present invention;

FIG. 2 is a schematic diagram of a multi-modal collaborative fusion (MCF-Net) network architecture according to one or more embodiments of the present invention;

FIG. 3 is a schematic diagram of an architecture of an attention directing fusion module (ADF) provided by one or more embodiments of the invention;

FIG. 4 is a visual result of a T1 and T2 synthetic FLAIR task provided by one or more embodiments of the present invention;

FIG. 5 is a visual result of a T2 and FLAIR synthesis T1 task provided by one or more embodiments of the present invention;

FIG. 6 is a visual result of a T1 and FLAIR synthesis T2 task provided by one or more embodiments of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

The magnetic resonance imaging (Magnetic Resonance Imaging, abbreviated as MRI or MR) uses the principle of nuclear magnetic resonance (nuclearmagneticresonance, abbreviated as NMR) to detect the emitted electromagnetic waves by externally applied gradient magnetic fields according to the different attenuation of the released energy in different structural environments inside the material, so as to obtain the positions and types of the nuclei constituting the object, and can be drawn into an image of the internal structure of the object, which is commonly used for detecting human tissues by medical institutions.

The following embodiments provide a multi-mode MR image synthesis method, system, storage medium and apparatus, introducing a multi-mode medical image synthesis framework (MCF-Net) with patch complementary pre-training (PC-MAE) under the condition of limited pairing data, and overcoming the limitation of the traditional method on noise, motion artifact and cost by fusing the information of different modes of the magnetic resonance image, thereby reducing the cost of acquiring the image and improving the quality of the image.

Embodiment one:

as shown in fig. 1-6, the multi-modality MR image synthesis method includes the steps of:

Step 1: data acquisition and preprocessing.

1.1 Data acquisition: collecting MR images of different modalities (e.g., the published BRATS2020 data) from a medical imaging database, from which three modality images are selected, such as T1 weighted, T2 weighted, and FLAIR images;

1.2 image block partitioning: dividing the image of each modality into mutually non-overlapping blocks, typically of fixed size, to ensure that each block contains complete anatomical information;

1.3 random pair Ji Zhebi: randomly aligned masking of two modal images after division into a plurality of blocks, namely masking blocks of a first modal image, and not masking blocks at the same position of a second input modal image (so-called masking some blocks, namely adding great noise to small blocks of the images to cause the small blocks to lose original information so as to simulate differences and defects in actual medical images)

Step 2: feature extraction and decoding.

2.1 Encoder feature extraction: arranging two image blocks with non-occluded modes into a row in a one-dimensional vector form respectively, inputting the image blocks into a visual transducer encoder respectively, and extracting an alignment mark and corresponding position information in each mode image; then splicing and inputting the characteristics output by the two specific encoders into the shared encoder and decoder;

2.2 decoder decodes: based on the characteristics and the position information obtained in the steps, the information is decoded by two specific decoders, so that the reconstruction of the two-mode image is realized (the image is segmented and shielded at first, the non-shielded image blocks are input, and the information of the non-shielded image blocks is extracted, so that the two complete original non-shielded mode images are reconstructed and recovered).

The method comprises the steps of pre-training, carrying out shielding reconstruction on two modal images, training a model, extracting the characteristics of the model, and obtaining pre-trained weights for subsequent synthesis tasks.

Step 3: edge enhancement and weight reading.

3.1: And reading in the weight of the pre-training, and continuing to train the downlink task on the basis. And respectively generating edge maps corresponding to the two input modes by utilizing a Sobel generator, respectively splicing the two edge maps with the two input mode images, and inputting the two edge maps into an encoder of the visual transducer together.

Step 4: target modality synthesis and consistency processing.

4.1 Fusion of features: further extracting features through an MCF-Net model, and fusing the feature information obtained from the front side together through an ADF module to finally obtain a third target mode image;

4.2 consistency handling: the synthesized image and the real image are aligned in characteristics through the pre-trained encoder, so that the difference between the two modes is reduced, and the synthesized image is more lifelike.

The synthesis process of this embodiment may be divided into two parts, patch complementary pre-training and edge enhancement fine tuning.

Patch complement pre-training (PatchComplementaryPre-training). The input image is firstly divided into a series of non-overlapping blocks, and then the two input modes are randomly aligned and shielded, so that the patch at the same position in the two modes is ensured to have information complementarity. The unmasked patches are projected into the D-dimensional embeddings, respectively, while the masked patches are masked.

In order to enable the encoders to learn the modal differences by mapping the input to the feature space, the aligned visible markers and their respective positions and modal embeddings are provided to the respective encoders. Feature fusion and cross-modal interaction of the visible patch is then facilitated by sharing the encoder.

In general, MAE encoders benefit from learning a generic encoder that captures high-dimensional data encoded representations of different modality data. Because of the differences between the two modalities, specialized decoders are required to decode the advanced potential information of the respective modalities. The purpose of the additional shared decoder layer is ultimately to focus the encoder more on feature extraction while ignoring details of modal interactions. Since MAEs employ an asymmetric auto-encoder design in which the mask token does not pass through the shared encoder, the mask token is passed in addition to the visible token through the shared decoder. Without such a design, the entire decoder branch would be partitioned and the mask flags for the different modalities would not participate in feature fusion. After sharing the decoder, a special decoder is designed for the different modalities to achieve a better reconstruction.

Loss is measured using Mean Square Error (MSE), respectively denoted asAnd/>The input of the shared decoder includes the complete set of coded visible features and mask marks for both modalities, and the shared decoder performs cross-modality interactions on these potential representations. Then, at a separate decoder stage, the decoder maps back to the image, with the final penalty of: /(I)The detailed architecture is shown in fig. 1.

Edge-enhanced trimming (Edge-ENHANCEDFINE-Tuning). And generating an edge map corresponding to the input mode image by using a Sobel generator, connecting the generated edge map with the original mode image, and forming a characteristic representation of a specific mode by utilizing a pre-training stage in a multi-mode complementation mode. These feature representations are input into two independent encoder branches of an MCF-Net (multi-modal collaborative fusion network), each branch containing multiple downsampling stages. At each stage, swinTransformer layers and patch merging modules are adopted, the resolution of the feature map is reduced through continuous operation, and the feature dimension is doubled, so that a layered feature representation is formed.

Notably, the large-scale branching includes an additional downsampling stage as compared to the small-scale branching, further enhancing the feature capture capability of the network. In order to maintain important information of a specific modality in the final composition, a decoder is employed, including a patch expansion operation and SwinTransformer layers, to improve feature resolution. By introducing a jump connection, a tight communication between the encoder and the decoder is achieved, helping to preserve more modality-specific information. The design enables the MCF-Net to capture the association between each mode more flexibly and effectively in the process of feature fusion, and provides more powerful performance for multi-mode synthesis tasks, and the detailed architecture is shown in fig. 1-2.

The following are two important modules that introduce MCF-Net (multi-modal collaborative fusion network) inside:

1) Attention directed fusion module (Attention-DrivenFusionModule): in order to better fuse the features of a specific mode in the multi-modes extracted by the dual-branch encoder, an attention-directed Aggregation (ADF) module is proposed to effectively integrate the multi-mode features, and propagate the fused features into the decoder by skipping the connection. Inspired by the existing attention-based feature selection module, the proposed AD F module consists of two key components, spatial attention-directed fusion, channel attention-directed fusion, as shown in fig. 3, representing the specific modal features of the dual-branch encoder as F _a∈R^C×(H×W) and F _b∈R^C ^×(H×W), and the upsampled features from the decoder as F _s∈R^C×(H×W). In the spatial attention guidance stage, F _a and F _b are respectively connected with F _s, and then connected features are input into a point-by-point convolution operation to generate two spatial attention force diagrams M _a∈R^H×W and M _b∈R^H×W.

Spatial attention is striven scaled to [0,1] using a Sigmoid function. To suppress noise features while emphasizing information features, spatial multiplication is performed between the spatial attention map and the input features, resulting in filtered features F '_a and F' _b. In the channel attention guidance phase, global average pooling and global maximum pooling are applied to the filtered input features F '_a and F' _b, respectively, to obtain channel context descriptorsAnd/>And respectively carrying out addition operation on the maximum pooling and average pooling results to obtain: /(I)They were then combined into F _p＝[P_a,P_b]∈R^2C×1, and then softmax manipulation was performed along the channel: /(I)Wherein/>And/>Each representing the ith element of the corresponding channel attention map. Here, there is/> And/>The i-th element in P _a and P _b are denoted, respectively. Information features are obtained by fusing the dual-branch features, which can be processed by: wherein F represents the fused feature map. Finally, connections F and F _s are used as inputs for the next upsampling stage, the detailed architecture is shown in FIG. 3.

2) And a feature consistency module: in order to obtain a rich and generic representation of the modal characteristics learned from large-scale data, a pre-training encoder is chosen as the core of the characteristic consistency module. Specifically, the image to be synthesizedAnd the corresponding true values y of the missing modes are fed into the pre-trained PC-MAE encoder, where x, E and G represent the input image, the pre-trained encoder and the proposed MCF-Net, respectively. Let F _j (y) and/>The j-th transformation layer of the feature consistency module is derived from y and/>, respectivelyThe output of the multi-level features is extracted, and the perceived difference between the synthesized image and the true value is measured by using the feature consistency loss, namely the method can be defined by the following formula:

Wherein, Indicating the desire, l indicates the number of transform layers of the feature consistency module. Feature consistency loss is used in the framework, enabling more emphasis on content and style similarity between images. Further, the conventional pixel-by-pixel difference between the composite image and the true value can be measured by the following formula:

then, the overall objective function of the fine-tuning framework is formed by linear combination:

Where λ is the weight parameter used to balance the two loss functions.

The core goal of this embodiment is to generate a missing modality with limited paired data, which is often available only in small amounts due to the high cost of data acquisition and labeling in medical image processing. The present embodiment can thus utilize limited pairing data to generate missing modalities in order to better utilize such data in medical image analysis and diagnosis.

1. The data pairing time can be saved: conventional supervised learning methods typically require a large amount of paired multi-modal data to train. However, in practical applications, it is often expensive or impractical to obtain sufficient pairing data. The method provided by the embodiment can effectively utilize limited pairing data.

2. Can support medical diagnosis and research: the missing modalities may lead to incomplete medical image data, which can affect clinical diagnosis and scientific research. By synthesizing the missing modalities, doctors and researchers can obtain more comprehensive image information, helping to make more accurate diagnoses and conduct more intensive studies.

3. Can save cost: obtaining complete multimodal medical image data generally requires more cost and time, including additional scan time and equipment. By synthesizing the missing modalities, the required information can be obtained without increasing cost and time, thereby improving the efficiency of medical care.

4. Can enhance the case data: for medical research and education, it is very valuable to have more case data. The synthetic missing modality can be used to augment existing medical image datasets, enabling researchers to conduct more extensive research and analysis.

Furthermore, in some cases, a physician may need multimodal information to formulate a treatment plan for a patient. The resultant missing modalities may provide the physician with complete information that helps to better plan the treatment plan.

By testing the above procedure with three different compositing tasks as shown in fig. 4-6, in each of the figures from left to right, columns 1-2 represent the input of the model, column 3 represents the actual result to be composited, column 4 represents the result of the model compositing, and columns 5-8 represent the results composited by other comparison methods, and by comparison, the embodiment can be found to have good effects on all three tasks and also to highlight its effectiveness in the multi-modal MR image compositing.

Embodiment two:

A multi-modality MR image synthesis system comprising:

MR images are mostly used in medical image processing, because of the high cost of data acquisition and labeling, only a small amount of paired data is often available, and at the same time, the missing modality may cause incomplete medical image data, which may affect subsequent clinical diagnosis and scientific research, and the missing modality is synthesized by limited paired data, so that the time of data pairing can be saved and the cost of the MR device can be reduced.

Embodiment III:

The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in the multi-modality MR image synthesis method as described in the above embodiment.

Embodiment four:

the present embodiment provides a computer device, including a memory, a processor and a computer program stored on the memory and executable on the processor, where the processor implements the steps in the multi-modality MR image synthesis method according to the above embodiment when executing the program.

The steps involved in the second to fourth embodiments correspond to the first embodiment, and the detailed description of the second embodiment refers to the related description section of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media including one or more sets of instructions; it should also be understood to include any medium capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any one of the methods of the present invention.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The multi-mode MR image synthesis method is characterized by comprising the following steps:

2. The multi-modality MR image synthesis method of claim 1, wherein the alignment markers and corresponding positions in each modality image are extracted based on an encoder, in particular: and acquiring the alignment mark and the corresponding position in the mode image by utilizing the encoder corresponding to each mode, and acquiring high-dimensional data coding representation of different mode images by using the shared encoder.

3. The multi-modality MR image synthesis method of claim 1, wherein the decoder-based acquisition of images of the target modality is: and performing cross-modal interaction by a shared decoder corresponding to the shared encoder, and obtaining images of corresponding target modalities based on the specific decoder.

4. The multi-mode MR image synthesis method of claim 1, wherein a characteristic representation of a mode is set, downsampling is performed through branches of an encoder corresponding to each mode, a decoder corresponding to each mode improves a characteristic resolution, and an image of a target mode is obtained through fusion.

5. The method of multi-modality MR image synthesis as claimed in claim 4, wherein obtaining an image of the target modality via fusion includes:

6. The method of multi-modality MR image synthesis as claimed in claim 4, wherein the image of the target modality is obtained via fusion, further comprising:

7. The multi-modality MR image synthesis method of claim 1, wherein the consistency process is specifically: and balancing a loss function through weight parameters to realize consistency processing by measuring the difference between the synthesized image and the true value based on the characteristic consistency loss by using the fused target mode image and the corresponding true value of the missing mode.

8. A multi-modality MR image synthesis system, comprising:

9. A computer readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the multi-modality MR image synthesis method as claimed in any one of claims 1-7.

10. Computer device, characterized in that it comprises a memory, a processor and a computer program stored on the memory and executable on the processor, which processor, when executing the program, realizes the steps in the multi-modality MR image synthesis method according to any of claims 1-7.