CN114882135A

CN114882135A - CT image synthesis method, device, equipment and medium based on MR image

Info

Publication number: CN114882135A
Application number: CN202210428309.5A
Authority: CN
Inventors: 陈泽立; 钟丽明; 阳维
Original assignee: Southern Medical University
Current assignee: Southern Medical University
Priority date: 2022-04-22
Filing date: 2022-04-22
Publication date: 2022-08-09

Abstract

The invention discloses a CT image synthesis method, a device, equipment and a medium based on MR images, wherein a first MR image and a first CT image are input into a network model for training to obtain an image synthesis model; the global block embedding module is used for carrying out global feature extraction on the first MR image, so that a global feature extraction result has more global property; the encoding module is used for extracting local features of the first MR image, the RSC self-attention module is used for performing first feature processing according to the global feature extraction result and the local feature extraction result and expanding the receptive field of the features, and the decoding module is used for decoding the first feature processing result and generating a second CT image; the discriminator is used for generating discrimination results according to the first CT image and the second CT image for training, and the finally obtained target CT image has richer and clearer details.

Description

CT image synthesis method, device, equipment and medium based on MR image

Technical Field

The invention relates to the field of image processing, in particular to a CT image synthesis method, a device, equipment and a medium based on an MR image.

Background

The existing method for treating malignant tumors such as head and neck cancer can adopt a radiation therapy plan, and in the radiation therapy plan, Computed Tomography (CT) which needs to shoot a focus of a patient is used for planning radiation dose and positioning tumors, and the radiation dose planning and positioning method can be called as a gold standard. Magnetic resonance imaging (MRI, MR for short) has been increasingly used as an aid to CT for radiotherapy planning to segment tumor tissue from healthy organs, because of its superior soft tissue contrast compared to CT. However, the simultaneous acquisition of MR and CT images of a patient brings a large economic burden to the patient and is inefficient, so that the academia began to search for the synthesis of corresponding CT images from MR images.

The traditional synthesis method is mainly based on feature matching, for example, based on models such as segmentation, image blocks, maps and sparse coding, but all have certain defects, for example, based on the segmentation method, because each type of tissue shares the same CT value, blurred details of the image are lost; in the sparse coding-based model, since sparse coding at all image positions is optimized, the calculation cost of the algorithm is often high, so that the traditional method often needs high data quality and high calculation cost, and the synthesized image has the problems of blurring and serious detail loss, and a solution needs to be provided.

Disclosure of Invention

In view of the above, in order to solve at least one of the above technical problems, an object of the present invention is to provide a CT image synthesis method, apparatus, device and medium based on MR images.

The embodiment of the invention adopts the technical scheme that:

the CT image synthesis method based on the MR image comprises the following steps:

acquiring training data; the training data comprises a first MR image and a first CT image with the same target object;

inputting the first MR image and the first CT image into a network model for training to obtain an image synthesis model; the network model comprises a CST generator and a discriminator, wherein the CST generator comprises an encoding module, a decoding module, an RSC self-attention module and a global block embedding module; the global block embedding module is used for carrying out global feature extraction on the first MR image, the coding module is used for carrying out local feature extraction on the first MR image, the RSC self-attention module is used for carrying out first feature processing according to a global feature extraction result and a local feature extraction result, and the decoding module is used for decoding a first feature processing result to generate a second CT image; the discriminator is used for generating discrimination results according to the first CT image and the second CT image so as to carry out training;

and inputting the MR image to be synthesized into the image synthesis model to obtain a target CT image.

Further, the acquiring training data includes:

acquiring an original MR image and an original CT image of the same target object;

performing offset field operation on the original MR image and then registering the original MR image with the original CT image to obtain a registered MR image and a registered CT image;

performing assignment processing of a preset range on the registration CT image;

carrying out normalization processing on the registered MR image and the registered CT image after the assignment processing;

and intercepting an image with a preset size from the registration MR image after the normalization processing to obtain a first MR image, and intercepting an image with a preset size from the registration CT image after the normalization processing to obtain a first CT image.

Further, the inputting the first MR image and the first CT image into a network model for training to obtain an image synthesis model includes:

performing local feature extraction on the first MR image through the encoding module;

performing global feature extraction on the first MR image through the global block embedding module;

fusing the local feature extraction result and the global feature extraction result, and performing first feature processing on the fused result through the RSC self-attention module;

decoding the first characteristic processing result through a decoding module to generate a second CT image;

inputting the first CT image and the second CT image into the discriminator to obtain a discrimination result;

and training a network model according to the identification result and a preset loss function to obtain an image synthesis model.

Further, the encoding module comprises at least a first CNN module, a first RSC self-attention module, and a second CNN module; the local feature extraction of the first MR image by the encoding module includes:

performing at least one time of first feature extraction on the first MR image through the first CNN module to obtain first feature information;

performing second feature processing on the first feature information through a first RSC self-attention module;

and performing second feature extraction on the second feature processing result through the second CNN module to obtain a local feature extraction result.

Further, the global block embedding module comprises a block embedding module and a second RSC self-attention module; the global feature extraction of the first MR image by the global block embedding module includes:

performing second feature extraction on the first MR image through the block embedding module to obtain second feature information;

and performing third feature processing on the second feature information through the second RSC self-attention module to obtain a global feature extraction result.

Further, the first signature process comprises at least one signature sub-process, the RSC self-attention module comprising at least one third RSC self-attention module; the performing, by the RSC self-attention module, a first feature processing on the fusion result includes:

performing the feature sub-processing on the fusion result through the third RSC self-attention module, wherein the feature sub-processing specifically includes:

performing a first separable convolution on the fusion result, performing a first dilated convolution on the first separable convolution result, performing a second separable convolution on the first dilated convolution result, and performing a second dilated convolution on the second separable convolution result;

performing first-layer normalization according to the second expansion convolution result, and performing window multi-head attention on the first-layer normalization result;

performing first addition on the multi-head attention result of the window and the second expansion convolution result, performing second-layer normalization on the first addition result, and performing processing on the second-layer normalization result by using a first multilayer perceptron;

performing second addition on the processing result of the first multilayer perceptron and the first addition result, performing third-layer normalization on the second addition result, and performing moving-window multi-head attention on the third-layer normalization result;

performing third addition on the multi-head attention result of the moving window and the second addition result, performing fourth-layer normalization on the third addition result, and performing processing on the fourth-layer normalization result by using a second multilayer perceptron;

and performing fourth addition according to the processing result of the second multilayer perceptron, the third addition result and the second expansion convolution result to obtain a characteristic sub-processing result.

Further, the training a network model according to the identification result and a preset loss function to obtain an image synthesis model includes:

determining a confrontation loss value of the discriminator according to the discrimination result and a confrontation loss function;

determining an average absolute error value according to the first CT image, the second CT image and an average absolute error loss function;

determining a perceptual loss value according to the first CT image, the second CT image and a perceptual loss function;

and adjusting network parameters of a network model according to the confrontation loss value, the average absolute error value and the perception loss value to obtain an image synthesis model.

The embodiment of the present invention further provides a CT image synthesis apparatus based on MR images, including:

the acquisition module is used for acquiring training data; the training data comprises a first MR image and a first CT image of the same target object;

the training module is used for inputting the first MR image and the first CT image into a network model for training to obtain an image synthesis model; the network model comprises a CST generator and a discriminator, wherein the CST generator comprises an encoding module, a decoding module, an RSC self-attention module and a global block embedding module; the global block embedding module is used for carrying out global feature extraction on the first MR image, the coding module is used for carrying out local feature extraction on the first MR image, the RSC self-attention module is used for carrying out first feature processing according to a global feature extraction result and a local feature extraction result, and the decoding module is used for decoding a first feature processing result to generate a second CT image; the discriminator is used for generating discrimination results according to the first CT image and the second CT image so as to carry out training;

and the synthesis module is used for inputting the MR image to be synthesized into the image synthesis model to obtain a target CT image.

An embodiment of the present invention further provides an electronic device, which includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the method.

Embodiments of the present invention also provide a computer-readable storage medium, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the storage medium, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by a processor to implement the method.

The beneficial effects of the invention are: acquiring training data, wherein the training data comprises a first MR image and a first CT image of a same target object; inputting the first MR image and the first CT image into a network model for training to obtain an image synthesis model; the network model comprises a CST generator and a discriminator, wherein the CST generator comprises an encoding module, a decoding module, an RSC self-attention module and a global block embedding module; the global block embedding module is used for carrying out global feature extraction on the first MR image, so that a global feature extraction result has more global property; the encoding module is used for extracting local features of the first MR image, the RSC self-attention module is used for performing first feature processing according to the global feature extraction result and the local feature extraction result and expanding the receptive field of the features, and the decoding module is used for decoding the first feature processing result and generating a second CT image; the discriminator is used for generating discrimination results according to the first CT image and the second CT image so as to train, and inputting the MR image to be synthesized into the image synthesis model, so that the details of the finally obtained target CT image are richer and clearer.

Drawings

FIG. 1 is a schematic flow chart illustrating the steps of the MR image-based CT image synthesis method according to the present invention;

FIG. 2 is a schematic diagram of a network model according to an embodiment of the present invention;

FIG. 3 is a schematic block diagram of a third RSC self-attention module in accordance with certain embodiments of the present invention;

FIG. 4 is a schematic diagram of the generation of an authentication result according to an embodiment of the present invention;

FIG. 5(a) is a schematic diagram of an MR image to be synthesized according to an embodiment, and FIG. 5(b) is a real CT corresponding to the MR image to be synthesized according to an embodiment; FIG. 5(c) is a schematic representation of a target CT image in accordance with one embodiment; fig. 5(d) is a schematic diagram of the output difference.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein may be combined with other embodiments.

The embodiments of the present invention relate to the noun explanation:

transformer: a neural network based on a self-attention mechanism can acquire global context information, and is one of the main network frameworks for deep learning at present.

Swin transducer: a neural network based on a hierarchical local window self-attention mechanism.

A convolutional neural network: CNN, a feedforward neural network, is widely used in image processing and speech recognition, and is one of the main network frameworks for deep learning at present.

And (3) generating a countermeasure network: for GAN, learning is performed by a method of enabling two neural networks (a generator and a discriminator) to game with each other, and an output result needs to imitate real samples in a training set as much as possible.

Patch embedding: block embedding, similar to word embedding in natural language processing, divides an image into image blocks, and embeds a vector for each image block. The length of the vector represents the coding length of the image.

And (3) registration: in the medical field, the corresponding points of one medical image and the other medical image are consistent in space by utilizing a (or a series of) space change, and the consistency refers to that the same anatomical point on a human body has the same space position on two matched images.

Example normalization: also called contrast normalization, normalization is the speed of optimization that can be accelerated by using one and the same scale in order to change the value of the number columns in the data set when the features in the data have different ranges. Whereas the implementation object of the instance normalization is a whole batch of images rather than a single image.

Context information: generally understood to be the perception and application of some or all of the information that can affect objects in scenes and images, and in the image domain primarily understood to be some connection of pixels and surrounding pixels.

Residual connection: also called jump connection, is a core method in a residual neural network to alleviate the phenomena of gradient dissipation and gradient explosion occurring in a deep neural network. The output is a linear superposition of the input and a non-linear variation of the input.

MLP: a multilayer perceptron, which is a neural network of forward architecture, maps a set of input vectors to a set of output vectors.

Tanh: is an activation function of a neural network that maps inputs to the interval-1 to 1.

GELU: is an activation function of a neural network.

ReLU: is an activation function of a neural network.

Discrete wavelet change: the method is to discretize the scale and translation of basic wavelet, commonly use dyadic wavelet as wavelet variation function in image processing, can decompose signals in different scales, and the selection of different scales can be determined according to different targets.

As shown in fig. 1, an embodiment of the present invention provides a CT image synthesis method based on MR images, including steps S100-S300:

and S100, acquiring training data.

In an embodiment of the present invention, the training data includes a first MR image and a first CT image of the same target object, that is, the first MR image and the first CT image are both images obtained by capturing the same target object, for example, but not limited to, a certain part of the object, such as a hand, a leg, or a neck. MR is abbreviated as MRI, i.e., Magnetic Resonance Imaging (MRI), and CT refers to computed tomography.

Optionally, step S100 includes steps S110-S150:

s110, acquiring an original MR image and an original CT image of the same target object.

It should be noted that the number of the original MR images and the original CT images, i.e., the images before the preprocessing, is several.

And S120, registering the original MR image with the original CT image after the offset field operation is carried out on the original MR image to obtain a registered MR image and a registered CT image.

Optionally, the raw MR image as well as the raw CT image need to be preprocessed, in particular: and (3) performing N4 offset field operation on the original MR image, and performing registration with the original CT image by using an Elastix method, wherein the length and width of the image are adjusted to be 512-resolution, and obtaining a registered MR image and a registered CT image.

And S130, carrying out assignment processing of a preset range on the registered CT image.

Optionally, the registered MR images and the registered CT images are classified into a training set, a validation set and a test set according to a preset ratio. And performing assignment processing of a preset range on the registration CT image in the training set, wherein the preset range can be adjusted according to the practice. For example: and assigning values lower than-1000 to be-1000 and assigning values higher than 2500 in CT values of the registered CT images to be 2500, wherein the preset range is set to be-1000-2500.

And S140, carrying out normalization processing on the registered MR image and the registered CT image after the assignment processing.

In the embodiment of the invention, the values of the registered MR image and the registered CT image after the value assignment processing are normalized to be-1 to 1.

S150, intercepting an image with a preset size from the registration MR image after normalization processing to obtain a first MR image, and intercepting an image with a preset size from the registration CT image after normalization processing to obtain a first CT image.

Specifically, a central point is selected from the normalized registered MR image, a region of a certain size, for example, a region of a size 8 × 256 × 256, is determined with the central point as the center, and then images of a size 8 × 256 × 256 are respectively cut at corresponding positions in the normalized registered MR image and the normalized registered CT image, so as to obtain a first MR image and a first CT image.

S200, inputting the first MR image and the first CT image into a network model for training to obtain an image synthesis model.

Optionally, the network model comprises a CST generator and a discriminator, wherein a basic framework of the CST generator is a UNet structure, the CST generator is called a CNN-Swin-Transformer generator, and is a deep learning model combining the Swin Transformer and the CNN network; in the embodiment of the invention, the CST generator comprises an encoding module, a decoding module, an RSC self-attention module and a global block embedding module. It should be noted that by setting the CNN-Swin-Transformer generator, the computational complexity of the Transformer and the image size are in a linear relationship by a local window self-attention mechanism, thereby reducing the computational overhead.

Optionally, step S200 includes steps S210-S260:

and S210, local feature extraction is carried out on the first MR image through an encoding module.

Optionally, the encoding module includes at least a first CNN module, a first RSC self-attention module, and a second CNN module.

In the embodiment of the present invention, step S210 includes steps S2101 to S2103:

s2101, performing at least one time of first feature extraction on the first MR image through a first CNN module to obtain first feature information.

As shown in fig. 2, it should be noted that the number of the first CNN modules may be set as needed, in the embodiment of the present invention, three first CNN modules are set between the first RSC self-attention module 104 and the input first MR image, which are taken as an example, and are the first CNN module 101, the first CNN module 102, and the first CNN module 103, respectively, where first feature extraction processes performed by each first CNN module are the same, the first CNN module 101 performs first feature extraction on an input of the first CNN module 101, the first CNN module 102 performs first feature extraction on an input of the first CNN module 102 (such as an output of the first CNN module 101), and the first CNN module 103 performs first feature extraction on an input of the first CNN module 103 (such as an output of the first CNN module 102), that is, the first MR image is subjected to third first feature extraction, so that the first CNN module 103 outputs first feature information.

Optionally, each first feature extraction includes convolution, instance normalization, and ReLU activation functions performed in sequence, in which the principle is to map the number of channels of the input (e.g., first MR image) to a high-dimensional space, and output a feature map with a size that may include the channels, layer thickness, height, width, etc. of the feature map.

S2102, performing a second feature process on the first feature information by the first RSC self-attention module.

Specifically, the first feature information output by the first CNN module 103 is input into the first RSC self-attention module 104 to perform the second feature processing on the first feature information.

S2103, performing second feature extraction on the second feature processing result through the second CNN module to obtain a local feature extraction result.

It should be noted that the second feature extraction process of the second CNN module 105 is similar to the first feature extraction process. Alternatively, the core size, padding, and the number of channels of the first CNN module 101 are set to 3 × 5 × 5, 1 × 2 × 2, and 16; the convolution kernels and padding of the first CNN module 102, the first CNN module 103, and the second CNN module 105 are set to 3 × 3 × 3 and 1 × 1 × 1, and the number of channels is set to the previous CNN module (which may be the second CNN module or the first CNN module), for example, the number of channels of the first CNN module 103 is twice that of the first CNN module 102. In addition, the convolution step of the first CNN module 102 is set to 2 × 2 × 2, and the convolution steps of the first CNN module 103 and the second CNN module 105 are set to 1 × 2 × 2, so that the situation that excessive down-sampling in the dimension D destroys image information due to the relatively thin number of layers D of the input image is avoided.

In the embodiment of the present invention, the image is downsampled by controlling the sizes of the convolution steps of the first CNN module 102, the first CNN module 103, and the second CNN module 105, the number of channels is doubled, and after three times of convolution with downsampling property, depth local feature information of the image, that is, a local feature extraction result, for example, a local feature extraction result finally output by the second CNN module 105, is a feature map, can be obtained

Where R is a real number, C is the number of channels of the first MR image, H is the height of the first MR image, W is the width of the first MR image, and D is the (number of layers) thickness of the first MR image.

S220, global feature extraction is carried out on the first MR image through a global block embedding module.

As shown in fig. 2, optionally, the global block embedding module 201 forms a global Patch embedding branch for performing global feature extraction on the first MR image. Specifically, the global chunk embedding module 201 includes a chunk embedding module (Patch embedding module 202) and a second RSC self-attention module 203.

Optionally, step S220 includes steps S2201-S2202:

s2201, performing second feature extraction on the first MR image through a block embedding module to obtain second feature information.

In the embodiment of the invention, C is the channel number of the first MR image, H is the height of the first MR image, W is the width of the first MR image, D is the thickness (layer number) of the first MR image, and the first MR image F belongs to R ^C×H×W×D The second feature extraction includes, but is not limited to, convolution processing, dimension flattening, transposition, layer normalization, and restoration processing:

convolution processing to obtain a convolution image block F _conv The formula is as follows:

where Conv (·) represents a convolution operation.

Then, F is mixed _conv Dimension conversion is carried out on the height, width and thickness to obtain one dimension (dimension flattening) and transposition is carried out on the dimension, layer normalization is carried out on a channel of the C, and a layer normalization result F is obtained _Pemb The formula is as follows:

F _Pemb ＝Layernorm(transpose(flatten(F _conv )))

wherein Layernorm represents surface normalization, transpose represents transposition, and flatten represents flattening of dimensions. Finally, the normalization result F _Pemb And restoring the dimension of the characteristic to the size of the original input block embedding module to obtain second characteristic information and outputting the second characteristic information.

And S2202, performing third feature processing on the second feature information through the second RSC self-attention module to obtain a global feature extraction result.

As shown in fig. 2, in the embodiment of the present invention, in order to increase information interaction and globality between features of the block embedding module (i.e., the Patch embedding module 202), a second RSC self-attention module 203 is introduced to perform third feature processing on second feature information after outputting the second feature information.

And S230, fusing the local feature extraction result and the global feature extraction result, and performing first feature processing on the fusion result through the RSC self-attention module.

Specifically, the global feature extraction result representing the features based on the image block level and the local feature extraction result representing the features based on the pixel level are fused, wherein the fusion includes but is not limited to addition, and then the RSC self-attention module is used for carrying out first feature processing on the fusion result. It should be noted that the RSC self-attention module includes at least one third RSC self-attention module, i.e., a residual Swin-cnn (RSC) transformer module. The number of the third RSC self-attention modules can be adjusted according to the actual situation, and is not particularly limited; each third RSC self-attention module performs a feature sub-process on the input, so that the first feature process includes at least one feature sub-process.

In the embodiment of the invention, the RSC self-attention module is introduced, the Swin transform module is optimized by using separable convolution and expansion convolution, the receptive field of the characteristics is expanded, and the combination of the RSC self-attention module and the expansion convolution is carried out on the CNN, so that the local detail of the CNN is kept and the global property of the characteristics is enhanced.

For example, it is exemplified that the third RSC self-attention module 300 includes 6 third RSC self-attention modules, each of the third RSC self-attention modules is connected in sequence, the structures of each of the third RSC self-attention modules are the same, the structures of the third RSC self-attention module, the second RSC self-attention module, and the first RSC self-attention module are the same, and the process of the feature sub-process is the same as the second feature process and the third feature process.

As shown in fig. 3, optionally, taking one of the third RSC self-attention modules as an example for explanation, the third RSC self-attention module includes a first unit 301, a first Swin transformer module (including a second unit 302 and a third unit 303), and a second Swin transformer module (including a fourth unit 304 and a fifth unit 305), the first unit 301, the second unit 302, the third unit 303, the fourth unit 304, and the fifth unit 305 are connected by using residual errors, and the feature sub-process included in the first feature process in step S230 specifically includes S2301 to S2306:

s2301, performing a first separable convolution on the fusion result, performing a first dilated convolution on the first separable convolution result, performing a second separable convolution on the first dilated convolution result, and performing a second dilated convolution on the second separable convolution result.

Specifically, for example, the input is a fusion result, the first unit 301 performs first separable convolution on the fusion result, performs first dilated convolution on the first separable convolution result, performs second separable convolution on the first dilated convolution result, and performs second dilated convolution on the second separable convolution result and outputs the result.

It should be noted that, the separable convolution is a shared separable convolution, and an ordinary convolution can process each channel separately, and an ordinary convolution only possesses the same number of filters and feature maps as the number of channels, and weight sharing among filters sharing the separable convolution is equivalent to possessing only one filter, which can significantly reduce the amount of computation of the convolution process. Alternatively, the expansion ratio used by the first dilation convolution and the first dilation convolution is 1 × 2 × 2, which can effectively enlarge the receptive field of the extracted features.

In the embodiment of the present invention, by introducing the combination of separable convolution and dilation convolution, it is beneficial to expand the receptive field of the input image, so that the image features subsequently input into the second unit 302 have stronger global property, and the quality of the finally synthesized CT image is improved.

S2302, performing first-layer normalization according to the second expansion convolution result, and performing window multi-head attention on the first-layer normalization result.

Specifically, the input of the second unit 302 is the result of adding the second expansion convolution result and the original input of the second unit 301, and the second unit 302 performs the first-layer normalization on the result, performs the multi-head window attention on the first-layer normalization result, and outputs the multi-head window attention.

Assume that the input to the second unit 302 is

C _s Is a channel, D _s Is dimension (thickness), H _s Is high, W _s Is wideWill input

In dimension D _s The subsequent dimension is flattened and transposed to obtain a first layer normalization result

Wherein L ═ D _s ×H _s ×W _s . The window multi-head attention WMSA includes setting a three-dimensional local window and calculating multi-head self-attention MSA, that is, dividing an input image block into three-dimensional local window sizes and then calculating multi-head self-attention. Specifically, the method comprises the following steps: according to the size S of the three-dimensional partial window _d ×S _h ×S _w (dimension, height and width of three-dimensional local Window), will X _S Is adjusted to

Wherein N is _win Representing the number of three-dimensional partial windows, S _win ＝S _d ×S _h ×S _w Represents the size of the window, C _s Represents the code length of each three-dimensional partial window, the size of the three-dimensional partial window includes but is not limited to setting as 2 × 4 × 4, and the number of heads (three-dimensional partial windows) is 8. In addition, the calculation formula of the multi-head self-attention MSA is as follows:

MSA(Q,K,V)＝Concat(head ₁ ,…,head _i )W ₀

wherein Q, K, V are query, key, and value, respectively; head _i Represents the self-attention of the ith head, Concat is the junction function, W ₀ Is a weight matrix of multi-head attention, W _i ^Q Representing a weight matrix, W, corresponding to Q _i ^K Represents the weight matrix corresponding to K, W _i ^V Represents the weight matrix corresponding to V, Attention () is a self-Attention function, softmax (-) represents an activation function of the neural network，D _k Is the dimension of K, K ^T Is the transpose of K.

S2303, carrying out first addition according to the multi-head attention result of the window and the second expansion convolution result, carrying out second-layer normalization on the first addition result, and carrying out processing of the first multilayer perceptron on the second-layer normalization result.

Specifically, the input of the third unit 303 is a first addition result obtained by performing a first addition on the multi-head attention result and the second dilation convolution result according to the window, and the third unit 303 performs a second-layer normalization on the first addition result and performs processing of the first multi-layer perceptron on the second-layer normalization result.

Specifically, the formula of the first multi-layered perceptron MLP is as follows:

MLP(X)＝GELU(XW ₁ +b ₁ )W ₂ +b ₂

wherein X is the second layer normalization result, GELU (. circle.) is an activation function of the neural network, W ₁ And W ₂ Weights of two fully connected layers in the first multi-layer perceptron MLP, b ₁ And b ₂ Respectively, are the bias terms for the two fully-connected layers.

S2304, carrying out second addition on the processing result of the first multilayer perceptron and the first addition result, carrying out third-layer normalization on the second addition result, and carrying out moving-window multi-head attention on the third-layer normalization result.

Specifically, the input of the fourth unit 304 is a second addition result obtained by performing second addition on the processing result of the first multi-layer perceptron and the first addition result, and the fourth unit 304 performs third-layer normalization on the second addition result and performs moving-window multi-head attention on the third-layer normalization result. It should be noted that, the multi-head attention of the moving window can move, and information interaction between windows can be performed.

S2305, carrying out third addition on the multi-head attention result of the moving window and the second addition result, carrying out fourth-layer normalization on the third addition result, and carrying out processing of the second multilayer perceptron on the fourth-layer normalization result.

Specifically, the input of the fifth unit 305 is a third addition result obtained by performing third addition on the moving-window multi-head attention result and the second addition result, and the fifth unit 305 performs fourth-layer normalization on the third addition result and performs processing of the second multilayer perceptron on the fourth-layer normalization result.

It should be noted that the flow of the second Swin transformer module is similar to that of the first Swin transformer module, and is not repeated here, but the difference is that the multi-head attention in the fourth unit 304 is the moving-window multi-head attention S.

S2306, performing fourth addition according to the processing result of the second multilayer perceptron, the third addition result and the second expansion convolution result to obtain a feature sub-processing result.

Specifically, a fourth addition is performed according to the processing result of the second multi-layer sensor, the third addition result, and the second expansion convolution result to obtain and output a feature sub-processing result, for example, the fourth addition is performed on the processing result of the second multi-layer sensor, the third addition result, and the second expansion convolution result to obtain and output a feature sub-processing result.

It is understood that, when the third RSC self-attention module 300 has only one third RSC self-attention module, the first feature processing result output by the third RSC self-attention module is the feature sub-processing result, and when the third RSC self-attention module 300 has six third RSC self-attention modules, the feature sub-processing result output by the previous third RSC self-attention module is used as the input of the next third RSC self-attention module to perform another feature sub-processing until the output of the last third RSC self-attention module is obtained as the final first feature processing result.

Thus, the flow of the Swin transform module is as follows:

wherein l represents the number of layers corresponding to the Swin transform module (e.g., the second unit 302 and the third unit 303 are the first layer, the fourth unit 304 and the fifth unit 305 are the second layer),

represents the input of the l-th layer,

the output of multi-head attention is represented, WMSA (DEG) represents window multi-head self-attention, SWMSA (DEG) represents moving window multi-head self-attention, Layernorm represents surface layer normalization (such as first layer normalization, second layer normalization, third layer normalization and fourth layer normalization), and MLP represents a first multilayer perceptron or a second multilayer perceptron. It should be noted that, in the SWMSA, the window is translated to the lower left by half of the size of the three-dimensional window, and the self-attention is recalculated according to the new window information, thereby effectively improving the information interaction between the windows.

And S240, decoding the first characteristic processing result through a decoding module to generate a second CT image.

Optionally, the decoding module 400 includes a plurality of decoder CNN modules 401, the decoder CNN modules 401 are connected in sequence and the number of the decoder CNN modules 401 corresponds to that of the first CNN modules, and each decoder CNN module 401 is also provided with (upsampling + convolution + instance normalization + ReLU activation function) to form a symmetrical structure with the first CNN module. It should be noted that the decoding process is: the first feature processing result output by the RSC self-attention module is input into a first decoder CNN module 401 of the decoding module 400, and each decoder CNN module 401 is different from the first CNN module in that upsampling is realized by using a nearest neighbor interpolation method before convolution, and the upsampling is spliced with a feature map corresponding to the first CNN module and adopting jump connection to have the same scale as that in the first CNN module, so that the image is gradually restored to the original input size, and then each decoder CNN module 401 is activated by adopting a Tanh activation function, so that the synthesized CT image is output by the last decoder CNN module 401, and a second CT image is obtained.

And S250, inputting the first CT image and the second CT image into a discriminator to obtain a discrimination result.

Optionally, the second CT image and the first CT image synthesized by the CST generator are input to a markov discriminator for judgment, and a discrimination result is obtained, for example, the discrimination result is 1 when true, and the false synthesis is 0.

As shown in fig. 4, specifically, before being input into the discriminator, the first CT image and the second CT image may be input into a conversion module, respectively subjected to discrete wavelet conversion and convolution conversion, and the input image (e.g., the first CT image or the second CT image) is decomposed in the horizontal direction and the vertical direction by Haar discrete wavelet, so as to obtain 4 feature maps with resolution reduced to half of the original image, and then subjected to convolution conversion by a convolution module: adopting convolution kernel whose convolution step is 2 to obtain 1 characteristic graph whose resolution is reduced to that of original image, using 5 characteristic graphs obtained by using two conversion methods respectively to splice two images as input of Markov discriminator. Then, the Markov discriminator divides the input image into a plurality of N × N image blocks for judgment, and can effectively model the image into a Markov random field, so that a discrimination result can be obtained according to the input image. By the method, on the basis of the existing Markov discriminator, discrete wavelet change and convolution conversion are used for input, so that the detail information of the image is enhanced, and the discrimination capability of the image is improved.

And S260, training the network model according to the identification result and a preset loss function to obtain an image synthesis model.

Optionally, the preset loss function includes, but is not limited to, a confrontation loss function, a mean absolute error loss function, and a perceptual loss function, and step S260 includes steps S2601-S2603:

s2601, determining the confrontation loss value of the discriminator according to the discrimination result and the confrontation loss function.

Specifically, the countermeasure loss value L is calculated from the discrimination result and the countermeasure loss function _adv 。

S2602, determining a mean absolute error value according to the first CT image, the second CT image and the mean absolute error loss function.

Specifically, the mean absolute error value L is determined from the first CT image, the second CT image, and the mean absolute error loss function (i.e., the L1 loss function) _L1 。

S2603, determining a perception loss value according to the first CT image, the second CT image and the perception loss function.

Specifically, a perceptual loss value L is determined from the first CT image, the second CT image, and the perceptual loss function _P . It should be noted that the perceptual loss function adopts a pre-trained VGG19 perceptual loss model, the VGG19 perceptual loss model convolves the first CT image and the second CT image respectively to obtain a first feature and a second feature, and then the difference is calculated according to the first feature and the second feature to obtain a perceptual loss value L _P 。

S2604, adjusting network parameters of the network model according to the confrontation loss value, the average absolute error value and the perception loss value to obtain an image synthesis model.

Forming a total loss function L according to the confrontation loss value, the average absolute error value and the perception loss value _total As shown in the following equation:

where G represents CST generator, D represents discriminator, min is minimum generator loss, max is maximum discriminator loss, L _adv Represents the resistance loss value, L _L1 Represents the mean absolute error value, L _P Representing the perceptual loss value. Specifically, the CST generator and the discriminator are alternately trained, and the network parameters are updated until the value of the total loss function reaches the target range, so that the image synthesis model is obtained. Optionally, the initial learning rate used in the training process is 0.0002, and the Adma optimizer is used for trainingAnd refining 150 batches.

Optionally, the MR image-based CT image synthesis method according to the embodiment of the present invention further includes step S270:

and S270, testing and storing the model.

For example, the index used in the testing and verifying process is the mean square error, the verification set is tested in each batch, the mean square error of the second CT image and the first CT image is calculated, and the network parameter corresponding to the minimum mean square error is saved as the model result. And finally, sending the model result into a test set for index evaluation to obtain a final model result, namely the image synthesis model.

Optionally, the network parameters include, but are not limited to, data processing (or pre-processing) related parameters, training process and training related parameters, or network related parameters. For example, data processing (or pre-processing) related parameters include, but are not limited to, rich database parameters (enrich data), feature normalization and scaling parameters (feature normalization and scaling) and BN processing parameters (batch normalization); parameters related to training in the training process include but are not limited to training momentum, learning rate, attenuation function, weight initialization and regularization related methods; network-related parameters include, but are not limited to, selection parameters of the classifier, the number of neurons, the number of filters, and the number of network layers.

And S300, inputting the MR image to be synthesized into the image synthesis model to obtain a target CT image.

As shown in the figure, fig. 5(a) is an MR image to be synthesized, fig. 5(b) is a real CT corresponding to the MR image to be synthesized, the MR image to be synthesized is input into the image synthesis model, and the target CT image shown in fig. 5(c) is obtained, and it can be seen that the difference between the target CT image and the real CT is small. Specifically, after the MR images to be synthesized of 43 patients in the test set are input into the image synthesis model, the finally obtained mean square error MAE is 69.02, the structural similarity SSIM is 0.759, the peak signal-to-noise ratio PSNR is 27.78, and the output difference is as shown in fig. 5(d), which is the result of subtracting the output difference of the real CT in fig. 5(b) from the target CT image in fig. 5(c), and it can be seen that the overall difference is small.

In addition, the CT image synthesis method based on the MR image has high synthesis quality and rich synthesis details, and the synthesis in the air and bone parts is more prominent than that in the prior art.

the acquisition module is used for acquiring training data; the training data comprises a first MR image and a first CT image with the same target object;

the training module is used for inputting the first MR image and the first CT image into the network model for training to obtain an image synthesis model; the network model comprises a CST generator and a discriminator, wherein the CST generator comprises an encoding module, a decoding module, a RSC self-attention module and a global block embedding module; the system comprises a global block embedding module, an encoding module, an RSC self-attention module, a decoding module and a first CT image generation module, wherein the global block embedding module is used for carrying out global feature extraction on a first MR image, the encoding module is used for carrying out local feature extraction on the first MR image, the RSC self-attention module is used for carrying out first feature processing according to a global feature extraction result and a local feature extraction result, and the decoding module is used for decoding the first feature processing result to generate a second CT image; the discriminator is used for generating discrimination results according to the first CT image and the second CT image so as to carry out training;

and the synthesis module is used for inputting the MR image to be synthesized into the image synthesis model to obtain the target CT image.

The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.

The embodiment of the present invention further provides an electronic device, which includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the MR image based CT image synthesis method according to the foregoing embodiment. The electronic equipment of the embodiment of the invention comprises but is not limited to any intelligent terminal such as a mobile phone, a tablet computer, a vehicle-mounted computer and the like.

The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the beneficial effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.

The embodiment of the present invention further provides a computer-readable storage medium, in which at least one instruction, at least one program, a code set, or a set of instructions is stored, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the MR image based CT image synthesis method according to the foregoing embodiment.

Embodiments of the present invention also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the MR image-based CT image synthesis method of the foregoing embodiment.

The terms "first," "second," "third," "fourth," and the like (if any) in the description of the present application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form. Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes multiple instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. The CT image synthesis method based on the MR image is characterized by comprising the following steps:

2. The MR image based CT image synthesis method according to claim 1, wherein: the acquiring training data comprises:

3. A CT image synthesis method based on MR images according to any of claims 1-2, characterized in that: the inputting the first MR image and the first CT image into a network model for training to obtain an image synthesis model includes:

4. The MR image based CT image synthesis method according to claim 3, wherein: the encoding module comprises at least one first CNN module, a first RSC self-attention module and a second CNN module; the local feature extraction of the first MR image by the encoding module includes:

5. The MR image based CT image synthesis method according to claim 3, wherein: the global block embedding module comprises a block embedding module and a second RSC self-attention module; the global feature extraction of the first MR image by the global block embedding module includes:

6. The MR image based CT image synthesis method according to claim 3, wherein: the first signature process comprises at least one signature sub-process, the RSC self-attention module comprising at least one third RSC self-attention module; the first feature processing is performed on the fusion result through the RSC self-attention module, and includes:

7. The MR image based CT image synthesis method according to claim 3, wherein: the training of the network model according to the identification result and the preset loss function to obtain the image synthesis model comprises the following steps:

determining a challenge loss value of the discriminator according to the discrimination result and a challenge loss function;

8. An MR image-based CT image synthesis apparatus, comprising:

9. An electronic device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the method according to any one of claims 1-7.

10. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the method according to any one of claims 1 to 7.