CN116503515A

CN116503515A - Brain lesion image generation method and system based on text and image multi-mode

Info

Publication number: CN116503515A
Application number: CN202310463730.4A
Authority: CN
Inventors: 张欣茹; 叶初阳; 郭威严; 文梓棋
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2023-04-26
Filing date: 2023-04-26
Publication date: 2023-07-28

Abstract

The invention discloses a brain focus image generation method and system based on text and image multi-mode, comprising the following steps: and a data collection module: collecting a public data set of related brain lesions, matching each image with a text template to form a text-image template data set; and a fine adjustment module: the image and the text are respectively encoded and converted into a common embedded space, and then similarity is calculated for matching. And a data expansion module: and performing image generation and variant generation according to the required text requirement by using the trimmed DALLE2 model. And the annotation image generation module is used for editing the target image by using the circularly fine-tuned model, providing a mask specified editing area and carrying out text description to generate a target focus. Thereby generating a lesion image with labels. The invention has the advantages that: the method can process the image and the text information simultaneously to generate more real, accurate and various brain disease focus images, realize the unsupervised brain disease focus segmentation task and be more efficient, accurate and diversified.

Description

Brain lesion image generation method and system based on text and image multi-mode

Technical Field

The invention relates to the technical field of artificial intelligence generation of medical images, in particular to a text and image multi-mode brain focus image generation method and system based on deep learning.

Background

The brain focus image refers to an image presented by an abnormal focus found by examining the human brain through imaging technology (such as magnetic resonance imaging, computer tomography, etc.). These images can be used by doctors for diagnosis and treatment of encephalopathy. Common brain lesions include cerebral hemorrhage, cerebral infarction, tumors, and the like. Through the application of Computer Aided Diagnosis (CAD) and artificial intelligence technology, brain lesions can be identified and positioned more accurately, and important references are provided for doctors to formulate treatment schemes.

The prior art proposes deep learning based brain disease image generation models, such as GAN ^[1][2]

(Generative Adversarial Networks), and the like. These models are typically based on a single data source, such as a brain MRI image, to generate a brain lesion image. However, these methods have some drawbacks. First, they cannot utilize other available data sources, such as lesion description text, to improve the accuracy and authenticity of the generated image, nor can they add a priori information through the text description. Second, these methods typically require a large amount of training data, which is often difficult to obtain, especially for rare brain diseases. Meanwhile, the hyper-parameters of the GAN model are very sensitive, training is very unstable, and the diversity of the task of generating images is to be improved.

In addition, there is also prior art that proposes a GAN-based multimodal image generation model ^[3] Such as Text-to-Image Generation (Text-to-Image Generation) models. These models can combine both text and image information to generate an image. However, these models are often not well-suited for the generation of brain disease images because the pathological features of brain disease and the features of other objects are different, requiring specialized models to handle the more detailed textures. Most notably, these generative models do not provide labels for both the generated image and the edited lesion area, and thus do not create labeled image pairs for the segmentation task for training.

From the foregoing, it is known that prior art models are typically based on a single data source, and it is difficult to generate a true, accurate brain lesion image using both text and image multimodal information. Furthermore, the prior art requires a large amount of training data and computational resources to train the model, and is not suitable for the special needs of the medical image field, simply generating images without generating labeled areas for editing areas (e.g., added lesions) is not able to form paired images and labels to train the segmentation model.

[1].Han,Changhee,et al."GAN-based synthetic brain MR image generation."2018IEEE 15th international symposium on biomedical imaging(ISBI 2018).IEEE,2018.

[2].Huang,Yawen,et al."MCMT-GAN:Multi-task coherent modality transferable GAN for 3D brain image synthesis."IEEE Transactions on Image Processing 29(2020):8187-8198.

[3].Zhu,Minfeng,et al."Dm-gan:Dynamic memory generative adversarial networks for text-to-image synthesis."Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2019.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a brain focus image generation method and system based on text and image multi-mode. The image and text information can be processed simultaneously. The method can simultaneously utilize the multi-modal information of the text and the image and the appointed editing area to generate a more real and accurate brain disease focus image with the label, and simultaneously reduce the requirements of training data and calculation resources.

In order to achieve the above object, the present invention adopts the following technical scheme:

a brain focus image generation method based on text and image multi-modality, comprising the steps of:

step1, collecting a public data set of related brain lesions, matching each image with a text template, and filling reserved positions in the text template according to each image sample to form a text-image template data set.

Step2, using the paired text-image template dataset to fine tune on the CLIP module of DALLE2 so as to match the image text feature space.

Step3: and performing image generation and variant generation according to the required text requirement by using the trimmed DALLE2 model. The CLIP module encodes and maps the required text into a text feature space, which is then converted into an image feature space by a priori model, and the image features generate an image with required semantics by a decoder. The generated image is used as expansion data, and the requirement of the DALLE2 model on the data volume is further met on the basis of a limited public data set.

Step4, performing cyclic fine tuning on the model again by the paired image text generated in the Step3 together with the original text-image template data set of the Step 1.

Step5, editing the target image by using the circularly fine-tuned model, providing a mask to specify the area of the edited image, and inputting text description to generate a target focus so as to generate a focus image with labels for subsequent segmentation tasks.

Further, the fine tuning described in S2 is specifically: the pre-trained weighted model of the large-scale text image pair is loaded as an initialization model, and text-image data based on the brain focus image generated locally is used for continuous training on the initialization model, so that fine adjustment of model weights is completed.

Further, the cyclic fine tuning in S4 is specifically: and (3) taking the preliminary fine tuning model in the step (S2) as an initialization model, and training the model again by using the data generated by expansion in the step (S3) and the original data to finish secondary fine tuning of the model parameters.

The invention discloses a brain focus image generation system, which can be used for implementing the brain focus image generation method, and concretely comprises the following steps: the system comprises a data collection module, a fine adjustment module, a data expansion module and a marked image generation module;

and a data collection module: a public data set for collecting relevant brain lesions and matching each image with a text template to form a text-image template data set;

and a fine adjustment module: matching is performed by encoding the image and text separately, i.e. both are converted into a common embedding space, and then by calculating the similarity between them.

And a data expansion module: and performing image generation and variant generation according to the required text requirement by using the trimmed DALLE2 model.

And the annotation image generation module is used for editing the target image by using the circularly fine-tuned model, providing a mask specified editing area and carrying out text description to generate a target focus. Thereby generating a lesion image with labels.

The invention also discloses a computer readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements the above-mentioned brain focus image generation method.

Compared with the prior art, the invention has the advantages that:

1. the advantages of a large multi-modal model (such as CLIP, DALLE2 and the like) are utilized, the image and text information can be processed simultaneously, and the prior description of the language can be used for guiding the generation of the image according to the requirement.

2. The method can generate more true, accurate and various brain lesion images, realize the generation of non-labeling lesion segmentation image pairs, and provide better tools and support for medical diagnosis and research.

3. Better adapt to brain lesion image generation tasks.

4. The most advanced diffusion generation model (DALLE 2) is adopted to generate highly realistic and diversified brain lesion images and corresponding lesion labels, and the generated image label pairs are used for training of the separate models to realize the task of unsupervised brain lesion segmentation.

5. By using a large multi-mode model DALLE2 and the most advanced diffusion generation model, the method can generate brain focus image annotation pairs more efficiently, accurately and variously, thereby providing better medical focus segmentation diagnosis and research support.

Drawings

Fig. 1 is a flowchart of a method for generating a brain focus image according to an embodiment of the present invention.

FIG. 2 is a fine tuning flow chart of an embodiment of the present invention.

Fig. 3 is a flow chart of generating a lesion image with labels according to an embodiment of the present invention.

Detailed Description

The invention will be described in further detail below with reference to the accompanying drawings and by way of examples in order to make the objects, technical solutions and advantages of the invention more apparent.

As shown in fig. 1, a method for generating a brain focus image based on text and image multi-mode includes the steps of:

step1, collecting relevant public data sets of brain focus, such as brain tumor segmentation data set BraTS, old cerebral apoplexy segmentation data ATLAS, ischemic cerebral apoplexy data set ISLES, multiple sclerosis focus segmentation data set MSSEG, etc. And each image is provided with a text template, for example: [ Stroke ] with [ hyper ] intensity on [ DWI ] module ] fills [ ] locations in the text template from each image sample, forming a text-image template dataset.

Step2, using the paired text-image template dataset to fine tune on the CLIP module of DALLE2 so as to match the image text feature space. Specifically, a pre-trained weighted model of a large-scale text image pair is loaded as an initialization model, and text-image data based on a brain focus image generated locally is used for continuous training on the initialization model, so that fine adjustment of model weights is completed.

Step3: and performing image generation and variant generation according to the required text requirement by using the trimmed DALLE2 model. Specifically, the CLIP module encodes and maps the required text to a text feature space, which in turn is converted to an image feature space by a priori model, and the image features generate an image with required semantics by a decoder. Since the generation pattern of DALLE2 is based on the diffusion model, the noise of initialization is different for the same model with fixed weight parameters, and the corresponding generation data is not used. Therefore, the generated image is equivalent to a data expansion, so that the requirement of the DALLE2 model on the data volume is further met on the basis of a limited public data set.

Step4, performing cyclic fine tuning on the model again by the paired image text generated in the Step3 together with the original text-image template data set of the Step 1. Specifically, the preliminary fine tuning model in S2 is used as an initialization model in the step, that is, based on the model parameter weight, the model is trained again by using the data generated by expansion in S3 together with the original data, so as to complete the secondary fine tuning of the model parameters.

Step5, editing the target image by using the circularly fine-tuned model, and providing a mask designated editing area and a text description to generate a target focus. Thereby generating a focus image with labels, which can be used for subsequent segmentation tasks.

Specifically, in step1, details of the fine tuning in the CLIP module and the diffusion model are shown in fig. 2. Matching is performed by encoding the image and text separately, i.e. both are converted into a common embedding space, and then by calculating the similarity between them.

In step5, as shown in fig. 3, the input text (labeled (2) in fig. three) is "A tumor at Axial view with hyper-intensity" and is assigned to the region (i.e., a binary mask, i.e., the label pointed by the arrow in the labeled (3) in fig. three) near the ventricle of the input edited image (labeled (1) in fig. three), and the image with the lesion (labeled (3) in fig. three) is generated by the 3D graphics tool) can be generated.

As can be seen from fig. 3, for the same input image, different language descriptions and designated editing areas can be applied to generate a lesion image according to requirements. But also retains all the features and details of the original image outside the edit area. The generated sample data and the labeling area can be used for subsequent segmentation tasks, so that dependence on the labeling data is greatly reduced and avoided.

In still another embodiment of the present invention, there is provided a brain focus image generation system, which can be used to implement the above-described brain focus image generation method, specifically including: the system comprises a data collection module, a fine adjustment module, a data expansion module and a marked image generation module;

In a further embodiment of the present invention, the present invention also provides a storage medium, in particular, a computer readable storage medium (Memory), which is a Memory device in a terminal device, for storing programs and data. It will be appreciated that the computer readable storage medium herein may include both a built-in storage medium in the terminal device and an extended storage medium supported by the terminal device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. The computer readable storage medium herein may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory.

One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to implement the respective steps of the methods for generating images of brain lesions in the above-described embodiments; one or more instructions in a computer-readable storage medium are loaded by a processor and perform the steps of:

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Those of ordinary skill in the art will appreciate that the embodiments described herein are intended to aid the reader in understanding the practice of the invention and that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims

1. A brain focus image generation method based on text and image multi-mode is characterized in that: the method comprises the following steps:

step1, collecting a relevant brain focus public data set, matching each image with a text template, and filling a reserved position in the text template according to each image sample to form a text-image template data set;

step2, using the matched text-image template data set to be finely tuned on a CLIP module of DALLE2 so as to match the text feature space of the image;

step3: performing image generation and variant generation according to the required text requirement by using the trimmed DALLE2 model; the CLIP module encodes and maps the required text into a text feature space, the text feature space is converted into an image feature space through a priori model, and the image features generate an image with required semantics through a decoder; the generated image is used as expansion data, and the requirement of the DALLE2 model on the data volume is further met on the basis of a limited public data set;

step4, performing cyclic fine tuning on the model again by the paired image text generated in the Step3 and the original text-image template data set of the Step 1;

2. A method of generating a text and image based multi-modality brain lesion image according to claim 1, characterized by: the fine tuning described in S2 is specifically: the pre-trained weighted model of the large-scale text image pair is loaded as an initialization model, and text-image data based on the brain focus image generated locally is used for continuous training on the initialization model, so that fine adjustment of model weights is completed.

3. A method of generating a text and image based multi-modality brain lesion image according to claim 1, characterized by: the cyclic fine adjustment in S4 is specifically: and (3) taking the preliminary fine tuning model in the step (S2) as an initialization model, and training the model again by using the data generated by expansion in the step (S3) and the original data to finish secondary fine tuning of the model parameters.

4. A brain focus image generation system, characterized by: the system can be used to implement the brain focus image generation method of any one of claims 1 to 3;

the brain focus image generation system includes: the system comprises a data collection module, a fine adjustment module, a data expansion module and a marked image generation module;

and a fine adjustment module: the image and the text are respectively encoded, namely, both the image and the text are converted into a common embedded space, and then the matching is carried out by calculating the similarity between the image and the text;

and a data expansion module: performing image generation and variant generation according to the required text requirement by using the trimmed DALLE2 model;

the annotation image generation module is used for editing the target image by using the circularly fine-tuned model, providing mask appointed editing areas and text description to generate a target focus; thereby generating a lesion image with labels.

5. A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the brain lesion image generation method according to any one of claims 1 to 3.