CN117152282A

CN117152282A - Method and terminal for generating hand X-ray image through text

Info

Publication number: CN117152282A
Application number: CN202310940459.9A
Authority: CN
Inventors: 龚元浩; 黄万霖
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2023-07-27
Filing date: 2023-07-27
Publication date: 2023-12-01

Abstract

The invention discloses a method and a terminal for generating an X-ray image of a hand through text, wherein the method comprises the following steps: collecting hand X-ray images with different postures, and obtaining an image dataset according to the collected hand X-ray images; generating corresponding text information according to the hand X-ray image to obtain a corresponding text data set; inputting the image data set and the text data set into an image generation model for training to obtain a trained image generation model; and generating a model through the trained images, and converting the text into corresponding hand X-ray images. The invention solves the problems of high hand image error rate and poor quality in the image generation process, and reduces the correspondence difference of text information and image noise.

Description

Method and terminal for generating hand X-ray image through text

Technical Field

The present invention relates to the field of text encoding and medical image generation, and in particular, to a method, a terminal, and a storage medium for generating an X-ray image of a hand through text.

Background

With the rapid development of deep neural networks and computer computing power, more and more large models and multi-modal combination algorithms based on large models are researched and achieved with glaring results. The image generation is one of the fields which are relatively popular and widely applied, and covers the aspects of the enhancement and reconstruction of medical images, the scene synthesis of virtual reality, the auxiliary creation in the art design field and the like. The common image generation model has the common problems of more calculation resources required by training, long training time, slow training convergence speed, uncontrollable training effect and the like.

In the field of medical images, the problem of small data volume exists in the existing database generally, and the medical images relate to privacy information of patients and have scientific ethical factors to be considered; secondly, the production of the medical image database requires the participation of related experts in the medical industry, and the production of a large database is difficult and has high complexity. Therefore, image generation has wide prospect and practical application requirement in expanding a medical image database.

However, the existing medical image generation method is ubiquitous: 1. the generation effect is poor, and the generation effect cannot be directly used as the supplement of a database; 2. the randomness of the generation is too strong, and the generated artificial data is required to be cleaned; 3. the generated data is not labeled, and the problem of labeling the label by the related medical expert is also needed.

Therefore, the conventional technology for generating the medical image according to the text has the problems of high error rate and poor quality of the hand image.

Disclosure of Invention

The invention aims to solve the technical problems of high hand image error rate and poor quality in the existing image generation process, and reduces the correspondence difference of text information and image noise points.

The technical scheme adopted for solving the technical problems is as follows:

in a first aspect, the invention provides a method of generating an X-ray image of a hand by text, comprising:

collecting hand X-ray images with different postures, and obtaining an image dataset according to the collected hand X-ray images;

generating corresponding text information according to the hand X-ray image to obtain a corresponding text data set;

inputting the image data set and the text data set into an image generation model for training to obtain a trained image generation model;

and generating a model through the trained images, and converting the text into corresponding hand X-ray images.

In one implementation, the acquiring hand X-ray images of different postures, and obtaining an image dataset according to the acquired hand X-ray images, includes:

driving fingers of a pre-constructed hand three-dimensional model to bend, and acquiring hand models with different postures;

performing X-ray projection on the hand models with different postures to obtain corresponding hand X-ray images;

and obtaining an image data set according to the acquired hand X-ray image.

In one implementation manner, the performing X-ray projection on the hand models with different postures to obtain corresponding hand X-ray images includes:

and (3) storing the intensity information of the X-rays on the projection surfaces of the hand models in different postures, and converting the intensity values of all pixel positions on each projection surface into integer variables of 0-255 to obtain corresponding hand X-ray images.

In one implementation, the generating corresponding text information according to the hand X-ray image to obtain a corresponding text data set includes:

performing corresponding text expression on the hand X-ray image to obtain a corresponding text data set; wherein the specific paradigm of the text expression includes age, gender, and gesture; the gestures include a normal flat gesture and other finger curved gestures.

In one implementation, the inputting the image dataset and the text dataset into an image generation model, to obtain a trained image generation model, includes:

generating a training file according to the image dataset and the text dataset;

and inputting the training file into the image generation model for training to obtain the trained image generation model.

In one implementation, the generating a training file from the image dataset and the text dataset includes:

generating a training file according to the hand X-ray images in the image data set and the text information corresponding to the text data set;

each row of the training file corresponds to one hand X-ray image and text information corresponding to the hand X-ray image.

In one implementation, the image generation model includes: extraction encoder of text features and generation decoder of images.

In one implementation, the generating the model by the trained image, converting text into a corresponding hand X-ray image, includes:

inputting text information, and converting the text information into text feature information through the text feature extraction encoder;

inputting the text characteristic information into the image generation encoder to generate a corresponding hand X-ray image.

In a second aspect, the present invention also provides a terminal, including: a processor and a memory storing a text-to-hand X-ray image program which, when executed by the processor, is operative to implement the method of text-to-hand X-ray image of the first aspect.

In a third aspect, the present invention also provides a computer-readable storage medium storing a text-to-hand X-ray image program for implementing the operations of the text-to-hand X-ray image method of the first aspect when executed by a processor.

The technical scheme adopted by the invention has the following effects:

according to the method, the image data set and the corresponding text data set are constructed by collecting hand X-ray images in different postures, so that the image generation model is guided and trained in a condition constraint mode, a trained image generation model is obtained, text information is encoded through the trained model, and the corresponding hand X-ray image is generated by the guiding model; the method for generating the hand X-ray image through the text solves the problems of high error rate and poor quality of the hand image in the existing image generation process, reduces the corresponding difference of text information and image noise points, reduces the randomness of image generation, has simple input process and high consistency of the text and the image, and achieves the effects of low error rate of the generated image and high resolution of the generated image; and because of adding text information, a man-machine interaction interface is provided, so that the image generation method has wider and more convenient application.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the structures shown in these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method of textually generating an X-ray image of a hand in one implementation of the invention.

FIG. 2 is a schematic view of an X-ray three-dimensional simulated projection in one implementation of the invention.

FIG. 3 is a schematic diagram of a text data format portion presentation in one implementation of the invention.

FIG. 4 is a schematic diagram of an overall image generation model in one implementation of the invention.

FIG. 5 is a schematic diagram of the resulting effect in one implementation of the invention.

Fig. 6 is a functional schematic of a terminal in one implementation of the invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clear and clear, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Exemplary method

The common image generation model has the common problems of more calculation resources required by training, long training time, slow training convergence speed, uncontrollable training effect and the like; the problems of poor medical image generation effect, strong randomness and no label of generated data exist for the image generation in the medical field due to the influence of factors of privacy and scientific ethics of patients and the problem that the manufacture of a large database requires participation of related experts in the medical industry.

Aiming at the technical problems, the embodiment of the invention provides a method for generating a hand X-ray image through texts, which comprises the steps of acquiring hand X-ray images with different postures, constructing an image dataset and a corresponding text dataset, so as to conduct condition constraint guiding training on an image generation model, obtaining a trained image generation model, coding text information through the trained model, and guiding the model to generate a corresponding hand X-ray image; the embodiment of the invention solves the problem of high hand image error rate in the existing image generation process, and reduces the correspondence difference and image noise point of text information.

As shown in fig. 1, an embodiment of the present invention provides a method for generating an X-ray image of a hand through text, including the following steps:

step S100, collecting hand X-ray images in different postures, and obtaining an image data set according to the collected hand X-ray images.

In this embodiment, a method for generating a hand X-ray image through text is provided, and the method constructs an image dataset by collecting hand X-ray images with different postures, and performs corresponding text representation on images in the image dataset to form a text dataset; inputting the image data set and the corresponding text data set into an image generation model to obtain a trained image generation model; generating a model through the trained image, and converting the text into a corresponding hand X-ray image; the method for generating the hand X-ray image through the text solves the problems of high error rate and poor quality of the hand image in the existing image generation process, reduces the corresponding difference of text information and image noise points, reduces the randomness of image generation, has simple input process and high consistency of the text and the image, and achieves the effects of low error rate of the generated image and high resolution of the generated image.

Specifically, in one implementation manner of the embodiment of the present invention, the step S100 includes the following steps:

step S110, driving fingers of a pre-constructed hand three-dimensional model to bend, and acquiring hand models with different postures;

step S120, performing X-ray projection on the hand models with different postures to obtain corresponding hand X-ray images;

step S130, an image data set is obtained according to the collected hand X-ray image.

Because the original hand bone age prediction data set only has age and gender information, the hand bone age prediction data set has the problems of relatively single gesture information, mostly flat gesture and poor data variety; therefore, the invention simulates various complex gestures through the projection of the pre-constructed hand three-dimensional model to obtain more hand models so as to obtain more hand X-ray images.

In the embodiment, more hand X-ray image data are obtained by driving the hand model to make different postures, and X-ray three-dimensional parallel projection is carried out on the hand model, so that the completeness and the clarity of the hand X-ray image are ensured; the method comprises the steps of reserving intensity information of X-rays on a projection surface, converting intensity values of all pixel positions in an X-ray image of a hand into shaping variables, and obtaining an image dataset; the image data set contains rich image information, and the information is clear and complete, so that the image data set is conveniently and subsequently arranged into a training file to be input into a layer generation model, and accurate searching is also convenient when the image generation model is trained.

Specifically, in one implementation manner of the embodiment of the present invention, the step S130 includes the following steps:

step S121, saving the intensity information of X-rays on the projection surfaces of the hand models with different postures, and converting the intensity values of all pixel positions on each projection surface into integer variables of 0-255 to obtain corresponding hand X-ray images.

In the embodiment, the finger part of the pre-constructed hand three-dimensional model is driven to bend through parameters, the hand three-dimensional model is stored as three-dimensional grid surface patch models with different hand postures, and the stored file is in a stl file format;

in this embodiment, three-dimensional projection of X-rays is performed on hand models with different postures, and as illustrated in fig. 2, the projection mode is parallel projection:

P(x,y)＝∫f(x,y,z)dz

where P (x, y) represents the pixel value at position (x, y) on the projected image and f (x, y, z) is the absorption coefficient of the projected object at coordinates (x, y, z).

And storing the intensity information of the X-rays on a projection surface, and converting the intensity values of all pixel positions in the X-ray image of the hand into integer variables of 0-255 to obtain the image data set.

In this embodiment, by driving a hand three-dimensional model to store different hand gesture models, performing X-ray three-dimensional projection on the hand models, retaining X-ray intensity information on projection surfaces of the hand models in different gestures, and converting intensity values of all pixel positions on each projection surface into shaping variables, so as to obtain an image dataset; the image data set obtained by the method has the image data with richer hand information.

and step 200, generating corresponding text information according to the hand X-ray image to obtain a corresponding text data set.

In this embodiment, the images in the image dataset are all subjected to corresponding text representation, so as to generate corresponding Prompt, that is, the form of the text paradigm in the text representation, and the text feature extraction is facilitated when the training images are input into the training image generation model; the method is beneficial to unifying text characteristics by designing the Prompt, so that the accuracy of generated images is improved, and the degree of correspondence between the images and the text is improved.

Specifically, in one implementation manner of the embodiment of the present invention, the step S200 includes the following steps:

step S210, performing corresponding text expression on the hand X-ray image to obtain a corresponding text data set; wherein the specific paradigm of the text expression includes age, gender, and gesture; the gestures include a normal flat gesture and a plurality of finger bending gestures.

In this embodiment, corresponding text representations are performed on the images in the image dataset, and corresponding campts are generated to obtain the text dataset; by distinguishing the gestures of the fingers, more text data are obtained, and the accuracy of the model generated by the subsequent training images is improved.

In this embodiment, a corresponding text expression is performed on each X-ray image in the image dataset, where a specific expression paradigm of the text expression is composed of age, gender, and gestures; where the gestures include a normal flat gesture and other finger curved gestures, the english paradigm of the two gestures is as follows:

Prompt_1＝“This is an X-ray image of hand from M-month-old girl/boy,flat pose.”

Prompt_2＝“This is an X-ray image of hand from M-month-old girl/boy,other pose with fingers curved.”

wherein M is the corresponding age (in months), girl/boy corresponds to gender, flat pot represents a flat normal gesture, and other pot represents other finger bending gestures.

In this embodiment, corresponding text representation is performed on the images in the image dataset, in a normal form of the text, the ages corresponding to the images are marked by months, the genders corresponding to the people to which the images belong are distinguished by girl/bay, and the flat normal gesture or the finger bending gesture of the finger is distinguished by flat point/other point, and the corresponding promt of the image is generated through the text representation of the images; the text feature extraction is convenient for generating a training file with a corresponding image by converting the text feature extraction into the template, and is also convenient for the text feature extraction when the training image is input to a training image generation model; the method is beneficial to unifying text features by designing the Prompt, so that the accuracy of generating the hand X-ray image is improved, and the corresponding degree of the generated hand X-ray image and the input text is improved.

and step S300, inputting the image data set and the text data set into an image generation model for training, and obtaining the trained image generation model.

In this embodiment, the image dataset and the corresponding text dataset form a training text, and an image generation model is input; the text data is used as the input of the image generation model, the corresponding image data is used as the parameter guide, and the parameters of the image generation model are adjusted to obtain the trained image generation model, so that the effects that the error rate of the text generated image is low, the generation effect is close to that of a real X-ray image, the details are rich, and the corresponding text information is accurate are achieved.

Specifically, in one implementation manner of the embodiment of the present invention, the step S300 includes the following steps:

step S310, generating a training file according to the image data set and the text data set;

step S320, inputting the training file into the image generation model for training, so as to obtain the trained image generation model.

In this embodiment, the image generation model is composed of an extraction encoder of text features and a generation decoder of images, and the image generation encoder is mainly composed of a nnet neural network for performing iterative denoising on random noise and a potential space automatic decoder.

In this embodiment, the obtained image dataset and the text dataset corresponding to the image dataset form a training file, an image generation model is input, text information in the text dataset is encoded through a text feature extractor to generate text feature information, the text feature information is the same as the image feature size in the uiet, the text feature information is combined with the image feature in the denoising process, an image with the text information is generated, and the difference between the generated image and the original image in the image dataset corresponding to the text is reversely propagated to the uiet to be used as a guide for adjusting the uiet parameter.

Specifically, in one implementation manner of the embodiment of the present invention, the step S310 includes the following steps:

step S311, generating a training file according to the hand X-ray images in the image data set and the text information corresponding to the text data set; each row of the training file corresponds to one hand X-ray image and text information corresponding to the hand X-ray image.

In this embodiment, a training file is made according to the images in the image dataset and the text information of the corresponding paradigm, wherein the images in the image dataset include real pictures and pictures simulating projection; in the training file, each row corresponds to an image and text information corresponding to the image, and the format of the training file is as follows:

{“file_name”:“X.png”,“text”:Prompt}

where "x.png" is the file name of the image file, promt is the expression paradigm in the text representation, and the entire file internal format is shown in fig. 3.

In this embodiment, the training file is composed of the images in the image dataset and the text information in the corresponding paradigm, so that the training file structure facilitates random extraction of each batch of samples during training, and is also beneficial to gain of data before training, such as horizontal overturn and vertical overturn.

In this embodiment, the image dataset and the corresponding text dataset are input into a large image generation model Stable diffu ion for training learning, and the whole framework of the image generation model Stable diffu ion is shown in fig. 4.

In this embodiment, the image generation model Stable diffration is composed of an extraction encoder of text features and a generation decoder of images; the text feature extractor is a pre-training neural network model CLIP trained by OpenAI on 4 hundred million pairs of image text pairs and used for encoding input text; the generating decoder mainly comprises a Unet neural network for carrying out iterative denoising on random noise and a potential space automatic decoder, and aims to reduce the dimension of an input image, reduce the parameter quantity of the Unet and ensure that the Unet can be denoised to generate a potential space image containing text information.

In this embodiment, when training the Stable Diffusion of the image generating model, the promt text is used as the input of the whole model, and is converted into text feature information through a text feature extractor, the text feature is the same as the image feature size in the uiet, and after being combined with the image feature in the denoising process, an image with text information is generated, and the difference between the generated image and the image corresponding to the image dataset of the text is reversely propagated into the uiet as a guide for adjusting the uiet parameters, so as to obtain the trained image generating model.

In this embodiment, when training the image generation model, a mixed precision of fp16+fp32, that is, a floating point number precision of 2 bytes, 16 bits and a precision of 4 bytes, 32 bits is used; therefore, the video memory requirement of the whole large model on the video card is reduced, more application scenes are adapted, and meanwhile, the convergence speed and the training speed of the model are increased.

In the embodiment, the images in the image data set and the text data corresponding to the images form a training file, the image generation model is trained, the text data is used as the input of the whole model in the training process, and the corresponding original images are used as the parameter to guide the model to be adjusted so as to obtain the trained image generation model; the method achieves the effects that the error rate of the text generated image of the trained image generation model is low, the generation effect is close to that of a real X-ray image, details are rich, and corresponding text information is accurate.

and step S400, generating a model through the trained images, and converting the text into corresponding hand X-ray images.

In the embodiment, the required text information is input into a trained image generation model, text characteristics are encoded through text, and the trained image generation model is guided to generate a hand X-ray image with the length and the width of 512 by random noise; the generated hand X-ray image is close to a real X-ray image, has rich details and accurate corresponding text information, and solves the problems of high error rate of the hand image, poor correspondence of the text information and more image noise in the image generation process.

Specifically, in one implementation manner of the embodiment of the present invention, the step S400 includes the following steps:

step S410, inputting text information, and converting the text information into text feature information through the text feature extraction encoder;

and step S420, inputting the text characteristic information into the image generation encoder to generate a corresponding hand X-ray image.

In this embodiment, text information is input, the text information is encoded into text features based on the Prompt designed above, the text features are input into a trained image generation model, and iterative generation is performed to obtain a corresponding hand X-ray image.

In this embodiment, based on the foregoing designed promt, a corresponding text of "age+gender+gesture" is input, where the text is encoded by a text feature extractor into text feature information that can be used by an image generation model, and the text feature information enters an iterative denoising module uiet to instruct the uiet to iteratively denoise a random noise map to generate a clear image, and because the image generation model is already trained, the uiet parameters herein can generate a corresponding X-ray image containing the input text information without changing, and the resulting generated image is shown in fig. 5.

The following technical effects are achieved through the technical scheme:

Exemplary apparatus

Based on the above embodiment, the present invention further provides a terminal, including: the system comprises a processor, a memory, an interface, a display screen and a communication module which are connected through a system bus; wherein the processor is configured to provide computing and control capabilities; the memory includes a storage medium and an internal memory; the storage medium stores an operating system and a computer program; the internal memory provides an environment for the operation of the operating system and computer programs in the storage medium; the interface is used for connecting external equipment, such as mobile terminals, computers and other equipment; the display screen is used for displaying corresponding information; the communication module is used for communicating with a cloud server or a mobile terminal.

The computer program is configured to perform operations of a method for textually generating an X-ray image of a hand when executed by the processor.

It will be appreciated by those skilled in the art that the functional block diagram shown in fig. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the terminal to which the present inventive arrangements may be applied, and that a particular terminal may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, there is provided a computer terminal, including: a processor and a memory storing a text-to-hand X-ray image program which when executed by the processor is operative to implement the operations of the text-to-hand X-ray image method as described above.

In one embodiment, a computer readable storage medium is provided, wherein the computer readable storage medium stores a textually generated hand X-ray image program that when executed by the processor is operative to implement the method of textually generating hand X-ray images as described above.

Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program comprising instructions for the relevant hardware, the computer program being stored on a non-volatile storage medium, the computer program when executed comprising the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory.

In summary, the invention discloses a method and a terminal for generating an X-ray image of a hand through text, wherein the method comprises the following steps: collecting hand X-ray images with different postures, and obtaining an image dataset according to the collected hand X-ray images; generating corresponding text information according to the hand X-ray image to obtain a corresponding text data set; inputting the image data set and the text data set into an image generation model for training to obtain a trained image generation model; and generating a model through the trained images, and converting the text into corresponding hand X-ray images. The invention solves the problems of high hand image error rate and poor quality in the image generation process, and reduces the correspondence difference of text information and image noise.

It is to be understood that the invention is not limited in its application to the examples described above, but is capable of modification and variation in light of the above teachings by those skilled in the art, and that all such modifications and variations are intended to be included within the scope of the appended claims.

Claims

1. A method for generating a hand X-ray image from text, the method comprising the steps of:

2. The method of generating hand X-ray images from text according to claim 1, wherein the acquiring hand X-ray images of different poses and deriving the image dataset from the acquired hand X-ray images comprises:

and obtaining an image data set according to the acquired hand X-ray image.

3. The method for generating a hand X-ray image through text according to claim 2, wherein the performing X-ray projection on the hand models with different postures to obtain the corresponding hand X-ray image comprises:

4. The method for generating a hand X-ray image by text according to claim 1, wherein generating corresponding text information from the hand X-ray image, to obtain a corresponding text data set, comprises:

5. The method of generating a hand X-ray image by text according to claim 1, wherein said inputting the image dataset and the text dataset into an image generation model results in a trained image generation model comprising:

generating a training file according to the image dataset and the text dataset;

6. The method of generating a hand X-ray image from text as defined in claim 5, wherein generating a training file from the image dataset and the text dataset comprises:

7. The method of generating a hand X-ray image from text according to claim 1, wherein the image generation model comprises: extraction encoder of text features and generation decoder of images.

8. The method of generating a hand X-ray image from text according to claim 7, wherein said generating a model from said trained image converts text into a corresponding hand X-ray image, comprising:

9. A terminal, comprising: a processor and a memory storing a text-to-hand X-ray image program for implementing the text-to-hand X-ray image method of any one of claims 1-8 when executed by the processor.

10. A computer readable storage medium storing a text-to-hand X-ray image program which, when executed by a processor, is operable to carry out the operations of the text-to-hand X-ray image method of any one of claims 1 to 8.