CN116168108A

CN116168108A - Method and device for generating image through text, storage medium and electronic equipment

Info

Publication number: CN116168108A
Application number: CN202310266304.1A
Authority: CN
Inventors: 马建
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2023-03-17
Filing date: 2023-03-17
Publication date: 2023-05-26

Abstract

The present disclosure relates to the field of image data processing or generating technologies, and in particular, to a method and apparatus for generating an image by using text, a computer readable storage medium, and an electronic device, where the method includes: coding natural language text describing the image to obtain text coding data; fusing the text coding data with first Gaussian noise of a first preset time step to obtain intermediate text data; sequentially processing the intermediate text data corresponding to the time step by using the coded image generation models with different time steps to obtain image coded data; decoding the image coding data to obtain a target image; wherein the set of time steps of the plurality of encoded image generation models completely covers the first preset time step. The technical scheme of the embodiment of the disclosure improves the precision of the image generated by the text and the matching degree with the text semantic information.

Description

Method and device for generating image through text, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of image data processing or generating technology, and in particular, to a method and apparatus for generating an image by text, a computer readable storage medium, and an electronic device.

Background

The text-based image generation technology has wide application prospects in many scenes, including personalized wallpaper creation of mobile phone theme providers, acquisition of slide creative image materials, creation of contents in a virtual space, a multi-mode dialogue interaction system and the like.

However, the method for generating the image by the text in the related art has lower precision, and the generated image has lower matching degree with text semantic information.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

An object of the present disclosure is to provide a method of generating an image of a text, an apparatus of generating an image of a text, a computer-readable medium, and an electronic device, thereby improving the accuracy of an image of a text generation and the matching degree with text semantic information at least to some extent.

According to a first aspect of the present disclosure, there is provided a method of text generation of an image, comprising: coding natural language text describing the image to obtain text coding data; fusing the text coding data with first Gaussian noise of a first preset time step to obtain intermediate text data; sequentially processing the intermediate text data corresponding to the time step by using coding image generation models with different time steps to obtain image coding data; decoding the image coding data to obtain a target image; wherein a set of time steps of a plurality of said encoded image generation models completely covers said first preset time step.

According to a second aspect of the present disclosure, there is provided an apparatus for generating an image of text, comprising: the coding module is used for coding the natural language text describing the image to obtain text coding data; the fusion module is used for fusing the text coding data with first Gaussian noise of a first preset time step to obtain intermediate text data; the processing module is used for sequentially processing the intermediate text data corresponding to the time step by using the coded image generation models of different time steps to obtain image coded data; a decoding module; the method comprises the steps of decoding the image coding data to obtain a target image; wherein a set of time steps of a plurality of said encoded image generation models completely covers said first preset time step.

According to a third aspect of the present disclosure, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the method described above.

According to a fourth aspect of the present disclosure, there is provided an electronic apparatus, comprising: one or more processors; and a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the methods described above.

According to the method for generating the image by the text provided by the embodiment of the disclosure, the natural language text describing the image is encoded to obtain text encoding data; fusing the text coding data with first Gaussian noise of a first preset time step to obtain intermediate text data; sequentially processing the intermediate text data corresponding to the time step by using the coded image generation models with different time steps to obtain image coded data; and decoding the image coding data to obtain a target image. Compared with the prior art, the method has the advantages that the coded image generation models with different time steps are utilized to sequentially process the intermediate text data corresponding to the time steps to obtain the image coded data, so that the model capacity is increased, and the quality of image generation is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort. In the drawings:

FIG. 1 illustrates a schematic diagram of an exemplary system architecture to which embodiments of the present disclosure may be applied;

FIG. 2 schematically illustrates a flow chart of a method of text generation of an image in an exemplary embodiment of the present disclosure;

FIG. 3 schematically illustrates a block diagram of capturing text in natural language in an exemplary embodiment of the present disclosure;

FIG. 4 schematically illustrates a flowchart for acquiring image encoding data in an exemplary embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow chart for acquiring an encoded image generation model in an exemplary embodiment of the present disclosure;

FIG. 6 schematically illustrates a dataflow diagram of a training encoded image generation model in an exemplary embodiment of the disclosure;

FIG. 7 schematically illustrates a flowchart for acquiring a target image in an exemplary embodiment of the present disclosure;

FIG. 8 schematically illustrates a flowchart for acquiring an intermediate image in an exemplary embodiment of the present disclosure;

FIG. 9 schematically illustrates a dataflow diagram of a training image decoding model in an exemplary embodiment of the present disclosure;

FIG. 10 schematically illustrates another flowchart for acquiring an intermediate image in an exemplary embodiment of the present disclosure;

FIG. 11 schematically illustrates another flowchart for acquiring a target image in an exemplary embodiment of the present disclosure;

FIG. 12 schematically illustrates a flowchart of still another method of acquiring a target image in an exemplary embodiment of the present disclosure;

FIG. 13 schematically illustrates a dataflow diagram of a method of text generation of an image in an exemplary embodiment of the disclosure;

FIG. 14 schematically illustrates a dataflow diagram of a web page-side text-generated image in an exemplary embodiment of the present disclosure;

FIG. 15 schematically illustrates an effect presentation of a web page end generating preview image in an exemplary embodiment of the present disclosure;

fig. 16 schematically illustrates a composition diagram of an apparatus for text-generating an image in an exemplary embodiment of the present disclosure;

fig. 17 shows a schematic diagram of an electronic device to which embodiments of the present disclosure may be applied.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

In the related art, the text-based image generation technology has wide application prospects in many scenes, including personalized wallpaper creation of mobile phone theme vendors, PPT creative image material acquisition, content creation in a virtual space, a multi-mode dialogue interactive system and the like. The main challenge of such open-field text-generated images is to generate high quality images containing text semantic information based on text guidance. The current technical scheme is mainly based on a GAN model, an autoregressive model and a diffusion model.

GAN is a model that assumes that Z follows some common distribution (e.g., normal or even distribution) and then wishes to train a model x=g (Z), learning a discriminator to bring the overall decisions of the actual picture and the generated picture into agreement. The probability density function is not explicitly modeled, but rather the way the generator and the arbiter fight. The main disadvantages are unstable training and poor diversity.

The autoregressive model is represented as a bayesian network by factorization of the chain law. In the autoregressive generation model, the current node learning depends on the previous generation result, and is applied to text generation, voice generation and the like in the NLP field at the earliest, but the image itself is in discrete distribution and consists of a limited number of pixels, and the value of each pixel is also discrete and limited, so that the image can be described by discrete distribution. The problem with autoregressive is that the generation speed is too slow and the decoding space is too large to generate high resolution images.

Based on the above drawbacks, the present disclosure provides a method for generating an image by text, fig. 1 shows a schematic diagram of a system architecture in which the method for generating an image by text may be implemented, and the system architecture 100 may include a terminal 110 and a server 120. The terminal 110 may be a terminal device such as a smart phone, a tablet computer, a desktop computer, a notebook computer, etc., and the server 120 generally refers to a background system that provides a service related to text-to-image generation in the present exemplary embodiment, and may be a server or a cluster formed by multiple servers. The terminal 110 and the server 120 may form a connection through a wired or wireless communication link for data interaction.

In one embodiment, the above-described method of text generating an image may be performed by the terminal 110. For example, the user obtains natural language text using the terminal 110, and the terminal 110 generates a target image based on the natural language text and outputs the target image.

In one embodiment, the method of generating an image from text described above may be performed by the server 120. For example, after the user acquires the natural language text using the terminal 110, the terminal 110 uploads the natural language text to the server 120, the server 120 generates a target image based on the natural language text, and returns the target image to the terminal 110.

As can be seen from the above, the subject of execution of the method of generating an image by text in the present exemplary embodiment may be the terminal 110 or the server 120 described above, which is not limited by the present disclosure.

The method for generating an image of a text in the present exemplary embodiment will be described below with reference to fig. 2, and fig. 2 shows an exemplary flow of the method for generating an image of a text, which may include:

step S210, coding natural language texts describing images to obtain text coding data;

step S220, fusing the text coding data with first Gaussian noise of a first preset time step to obtain intermediate text data;

Step S230, sequentially processing the intermediate text data corresponding to the time step by using coding image generation models of different time steps to obtain image coding data;

step S240, decoding the image coding data to obtain a target image;

wherein a set of time steps of a plurality of said encoded image generation models completely covers said first preset time step.

Based on the method, the image coding data is obtained by sequentially processing the intermediate text data corresponding to the time step by using the coding image generation models with different time steps, so that the model capacity is increased, and the quality of image generation is improved.

Each step in fig. 2 is specifically described below.

Referring to fig. 2, in step S210, natural language text describing an image is encoded to obtain text encoded data.

In an example embodiment of the present disclosure, the method for generating an image by using text may be applied to a web page, and before the description encodes the natural language text describing the image to obtain text encoding data, the natural language text describing the image may be obtained first.

In an example embodiment, the natural language script describing the image may include a picture body, detailed words, and modifier words. The detailed words can be combined at will, the modifier words can define a style or multiple styles, the basic principle of the modifier words is that the modifier words conform to normal Chinese grammar logic, and the modifier words are shown in fig. 3, for example, a very detailed digital drawing is depicted, a person stands in front of a transmission gate and is placed in mystery forests of a plurality of fantasy trees, and meanwhile, the experience end is provided with a plurality of text description contents. And then the functional areas are selected in style, including cartoon, chinese wind and the like, and can be customized according to the requirements of users, and personalized labels are selected, including some detailed words and modifier words. The last modifiable settings include resolution of the picture, number of pictures generated, degree of details of the picture, number of steps of picture generation, etc., and the specific details of the natural language text describing the image may also be customized according to the user's requirements, which is not specifically limited in this exemplary embodiment.

In this exemplary embodiment, the natural language describing the image may be obtained by means of user input and selection, specifically, referring to fig. 3, there may be a text input by the user, and then a style, a personalized tag, etc. of the image may be selected, where the style and the personalized tag may be unnecessary options.

In an exemplary embodiment of the present disclosure, after the natural language text describing the image is obtained, the natural language text may be encoded to obtain text encoded data, and specifically, the XLM-robert-Large-Vit-B-16 Plus encoder may be used to encode the natural language text, or other encoders may be used to encode the natural language text, which is not specifically limited in this exemplary real-time manner.

In step S220, the text encoding data is fused with the first gaussian noise of the first preset time step to obtain intermediate text data.

In an example embodiment of the present disclosure, after obtaining the above text encoded data, the text encoded data may be fused with first gaussian noise, where the fusing process may include a preset time step, and the first preset time step may include 1000 steps, or may include 2000 steps, or may be customized according to a user requirement, and in this example embodiment, the present disclosure is not limited specifically. And obtaining the intermediate text data after fusion.

In step S230, the intermediate text data corresponding to the time step is sequentially processed by using the encoded image generation models of different time steps to obtain image encoded data.

In the present example real-time manner, referring to fig. 4, the above steps may include steps S410 to S430.

In step S410, the intermediate text data of the nth stage is processed by using the encoded image generation model corresponding to the nth stage to obtain the image encoded data of the nth stage.

In this example embodiment, referring to fig. 5, the method may further include acquiring the encoded image generation model, and in particular, may include steps S510 to S530.

In step S510, a first initial model and first training data are acquired, where the first training data includes a plurality of reference text encoding data and truth image encoding data corresponding to the reference text encoding data.

In one example embodiment of the present disclosure, a first initial model and training data may be first obtained, wherein the training data may include a plurality of reference text encoding data and truth image encoding data corresponding to the reference text encoding data.

The reference text encoding data is obtained by reference natural language text encoding, wherein the reference natural language text encoding may include an industry open-source scope Chinese part, a scope data, a translated scope Chinese data, a collected crawl data, and the like, and may be customized according to a user requirement, which is not specifically limited in this example embodiment.

Before encoding the reference natural language text data, the data may be further cleaned, specifically, the description of the physical entity word is filtered out, and the text content is cleaned, including removing special punctuation marks, complex and simplified, and the like, which is not specifically limited in this exemplary embodiment.

The truth image coding data can be obtained after the truth image is coded, wherein before the truth image is coded, images with low resolution and low aesthetic scores in the truth image can be deleted, so that the quality of training data is improved.

In step S520, the reference text encoded data is input to the first initial model to obtain reference image encoded data.

In an example embodiment of the present disclosure, after the reference text encoded data is obtained, the reference text encoded data may be input to the first initial model, wherein the first initial model may be composed of Transformer Encoder (encoder) parts, and the generation of the reference image encoded data may be controlled using random gaussian noise and encoding of a time step of the reference text encoded data as conditions.

In step S530, training the first initial model based on the reference image encoding data and the truth image encoding data results in the encoded image generation model.

In an example embodiment of the present disclosure, after the reference image encoded data and the truth image encoded data are obtained, a first initial model may be trained based on the reference image encoded data and the truth image encoded data to obtain the encoded image generation model.

Specifically, referring to fig. 6, a reference natural language text is input to perform data encoding to obtain reference text encoded data, the reference text encoded data is input to the first initial model to obtain reference image encoded data, a truth image is used to perform encoding to obtain truth image encoded data, and the truth image encoded data and the reference image encoded data are used to update the first initial model to obtain an encoded image generation model.

In this example embodiment, the first preset time step may include N stages, where the first preset time step may include N encoded image generation models that are set in series, and each encoded image generation model processes data of one stage, so that pertinence of the model can be enhanced, and quality of an obtained image can be improved.

In this exemplary embodiment, the intermediate text data is obtained after the reference text encoded image is fused with the first gaussian noise of the first preset time step, and then the intermediate text data of the nth stage is processed by using the encoded image generation model corresponding to the nth stage to obtain the image encoded data of the nth stage.

In step S420, the N-1 th stage image coded data is processed by using the N-1 th stage corresponding coded image generation model.

In step S430, the image coded data of the first stage is outputted.

In the present exemplary embodiment, the N-1 th stage image encoded data is obtained by processing the N-1 th stage image encoded data using the N-1 th stage corresponding encoded image generation model, that is, the input of the N-1 th stage encoded image generation model is the output of the N-th stage encoded image generation model. The image-encoded data of the first stage described above may be output. The above N is a positive integer greater than 1.

It should be noted that, the set of time steps of the plurality of encoded image generation models completely covers the first preset time step, for example, if the first preset time step is 1000 steps, the time step of the encoded image generation model is greater than or equal to 1000 steps, and the step of each of the N stages may be the same or different, may be customized according to the user requirement, and is not specifically limited in this exemplary embodiment.

In step S240, the image encoded data is decoded to obtain a target image.

In an example embodiment of the present disclosure, referring to fig. 7, the above steps may include steps S710 to S720.

In step S710, the image-encoded data is decoded using an image decoding model to obtain an intermediate image.

In the present exemplary embodiment, referring to fig. 8, the above steps may include steps S820 to S820.

In step S810, the image encoding data is fused with a second gaussian noise of a second preset time step to obtain an image to be decoded.

In an example embodiment of the present disclosure, after the above image encoding data is obtained, the image encoding data may be fused with second gaussian noise, where the fusing process may include a preset time step, and the second preset time step may include 1000 steps, or may include 2000 steps, or may be customized according to a user requirement, and in this example embodiment, the present disclosure is not limited specifically. And fusing to obtain the image to be decoded.

In step S820, the image decoding models with different time steps are utilized to sequentially decode the image to be decoded corresponding to the time step to obtain an intermediate image.

In this exemplary embodiment, referring to fig. 9, a second initial model and second training data may be first acquired, where the second training data includes a plurality of reference images to be decoded and true intermediate images corresponding to the reference images to be decoded. And inputting the reference image to be decoded into a second initial model to obtain a reference intermediate image, and finally training the second initial model based on the reference intermediate image and the true intermediate image to update the second initial model to obtain the image decoding model.

The specific training process may refer to training the above-mentioned code image generation model, and will not be described in detail in this example embodiment.

In this example real-time manner, the input of the image decoding model may include random third gaussian noise, image coding data and training time steps as conditions, and further, the intermediate image may also be generated by using a strategy of randomly zeroing text coding data in a certain proportion of time to perform a condition generating task using a diffusion model.

In this example embodiment, the second preset time step includes M stages, referring to fig. 10, and decoding the to-be-decoded image corresponding to the time step by using an image decoding model of a different time step to obtain an intermediate image may include steps S1010 to S1030.

In step S1010, the image to be decoded in the mth stage is decoded by using the image decoding model corresponding to the mth stage to obtain an intermediate image in the mth stage.

In this example embodiment, the second preset time step may include M stages, where the second preset time step may include M image decoding models that are set in series, and each image decoding model processes data of one stage, so that pertinence of the model can be enhanced, and quality of an obtained image can be improved.

In this exemplary embodiment, the image to be decoded in the mth stage is processed by using the image decoding model corresponding to the mth stage to obtain the intermediate image in the mth stage.

In step S1020, the intermediate image of the M-1 th stage is decoded by using the image decoding model corresponding to the M-1 th stage to obtain the intermediate image of the M-1 th stage.

In step S1030, the intermediate image of the first stage is output.

In the present exemplary embodiment, the intermediate image of the M-1 th stage is processed by using the image decoding model corresponding to the M-1 th stage to obtain the intermediate image of the M-1 th stage, that is, the input of the image decoding model of the M-1 th stage is the output of the image decoding model of the M-1 th stage. The intermediate image of the first stage described above may be output. The above M is a positive integer greater than 1.

It should be noted that, the set of time steps of the plurality of image decoding models completely covers the second preset time step, for example, if the second preset time step is 1000 steps, the time step of the image decoding model is greater than or equal to 1000 steps, and the step of each of the M phases may be the same or different, may be customized according to the user requirement, and is not specifically limited in this exemplary embodiment.

In step S720, the intermediate image is subjected to super-resolution processing at least once to obtain the target image.

In an example embodiment of the present disclosure, referring to fig. 11, the above steps may include steps S1110 to S1120.

In step S1110, the intermediate image is fused with a third gaussian noise of a third preset time step to obtain an intermediate super-resolution image.

In an example embodiment of the present disclosure, after the above intermediate image is obtained, it may be fused with third gaussian noise, where the fusing process may include a preset time step, and the third preset time step may include 1000 steps, or may include 2000 steps, or may be customized according to a user requirement, and in this example embodiment, the present disclosure is not limited specifically. And fusing to obtain the image to be decoded.

In step S1120, super-resolution is sequentially performed on the intermediate super-resolution images corresponding to the time steps by using the image super-resolution models of different time steps to obtain a target image.

In this example embodiment, a third initial model and third training data may be first acquired, where the third training data includes a plurality of reference intermediate images and true value superscore images corresponding to the reference intermediate images. And inputting the reference intermediate image into a third initial model to obtain a reference super-resolution image, and finally training the third initial model based on the reference super-resolution image and the true super-resolution image to update the third initial model to obtain the image super-resolution model.

In one example embodiment of the present disclosure, referring to fig. 12, the third preset time step includes P phases; sequentially performing super resolution on the intermediate super resolution images corresponding to the time steps by using the image super resolution models of different time steps to obtain a target image may include steps S1210 to S1230.

In step S1210, super resolution is performed on the intermediate image of the P-th stage by using the image super resolution model corresponding to the P-th stage to obtain the target image of the P-th stage.

In this example embodiment, the third preset time step may include P stages, where the third preset time step may include P image super-resolution models that are set in series, and each image super-resolution model processes data of one stage, so that pertinence of the model can be enhanced, and quality of an obtained image can be improved.

In this exemplary embodiment, the intermediate image is fused with the third gaussian noise of the third preset time step to obtain intermediate text data, and then the intermediate image of the P-th stage is processed by using the image super-resolution model corresponding to the P-th stage to obtain the target image of the P-th stage.

In step S1220, the intermediate image of the P-1 th stage is super-resolved by using the image super-resolution model corresponding to the P-1 th stage to obtain the target image of the P-1 th stage.

In step S1230, the target image of the first stage is output.

In the present exemplary embodiment, the P-1 stage target image is obtained by processing the P-1 stage target image using the P-1 stage corresponding image super-resolution model, that is, the input of the P-1 stage image super-resolution model is the output of the P-stage image super-resolution model. The target image of the first stage described above may be output. The above P is a positive integer greater than 1.

It should be noted that, the set of time steps of the plurality of image super-resolution models completely covers the third preset time step, for example, if the third preset time step is 1000 steps, the time step length of the image super-resolution model is equal to or greater than 1000 steps, and the step length of each of the P phases may be the same or different, may be customized according to the user requirement, and is not specifically limited in this exemplary embodiment.

In this exemplary embodiment, the number of the image super-resolution models may be one or more, and in one exemplary embodiment, after obtaining the target image using the image super-resolution models, the SwinIR super-resolution model may be further used to update the target image.

The method for generating an image from the text will be described in detail with reference to fig. 13, where the text encoding module may first input the text from the natural language to obtain text encoding data, the text encoding data may be processed by the encoding image generating model to obtain image encoding data, then input the image encoding data to the image decoding model to obtain an intermediate image, and then super-resolution the intermediate image by using the image super-resolution model to obtain a target image, where the SwinIR super-resolution model may also be used to super-resolution the target image to update the target image.

In this exemplary embodiment, the resolution of the intermediate image may be 64×64, the resolution of the target image after passing through the image super-resolution model may be 256×256, and the resolution of the target image after passing through the SwinIR super-resolution model and performing super-resolution update on the target image may be 1024×1024. The specific value of the resolution may be customized according to the user requirement, which is not specifically limited in this exemplary embodiment.

In an example embodiment of the present disclosure, the above method may be applied to a web page end, specifically, referring to fig. 14, a user natural language text is sent to a web server, that is, the user sends a request to the web server, the web server sends the natural language text to a text graph model, and the text graph model may generate a target image by using the above method for generating an image by using the text, and feed back to the user via the web server.

Referring to fig. 15, a user may preview an image at a web page end, and after clicking a generated image identifier by the user, the generated image may be displayed on an image display interface, and the number of generated images may be customized according to the user's requirement, which is not specifically limited in this exemplary embodiment.

In summary, in the present exemplary embodiment, the image coding data obtained by sequentially processing the intermediate text data corresponding to the time step by using the coding image generation models of different time steps increases the model capacity and improves the quality of image generation. Meanwhile, the image decoding models with different time steps are utilized to encode and process the image corresponding to the time step at one time, and the intermediate image corresponding to the time step of the image super-resolution model without passing the time step is utilized to carry out super-resolution, so that the quality of the generated image is further improved. The image super-resolution model, the coded image generation model and the image decoding model all adopt diffusion models, the text generated image has strong fitting capacity for complex distribution and big data, and meanwhile, the generation quality is high, the diversity is good, and the editable capacity is strong.

It is noted that the above-described figures are merely schematic illustrations of processes involved in a method according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

Further, referring to fig. 16, in this exemplary embodiment, there is further provided an apparatus 1600 for generating an image by text, including an encoding module 1610, a fusing module 1620, a processing module 1630, and a decoding module 1640. Wherein:

the encoding module 1610 may be configured to encode natural language text describing an image into text encoded data.

The fusion module 1620 may be configured to fuse the text encoding data with the first gaussian noise of the first preset time step to obtain intermediate text data.

The processing module 1630 may be configured to sequentially process the intermediate text data corresponding to the time step by using the encoded image generation models of different time steps to obtain image encoded data.

In an example embodiment, processing module 1630 may be configured to obtain a first initial model and first training data, wherein the first training data comprises a plurality of reference text encoding data and truth image encoding data corresponding to the reference text encoding data; inputting the reference text coding data into the first initial model to obtain reference image coding data; training the first initial model based on the reference image encoding data and the truth image encoding data to obtain the encoding image generation model.

In an example embodiment, the first preset time step includes N stages, and the processing module 1630 may be configured to process the intermediate text data of the nth stage by using the encoded image generation model corresponding to the nth stage to obtain image encoded data of the nth stage; processing the image coding data of the N stage by using a coding image generation model corresponding to the N-1 stage to obtain the image coding data of the N-1 stage; outputting the image coding data of the first stage; wherein N is a positive integer greater than 1.

The decoding module 1640 may be used to decode the image encoded data to obtain a target image.

In an example embodiment, the decoding module 1640 may be configured to decode the image encoded data using an image decoding model to obtain an intermediate image; and performing super-resolution processing on the intermediate image at least once to obtain the target image.

In an example embodiment, the decoding module 1640 may be configured to fuse the image encoded data with a second gaussian noise for a second preset time step to obtain an image to be decoded; sequentially decoding the images to be decoded corresponding to the time step by using image decoding models with different time steps to obtain intermediate images; wherein a set of time steps of a plurality of said image decoding models completely covers said second preset time step.

In this example embodiment, the decoding module 1640 may be configured to decode the image to be decoded of the mth stage to obtain the intermediate image of the mth stage using the image decoding model corresponding to the mth stage; decoding the intermediate image of the M-1 stage by using an image decoding model corresponding to the M-1 stage to obtain the intermediate image of the M-1 stage; outputting the intermediate image of the first stage; wherein M is a positive integer greater than 1.

In an example embodiment, the decoding module 1640 may be configured to fuse the intermediate image with a third gaussian noise of a third preset time step to obtain an intermediate super-resolution image; sequentially carrying out super resolution on the intermediate super-resolution images corresponding to the time step by utilizing image super-resolution models with different time steps to obtain a target image; wherein a set of time steps of a plurality of said image super-resolution models completely covers said third preset time step.

In this example embodiment, the third preset time step includes P stages, and the decoding module 1640 may be configured to perform super resolution on the intermediate image of the P-th stage by using the image super resolution model corresponding to the P-th stage to obtain the target image of the P-th stage; performing super resolution on the intermediate image of the P stage by using an image super resolution model corresponding to the P-1 stage to obtain a target image of the P-1 stage; outputting the target image of the first stage; wherein P is a positive integer greater than 1.

The specific details of each module in the above apparatus are already described in the method section, and the details that are not disclosed can be referred to the embodiment of the method section, so that they will not be described in detail.

The exemplary embodiments of the present disclosure also provide an electronic device for performing the above-described method of generating an image of text, which may be the above-described terminal 110 or server 120. In general, the electronic device may include a processor and a memory for storing executable instructions of the processor, the processor being configured to perform the above-described method of text generation of an image via execution of the executable instructions.

The configuration of the electronic device will be exemplarily described below taking the mobile terminal 1700 in fig. 17 as an example. It will be appreciated by those skilled in the art that the configuration of fig. 17 can be applied to stationary type devices in addition to components specifically for mobile purposes.

As shown in fig. 17, the mobile terminal 1700 may specifically include: processor 1701, memory 1702, bus 1703, mobile communication module 1704, antenna 1, wireless communication module 1705, antenna 2, display 1706, camera module 1707, audio module 1708, power module 1709 and sensor module 1710.

The processor 1701 may include one or more processing units, such as: the processor 1701 may include an AP (Application Processor ), modem processor, GPU (Graphics Processing Unit, graphics processor), ISP (Image Signal Processor ), controller, encoder, decoder, DSP (Digital Signal Processor ), baseband processor and/or NPU (Neural-Network Processing Unit, neural network processor), and the like. The method of text generation image in the present exemplary embodiment may be performed by an AP, GPU, or DSP, and may be performed by an NPU when the method involves neural network related processing.

The encoder may encode (i.e., compress) the image or video, for example, the target image may be encoded into a particular format to reduce the data size for storage or transmission. The decoder may decode (i.e., decompress) the encoded data of the image or video to restore the image or video data, e.g., may read the encoded data of the target image, and decode the encoded data by the decoder to restore the data of the target image, and further perform text-to-image related processing on the data. The mobile terminal 200 may support one or more encoders and decoders. In this way, the mobile terminal 200 can process images or videos in various encoding formats, such as: image formats such as JPEG (Joint Photographic Experts Group ), PNG (Portable Network Graphics, portable network graphics), BMP (Bitmap), and video formats such as MPEG (Moving Picture Experts Group ) 1, MPEG2, h.263, h.264, HEVC (High Efficiency Video Coding ).

The processor 1701 may form a connection with the memory 1702 or other components via a bus 1703.

The memory 1702 may be used to store computer executable program code that includes instructions. The processor 1701 performs various functional applications and data processing of the mobile terminal 1700 by executing instructions stored in the memory 1702. The memory 1702 may also store application data, such as files that store images, videos, and the like.

The communication functions of the mobile terminal 1700 may be implemented by the mobile communication module 1704, the antenna 1, the wireless communication module 1705, the antenna 2, a modem processor, a baseband processor, and the like. The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. The mobile communication module 1704 may provide a 2G, 3G, 4G, 5G, etc. mobile communication solution for application on the mobile terminal 1700. The wireless communication module 1705 may provide wireless communication solutions for wireless local area networks, bluetooth, near field communications, etc. that are applied on the mobile terminal 1700.

The display 1706 is used to implement display functions such as displaying user interfaces, images, video, and the like. The image capturing module 1707 is used to implement a capturing function such as capturing an image, video, or the like. The audio module 208 is used to implement audio functions, such as playing audio, collecting speech, etc. The power module 209 is used to implement power management functions such as charging a battery, powering a device, monitoring a battery status, etc. The sensor module 1710 may include a depth sensor 17101, a pressure sensor 17102, a gyro sensor 17103, a barometric pressure sensor 17104, etc. to implement a corresponding sensing function.

Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification. In some possible implementations, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the "exemplary methods" section of this specification, when the program product is run on the terminal device.

It should be noted that the computer readable medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Furthermore, the program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of generating an image from text, comprising:

coding natural language text describing the image to obtain text coding data;

fusing the text coding data with first Gaussian noise of a first preset time step to obtain intermediate text data;

sequentially processing the intermediate text data corresponding to the time step by using coding image generation models with different time steps to obtain image coding data;

Decoding the image coding data to obtain a target image;

2. The method according to claim 1, wherein the method further comprises:

acquiring a first initial model and first training data, wherein the first training data comprises a plurality of reference text coding data and true image coding data corresponding to the reference text coding data;

inputting the reference text coding data into the first initial model to obtain reference image coding data;

training the first initial model based on the reference image encoding data and the truth image encoding data to obtain the encoding image generation model.

3. The method of claim 1, wherein the first preset time step comprises N phases; the method for processing the intermediate text data corresponding to the time step sequentially by using the coded image generation models with different time steps to obtain image coded data comprises the following steps:

processing the intermediate text data of the nth stage by using the coded image generation model corresponding to the nth stage to obtain the coded image data of the nth stage;

Processing the image coding data of the N stage by using a coding image generation model corresponding to the N-1 stage to obtain the image coding data of the N-1 stage;

outputting the image coding data of the first stage;

wherein N is a positive integer greater than 1.

4. The method of claim 1, wherein decoding the image encoded data to obtain a target image comprises:

decoding the image coding data by using an image decoding model to obtain an intermediate image;

and performing super-resolution processing on the intermediate image at least once to obtain the target image.

5. The method of claim 4, wherein decoding the image encoded data using an image decoding model to obtain an intermediate image comprises:

fusing the image coding data with second Gaussian noise of a second preset time step to obtain an image to be decoded;

sequentially decoding the images to be decoded corresponding to the time step by using image decoding models with different time steps to obtain intermediate images;

wherein a set of time steps of a plurality of said image decoding models completely covers said second preset time step.

6. The method of claim 5, wherein the second preset time step comprises M phases; sequentially processing the image to be decoded corresponding to the time step by using image decoding models with different time steps to obtain an intermediate image, wherein the method comprises the following steps:

decoding an image to be decoded in the M stage by using an image decoding model corresponding to the M stage to obtain an intermediate image in the M stage;

decoding the intermediate image of the M-1 stage by using an image decoding model corresponding to the M-1 stage to obtain the intermediate image of the M-1 stage;

outputting the intermediate image of the first stage;

wherein M is a positive integer greater than 1.

7. The method of claim 4, wherein performing super-resolution processing on the intermediate image at least once to obtain the target image comprises:

fusing the intermediate image with third Gaussian noise of a third preset time step to obtain an intermediate super-resolution image;

sequentially carrying out super resolution on the intermediate super-resolution images corresponding to the time step by utilizing image super-resolution models with different time steps to obtain a target image;

wherein a set of time steps of a plurality of said image super-resolution models completely covers said third preset time step.

8. The method of claim 7, wherein the third preset time step comprises P phases; sequentially performing super resolution on the intermediate super-resolution images corresponding to the time step by using image super-resolution models with different time steps to obtain a target image, wherein the method comprises the following steps:

performing super resolution on the intermediate image of the P stage by using an image super resolution model corresponding to the P stage to obtain a target image of the P stage;

performing super resolution on the intermediate image of the P stage by using an image super resolution model corresponding to the P-1 stage to obtain a target image of the P-1 stage;

outputting the target image of the first stage;

wherein P is a positive integer greater than 1.

9. An apparatus for generating an image from text, comprising:

the coding module is used for coding the natural language text describing the image to obtain text coding data;

the fusion module is used for fusing the text coding data with first Gaussian noise of a first preset time step to obtain intermediate text data;

the processing module is used for sequentially processing the intermediate text data corresponding to the time step by using the coded image generation models of different time steps to obtain image coded data;

A decoding module; the method comprises the steps of decoding the image coding data to obtain a target image;

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements a method of generating an image of a text as claimed in any one of claims 1 to 8.

11. An electronic device, comprising:

one or more processors; and

a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the method of text generation of an image of any of claims 1-8.