CN116188632A

CN116188632A - Image generation method and device, storage medium and electronic equipment

Info

Publication number: CN116188632A
Application number: CN202310448216.3A
Authority: CN
Inventors: 李太豪; 齐旺
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-04-24
Filing date: 2023-04-24
Publication date: 2023-05-30

Abstract

The specification discloses a method, a device, a storage medium and an electronic device for generating an image, wherein a description text of the image to be generated can be acquired first, the description text is input into a first model trained in advance, and each target image corresponding to the description text output by the first model is obtained. And inputting the target image into a pre-trained second model aiming at each target image to obtain the image characteristics of the target image output by the second model, and inputting the descriptive text into the pre-trained second model to obtain the text characteristics of the descriptive text output by the second model. And finally, determining a final target image according to the similarity between the image characteristics and the text characteristics of each target image. The method can generate a plurality of images according to the descriptive text, and determine the image most conforming to the descriptive text in the generated images, thereby avoiding the problem of low image quality caused by the generation of a single image, leading the generated image to have higher quality and leading the image to be more consistent with the description of the text.

Description

Image generation method and device, storage medium and electronic equipment

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for generating an image, a storage medium, and an electronic device.

Background

With the development of technology, machine learning is increasingly used. In the field of image generation, machine learning models make a great contribution to the generation of images.

Currently, some descriptive text may be input into image generation software to obtain an image corresponding to the descriptive text. Typically, these image generation software have an image generation model deployed therein, for example: stable diffusion models (Stable Diffusion Models, SDMs), latent diffusion models (Latent Diffusion Models, LDMs). Further, when the user uses the image generating software, text can be input into the image generating software, and then an image conforming to the text can be obtained through an image generating model in the software. For example: the text entered by the user is: "A Persian cat with blue eyes and white hair", the image generation software can output an image of a Persian cat with blue eyes and white hair. Clearly, how to generate an image consistent with the content of the text description is a critical issue.

Based on this, the present specification provides a method of generating an image.

Disclosure of Invention

The present disclosure provides a method, an apparatus, a storage medium, and an electronic device for generating an image, so as to at least partially solve the foregoing problems in the prior art.

The technical scheme adopted in the specification is as follows:

the specification provides a method for generating an image, the method comprising:

acquiring a description text of an image to be generated;

inputting the description text into a pre-trained first model to obtain each target image corresponding to the description text output by the first model;

inputting each target image into a pre-trained second model to obtain the image characteristics of the target image output by the second model; inputting the descriptive text into the pre-trained second model to obtain text characteristics of the descriptive text output by the second model;

and determining a final target image according to the similarity between the image characteristics of each target image and the text characteristics.

Optionally, acquiring the description text of the image to be generated specifically includes:

acquiring an original text of an image to be generated;

Determining the input text language of the first model and/or the second model;

and determining the description text of the image to be generated according to the input text language and the original text.

Optionally, obtaining each target image corresponding to the description text output by the first model specifically includes:

and accelerating the pre-trained first model by using TensorRT, and obtaining each target image corresponding to the description text output by the first model through accelerating the running first model.

Optionally, before accelerating the pre-trained first model using TensorRT, the method further comprises:

converting parameters in the pre-trained first model into a parameter format of the TensorRT.

Optionally, inputting the descriptive text into a pre-trained first model specifically includes:

acquiring an image style of an image to be generated;

determining a prompt text corresponding to the image style;

the prompt text and the description text are input into a first model trained in advance.

Optionally, inputting the description text into a pre-trained first model to obtain each target image corresponding to the description text output by the first model, which specifically includes:

Acquiring random seeds of at least two first models; wherein the random seed is used for enabling the first model to generate a target image;

and inputting the random seed and the description text into the first model aiming at each random seed, so that the first model initializes a noise image according to the random seed, and obtains a target image corresponding to the random seed according to the noise image and the description text.

Optionally, the second model includes at least: a text encoder and an image encoder;

inputting the target image into a pre-trained second model to obtain the image characteristics of the target image output by the second model, wherein the method specifically comprises the following steps:

inputting the target image into the image encoder to obtain the image characteristics of the target image output by the image encoder;

inputting the descriptive text into the pre-trained second model to obtain text characteristics of the descriptive text output by the second model, wherein the method specifically comprises the following steps:

and inputting the descriptive text into the text encoder to obtain text characteristics of the descriptive text output by the text encoder.

Optionally, the first model is a stable diffusion model;

The second model is a contrast text image model.

Optionally, determining the final target image according to the similarity between the image features and the text features of each target image specifically includes:

and taking the target image corresponding to the maximum similarity in the similarities as a final target image.

The present specification provides an image generation apparatus including:

the text acquisition module is used for acquiring the descriptive text of the image to be generated;

the first input module is used for inputting the description text into a pre-trained first model to obtain each target image corresponding to the description text output by the first model;

the second input module is used for inputting each target image into a pre-trained second model to obtain the image characteristics of the target image output by the second model; inputting the descriptive text into the pre-trained second model to obtain text characteristics of the descriptive text output by the second model;

and the image determining module is used for determining a final target image according to the similarity between the image characteristics of each target image and the text characteristics.

Optionally, the text obtaining module is specifically configured to obtain an original text of the image to be generated; determining the input text language of the first model and/or the second model; and determining the description text of the image to be generated according to the input text language and the original text.

Optionally, the first input module is specifically configured to accelerate the first model trained in advance by using a TensorRT, and obtain, through the first model that is operated in an accelerated manner, each target image corresponding to the descriptive text output by the first model.

Optionally, the first input module is further configured to convert parameters in the pre-trained first model into a parameter format of the TensorRT.

Optionally, the first input module is specifically configured to obtain an image style of an image to be generated; determining a prompt text corresponding to the image style; the prompt text and the description text are input into a first model trained in advance.

Optionally, the first input module is specifically configured to obtain random seeds of at least two first models; wherein the random seed is used for enabling the first model to generate a target image; and inputting the random seed and the description text into the first model aiming at each random seed, so that the first model initializes a noise image according to the random seed, and obtains a target image corresponding to the random seed according to the noise image and the description text.

the second input module is specifically configured to input the target image into the image encoder, so as to obtain an image feature of the target image output by the image encoder; and inputting the descriptive text into the text encoder to obtain text characteristics of the descriptive text output by the text encoder.

Optionally, the first model is a stable diffusion model;

the second model is a contrast text image model.

Optionally, the image determining module is specifically configured to take a target image corresponding to a maximum similarity among the similarities as a final target image.

The present specification provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above-described image generation method.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of generating an image described above when executing the program.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

In the image generation method provided by the specification, a description text of an image to be generated is acquired, and then the description text is input into a first model to obtain each target image. And obtaining the image characteristics of each target image and the text characteristics of the descriptive text according to the second model. And determining the image most consistent with the descriptive text in each target image according to the image characteristics of each target image and the text characteristics of the descriptive text.

According to the method, a plurality of images can be generated, the image most conforming to the description text is determined from the plurality of images, the problem of low image quality caused by generation of a single image is avoided, and the generated image is high in quality and consistent with the description text.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

FIG. 1 is a flow chart of an image generating method in the present specification;

FIG. 2a is a schematic illustration of a second model provided in the present specification;

FIG. 2b is a schematic view of feature extraction provided in the present specification;

FIG. 3 is a schematic diagram of an image generating apparatus provided in the present specification;

fig. 4 is a schematic view of the electronic device corresponding to fig. 1 provided in the present specification.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

Fig. 1 is a flow chart of an image generating method provided in the present specification, which specifically includes the following steps:

s100: and acquiring a description text of the image to be generated.

Generally, when a user uses the image generation software, the user inputs the description text into the image generation software, and the image generation software can generate an image through the image generation model, and then output and display the image to the user. However, the generation of a single image is difficult to meet the abundant practical demands of users, and the output image is inconsistent with the text description, so that the quality of the image is low. Based on the above, the specification provides an image generation method, which can enable the quality of the generated image to be higher and more consistent with the text description.

The technical scheme of the application is described by taking a client as an execution main body, wherein a first model and a second model which are trained in advance are arranged in the client, the first model is used for converting input text into an image, and the second model at least comprises a text encoder and an image encoder and is used for extracting characteristics of the text and the image.

In one or more embodiments of the present description, the client may obtain descriptive text of the image to be generated. Wherein the descriptive text is a description of the desired image by the user.

S102: and inputting the description text into a pre-trained first model to obtain each target image corresponding to the description text output by the first model.

In one or more embodiments of the present description, the pre-trained first model may be a stable diffusion model (Stable Diffusion Models, SDMs). The client may input descriptive text into the stable diffusion model to obtain an output image.

In general, when using a stable diffusion model, there is also a random seed (seed) that accompanies the text input. The random seed is used for initializing a stable diffusion model to generate a noise image, and then the model can denoise the noise image according to the input descriptive text to generate a target image. Wherein, the random seeds are different, and the obtained initialization noise images are also different.

Therefore, in order to avoid the problem of generating a single image with low image quality, in one or more embodiments of the present disclosure, the client may obtain a plurality of random seeds, so as to generate a plurality of images corresponding to the descriptive text according to the descriptive text and the plurality of random seeds, so that an image with the best quality (i.e., the best coincidence with the descriptive text) may be selected from the plurality of images in a subsequent step.

Specifically, when the descriptive text is input into the first model, the client may acquire at least two random seeds, and for each random seed, the random seed and the descriptive text may be input into the first model to generate a target image, for example: assuming that the set random seeds are 50 and 60, the seed=50 and the descriptive text may be input into the first model to obtain one target image, and then the seed=60 and the descriptive text may be input into the first model to obtain another target image.

That is, for the descriptive text, the first model may output as many target images as there are random seeds, and then, in a subsequent step, an image most conforming to the description of the descriptive text may be selected among the plurality of target images.

It should be noted that, when the random seed is obtained, a random seed algorithm may be set in the client to randomly generate the random seed in a random seed interval acceptable by the first model. For example: the value interval of the acceptable random seeds of the first model is [50,100], and then a random seed algorithm which generates random seeds only in the interval [50,100] can be set. In particular, several random seeds are obtained, which is not limited in this specification, and may be more than two.

S104: inputting each target image into a pre-trained second model to obtain the image characteristics of the target image output by the second model; and inputting the descriptive text into the pre-trained second model to obtain text characteristics of the descriptive text output by the second model.

In one or more embodiments of the present description, there are text encoders and Image encoders in the second model, which may be a comparative text Image (CLIP) model. As shown in fig. 2a, a second model is provided in the present application, where the second model includes an image encoder and a text encoder, and features of the image and the text may be extracted respectively.

The CLIP model is trained according to different text-image pairs, namely an image and text description corresponding to the image, so that the CLIP model can learn the matching relation between the text-image pairs. In particular, text encoders and image encoders in the CLIP model may be trained. Firstly, a plurality of sample images and sample description texts corresponding to the plurality of sample images can be obtained, each sample image and each sample description text are input into a CLIP model to be trained, the image characteristics of each sample image are extracted through an image encoder in the CLIP model to be trained, and the text characteristics of each sample description text are extracted through a text encoder in the CLIP model to be trained. Then, for each sample image, the sample image and the sample description text corresponding to the sample image are taken as positive sample pairs, and the sample description text corresponding to the sample image and other sample images are taken as negative sample pairs. Furthermore, the text encoder in the CLIP model to be trained can be trained on the image encoder by taking the maximum similarity between the image features and the text features in the positive sample pair and the minimum similarity between the image features and the text features in the negative sample pair as an optimization target, so as to obtain the trained CLIP model.

That is, N sample text-image pairs may be input into the CLIP model, the text encoder in the CLIP model may extract N text features, the image encoder in the CLIP model may extract N image features, and N x N sample text-image feature pairs may be combined. In the N x N sample text-image feature pairs, the original N text sample text-image pairs are taken as positive samples, the rest sample text-image pairs are taken as negative samples, and then the text encoder and the image encoder of the CLIP model can be trained with the maximum similarity between the positive sample pairs and the minimum similarity between the negative sample pairs as an optimization target.

Then, in order to find out the image most conforming to the descriptive text in each target image, the client may input the target image into the second model for each target image, and obtain, through an image encoder in the second model, the image characteristics of the target image output by the second model. And descriptive text may be input into the second model, and text features output by the second model may be obtained by a text encoder in the second model.

As shown in fig. 2b, for the feature extraction schematic diagram provided in the present specification, three target images are taken together, and then the three images can be respectively input into the image encoder of the second model to obtain three image features, and meanwhile, the descriptive text is input into the text encoder to obtain the text features of the descriptive text.

S106: and determining a final target image according to the similarity between the image characteristics of each target image and the text characteristics.

Similarity of image features and text features characterizes similarity between images and text, in one or more embodiments of the present disclosure, after the client obtains the image features of each target image and the text features describing the text, the similarity between the image features and the text features of each target image may be determined for the image features of the target image. And determining a target image corresponding to the maximum similarity from the similarities, and taking the target image corresponding to the maximum similarity as a final target image, namely determining the image most conforming to the descriptive text.

When the similarity between the image feature and the text feature of the target image is determined, the cosine similarity between the image feature and the text feature of the target image may be calculated, the euclidean distance between the image feature and the text feature of the target image may also be calculated, and the like, which is not limited in this specification.

In the method for generating the image provided by the present specification based on fig. 1, the description text and different random seeds are input into the first model for multiple times, so as to obtain multiple target images. And selecting a target image which is the most consistent with the description text from the plurality of images through the second model as a final image. The problems that the output image and the text description are not consistent and the quality of the image is low can be avoided, and the image most consistent with the description text can be selected from the generated multiple images, so that the quality of the generated image is higher.

In addition, many image generation models are trained based on text in only a single language, and the languages used by different users are different, and the application of the image generation models is limited by the fact that only text in a single language can be received. Therefore, in order to enable the client to generate corresponding images according to texts of different languages, so as to benefit a wider range of users, in one or more embodiments of the present specification, a language recognition model and a translation model are further deployed in the client.

Furthermore, the client can determine the languages (such as Chinese and English) of the descriptive text through the language recognition model, and simultaneously determine the languages of the input text which can be recognized by the first model and the second model, and if the languages of the descriptive text are different from the languages which can be recognized by the first model and/or the second model, the descriptive text can be translated into the texts which can be recognized by the first model and/or the second model through the translation model. For example: assuming that the language of the input text which can be identified by the first model is english and the language of the input text which can be identified by the second model is chinese, the client determines that the language of the description text is russian through the language identification model, and the description text can be translated into the english text and the chinese text through the translation model, so that the first model and the second model can identify the description text.

Further, since the first model generates a plurality of images corresponding to each description text, in order to increase the running speed of the first model and increase the image generating efficiency, in one or more embodiments of the present disclosure, the first model may be accelerated by using the TensorRT to obtain each target image corresponding to the description text output by the first model through the first model that is operated by acceleration.

Wherein parameters in the pre-trained first model are converted into a parameter format of the TensorRT before acceleration using the TensorRT first model.

In addition, tensorRT is an acceleration reasoning library, and can accelerate the neural network model through the modes of operator fusion, quantization, kernel automatic adjustment and the like. In one or more embodiments of the present specification, the acceleration of the first model by the client using the TensorRT may include two phases, where the first phase is: the first model (such as Pytorch model) trained in advance is converted into ONNX model, and the second stage is: and performing network optimization such as operator fusion on the first model, and converting the optimized first model into a TRT engine file. After the first stage and the second stage are finished, the TRT engine file can be directly used for reasoning instead of the original first model (Pytorch model). And because the TRT engine file is the first model after network optimization, the effect of reasoning acceleration can be achieved.

Further, in one or more embodiments of the present disclosure, in order to make the generated image more accurate, the client sets a prompt text corresponding to different image styles, so that when the user inputs the descriptive text, different image styles may be selected, and further, the corresponding prompt text is determined according to the image style selected by the user, and the prompt text and the descriptive text are input together into the first model, so as to output the target image corresponding to the style.

Wherein, the prompt text may include a prefix prompt text and a suffix prompt text, in one or more embodiments of the present disclosure, the first model may be a stable diffusion model, and the language of the input text identified by the stable diffusion model is english, and taking the english prompt text as an example, 8 types of prompt text are provided, as shown in table 1.

The hint text in table 1 is only an example, other styles of hint text may be provided, and the specific contents of hint text are not limited to those shown in table 1.

In addition, when the prompt text and the description text are input into the first model, the prompt text and the description text can be spliced together and input into the first model according to the splicing mode of the prefix prompt text, the description text and the suffix prompt text. For example: assuming that the user selected image style is "concept oil", the text entered by the user is: a cat, the client may input "a beautiful painting of a cat isometric, warwick go, training on artstation" into the first model.

By setting different image styles and setting prompt texts corresponding to the image styles, the generated image can be more accurate.

Based on the above-mentioned image generation method, the embodiment of the present disclosure further provides a schematic diagram of a generation device for an image, as shown in fig. 3.

Fig. 3 is a schematic diagram of an apparatus for generating an image according to an embodiment of the present disclosure, where the apparatus includes:

a text obtaining module 300, configured to obtain a description text of an image to be generated;

the first input module 302 is configured to input the description text into a first model trained in advance, and obtain each target image corresponding to the description text output by the first model;

a second input module 304, configured to input, for each target image, the target image into a second model trained in advance, so as to obtain image features of the target image output by the second model; inputting the descriptive text into the pre-trained second model to obtain text characteristics of the descriptive text output by the second model;

an image determining module 306, configured to determine a final target image according to the similarity between the image features and the text features of the target images.

Optionally, the text obtaining module 300 is specifically configured to obtain an original text of the image to be generated; determining the input text language of the first model and/or the second model; and determining the description text of the image to be generated according to the input text language and the original text.

Optionally, the first input module 302 is specifically configured to accelerate the first model trained in advance by using a TensorRT, and obtain, through the first model that is operated in an accelerated manner, each target image corresponding to the descriptive text output by the first model.

Optionally, the first input module 302 is further configured to convert parameters in the pre-trained first model into a parameter format of the TensorRT.

Optionally, the first input module 302 is specifically configured to obtain an image style of an image to be generated; determining a prompt text corresponding to the image style; the prompt text and the description text are input into a first model trained in advance.

Optionally, the first input module 302 is specifically configured to obtain random seeds of at least two first models; wherein the random seed is used for enabling the first model to generate a target image; and inputting the random seed and the description text into the first model aiming at each random seed, so that the first model initializes a noise image according to the random seed, and obtains a target image corresponding to the random seed according to the noise image and the description text.

the second input module 304 is specifically configured to input the target image into the image encoder, so as to obtain an image feature of the target image output by the image encoder; and inputting the descriptive text into the text encoder to obtain text characteristics of the descriptive text output by the text encoder.

Optionally, the first model is a stable diffusion model;

the second model is a contrast text image model.

Optionally, the image determining module 306 is specifically configured to take, as the final target image, the target image corresponding to the greatest similarity among the similarities.

The embodiments of the present specification also provide a computer-readable storage medium storing a computer program operable to execute the image generation method described above.

Based on the above-described image generation method, the embodiment of the present disclosure further proposes a schematic structural diagram of the electronic device shown in fig. 4. At the hardware level, as in fig. 4, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile storage, although it may include hardware required for other services. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to realize the image generation method.

Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present application.

Claims

1. A method of generating an image, the method comprising:

acquiring a description text of an image to be generated;

2. The method of claim 1, wherein the acquiring descriptive text of the image to be generated specifically comprises:

acquiring an original text of an image to be generated;

determining the input text language of the first model and/or the second model;

3. The method of claim 1, wherein obtaining each target image corresponding to the descriptive text output by the first model specifically comprises:

4. The method of claim 3, wherein prior to accelerating the pre-trained first model using TensorRT, the method further comprises:

5. The method of claim 1, wherein inputting the descriptive text into a pre-trained first model, specifically comprises:

acquiring an image style of an image to be generated;

determining a prompt text corresponding to the image style;

6. The method of claim 1, wherein inputting the descriptive text into a pre-trained first model to obtain each target image corresponding to the descriptive text output by the first model, specifically comprises:

7. The method of claim 1, wherein the second model comprises at least: a text encoder and an image encoder;

8. The method of claim 1, wherein the first model is a steady diffusion model;

the second model is a contrast text image model.

9. The method according to claim 1, wherein determining the final target image based on the similarity between the image features and the text features of the target images, comprises:

10. An image output apparatus, characterized in that the apparatus specifically comprises:

11. The apparatus of claim 10, wherein the text acquisition module is specifically configured to acquire an original text of an image to be generated; determining the input text language of the first model and/or the second model; and determining the description text of the image to be generated according to the input text language and the original text.

12. The apparatus of claim 10, wherein the first input module is specifically configured to accelerate the pre-trained first model using a TensorRT, and obtain, by accelerating the running first model, each target image corresponding to the descriptive text output by the first model.

13. The apparatus of claim 12, wherein the first input module is further for converting parameters in the pre-trained first model into a parameter format of the TensorRT.

14. The apparatus of claim 10, wherein the first input module is specifically configured to obtain an image style of an image to be generated; determining a prompt text corresponding to the image style; the prompt text and the description text are input into a first model trained in advance.

15. The apparatus of claim 10, wherein the first input module is specifically configured to obtain random seeds of at least two of the first models; wherein the random seed is used for enabling the first model to generate a target image; and inputting the random seed and the description text into the first model aiming at each random seed, so that the first model initializes a noise image according to the random seed, and obtains a target image corresponding to the random seed according to the noise image and the description text.

16. The apparatus of claim 10, wherein the second model comprises at least: a text encoder and an image encoder;

17. The apparatus of claim 10, wherein the first model is a steady diffusion model;

the second model is a contrast text image model.

18. The apparatus of claim 10, wherein the image determining module is specifically configured to take a target image corresponding to a maximum similarity among the similarities as the final target image.

19. A computer readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-9.

20. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of the preceding claims 1-9 when the program is executed.