CN117475020A

CN117475020A - Image generation method, device, equipment and medium

Info

Publication number: CN117475020A
Application number: CN202311531987.5A
Authority: CN
Inventors: 刘艺博; 张璐; 陶明
Original assignee: Shanghai Renyimen Technology Co ltd
Current assignee: Shanghai Renyimen Technology Co ltd
Priority date: 2023-11-16
Filing date: 2023-11-16
Publication date: 2024-01-30

Abstract

The application discloses an image generation method, device, equipment and medium, relating to the technical field of computers, comprising the following steps: acquiring current input information, and reasoning the current input information by using a first preset diffusion model and preset prompt words to obtain a one-stage image generation result; screening abstract modification type prompt words from preset prompt words to obtain two-stage prompt words; guiding a second preset diffusion model by using a target control network to infer a one-stage image generation result based on the two-stage prompt words so as to obtain a two-stage image generation result; and determining the two-stage image generation result as a target generation image corresponding to the current input information. The first stage uses the diffusion model to obtain a first-stage image generation result with coarse granularity, screens abstract prompt words as second-stage prompt words, and the second stage uses the target control network and the second-stage prompt words to guide the generated image to converge towards the high quality direction, so that the resolution of the generated image is higher and the details are more abundant.

Description

Image generation method, device, equipment and medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to an image generating method, apparatus, device, and medium.

Background

The quality of the generated image is the core in AIGC (Artificial Intelligence Generated Content, namely the generation type artificial intelligence) project, a large amount of vivid and innovative image contents can be rapidly generated through the image generation technology, the creation efficiency is greatly improved, and the generated image effect is better, so that the method has important significance.

In the prior art, a diffusion model is generally used for generating a graph, but the generated graph is often poor in detail, and is easy to generate conditions of blurring, distortion and the like, and the resolution of the generated graph is often low in terms of image resolution.

In summary, how to make the resolution of the generated image higher and the details richer is a problem to be solved in the art.

Disclosure of Invention

In view of the above, the present invention aims to provide an image generating method, apparatus, device and medium, which can make the resolution of the generated image higher and the details richer. The specific scheme is as follows:

in a first aspect, the present application discloses an image generation method, including:

acquiring current input information, and reasoning the current input information by using a first preset diffusion model and preset prompt words to obtain a one-stage image generation result;

screening abstract modification type prompt words from the preset prompt words to obtain two-stage prompt words;

guiding a second preset diffusion model by using a target control network, and reasoning the one-stage image generation result based on the two-stage prompt word so as to obtain a two-stage image generation result;

and determining the two-stage image generation result as a target generation image corresponding to the current input information.

Optionally, the obtaining the current input information includes:

acquiring current input information of a text generation picture mode; the current input information is text information;

or, acquiring current input information of a picture generation picture mode; wherein the current input information includes text information and picture information.

Optionally, the reasoning the current input information by using a first preset diffusion model and a preset prompt word includes:

configuring first-stage reasoning parameters of a first preset diffusion model to reason the current input information by using the first preset diffusion model and preset prompt words;

correspondingly, the guiding the second preset diffusion model by using the target control network to infer the one-stage image generation result based on the two-stage prompt word comprises the following steps:

and configuring second-stage reasoning parameters of a second preset diffusion model, so that the second preset diffusion model is guided by a target control network to reason the one-stage image generation result based on the two-stage prompt words.

Optionally, before the current input information is obtained, the method further includes:

collecting original image data, and preprocessing the original image data to obtain initial training data;

constructing initial text condition input information corresponding to the initial training data;

and updating parameters of an initial control network by using the initial training data and the initial text condition input information to obtain a target control network.

Optionally, the preprocessing the raw image data to obtain initial training data includes:

performing center clipping on the center square area of the original image data to obtain clipped image data;

and performing linear interpolation on the clipped image data by using a preset linear interpolator to obtain initial training data.

Optionally, the performing linear interpolation on the clipped image data by using a preset linear interpolator to obtain initial training data includes:

performing linear interpolation on the clipped image data by using a preset linear interpolator to adjust the resolution of the clipped image data so as to obtain first interpolated image data;

and processing the first interpolated image data again by using the preset linear interpolator to obtain second interpolated image data with different resolutions, and determining the second interpolated image data with different resolutions as initial training data.

Optionally, the updating the parameters of the initial control network by using the initial training data and the initial text condition input information to obtain the target control network includes:

determining the initial training data, the initial text condition input information and an initial control network as current training data, current text condition input information and a current control network respectively;

inputting the current training data and the current text condition input information into the current control network to output a current output result;

judging whether a preset iteration stop condition is met currently or not;

if not, determining a loss function value between the current training data and the current output result, and updating the current control network according to the loss function value to obtain a next control network;

updating the next training data, the next text condition input information and the next control network into current training data, current text condition input information and a current control network respectively, and re-jumping to the step of inputting the current training data and the current text condition input information into the current control network until the preset iteration stop condition is met, and determining the output current control network as a target control network.

In a second aspect, the present application discloses an image generation apparatus comprising:

the one-stage reasoning module is used for acquiring current input information, and reasoning the current input information by using a first preset diffusion model and preset prompt words so as to obtain a one-stage image generation result;

the prompting word adjusting module is used for screening abstract modification type prompting words from the preset prompting words to obtain two-stage prompting words;

the two-stage reasoning module is used for guiding a second preset diffusion model to reason the one-stage image generation result based on the two-stage prompt word by utilizing a target control network so as to obtain a two-stage image generation result;

and the target generation image determining module is used for determining the two-stage image generation result as a target generation image corresponding to the current input information.

In a third aspect, the present application discloses an electronic device comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the steps of the previously disclosed image generation method.

In a fourth aspect, the present application discloses a computer-readable storage medium for storing a computer program; wherein the computer program when executed by a processor implements the steps of the previously disclosed image generation method.

The beneficial effects of the application are that: the method comprises the steps of obtaining current input information, and reasoning the current input information by using a first preset diffusion model and preset prompt words to obtain a one-stage image generation result; screening abstract modification type prompt words from the preset prompt words to obtain two-stage prompt words; guiding a second preset diffusion model by using a target control network, and reasoning the one-stage image generation result based on the two-stage prompt word so as to obtain a two-stage image generation result; and determining the two-stage image generation result as a target generation image corresponding to the current input information. Therefore, the method and the device have the advantages that the diffusion model is utilized at one stage to obtain the one-stage image generation result with coarse granularity, abstract prompt words are screened out from the preset prompt words to serve as two-stage prompt words, namely, the prompt words utilized at one stage are different from those utilized at two stages, so that the generated image is guided to be converged towards the high-quality direction by utilizing the target control network and the two-stage prompt words at two stages, and the resolution of the generated image is higher, and the details are richer.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.

FIG. 1 is a flow chart of an image generation method disclosed in the present application;

FIG. 2 is a flowchart of a specific image generation method disclosed in the present application;

FIG. 3 is a schematic diagram of a specific image generation structure disclosed herein;

FIG. 4 is a schematic diagram of an image generating apparatus disclosed in the present application;

fig. 5 is a block diagram of an electronic device disclosed in the present application.

Detailed Description

The following description of the technical solutions in the embodiments of the present application will be made clearly and completely with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The quality of the generated image is the core in the AIGC item, a large amount of vivid and innovative image contents can be rapidly generated through the image generation technology, the creation efficiency is greatly improved, and the generated image has better effect and has important significance.

Therefore, the image generation scheme is correspondingly provided, so that the generated image has higher resolution and richer details.

Referring to fig. 1, an embodiment of the present application discloses an image generating method, including:

step S11: the method comprises the steps of obtaining current input information, and reasoning the current input information by using a first preset diffusion model and preset prompt words to obtain a one-stage image generation result.

In this embodiment, the obtaining the current input information includes: acquiring current input information of a text generation picture mode; the current input information is text information; or, acquiring current input information of a picture generation picture mode; wherein the current input information includes text information and picture information. It may be understood that the current input information is reference information for subsequent image generation, that is, the subsequent image generation is based on the current input information, where the current input information may be text information, and further, the current input information may be image generation of a subsequent expanded text generation picture mode, that is, a text map mode, or the current input information may also include text information and picture information, that is, image generation of a subsequent expanded picture generation picture mode, that is, a map mode, and the user may determine the current input information according to different usage scenarios so as to obtain the current input information by using a diffusion model (stable diffusion, that is, SD).

In this embodiment, the reasoning the current input information by using the first preset diffusion model and the preset prompting word includes: and configuring first-stage reasoning parameters of a first preset diffusion model to reason the current input information by using the first preset diffusion model and preset prompt words. In this embodiment, a user configures a first-stage reasoning parameter for a first preset diffusion model separately, and then reasoning about current input information based on the first preset-stage reasoning parameter by using the first preset diffusion model and a preset prompt word.

The user sets a preset prompting word, wherein the preset prompting word comprises an abstract modification prompting word and a negative prompting word, and also comprises an entity word, the abstract modification prompting word is, for example, "best quality", "masterpiece", the negative prompting word is, for example, "world quality", "bad", "ugly", and the entity word is, for example, "a handname man". It should be noted that, because the diffusion model is trained based on english text during training, the diffusion model cannot recognize chinese, and thus the preset hint word is english.

Step S12: and screening abstract modification type prompt words from the preset prompt words to obtain two-stage prompt words.

For example, the preset prompting words are "a handname, best quality and masterpiece", wherein "a handname" is an entity word, and "best quality and masterpiece" are abstract modification prompting words, so that only "best quality and masterpiece" are taken as two-stage prompting words, that is, the two-stage prompting words do not contain entity words, because the conventional text prompting words for redrawing two stages are generally the same, excessive emphasis is made on some entity semantic information when generating a picture, and the generated image is partially distorted, while in the embodiment, the preset prompting words and the two-stage prompting words used in one stage are not identical, so that the two-stage prompting words contain modified quality prompting words, and the text is guided to enable the image to converge towards a high quality direction.

Step S13: and guiding a second preset diffusion model by using a target control network, and reasoning the one-stage image generation result based on the two-stage prompt word so as to obtain a two-stage image generation result.

It can be understood that an initial control network (control network) is created in advance, and then the initial control network is iteratively trained to obtain a target control network, so that the target control network and the two-stage prompt word can be utilized to infer the one-stage image generation result again, namely, redraw the two-stage image, so as to obtain the two-stage image generation result.

The control network is an adaptive network for controlling the generation of the result, which can be generated in the direction of the guidance condition by controlling the generation under different guidance conditions. The control net of the control details designed in the embodiment is to enrich the details (textures and the like) generated and improve the resolution of the image on the premise that the two-stage image generation result and the one-stage image generation result keep the semantics consistent under the guiding condition. In the prior art, the image redrawing is a common trick (skill), but since the conventional redrawing does not use the control net for conditional control, the redrawing result is often more different from the one-stage generating effect.

In this embodiment, the guiding the second preset diffusion model by using the target control network to infer the one-stage image generation result based on the two-stage prompt word includes: and configuring second-stage reasoning parameters of a second preset diffusion model, so that the second preset diffusion model is guided by a target control network to reason the one-stage image generation result based on the two-stage prompt words. It should be noted that parameters of the first preset diffusion model and parameters of the second preset diffusion model are configured independently, so that loss of part of original image information after secondary redrawing can be avoided, quality of generated images can be improved, and details of pictures can be enriched.

Step S14: and determining the two-stage image generation result as a target generation image corresponding to the current input information.

The two-stage image generation result is a target generation image corresponding to the current input information, the target generation image is displayed in a display interface, and the target generation image can be stored in a corresponding storage position so that a user can process the target generation image correspondingly later.

Referring to fig. 2, an embodiment of the present application discloses a specific image generating method, including:

step S21: collecting original image data, and preprocessing the original image data to obtain initial training data; constructing initial text condition input information corresponding to the initial training data; and updating parameters of an initial control network by using the initial training data and the initial text condition input information to obtain a target control network.

In this embodiment, the preprocessing the raw image data to obtain initial training data includes: performing center clipping on the center square area of the original image data to obtain clipped image data; and performing linear interpolation on the clipped image data by using a preset linear interpolator to obtain initial training data. Collecting a plurality of original image data, for example, 10 ten thousand original images, preprocessing each original image data, namely cutting a central square area of the current original image data to obtain cut image data, and linearly interpolating each cut image data by using a preset linear interpolator to change the resolution of each image data, so as to finally obtain initial training data.

In this embodiment, the performing linear interpolation on the clipped image data by using a preset linear interpolator to obtain initial training data includes: performing linear interpolation on the clipped image data by using a preset linear interpolator to adjust the resolution of the clipped image data so as to obtain first interpolated image data; and processing the first interpolated image data again by using the preset linear interpolator to obtain second interpolated image data with different resolutions, and determining the second interpolated image data with different resolutions as initial training data. The specific process of acquiring the initial training data is as follows:

1) Performing first linear interpolation on the clipped image data to change the resolution of the clipped image data, so as to obtain first interpolated image data with the resolution of 512 multiplied by 512;

2) Changing the first interpolation image data by using a linear interpolation method to respectively obtain second interpolation image data with three small resolutions of 64×64, 128×128 and 256×256;

3) The second interpolated image data with the resolutions of 64×64, 128×128, 256×256, respectively, are used as initial training data.

It will be appreciated that if there are 50 pieces of first interpolated image data, then 150 pieces of second interpolated image data are obtained after the second linear interpolation, that is, second interpolated image data with resolutions of 64×64, 128×128, 256×256 respectively corresponding to each piece of first interpolated image data are obtained. The pre-training weight training samples of the diffusion model are 512 multiplied by 512 in resolution, fine adjustment training is better by using square pictures, linear interpolation processing is needed to obtain square initial training data, and the purpose of secondary linear interpolation processing is to obtain small-resolution sample data, so that detail generation of the diffusion model on a small-resolution image is enhanced.

It should be noted that, the initial text condition input information corresponding to the initial training data is constructed, the initial text condition input information is abstract modifier type prompt words, and no prompt words describing the picture entity are included.

In this embodiment, updating parameters of an initial control network by using the initial training data and the initial text condition input information to obtain a target control network includes: determining the initial training data, the initial text condition input information and an initial control network as current training data, current text condition input information and a current control network respectively; inputting the current training data and the current text condition input information into the current control network to output a current output result; judging whether a preset iteration stop condition is met currently or not; if not, determining a loss function value between the current training data and the current output result, and updating the current control network according to the loss function value to obtain a next control network; updating the next training data, the next text condition input information and the next control network into current training data, current text condition input information and a current control network respectively, and re-jumping to the step of inputting the current training data and the current text condition input information into the current control network until the preset iteration stop condition is met, and determining the output current control network as a target control network.

The iterative training of the control network is to iteratively update the parameters of the current control network, wherein the parameters of the current control network are updated through loss function values (loss) between input data and output data of the current control network, that is, the parameters of the current control network are updated according to the difference or the error between the predicted data and the actual target of the current control network.

Before performing iterative training on the control network, setting training super-parameters, such as training 60 epochs (rounds), learning rate of 1e-5, adamW of an optimizer and fp16 of precision, if the preset iteration stop condition is not met currently, that is, the current iteration number does not reach the preset threshold, for example, the preset threshold is 60 (epochs, that is, rounds), and the current iteration number is 45, so that the preset iteration stop condition is not met currently, and continuing the iterative training is needed.

Step S22: the method comprises the steps of obtaining current input information, and reasoning the current input information by using a first preset diffusion model and preset prompt words to obtain a one-stage image generation result.

For example, in a specific image generating structure shown in fig. 3, when current input information in a picture format is obtained, a first-stage inference parameter is determined, so that the first preset diffusion model and a preset prompt word can be used to infer the current input information, so as to obtain a first-stage image generating result.

Step S23: and screening abstract modification type prompt words from the preset prompt words to obtain two-stage prompt words.

Step S24: and guiding a second preset diffusion model by using a target control network, and reasoning the one-stage image generation result based on the two-stage prompt word so as to obtain a two-stage image generation result.

As shown in fig. 3, after the first-stage image generation result is obtained, redrawing the image again on the basis of the first-stage image generation result, that is, guiding the second preset diffusion model by using the target control network to infer the first-stage image generation result based on the second-stage prompt word so as to obtain the second-stage image generation result; wherein the two-stage inference parameters need to be set individually, that is, the one-stage inference parameters and the two-stage inference parameters are set independently.

Step S25: and determining the two-stage image generation result as a target generation image corresponding to the current input information.

Therefore, the first stage of the application is used for generating the outline and the general information of the target object, on the basis, the generation result of the first stage is used as input to carry out two-stage graphically-generated drawing redrawing, the control net used for additional training is used as guide in the redrawing process, the similarity of the two-stage generated picture is controlled, and finally, the target image with richer details and higher resolution is generated in the second stage; in addition, in two different stages, independent control conditions and parameter strategies are respectively set, and the robustness of the two-time generation on the similarity is improved in a block configuration mode.

Referring to fig. 4, an embodiment of the present application discloses an image generating apparatus, including:

a one-stage reasoning module 11, configured to obtain current input information, and reason the current input information by using a first preset diffusion model and a preset prompt word to obtain a one-stage image generation result;

the prompt word adjustment module 12 is configured to screen abstract modification class prompt words from the preset prompt words to obtain two-stage prompt words;

the two-stage reasoning module 13 is used for guiding a second preset diffusion model to reason the one-stage image generation result based on the two-stage prompt word by utilizing a target control network so as to obtain a two-stage image generation result;

a target generation image determining module 14, configured to determine the two-stage image generation result as a target generation image corresponding to the current input information.

Further, the embodiment of the application also provides electronic equipment. Fig. 5 is a block diagram of an electronic device 20, according to an exemplary embodiment, and the contents of the diagram should not be construed as limiting the scope of use of the present application in any way.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. Specifically, the method comprises the following steps: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein the memory 22 is adapted to store a computer program which is loaded and executed by the processor 21 to implement the relevant steps of the image generation method performed by the electronic device as disclosed in any of the foregoing embodiments.

In this embodiment, the power supply 23 is configured to provide an operating voltage for each hardware device on the electronic device; the communication interface 24 can create a data transmission channel between the electronic device and an external device, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein; the input/output interface 25 is used for acquiring external input data or outputting external output data, and the specific interface type thereof may be selected according to the specific application requirement, which is not limited herein.

Processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc. The processor 21 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 21 may also comprise a main processor, which is a processor for processing data in an awake state, also called CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 21 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 21 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon include an operating system 221, a computer program 222, and data 223, and the storage may be temporary storage or permanent storage.

The operating system 221 is used for managing and controlling various hardware devices on the electronic device and the computer program 222, so as to implement the operation and processing of the processor 21 on the mass data 223 in the memory 22, which may be Windows, unix, linux. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the image generation method performed by the electronic device as disclosed in any of the foregoing embodiments. The data 223 may include, in addition to data received by the electronic device and transmitted by the external device, data collected by the input/output interface 25 itself, and so on.

Further, the application also discloses a computer readable storage medium for storing a computer program; wherein the computer program, when executed by a processor, implements the image generation method disclosed previously. For specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and no further description is given here.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application. The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be placed in random access Memory (Random Access Memory), memory, read-Only Memory (ROM), electrically programmable EPROM (Erasable Programmable Read Only Memory), electrically erasable programmable EEPROM (Electrically Erasable Programmable Read Only Memory), registers, hard disk, removable disk, CD-ROM (CoMP 23033389act Disc Read-Only Memory), or any other form of storage medium known in the art.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing has described in detail the method, apparatus, device and medium for generating images according to the present invention, and specific examples are provided herein to illustrate the principles and embodiments of the present invention, and the above examples are only for aiding in the understanding of the method and core ideas of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. An image generation method, comprising:

2. The image generation method according to claim 1, wherein the acquiring current input information includes:

3. The image generating method according to claim 1, wherein the reasoning about the current input information using the first preset diffusion model and the preset hint word includes:

4. A method of generating an image according to any one of claims 1 to 3, wherein prior to said obtaining current input information, further comprising:

5. The image generation method according to claim 4, wherein preprocessing the raw image data to obtain initial training data comprises:

6. The method according to claim 5, wherein the linearly interpolating the cropped image data using a predetermined linear interpolator to obtain initial training data, comprises:

7. The image generation method according to claim 4, wherein updating parameters of an initial control network using the initial training data and the initial text condition input information to obtain a target control network, comprises:

judging whether a preset iteration stop condition is met currently or not;

8. An image generating apparatus, comprising:

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the steps of the image generation method as claimed in any one of claims 1 to 7.

10. A computer-readable storage medium storing a computer program; wherein the computer program when executed by a processor implements the steps of the image generation method of any of claims 1 to 7.