CN116363262B

CN116363262B - Image generation method and device and electronic equipment

Info

Publication number: CN116363262B
Application number: CN202310343918.5A
Authority: CN
Inventors: 王展鹏; 李国豪; 佘俏俏; 李伟; 刘家辰; 肖欣延
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-03-31
Filing date: 2023-03-31
Publication date: 2024-02-02
Anticipated expiration: 2043-03-31
Also published as: CN116363262A

Abstract

The disclosure provides an image generation method, an image generation device and electronic equipment, relates to the technical field of artificial intelligence, and particularly relates to the technical fields of deep learning, natural language processing and computer vision. The specific implementation scheme is as follows: acquiring material information to be processed, wherein the material information comprises a plurality of reference images; performing feature extraction and splicing treatment on the multiple reference images to obtain spliced features; and according to the stitching characteristics, denoising processing and characteristic decoding processing are carried out on the noise image characteristics to obtain a generated image corresponding to the material information, so that the method is applicable to a scene of image generation based on a plurality of reference images, and the image generation efficiency is improved.

Description

Image generation method and device and electronic equipment

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of deep learning, natural language processing and computer vision, and particularly relates to an image generation method, an image generation device and electronic equipment.

Background

The current hybrid image generation method mainly comprises the steps of adding a text into an image, extracting features of the image and the text, and performing splicing treatment to obtain spliced features; and inputting the spliced features into a diffusion network to obtain the generated image.

In the method, the number of the image features input into the diffusion network is one, so that the diffusion network is difficult to be applied to a scene for generating images based on a plurality of images, and the image generation efficiency is reduced.

Disclosure of Invention

The disclosure provides an image generation method, an image generation device and electronic equipment.

According to an aspect of the present disclosure, there is provided an image generation method including: acquiring material information to be processed, wherein the material information comprises a plurality of reference images; performing feature extraction and splicing treatment on the plurality of reference images to obtain spliced features; and carrying out denoising processing and feature decoding processing on the noise image features according to the splicing features to obtain a generated image corresponding to the material information.

According to another aspect of the present disclosure, there is provided a training method of an image generation model, the method including: acquiring training data, the training data comprising: a plurality of pairs of image text; the text in the image text pair is used for describing the image in the image text pair; an initial image generation model is acquired, the image generation model comprising: the system comprises an image coding network for extracting image features, a text coding network for extracting text features, a splicing module, a diffusion network for denoising and a decoding network; the splicing module is respectively connected with the image coding network, the text coding network and the diffusion network, and the diffusion network is connected with the decoding network; and training the initial image generation model by adopting a plurality of image text pairs to obtain a trained image generation model.

According to another aspect of the present disclosure, there is provided an image generating apparatus including: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring material information to be processed, and the material information comprises a plurality of reference images; the first processing module is used for carrying out feature extraction and splicing processing on the plurality of reference images to obtain spliced features; and the second processing module is used for carrying out denoising processing and feature decoding processing on the noise image features according to the splicing features to obtain a generated image corresponding to the material information.

According to another aspect of the present disclosure, there is provided a training apparatus of an image generation model, the apparatus including: the first acquisition module is used for acquiring training data, and the training data comprises: a plurality of pairs of image text; the text in the image text pair is used for describing the image in the image text pair; a second acquisition module for acquiring an initial image generation model, the image generation model comprising: the system comprises an image coding network for extracting image features, a text coding network for extracting text features, a splicing module, a diffusion network for denoising and a decoding network; the splicing module is respectively connected with the image coding network, the text coding network and the diffusion network, and the diffusion network is connected with the decoding network; and the training module is used for training the initial image generation model by adopting a plurality of image text pairs to obtain a trained image generation model.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the image generation method or the training method of the image generation model set forth above in the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the image generation method or the training method of the image generation model proposed in the above-described disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the image generation method or the training method of the image generation model set forth above of the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a schematic diagram of image generation;

FIG. 5 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a sixth embodiment of the present disclosure;

fig. 8 is a block diagram of an electronic device used to implement an image generation method or a training method of an image generation model of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In view of the above, the present disclosure provides an image generating method, an image generating device, and an electronic device.

Fig. 1 is a schematic diagram of a first embodiment of the present disclosure, and it should be noted that the image generating method of the embodiment of the present disclosure may be applied to an image generating apparatus, and the apparatus may be configured in an electronic device, so that the electronic device may perform an image generating function. In the following embodiments, an execution body is described as an example of an electronic device.

The electronic device may be any device with computing capability, for example, may be a personal computer (Personal Computer, abbreviated as PC), a mobile terminal, a server, etc., and the mobile terminal may be, for example, a vehicle-mounted device, a mobile phone, a tablet computer, a personal digital assistant, a wearable device, a smart speaker, etc., and has various hardware devices including an operating system, a touch screen, and/or a display screen.

As shown in fig. 1, the image generation method may include the steps of:

step 101, acquiring material information to be processed, wherein the material information comprises a plurality of reference images.

In the embodiment of the present disclosure, a plurality of reference images included in material information have elements that need to be incorporated into a generated image, respectively. Wherein elements such as style, color, body, etc. The style of the ink can be set according to actual needs, such as freshness, literature, ink, light and shadow.

Wherein the subject is, for example, an animal, plant, character, object, or the like.

The plurality of reference images may have elements to be incorporated into the generated image, respectively, but there may be a correlation or no correlation between the plurality of reference images. There is an association relationship, for example, the main bodies in the plurality of reference images are all some kind of objects.

In the embodiment of the present disclosure, at least one of the following may be further included in the material information: the influence weight of the reference image, the description text for describing the generated image, the influence weight of the description text, the size information of the generated image, the number information of the generated images, and the like.

Taking two reference images, namely a puppet cat image and a labrador dog image as an example, the description text can be, for example, "a dog with a fur color like the puppet cat", which is used for describing the characteristics of the generated image to be generated.

The material information comprises at least one of an influence weight of a reference image, a description text for describing a generated image, an influence weight of the description text, size information of the generated image and quantity information of the generated image, so that a user can flexibly set the description text, the influence weight, the size information, the quantity information and the like according to requirements, the generated image meeting the requirements of the user can be generated, the flexibility of image generation is improved, and the image generation efficiency is improved.

In the disclosed embodiments, the reference image may be an original image, and/or a historically generated image. Wherein the original image is, for example, an image acquired by an image acquisition device. Wherein the historically generated image is, for example, other generated images generated in combination with the plurality of original images.

The images in the reference image may be all original images, or all history generated images, or part of original images plus part of history generated images. This arrangement allows the electronic device to generate a new image based on the history-generated image in connection with the present embodiment. For example, if the history generated image does not meet the user requirement, the user can provide other reference images, and the image is regenerated by combining the history generated image with other reference images, so that the image meeting the user requirement can be quickly generated, and the image generation efficiency is improved.

And 102, performing feature extraction and splicing processing on the multiple reference images to obtain spliced features.

In the embodiment of the present disclosure, the process of executing step 102 by the electronic device may be, for example, performing feature extraction processing on a plurality of reference images to obtain image features of the plurality of reference images; and performing stitching processing on the image features of the plurality of reference images to obtain stitching features.

The electronic equipment can input the reference image into an image coding network of the image generation model to acquire image characteristics output by the image coding network; the electronic equipment can input the image features into a splicing module of the image generation model to obtain the splicing features output by the splicing module. The splicing feature is used for inputting a diffusion network and a decoding network to acquire a generated image.

The image quantity threshold value and the text quantity threshold value are set in the diffusion network. Wherein the number of reference images may be less than or equal to the image number threshold. Descriptive text may be included in the story information or not, but when descriptive text is included in the story information, the number of descriptive text may be less than or equal to the text number threshold. The text quantity threshold may be 1, for example.

Under the condition that descriptive text is not included in the material information and the number of reference images is equal to the threshold value of the number of images, the electronic equipment can acquire unconditional text characteristics of blank characters; and performing splicing processing on the image features of the plurality of reference images and the unconditional text features to obtain splicing features.

Wherein, in the case that the descriptive text is not included in the material information and the number of the reference images is smaller than the image number threshold, the electronic device may determine a difference between the image number threshold and the number of the reference images, for example, N; acquiring unconditional text features of an empty character and N unconditional image features; and performing splicing processing on the image features of the plurality of reference images, the N unconditional image features and the unconditional text features to obtain splicing features.

The method comprises the steps of setting unconditional text features and unconditional image features, enabling the input of a diffusion network to be the splicing result of image quantity threshold number image features and text quantity threshold number text features, meeting the input requirement of the diffusion network, enabling a user to determine whether to input descriptive texts according to actual requirements, flexibly adjusting the quantity of reference images, and accordingly improving the flexibility of image generation and the efficiency of image generation.

And step 103, denoising and feature decoding are carried out on the noise image features according to the splicing features, so as to obtain a generated image corresponding to the material information.

In the embodiment of the present disclosure, the process of executing step 103 by the electronic device may be, for example, inputting the noise image feature and the stitching feature into a diffusion network performing denoising processing, and obtaining a hidden feature output by the diffusion network, where the hidden feature is obtained by denoising the noise image feature for multiple times by combining the diffusion network with the stitching feature; and carrying out feature decoding processing on the hidden features to obtain a generated image corresponding to the material information.

The process of denoising the diffusion network can be, for example, denoising random noise image features by combining with the splicing features to obtain primary hidden features; denoising the primary hidden features by combining the splicing features to obtain secondary hidden features; repeating the above process until the number of times of denoising treatment reaches a preset number of times threshold.

The noise image features are subjected to denoising processing by combining the diffusion network and the splicing features, and then decoding processing is performed, so that a generated image with the splicing features can be obtained, and the generation efficiency of the generated image is improved.

According to the image generation method, material information to be processed is obtained, wherein the material information comprises a plurality of reference images; performing feature extraction and splicing treatment on the multiple reference images to obtain spliced features; and according to the stitching characteristics, denoising processing and characteristic decoding processing are carried out on the noise image characteristics to obtain a generated image corresponding to the material information, so that the method is applicable to a scene of image generation based on a plurality of reference images, and the image generation efficiency is improved.

In order to accurately generate the image, the image generation efficiency is improved, and the descriptive text for describing the generated image can be included in the material information, so that the electronic equipment can generate the generated image conforming to the descriptive text. As shown in fig. 2, fig. 2 is a schematic diagram of a second embodiment according to the present disclosure, and the embodiment shown in fig. 2 may include the following steps:

in step 201, material information to be processed is acquired, where the material information includes a plurality of reference images and description text for describing the generated images.

Step 202, performing image feature extraction processing on the plurality of reference images respectively to obtain image features of the plurality of reference images.

In the embodiment of the disclosure, for each reference image, the electronic device may input the reference image into the image coding network of the image generation model, and acquire the image characteristics output by the image coding network.

And 203, performing text feature extraction processing on the descriptive text to obtain text features of the descriptive text.

In the embodiment of the disclosure, the electronic device may input the description text into the text encoding network of the image generation model to obtain the text characteristics output by the text encoding network.

And 204, performing stitching processing on the image features of the plurality of reference images and the text features of the descriptive text to obtain stitching features.

In the embodiment of the disclosure, an image number threshold and a text number threshold are set in the diffusion network. Wherein the number of reference images may be less than or equal to the image number threshold; the number of descriptive text may be less than or equal to the text number threshold. The text quantity threshold may be 1, for example.

Under the condition that descriptive text is included in the material information and the number of the reference images is equal to the threshold value of the number of the images, the electronic equipment can directly splice the image characteristics of the plurality of reference images and the text characteristics of the descriptive text to obtain spliced characteristics.

Wherein, in the case that the descriptive text is included in the material information and the number of the reference images is smaller than the image number threshold, the electronic device may determine a difference value between the image number threshold and the number of the reference images, for example, N; acquiring N unconditional image features; and performing splicing processing on the image features of the plurality of reference images, the N unconditional image features and the text features of the descriptive text to obtain splicing features.

The unconditional image features are set, so that the input of the diffusion network can be the threshold image features of the image quantity, the input requirement of the diffusion network is met, a user can flexibly adjust the quantity of the reference images according to actual requirements, and therefore the flexibility of image generation and the image generation efficiency are improved.

And 205, denoising and feature decoding are carried out on the noise image features according to the splicing features, so as to obtain a generated image corresponding to the material information.

It should be noted that, for details of step 201 and step 205, reference may be made to step 101 and step 103 in the embodiment shown in fig. 1, and detailed description thereof will not be given here.

According to the image generation method, material information to be processed is obtained, wherein the material information comprises a plurality of reference images and descriptive text for describing the generated images; respectively carrying out image feature extraction processing on the plurality of reference images to obtain image features of the plurality of reference images; extracting text features of the descriptive text to obtain the text features of the descriptive text; splicing the image features of the plurality of reference images and the text features of the descriptive text to obtain spliced features; and according to the splicing characteristics, denoising processing and characteristic decoding processing are carried out on the noise image characteristics to obtain a generated image corresponding to the material information, so that the method is applicable to a scene of image generation based on a plurality of reference images and descriptive texts, and the image generation efficiency is improved.

In order to flexibly adjust the influence of the reference image and the description text on the generated image, the image generation efficiency is improved, and the material information can comprise the influence weight of the reference image and the influence weight of the description text. As shown in fig. 3, fig. 3 is a schematic diagram of a third embodiment according to the present disclosure, and the embodiment shown in fig. 3 may include the following steps:

in step 301, material information to be processed is acquired, where the material information includes a plurality of reference images, description text for describing the generated image, influence weights of the plurality of reference images, and influence weights of the description text.

In the embodiment of the disclosure, a user can determine a reference element to be referred in a reference image according to requirements; and determining the influence weight of each reference image according to the reference element expected to be referenced in each reference image.

In the embodiment of the disclosure, the method is used for determining the influence weight on the descriptive text according to requirements.

Step 302, performing image feature extraction processing on the plurality of reference images respectively to obtain image features of the plurality of reference images.

Step 303, for each reference image, weighting the image features of the reference image according to the influence weights of the reference image, so as to obtain weighted image features of the reference image.

Wherein, in the case that the number of the reference images in the material information is smaller than the image number threshold, the electronic device may determine a difference value between the image number threshold and the number of the reference images, for example, N; n unconditional image features are acquired. The electronic device may set the influence weight randomly for the unconditional image feature, and perform the weighting processing on the unconditional image feature in combination with the influence weight, because the unconditional image feature is ignored or not considered during the processing of the diffusion network.

And step 304, performing text feature extraction processing on the descriptive text to obtain text features of the descriptive text.

In the case where the descriptive text is not included in the material information, the electronic device may acquire an unconditional text feature of the null character. Because the unconditional text feature is ignored or not considered during the processing of the diffusion network, the electronic device can randomly set the influence weight for the unconditional text feature and perform the weighting processing on the unconditional text feature in combination with the influence weight.

And 305, weighting the text characteristics of the descriptive text according to the influence weight of the descriptive text to obtain weighted text characteristics of the descriptive text.

And 306, performing stitching processing on the weighted image features and the weighted text features to obtain stitching features.

Step 307, denoising and feature decoding are performed on the noise image features according to the stitching features, so as to obtain a generated image corresponding to the material information.

In addition, the user's demand, that is, the demand for generating an image, may be, for example, mixing a main body of a plurality of reference images, migrating a style of a part of the reference images into one reference image, mixing a main body of a part of the reference images with a background of one reference image, or the like. According to the requirements, a proper reference image can be selected, a proper description text, influence weights and the like are set, and therefore the generated image meeting the requirements is conveniently generated.

It should be noted that, for details of step 301 and step 307, reference may be made to step 101 and step 103 in the embodiment shown in fig. 1, and detailed description thereof will not be given here.

The image generation method of the embodiment of the disclosure includes acquiring material information to be processed, wherein the material information comprises a plurality of reference images, descriptive text for describing the generated images, influence weights of the plurality of reference images and influence weights of the descriptive text; respectively carrying out image feature extraction processing on the plurality of reference images to obtain image features of the plurality of reference images; for each reference image, weighting the image characteristics of the reference image according to the influence weight of the reference image to obtain weighted image characteristics of the reference image; extracting text features of the descriptive text to obtain the text features of the descriptive text; according to the influence weight of the descriptive text, carrying out weighting processing on the text characteristics of the descriptive text to obtain weighted text characteristics of the descriptive text; splicing the weighted image features and the weighted text features to obtain spliced features; according to the splicing characteristics, denoising processing and characteristic decoding processing are carried out on the noise image characteristics to obtain a generated image corresponding to the material information, so that the method is applicable to scenes for image generation based on a plurality of reference images, description texts, influence weights of the reference images and influence weights of the description texts, and the image generation efficiency is improved.

The following examples are illustrative. As shown in fig. 4, is a schematic diagram of image generation. In fig. 4, the material information may include 3 reference images (fig. 1, 2, 3) and 1 description text (text description). (1) Inputting the images in the figures 1 to 3 into an image coding network (image coder), and weighting the characteristics output by the image coder according to the corresponding influence weights (the influence weight in the figure 1 is 1.0, the influence weight in the figure 2 is 1.5, and the influence weight in the figure 3 is 0.75) to obtain 3 image characteristics (the characteristics in the figure 1, the characteristics in the figure 2 and the characteristics in the figure 3); (2) Inputting the text description into a text coding network (text coder), and weighting the characteristics output by the text coder according to the corresponding influence weight (the influence weight of the text description is 0.8) to obtain a text characteristic; (3) The splicing module performs splicing processing on the characteristics of FIG. 1, the characteristics of FIG. 2, the characteristics of FIG. 3, unconditional image characteristics (unconditional vectors) and text characteristics, and then inputs a diffusion network (diffusion model); (4) The diffusion model carries out denoising treatment on the input noise image characteristics for a plurality of times according to the splicing characteristics to obtain hidden characteristics; (5) The hidden feature is input into a decoding network (VAE decoder) to obtain a generated image (result map).

The splicing module is specifically realized by a calling logic aiming at each network in the image generation model, and the calling logic can call the image coding network and the text coding network to acquire image characteristics and text characteristics; calling logic to carry out weighted splicing processing on the image features and the text features by combining the influence weights, and providing the spliced features for a diffusion network; then, calling logic can call the diffusion network to perform denoising treatment for multiple times to obtain hidden features; the invoking logic may provide the hidden feature to the decoding network, invoking the decoding network output to generate the image.

Fig. 5 is a schematic diagram of a fourth embodiment of the disclosure, where it is noted that the training method of the image generation model according to the embodiment of the disclosure may be applied to a training apparatus of the image generation model, where the apparatus may be configured in an electronic device, so that the electronic device may perform a training function of the image generation model. In the following embodiments, an execution body is described as an example of an electronic device.

As shown in fig. 5, the training method of the image generation model may include the steps of:

step 501, obtaining training data, wherein the training data comprises: a plurality of pairs of image text; text in the image text pair is used to describe the image in the image text pair.

Step 502, an initial image generation model is obtained, wherein the image generation model comprises: the system comprises an image coding network for extracting image features, a text coding network for extracting text features, a splicing module, a diffusion network for denoising and a decoding network; and the splicing module is respectively connected with the image coding network, the text coding network and the diffusion network, and the diffusion network is connected with the decoding network.

In the embodiments of the present disclosure, the input of the image encoding network may be an image and the output may be an image feature of the image. The image coding network in the initial image generation model can be a pre-trained image coding network, so that the accuracy of the extracted image features is improved, the training speed of the image generation model is further increased, and the accuracy of the image generation model obtained through training is improved.

In the embodiments of the present disclosure, the input of the text encoding network may be text and the output may be a text feature of the text. The text coding network in the initial image generation model can be a text coding network after pre-training, so that the accuracy of the extracted text features is improved, the training speed of the image generation model is further improved, and the accuracy of the image generation model obtained through training is improved.

The input of the splicing module can be a plurality of image features, or the plurality of image features plus text features, and the output of the splicing module can be the splicing features. The input of the diffusion network can be a splicing characteristic, and the output can be a hidden characteristic obtained by denoising the noise image characteristic by combining the splicing characteristic. The input to the decoding network may be a hidden feature and the output may be a generated image.

The splicing module specifically realizes that calling logic aiming at calling logic of each network in the image generation model can call the image coding network and the text coding network to acquire image characteristics and text characteristics; calling logic to carry out weighted splicing processing on the image features and the text features by combining the influence weights, and providing the spliced features for a diffusion network; then, calling logic can call the diffusion network to perform denoising treatment for multiple times to obtain hidden features; the invoking logic may provide the hidden feature to the decoding network, invoking the decoding network output to generate the image.

In the embodiment of the disclosure, the diffusion network in the initial image generation model may be a pre-trained diffusion network, so as to improve accuracy of the extracted hidden features, further speed up training of the image generation model, and improve accuracy of the image generation model obtained by training.

And 503, training the initial image generation model by adopting a plurality of image text pairs to obtain a trained image generation model.

In the embodiment of the present disclosure, the electronic device performs the process of step 503 may, for example, be that, for each image text pair, an image in the image text pair and a text are input into an image generation model, and a prediction generation image output by the image generation model is obtained; generating an image according to the prediction, and constructing a loss function of the image in the image text pair; and carrying out parameter adjustment on the image generation model according to the numerical value of the loss function to obtain a trained image generation model.

The image text pair comprises an image and a text, the text is used for describing the image, and the generated image is generally similar to the image based on the text and the image, so that a loss function can be constructed according to the predicted generated image output by the image generation model and the image in the image text pair, and parameter adjustment can be performed on the image generation model, thereby reducing the acquisition cost of training data and the training cost of the image generation model.

The loss function can be constructed according to the similarity of the prediction generated image and the image in the image text pair.

In an embodiment of the present disclosure, in order to balance the effect of training data on the trained image generation model, after step 503, the electronic device may further perform the following process: and adjusting parameters of the diffusion network in the trained image generation model according to parameters of the pre-trained diffusion network.

The electronic device may, for example, obtain, for each parameter of the diffusion network in the trained image generation model, a corresponding parameter in the pre-trained diffusion network in a process of adjusting the parameter of the diffusion network in the trained image generation model; weighting the parameters and the corresponding parameters to obtain weighted parameters; and updating parameters of the diffusion network in the image generation model according to the weighted parameters.

According to each parameter of the diffusion network in the trained image generation model, the parameters are adjusted according to the corresponding parameters in the pre-trained diffusion network, so that the influence of training data on each parameter in the diffusion network of the trained image generation model can be balanced, the model is prevented from being over-fitted, and the accuracy of the trained image generation model is improved.

According to the training method of the image generation model, training data is obtained, and the training data comprises the following steps: a plurality of pairs of image text; text in the image text pair for describing the image in the image text pair; an initial image generation model is acquired, the image generation model comprising: the system comprises an image coding network for extracting image features, a text coding network for extracting text features, a splicing module, a diffusion network for denoising and a decoding network; the splicing module is respectively connected with the image coding network, the text coding network and the diffusion network, and the diffusion network is connected with the decoding network; and training the initial image generation model by adopting a plurality of image text pairs to obtain a trained image generation model, so that the trained image generation model can be suitable for scenes of image generation based on a plurality of reference images, and further the image generation efficiency is improved.

In order to achieve the above embodiments, the present disclosure also provides an image generating apparatus. As shown in fig. 6, fig. 6 is a schematic diagram according to a fifth embodiment of the present disclosure. The image generating apparatus 60 may include: an acquisition module 601, a first processing module 602 and a second processing module 603.

The acquiring module 601 is configured to acquire material information to be processed, where the material information includes a plurality of reference images;

the first processing module 602 is configured to perform feature extraction and stitching on the plurality of reference images to obtain stitching features;

and a second processing module 603, configured to perform denoising processing and feature decoding processing on the noise image feature according to the stitching feature, so as to obtain a generated image corresponding to the material information.

As one possible implementation manner of the embodiment of the present disclosure, the material information further includes description text for describing the generated image; the first processing module 602 is specifically configured to perform feature extraction and stitching processing on the plurality of reference images and the description text, so as to obtain the stitching feature.

As one possible implementation of the embodiment of the present disclosure, the first processing module 602 includes: the device comprises a first extraction unit, a second extraction unit and a splicing unit; the first extraction unit is used for respectively carrying out image feature extraction processing on the plurality of reference images to obtain image features of the plurality of reference images; the second extraction unit is used for extracting text features of the descriptive text to obtain the text features of the descriptive text; and the splicing unit is used for carrying out splicing processing on the image characteristics of the plurality of reference images and the text characteristics of the descriptive text to obtain the splicing characteristics.

As one possible implementation manner of the embodiment of the present disclosure, the material information further includes influence weights of a plurality of the reference images, and influence weights of the description text; the stitching unit is specifically configured to perform weighting processing on image features of each reference image according to influence weights of the reference images, so as to obtain weighted image features of the reference images; according to the influence weight of the descriptive text, weighting the text characteristics of the descriptive text to obtain weighted text characteristics of the descriptive text; and performing splicing processing on the weighted image features and the weighted text features to obtain the spliced features.

As one possible implementation manner of the embodiment of the present disclosure, the number of the reference images is less than or equal to an image number threshold of the diffusion network performing the denoising process, and the apparatus further includes: a determining module; the determining module is used for determining a difference value between the image quantity threshold value and the quantity of the reference images in the condition that the quantity of the reference images is smaller than the image quantity threshold value; the first processing module 602 is further configured to perform stitching on the image features of the plurality of reference images, the unconditional image features of the difference value, and the text features of the descriptive text, to obtain the stitched feature.

As one possible implementation manner of the embodiment of the present disclosure, a diffusion network performing denoising processing is provided with a text quantity threshold, where the text quantity threshold is 1; the first processing module 602 is specifically configured to perform feature extraction processing on a plurality of reference images to obtain image features of the plurality of reference images; acquiring unconditional text characteristics of the blank character; and performing splicing processing on the image features of the plurality of reference images and the unconditional text features of the blank characters to obtain the splicing features.

As a possible implementation manner of the embodiment of the present disclosure, the second processing module 603 is specifically configured to input the noise image feature and the stitching feature into a diffusion network performing denoising processing, obtain a hidden feature output by the diffusion network, where the hidden feature is obtained by denoising the noise image feature multiple times by combining the diffusion network with the stitching feature; and carrying out feature decoding processing on the hidden features to obtain a generated image corresponding to the material information.

As one possible implementation of the embodiments of the present disclosure, the reference image is an original image, and/or a historically generated image.

As one possible implementation manner of the embodiment of the present disclosure, the material information further includes: size information and/or number information for instructing generation of a generated image having the size information and/or for instructing generation of a generated image having the number information.

The image generating device of the embodiment of the disclosure obtains material information to be processed, wherein the material information comprises a plurality of reference images; performing feature extraction and splicing treatment on the multiple reference images to obtain spliced features; and according to the stitching characteristics, denoising processing and characteristic decoding processing are carried out on the noise image characteristics to obtain a generated image corresponding to the material information, so that the method is applicable to a scene of image generation based on a plurality of reference images, and the image generation efficiency is improved.

In order to achieve the above embodiment, the present disclosure further provides a training apparatus for an image generation model. As shown in fig. 7, fig. 7 is a schematic diagram according to a sixth embodiment of the present disclosure. The training device 70 for generating the model by using the image may include: a first acquisition module 701, a second acquisition module 702, and a training module 703.

The first obtaining module 701 is configured to obtain training data, where the training data includes: a plurality of pairs of image text; the text in the image text pair is used for describing the image in the image text pair;

A second acquisition module 702, configured to acquire an initial image generation model, where the image generation model includes: the system comprises an image coding network for extracting image features, a text coding network for extracting text features, a splicing module, a diffusion network for denoising and a decoding network; the splicing module is respectively connected with the image coding network, the text coding network and the diffusion network, and the diffusion network is connected with the decoding network;

and the training module 703 is configured to perform training processing on the initial image generation model by using a plurality of pairs of the image text, so as to obtain a trained image generation model.

As one possible implementation manner of the embodiment of the present disclosure, the training module 703 is specifically configured to, for each image text pair, input, into the image generation model, an image and a text in the image text pair, and obtain a prediction generated image output by the image generation model; generating an image according to the prediction and constructing a loss function of the image in the image text pair; and carrying out parameter adjustment on the image generation model according to the numerical value of the loss function to obtain a trained image generation model.

As one possible implementation of the embodiments of the present disclosure, the diffusion network in the initial image generation model is a pre-trained diffusion network.

As one possible implementation manner of the embodiments of the present disclosure, the apparatus further includes: and the adjustment module is used for adjusting the parameters of the diffusion network in the trained image generation model according to the parameters of the pre-trained diffusion network.

As one possible implementation manner of the embodiments of the present disclosure, the adjustment module is specifically configured to obtain, for each parameter of a diffusion network in the trained image generation model, a corresponding parameter in the pre-trained diffusion network; weighting the parameters and the corresponding parameters to obtain weighted parameters; and updating the parameters of the diffusion network in the image generation model according to the weighted parameters.

According to the training device for the image generation model, training data is obtained, and the training data comprises: a plurality of pairs of image text; text in the image text pair for describing the image in the image text pair; an initial image generation model is acquired, the image generation model comprising: the system comprises an image coding network for extracting image features, a text coding network for extracting text features, a splicing module, a diffusion network for denoising and a decoding network; the splicing module is respectively connected with the image coding network, the text coding network and the diffusion network, and the diffusion network is connected with the decoding network; and training the initial image generation model by adopting a plurality of image text pairs to obtain a trained image generation model, so that the trained image generation model can be suitable for scenes of image generation based on a plurality of reference images, and further the image generation efficiency is improved.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user are performed on the premise of proving the consent of the user, and all the processes accord with the regulations of related laws and regulations, and the public welfare is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 801 performs the respective methods and processes described above, for example, an image generation method or a training method of an image generation model. For example, in some embodiments, the image generation method or the training method of the image generation model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the image generation method or the training method of the image generation model described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the image generation method or the training method of the image generation model in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An image generation method, the method comprising:

acquiring material information to be processed, wherein the material information comprises a plurality of reference images;

performing feature extraction and splicing treatment on the plurality of reference images to obtain spliced features;

according to the splicing characteristics, denoising processing and characteristic decoding processing are carried out on noise image characteristics, and a generated image corresponding to the material information is obtained;

The material information further includes descriptive text for describing the generated image, influence weights of a plurality of the reference images, and influence weights of the descriptive text;

the step of extracting features and performing stitching treatment on the plurality of reference images to obtain stitching features comprises the following steps:

respectively carrying out image feature extraction processing on a plurality of reference images to obtain image features of the plurality of reference images;

extracting text features of the descriptive text to obtain the text features of the descriptive text;

for each reference image, weighting the image characteristics of the reference image according to the influence weight of the reference image to obtain weighted image characteristics of the reference image;

according to the influence weight of the descriptive text, weighting the text characteristics of the descriptive text to obtain weighted text characteristics of the descriptive text;

and performing splicing processing on the weighted image features and the weighted text features to obtain the spliced features.

2. The method of claim 1, wherein the number of reference images is less than or equal to an image number threshold of a diffusion network performing denoising processing, the method further comprising:

Determining a difference between the image number threshold and the number of reference images if the number of reference images is less than the image number threshold;

and performing splicing processing on the image features of the plurality of reference images, the unconditional image features of the difference values and the text features of the descriptive text to obtain the spliced features.

3. The method according to claim 1, wherein the diffusion network performing the denoising process is provided with a text number threshold value of 1;

performing feature extraction processing on a plurality of reference images to obtain image features of the plurality of reference images;

acquiring unconditional text characteristics of the blank character;

and performing splicing processing on the image features of the plurality of reference images and the unconditional text features of the blank characters to obtain the splicing features.

4. The method according to claim 1, wherein the performing denoising processing and feature decoding processing on the noise image features according to the stitching features to obtain a generated image corresponding to the material information includes:

Inputting the noise image features and the splicing features into a diffusion network for denoising, and obtaining hidden features output by the diffusion network, wherein the hidden features are obtained by the diffusion network combining the splicing features to denoise the noise image features for a plurality of times;

and carrying out feature decoding processing on the hidden features to obtain a generated image corresponding to the material information.

5. The method of claim 1, wherein the reference image is an original image, and/or a historically generated image.

6. The method of claim 1, wherein the material information further comprises: size information and/or number information for instructing generation of a generated image having the size information and/or for instructing generation of a generated image having the number information.

7. A training method of an image generation model, the method comprising:

acquiring training data, the training data comprising: a plurality of pairs of image text; the text in the image text pair is used for describing the image in the image text pair;

an initial image generation model is acquired, the image generation model comprising: the system comprises an image coding network for extracting image features, a text coding network for extracting text features, a splicing module, a diffusion network for denoising and a decoding network; the splicing module is respectively connected with the image coding network, the text coding network and the diffusion network, and the diffusion network is connected with the decoding network;

Training the initial image generation model by adopting a plurality of image text pairs to obtain a trained image generation model;

the image coding network is used for extracting image characteristics of images in a plurality of image text pairs to obtain a plurality of image characteristics;

the text coding network is used for extracting text characteristics of texts in a plurality of image text pairs to obtain text characteristics;

the splicing module is used for: weighting the plurality of image features output by the image encoder according to the corresponding influence weights to obtain a plurality of weighted image features; the text characteristics output by the text encoder are weighted according to the corresponding influence weights, and weighted text characteristics are obtained; performing splicing processing on the weighted image features, the unconditional image features and the weighted text features to obtain spliced features;

the diffusion network carries out denoising treatment on the input noise image characteristics for a plurality of times according to the splicing characteristics to obtain hidden characteristics;

the decoding network is used for carrying out feature decoding processing on the hidden features and outputting a generated image.

8. The method of claim 7, wherein said training the initial image generation model using a plurality of said pairs of image text to obtain a trained image generation model comprises:

Inputting an image and text in the image text pair into the image generation model for each image text pair, and obtaining a prediction generation image output by the image generation model;

generating an image according to the prediction and constructing a loss function of the image in the image text pair;

and carrying out parameter adjustment on the image generation model according to the numerical value of the loss function to obtain a trained image generation model.

9. The method of claim 7, wherein the diffusion network in the initial image generation model is a pre-trained diffusion network.

10. The method of claim 9, wherein the method further comprises:

and according to the parameters of the pre-trained diffusion network, adjusting the parameters of the diffusion network in the trained image generation model.

11. The method of claim 10, wherein the adjusting the parameters of the diffusion network in the trained image generation model according to the parameters of the pre-trained diffusion network comprises:

aiming at each parameter of a diffusion network in the trained image generation model, acquiring corresponding parameters in the pre-trained diffusion network;

Weighting the parameters and the corresponding parameters to obtain weighted parameters;

and updating the parameters of the diffusion network in the image generation model according to the weighted parameters.

12. An image generation apparatus, the apparatus comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring material information to be processed, and the material information comprises a plurality of reference images;

the first processing module is used for carrying out feature extraction and splicing processing on the plurality of reference images to obtain spliced features;

the second processing module is used for carrying out denoising processing and feature decoding processing on the noise image features according to the splicing features to obtain a generated image corresponding to the material information;

wherein the material information further includes descriptive text for describing the generated image, influence weights of a plurality of the reference images, and influence weights of the descriptive text;

the first processing module includes: the device comprises a first extraction unit, a second extraction unit and a splicing unit;

the first extraction unit is used for respectively carrying out image feature extraction processing on the plurality of reference images to obtain image features of the plurality of reference images;

The second extraction unit is used for extracting text features of the descriptive text to obtain the text features of the descriptive text;

the stitching unit is used for carrying out weighting processing on the image characteristics of each reference image according to the influence weight of the reference image to obtain weighted image characteristics of the reference image; according to the influence weight of the descriptive text, weighting the text characteristics of the descriptive text to obtain weighted text characteristics of the descriptive text; and performing splicing processing on the weighted image features and the weighted text features to obtain the spliced features.

13. The apparatus of claim 12, wherein the number of reference images is less than or equal to an image number threshold of a diffusion network performing denoising processing, the apparatus further comprising: a determining module;

the determining module is used for determining a difference value between the image quantity threshold value and the quantity of the reference images in the condition that the quantity of the reference images is smaller than the image quantity threshold value;

the first processing module is further configured to perform stitching processing on the image features of the plurality of reference images, the unconditional image features of the difference values, and the text features of the descriptive text, so as to obtain the stitching features.

14. The apparatus of claim 12, wherein the diffusion network performing the denoising process is provided with a text number threshold value of 1; the first processing module is specifically configured to,

acquiring unconditional text characteristics of the blank character;

15. The apparatus of claim 12, wherein the second processing module is configured to,

16. The apparatus of claim 12, wherein the reference image is an original image, and/or a historically generated image.

17. The apparatus of claim 12, wherein the material information further comprises: size information and/or number information for instructing generation of a generated image having the size information and/or for instructing generation of a generated image having the number information.

18. A training apparatus for image generation models, the apparatus comprising:

the first acquisition module is used for acquiring training data, and the training data comprises: a plurality of pairs of image text; the text in the image text pair is used for describing the image in the image text pair;

a second acquisition module for acquiring an initial image generation model, the image generation model comprising: the system comprises an image coding network for extracting image features, a text coding network for extracting text features, a splicing module, a diffusion network for denoising and a decoding network; the splicing module is respectively connected with the image coding network, the text coding network and the diffusion network, and the diffusion network is connected with the decoding network;

the training module is used for training the initial image generation model by adopting a plurality of image text pairs to obtain a trained image generation model;

19. The device according to claim 18, wherein the training module is specifically configured to,

20. The apparatus of claim 18, wherein the diffusion network in the initial image generation model is a pre-trained diffusion network.

21. The apparatus of claim 18, wherein the apparatus further comprises: and the adjustment module is used for adjusting parameters of the diffusion network in the trained image generation model according to the parameters of the pre-trained diffusion network.

22. The device according to claim 21, wherein the adjustment module is specifically configured to,

23. An electronic device, comprising:

At least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 6; alternatively, the method of any one of claims 7 to 11 is performed.

24. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 6; alternatively, the method of any one of claims 7 to 11 is performed.