CN116993864A

CN116993864A - Image generation method and device, electronic equipment and storage medium

Info

Publication number: CN116993864A
Application number: CN202310777137.7A
Authority: CN
Inventors: 陈瑞
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2023-06-28
Filing date: 2023-06-28
Publication date: 2023-11-03

Abstract

The present disclosure relates to an image generation method, apparatus, electronic device, and storage medium, the method comprising: acquiring image characteristic information of an initial image; the image characteristic information comprises initial posture information of an initial object in an initial image and initial background semantic information of the initial image; acquiring object feature information of a target object; synthesizing the noise image by utilizing the object characteristic information and the image characteristic information to obtain a target image containing a target object; the noise image is obtained by adding Gaussian noise to the initial image; the background semantic information of the target image is matched with the initial background semantic information, and the gesture of the target object in the target image is matched with the initial gesture information. The embodiment of the disclosure provides a novel image generation method, which has wide application prospect in the field of image generation.

Description

Image generation method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of internet, and in particular relates to an image generation method, an image generation device, electronic equipment and a storage medium.

Background

With the development of digital technology and artificial intelligence (artificial intelligence, AI) technology, the content generation field gradually turns the generation of content from professional fields (such as short video bloggers with professional field knowledge) to artificial intelligence generation content (artificial intelligence generated content, AIGC). The AIGC greatly expands the richness and imagination of the digital content generation field, greatly enriches the digital life of people, and is an indispensable supporting force for comprehensively developing the new era of digital civilization in the future.

In the current field of image generation by artificial intelligence, tasks such as generating corresponding images according to texts and generating new images by performing style conversion on original images are widely applied, but at present, task forms are relatively single, and development of new tasks is needed to improve application diversity in the field of image generation.

Disclosure of Invention

The disclosure provides an image generation method, an image generation device, an electronic device and a storage medium, and the technical scheme of the disclosure is as follows:

according to a first aspect of an embodiment of the present disclosure, there is provided an image generating method including:

acquiring image characteristic information of an initial image; the image characteristic information comprises initial posture information of an initial object in an initial image and initial background semantic information of the initial image;

acquiring object feature information of a target object;

synthesizing the noise image by utilizing the object characteristic information and the image characteristic information to obtain a target image containing a target object; the noise image is obtained by adding Gaussian noise to the initial image; the background semantic information of the target image is matched with the initial background semantic information, and the gesture of the target object in the target image is matched with the initial gesture information.

In some possible embodiments, the synthesizing the noise image using the object feature information and the image feature information to obtain a target image including the target object includes:

Taking object characteristic information and image characteristic information as guide information;

denoising the noise image for a plurality of times based on the guide information to obtain a denoised image;

and taking the denoising image as a target image until the denoising image contains the target object and the initial background of the initial image.

In some possible embodiments, denoising the noise image multiple times based on the guidance information to obtain a denoised image, including:

processing the guide information and the noise image through an image denoising model, and predicting added noise data;

determining a denoising image distribution parameter according to the noise data, the noise image and the model super-parameters of the image denoising model; the denoising image distribution parameters comprise a mean value and a variance;

and sampling from the standard Gaussian distribution based on the denoising image distribution parameter to obtain a denoising image.

In some possible embodiments, the data dimension of the object feature information is different from the data dimension of the image feature information;

the object feature information and the image feature information are used as guiding information, and the guiding information comprises:

feature fusion is carried out on the object feature information and the image feature information, and fused feature information is obtained; the data dimension of the fused characteristic information meets the preset data dimension;

And taking the fused characteristic information as guide information.

In some possible embodiments, the training manner of the image denoising model includes:

acquiring an initial training image; the initial training image comprises an initial training object;

acquiring object training characteristic information of an initial training object;

acquiring a first initial training model; the first initial training model comprises an initial feature processing model, an image noise adding module and an initial image noise removing model;

according to the image noise adding module, adding noise data conforming to Gaussian distribution to the initial training image to obtain a corresponding training noise image;

acquiring image training feature information of an initial training image according to the initial feature processing model; the image training characteristic information comprises initial posture information of an initial training object and initial background semantic information of an initial training image;

based on the training noise image, the object training feature information and the image training feature information, obtaining prediction noise data by using an initial image denoising model;

training an initial feature processing model and an initial image denoising model based on the predicted noise data and the actually added noise data until a trained feature processing model and an image denoising model are obtained;

The feature processing model is used for extracting features of the initial image to obtain image feature information of the initial image.

In some possible embodiments, acquiring image characteristic information of the initial image includes:

acquiring an initial object mask image;

determining an initial object region in the initial image based on the initial object mask image;

extracting features of the initial object region to obtain initial posture information of the initial object;

determining a region except for the initial object in the initial image as a background region based on the initial object mask image;

and carrying out background semantic extraction on the background area to obtain initial background semantic information of the initial image.

In some possible embodiments, obtaining object feature information of the target object includes:

acquiring description information of a target object; the descriptive information comprises any one of picture descriptive information, text descriptive information and audio descriptive information;

and carrying out feature coding on the description information of the target object according to the feature coding model to obtain object feature information.

In some possible embodiments, the training manner of the feature coding model includes:

acquiring a plurality of training sample pairs; each training sample pair of the plurality of training sample pairs includes first descriptive information of a training object and second descriptive information of the training object;

Acquiring a second initial training model;

performing feature coding on the first description information in each training sample pair according to the second initial training model to obtain a first feature code of each training sample pair; obtaining a first feature code set based on the first feature codes of each training sample pair;

performing feature coding on the second description information in each training sample pair according to a second initial training model to obtain a second feature code in each training sample pair; obtaining a second feature code set based on the second feature codes of each training sample pair;

and training the second initial training model based on the similarity information between each first feature code in the first feature code set and each second feature code in the second feature code set until a feature code model is obtained.

According to a second aspect of the embodiments of the present disclosure, there is provided an image generating method, including:

displaying an image synthesis page; the image synthesis page comprises a first display area, a second display area and a third display area; an initial image is displayed in the first display area; displaying an image of the target object in the second display area;

in response to the synthesis instruction, displaying a target image synthesized based on the initial image and the target object after the initial object in the initial image is removed in a third display area; the background semantic information of the target image is matched with the initial background semantic information of the initial image, and the gesture of the target object in the target image is matched with the initial gesture information of the initial object.

In some possible embodiments, the method further comprises:

responding to a trigger instruction acted on the first display area, and displaying the selected initial image in the first display area;

in response to a target object selection instruction acting on the second presentation area, an image of the selected target object is displayed on the second presentation area.

In some possible embodiments, the image composition page includes a first operation control therein, and the method further includes:

and responding to an operation instruction aiming at the first operation control, starting an erasing function, and shielding an initial object in an initial image displayed in the first display area based on the erasing function.

In some possible embodiments, the image composition page includes a second operation control therein, and the method further includes:

and generating a composite instruction in response to the operation instruction for the second operation control, and displaying the target image in the third display area based on the composite instruction.

According to a third aspect of the embodiments of the present disclosure, there is provided an image generating apparatus including:

a first acquisition module configured to perform acquisition of image feature information of an initial image; the image characteristic information comprises initial posture information of an initial object in an initial image and initial background semantic information of the initial image;

A second acquisition module configured to perform acquisition of object feature information of the target object;

a synthesizing module configured to perform synthesizing processing on the noise image by using the object feature information and the image feature information to obtain a target image including a target object; the noise image is obtained by adding Gaussian noise to the initial image; the background semantic information of the target image is matched with the initial background semantic information, and the gesture of the target object in the target image is matched with the initial gesture information.

In some possible embodiments, the synthesizing module is further configured to perform the object feature information and the image feature information as the guide information; denoising the noise image for a plurality of times based on the guide information to obtain a denoised image; and taking the denoising image as a target image until the denoising image contains the target object and the initial background of the initial image.

In some possible embodiments, the synthesizing module is further configured to perform processing of the guide information and the noise image by the image denoising model, predicting the added noise data; determining a denoising image distribution parameter according to the noise data, the noise image and the model super-parameters of the image denoising model; the denoising image distribution parameters comprise a mean value and a variance; and sampling from the standard Gaussian distribution based on the denoising image distribution parameter to obtain a denoising image.

the synthesis module is further configured to perform feature fusion on the object feature information and the image feature information to obtain fused feature information; the data dimension of the fused characteristic information meets the preset data dimension; and taking the fused characteristic information as guide information.

In some possible embodiments, the apparatus module further comprises a training module of the image denoising model;

the training module of the image denoising model is configured to acquire an initial training image; the initial training image comprises an initial training object; acquiring object training characteristic information of an initial training object; acquiring a first initial training model; the first initial training model comprises an initial feature processing model, an image noise adding module and an initial image noise removing model; according to the image noise adding module, adding noise data conforming to Gaussian distribution to the initial training image to obtain a corresponding training noise image; acquiring image training feature information of an initial training image according to the initial feature processing model; the image training characteristic information comprises initial posture information of an initial training object and initial background semantic information of an initial training image; based on the training noise image, the object training feature information and the image training feature information, obtaining prediction noise data by using an initial image denoising model; training an initial feature processing model and an initial image denoising model based on the predicted noise data and the actually added noise data until a trained feature processing model and an image denoising model are obtained; the feature processing model is used for extracting features of the initial image to obtain image feature information of the initial image.

In some possible embodiments, the first acquisition module is configured to perform acquisition of an initial object mask image; determining an initial object region in the initial image based on the initial object mask image; extracting features of the initial object region to obtain initial posture information of the initial object; determining a region except for the initial object in the initial image as a background region based on the initial object mask image; and carrying out background semantic extraction on the background area to obtain initial background semantic information of the initial image.

In some possible embodiments, the second obtaining module is configured to perform obtaining the description information of the target object; the descriptive information comprises any one of picture descriptive information, text descriptive information and audio descriptive information; and carrying out feature coding on the description information of the target object according to the feature coding model to obtain object feature information.

In some possible embodiments, the apparatus further comprises a training module of the feature encoding model;

a training module of the feature encoding model configured to perform acquiring a plurality of training sample pairs; each training sample pair comprises first descriptive information of a training object and second descriptive information of the training object; acquiring a second initial training model; performing feature coding on the first description information in each training sample pair according to the second initial training model to obtain a first feature code of each training sample pair; obtaining a first feature code set based on the first feature codes of each training sample pair; performing feature coding on the second description information in each training sample pair according to a second initial training model to obtain a second feature code in each training sample pair; obtaining a second feature code set based on the second feature codes of each training sample pair; and training the second initial training model based on the similarity information between each first feature code in the first feature code set and each second feature code in the second feature code set until a feature code model is obtained.

According to a fourth aspect of the embodiments of the present disclosure, there is provided an image generating apparatus including:

a first display module configured to perform displaying an image composition page; the image synthesis page comprises a first display area, a second display area and a third display area; an initial image is displayed in the first display area; displaying an image of the target object in the second display area;

a second display module configured to execute a target image synthesized based on the initial image and the target object after the initial object in the initial image is removed in response to the synthesis instruction, in a third display area; the background semantic information of the target image is matched with the initial background semantic information of the initial image, and the gesture of the target object in the target image is matched with the initial gesture information of the initial object.

In some possible embodiments, the apparatus further comprises a third display module;

a third display module configured to execute a trigger instruction that is responsive to the trigger instruction acting on the first display area, to display the selected initial image on the first display area; in response to a target object selection instruction acting on the second presentation area, an image of the selected target object is displayed on the second presentation area.

In some possible embodiments, the image synthesis page includes a first operation control, and the apparatus further includes a fourth display module;

and a fourth display module configured to execute an operation instruction responding to the first operation control, start an erasing function, and shade an initial object in an initial image displayed in the first display area based on the erasing function.

In some possible embodiments, the image synthesis page includes a second operation control, and the apparatus further includes a fifth display module;

and a fifth display module configured to execute an operation instruction in response to the second operation control, generate a composition instruction, and display the target image in the third display area based on the composition instruction.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute instructions to implement the image generation method as in the first aspect of the embodiments of the present disclosure or the image generation method as in the second aspect of the embodiments of the present disclosure.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the image generation method as in the first aspect of embodiments of the present disclosure or the image generation method as in the second aspect of embodiments of the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the image characteristic information of the initial image and the object characteristic information of the target object are obtained, the noise image is synthesized by utilizing the image characteristic information and the object characteristic information, the target image containing the target object is obtained, and meanwhile, the matching of the background semantic information of the target image and the background semantic information of the initial image can be ensured, and the gesture of the target object in the target image is matched with the gesture of the initial object in the initial image. The embodiment of the disclosure provides a novel image generation method, which can realize erasing a part of the content of an input image, then generate a new main body in the erased part of the image, and the new main body can be well fused with an original input image, thereby having wide application prospects in the field of image generation.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

FIG. 1 is a schematic diagram of an application environment shown in accordance with an exemplary embodiment;

FIG. 2 is a flowchart illustrating an image generation method according to an exemplary embodiment;

FIG. 3 is a schematic diagram of an initial image shown according to an exemplary embodiment;

FIG. 4 is a flowchart illustrating one way of deriving image characteristic information for an initial image, according to an example embodiment;

FIG. 5 is a schematic diagram illustrating an image generation process according to an example embodiment;

FIG. 6 is a flowchart illustrating an example of obtaining object feature information of a target object according to an example embodiment;

FIG. 7 is a schematic diagram of a model structure shown in accordance with an exemplary embodiment;

FIG. 8 is a training flow diagram of a feature encoding model, according to an example embodiment;

FIG. 9 is a schematic diagram of a diffusion model according to an exemplary embodiment;

FIG. 10 is a flowchart illustrating one method of obtaining a target image according to an exemplary embodiment;

FIG. 11 is a flowchart illustrating one method of obtaining a denoised image according to an exemplary embodiment;

FIG. 12 is a training flow diagram of an image denoising model, according to an exemplary embodiment;

FIG. 13 is a training process diagram of an image denoising model, according to an exemplary embodiment;

FIG. 14 is a schematic diagram of an image composition page shown in accordance with an exemplary embodiment;

FIG. 15 is a flowchart illustrating an image generation method according to an exemplary embodiment;

FIG. 16 is a block diagram of an image generation apparatus according to an exemplary embodiment;

FIG. 17 is a block diagram of another image generation apparatus shown in accordance with an exemplary embodiment;

fig. 18 is a block diagram of an electronic device for image generation, according to an example embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar first objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The user information (including but not limited to user equipment information, user personal information, etc.) related to the present disclosure is information authorized by the user or sufficiently authorized by each party.

Referring to fig. 1, fig. 1 is a schematic diagram of an application environment according to an exemplary embodiment, as shown in fig. 1, including a server 01 and a terminal device 02. Alternatively, the server 01 and the terminal device 02 may be connected through a wireless link, or may be connected through a wired link, which is not limited herein.

The server 01 may provide an image generation service; the server 01 may perform synthesis processing on the noise image based on the acquired image feature information of the initial image and the object feature information of the target object, to generate a target image; the server 01 generates a target image in which the initial object in the initial image is replaced by the target object, but the posture of the target object is still ensured to be consistent with the posture of the initial object, and the background semantic of the target image is consistent with the background semantic of the initial image.

Optionally, the server 01 may include a separate physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), and basic cloud computing services such as big data and artificial intelligence platforms. Operating systems running on the server may include, but are not limited to, android systems, IOS systems, linux, windows, unix, and the like.

The terminal device 02 may be a provider of the initial image and the target object for transmitting the initial image and the target object to the server 01. Alternatively, the terminal device 02 may include, but is not limited to, a smart phone, a desktop computer, a tablet computer, a notebook computer, a smart speaker, a digital assistant, an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, a smart wearable device, or the like. Or may be software running on the client, such as an application, applet, etc. Alternatively, the operating system running on the client may include, but is not limited to, an android system, an IOS system, linux, windows, unix, and the like.

In addition, it should be noted that the application environment shown in fig. 1 is merely an example. In practical applications, the image generation method of the embodiment of the present disclosure may be executed by the terminal device and the server in cooperation, or the image generation method of the embodiment of the present disclosure may be executed independently by the terminal device or the server, and the embodiment of the present disclosure is not limited to a specific application environment.

Fig. 2 is a flowchart illustrating an image generation method according to an exemplary embodiment, which may be applied to a server as shown in fig. 2, including the steps of:

In step S201, image feature information of an initial image is acquired; the image feature information includes initial pose information of an initial object in the initial image and initial background semantic information of the initial image.

In the embodiment of the application, the server can acquire the initial image from the terminal equipment, and can acquire the initial image from other servers; the initial image is an image on an initial object; wherein the initial object may be any type of object including, but not limited to, a person, animal, or other inanimate object.

In some possible embodiments, the initial image may contain four image channels, three of which are components of the original image containing the initial object on three color channels, red (R), green (G), blue (B), and the other image channel is a mask image of the initial object; the mask image of the initial object is obtained by masking the initial object in the original image. As shown in fig. 3, a schematic diagram of an initial image including four image channels is shown in fig. 3, and the server may generate a new target image based on the initial image through a series of operation processes, where the initial object is replaced with another target object in the target image, and semantic information such as background semantic information of the other initial image is kept unchanged.

In the embodiment of the application, the server can perform feature extraction on the initial image through a feature extraction algorithm or model to obtain the image feature information of the initial image; the image feature information includes initial pose information of the initial object and initial background semantic information of the initial image.

In some possible embodiments, the acquiring the image characteristic information of the initial image may include the following steps as shown in fig. 4:

in step S401, an initial object mask image is acquired.

Based on the above, the initial image has four image channels, including an initial object image including an initial object and an initial object mask image, which are formed by three color channels of RGB.

In step S403, an initial object region in the initial image is determined based on the initial object mask image.

In step S405, feature extraction is performed on the initial object region, and initial posture information of the initial object is obtained.

In the steps S403 to S405, the server may position the region of the initial object in the initial object image including the RGB three color channels by using the initial object mask image, and extract the gesture feature of the initial object to obtain the initial gesture information of the initial object. The initial posture information of the initial object may include information such as a posture, a shape, a position orientation, and the like of the initial object.

In step S407, based on the initial object mask image, the area other than the initial object in the initial image is determined as the background area.

In step S409, background semantic extraction is performed on the background area, so as to obtain initial background semantic information of the initial image.

In the steps S407 to S409, the server uses the initial object mask image to determine an image area outside the initial object area in the initial object image including the RGB three color channels as a background area, and then performs semantic extraction on the background area to obtain initial background semantic information. The initial background semantic information of the initial image comprises environment information of the initial object, other object information in the environment, association information between the initial object and other objects and the like.

In a specific application scenario, an initial object image and an initial object mask image are shown in fig. 5, where an initial object is a puppy, and the server can obtain feature information such as a gesture, an orientation, and the like of the puppy by extracting features of the puppy in the initial object image; by extracting background semantics from the initial object image, the puppy can be laid on the edge of the water bucket and in a virtual green environment.

In the above embodiment, the server extracts, as much as possible, initial pose information of the initial object with rich semantics and initial background semantic information of the initial image from the initial image by performing feature extraction on the initial image, so as to be used for generating a subsequent new target image.

In step S203, object feature information of the target object is acquired.

In embodiments of the present application, the target object may be any type of object, including but not limited to a person, animal, or other inanimate object. The target object may be the same type as the initial object or may be different.

In some possible embodiments, the acquiring the object feature information of the target object may include the following steps as shown in fig. 6:

in step S601, description information of a target object is acquired.

The description information of the target object may include any one of picture description information, text description information and audio description information of the target object.

In this step, the server may acquire, from the terminal device or other servers, description information of the target object in any description form, such as a picture, text, and audio.

In step S603, the description information of the target object is feature-coded according to the feature coding model, so as to obtain object feature information.

In this step, the server acquires a feature encoding model trained in advance. And then, inputting the description information of the target object into a feature coding model to obtain the coded object feature information. Feature encoders of corresponding modalities, for example, feature encoders are pre-trained for descriptive information of a target object in text form, according to different forms of the descriptive information of the target object.

In a specific embodiment, the feature encoding model uses a language-Image Pre-training (CLIP) model for contrast learning as shown in fig. 7, and the feature encoding model may be obtained by a training manner as shown in fig. 8:

in step S801, a plurality of training sample pairs are acquired.

In the step, when a model is trained, a server firstly needs to collect a large number of training sample pairs, and each training sample pair comprises first description information aiming at a training object and second description information aiming at the training object; the first descriptive information and the second descriptive information in the same training sample pair are different in description form, and the descriptive contents are all aimed at the training object.

In particular, when training the CLIP model, a large number of text image pairs need to be collected, the text in each text image pair is descriptive of the content of object a, and the image is also an image of that object a.

In step S803, a second initial training model is acquired.

In this step, referring to fig. 7, a second initial training model is constructed based on the CLIP model structure, so that the second initial training model may include a first encoder, which may be a text encoder, and a second encoder, which may be an image encoder. Correspondingly, the first descriptive information is text descriptive information, and the second descriptive information is image descriptive information.

In step S805, performing feature encoding on the first description information in each training sample pair according to the second initial training model to obtain a first feature encoding of each training sample pair; a first set of feature codes is derived based on the first feature codes for each training sample pair.

In this step, the server encodes the first description information by the first encoder. Assuming that N training sample pairs are provided, the server inputs N first description information in the N training sample pairs into the first encoder, and performs feature coding on each first description information through the first encoder to obtain N first feature codes corresponding to the N training sample pairs, namely a first feature code set.

As shown in FIG. 7, the text encoder performs feature encoding on N text description information to obtain N training sample pairs corresponding to N text feature encodings, namely a text feature encoding set [ T ] ₁ ,T ₂ ,T ₃ ,…,T _N ]。

In step S807, feature encoding is performed on the second description information in each training sample pair according to the second initial training model, so as to obtain a second feature encoding in each training sample pair; a second set of feature codes is derived based on the second feature codes for each training sample pair.

In this step, the server encodes the second description information by a second encoder. The server inputs N second description information in N training sample pairs into a second encoder, and performs feature coding on each second description information through the second encoder to obtain N second feature codes corresponding to the N training sample pairs, namely a second feature code set.

As shown in fig. 7, by image codingThe device performs feature coding on the N image description information to obtain N training samples and corresponding N image feature codes, namely an image feature code set [ I ] ₁ ,I ₂ ,I ₃ ,…,I _N ]。

In step S809, the second initial training model is trained based on similarity information between each first feature code in the first set of feature codes and each second feature code in the second set of feature codes until a feature code model is obtained.

In the step, after obtaining a first feature code set and a second feature code set, the server calculates similarity information between each first feature code and each second feature code, and training a second initial training model by taking the similarity information between the first feature code and the second feature code which belong to the same training sample pair as a target and pulling the similarity information between the first feature code and the second feature code which do not belong to the same training sample pair far, until a preset training ending condition is met, so as to obtain a feature code model. The preset training ending condition may include ending training when the number of iterations reaches a preset number, or ending training when the loss value is less than or equal to a preset value; here, the loss value is calculated based on the cross entropy loss function and similarity information between each of the first feature codes in the first feature code set and each of the second feature codes in the second feature code set.

Specifically, as shown in FIG. 7, the server calculates a text feature code set [ T ] ₁ ,T ₂ ,T ₃ ,…,T _N ]With image feature code set [ I ] ₁ ,I ₂ ,I ₃ ,…,I _N ]The inner product of (2) can obtain a matrix of the diagram, and each element in the matrix represents the similarity; wherein for a certain text feature T _i Only the image features I corresponding to it _i Is a positive sample and the other image features are negative samples. The training goal of the server is to optimize this matrix, with the hope that the larger the diagonal the better, the smaller the value elsewhere in the matrix.

After training is completed, the text encoder and the image encoder can be independently taken out for use in practical application, namely, text description information of a section of target object is input into the text feature encoder, and the text feature information of the target object can be output; and inputting a section of image description information of the target object to the image feature encoder to obtain the image feature information of the target object.

In the above embodiment, the server performs feature encoding on the description information of the target object by pre-training to obtain the feature encoding model, and the feature encoding model can put different types of description information such as text, image, voice and the like of the target object into the same feature space for representation, so that the advantage is that different types of data can be communicated, and the accuracy of feature encoding is improved; in the training process of the feature coding model, a wider supervision source-natural language is used for supervising the self-training of the visual task, and compared with the problem that label data is difficult to acquire in a full supervision training mode, the training can be completed on label-free data, and the training efficiency of the model is greatly improved.

In step S205, the noise image is synthesized by using the object feature information and the image feature information, so as to obtain a target image including a target object; the noise image is obtained by adding Gaussian noise to the initial image; the background semantic information of the target image is matched with the initial background semantic information, and the gesture of the target object in the target image is matched with the initial gesture information.

In the embodiment of the application, the noise image is obtained by adding Gaussian noise to the initial image. The server predicts noise filtered in each step by utilizing the object characteristic information and the image characteristic information, and carries out iterative denoising to finally obtain a high-quality target image containing the target object, so that the background semantic information of the target image is matched with the initial background semantic information, and the gesture of the target object in the target image is matched with the initial gesture information.

Two image generation results are exemplarily provided in fig. 5, and it can be seen from the figure that the server generates two target images based on two different target objects (tiger and puppies different from the original breed); in the target image, only the initial object in the initial image is replaced, namely, the puppy in the initial image is replaced by a tiger and a puppy different from the original variety, the postures of the tiger and the puppy different from the original variety are kept consistent with those of the puppy in the initial image, and the overall background semantic of the target image is kept consistent with that of the initial image.

In the embodiment of the application, the server realizes iterative denoising of the noise image based on the principle of the diffusion model so as to synthesize and obtain the target image. As shown in fig. 9, the standard diffusion model has two main processes: forward diffusion and reverse diffusion. In the forward diffusion stage, the original clean image X ₀ Is contaminated by the gradually introduced noise until the image becomes completely random noise X _T . In the reverse process, the predictive noise is progressively removed at each time step using a series of Markov chains, thereby removing the predictive noise from the Gaussian noise X _T Middle recovery data X ₀ . In the actual application process, the noise image is subjected to layer-by-layer denoising treatment step by step through the back diffusion process, namely denoising treatment operation is circularly executed, so that the target image is obtained.

In the image generation task, the diffusion model is a conditional model that depends on a priori, which may be text, an image, or a semantic graph. In some possible embodiments, the present application uses the object feature information and the image feature information calculated in the foregoing steps as the prior guide information, and correspondingly, the above-mentioned synthesizing process is performed on the noise image by using the object feature information and the image feature information to obtain the target image including the target object, which may include the following steps as shown in fig. 10:

In step S1001, object feature information and image feature information are used as guidance information.

In this step, the server takes the object feature information and the image feature information as guidance information. Since the object feature information and the image feature information are obtained through corresponding algorithms or models respectively, there may be inconsistent data dimensions of the two, for example, the data dimensions of the object feature information are 1×1024 dimensions, and the data dimensions of the image feature information are n×512×512 dimensions.

Generally, the guide information is used for subsequent guide image denoising, and specific data dimension requirements exist; thus, in order to satisfy the dimension required as the guidance information, in a specific embodiment, the above-mentioned object feature information and image feature information as the guidance information may include: and carrying out feature fusion on the object feature information and the image feature information to obtain fused feature information, and taking the fused feature information as guide information so that the data dimension of the guide information meets the preset data dimension.

In particular, feature fusion may employ attention mechanisms, including but not limited to cross-attention mechanisms, self-attention mechanisms, and the like; when a cross attention mechanism is used, the server respectively inputs the object feature information and the image feature information into a cross attention module, and outputs the fused feature information by carrying out cross calculation on the object feature information and the image feature information; when the self-attention mechanism is used, the server firstly performs splicing processing on the object characteristic information and the image characteristic information, the spliced characteristic information is input into the self-attention module, and the fused characteristic information is output.

In step S1003, a plurality of times of denoising processing is performed on the noise image based on the guidance information, resulting in a denoised image.

In step S1005, the denoising image is taken as the target image until the denoising image includes the target object and the initial background of the initial image.

In this step, the server first acquires a noise image and a current time step corresponding to the noise image, where the noise image is the same size as the initial image. Specifically, when the server acquires the noise image, the noise image may be searched from the local storage, or the noise image may also be acquired from another third party database, or the like, which may be specifically set according to actual needs, where the method for acquiring the noise image is not specifically limited in the embodiments of the present disclosure.

The noise image is understood to be obtained by performing the noise adding process of the current time step (time step) on the original clean image in the noise adding stage. Assuming that the total number of the set diffusion time steps is T steps in the diffusion model scene, and the current time step is represented by T, the current time step range can be determined to be 0-T, and the current time step T is a random number in 0-T. For example, assuming that the current time step determined randomly is t=5, the noise adding process in the diffusion model may continuously perform the noise adding process on the original clean image five times, wherein the second noise adding process is performed on the basis of the noise image obtained by the first noise adding process, the third noise adding process is performed on the basis of the noise image obtained by the second noise adding process, the fourth noise adding process is performed on the basis of the noise image obtained by the third noise adding process, and the fifth noise adding process is performed on the basis of the noise image obtained by the fourth noise adding process, so that the noise image of the original clean image at the current time step t=5 may be obtained by performing the noise adding process on the original clean image five times.

Since the target image is finally generated based on the noise image and the size of the target image needs to be kept identical to the initial image, the size of the noise image should also be identical to the size of the initial image.

Then, the server performs denoising processing on the noise image based on the guide information and the current time step to obtain a denoising image corresponding to the current time step.

It can be understood that in the process of performing cyclic denoising processing on the noise image by the server, the guiding information used by the noise image corresponding to different cyclic rounds, namely different times, is the same.

And in each time step, the server performs denoising processing on the noise image corresponding to the current time step based on the guide information and the current time step to obtain a denoising image corresponding to the current time step.

After obtaining the denoising image corresponding to the current time step, the server updates the current time step, specifically, performs decreasing operation on the current time step, and obtains the updated current time step. Then judging whether the updated current time step is a target time step, and if the updated current time step is the target time step, taking a denoising image corresponding to the current time step as a target image; otherwise, taking the denoising image corresponding to the current time step as the noise image corresponding to the updated current time step, and carrying out denoising processing of the next round on the noise image corresponding to the updated current time step. And when the updated current time step is the target time step, displaying a clear target object and an initial background of the initial image in the denoising image, and taking the denoising image corresponding to the target time step as the target image.

For example, the server acquires the noise image X corresponding to step 1000 ₁₀₀₀ From the noise image X corresponding to step 1000 ₁₀₀₀ And starting iterative denoising processing. First, when the current time step is 1000 steps, the step S1005 to S1105 is performed to obtain a denoising image 1 corresponding to the 1000 th step ₁ ′ ₀₀₀ The method comprises the steps of carrying out a first treatment on the surface of the Here, the target time step is set to 0; obviously, the server obtains the updated current time step of 999 from the current time step-1, and denoises the image X corresponding to the 1000 th step ₁ ′ ₀₀₀ As a noise image X corresponding to step 999 ₉₉₉ The method comprises the steps of carrying out a first treatment on the surface of the Then, the noise image X is continuously subjected to steps S1005 to S1105 ₉₉₉ Processing to obtain a denoising image X corresponding to the 999 th step ₉ ′ ₉₉ … … when the updated current time step is 1, denoising the image X corresponding to the 2 nd step ₂ ' noise image X corresponding to step 1 ₁ The method comprises the steps of carrying out a first treatment on the surface of the The server performs the steps S1005 to S1105 on the noise image X ₁ Processing to obtain a denoising image X corresponding to the 1 st step ₁ ' then the server takes the current time step-1 to obtain updated current time step as 0; at this time, if the server detects that the updated current time step is the target time step, the iterative denoising is ended, and the denoising image X corresponding to the 1 st step is obtained ₁ ' as a target image.

In some possible embodiments, the above denoising process is performed on the noise image multiple times based on the guiding information, so as to obtain a denoised image, which may include the following steps as shown in fig. 11:

in step S1101, the guide information and the noise image are processed by the image denoising model, and the added noise data is predicted.

The image denoising model is trained by the server in advance, and specific training steps are described in detail below and are not repeated here. And the server circularly uses an image denoising model, namely, in each time step, the image denoising model processes the noise image corresponding to the current time step and predicts the noise data added by the current time step.

In step S1103, a denoised image distribution parameter is determined according to the noise data, the noise image, and the model super parameter of the image denoise model; the denoised image distribution parameters include mean and variance.

In the step, in each time step, the server calculates a denoising image distribution parameter corresponding to the current time step according to noise data corresponding to the current time step, a noise image corresponding to the current time step and a model super parameter of an image denoising model, wherein the denoising image distribution parameter comprises a mean value and a variance.

Wherein the model hyper-parameters of the image denoising model comprise a series of hyper-parameters of Gaussian distribution variances And alpha _t ，α _t ＝1-β _t ；β _t Increasing with increasing t. Exemplary, beta _t May be a linear interpolation from 0.0001 to 0.02.

In step S1105, a denoised image is obtained by sampling from a standard gaussian distribution based on the denoised image distribution parameter.

In the step, in each time step, the server samples from standard Gaussian distribution through a heavy parameter according to the mean value and the variance of the denoising image corresponding to the current time step, and obtains the denoising image corresponding to the current time step.

Specifically, the denoising image is obtained by sampling according to the following expression:

z＝μ+σ*∈，∈∈N(0,I)

wherein z represents a denoised image; mu and sigma are the mean value and standard deviation of the denoising image respectively; the e is a standard gaussian distribution.

In the above embodiment, the server uses the object feature information and the image feature information as the guide information, and performs the cyclic denoising processing on the noise image by using the back diffusion process of the diffusion model, where the guide information instructs the server to gradually restore the noise image into the target image with the semantics matched with the semantics of the guide information, so that the accuracy of the generated target image and the matching degree with the initial image can be improved, and the texture details of the generated target image can be richer, and the target object shape is more regular and real.

An alternative training mode is described below for the image denoising model to obtain the image denoising model. In some possible embodiments, the training method of the image denoising model may include the following steps as shown in fig. 12:

in step S1201, an initial training image is acquired; the initial training image includes an initial training object.

In this step, the server pre-collects a number of initial training images including different initial training objects for training of the model.

In step S1203, object training feature information of the initial training object is acquired.

Here, the object training feature information of the initial training object may be obtained by the server performing feature extraction based on the initial training image.

Specifically, as shown in fig. 13, the server may identify an initial training object from the initial training image, and crop the initial training image to obtain an initial training object image; then, the server performs feature coding on the initial training object image by using the image encoder mentioned in the above embodiment to obtain image coding features of the initial training object image, and uses the image coding features as object training feature information.

In a further alternative embodiment, the server, after obtaining the initial training object image, may perform enhancement processing on the initial training object image in color space or affine space, including but not limited to translation, rotation, miscut, symmetric operation, and the like.

In step S1205, a first initial training model is acquired.

As shown in fig. 13, the server builds a first initial training model that includes an initial feature processing model, an image denoising module, and an initial image denoising model. The image noise adding module is used for adding noise to the initial training image in the training stage, and is not used in the actual application stage.

In step S1207, noise data conforming to gaussian distribution is added to the initial training image according to the image noise adding module, so as to obtain a corresponding training noise image.

In the step, a server sequentially adds noise data conforming to Gaussian distribution to an initial training image according to a preset time step and a training noise image corresponding to each time step is obtained according to an image noise adding module.

Specifically, the server first sets a preset time step, which characterizes the total number of forward diffusion steps. Then, noise adding processing is circularly executed according to a preset time step, and noise data conforming to Gaussian distribution are sequentially added.

The preset time step may be, for example, 1000 steps. In the first time step, the server adds noise data conforming to Gaussian distribution to the initial training image to obtain a training noise image corresponding to the 1 st step; then, in the second time step, continuously adding noise data conforming to Gaussian distribution to the training noise image corresponding to the step 1, and obtaining a final noise image when the training noise image reaches 1000 steps; for example, in FIG. 13, at step t-1, image X is shown _t-1 Adding noise data Z conforming to Gaussian distribution _t Obtaining a noise image x _t 。

In step S1209, image training feature information of the initial training image is acquired according to the initial feature processing model.

In the step, the server performs feature extraction on an initial training image according to an initial feature processing model to obtain image training feature information of the initial training image; the image training feature information comprises initial posture information of an initial training object and initial background semantic information of an initial training image.

The initial training image, similar to the initial image, may include four image channels, see in particular the embodiments above; the server performs feature extraction on the initial training image through the initial feature processing model to obtain image training feature information of the initial training image; the image training feature information comprises initial posture information of an initial training object and initial background semantic information of an initial training image. The initial posture information of the initial training object may include information such as a posture, a form, a position orientation, and the like of the initial training object. The initial background semantic information of the initial training image may include environment information where the initial training object is located, other object information in the environment, association information between the initial training object and other objects, and the like.

In step S1211, based on the training noise image, the object training feature information, and the image training feature information, prediction noise data is obtained using the initial image denoising model.

In the step, the server predicts the added noise of each time step in turn by using an initial image denoising model based on the training noise image, the object training characteristic information and the image training characteristic information corresponding to each time step, and obtains the predicted noise data of each time step.

Specifically, the server circularly uses an initial image denoising model, namely, in each time step, the server processes a training noise image corresponding to the current time step through the initial image denoising model based on object training feature information and image training feature information, and predicts noise data added by the current time step.

As shown in FIG. 13, at step t-1, the server performs an initial image denoising model on the training noise image x corresponding to the current time step based on the object training feature information and the image training feature information _t Processing, predicting the noise data Z added in the current time step _t ′。

In step S1213, the initial feature processing model and the initial image denoising model are trained based on the predicted noise data and the actually added noise data until a trained feature processing model and image denoising model are obtained.

In the step, the server trains an initial feature processing model and an initial image denoising model based on the predicted noise data of each time step and the noise data actually added by each time step until a preset training ending condition is met, and a trained feature processing model and an image denoising model are obtained.

In this step, as known from step S1207, the noise data actually added in each time step is known, so that the server may determine the loss based on the predicted noise data in each time step and the noise data actually added in each time step, train the initial feature processing model and the initial image denoising model until the preset training end condition is satisfied, and obtain the trained feature processing model and the trained image denoising model. The preset training ending condition may include ending training when the number of iterations reaches a preset number, or ending training when the loss value is less than or equal to a preset value. Here, the trained feature processing model may be used to perform feature extraction on the initial image, to obtain image feature information of the initial image.

In the above embodiment, when the image denoising model is trained, by extracting the feature of the initial training object in the initial training image as the object feature information and extracting the image training feature information of the initial training image, predicting the noise data based on the guide information by taking the object feature information and the image training feature information as the guide information, and then training the initial feature processing model and the initial image denoising model by combining the actually added noise data, the model parameters are continuously updated, so that the training efficiency of the image denoising model can be improved, and the denoising performance of the image denoising model can be effectively enhanced.

Based on the embodiment of the image generating method, the application also provides an image synthesis page for a user to operate on the terminal device and an image generating method for adapting the image synthesis page, as shown in fig. 14, and fig. 14 is a schematic diagram of an image synthesis page according to an exemplary embodiment. Fig. 15 is a flowchart illustrating an image generation method according to an exemplary embodiment, and as shown in fig. 15, the image generation method may be applied to a terminal device, including the steps of:

in step S1501, an image composition page is displayed; the image synthesis page comprises a first display area, a second display area and a third display area; an initial image is displayed in the first display area; an image of the target object is displayed in the second display area.

As shown in fig. 14, the terminal device may display an image composition page including a first display area, a second display area, and a third display area. The positions of the respective areas shown in the drawings are only one example, and the embodiments of the present disclosure do not limit the positions of the first display area, the second display area, and the third display area in the image composition page.

In some possible embodiments, the image generation method may further include the steps of: responding to a trigger instruction acted on the first display area, and displaying the selected initial image in the first display area; in response to a target object selection instruction acting on the second presentation area, an image of the selected target object is displayed on the second presentation area.

Specifically, the user can click the first display area, the terminal equipment generates a corresponding trigger instruction based on the click operation of the user in the first display area, a corresponding initial image selection window is provided, and then the terminal equipment displays the initial image selected by the user in the first display area; then, the user can click the second display area, the terminal device generates a target object selection instruction based on the click operation of the user in the second display area, a corresponding target object selection window is provided, and then the terminal device displays the image of the target object selected by the user in the second display area. Here, the second presentation area may be designed as an area for presenting text, image or audio of the target object according to the form of data of the target object.

In step S1503, in response to the composition instruction, a target image obtained by composition based on the initial image and the target object after the initial object in the initial image is removed is displayed in the third display area; the background semantic information of the target image is matched with the initial background semantic information of the initial image, and the gesture of the target object in the target image is matched with the initial gesture information of the initial object.

In the step, the terminal equipment responds to the synthesis instruction to synthesize the initial image and the target object after the initial object is removed, the terminal equipment can realize the synthesis at the server by sending a synthesis processing request to the server, the terminal equipment directly displays the target image synthesized by the initial image and the target object after the initial object is removed in a third display area, the background semantic information of the finally displayed target image is matched with the initial background semantic information of the initial image, and the gesture of the target object in the target image is matched with the initial gesture information of the initial object.

In some possible embodiments, the image synthesis page includes a first operation control, and the image generating method further includes the steps of: and responding to an operation instruction aiming at the first operation control, starting an erasing function, and shielding an initial object in an initial image displayed in the first display area based on the erasing function.

Specifically, the first operation control is an erasure control; as shown in fig. 14, an erasure control is further provided in the image synthesis page, where the erasure control is disposed below the first display area, so as to facilitate user operation; correspondingly, when the terminal equipment detects the clicking operation aiming at the erasure control, an erasure function is started, at the moment, the user can erase the initial image in the first display area, the terminal equipment detects the erasure operation to determine the currently erased object, takes the currently erased object as the initial object, and shields the initial object.

In some possible embodiments, the image synthesis page includes a second operation control, and the image generating method further includes the steps of: and generating a composite instruction in response to the operation instruction for the second operation control, and displaying the target image in the third display area based on the composite instruction.

Specifically, the second operation control is a synthesis control; as shown in fig. 14, a composition control is further provided in the image composition page, where the composition control may be disposed in the middle of the first display area and the second display area, so as to facilitate user operation; correspondingly, if the terminal equipment detects the clicking operation of the user on the synthesis control, a synthesis instruction is generated.

In the above embodiment, the present disclosure provides an image synthesis page for a user terminal device and a corresponding image generation method, where the degree of intelligence is high, and an image with rich imagination can be generated, so that an easy authoring environment can be provided for a terminal user through a highly intelligent technology for automatically generating an image, and user experience is greatly improved.

In summary, the embodiment of the present application provides a novel image generation method, which has a wide application prospect in the field of image generation. The method comprises the steps of obtaining image characteristic information of an initial image and object characteristic information of a target object, and synthesizing a noise image by utilizing the image characteristic information and the object characteristic information to obtain a target image containing the target object, wherein the background semantic information of the final target image is matched with the background semantic information of the initial image, and the gesture of the target object in the target image is matched with the gesture of the initial object in the initial image. That is, the application can erase a part of the content of an input picture, then generate a new main body in the erased part of the picture, and the new main body can be fused with the original input image well.

Fig. 16 is a block diagram of an image generating apparatus according to an exemplary embodiment. Referring to fig. 16, the apparatus includes a first acquisition module 1601, a second acquisition module 1602, and a synthesis module 1603;

a first acquisition module 1601 configured to perform acquisition of image feature information of an initial image; the image characteristic information comprises initial posture information of an initial object in an initial image and initial background semantic information of the initial image;

a second acquisition module 1602 configured to perform acquisition of object feature information of the target object;

a synthesizing module 1603 configured to perform synthesizing processing on the noise image using the object feature information and the image feature information to obtain a target image containing a target object; the noise image is obtained by adding Gaussian noise to the initial image; the background semantic information of the target image is matched with the initial background semantic information, and the gesture of the target object in the target image is matched with the initial gesture information.

In some possible embodiments, the synthesizing module 1603 is further configured to perform object feature information and image feature information as guide information; denoising the noise image for a plurality of times based on the guide information to obtain a denoised image; and taking the denoising image as a target image until the denoising image contains the target object and the initial background of the initial image.

In some possible embodiments, the synthesizing module 1603 is further configured to perform processing of the guide information and the noise image by the image denoising model, predicting the added noise data; determining a denoising image distribution parameter according to the noise data, the noise image and the model super-parameters of the image denoising model; the denoising image distribution parameters comprise a mean value and a variance; and sampling from the standard Gaussian distribution based on the denoising image distribution parameter to obtain a denoising image.

the synthesizing module 1603 is further configured to perform feature fusion on the object feature information and the image feature information to obtain fused feature information; the data dimension of the fused characteristic information meets the preset data dimension; and taking the fused characteristic information as guide information.

In some possible embodiments, the first acquiring module 1601 is configured to perform acquiring an initial object mask image; determining an initial object region in the initial image based on the initial object mask image; extracting features of the initial object region to obtain initial posture information of the initial object; determining a region except for the initial object in the initial image as a background region based on the initial object mask image; and carrying out background semantic extraction on the background area to obtain initial background semantic information of the initial image.

In some possible embodiments, the second obtaining module 1602 is configured to perform obtaining the description information of the target object; the descriptive information comprises any one of picture descriptive information, text descriptive information and audio descriptive information; and carrying out feature coding on the description information of the target object according to the feature coding model to obtain object feature information.

Fig. 17 is a block diagram of another image generation apparatus according to an exemplary embodiment. Referring to fig. 17, the apparatus includes a first display module 1701 and a second display module 1702;

a first display module 1701 configured to perform displaying an image composition page; the image synthesis page comprises a first display area, a second display area and a third display area; an initial image is displayed in the first display area; displaying an image of the target object in the second display area;

a second display module 1702 configured to execute a process of displaying, in response to the composition instruction, a target image, which is synthesized based on the initial image from which the initial object in the initial image is removed and the target object, in a third display area; the background semantic information of the target image is matched with the initial background semantic information of the initial image, and the gesture of the target object in the target image is matched with the initial gesture information of the initial object.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 18 is a block diagram illustrating an electronic device for image generation, which may be a terminal, according to an exemplary embodiment, and an internal structure diagram thereof may be as shown in fig. 18. The electronic device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the electronic device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an image generation method. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the electronic equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in fig. 18 is merely a block diagram of a portion of the structure associated with the disclosed aspects and is not limiting of the electronic device to which the disclosed aspects apply, and that a particular electronic device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an exemplary embodiment, there is also provided an electronic device including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the image generation method as in the embodiments of the present disclosure.

In an exemplary embodiment, a computer-readable storage medium is also provided, which when executed by a processor of an electronic device, enables the electronic device to perform the image generation method in the embodiments of the present disclosure.

In an exemplary embodiment, there is also provided a computer program product containing instructions, the computer program product comprising a computer program stored in a readable storage medium, from which at least one processor of the computer device reads and executes the computer program, causing the computer device to perform the image generation method of the embodiments of the present disclosure.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An image generation method, comprising:

acquiring image characteristic information of an initial image; the image characteristic information comprises initial posture information of an initial object in the initial image and initial background semantic information of the initial image;

acquiring object feature information of a target object;

synthesizing the noise image by utilizing the object characteristic information and the image characteristic information to obtain a target image containing the target object; the noise image is obtained by adding Gaussian noise to the initial image; the background semantic information of the target image is matched with the initial background semantic information, and the gesture of the target object in the target image is matched with the initial gesture information.

2. The image generation method according to claim 1, wherein the synthesizing the noise image using the object feature information and the image feature information to obtain a target image including the target object, comprises:

taking the object characteristic information and the image characteristic information as guiding information;

and taking the denoising image as the target image until the denoising image contains the target object and the initial background of the initial image.

3. The image generating method according to claim 2, wherein the performing denoising processing on the noise image multiple times based on the guidance information to obtain a denoised image comprises:

and sampling from standard Gaussian distribution based on the denoising image distribution parameter to obtain the denoising image.

4. The image generation method according to claim 2, wherein a data dimension of the object feature information is different from a data dimension of the image feature information;

the method for using the object feature information and the image feature information as guiding information comprises the following steps:

and taking the fused characteristic information as the guide information.

5. The image generation method according to claim 3, wherein the training mode of the image denoising model comprises:

acquiring object training characteristic information of the initial training object;

acquiring a first initial training model; the first initial training model comprises an initial feature processing model, an image denoising module and an initial image denoising model;

Acquiring image training feature information of the initial training image according to the initial feature processing model; the image training characteristic information comprises initial posture information of the initial training object and initial background semantic information of the initial training image;

based on the training noise image, the object training feature information and the image training feature information, obtaining predicted noise data by using the initial image denoising model;

training the initial feature processing model and the initial image denoising model based on the predicted noise data and the actually added noise data until a trained feature processing model and the image denoising model are obtained;

6. The image generation method according to claim 1, wherein the acquiring image feature information of the initial image includes:

acquiring an initial object mask image;

Determining a region except the initial object in the initial image as a background region based on the initial object mask image;

and extracting the background semantics of the background area to obtain the initial background semantic information of the initial image.

7. The image generating method according to any one of claims 1 to 6, wherein the acquiring object feature information of the target object includes:

acquiring description information of the target object; the description information comprises any one of picture description information, text description information and audio description information;

and carrying out feature coding on the description information of the target object according to a feature coding model to obtain the object feature information.

8. The image generation method according to claim 7, wherein the training mode of the feature coding model includes:

acquiring a plurality of training sample pairs; each training sample pair comprises first descriptive information of a training object and second descriptive information of the training object;

acquiring a second initial training model;

performing feature coding on the first description information of each training sample pair according to the second initial training model to obtain a first feature code of each training sample pair; obtaining a first feature code set based on the first feature codes of each training sample pair;

Performing feature coding on the second description information in each training sample pair according to the second initial training model to obtain a second feature code in each training sample pair; obtaining a second feature code set based on the second feature codes of each training sample pair;

and training the second initial training model based on similarity information between each first feature code in the first feature code set and each second feature code in the second feature code set until the feature code model is obtained.

9. An image generation method, comprising:

displaying an image synthesis page; the image synthesis page comprises a first display area, a second display area and a third display area; an initial image is displayed in the first display area; an image of the target object is displayed in the second display area;

responding to a synthesis instruction, and displaying a target image synthesized based on the initial image after the initial object in the initial image is removed and the target object in the third display area; the background semantic information of the target image is matched with the initial background semantic information of the initial image, and the gesture of the target object in the target image is matched with the initial gesture information of the initial object.

10. The image generation method according to claim 9, characterized in that the method further comprises:

and responding to a target object selection instruction acting on the second display area, and displaying the selected image of the target object in the second display area.

11. The image generation method of claim 9, wherein the image composition page includes a first operation control therein, the method further comprising:

and responding to an operation instruction aiming at the first operation control, starting an erasing function, and shielding an initial object in the initial image displayed in the first display area based on the erasing function.

12. The image generation method of claim 11, wherein the image composition page includes a second operation control therein, the method further comprising:

and generating the synthetic instruction in response to the operation instruction for the second operation control, and displaying the target image in the third display area based on the synthetic instruction.

13. An image generating apparatus, comprising:

a first acquisition module configured to perform acquisition of image feature information of an initial image; the image characteristic information comprises initial posture information of an initial object in the initial image and initial background semantic information of the initial image;

a synthesizing module configured to perform synthesizing processing on a noise image using the object feature information and the image feature information to obtain a target image containing the target object; the noise image is obtained by adding Gaussian noise to the initial image; the background semantic information of the target image is matched with the initial background semantic information, and the gesture of the target object in the target image is matched with the initial gesture information.

14. An image generating apparatus, comprising:

a first display module configured to perform displaying an image composition page; the image synthesis page comprises a first display area, a second display area and a third display area; an initial image is displayed in the first display area; an image of the target object is displayed in the second display area;

A second display module configured to execute a synthesis instruction, and display, in the third display area, a target image synthesized based on the initial image from which the initial object in the initial image is removed and the target object; the background semantic information of the target image is matched with the initial background semantic information of the initial image, and the gesture of the target object in the target image is matched with the initial gesture information of the initial object.

15. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the image generation method of any of claims 1-12.

16. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the image generation method of any one of claims 1-12.