CN113096055B

CN113096055B - Training method and device for image generation model, electronic equipment and storage medium

Info

Publication number: CN113096055B
Application number: CN202110316298.7A
Authority: CN
Inventors: 邓红波
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2024-03-08
Anticipated expiration: 2041-03-24
Also published as: CN113096055A

Abstract

The disclosure relates to a training method, a training device, electronic equipment and a storage medium for an image generation model, which are used for solving the problems that an image obtained by fusing a current human face has low similarity and low authenticity with an original image participating in the fusion of the human face. The method comprises the following steps: acquiring a plurality of sample face images; training a generating type countermeasure network according to the plurality of sample face images to obtain a target face generating network; the generating type countermeasure network comprises a generator and a discriminator; the target face generating network is constructed according to the trained generator; generating a network training initial encoder according to the plurality of sample face images and the target face to obtain a target encoder; the initial encoder is constructed according to the trained discriminators; sequentially connecting a target encoder and a target face generation network to obtain a target image generation model; the target image generation model is used for generating a fused face image of at least two face images according to the at least two face images.

Description

Training method and device for image generation model, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of image processing, and in particular relates to a training method and device for an image generation model, electronic equipment and a storage medium.

Background

The face fusion technology not only can be used for virtual try-on of commodities in network marketing, but also can be used for enriching product functions in game entertainment products so as to improve user experience.

At present, a face fusion method used for fusing two face images generally resolves feature information of a first face image, applies the feature information of the first face image to a second face image, and then performs color correction on the second face image to which the feature information of the first face image is applied to obtain a face fusion image of the first face image and the second face image.

Disclosure of Invention

The disclosure provides a training method, device, electronic equipment and storage medium for an image generation model, so as to solve the problems that an image obtained by fusing a current face is low in similarity and low in authenticity with an original image participating in the face fusion.

The technical scheme of the present disclosure is as follows:

in a first aspect, an embodiment of the present disclosure provides a training method of an image generation model, including: acquiring a plurality of sample face images; training a generating type countermeasure network according to the plurality of sample face images to obtain a target face generating network; the generating type countermeasure network comprises a generator and a discriminator; the target face generating network is constructed according to the trained generator; generating a network training initial encoder according to the plurality of sample face images and the target face to obtain a target encoder; the initial encoder is constructed according to the trained discriminators; sequentially connecting a target encoder and a target face generation network to obtain a target image generation model; the target image generation model is used for generating a fused face image of at least two face images according to the at least two face images.

In the embodiment of the disclosure, training of the target face generation network and training of the target encoder are performed separately, an initial encoder for training the target encoder is constructed according to a discriminator for training the target face generation network, the trained discriminator has good coding capability, and the target encoder is obtained by taking the target face generation network as a supervision training initial encoder, so that the target encoder can analyze the latent code of the face image more accurately, the target face generation network analyzes the latent code of the obtained face image according to the target encoder, and the generated fused face image has higher similarity and higher authenticity with the original face image.

In one possible implementation manner, the generating the network training initial encoder according to the plurality of sample face images and the target face to obtain the target encoder includes: removing the last two full connection layers in the trained discriminant to obtain an initial encoder; inputting a plurality of sample face images into an initial encoder to obtain latent codes respectively corresponding to the plurality of sample face images; inputting the latent codes of the plurality of sample face images into a target face generation network to obtain predicted face images respectively corresponding to the latent codes of the plurality of sample face images; acquiring a plurality of first losses according to each sample face image and a predicted face image corresponding to each sample face image; inputting each predicted face image into an initial encoder to obtain latent codes corresponding to each predicted face image respectively; acquiring a plurality of second losses according to the latent codes and the predicted latent codes of each sample face image; the predicted latent codes are: predicting the latent codes of the face images corresponding to the latent codes of the face images of each sample; training the initial encoder according to the first losses and the second losses to obtain a target encoder.

In the embodiment of the disclosure, the first loss may enable the image generated by the latent code obtained by the target encoder to be similar to the image of the input target encoder in terms of the pixel level, and the second loss may enable the image generated by the latent code obtained by the target encoder to be similar to the image of the input target encoder in terms of the semantic level, so that the similarity between the fused face image generated by the image generation model comprising the target encoder and the original image is further improved.

In another possible implementation manner, the training the generating type countermeasure network according to the plurality of sample face images to obtain the target face generating network includes: training the generated type countermeasure network in batches according to the plurality of sample face images; acquiring an evaluation index value of the generated type countermeasure network obtained by training each batch according to the pre-trained detection network; the evaluation index value is the average difference value of the input face image and the output face image; the input face image is a face image of a generated type countermeasure network obtained by input training, and the output face image is a face image obtained by fusing the input face image with the generated type countermeasure network obtained by training; and stopping training when the evaluation index value is smaller than or equal to the threshold value, and taking the generated countermeasure network obtained by training as a target face generation network.

In the embodiment of the disclosure, the pre-trained detection network is used for acquiring the evaluation index value of the generated countermeasure network obtained by training each batch, and the evaluation index value reflects the difference between the face image and the real image generated by the generated countermeasure network to a certain extent. And stopping training when the evaluation index value is smaller than or equal to the threshold value, and taking the generated countermeasure network obtained by training as a target face generation network, so that the authenticity of the face image generated by the target face generation network is further enhanced.

In a second aspect, an embodiment of the present disclosure provides an image generating method, including: acquiring at least two face images; inputting at least two face images into a target encoder of a target image generation model to respectively obtain latent codes of the at least two face images; the target image generation model is trained according to the training method of the image generation model provided by any one of the possible implementation manners of the first aspect; fusing the latent codes of at least two face images to obtain fused latent codes; and inputting the fusion latent codes into a target face generation network of a target image generation model to generate a face fusion image.

In the embodiment of the present disclosure, the target image generation model obtained by training using the training method of the image generation model provided by any one of the possible implementation manners of the first aspect is similar to the first aspect, and therefore, the beneficial effects are not repeated.

In one possible implementation manner, the fusing the latent codes of at least two face images to obtain a fused latent code includes: acquiring weights of latent codes of at least two face images; and fusing the latent codes of at least two face images according to the weight of the obtained latent codes of each face image to obtain fused latent codes.

In the embodiment of the disclosure, the similarity between the generated face fusion image and the personnel image of the input model can be adjusted by using the weights of different face images, so that the user experience is improved.

In a third aspect, an embodiment of the present disclosure provides a training apparatus for an image generation model, including: an acquisition module configured to acquire a plurality of sample face images; the first training module is configured to train the generation type countermeasure network according to the plurality of sample face images to obtain a target face generation network; the generating type countermeasure network comprises a generator and a discriminator; the target face generating network is constructed according to the trained generator; the second training module is configured to generate a network training initial encoder according to the plurality of sample face images and the target face to obtain a target encoder; the initial encoder is constructed according to the trained discriminators; the generating module is configured to sequentially connect the target encoder and the target face generating network to obtain a target image generating model; the target image generation model is used for generating a fused face image of at least two face images according to the at least two face images.

Optionally, the second training module is specifically configured to: removing the last two full connection layers in the trained discriminant to obtain an initial encoder; inputting a plurality of sample face images into an initial encoder to obtain latent codes respectively corresponding to the plurality of sample face images; inputting the latent codes of the plurality of sample face images into a target face generation network to obtain predicted face images respectively corresponding to the latent codes of the plurality of sample face images; acquiring a plurality of first losses according to each sample face image and a predicted face image corresponding to each sample face image; inputting each predicted face image into an initial encoder to obtain latent codes corresponding to each predicted face image respectively; acquiring a plurality of second losses according to the latent codes and the predicted latent codes of each sample face image; the predicted latent codes are: predicting the latent codes of the face images corresponding to the latent codes of the face images of each sample; training the initial encoder according to the first losses and the second losses to obtain a target encoder.

Optionally, the first training module is specifically configured to: training the generated type countermeasure network in batches according to the plurality of sample face images; acquiring an evaluation index value of the generated type countermeasure network obtained by training each batch according to the pre-trained detection network; the evaluation index value is the average difference value of the input face image and the output face image; the input face image is a face image of a generated type countermeasure network obtained by input training, and the output face image is a face image obtained by fusing the input face image with the generated type countermeasure network obtained by training; and stopping training when the evaluation index value is smaller than or equal to the threshold value, and taking the generated countermeasure network obtained by training as a target face generation network.

In a fourth aspect, there is provided an image generating apparatus comprising: the acquisition module is configured to acquire at least two face images; inputting at least two face images into a target encoder of a target image generation model to respectively obtain latent codes of the at least two face images; the target image generation model is trained according to the training method of the image generation model provided by any one of the possible implementation manners of the first aspect; the fusion module is configured to fuse the latent codes of at least two face images to obtain fusion latent codes; the generating module is configured to input the fusion latent code into a target face generating network of the target image generating model to generate a face fusion image.

Optionally, the acquiring module is further configured to: acquiring weights of latent codes of at least two face images; the fusion module is specifically configured to: and fusing the latent codes of at least two face images according to the weight of the obtained latent codes of each face image to obtain fused latent codes.

In a fifth aspect, embodiments of the present disclosure provide an electronic device, including: a processor; a memory for storing processor-executable instructions. Wherein the processor is configured to execute the instructions to implement the training method of the image generation model shown in the first aspect and any of the possible implementations of the first aspect, or to implement the image generation method shown in the second aspect and any of the possible implementations of the second aspect.

In a sixth aspect, embodiments of the present disclosure provide a computer readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform the training method of the image generation model as shown in the first aspect and any one of the possible implementations of the first aspect, or to perform the image generation method as shown in the second aspect and any one of the possible implementations of the second aspect.

In a seventh aspect, embodiments of the present disclosure provide a computer program product directly loadable into an internal memory of a computer and containing software code, the computer program being capable of implementing, after being loaded and executed via the computer, the training method of the image generation model shown in the first aspect and any possible implementation of the first aspect, or of performing the image generation method shown in the second aspect and any possible implementation of the second aspect.

Any of the training apparatuses, image generating apparatuses, electronic devices, computer-readable storage media or computer program products for image generating models provided above are used for executing the corresponding methods provided above, and therefore, the advantages achieved by the training apparatuses, image generating apparatuses, electronic devices, computer-readable storage media or computer program products can refer to the advantages of the corresponding schemes in the corresponding methods provided above, and are not described herein.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

FIG. 1 is a flow diagram illustrating a training method for an image generation model, according to an exemplary embodiment;

FIG. 2 is a flow diagram illustrating an image generation method according to an exemplary embodiment;

FIG. 3 is a schematic diagram of an application scenario illustrated in accordance with an exemplary embodiment;

FIG. 4 is a block diagram of a training apparatus for an image generation model, according to an exemplary embodiment;

FIG. 5 is a block diagram of an image generation apparatus according to an exemplary embodiment;

fig. 6 is a block diagram of an electronic device, according to an example embodiment.

Detailed Description

In the presently disclosed embodiments, the words "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in the examples of this disclosure should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

In the presently disclosed embodiments, "at least one" refers to one or more. "plurality" means two or more.

In the embodiment of the present disclosure, "and/or" is merely an association relationship describing an association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

The face image fusion technology can be used for enhancing product functions such as intelligent avatar, network marketing virtual try-on, game entertainment or live broadcast and the like in user privacy protection and film and television production so as to promote multiple aspects such as user experience and intelligent synthesis of novel propaganda materials.

According to the training method for the image generation model, the target image generation model is trained, a target encoder in the target image generation model analyzes face images to be fused to obtain the latent codes of the face images to be fused, fusion is carried out on the obtained latent codes of the face images to be fused to obtain fusion latent codes, then the fusion latent codes are input into a target face generation network to obtain fused face images of the face images to be fused, and therefore the similarity of the obtained fused face images to each face image in the face images to be fused is larger than a first threshold value.

The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present disclosure. All other embodiments obtained by one of ordinary skill in the art based on the embodiments provided by the present disclosure are within the scope of the present disclosure.

As shown in fig. 1, which is a flowchart illustrating a training method of an image generation model according to an exemplary embodiment, the method shown in fig. 1 may be applied to an electronic device, and the method shown in fig. 1 may include the following steps:

s100: a plurality of sample face images are acquired.

In one possible implementation, the electronic device may download the public data set from a public database, such as: a plurality of face images in a high quality face image dataset (flickr faces high quanlity, FFHQ).

In another possible implementation, the electronic device reads a plurality of sample face images stored locally.

S101, training the generation type countermeasure network according to a plurality of sample face images to obtain a target face generation network. The generative antagonism network includes a generator and a arbiter. The target face generation network is constructed according to the trained generator.

According to the embodiment of the disclosure, the plurality of sample face images acquired by the electronic device may be used as training sample data of the generated countermeasure network.

Wherein the generative countermeasure network may include a generator and a discriminant, the generator may be a deep learning neural network for learning a distribution of the real image so that a generated image resembling the real image may be generated, for example, the generator may be a residual network (residual neural network, resNet). The discriminators may be neural networks based on classification algorithms, which may be bayesian classification algorithms, k-nearest neighbor classification algorithms, etc. The discriminator is used for classifying whether the image input into the discriminator is from the real training sample data or from the generated image output by the generator. The output of the arbiter may be the probability that the image is from the training sample data. The generator may be a deep learning neural network, or may be any function capable of fitting to generate an image. The present disclosure does not limit the type of generator.

Optionally, the electronic device trains the generated type countermeasure network in batches according to the plurality of sample face images, and acquires the evaluation index value of the generated type countermeasure network obtained by training in each batch according to the pre-trained detection network; the evaluation index value is the average difference value of the input face image and the output face image; the input face image is a face image of a generated type countermeasure network obtained by input training, and the output face image is a face image obtained by fusing the input face image with the generated type countermeasure network obtained by training; and stopping training when the evaluation index value is smaller than or equal to the threshold value, and taking the generated countermeasure network obtained by training as a target face generation network.

Illustratively, the electronic device uses a trained generated countermeasure network (style generative adversarial networks, styleGAN) as a target face generation network, and the ability of the styleGAN network to generate face images has been widely accepted, and can generate face images that are true and false and difficult to identify. The StyleGAN network maps face image information in the training sample data set into a latent code (latent code) having semantic information, and then the StyleGAN network generates a face image using the latent code. Therefore, the reality of the face image generated by the target face generation network is high.

S102: generating a network training initial encoder according to the plurality of sample face images and the target face to obtain a target encoder. Wherein the initial encoder is constructed based on the trained discriminators described above.

Specifically, the electronic device obtains the target encoder by:

step one: the electronic device removes the last two full connection layers in the trained arbiter to obtain the initial encoder.

Therefore, the discriminant in the trained target face generation network has good coding capability, so that a great amount of training time can be saved in the process of training the encoder by directly borrowing the discriminant.

Step two: firstly, the electronic equipment inputs a plurality of sample face images into an initial encoder to obtain latent codes respectively corresponding to the plurality of sample face images, then the electronic equipment inputs the latent codes of the plurality of sample face images into a target face generation network to obtain predicted face images respectively corresponding to the latent codes of the plurality of sample face images, and then the electronic equipment acquires a plurality of first losses according to each sample face image and the predicted face image corresponding to each sample face image.

Specifically, for each sample face image, the electronic device performs: inputting the sample face image into an initial encoder to obtain a latent code corresponding to the sample face image, inputting the latent code of the sample face image into a target face generation network to obtain a predicted face image corresponding to the latent code of the sample face image, and obtaining a difference value of each pixel point of the sample face image and the corresponding predicted face image to obtain a first loss. After the electronic device executes the steps on the plurality of sample face images, a plurality of first losses are obtained.

Step three: the electronic equipment inputs each predicted face image into an initial encoder to obtain a latent code corresponding to each predicted face image respectively, and a plurality of second losses are obtained according to the latent code and the predicted latent code of each sample face image; the predicted latent codes are: the latent codes of the predicted face images corresponding to the latent codes of each sample face image.

Step four: the electronic device trains the initial encoder according to the first losses and the second losses to obtain a target encoder.

Specifically, the electronic device reversely transmits the first losses and the second losses, trains the initial encoder, and obtains the target encoder.

As such, embodiments of the present disclosure reference not only the first penalty but also the second penalty in training the target encoder. Wherein the first penalty represents the difference between the sample face image input to the encoder and the predicted face image generated from the latent code of the sample face image parsed by the encoder. The second loss reflects the difference between the latent code of the predicted face image and the latent code of the sample face image, which are obtained through analysis by the encoder, the first loss enables the sample face image and the predicted face image to be closer at the pixel level, the second loss is used for ensuring that the predicted face image and the sample face image are closer at the semantic level, and therefore the target encoder obtained through training can more accurately analyze the latent code of the face image, and the definition of the face image generated according to the latent code later is higher and more vivid.

S103: sequentially connecting a target encoder and a target face generation network to obtain a target image generation model; the target image generation model is used for generating a fused face image of at least two face images according to the at least two face images.

The method for generating an image by using the target image generation model obtained by training by using the training method for generating an image generation model provided by the embodiment of the present disclosure is described below.

Fig. 2 is a flow chart illustrating an image generation method according to an exemplary embodiment. The method shown in fig. 2 can be applied to an electronic device, and the method shown in fig. 2 includes the following steps:

S200: at least two face images are acquired.

The embodiment of the disclosure does not limit the acquisition mode of at least two face images. The following description will take two face images as examples:

in one possible implementation, the electronic device obtains one face image through the camera, and reads or receives another face image sent by other electronic devices locally. In another possible implementation, the computer device selects, from the stored images, two face images indicated by the user selection instruction according to the received user selection instruction. The user selection instruction may be a user selection instruction received from a user through the input unit.

In one example, when a user is using an application in an electronic device, the electronic device receives image processing instructions in the application selected by the user that indicate that face fusion processing is to be performed. The electronic equipment prompts a user to select at least two face images, and responds to a user selection instruction to acquire the at least two face images indicated by the user selection instruction.

S201: and inputting the at least two face images into a target encoder of a target image generation model to respectively obtain the latent codes of the at least two face images. The target image generation model is trained according to the training method of the image generation model of the above disclosed embodiment.

Specifically, the electronic device inputs the at least two face images into the target encoder of the trained target image generation model, and obtains the latent codes of the at least two face images respectively.

S202: and fusing the latent codes of the at least two face images to obtain a fused latent code.

Specifically, the electronic device acquires the weight of the latent code of each face image, and fuses the latent codes of the at least two face images according to the weight of the latent code of each face image to obtain a fused latent code. The sum of the weights of the latent codes of all face images is 1.

In an example, taking the at least two face images as two face images for illustration, assuming that the latent code of one face image is W1, the latent code of the other face image is W2, the weight of W1 is 60%, and the weight of W2 is 40%, then fusion latent code W' =60% > w1+40% > W2.

S203: and inputting the fusion latent codes into a target face generation network of a target image generation model to generate a face fusion image.

Specifically, the electronic device inputs the fusion latent code into the trained target face generation network, and the target face generation network generates a face fusion image according to the fusion latent code.

It should be noted that, the execution body of the embodiment of the present disclosure may be a terminal device or a server. The terminal equipment comprises, but is not limited to, a mobile phone, a tablet personal computer, a notebook computer, a palm computer, a vehicle-mounted terminal and the like. The server may be one server, or may be a server cluster formed by a plurality of servers, which is not limited in the embodiments of the present disclosure.

In one possible application scenario, as shown in fig. 3, the terminal device 301 uploads the acquired at least two face images to the server 302. After receiving the image, the server 302 executes the above-described S200 to S203 to acquire a face fusion image and transmits it to the terminal device 301. After receiving the face fusion image, the terminal device 301 displays the face fusion image to the user. Of course, if the terminal device 301 performs the steps of S200 to S203 to obtain the face fusion image, the server 302 is not required.

Under other possible application scenarios, after the terminal equipment acquires two face images through the camera, the face fusion images of the two face images are generated through the method provided by the embodiment of the disclosure, and the generated face fusion images are displayed on the screen, so that the pleasure of a user is increased. Of course, the embodiments of the present disclosure may also be used to fuse a selected face image with a face image in a video to generate a face fusion image, thereby replacing the face image in the video with the face fusion image.

In the embodiment of the disclosure, the target encoder in the target image generation model trained by using the training method of the image generation model provided by the embodiment can more accurately analyze and obtain the latent code of each face image to be fused (such as the at least two face images), then fuse the latent codes of each face image to be fused to obtain the fused latent code, input the fused latent code into the target face generation network to obtain the fused face image, and the latent code of the face image is used for representing the characteristics of the semantic level of the face image, so that the obtained fused face image fuses the characteristics of each face image in the at least two face images, and the fusion degree of the obtained fused face image is higher and more natural, and the similarity with each face image to be fused is greater than a threshold value. The capability of the target face generation network to generate the face image is widely accepted, and the face image with true and false indistinct can be generated, so that the reality of the face image generated by the target face generation network is higher.

The foregoing description of the embodiments of the present disclosure has been presented primarily in terms of methods. To achieve the above functions, it includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative method steps described in connection with the embodiments disclosed herein may be implemented as hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The embodiment of the disclosure may divide the functional modules of the electronic device according to the above method examples, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present disclosure, the division of the modules is merely a logic function division, and other division manners may be implemented in actual practice.

Fig. 4 is a block diagram of a training apparatus for an image generation model, according to an exemplary embodiment, and referring to fig. 4, the training apparatus 40 for an image generation model includes an acquisition module 401, a first training module 402, a second training module 403, and a generation module 404. Wherein, the acquisition module 401 is configured to acquire a plurality of sample face images; a first training module 402 configured to train a generating type countermeasure network according to a plurality of sample face images to obtain a target face generating network; the generating type countermeasure network comprises a generator and a discriminator; the target face generating network is constructed according to the trained generator; a second training module 403 configured to generate a network training initial encoder to obtain a target encoder according to the plurality of sample face images and the target face; the initial encoder is constructed according to the trained discriminators; a generating module 404 configured to sequentially connect the target encoder and the target face generating network to obtain a target image generating model; the target image generation model is used for generating a fused face image of at least two face images according to the at least two face images. For example, in connection with fig. 1, the acquisition module 401 may be used to perform S100, the first training module 402 may be used to perform S101, and the second training module 403 may be used to perform S102. The generation module 404 may be used to perform S103.

Optionally, the second training module 403 is specifically configured to: removing the last two full connection layers in the trained discriminant to obtain an initial encoder; inputting a plurality of sample face images into an initial encoder to obtain latent codes respectively corresponding to the plurality of sample face images; inputting the latent codes of the plurality of sample face images into a target face generation network to obtain predicted face images respectively corresponding to the latent codes of the plurality of sample face images; acquiring a plurality of first losses according to each sample face image and a predicted face image corresponding to each sample face image; inputting each predicted face image into an initial encoder to obtain latent codes corresponding to each predicted face image respectively; acquiring a plurality of second losses according to the latent codes and the predicted latent codes of each sample face image; the predicted latent codes are: predicting the latent codes of the face images corresponding to the latent codes of the face images of each sample; training the initial encoder according to the first losses and the second losses to obtain a target encoder.

Optionally, the first training module 402 is specifically configured to: training the generated type countermeasure network in batches according to the plurality of sample face images; acquiring an evaluation index value of the generated type countermeasure network obtained by training each batch according to the pre-trained detection network; the evaluation index value is the average difference value of the input face image and the output face image; the input face image is a face image of a generated type countermeasure network obtained by input training, and the output face image is a face image obtained by fusing the input face image with the generated type countermeasure network obtained by training; and stopping training when the evaluation index value is smaller than or equal to the threshold value, and taking the generated countermeasure network obtained by training as a target face generation network.

The specific manner in which the respective modules perform the operations in the training apparatus for an image generation model in the above-described embodiment has been described in detail in the embodiment concerning the method, and will not be explained in detail here.

Reference is made to the foregoing method embodiments for the detailed description of the foregoing optional modes, and details are not repeated herein. In addition, the explanation and description of the beneficial effects of any of the training devices 40 for generating models can refer to the corresponding method embodiments described above, and will not be repeated.

It should be noted that the actions correspondingly performed by the above modules are only specific examples, and the actions actually performed by the modules refer to the actions or steps mentioned in the description of the embodiment described based on fig. 1.

Fig. 5 is a block diagram of an image generating apparatus according to an exemplary embodiment. Referring to fig. 5, the apparatus 50 for generating a face fusion image includes an acquisition module 501, a fusion module 502, and a generation module 503. Wherein, the acquiring module 501 is configured to acquire at least two face images; inputting at least two face images into a target encoder of a target image generation model to respectively obtain latent codes of the at least two face images; the target image generation model is obtained by training according to the training method of the image generation model provided by the embodiment of the method; the fusion module 502 is configured to fuse the latent codes of at least two face images to obtain a fused latent code; the generating module 503 is configured to input the fusion latent code into a target face generating network of the target image generating model to generate a face fusion image. For example, in connection with FIG. 2, the acquisition module 501 may be used to perform S200-S201 and the fusion module 502 may be used to perform S202. The generating module 503 may be used to perform S203.

Optionally, the obtaining module 501 is further configured to: acquiring weights of latent codes of at least two face images; the fusion module 502 is specifically configured to: and fusing the latent codes of at least two face images according to the weight of the obtained latent codes of each face image to obtain fused latent codes.

With respect to the image generating apparatus in the above-described embodiment, the specific manner in which the respective modules perform the operations has been described in detail in the embodiment concerning the method, and will not be described in detail here.

Reference is made to the foregoing method embodiments for the detailed description of the foregoing optional modes, and details are not repeated herein. In addition, any explanation and description of the beneficial effects of the image generating apparatus 50 provided above may refer to the corresponding method embodiments described above, and will not be repeated.

It should be noted that the actions correspondingly performed by the above modules are only specific examples, and the actions actually performed by the modules refer to the actions or steps mentioned in the description of the embodiment described based on fig. 2.

The disclosed embodiments also provide an electronic device, and fig. 6 is a block diagram of an electronic device, which is shown according to an exemplary embodiment. Referring to fig. 6, the electronic device 60 includes: a memory 601 and a processor 602; the memory 601 is for storing a computer program and the processor 602 is for invoking the computer program to perform the actions or steps referred to in any of the embodiments provided above.

The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when run on a computer, causes the computer to perform the actions or steps mentioned in any of the embodiments provided above.

The embodiment of the disclosure also provides a chip. The chip integrates a circuit and one or more interfaces for realizing the functions of the generation device of the face fusion image. Optionally, the functions supported by the chip may include processing actions based on the embodiment described in fig. 1, which are not described herein. Those of ordinary skill in the art will appreciate that all or a portion of the steps implementing the above-described embodiments may be implemented by a program to instruct associated hardware. The program may be stored in a computer readable storage medium. The above-mentioned storage medium may be a read-only memory, a random access memory, or the like. The processing unit or processor may be a central processing unit, a general purpose processor, an application specific integrated circuit (application specific integrated circuit, ASIC), a microprocessor (digital signal processor, DSP), a field programmable gate array (field programmable gate array, FPGA) or other programmable logic device, transistor logic device, hardware components, or any combination thereof.

The disclosed embodiments also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform any of the methods of the above embodiments. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in accordance with the embodiments of the present disclosure are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, a website, computer, server, or data center via a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. Computer readable storage media can be any available media that can be accessed by a computer or data storage devices including one or more servers, data centers, etc. that can be integrated with the media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

It should be noted that the above-mentioned devices for storing computer instructions or computer programs, such as, but not limited to, the above-mentioned memories, computer-readable storage media, and communication chips, etc., provided by the embodiments of the present disclosure all have non-volatility (non-transparency).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of training an image generation model, the method comprising:

acquiring a plurality of sample face images;

Training a generated type countermeasure network according to the plurality of sample face images to obtain a target face generation network; the generating type countermeasure network comprises a generator and a discriminator; the target face generation network is constructed according to the trained generator;

removing the last two full connection layers in the trained discriminant to obtain an initial encoder; the initial encoder is constructed according to the trained discriminators;

inputting the plurality of sample face images into the initial encoder to obtain latent codes respectively corresponding to the plurality of sample face images;

inputting the latent codes of the plurality of sample face images into the target face generation network to obtain predicted face images respectively corresponding to the latent codes of the plurality of sample face images;

acquiring a plurality of first losses according to each sample face image and a predicted face image corresponding to each sample face image;

inputting each predicted face image into the initial encoder to obtain latent codes corresponding to each predicted face image respectively;

acquiring a plurality of second losses according to the latent codes and the predicted latent codes of each sample face image; the prediction latent code is: a latent code of the predicted face image corresponding to the latent code of each sample face image;

Training the initial encoder according to the first losses and the second losses to obtain a target encoder; sequentially connecting the target encoder and the target face generation network to obtain a target image generation model; the target image generation model is used for generating a fused face image of at least two face images according to the at least two face images.

2. The method of claim 1, wherein training the generated challenge network from the plurality of sample face images to obtain a target face generation network comprises:

training the generated type countermeasure network in batches according to the plurality of sample face images;

acquiring an evaluation index value of the generated type countermeasure network obtained by training each batch according to the pre-trained detection network; the evaluation index value is an average difference value of the input face image and the output face image; the input face image is a face image of the generated type countermeasure network obtained by inputting the training, and the output face image is a face image obtained by fusing the input face image by the generated type countermeasure network obtained by the training;

and stopping training when the evaluation index value is smaller than or equal to a threshold value, and taking the generated countermeasure network obtained by training as the target face generation network.

3. An image generation method, characterized in that the generation method comprises:

acquiring at least two face images;

inputting the at least two face images into a target encoder of a target image generation model to respectively obtain latent codes of the at least two face images; the target image generation model is trained according to the training method of the image generation model as set forth in claim 1 or 2;

fusing the latent codes of the at least two face images to obtain fused latent codes;

and inputting the fusion latent codes into a target face generation network of the target image generation model to generate a face fusion image.

4. A method according to claim 3, wherein fusing the latent codes of the at least two face images to obtain a fused latent code comprises:

acquiring weights of the latent codes of the at least two face images;

and fusing the latent codes of the at least two face images according to the weight of the obtained latent codes of each face image to obtain the fused latent codes.

5. A training device for an image generation model, comprising:

an acquisition module configured to acquire a plurality of sample face images;

The first training module is configured to train the generation type countermeasure network according to the plurality of sample face images to obtain a target face generation network; the generating type countermeasure network comprises a generator and a discriminator; the target face generation network is constructed according to the trained generator;

a second training module configured to: removing the last two full connection layers in the trained discriminant to obtain an initial encoder; the initial encoder is constructed according to the trained discriminators; inputting the plurality of sample face images into the initial encoder to obtain latent codes respectively corresponding to the plurality of sample face images; inputting the latent codes of the plurality of sample face images into the target face generation network to obtain predicted face images respectively corresponding to the latent codes of the plurality of sample face images; acquiring a plurality of first losses according to each sample face image and a predicted face image corresponding to each sample face image; inputting each predicted face image into the initial encoder to obtain latent codes corresponding to each predicted face image respectively; acquiring a plurality of second losses according to the latent codes and the predicted latent codes of each sample face image; the prediction latent code is: a latent code of the predicted face image corresponding to the latent code of each sample face image; training the initial encoder according to the first losses and the second losses to obtain a target encoder;

The generating module is configured to sequentially connect the target encoder and the target face generating network to obtain a target image generating model; the target image generation model is used for generating a fused face image of at least two face images according to the at least two face images.

6. The training device of claim 5, wherein the first training module is specifically configured to:

7. An image generating apparatus, comprising:

The acquisition module is configured to acquire at least two face images; inputting the at least two face images into a target encoder of a target image generation model to respectively obtain latent codes of the at least two face images; the target image generation model is trained according to the training method of the image generation model as set forth in claim 1 or 2;

the fusion module is configured to fuse the latent codes of the at least two face images to obtain fusion latent codes;

and the generating module is configured to input the fusion latent code into a target face generating network of the target image generating model to generate a face fusion image.

8. The image generating apparatus according to claim 7, wherein,

the acquisition module is further configured to: acquiring weights of the latent codes of the at least two face images;

the fusion module is specifically configured to: and fusing the latent codes of the at least two face images according to the weight of the obtained latent codes of each face image to obtain the fused latent codes.

9. An electronic device, comprising:

a processor and a memory for storing instructions executable by the processor; wherein the processor is configured to execute the executable instructions to implement the training method of the image generation model of claim 1 or 2 or to implement the image generation method of claim 3 or 4.

10. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the training method of an image generation model according to claim 1 or 2, or to perform the image generation method according to claim 3 or 4.