CN115578779B

CN115578779B - Training of face changing model, video-based face changing method and related device

Info

Publication number: CN115578779B
Application number: CN202211477442.6A
Authority: CN
Inventors: 朱俊伟; 贺珂珂; 邰颖; 汪铖杰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-23
Filing date: 2022-11-23
Publication date: 2023-03-10
Anticipated expiration: 2042-11-23
Also published as: CN115578779A

Abstract

The embodiment of the application discloses a training method of a face changing model, a video-based face changing method and a related device, wherein the training method of the face changing model comprises the steps of collecting a plurality of real face images and a plurality of virtual face images, and obtaining a plurality of first target sample images and a plurality of corresponding first source sample images by sampling, wherein the plurality of first target sample images comprise the real face images, and the plurality of first source sample images comprise the virtual face images; generating each first target sample image and the corresponding first source sample image through a first generation model to obtain a first face generation image; and training the first generation model in a mode of minimizing a first loss function through the first face generation image, the first target sample image and the first source sample image to obtain a first face changing model as a target face changing model. The target face-changing model can generate a face generation image which expresses the face image of the virtual face image more realistically while ensuring the robustness of face changing.

Description

Training of face changing model, video-based face changing method and related device

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a face changing model training method based on video and a related device.

Background

With the rapid development of video interaction technology, in daily life and work, more and more objects realize interaction through videos, for example, more and more objects realize interaction through live videos, video chats and the like. In the video interaction process, the virtual face image needs to be used to replace the real face image in the video in consideration of factors such as protection of the real face image or experience of the virtual face image.

In the related art, a face-change model may be obtained by training in advance using a collected real face image as a training sample, and a face generation image may be obtained by performing generation processing on a virtual face image and a real face image in a video through the face-change model, so as to replace the real face image in the video with the face generation image.

However, the face image of the virtual face image is more exaggerated than that of the real face image, and in the case where the face replacement model is trained using the collected real face image as a training sample, the face generation image generated by the face replacement model is closer to the real face image, so that it is difficult to realistically express the face image of the virtual face image, resulting in poor texture of the face image of the virtual face after face replacement.

Disclosure of Invention

In order to solve the technical problems, the application provides a training of a face changing model, a video-based face changing method and a related device, the trained target face changing model can generate a face generation image which can express the face image of a virtual face image more vividly while ensuring the robustness of face changing, and therefore the texture of the face image of the virtual face after face changing can be improved by applying the target face changing model to change the face.

The embodiment of the application discloses the following technical scheme:

in one aspect, the present application provides a method for training a face-changing model, where the method includes:

sampling the plurality of real face images and the plurality of virtual face images to obtain a plurality of first target sample images and a plurality of corresponding first source sample images; the plurality of first target sample images includes the real face image, the plurality of first source sample images includes the virtual face image;

generating each first target sample image and the corresponding first source sample image through a first generation model to obtain a first face generation image;

training the first generation model by minimizing a first loss function according to the first face generation image, the first target sample image and the first source sample image to obtain a first face changing model; the first loss function is used to calculate a loss of face image between the first face generation image and the first source sample image and a loss of face state between the first face generation image and the first target sample image;

and determining the first face changing model as a target face changing model.

In another aspect, the present application provides a video-based face changing method, including:

acquiring a real face image to be replaced in a video to be displayed;

determining a preset virtual face image corresponding to the real face image to be replaced;

generating and processing the real face image to be changed and the preset virtual face image through a target face changing model to obtain a target virtual face image; the target virtual face image is matched with the face image of the preset virtual face image and the face state of the real face image to be replaced;

replacing the real face image to be replaced with the target virtual face image when the video to be displayed is displayed;

the target face-changing model is obtained by executing the training method of the face-changing model in the aspect.

In another aspect, the present application provides a training device for face changing model, the device comprising: the device comprises a sampling unit, a first generating unit, a training unit and a first determining unit;

the sampling unit is used for sampling a plurality of real face images and a plurality of virtual face images to obtain a plurality of first target sample images and a plurality of corresponding first source sample images; the plurality of first target sample images includes the real face image, the plurality of first source sample images includes the virtual face image;

the first generation unit is used for generating and processing each first target sample image and the corresponding first source sample image through a first generation model to obtain a first face generation image;

the training unit is used for training the first generation model to obtain a first face changing model by minimizing a first loss function according to the first face generation image, the first target sample image and the first source sample image; the first loss function is used to calculate a loss of face image between the first face generation image and the first source sample image and a loss of face state between the first face generation image and the first target sample image;

the first determining unit is configured to determine the first face changing model as a target face changing model.

In another aspect, the present application provides a video-based face-changing device, including: the device comprises an acquisition unit, a second determination unit, a second generation unit and a replacement unit;

the acquisition unit is used for acquiring a real face image to be replaced in a video to be displayed;

the second determining unit is used for determining a preset virtual face image corresponding to the real face image to be replaced;

the second generating unit is used for generating and processing the real face image to be changed and the preset virtual face image through a target face changing model to obtain a target virtual face image; the target virtual face image is matched with the face image of the preset virtual face image and the face state of the real face image to be replaced;

the replacing unit is used for replacing the real face image to be replaced with the target virtual face image when the video to be displayed is displayed;

In another aspect, the present application provides a computer device comprising a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute the training method of the face-changing model or the video-based face-changing method according to instructions in the program code.

In another aspect, an embodiment of the present application provides a computer-readable storage medium for storing a computer program, where the computer program is executed by a processor to perform the method for training a face-changing model or the method for changing faces based on videos according to the foregoing aspects.

In another aspect, embodiments of the present application provide a computer program product, which includes a computer program or instructions; the computer program or instructions, when executed by a processor, perform a method of training a face-changing model or a video-based face-changing method as described in the above aspects.

According to the technical scheme, firstly, a plurality of real face images and a plurality of virtual face images are collected, a plurality of first target sample images and a plurality of corresponding first source sample images are obtained through sampling, the plurality of first target sample images comprise the real face images, and the plurality of first source sample images comprise the virtual face images; then, generating each first target sample image and the corresponding first source sample image through a first generation model to obtain a first face generation image; finally, training a first generation model in a mode of minimizing a first loss function through the first face generation image, the first target sample image and the first source sample image to obtain a first face changing model as a target face changing model; the first loss function is used to calculate a face image loss between the first face-generating image and the first source sample image, and a face state loss between the first face-generating image and the first target sample image.

It can be seen that, adding a virtual face image on the basis of the real face image, and sampling to obtain a plurality of first target sample images including the real face image and a plurality of first source sample images including the virtual face image; training a first generation model aiming at each first target sample image and the corresponding first source sample image to mine the face image of the first source sample image and the face state of the first target sample image, so that the trained first face changing model further learns the face image of the virtual face image on the basis of learning the face state of the real face image; the first face changing model is used as a target face changing model, the target face changing model can generate a face generation image which expresses the face image of the virtual face image more vividly while ensuring the face changing robustness, and therefore the face changing model is applied to change the face, and the face image texture of the virtual face after face changing can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of a face changing model training method and a video-based face changing method according to an embodiment of the present application;

fig. 2 is a flowchart of a method for training a face-changing model according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a real facial image and a virtual facial image according to an embodiment of the present application;

fig. 4 is a schematic diagram of a framework for obtaining a first face-changed model by training a first generation model through a plurality of first target sample images and a corresponding plurality of first source sample images according to an embodiment of the present application;

fig. 5 is a schematic diagram of a framework provided in the embodiment of the present application for obtaining a second face change model by training a second generative model through a plurality of second target sample images and a corresponding plurality of second source sample images;

fig. 6 is a schematic diagram of a plurality of branch model parameters in a second generative model and fused model parameters in an updated second face-changed model according to an embodiment of the present application;

fig. 7 is a flowchart of a method of a video-based face changing method according to an embodiment of the present application;

fig. 8 is a schematic diagram illustrating an effect of replacing a real face image to be replaced in a video to be displayed with a target virtual face image according to an embodiment of the present application;

fig. 9 is a flowchart of another method for changing a video face based on a preset binding relationship according to an embodiment of the present application;

fig. 10 is a flowchart illustrating a structure of a training apparatus for a face changing model according to an embodiment of the present application;

fig. 11 is a flowchart illustrating a structure of a video-based face changing apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a server according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

At present, in the video interaction process of live video, video chat and the like of the user a, the real face image of the user a in the video V needs to be replaced by the virtual face image in consideration of factors such as protection of the real face image of the user a and the like. Generally, a face changing model is obtained by training a collected real face image as a training sample, a virtual face image and a real face image of the user A in the video V are generated and processed to obtain a face generating image, and the face generating image is used for replacing the real face image of the user A in the video V.

However, it is found through research that the face image of the virtual face image is more exaggerated than the face image of the real face image of the user a in the video V, and the face-changing model is trained by using the collected real face image as a training sample, so that the face-generating image generated by the face-changing model is closer to the real face image of the user a in the video V, and thus the face image of the virtual face image is difficult to be expressed realistically, and the face image texture of the virtual face after the face-changing of the user a in the video V is poor.

In view of the above, the present application provides a training of a face-changing model, a video-based face-changing method and a related apparatus, in which a virtual face image is added on the basis of a real face image, and a plurality of first target sample images including the real face image and a plurality of first source sample images including the virtual face image are obtained by sampling; training a first generation model aiming at each first target sample image and the corresponding first source sample image to mine the face image of the first source sample image and the face state of the first target sample image, so that the trained first face changing model further learns the face image of the virtual face image on the basis of learning the face state of the real face image; the first face changing model is used as a target face changing model, the target face changing model can generate a face generation image which expresses the face image of the virtual face image more vividly while ensuring the face changing robustness, and therefore the face changing model is applied to change the face, and the face image texture of the virtual face after face changing can be improved.

In order to facilitate understanding of the technical scheme of the present application, a training method of a face changing model and a face changing method based on a video provided in the embodiments of the present application are introduced below with reference to an actual application scenario.

Referring to fig. 1, the figure is a schematic view of an application scenario of a face changing model training method and a video-based face changing method provided in an embodiment of the present application. In the application scenario shown in fig. 1, the application scenario includes a server 101 and a face exchanging device 102. The server 101 stores therein a plurality of real face images and a plurality of virtual face images, which are obtained by rendering a plurality of virtual face models differently.

The server 101 samples a plurality of real face images and a plurality of virtual face images to obtain a plurality of first target sample images and a plurality of corresponding first source sample images; the plurality of first target sample images includes a real face image and the plurality of first source sample images includes a virtual face image. As an example, the first target sample image is a real face image a, and the corresponding first source sample image is a virtual face image b.

The server 101 performs generation processing on each first target sample image and the corresponding first source sample image through a first generation model, and obtains a first face generation image. As an example, the first generation model is generation model 1, and on the basis of the above example, the server 101 performs generation processing on the real face image a and the virtual face image b by the generation model 1 to obtain a first face generation image as the face generation image 1.

The server 101 trains a first generation model to obtain a first face changing model by minimizing a first loss function according to the first face generation image, the first target sample image and the first source sample image; the first loss function is used to calculate a face image loss between the first face-generating image and the first source sample image, and a face state loss between the first face-generating image and the first target sample image. As an example, the first loss function is a loss function 1, and on the basis of the above example, the loss function 1 is used to calculate a loss of facial image between the face generation image 1 and the virtual face image b, and a loss of face state between the face generation image 1 and the real face image a; the server 101 trains the generating model 1 in a mode of minimizing the loss function 1, and obtains a first face changing model as the face changing model 1.

The server 101 determines the first face change model as a target face change model, and transmits the target face change model to the face change device 102, so that the face change device 102 deploys the target face change model. As an example, on the basis of the above example, the server 101 takes the face exchange model 1 as a target face exchange model, and transmits the face exchange model 1 to the face exchange device 102, so that the face exchange device 102 deploys the face exchange model 1.

After the face changing device 102 deploys the target face changing model, a real face image to be changed in the video to be displayed is acquired, and a preset virtual face image corresponding to the real face image to be changed is determined. As an example, on the basis of the above example, after the face changing device 102 deploys the face changing model 1, it may acquire a real face image to be changed of the user a in the video V as a real face image pp, and determine that a preset virtual face image corresponding to the real face image pp is a virtual face image qq.

The face changing device 102 generates and processes a real face image to be changed and a preset virtual face image through a target face changing model to obtain a target virtual face image; the target virtual face image is matched with the face image of the preset virtual face image and the face state of the real face image to be replaced. As an example, on the basis of the above example, the face changing device 102 performs generation processing on the real face image pp and the virtual face image qq by the face changing model 1, and obtains a target virtual face image that matches the face image of the virtual face image qq and the face state of the real face image pp as a virtual face image pq that more realistically expresses the face image of the virtual face image qq.

When the face changing device 102 displays the video to be displayed, the real face image to be changed is replaced with the target virtual face image. As an example, on the basis of the above example, the face exchanging device 102 replaces the real face image pp with the virtual face image pq having a better face image quality of the virtual face when displaying the video V.

Therefore, the virtual face image is added on the basis of the real face image, and a plurality of first target sample images comprising the real face image and a plurality of first source sample images comprising the virtual face image are obtained through sampling; training a first generation model aiming at each first target sample image and the corresponding first source sample image to mine the face image of the first source sample image and the face state of the first target sample image, so that the trained first face changing model further learns the face image of the virtual face image on the basis of learning the face state of the real face image; the first face changing model is used as a target face changing model, the target face changing model can generate a face generation image which can express the face image of the virtual face image more realistically while ensuring the robustness of face changing, and therefore the texture of the face image of the virtual face after face changing can be improved by applying the target face changing model to change faces.

The face changing model training method and the video-based face changing method can be applied to processing equipment with data processing capacity, such as a server and terminal equipment. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, a cloud server providing cloud computing service, and the like, but is not limited thereto; the terminal device includes, but is not limited to, a mobile phone, a tablet, a computer, an intelligent camera, an intelligent voice interaction device, an intelligent appliance, a vehicle-mounted terminal, an aircraft, and the like. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

The application provides a training method of a face changing model, and relates to artificial intelligence. Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The basic technologies of artificial intelligence generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

The training method of the face changing model mainly relates to the large directions of computer vision technology, machine learning/deep learning and the like. Computer Vision technology (CV) is a scientific discipline for researching how to make a machine "see", and more specifically, it refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, following and measurement on a target, and further to perform graphic processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image recognition, image semantic understanding, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, automatic driving, intelligent transportation and other technologies, and also includes common face recognition and other recognition technologies.

Machine Learning (ML) is a multi-domain cross subject, and relates to multi-domain subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

The training method of the face changing model and the face changing method based on the video can be applied to various scenes including but not limited to cloud technology, artificial intelligence, vehicle-mounted scenes, intelligent traffic, driving assistance and the like.

Next, the method for training a face-changing model provided in the embodiment of the present application will be specifically described below with reference to the server as a training device for the face-changing model.

Referring to fig. 2, this figure is a flowchart of a method for training a face-changing model according to an embodiment of the present application. As shown in fig. 2, the training method of the face-changing model includes the following steps:

s201: sampling the plurality of real face images and the plurality of virtual face images to obtain a plurality of first target sample images and a plurality of corresponding first source sample images; the plurality of first target sample images includes a real face image and the plurality of first source sample images includes a virtual face image.

In the related technology, a face changing model is generally obtained by training collected real face images as training samples, and the face changing model is used for generating and processing virtual face images and real face images in a video to obtain face generating images; and replacing the real face image in the video with the face-generating image. However, it has been found through research that the face image of the virtual face image is more exaggerated than the face image of the real face image of the video, and the face-changing model is trained by using the collected real face image as a training sample, so that the face-generating image generated by the face-changing model is closer to the real face image of the video, and thus it is difficult to realistically express the face image of the virtual face image, and the texture of the face image of the virtual face after face-changing in the video is poor.

Therefore, in the embodiment of the present application, in order to solve the above problem, a new face change model needs to be trained, and the new face change model needs to learn the face image of the virtual face image, so as to be able to generate a face generation image that more realistically expresses the face image of the virtual face image; based on this, it is considered that, while the collected real face image is used as a training sample to ensure the face-changing robustness of the new face-changing model, a virtual face image needs to be added on the basis of the real face image as a training sample, so as to obtain the new face-changing model through training.

On the basis of the above description, first, it is necessary to prepare a real face image set and a virtual face image set, that is, a plurality of real face images and a plurality of virtual face images. Then, sampling a batch of images from the real face image set and the virtual face image set each time to obtain a training sample set, wherein each training sample in each training sample set comprises a target sample image and a source sample image, and the source sample image is used for replacing the target sample image; that is, a plurality of real face images and a plurality of virtual face images are sampled, and a plurality of first target sample images and a corresponding plurality of first source sample images are obtained. Considering that a face change scene corresponding to the virtual face image replacing the real face image and the new face change model need to learn the face image of the virtual face image, the plurality of first target sample images need to include the real face image and the plurality of first source sample images need to include the virtual face image. Referring to fig. 3, the figure is a schematic diagram of a real face image and a virtual face image provided in an embodiment of the present application; in fig. 3, (a) shows a real face image, and in fig. 3, (b) shows a virtual face image.

The plurality of real face images may be acquired from an Asian-celeb, vggFace2, or other open source data set. The plurality of virtual face images can be obtained by differently rendering the plurality of virtual face models, the plurality of virtual face models can be obtained from a virtual face model library, and can also be obtained by mixing different virtual face models in the virtual face model library.

As an example, 50+ virtual face models are obtained, and different rendering is performed on each virtual face model by using preset production software UE, such as different angles, different postures, different expressions, different illumination, and the like, to obtain a plurality of virtual face images. The different angles may include, for example, a front angle, a left 45 ° angle, a right 45 ° angle, an upper 30 ° angle, a lower 30 ° angle, and the like.

In the specific implementation of S201, considering that the face-change robustness of the new face-change model is obtained by training based on the plurality of first target sample images and the corresponding plurality of first source sample images, and the new face-change model can generate a face image that more realistically expresses the virtual face image, in the plurality of first target sample images and the corresponding plurality of first source sample images, the real face image and the virtual face image need to satisfy a certain ratio, that is, the first ratio. In addition, the objects to which the first target sample image and the corresponding first source sample image belong may be the same or different, and in consideration of the balance of the training sample, in the plurality of first target sample images and the corresponding plurality of first source sample images, the same object and the different object to which the first target sample image and the corresponding first source sample image belong also need to satisfy another certain ratio, that is, a second ratio. Therefore, in a possible implementation manner of the embodiment of the present application, S201 may specifically be, for example: sampling the plurality of real face images and the plurality of virtual face images according to a first proportion of the real face images and the virtual face images and a second proportion of the same object and different objects to which the first target sample images and the corresponding first source sample images belong, and obtaining a plurality of first target sample images and a plurality of corresponding first source sample images.

As an example, in the plurality of first target sample images and the corresponding plurality of first source sample images, the first ratio of the real face image and the virtual face image may be 9:1, the second ratio of the same object and different objects to which the first target sample image and the corresponding first source sample image belong may be 1:1.

s202: and generating the first target sample image and the first source sample image through a first generation model aiming at each first target sample image and the corresponding first source sample image to obtain a first face generation image.

In the embodiment of the present application, after performing S201 to obtain a plurality of first target sample images and a corresponding plurality of first source sample images, for each first target sample image and corresponding first source sample image, it is necessary to mine a face image of the first source sample image and a face state of the first target sample image through a generation model to generate the first target sample image and a face generation image corresponding to the first source sample image, that is, a first face generation image; therefore, the first target sample image and the first source sample image are subjected to generation processing by the first generation model, so that a first face generation image is obtained.

As an example, the first target sample image is X _t1 Of 1 atOne source sample image is X _s1 By first generating a model pair X _t1 And X _s1 Performing generation processing to obtain a first face generation image as Y _s1,t1 。

S203: training a first generation model by minimizing a first loss function according to a first face generation image, a first target sample image and a first source sample image to obtain a first face changing model; the first loss function is used to calculate a face image loss between the first face-generating image and the first source sample image, and a face state loss between the first face-generating image and the first target sample image.

In the embodiment of the application, the training target is to enable the trained new face change model to learn not only the face state of the first target sample image but also the face image of the first source sample image, so that under the condition that the plurality of first target sample images comprise real face images and the plurality of first source sample images comprise virtual face images, the new face change model learns the face image of the virtual face images on the basis of learning the face state of the real face images; therefore, one loss function, i.e., a first loss function for calculating a face figure loss between the first face generation image and the first source sample image and a face state loss between the first face generation image and the first target sample image is configured for the first generation model.

Based on this, after the first face generation image is obtained in S202, based on the first face generation image, the first target sample image and the first source sample image, the first generation model needs to be trained in a manner of minimizing the first loss function, and a new face change model, i.e., the first face change model, is obtained after the training is completed.

In step S203, first, the first face change model needs to learn the face state of the first target sample image, which actually means that the discriminant features of the first face generation image obtained by the discriminator need to be matched with the discriminant features of the first target sample image obtained by the discriminator; the first loss function needs to include one feature loss function, i.e., a first feature loss function, which is used to calculate the loss between the discriminatory features of the first face-generating image and the discriminatory features of the first target sample image. In addition, when the first target sample image and the corresponding first source sample image belong to the same object, the first face-changing model needs to learn the face state of the first target sample image, and may also be expressed that the first face generation image is a reconstructed image of the first target sample image; the first loss function needs to include one reconstruction loss function, i.e., a first reconstruction loss function, which is used to calculate the reconstruction loss between the first face generation image and the first target sample image.

Then, the first face changing model needs to learn the face image of the first source sample image, which actually means that the object identifier of the first face generation image obtained through the identity recognition model needs to be matched with the object identifier of the first source sample image obtained through the identity recognition model; the first loss function needs to include an identity loss function, i.e. a first identity loss function, which is used to calculate the loss between the object identification of the first face generation image and the object identification of the first source sample image.

Therefore, in a possible implementation manner of the embodiment of the present application, the first loss function includes a first reconstruction loss function, a first identity loss function, and a first characteristic loss function; s203 may include, for example, the following S2031-S2035:

s2031: and if the object identifier of the first source sample image is the same as the object identifier of the first target sample image, obtaining a first reconstruction loss value according to the first face generation image, the first target sample image and the first reconstruction loss function.

As an example, on the basis of the above example, the first reconstruction loss function is as follows:

wherein L is ₁₁ Is the first reconstruction lossFunction, mask _{same id} For the reconstruction coefficient, if the first source sample image X _s1 And a first target sample image X _t1 When the object identification is the same, mask _{same id} Is 1, if X _s1 Object identification and X _t1 When the object identifiers are different, mask _{same id} Is 0; y is _s1,t1 Generating an image for the first face, | luminance ₁ Is the L1 norm.

S2032: a first identity loss value is obtained from the object identification of the first face generation image, the object identification of the first source sample image, and the first identity loss function.

When the first target sample image and the corresponding first source sample image belong to different objects in the specific implementation of S2032, for the facial image of the first source sample image that needs to be learned by the first face replacement model, not only the object identifier of the first face generation image needs to be matched with the object identifier of the first source sample image, but also the first similarity between the object identifier of the first face generation image and the object identifier of the first target sample image needs to be matched with the second similarity between the object identifier of the first source sample image and the object identifier of the first target sample image; the first loss function needs to include two sub-loss functions, namely a first sub-loss function for calculating a loss between the object identification of the first face generation image and the object identification of the first source sample image and a second sub-loss function for calculating a loss between the first similarity and the second similarity. Therefore, in one possible implementation manner of the embodiment of the present application, the first identity loss function includes a first sub-loss function and a second sub-loss function; s2032 may include, for example, the following S1 to S3:

s1: and obtaining a first sub-loss value according to the object identification of the first face generation image, the object identification of the first source sample image and the first sub-loss function.

S2: obtaining a second sub-loss value according to the object identifier of the first face generation image, the object identifier of the first source sample image, the object identifier of the first target sample image and a second sub-loss function; the second sub-loss function is used for calculating a loss between a first similarity and a second similarity, wherein the first similarity is a similarity between an object identifier of the first face generation image and an object identifier of the first target sample image, and the second similarity is a similarity between an object identifier of the first source sample image and an object identifier of the first target sample image.

S3: and summing the first sub-loss value and the second sub-loss value to obtain a first identity loss value.

As an example, on the basis of the above example, the first identity loss function is as follows:

wherein L is _ICL1 As a function of first identity loss, 1-cos (z) _id (Y _s1,t1 )，z _id (X _s1 ) Is a first sub-loss function, (cos (z)) _id (Y _s1,t1 )，z _id (X _t1 ))-cos(z _id (X _s1 )，z _id (X _t1 ))) ² Is a second sub-loss function, z _id For pre-trained identity recognition models, z _id (Y _s1,t1 ) Generating an image Y for the first face _s1,t1 Object identification of (2), z _id (X _s1 ) For the first source sample image X _s1 Object identification of, z _id (X _t1 ) Is a first target sample image X _t1 The object identification of (1).

S2033: and obtaining a first feature loss value according to the discriminant features of the first face generation image, the discriminant features of the first target sample image and the first feature loss function.

As an example, on the basis of the above example, the first characteristic loss function is as follows:

wherein L is _FM1 Is a first characteristic loss function, D ⁱ Is the i-th layer of the discriminator, m represents the total number of layers of the discriminator, D ⁱ (Y _s1,t1 ) Generating an image Y for the first face _s1,t1 Characteristic of (1), D ⁱ (X _t1 ) Is a first target sample image X _t1 Is not limited | count ₂ Is the L2 norm.

S2034: and performing weighting processing according to the first reconstruction loss value, the first identity loss value, the first characteristic loss value and the corresponding first preset coefficient to obtain a first total loss value.

In addition, in the embodiment of the application, in order to enable the first face changing model to better learn the facial expression of the first target sample image, the expression category of the first face generation image obtained through the expression recognition model needs to be matched with the expression category of the first target sample image obtained through the expression recognition model; the first loss function may further include an expression loss function, i.e., a first expression loss function for calculating a loss between the expression class of the first face generation image and the expression class of the first target sample image. It should be noted that the first loss function may further include a generation countermeasure loss function corresponding to the first generation model, that is, a first generation countermeasure loss function for calculating a loss between the discrimination category of the first face generation image at the time of generation and the discrimination category of the first target sample image, and a loss between the discrimination category of the first face generation image at the time of discrimination and the discrimination category of the first target sample image. Therefore, in a possible implementation manner of the embodiment of the present application, the first loss function further includes a first expressive loss function and a first generative pairwise loss-resistance function; the method may for example further comprise the following S4-S5:

s4: and obtaining a first expression loss value according to the expression category of the first face generation image, the expression category of the first target sample image and the first expression loss function.

As an example, on the basis of the above example, the first expression loss function is as follows:

wherein L is _EXP1 As a first expression loss function, E is a pre-trained expression recognition model, E (Y) _s1,t1 ) Generating an image Y for the first face _s1,t1 Expression class of E (X) _t1 ) Is a first target sample image X _t1 The (| sweet) eye ₂ Is the L2 norm.

S5: and obtaining a first generated confrontation loss value according to the discrimination category of the first face generation image, the discrimination category of the first target sample image and the first generated confrontation loss function.

As an example, on the basis of the above example, the first generation pair-wise loss-tolerance function is as follows:

wherein L is _GAN1 For a first generated pairwise loss-resistance function, D is a discriminator, D (Y) _s1,t1 ) Generating an image Y for the first face _s1,t1 D (X) is a discrimination category of _t1 ) For a first target sample image X _t1 E' is expectation, G is generator.

Correspondingly, on the basis of S4-S5, S2034 may specifically be, for example: and weighting the loss resistance value and the corresponding second preset coefficient according to the first reconstruction loss value, the first identity loss value, the first characteristic loss value, the first expression loss value and the first generation to obtain a first total loss value.

As an example, on the basis of the above example, the first loss function is as follows:

wherein L is _total1 Is a first loss function, L _GAN1 The corresponding second coefficient is 1,L _FM1 The corresponding second coefficient is 1,L _EXP1 Corresponding toThe second coefficient is 5,L _ICL1 The corresponding second coefficient is 5,L ₁₁ And if the corresponding second coefficient is 10, the first reconstruction loss value, the first identity loss value, the first characteristic loss value, the first expression loss value and the first generation pairing immunity loss value respectively correspond to second preset coefficients of 10, 5, 2, 5 and 1.

S2035: and training the model parameters of the first generation model by minimizing the first loss function according to the first total loss value until the model converges to obtain a first face changing model.

In practical application, model parameters of a first generation model are trained by minimizing a first loss function based on a first total loss value, an Adam optimizer is used, and the learning rate is 0.0001 until the model converges to obtain a first face changing model.

Referring to fig. 4, the drawing is a schematic diagram of a framework for obtaining a first face-changing model by training a first generation model through a plurality of first target sample images and a corresponding plurality of first source sample images according to an embodiment of the present application. Wherein, in the plurality of first target sample images and the corresponding plurality of first source sample images, the first ratio of the real face image to the virtual face image is 9:1, the second proportion of the same object and different objects to which the first target sample image and the corresponding first source sample image belong is 1:1; the first target sample image is X _t1 The first source sample image is X _s1 By first generating a model pair X _t1 And X _s1 Performing generation processing to obtain a first face generation image Y _s1,t1 (ii) a Based on Y _s1,t1 、X _t1 And X _s1 Training a first generation model by minimizing a first loss function to obtain a first face-changing model, the first loss function L _total1 Comprising a first reconstruction loss function L ₁₁ First loss of identity function L _ICL1 First characteristic loss function L _FM1 First expression loss function L _EXP1 And a first generation pairwise loss-resistance function L _GAN1 。

S204: and determining the first face changing model as a target face changing model.

In the embodiment of the present application, after the obtaining of the first face change model in S203 is performed, since the first face change model is capable of learning not only the face state of the first target sample image but also the face figure of the first source sample image, under the condition that the plurality of first target sample images include real face images and the plurality of first source sample images include virtual face images, the first face change model further learns the face figure of the virtual face image on the basis of learning the face state of the real face image; based on this, the first face-changing model can generate a face generation image that more realistically expresses the face image of the virtual face image while ensuring the face-changing robustness, and then the first face-changing model can be used as the target face-changing model.

In addition, in the embodiment of the present application, the first face change model is obtained by training a first generation model, and if the calculated amount of the first generation model is large, the calculated amount is also large when the first face change model is used as a target face change model for face change, and it is difficult to complete face change in real time in a video interaction process; therefore, in order to solve the problem, another generation model with a small calculation amount, namely a second generation model, can be set up, still by using the prepared real face image set and virtual face image set, the second generation model is trained to obtain a second face change model, the second face change model is used as a target face change model for face change, the calculation amount is small, and face change is completed in real time in the video interaction process.

In practical application, firstly, only the first proportion of the real face image to the virtual face image is used for sampling the real face images and the virtual face images to obtain a plurality of second target sample images and a plurality of corresponding second source sample images; also, the plurality of second target sample images need to include a real face image, and the plurality of second source sample images need to include a virtual face image.

Then, for each second target sample image and the corresponding second source sample image, mining a face image of the second source sample image and a face state of the second target sample image through a second generation model to generate the second target sample image and a face generation image corresponding to the second source sample image; that is, the second face generates an image. Correspondingly, the face image of the second source sample image and the face state of the second target sample image need to be mined through the first face changing model to generate a third face generation image corresponding to the second target sample image and the second source sample image, and the third face generation image is used for comparing the second face generation image.

Then, since the training target is such that the second face change model learns not only the face state of the second target sample image but also the face image of the second source sample image, so that the second face change model further learns the face image of the virtual face image on the basis of learning the face state of the real face image under the condition that the plurality of second target sample images include the real face image and the plurality of second source sample images include the virtual face image; therefore, another loss function configured identically for the second generative model, i.e., a second loss function for calculating a loss of facial image between the second face generation image and the second source sample image, and a loss of facial state between the second face generation image and the second target sample image.

Therefore, in one possible implementation manner of the embodiment of the present application, after S203 and before S204, the method may further include, for example, S6 to S9:

s6: sampling the plurality of real face images and the plurality of virtual face images according to a first proportion of the real face images and the virtual face images to obtain a plurality of second target sample images and a plurality of corresponding second source sample images; the plurality of second target sample images includes a real face image and the plurality of second source sample images includes a virtual face image.

S7: generating and processing each second target sample image and the corresponding second source sample image through a second generation model to obtain a second face generation image; the amount of computation of the second generative model is less than the amount of computation of the first generative model.

Considering that the second face change model is used as a target face change model for face change, face change is completed in real time in the video interaction process, the face change is influenced by the real-time calculated amount of face change equipment to be deployed with the target face change model, and the maximum calculated amount of the second generation model, namely the preset calculated amount, can be determined by the real-time calculated amount; the calculation amount of the second generative model is less than or equal to the preset calculation amount. Therefore, in a possible implementation manner of the embodiment of the present application, a calculation amount of the second generation model is less than or equal to a preset calculation amount, where the preset calculation amount is determined according to a real-time calculation amount of a face changing device to which the target face changing model is to be deployed.

As an example, the second target sample image is X _t2 The second source sample image is X _s2 By a second generative model pair X _t2 And X _s2 Performing generation processing to obtain a second face generation image Y _s2,t2 (ii) a The preset calculation amount is 500MFlops, the calculation amount of the second generative model is less than or equal to 500MFlops, and 1MFlops represents one million times of floating point operation per second.

S8: and generating and processing the second target sample image and the second source sample image through the first face changing model to obtain a third face generation image.

As an example, on the basis of the above example, the pair X is provided by the first face-changing model _t2 And X _s2 Performing generation processing to obtain a third face generation image Y _teacher 。

S9: training a second generation model by minimizing a second loss function according to the second face generation image, the third face generation image and the second source sample image to obtain a second face changing model; the second loss function is used to calculate a face image loss between the second face generation image and the second source sample image, and a face state loss between the second face generation image and the third face generation image.

In the specific implementation of S9, referring to the description of the first loss function, on the basis that the third face generation image is used for comparing the second face generation image, first, the discriminant features of the second face generation image and the discriminant features of the third face generation image need to be matched; the second loss function needs to include one feature loss function, i.e., a second feature loss function for calculating a loss between the discriminative features of the second face generation image and the discriminative features of the third face generation image. Further, the second face generation image is a reconstructed image of the second target sample image, and needs to be matched with the third face generation image; the second loss function needs to include one reconstruction loss function, i.e., a second reconstruction loss function for calculating a reconstruction loss between the second face generation image and the third face generation image.

Then, in the same way, the object identifier of the second face generation image needs to be matched with the object identifier of the second source sample image; the second loss function needs to include one identity loss function, i.e. a second identity loss function, which is used to calculate the loss between the object identification of the second face generation image and the object identification of the second source sample image.

Therefore, in a possible implementation manner of the embodiment of the present application, the second loss function includes a second reconstruction loss function, a second identity loss function, and a second feature loss function; s9 may include, for example, the following S91-S95:

s91: and obtaining a second reconstruction loss value according to the second face generation image, the third face generation image and the second reconstruction loss function.

As an example, on the basis of the above example, the second reconstruction loss function is as follows:

wherein L is ₁₂ As a second reconstruction loss function, Y _s2,t2 Generating an image for the second face, Y _teacher Generating an image for a third face part, | Liquid ₁ Is the L1 norm.

S92: and obtaining a second identity loss value according to the object identifier of the second face generation image, the object identifier of the second source sample image and the second identity loss function.

When the second target sample image and the corresponding second source sample image belong to different objects, not only the object identifier of the second face generation image needs to be matched with the object identifier of the second source sample image, but also the third similarity between the object identifier of the second face generation image and the object identifier of the second target sample image needs to be matched with the fourth similarity between the object identifier of the second source sample image and the object identifier of the second target sample image; the second loss function needs to include two sub-loss functions, namely a third sub-loss function for calculating a loss between the object identification of the second face generation image and the object identification of the second source sample image, and a fourth sub-loss function for calculating a loss between the third similarity and the fourth similarity.

Therefore, in a possible implementation manner of the embodiment of the present application, the second identity loss function includes a third sub-loss function and a fourth sub-loss function; s92 may include, for example: obtaining a third sub-loss value according to the object identifier of the second face generation image, the object identifier of the second source sample image and a third sub-loss function; obtaining a fourth sub-loss value according to the object identifier of the second face generation image, the object identifier of the second source sample image, the object identifier of the second target sample image and a fourth sub-loss function; the fourth sub-loss function is used for calculating the loss between a third similarity and a fourth similarity, wherein the third similarity is the similarity between the object identifier of the second face generation image and the object identifier of the second target sample image, and the fourth similarity is the similarity between the object identifier of the second source sample image and the object identifier of the second target sample image; and summing the third sub-loss value and the fourth sub-loss value to obtain a second identity loss value.

As an example, on the basis of the above example, the second identity loss function is as follows:

wherein L is _ICL2 As a second identity loss function, 1-cos (z) _id (Y _s2,t2 )，z _id (X _s2 ) Is a third sub-loss function, (cos (z)) _id (Y _s2,t2 )，z _id (X _t2 ))-cos(z _id (X _s2 )，z _id (X _t2 ))) ² Is a fourth sub-loss function, z _id For the pre-trained identity recognition model, z _id (Y _s2,t2 ) Generating image Y for the second face _s2,t2 Object identification of, z _id (X _s2 ) For the second source sample image X _s2 Object identification of, z _id (X _t2 ) For a second target sample image X _t2 The object identification of (2).

S93: and obtaining a second feature loss value according to the distinguishing feature of the second face generation image, the distinguishing feature of the third face generation image and the second feature loss function.

As an example, on the basis of the above example, the second characteristic loss function is as follows:

wherein L is _FM2 As a second characteristic loss function, D ⁱ Is the i-th layer of the discriminator, m represents the total number of layers of the discriminator, D ⁱ (Y _s2,t2 ) Generating image Y for the second face _s2,t2 Characteristic of (1), D ⁱ (Y _teacher ) Generating image Y for the third face _teacher Is not limited | count ₂ Is the L2 norm.

S94: and carrying out weighting processing according to the second reconstruction loss value, the second identity loss value, the second characteristic loss value and a corresponding third preset coefficient to obtain a second total loss value.

In addition, in the embodiment of the present application, referring to the description of the first loss function, it can be obtained by the same principle that the expression category of the second face generation image needs to be matched with the expression category of the second target sample image; the second loss function may further include an expression loss function, i.e., a second expression loss function for calculating a loss between the expression class of the second face generation image and the expression class of the second target sample image. It should be noted that the second loss function may further include a generation countermeasure loss function corresponding to the second generation model, that is, a second generation countermeasure loss function for calculating a discrimination class loss of the second face generation image at the time of generation and a loss between discrimination classes of the second target sample image, and a discrimination class loss of the second face generation image at the time of discrimination and a loss between discrimination classes of the second target sample image. Therefore, in a possible implementation manner of the embodiment of the present application, the second loss function further includes a second expression loss function and a second generation countermeasure loss function; for example, the following S10 to S11 may be included:

s10: and obtaining a second expression loss value according to the expression category of the second face generation image, the expression category of the second target sample image and the second expression loss function.

As an example, on the basis of the above example, the second expression loss function is as follows:

wherein L is _EXP2 Is a second expression loss function, E is a pre-trained expression recognition model, E (Y) _s2,t2 ) Generating image Y for the second face _s2,t2 Expression class of E (X) _t2 ) For the second target sample image X _t2 The facial expression categories, | Limu ₂ Is the L2 norm.

S10: and obtaining a second generated confrontation loss value according to the discrimination category of the second face generation image, the discrimination category of the second target sample image and the second generated confrontation loss function.

As an example, on the basis of the above example, the second generated antagonistic loss function is as follows:

wherein L is _GAN2 For the second generation of the opposing loss function, D is the discriminator, D (Y) _s2,t2 ) Generating an image Y for the second face _s2,t2 The discrimination class of (2), D (X) _t2 ) For the second target sample image X _t2 E' is expectation, G is generator.

Correspondingly, S94 may specifically be, for example: and carrying out weighting processing according to the second reconstruction loss value, the second identity loss value, the second characteristic loss value, the second expression loss value, the second generated confrontation loss value and a corresponding fourth preset coefficient to obtain a second total loss value.

As an example, on the basis of the above example, the second loss function is as follows:

wherein L is _total2 Is a second loss function, L _GAN2 The corresponding fourth coefficient is 1,L _FM2 The corresponding fourth coefficient is 1,L _EXP2 The corresponding fourth coefficient is 5,L _ICL2 The corresponding fourth coefficient is 5,L ₁₂ And if the corresponding fourth coefficient is 10, the second reconstruction loss value, the second identity loss value, the second feature loss value, the second expression loss value and the second generation confrontation loss value respectively correspond to second preset coefficients of 10, 5, 2, 5 and 1.

S95: and training the model parameters of the second generation model by minimizing the second loss function according to the second total loss value until the model converges to obtain a second face changing model.

Referring to fig. 5, this is a schematic diagram of a framework provided in the embodiment of the present application for training a second generation model by using a plurality of second target sample images and a corresponding plurality of second source sample images to obtain a second face-changing model. Wherein, in the plurality of second target sample images and the corresponding plurality of second source sample images, the second ratio of the real face image to the virtual face image is 9:1; second target sampleThe image is X _t2 The second source sample image is X _s2 By a second generative model pair X _t2 And X _s2 Performing generation processing to obtain a second face generation image Y _s2,t2 And a first face-changing model pair X _t2 And X _s2 Performing generation processing to obtain a third face generation image Y _teacher (ii) a Based on Y _s2,t2 、X _t2 、Y _teacher And X _s2 Training a second generative model by minimizing a second loss function to obtain a second face-changing model, a second loss function L _total2 Including a second reconstruction loss function L ₁₂ A second identity loss function L _ICL2 A second characteristic loss function L _FM2 Second expression loss function L _EXP2 And a second generation of a penalty function L _GAN2 。

Correspondingly, S204 may specifically be, for example: and determining the second face changing model as a target face changing model.

In addition, in the embodiment of the present application, since the calculation amount of the second generative model is small, in order to ensure that the face change robustness of the second face change model is obtained by training the second generative model, the second generative model is generally built in a reparameterization manner, and the model parameters of the second generative model include a plurality of branch model parameters. Based on this, the model parameters of the second face changing model include a plurality of trained branch model parameters, and the second face changing model can be further compressed, that is, the second face changing model is updated by fusing the plurality of trained branch model parameters, so as to obtain an updated second face changing model; the updated model parameters of the second face changing model comprise the fused model parameters, and the calculated amount of the updated second face changing model is smaller when the updated second face changing model is used as a target face changing model to change faces, so that the face changing can be completed in real time in the video interaction process, and the method is more suitable for scenes of real-time video interaction, such as live video scenes. Therefore, in a possible implementation manner of the embodiment of the present application, when the model parameters of the second generation model include a plurality of branch model parameters, the model parameters of the second face-changing model include a plurality of branch model parameters after training, and the method may further include, for example, S12: fusing the multiple trained branch model parameters to update the second face changing model to obtain an updated second face changing model; the updated model parameters of the second face changing model comprise fused model parameters; correspondingly, S204 may specifically be, for example: and determining the updated second face changing model as a target face changing model.

As an example, refer to fig. 6, which is a schematic diagram of a plurality of branch model parameters in a second generative model and fused model parameters in an updated second face-changed model according to an embodiment of the present application. Wherein (a) in fig. 6 represents a plurality of branch model parameters in the second generation model, and (b) in fig. 6 represents model parameters after fusion in the updated second face-changing model, the plurality of branch model parameters may be 3 branch convolution kernels, specifically, a 3 × 3 convolution kernel, and a 1 × 1 convolution kernel, and all of the 3 branch convolution kernels are converted into the 3 × 3 convolution kernel, and when the 1 × 1 convolution kernel is converted into the 3 × 3 convolution kernel, a mode that a central weight is equal to the 1 × 1 convolution kernel and a peripheral weight is 0 is adopted; fusing the converted 3 convolution kernels with the 3 multiplied by 3, and adding the weight w of the 3 multiplied by 3 convolution kernels and the bias b to obtain a new 3 multiplied by 3 convolution kernel as the fused model parameter.

The training method of the face changing model provided in the above embodiment includes, first, collecting a plurality of real face images and a plurality of virtual face images, and sampling to obtain a plurality of first target sample images and a plurality of corresponding first source sample images, where the plurality of first target sample images include the real face images, and the plurality of first source sample images include the virtual face images; then, generating each first target sample image and the corresponding first source sample image through a first generation model to obtain a first face generation image; finally, training a first generation model in a mode of minimizing a first loss function through the first face generation image, the first target sample image and the first source sample image to obtain a first face changing model as a target face changing model; the first loss function is used to calculate a face image loss between the first face-generating image and the first source sample image, and a face state loss between the first face-generating image and the first target sample image.

On the basis of the training method of the face changing model, deploying the target face changing model on face changing equipment, and after the face changing equipment acquires the video to be displayed, when the video to be displayed has the requirement of replacing a real face image in the video to be displayed by using a virtual face image; firstly, the face changing equipment needs to acquire a real face image to be changed in a video to be displayed, and secondly, the face changing equipment needs to determine a virtual face image to be changed by the real face image to be changed, namely, a preset virtual face image; then, the face changing device generates and processes the real face image to be changed and the preset virtual face image through the target face changing model, so as to obtain a virtual face image which matches the face image of the preset virtual face image and the face state of the real face image to be changed, namely the target virtual face image, and the target virtual face image expresses the face image of the preset virtual face image more vividly; and finally, when the face changing equipment displays the video to be displayed, the target virtual face image is used for replacing the real face image to be changed so as to finish face changing, and the face image texture of the virtual face after face changing is better.

Next, the video-based face changing method provided in the embodiment of the present application is specifically described below with respect to the server or the terminal device as the face changing device.

Referring to fig. 7, the flowchart of a method for changing a face based on a video according to an embodiment of the present application is shown. As shown in fig. 7, the video-based face changing method includes the following steps:

s701: and acquiring a real face image to be replaced in the video to be displayed.

S702: and determining a preset virtual face image corresponding to the real face image to be replaced.

S703: generating and processing a real face image to be changed and a preset virtual face image through a target face changing model to obtain a target virtual face image; the target virtual face image is matched with the face image of the preset virtual face image and the face state of the real face image to be replaced.

S704: and replacing the real face image to be replaced with the target virtual face image when the video to be displayed is displayed.

Referring to fig. 8, the figure is a schematic diagram of an effect of replacing a real face image to be replaced in a video to be displayed with a target virtual face image according to an embodiment of the present application. Wherein, in fig. 8, (a) represents a to-be-replaced real face image in a to-be-displayed video, in fig. 8, (b) represents a preset virtual face image corresponding to the to-be-replaced real face image, and in fig. 8, (c) represents a target virtual face image obtained by generating and processing the to-be-replaced real face image and the preset virtual face image through a target face-replacing model; the target virtual face image expresses the face image of the preset virtual face image more vividly, and the face image of the virtual face after face changing has better texture.

In addition, in the embodiment of the application, considering that the preset virtual face image may have a proprietary attribute, that is, the preset virtual face image only belongs to the preset object, the preset object and the preset virtual face image may be bound to obtain a preset binding relationship indicating that the preset object and the preset virtual face image are bound; after the face changing equipment acquires a video to be displayed, face recognition is carried out on the video to be displayed to obtain a recognized real face image, and the face characteristics of the recognized real face image are acquired; and judging whether the facial features of the recognized real facial image are matched with the facial features of the preset object or not, if so, indicating that the recognized real facial image is the real facial image to be replaced in the video to be displayed. The method can only change the faces of the real face images bound with the virtual face images in the video, so that normal face changing can still be realized when the real face images of different objects are switched or a plurality of objects are simultaneously in one video picture, and in addition, the copyright protection of the virtual face images can also be realized. Therefore, in a possible implementation manner of the embodiment of the present application, before S701, the method may further include, for example, the following S13 to S15:

s13: and binding the facial features of the preset object with the preset virtual facial image to obtain a preset binding relationship.

S14: and acquiring the facial features of the recognized real facial image in the video to be displayed.

S15: and if the facial features of the recognized real facial image are matched with the facial features of the preset object, determining the recognized real facial image as the real facial image to be replaced.

Correspondingly, S702 may specifically be, for example: and determining a preset virtual face image according to the face features of the real face image to be replaced and the preset binding relationship.

In addition, in the embodiment of the present application, there may be a case where the facial feature of the recognized real face image does not match the facial feature of the preset object, which indicates that the recognized real face image does not need to be changed in face, and the recognized real face image is multiplexed when the video to be displayed is displayed. Therefore, in a possible implementation manner of the embodiment of the present application, the method may further include, for example, S16: and if the facial features of the recognized real facial image are not matched with the facial features of the preset object, multiplexing the recognized real facial image when displaying the video to be displayed.

Referring to fig. 9, the figure is a flowchart of another method for changing a video face based on a preset binding relationship according to an embodiment of the present application. The first step is as follows: binding the facial features of the preset object with the preset virtual facial image to obtain a preset binding relationship; the second step: acquiring the face features of the recognized real face image in the video to be displayed; the third step: judging whether the facial features of the recognized real facial image are matched with the facial features of the preset object or not, if so, executing the fourth step, and if not, executing the sixth step; the fourth step: determining the identified real face image as a real face image to be replaced, and generating and processing the real face image to be replaced and a preset virtual face image through a target face replacing model to obtain a target virtual face image; the fifth step: replacing the real face image to be replaced with a target virtual face image when displaying the video to be displayed; and a sixth step: and multiplexing the identified real face image when displaying the video to be displayed.

The video-based face changing method provided by the embodiment includes the steps of firstly, obtaining a real face image to be changed in a video to be displayed, and secondly, determining a virtual face image to be changed by the real face image to be changed, namely, presetting the virtual face image; then, generating and processing the real face image to be changed and the preset virtual face image through the target face changing model, so as to obtain a virtual face image which matches the face image of the preset virtual face image and the face state of the real face image to be changed, namely the target virtual face image, wherein the target virtual face image expresses the face image of the preset virtual face image more realistically; and finally, when the video to be displayed is displayed, replacing the real face image to be replaced by the target virtual face image to finish face replacement, wherein the face image texture of the virtual face after face replacement is better.

For the above-described training method of the face change model, the embodiment of the present application further provides a training device for the face change model, and the following specifically introduces the training device for the face change model provided in the embodiment of the present application.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a training device for a face changing model according to an embodiment of the present application. As shown in fig. 10, the training apparatus 1000 for face-changing model includes: a sampling unit 1001, a first generating unit 1002, a training unit 1003, and a first determining unit 1004;

a sampling unit 1001 configured to sample a plurality of real face images and a plurality of virtual face images to obtain a plurality of first target sample images and a plurality of corresponding first source sample images; the plurality of first target sample images include a real face image, and the plurality of first source sample images include a virtual face image;

a first generating unit 1002, configured to perform generation processing on each first target sample image and a corresponding first source sample image through a first generation model to obtain a first face generation image;

a training unit 1003, configured to train a first generation model by minimizing a first loss function according to the first face generation image, the first target sample image, and the first source sample image, to obtain a first face change model; a first loss function for calculating a face image loss between the first face generation image and the first source sample image and a face state loss between the first face generation image and the first target sample image;

a first determining unit 1004 for determining the first face change model as a target face change model.

As a possible implementation manner, the sampling unit 1001 is specifically configured to:

and sampling the plurality of real face images and the plurality of virtual face images according to a first proportion of the real face images and the virtual face images and a second proportion of the same object and different objects to which the first target sample images and the corresponding first source sample images belong, and obtaining a plurality of first target sample images and a plurality of corresponding first source sample images.

As a possible implementation, the first loss function includes a first reconstruction loss function, a first identity loss function, and a first characteristic loss function; a training unit 1003 including: a first obtaining subunit and a first training subunit;

a first obtaining subunit, configured to obtain a first reconstruction loss value according to the first face generation image, the first target sample image, and the first reconstruction loss function if the object identifier of the first source sample image is the same as the object identifier of the first target sample image;

the first obtaining subunit is further configured to obtain a first identity loss value according to the object identifier of the first face generation image, the object identifier of the first source sample image, and the first identity loss function;

the first obtaining subunit is further configured to obtain a first feature loss value according to the distinguishing feature of the first face generation image, the distinguishing feature of the first target sample image, and the first feature loss function;

the first obtaining subunit is further configured to perform weighting processing according to the first reconstruction loss value, the first identity loss value, the first characteristic loss value, and a corresponding first preset coefficient, so as to obtain a first total loss value;

and the first training subunit is used for training the model parameters of the first generation model by minimizing the first loss function according to the first total loss value until the model converges to obtain a first face changing model.

As a possible implementation, the first identity loss function includes a first sub-loss function and a second sub-loss function; a first obtaining subunit, specifically configured to:

obtaining a first sub-loss value according to the object identification of the first face generation image, the object identification of the first source sample image and the first sub-loss function;

obtaining a second sub-loss value according to the object identifier of the first face generation image, the object identifier of the first source sample image, the object identifier of the first target sample image and a second sub-loss function; the second sub-loss function is used for calculating the loss between a first similarity and a second similarity, wherein the first similarity is the similarity between the object identifier of the first face generation image and the object identifier of the first target sample image, and the second similarity is the similarity between the object identifier of the first source sample image and the object identifier of the first target sample image;

and summing the first sub-loss value and the second sub-loss value to obtain a first identity loss value.

As a possible implementation, the first loss function further includes a first expressive loss function and a first generative pairwise loss-resistance function; a first obtaining subunit further configured to:

obtaining a first expression loss value according to the expression category of the first face generation image, the expression category of the first target sample image and the first expression loss function;

obtaining a first generation pair loss resistance value according to the discrimination category of the first face generation image, the discrimination category of the first target sample image and the first generation pair loss resistance function;

a first obtaining subunit, specifically configured to:

and weighting the anti-loss value and a corresponding second preset coefficient according to the first reconstruction loss value, the first identity loss value, the first characteristic loss value, the first expression loss value and the first generation to obtain a first total loss value.

As a possible implementation manner, the sampling unit 1001 is further configured to sample the multiple real face images and the multiple virtual face images according to a first ratio between the real face images and the virtual face images, and obtain multiple second target sample images and multiple corresponding second source sample images; the plurality of second target sample images include a real face image, and the plurality of second source sample images include a virtual face image;

the first generating unit 1002 is further configured to perform, for each second target sample image and corresponding second source sample image, generation processing on the second target sample image and the second source sample image through a second generation model to obtain a second face generation image; the calculation amount of the second generative model is less than that of the first generative model;

the first generating unit 1002 is further configured to perform generation processing on the second target sample image and the second source sample image through the first face changing model, so as to obtain a third face generation image;

a training unit 1003, configured to train a second generation model by minimizing a second loss function according to the second face generation image, the third face generation image, and the second source sample image, to obtain a second face change model; a second loss function for calculating a face image loss between the second face generation image and the second source sample image, and a face state loss between the second face generation image and the third face generation image;

the first determining unit 1004 is specifically configured to:

and determining the second face changing model as a target face changing model.

As a possible implementation, the training unit 1003 includes: a second obtaining subunit and a second training subunit;

a second obtaining subunit, configured to obtain a second reconstruction loss value according to the second face generation image, the third face generation image, and the second reconstruction loss function;

the second obtaining subunit is further configured to obtain a second identity loss value according to the object identifier of the second face generation image, the object identifier of the second source sample image, and the second identity loss function;

the second obtaining subunit is further configured to obtain a second feature loss value according to the distinguishing feature of the second face generation image, the distinguishing feature of the third face generation image, and the second feature loss function;

the second obtaining subunit is further configured to perform weighting processing according to the second reconstruction loss value, the second identity loss value, the second characteristic loss value, and a corresponding third preset coefficient, so as to obtain a second total loss value;

and the second training subunit is used for training the model parameters of the second generation model by minimizing the second loss function according to the second total loss value until the model converges to obtain a second face changing model.

As a possible implementation manner, the second loss function further includes a second expression loss function and a second generated confrontation loss function; a second obtaining subunit further configured to:

obtaining a second expression loss value according to the expression category of the second face generation image, the expression category of the second target sample image and a second expression loss function;

obtaining a second generated confrontation loss value according to the discrimination category of the second face generation image, the discrimination category of the second target sample image and the second generated confrontation loss function;

a second obtaining subunit, specifically configured to:

and performing weighting processing according to the second reconstruction loss value, the second identity loss value, the second characteristic loss value, the second expression loss value, the second generated confrontation loss value and a corresponding fourth preset coefficient to obtain a second total loss value.

As a possible implementation manner, when the model parameters of the second generation model include a plurality of branch model parameters, the model parameters of the second face-changing model include a plurality of branch model parameters after training, and the apparatus further includes: an update unit;

the updating unit is used for fusing the parameters of the trained branch models to update the second face changing model to obtain an updated second face changing model; the updated model parameters of the second face changing model comprise the fused model parameters;

the determining unit 1004 is specifically configured to:

and determining the updated second face changing model as a target face changing model.

As a possible implementation manner, the calculation amount of the second generation model is less than or equal to a preset calculation amount, and the preset calculation amount is determined according to the real-time calculation amount of the face changing device to which the target face changing model is to be deployed.

The training device for the face changing model provided in the above embodiment first collects a plurality of real face images and a plurality of virtual face images, and samples to obtain a plurality of first target sample images and a plurality of corresponding first source sample images, where the plurality of first target sample images include the real face images, and the plurality of first source sample images include the virtual face images; then, generating each first target sample image and the corresponding first source sample image through a first generation model to obtain a first face generation image; finally, training a first generation model in a mode of minimizing a first loss function through the first face generation image, the first target sample image and the first source sample image to obtain a first face changing model serving as a target face changing model; the first loss function is used to calculate a face image loss between the first face-generating image and the first source sample image, and a face state loss between the first face-generating image and the first target sample image.

It can be seen that, adding a virtual face image on the basis of the real face image, and sampling to obtain a plurality of first target sample images including the real face image and a plurality of first source sample images including the virtual face image; training a first generation model aiming at each first target sample image and the corresponding first source sample image to mine the face image of the first source sample image and the face state of the first target sample image, so that the trained first face changing model further learns the face image of the virtual face image on the basis of learning the face state of the real face image; the first face changing model is used as a target face changing model, the target face changing model can generate a face generation image which can express the face image of the virtual face image more realistically while ensuring the robustness of face changing, and therefore the texture of the face image of the virtual face after face changing can be improved by applying the target face changing model to change faces.

For the video-based face changing method described above, an embodiment of the present application further provides a video-based face changing device, and the following specifically introduces the video-based face changing device provided in the embodiment of the present application.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a video-based face changing apparatus according to an embodiment of the present application. As shown in fig. 11, the video-based face-changing apparatus 1100 includes: an acquisition unit 1101, a second determination unit 1102, a second generation unit 1103, and a replacement unit 1104;

an obtaining unit 1101, configured to obtain a to-be-replaced real face image in a to-be-displayed video;

a second determining unit 1102, configured to determine a preset virtual face image corresponding to the real face image to be replaced;

a second generating unit 1103, configured to generate and process the to-be-replaced real face image and the preset virtual face image through the target face replacing model, so as to obtain a target virtual face image; the target virtual face image is matched with the face image of the preset virtual face image and the face state of the real face image to be replaced;

a replacing unit 1104 for replacing the real face image to be replaced with the target virtual face image when the video to be displayed is displayed;

the target face-changing model is obtained by executing the training method of the face-changing model of the above embodiment.

As a possible implementation manner, the apparatus further includes: a binding unit,

The binding unit is used for binding the facial features of the preset object with the preset virtual facial image to obtain a preset binding relationship;

the acquiring unit 1101 is further configured to acquire a facial feature of the recognized real face image in the video to be displayed;

the second determining unit 1102 is further configured to determine the recognized real face image as a real face image to be replaced if the facial features of the recognized real face image match with the facial features of the preset object;

the second determining unit 1102 is specifically configured to:

and determining a preset virtual face image according to the face features of the real face image to be replaced and the preset binding relationship.

As a possible implementation, the method further includes: a multiplexing unit;

and the multiplexing unit is used for multiplexing the identified real face image when the video to be displayed is displayed if the facial features of the identified real face image are not matched with the facial features of the preset object.

The face changing device based on the video provided by the embodiment includes acquiring a to-be-changed real face image in a to-be-displayed video, and determining a virtual face image to be replaced by the to-be-changed real face image, that is, a preset virtual face image; then, generating and processing the real face image to be changed and the preset virtual face image through the target face changing model, so as to obtain a virtual face image which matches the face image of the preset virtual face image and the face state of the real face image to be changed, namely the target virtual face image, wherein the target virtual face image expresses the face image of the preset virtual face image more realistically; and finally, when the video to be displayed is displayed, replacing the real face image to be replaced by the target virtual face image to finish face replacement, wherein the face image texture of the virtual face after face replacement is better.

For the above-described training method of the face change model and the video-based face change method, the embodiment of the present application further provides a training device for the face change model or a video-based face change device, so that the training method of the face change model and the video-based face change method are implemented and applied in practice, and the computer device provided by the embodiment of the present application will be introduced from the perspective of hardware materialization.

Referring to fig. 12, fig. 12 is a schematic diagram of a server 1200, which may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1222 (e.g., one or more processors) and a memory 1232, and one or more storage media 1230 (e.g., one or more mass storage devices) for storing applications 1242 or data 1244 according to an embodiment of the present application. Memory 1232 and storage media 1230, among other things, can be transient or persistent storage. The program stored in the storage medium 1230 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 1222 may be configured to communicate with the storage medium 1230, to execute a series of instruction operations in the storage medium 1230 on the server 1200.

The Server 1200 may also include one or more power supplies 1226, one or more wired or wireless network interfaces 1250, one or more input-output interfaces 1258, and/or one or more operating systems 1241, such as a Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ， Linux ^TM ，FreeBSD ^TM And so on.

The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 12.

The CPU 1222 is configured to perform the following steps:

sampling the plurality of real face images and the plurality of virtual face images to obtain a plurality of first target sample images and a plurality of corresponding first source sample images; the plurality of first target sample images include a real face image, and the plurality of first source sample images include a virtual face image;

generating and processing each first target sample image and the corresponding first source sample image through a first generation model to obtain a first face generation image;

training a first generation model by minimizing a first loss function according to a first face generation image, a first target sample image and a first source sample image to obtain a first face changing model; a first loss function for calculating a face image loss between the first face generation image and the first source sample image and a face state loss between the first face generation image and the first target sample image;

and determining the first face changing model as a target face changing model.

The CPU 1222 is further configured to perform the following steps:

acquiring a real face image to be replaced in a video to be displayed;

determining a preset virtual face image corresponding to a real face image to be replaced;

generating and processing a real face image to be changed and a preset virtual face image through a target face changing model to obtain a target virtual face image; the target virtual face image is matched with the face image of the preset virtual face image and the face state of the real face image to be replaced;

replacing the real face image to be replaced with a target virtual face image when displaying the video to be displayed;

the target face-changing model is obtained by executing the training method of the face-changing model described in the above embodiment.

Optionally, the CPU 1222 may further perform the method steps of any specific implementation of the training method of the face changing model or the video-based face changing method in the embodiment of the present application.

Referring to fig. 13, fig. 13 is a schematic structural diagram of a terminal device according to an embodiment of the present application. For convenience of explanation, only the parts related to the embodiments of the present application are shown, and details of the specific technology are not disclosed. The terminal device can be any terminal device including a mobile phone, a tablet computer, a PDA and the like, taking the terminal device as the mobile phone as an example:

fig. 13 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 13, the cellular phone includes: radio Frequency (RF) circuitry 1310, memory 1320, input unit 1330, display unit 1340, sensor 1350, audio circuitry 1360, wireless fidelity (WiFi) module 1370, processor 1380, and power supply 1390. Those skilled in the art will appreciate that the handset configuration shown in fig. 13 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 13:

RF circuit 1310 may be used for receiving and transmitting signals during a message transmission or communication session, and in particular, for receiving downlink information from a base station and processing the received downlink information in processor 1380; in addition, the data for designing uplink is transmitted to the base station. In general, the RF circuit 1310 includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, RF circuit 1310 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), general Packet Radio Service (GPRS), code Division Multiple Access (CDMA), wideband Code Division Multiple Access (WCDMA), long Term Evolution (LTE), email, short Message Service (SMS), and the like.

The memory 1320 may be used to store software programs and modules, and the processor 1380 may implement various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 1320. The memory 1320 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 1320 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 1330 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 1330 may include a touch panel 1331 and other input devices 1332. Touch panel 1331, also referred to as a touch screen, can collect touch operations by a user (e.g., operations by a user on or near touch panel 1331 using any suitable object or accessory such as a finger, a stylus, etc.) and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 1331 may include two portions of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, and sends the touch point coordinates to the processor 1380, where the touch controller can receive and execute commands sent by the processor 1380. In addition, the touch panel 1331 may be implemented by various types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 1330 may include other input devices 1332 in addition to the touch panel 1331. In particular, other input devices 1332 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 1340 may be used to display information input by or provided to the user and various menus of the mobile phone. The Display unit 1340 may include a Display panel 1341, and optionally, the Display panel 1341 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, touch panel 1331 can overlay display panel 1341, and when touch panel 1331 detects a touch event thereon or nearby, the touch event can be transmitted to processor 1380 to determine the type of touch event, and then processor 1380 can provide a corresponding visual output on display panel 1341 according to the type of touch event. Although in fig. 13, the touch panel 1331 and the display panel 1341 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1331 and the display panel 1341 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 1350, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 1341 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 1341 and/or the backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), can detect the magnitude and direction of gravity when stationary, can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer and tapping) and the like for recognizing the attitude of the mobile phone, and can be further configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor and the like, which are not described herein again.

The audio circuit 1360, speaker 1361, microphone 1362 may provide an audio interface between the user and the handset. The audio circuit 1360 can transmit the electrical signal converted from the received audio data to the speaker 1361, and the electrical signal is converted into a sound signal by the speaker 1361 and output; on the other hand, the microphone 1362 converts the collected sound signal into an electric signal, converts the electric signal into audio data after being received by the audio circuit 1360, and then processes the audio data by the audio data output processor 1380, and then sends the audio data to, for example, another cellular phone via the RF circuit 1310, or outputs the audio data to the memory 1320 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 1370, and provides wireless broadband internet access for the user. Although fig. 13 shows the WiFi module 1370, it is understood that it does not belong to the essential constitution of the handset, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 1380 is a control center of the mobile phone, connects various parts of the entire mobile phone using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1320 and calling data stored in the memory 1320, thereby integrally controlling the mobile phone. Optionally, processor 1380 may include one or more processing units; preferably, the processor 1380 may integrate an application processor, which handles primarily operating systems, user interfaces, application programs, etc., and a modem processor, which handles primarily wireless communications. It will be appreciated that the modem processor described above may not be integrated within processor 1380.

The handset also includes a power supply 1390 (e.g., a battery) to supply power to the various components, which may preferably be logically coupled to the processor 1380 via a power management system to manage charging, discharging, and power consumption management functions via the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

In an embodiment of the application, the handset includes a memory 1320 that stores program code and transfers the program code to the processor.

The processor 1380 comprised by the handset may perform the training method of the face-changing model or the video-based face-changing method provided in the various alternative implementations of the above aspects in accordance with instructions in the program code.

The embodiment of the present application further provides a computer-readable storage medium for storing a computer program, where the computer program is used to execute the face changing model training method or the video-based face changing method provided in the foregoing embodiment.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read by a processor of the computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the training method of the face change model or the video-based face change method provided in the various alternative implementations of the above aspects.

Those of ordinary skill in the art will understand that: all or part of the steps of implementing the method embodiments may be implemented by hardware associated with program instructions, where the program may be stored in a computer-readable storage medium, and when executed, performs the steps including the method embodiments; and the aforementioned storage medium may be at least one of the following media: a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other media capable of storing program codes.

It should be noted that, in the present specification, all the embodiments are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described embodiments of the apparatus and system are merely illustrative, and units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement without inventive effort.

The above description is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A training method of a face changing model is characterized by comprising the following steps:

sampling a plurality of real face images and a plurality of virtual face images to obtain a plurality of first target sample images and a plurality of corresponding first source sample images; the plurality of first target sample images includes the real face image, the plurality of first source sample images includes the virtual face image;

sampling the plurality of real face images and the plurality of virtual face images according to a first proportion of the real face images and the virtual face images to obtain a plurality of second target sample images and a plurality of corresponding second source sample images; the plurality of second target sample images includes the real face image, the plurality of second source sample images includes the virtual face image;

generating and processing each second target sample image and the corresponding second source sample image through a second generation model to obtain a second face generation image; the calculation amount of the second generative model is less than that of the first generative model;

generating and processing the second target sample image and the second source sample image through the first face changing model to obtain a third face generation image;

training the second generation model by minimizing a second loss function according to the second face generation image, the third face generation image and the second source sample image to obtain a second face changing model; the second loss function is used for calculating face image loss between the second face generation image and the second source sample image and face state loss between the second face generation image and the third face generation image;

and determining the second face changing model as a target face changing model.

2. The method according to claim 1, wherein the sampling of the plurality of real face images and the plurality of virtual face images to obtain a plurality of first target sample images and a corresponding plurality of first source sample images is specifically:

and sampling the plurality of real face images and the plurality of virtual face images according to a first proportion of the real face images and the virtual face images and a second proportion of the same object and different objects to which the first target sample images and the corresponding first source sample images belong, so as to obtain the plurality of first target sample images and the corresponding plurality of first source sample images.

3. The method of claim 2, wherein the first loss function comprises a first reconstruction loss function, a first identity loss function, and a first characteristic loss function; the training the first generation model by minimizing a first loss function to obtain a first face-changed model from the first face generation image, the first target sample image, and the first source sample image includes:

if the object identifier of the first source sample image is the same as the object identifier of the first target sample image, obtaining a first reconstruction loss value according to the first face generation image, the first target sample image and the first reconstruction loss function;

obtaining a first identity loss value according to the object identifier of the first face generation image, the object identifier of the first source sample image and the first identity loss function;

obtaining a first feature loss value according to the distinguishing feature of the first face generation image, the distinguishing feature of the first target sample image and the first feature loss function;

carrying out weighting processing according to the first reconstruction loss value, the first identity loss value, the first characteristic loss value and a corresponding first preset coefficient to obtain a first total loss value;

and training model parameters of the first generation model by minimizing the first loss function according to the first total loss value until the model converges to obtain the first face changing model.

4. The method of claim 3, wherein the first identity loss function comprises a first sub-loss function and a second sub-loss function; obtaining a first identity loss value according to the object identification of the first face generation image, the object identification of the first source sample image, and the first identity loss function, including:

obtaining a first sub-loss value according to the object identifier of the first face generation image, the object identifier of the first source sample image and the first sub-loss function;

obtaining a second sub-loss value according to the object identifier of the first face generation image, the object identifier of the first source sample image, the object identifier of the first target sample image and the second sub-loss function; the second sub-loss function is configured to calculate a loss between a first similarity and a second similarity, where the first similarity is a similarity between an object identifier of the first face generation image and an object identifier of the first target sample image, and the second similarity is a similarity between an object identifier of the first source sample image and an object identifier of the first target sample image;

and summing the first sub-loss value and the second sub-loss value to obtain the first identity loss value.

5. The method of claim 3 or 4, wherein the first loss function further comprises a first expressive loss function and a first generative pairwise loss function; the method further comprises the following steps:

obtaining a first generation pairwise loss resistance value according to the discrimination category of the first face generation image, the discrimination category of the first target sample image and the first generation pairwise loss resistance function;

the weighting processing is performed according to the first reconstruction loss value, the first identity loss value, the first characteristic loss value and a corresponding first preset coefficient to obtain a first total loss value, which specifically includes:

and weighting the loss resistance value and a corresponding second preset coefficient according to the first reconstruction loss value, the first identity loss value, the first characteristic loss value, the first expression loss value and the first generation to obtain the first total loss value.

6. The method of claim 1, wherein the second loss function comprises a second reconstruction loss function, a second identity loss function, and a second feature loss function; the training the second generative model by minimizing a second loss function according to the second face generative image, the third face generative image and the second source sample image to obtain a second face change model, comprising:

obtaining a second reconstruction loss value according to the second face generation image, the third face generation image and the second reconstruction loss function;

obtaining a second identity loss value according to the object identifier of the second face generation image, the object identifier of the second source sample image and the second identity loss function;

obtaining a second feature loss value according to the distinguishing feature of the second face generation image, the distinguishing feature of the third face generation image and the second feature loss function;

performing weighting processing according to the second reconstruction loss value, the second identity loss value, the second characteristic loss value and a corresponding third preset coefficient to obtain a second total loss value;

and training the model parameters of the second generation model by minimizing the second loss function according to the second total loss value until the model converges to obtain the second face changing model.

7. The method of claim 6, wherein the second loss function further comprises a second expressive loss function and a second generative penalty function; the method further comprises the following steps:

obtaining a second expression loss value according to the expression category of the second face generation image, the expression category of the second target sample image and the second expression loss function;

performing weighting processing according to the second reconstruction loss value, the second identity loss value, the second characteristic loss value and a corresponding third preset coefficient to obtain a second total loss value, which specifically comprises:

and performing weighting processing according to the second reconstruction loss value, the second identity loss value, the second characteristic loss value, the second expression loss value, the second generation confrontation loss value and a corresponding fourth preset coefficient to obtain a second total loss value.

8. The method according to any one of claims 1 to 7, wherein when the model parameters of the second generative model include a plurality of branch model parameters, the model parameters of the second face-changing model include a plurality of trained branch model parameters, the method further comprising:

fusing the trained branch model parameters to update the second face changing model to obtain an updated second face changing model; the updated model parameters of the second face changing model comprise fused model parameters;

determining the second face changing model as a target face changing model, specifically:

and determining the updated second face changing model as the target face changing model.

9. The method according to any one of claims 1 to 7, wherein the computation amount of the second generative model is less than or equal to a preset computation amount, and the preset computation amount is determined according to a real-time computation amount of a face changing device to which the target face changing model is to be deployed.

10. A video-based face changing method, the method comprising:

acquiring a real face image to be replaced in a video to be displayed;

generating the real face image to be changed and the preset virtual face image through a target face changing model to obtain a target virtual face image; the target virtual face image is matched with the face image of the preset virtual face image and the face state of the real face image to be replaced;

wherein the target face-changing model is obtained by executing the training method of the face-changing model according to any one of claims 1 to 9.

11. The method of claim 10, further comprising:

binding the facial features of a preset object with the preset virtual facial image to obtain a preset binding relationship;

acquiring the facial features of the recognized real facial image in the video to be displayed;

if the facial features of the recognized real facial image are matched with the facial features of the preset object, determining the recognized real facial image as the real facial image to be replaced;

the determining of the preset virtual face image corresponding to the real face image to be replaced specifically includes:

and determining the preset virtual face image according to the face features of the real face image to be replaced and the preset binding relationship.

12. The method of claim 11, further comprising:

and if the facial features of the identified real facial image are not matched with the facial features of the preset object, multiplexing the identified real facial image when the video to be displayed is displayed.

13. An apparatus for training a face-changing model, the apparatus comprising: the device comprises a sampling unit, a first generating unit, a training unit and a first determining unit;

the sampling unit is used for sampling the real face images and the virtual face images to obtain a plurality of first target sample images and a plurality of corresponding first source sample images; the plurality of first target sample images includes the real face image, the plurality of first source sample images includes the virtual face image;

the sampling unit is further configured to sample the plurality of real face images and the plurality of virtual face images according to a first ratio of the real face images to the virtual face images, and obtain a plurality of second target sample images and a plurality of corresponding second source sample images; the plurality of second target sample images includes the real face image, the plurality of second source sample images includes the virtual face image;

the first generation unit is further configured to perform generation processing on each second target sample image and the corresponding second source sample image through a second generation model to obtain a second face generation image; the calculation amount of the second generative model is less than that of the first generative model;

the first generating unit is further configured to perform generation processing on the second target sample image and the second source sample image through the first face changing model to obtain a third face generation image;

the training unit is further configured to train the second generative model according to the second face generation image, the third face generation image, and the second source sample image by minimizing a second loss function to obtain a second face change model; the second loss function is used for calculating face image loss between the second face generation image and the second source sample image and face state loss between the second face generation image and the third face generation image;

the first determining unit is configured to determine the second face change model as a target face change model.

14. A video-based face changing apparatus, comprising: the device comprises an acquisition unit, a second determination unit, a second generation unit and a replacement unit;

15. A computer device, comprising a processor and a memory:

the processor is configured to execute a training method of a face-changing model according to any one of claims 1 to 9 or a video-based face-changing method according to any one of claims 10 to 12 according to instructions in the program code.

16. A computer-readable storage medium for storing a computer program which, when executed by a processor, performs a method of training a face-changing model according to any one of claims 1 to 9, or a method of video-based face-changing according to any one of claims 10 to 12.