CN117540789A

CN117540789A - Model training method, facial expression migration method, device, equipment and medium

Info

Publication number: CN117540789A
Application number: CN202410031578.7A
Authority: CN
Inventors: 卫华威
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-01-09
Filing date: 2024-01-09
Publication date: 2024-02-09
Anticipated expiration: 2044-01-09

Abstract

The application discloses a model training method, a facial expression migration method, a device, equipment and a medium, and belongs to the technical field of computers. The method comprises the following steps: acquiring a first face image and a second face image; encoding the first facial image and the second facial image through an encoding model to obtain a first feature and a second feature; decoding the first feature through a first decoding model to obtain a third face image; decoding the second feature through a second decoding model to obtain a fourth facial image; training a first decoding model based on the first face image and the third face image; training the coding model based on the first face image, the third face image, the second face image, and the fourth face image; according to the method and the device, any facial image can be converted into the virtual facial image through the trained coding model and the first decoding model, and the accuracy of expression migration is guaranteed.

Description

Model training method, facial expression migration method, device, equipment and medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a model training method, a facial expression migration method, a device, equipment and a medium.

Background

Facial expression migration refers to migrating a facial expression of any subject to the face of the target subject, and changing only the expression features of the target subject without changing other facial features of the target subject. Although the present facial expression migration method can realize the migration of facial expressions, the present facial expression migration method is poor in accuracy.

Disclosure of Invention

The embodiment of the application provides a model training method, a facial expression migration method, a device, equipment and a medium, which can improve the accuracy of facial expression migration. The technical scheme is as follows.

In one aspect, a model training method is provided, the method comprising:

acquiring a first face image and a second face image, wherein the first face image comprises a virtual face, and the face in the second face image is different from the virtual face;

encoding the first facial image and the second facial image respectively through an encoding model to obtain a first feature and a second feature, wherein the first feature indicates the first facial image, and the second feature indicates the second facial image;

decoding the first feature through a first decoding model to obtain a third face image;

Decoding the second feature through a second decoding model to obtain a fourth facial image;

training the first decoding model based on the first face image and the third face image; training the coding model based on the first face image, the third face image, the second face image, and the fourth face image;

the first decoding model and the encoding model are used for generating a virtual face image based on any face image, and the expression of the virtual face in the virtual face image is identical to that of the face in the face image.

In another aspect, there is provided a facial expression migration method, which is characterized in that the method includes:

acquiring any facial image;

encoding the facial image through an encoding model to obtain the characteristics of the facial image;

decoding the characteristics of the face image through a first decoding model to obtain a virtual face image, wherein the virtual face image comprises a virtual face, and the expression of the face in the face image is the same as that of the virtual face in the virtual face image;

the coding model and the first decoding model are trained based on the model training method described in the above aspect.

In yet another aspect, a model training apparatus is provided, the apparatus comprising:

the device comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a first facial image and a second facial image, the first facial image comprises a virtual face, and the face in the second facial image is different from the virtual face;

the encoding module is used for respectively encoding the first facial image and the second facial image through an encoding model to obtain a first feature and a second feature, wherein the first feature indicates the first facial image, and the second feature indicates the second facial image;

the decoding module is used for decoding the first feature through a first decoding model to obtain a third face image;

the decoding module is further configured to decode the second feature through a second decoding model to obtain a fourth facial image;

a training module for training the first decoding model based on the first face image and the third face image; training the coding model based on the first face image, the third face image, the second face image, and the fourth face image;

In one possible implementation, the obtaining module is configured to generate the first facial image based on a sample expression parameter, where the sample expression parameter indicates an expression of the virtual face in the first facial image;

the apparatus further comprises:

the recognition module is used for carrying out expression recognition on the first facial image through an expression recognition model to obtain predicted expression parameters, wherein the predicted expression parameters indicate the expression of the virtual face in the first facial image;

the training module is further configured to train the expression recognition model based on the sample expression parameter and the predicted expression parameter.

In another possible implementation, the sample expression parameters include sample expression parameters of a plurality of locations, and the predicted expression parameters include predicted expression parameters of the plurality of locations; the training module is used for determining a first loss value based on the sample expression parameters of the plurality of parts and the predicted expression parameters of the plurality of parts, wherein the first loss value indicates the difference between the sample expression parameters and the predicted expression parameters of the same part; and training the expression recognition model based on the first loss value.

In another possible implementation, the training module is configured to determine a second loss value based on the first facial image and the third facial image, the second loss value indicating a difference between the first facial image and the third facial image; determining a third loss value based on the second face image and the fourth face image, the third loss value indicating a difference between the second face image and the fourth face image; training the coding model based on the second loss value and the third loss value.

In another possible implementation manner, the training module is further configured to train the second decoding model based on the second face image and the fourth face image in a case where the encoding model and the first decoding model are iteratively trained.

In yet another aspect, there is provided a facial expression migration apparatus, the apparatus including:

the acquisition module is used for acquiring any facial image;

the coding module is used for coding the facial image through a coding model to obtain the characteristics of the facial image;

the decoding module is used for decoding the characteristics of the face image through a first decoding model to obtain a virtual face image, wherein the virtual face image comprises a virtual face, and the expression of the face in the face image is the same as that of the virtual face in the virtual face image;

The coding model and the first decoding model are obtained by training the model training method in the aspect.

In one possible implementation, the apparatus further includes:

the recognition module is used for carrying out expression recognition on the virtual facial image through the expression recognition model to obtain expression parameters, and the expression parameters indicate the expression of the virtual face in the virtual facial image.

In another possible implementation, the apparatus further includes:

and the adjusting module is used for adjusting the face of the virtual object in the virtual scene based on the expression parameters so that the facial expression of the adjusted virtual object is the same as the facial expression in the facial image.

In another possible implementation manner, the expression parameters include expression parameters of a plurality of parts, the expression parameters of each part include a plurality of parts, and different expression parameters of the same part indicate different actions of the part; the adjusting module is used for fusing a plurality of expression parameters of the same part based on the expression parameters of the plurality of parts to obtain fused expression parameters of each part; and adjusting each part in the virtual object based on the fusion expression parameters of each part.

In another possible implementation, the face image is any video frame in a video; the adjusting module is used for adjusting the face of the virtual object in sequence according to the sequence of the video frames based on the expression parameters of the video frames under the condition that the expression parameters of the video frames are obtained, so that the facial expression of the virtual object changes along with the expression change of the face in the video.

In yet another aspect, a computer device is provided that includes a processor and a memory having at least one computer program stored therein, the at least one computer program loaded and executed by the processor to implement the operations performed by the model training method or the facial expression migration method as described in the above aspects.

In yet another aspect, a computer readable storage medium having stored therein at least one computer program loaded and executed by a processor to implement the operations performed by the model training method or the facial expression migration method as described in the above aspects is provided.

In yet another aspect, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the operations performed by the model training method or the facial expression migration method as described in the above aspects.

In the scheme provided by the embodiment of the application, for a first facial image and a second facial image containing different faces, the first facial image and the second facial image are respectively encoded through the same encoding model so as to acquire the characteristics of the two facial images; the first decoding model decodes the features of the first facial image into a third facial image, the second decoding model decodes the features of the second facial image into a fourth facial image, and further trains the first decoding model based on the first facial image and the third facial image; based on the first face image, the third face image, the second face image and the fourth face image, the training mode is simple, training efficiency of the model can be guaranteed, the coding model can learn the capability of coding the face image into the characteristics suitable for the first decoding model, the characteristics output by the coding model can be guaranteed to be suitable for the first decoding model, the first decoding model can learn the capability of decoding the characteristics output by the coding model into the face image containing the virtual face, any face image can be converted into the virtual face image through the trained coding model and the first decoding model, the virtual face image contains the virtual face, the expression of the virtual face is identical to the expression of the input face image, and the accuracy of facial expression migration is guaranteed.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application;

FIG. 2 is a flow chart of a model training method provided in an embodiment of the present application;

FIG. 3 is a flow chart of another model training method provided by an embodiment of the present application;

FIG. 4 is a flow chart of an image processing provided in an embodiment of the present application;

FIG. 5 is a flowchart for training an expression recognition model according to an embodiment of the present application;

FIG. 6 is a schematic illustration of a first facial image provided in an embodiment of the present application;

fig. 7 is a flowchart of a facial expression migration method provided in an embodiment of the present application;

FIG. 8 is a flowchart of another method for facial expression migration provided by embodiments of the present application;

FIG. 9 is a flowchart for obtaining expression parameters according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a real face image and a virtual face image according to an embodiment of the present application;

FIG. 11 is a schematic diagram of another real face image and virtual face image provided by an embodiment of the present application;

FIG. 12 is a schematic structural diagram of a model training device according to an embodiment of the present disclosure;

FIG. 13 is a schematic structural view of another model training apparatus according to an embodiment of the present disclosure;

fig. 14 is a schematic structural view of a facial expression migration apparatus according to an embodiment of the present application;

fig. 15 is a schematic structural view of another facial expression migration apparatus provided in an embodiment of the present application;

fig. 16 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms "first," "second," "third," "fourth," and the like as used herein may be used to describe various concepts, but are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another. For example, a first facial image may be referred to as a second facial image, and similarly, a second facial image may be referred to as a first facial image, without departing from the scope of the present application.

The terms "at least one," "a plurality," "each," "any one," as used herein, include one, two or more, a plurality includes two or more, and each refers to each of a corresponding plurality, any one referring to any one of the plurality. For example, the plurality of face images includes 3 face images, and each refers to each of the 3 face images, and any one of the 3 face images can be the first face image, or the second face image, or the third face image.

It should be noted that, information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to in this application are all authorized by the user or are fully authorized by the parties, and the collection, use, and processing of relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions. For example, facial images, videos, or virtual facial models referred to in this application are all acquired with sufficient authorization.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing to make the Computer process an image more suitable for human eye observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. The large model technology brings important reform for the development of computer vision technology, namely a Swin converter (a deep learning model), viT (Vision Transformer, a deep learning model), V-MOE (a vision model), MAE (Masked Autoencoders, automatic encoder) and other vision field pre-training models can be quickly and widely applied to specific downstream tasks through Fine tuning (Fine Tune). Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition ), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D (three-dimensional) techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others.

According to the scheme provided by the embodiment of the application, based on the artificial intelligence machine learning technology, a model training method can be realized to train a model for facial expression migration, and the trained model is utilized to realize the facial expression migration method.

The model training method and the facial expression migration method provided by the embodiment of the application can be used in computer equipment. Optionally, the computer device is a terminal or a server. Optionally, the server is an independent physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like. Optionally, the terminal is a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a smart voice interaction device, a smart home appliance, a vehicle-mounted terminal, and the like, but is not limited thereto.

In some embodiments, the computer program related to the embodiments of the present application may be deployed to be executed on one computer device or on multiple computer devices located at one site, or on multiple computer devices distributed across multiple sites and interconnected by a communication network, where the multiple computer devices distributed across multiple sites and interconnected by the communication network can form a blockchain system.

In some embodiments, the computer device is provided as a server. FIG. 1 is a schematic diagram of an implementation environment provided by embodiments of the present application. Referring to fig. 1, the implementation environment includes a terminal 101 and a server 102. The terminal 101 and the server 102 are connected by a wireless or wired network.

The terminal 101 is configured to provide a face image to the server 102, and the server 102 is configured to train the coding model and the first decoding model in combination with the second decoding model based on the face image provided by the terminal 101.

In one possible implementation, the terminal 101 is provided with an application served by the server 103, and after the server 102 trains the coding model and the first decoding model, the coding model and the first decoding model can be deployed in the server 103. The terminal 101 can realize functions such as facial expression migration through the application. Alternatively, the application is an application in the operating system of the terminal 101 or an application provided for a third party. For example, the application is an expression migration application having a function of facial expression migration, but of course, the expression migration application can also have other functions such as a comment function, a shopping function, a navigation function, a game function, and the like.

The terminal 101 is configured to log in an application based on a user identification, send a face image to the server 103 through the application, the server 103 is configured to receive the face image sent by the terminal 101, generate a virtual face image based on the face image through the coding model and the first decoding model, and the expression of the virtual face in the virtual face image is the same as the expression of the face in the face image.

It should be noted that, in the embodiment of the present application, the trained encoding model and the first decoding model are deployed in the server 103 as an example, and in another embodiment, the server 102 is configured to provide services for an application having a facial expression migration function, and then the server 102 stores the trained encoding model and the first decoding model in the trained encoding model and the first decoding model, so as to provide a facial expression migration service through the encoding model and the first decoding model subsequently.

Fig. 2 is a flowchart of a model training method provided in an embodiment of the present application, the method being performed by a computer device, as shown in fig. 2, the method including the following steps.

201. The computer device obtains a first facial image and a second facial image, the first facial image including a virtual face, the face in the second facial image being different from the virtual face.

Wherein the face in the second face image can be an arbitrary character face, for example, a face of a person, a face of an animal, or a face of a cartoon character, and the virtual face can be an arbitrary character face, for example, a virtual face is a face of a cartoon character or a face of a virtual object in a virtual scene. The second face image is any face image that does not contain a virtual face, for example, the second face image is a real face image, i.e., the second face image contains a face of a person.

In the embodiment of the application, the first facial image and the second facial image belong to different types of facial images, so that the model is trained through the different types of facial images later, and the model can realize the migration of expressions in the different types of facial images.

202. The computer equipment encodes the first facial image and the second facial image through the encoding model respectively to obtain a first feature and a second feature, wherein the first feature indicates the first facial image, and the second feature indicates the second facial image.

The coding model is used for coding the input image to acquire the characteristics of the image. The coding model can be any Network model, for example, a depth Network consisting of a Residual Network34 (Residual Network 34) when the coding model is in use. The first feature and the second feature can each be represented in any form, for example, the first feature and the second feature can each be represented in the form of feature vectors, e.g., the first feature and the second feature are each feature matrices having a scale of 8 x 8.

In the embodiment of the application, the first facial image is encoded through the encoding model to obtain a first feature; and encoding the second facial image through the encoding model to obtain a second characteristic.

203. The computer device decodes the first feature through the first decoding model to obtain a third face image.

The first decoding model is used to decode the features of the face image into a face image including a virtual face, and can be any network model, for example, the first decoding model is a model composed of a plurality of deconvolution layers. The third face image is a face image output by the first decoding model based on the features of the first face image, and if the first decoding model is sufficiently accurate, the third face image output by the first decoding model should be sufficiently similar to the first face image, but affected by the accuracy of the first decoding model, and the third face image may be different from the first face image, and the virtual face in the third face image may be different from the virtual face in the first face image.

204. The computer device decodes the second feature through the second decoding model to obtain a fourth facial image.

Wherein the second decoding model is used for decoding the features of the face image into the face image, the second decoding model can be an arbitrary network model, for example, the second decoding model is a model composed of a plurality of deconvolution layers. The fourth facial image is a facial image output by the first decoding model based on the features of the second facial image, and if the second decoding model is sufficiently accurate, the fourth facial image output by the second decoding model should be sufficiently similar to the second facial image, but affected by the accuracy of the second decoding model, the fourth facial image may be different from the first facial image, and the faces in the fourth facial image may be different from the faces in the first facial image.

205. The computer device training the first decoding model based on the first face image and the third face image; the coding model is trained based on the first face image, the third face image, the second face image, and the fourth face image.

In this embodiment of the present invention, the difference between the third face image and the first face image can reflect the accuracy of the first decoding model, and then based on the first face image and the third face image, the first decoding model is trained to improve the accuracy of the first decoding model, so that when the features of the first face image are decoded by the first decoding model, the similarity between the output face image and the first face image is increased, and the similarity between the virtual face in the output face image and the virtual face in the first face image is increased.

In this embodiment of the present invention, the first feature and the second feature are both obtained through the coding model, the inaccuracy of the first feature may cause the obtained third face image to be dissimilar to the first face image, and then the difference between the first face image and the third face image may reflect the accuracy of the coding model, and the inaccuracy of the second feature may cause the obtained fourth face image to be dissimilar to the second face image, and the difference between the second face image and the fourth face image may also reflect the accuracy of the coding model.

In this embodiment of the present application, the encoding model and the first decoding model are trained in the above manner, so that the encoding model can learn the capability of encoding the facial image into the features applicable to the first decoding model and the second decoding model, and the first decoding model learns the capability of decoding the features output by the encoding model into the facial image including the virtual face, so that after the encoding model and the first decoding model are trained, a virtual facial image can be generated based on any one of the facial images through the encoding model and the first decoding model, and the virtual face included in the virtual facial image is the same as the virtual face in the first virtual facial image in the above training process, but the expression of the virtual face may be different, and the expression of the virtual face in the virtual facial image is the same as the expression of the face in the input facial image. That is, the following migration of the expression can be realized by the encoding model and the first decoding model, that is, the expression of the face in the input face image is migrated to the virtual face, so as to obtain the virtual face image including the virtual face.

Based on the embodiment shown in fig. 2, the embodiment of the present application can acquire a first facial image including a virtual face by using the sample expression parameters, so that the coding model and the first decoding model can be trained based on the sample expression parameters and the first facial image, and meanwhile, the expression recognition model can be trained based on the sample expression parameters and the first facial image, which is described in detail in the following embodiments.

FIG. 3 is a flowchart of another model training method provided by an embodiment of the present application, the method being performed by a computer device, as shown in FIG. 3, the method comprising the following steps.

301. The computer device generates a first facial image based on sample expression parameters that indicate an expression of a virtual face in the first facial image.

In an embodiment of the present application, the first facial image includes a virtual face, the sample expression parameter indicates an expression of the virtual face, and the first facial image is generated based on the sample expression parameter, so that the expression of the virtual face in the first facial image is matched with the sample expression parameter, so as to provide training data for a subsequent expression recognition model.

In one possible implementation, the sample expression parameter can indicate any expression, e.g., the sample expression parameter indicates each expression of blinking, eye closure, frowning, mouth opening, mouth collapse, mouth breaking, mouth blessing, etc.

Optionally, the sample expression parameters include sample expression parameters of a plurality of sites. The plurality of parts refers to a plurality of parts of the virtual face, for example, the plurality of parts include eyes, nose, mouth, forehead, and the like of the virtual face. Optionally, the sample expression parameter of each part comprises a plurality of sample expression parameters, and different sample expression parameters of the same part indicate different actions of the part. For example, the mouth has 2 sample expression parameters, the 1 st sample expression parameter indicates the size of mouth opening, and the 2 nd sample expression parameter indicates the degree of beep mouth.

For example, the expression parameters of multiple locations in the virtual face can constitute one 130-dimensional sample expression parameter, with different dimensions of the sample expression parameter indicating different actions of the same location or indicating actions of different locations.

In one possible implementation, this step 301 includes: based on the sample expression parameters, the face of the virtual face model is adjusted, and the adjusted virtual face model is shot to obtain a first face image.

In the embodiment of the present application, the virtual face is a face of the virtual face model. The virtual face model is a three-dimensional face model, for example, the virtual face model is a virtual face model or a virtual animal face model. And controlling the expression of the virtual facial model based on the sample expression parameters, and further, shooting the virtual facial model after the expression of the virtual facial model is adjusted to obtain a first facial image containing the expression of the virtual facial model.

For example, the actions of each part are indicated according to the sample expression parameters, the face of the virtual facial model is adjusted so that the actions of each part in the virtual facial model are the same as the actions of each part indicated by the sample expression parameters, at the moment, the expression of the virtual facial model is matched with the sample expression parameters, and then the virtual facial model is shot to obtain a first facial image.

Optionally, the virtual facial model may be a digital facial model developed by an arbitrary digital production platform, after the digital facial model is developed, the digital facial model is led into a rendering engine, and the expression of the digital facial model is adjusted based on the sample expression parameter by the rendering engine, so that the expression of the digital facial model is photographed, and the first facial picture is obtained.

For example, taking a computer device as an example, the rendering engine can control the expression of the digital facial model, when the digital facial model is imported into the rendering engine, a user sets any sample expression parameter through the rendering engine, adjusts the expression of the digital facial model based on the sample expression parameter through the rendering engine, and photographs the expression of the digital facial model through the rendering engine to obtain the first facial picture.

For example, taking a computer device as an example, the rendering engine can control the expression of the digital facial model, the digital facial model is imported into the rendering engine, the terminal can display a control interface of the rendering engine, the digital facial model is displayed in the control interface, a user can randomly adjust the expression of the digital facial model through the control interface displayed by the terminal, the rendering engine can record the adjustment operation of the digital facial model, the terminal responds to the confirmation operation of the adjusted digital facial model, the terminal shoots the expression of the digital facial model through the rendering engine to obtain a first facial picture, and based on the recorded adjustment operation, a sample expression parameter is generated.

It should be noted that, in the embodiment of the present application, the first facial image is obtained based on the expression parameter of the sample, and in another embodiment, the above step 301 is not required to be performed, but the first facial image is obtained in other manners.

302. The computer device obtains a second facial image in which the face is different from the virtual face in the first facial image.

In one possible implementation, if the second facial image is a real facial image, the process of acquiring the second facial image includes: and shooting the face of the person to obtain a second face image.

In the embodiment of the application, the face included in the second facial image is the real face of a person, and when the person is photographed, the person can make various expressions on the camera, and the face of the person is photographed through the camera, so that the second facial image is obtained.

In one possible implementation, a face of a person is photographed to obtain a video, and a video frame including the face is extracted from the video as a second face image.

In the embodiment of the application, when shooting a person, the person can make various expressions on the camera, the face of the person is shot through the camera, so that videos containing various expressions of the person are obtained, and video frames containing the face of the person are extracted from the videos to serve as second face images.

303. The computer equipment encodes the first facial image and the second facial image through the encoding model respectively to obtain a first feature and a second feature, wherein the first feature indicates the first facial image, and the second feature indicates the second facial image.

In one possible implementation, before encoding the facial image, the facial image is further scaled so that the scale of the scaled facial image is the target scale, that is, the scaling process includes: for any one of the first face image and the second face image, if the face image has a scale smaller than the target scale, enlarging the face image so that the scale of the enlarged face image is the target scale; or, if the face image has a size larger than the target size, the face image is cut so that the size of the cut face image is the target size and the cut face image includes a face.

Where the target scale is the scale of the input image of the coding model, e.g. the target scale is 256 x 256. In the embodiment of the application, before encoding the facial image, the facial image is subjected to scale adjustment so as to ensure that the encoding model can encode the input facial image and ensure the accuracy of encoding.

In addition, when the first and second face images are both of the target dimensions, the dimensions of the first and second face images need not be adjusted. For example, the target scale is 256×256, the first face image and the second face image are each images of the scale 256×256, and the first face image and the second face image are encoded by encoding model encoding to obtain a first feature and a second feature, and feature scales of the first feature and the second feature are 8×8.

304. The computer device decodes the first feature through the first decoding model to obtain a third face image.

In one possible implementation, the first decoding model consists of a plurality of deconvolution layers, each of which has an output feature that is twice the scale of the deconvolution layer input feature. For example, the first feature has a scale of 8×8, the first decoding model includes 5 deconvolution layers, the first feature is decoded by the first decoding model, the scale of the output feature of each convolution layer is 2 times the scale of the input feature of the convolution layer, and the scale of the third image obtained by the 5 convolution layers is 256×256.

305. The computer device decodes the second feature through the second decoding model to obtain a fourth facial image.

In one possible implementation, the first decoding model is identical in structure to the second decoding model, but the model parameters are different. For example, the first decoding model and the second decoding model are each composed of 5 deconvolution layers, but the parameters of the deconvolution layers in the first decoding model and the second decoding model are different.

In this embodiment of the present application, the first decoding model and the second decoding model have the same structure, and the process of decoding the second feature by the second decoding model is the same as the process of decoding the first feature by the first decoding model, which is not described herein.

306. The computer device trains the first decoding model based on the first face image and the third face image.

In one possible implementation, a second loss value is determined based on the first and third facial images, the second loss value indicating a difference between the first and third facial images, and the first decoding model is trained based on the second loss value.

In the embodiment of the application, the second loss value is determined so as to quantify the difference between the first facial image and the third facial image, so that the second loss value can accurately reflect the difference between the first facial image and the third image, and the first decoding model is trained through the second loss value, so that the difference between the facial image and the first facial image, which are obtained by decoding the first feature through the trained first decoding model, is reduced, and the accuracy of the first decoding model is improved.

307. The computer device trains the coding model based on the first face image, the third face image, the second face image, and the fourth face image.

In one possible implementation, this step 306 includes: determining a second loss value based on the first face image and the third face image, the second loss value indicating a difference between the first face image and the third face image; determining a third loss value based on the second face image and the fourth face image, the third loss value indicating a difference between the second face image and the fourth face image; the coding model is trained based on the second loss value and the third loss value.

In the embodiment of the present application, the second loss value is determined so as to quantify the difference between the first face image and the third face image, the third loss value is determined so as to quantify the difference between the second face image and the fourth face image, so that the second loss value can accurately reflect the difference between the first face image and the third face image, and the third loss value can accurately reflect the difference between the second face image and the fourth face image.

Optionally, in the case of determining the second loss value and the third loss value, determining a loss value sum value of the second loss value and the third loss value, and training the coding model based on the loss value sum value.

In the embodiment of the application, the encoding model, the first decoding model and the second decoding model can form a face image converter, taking the first face image as a virtual face image and the second face image as a real face image as examples, and respectively encoding the virtual face image and the real face image through the same encoding model to obtain a first feature for indicating the virtual face image and a second feature for indicating the real face image; the first decoding model is a model for decoding the virtual face image, the second decoding model is used for decoding the real face image, the virtual face image can be decoded based on the first characteristics through the first decoding model, the real face image can be decoded based on the second characteristics through the second decoding model, and further, the image reconstruction loss is adopted to train the first decoding model and the coding model, so that the accuracy of the first decoding model and the coding model can be improved.

As shown in fig. 4, the first face image is input into the coding model, the coding model inputs the features into the first decoding model, and the third face image is output from the first decoding model; the second face image is input into the coding model, the coding model inputs the features into the second decoding model, and the fourth face image is output from the second decoding model, so that model training is realized by the first face image, the second face image, the third face image and the fourth face image.

In the embodiment of the present application, the encoding model and the first decoding model are trained once, and in another embodiment, the encoding model and the first decoding model may be trained iteratively according to the steps 301 to 307, and if the number of iterations reaches the number of times threshold, or if the second loss value and the third loss value are both smaller than the loss value threshold, or if the loss values and the values of the second loss value and the third loss value are smaller than the loss value threshold, the iterative training of the encoding model and the first decoding model may be stopped.

For example, in the process of performing iterative training on the coding model and the first decoding model, after performing one iteration on the coding model and the first decoding model according to the steps 301 to 307, a next first facial image and a next second facial image are acquired, and the acquired first facial image is the same as a virtual face contained in the first facial image used in the previous iteration but has a different expression of the virtual face; the obtained second facial image is the same as the face contained in the second facial image used in the previous iteration but the facial expression is different; based on the first and second facial images currently acquired, the coding model and the first decoding model are iterated again according to steps 303-307 described above.

In this embodiment of the present application, after training the coding model and the first decoding model, the first decoding model and the coding model are used to generate a virtual face image based on any one of the face images, where the expression of the virtual face in the virtual face image is the same as the expression of the face in the face image.

308. The computer equipment performs expression recognition on the first facial image through the expression recognition model to obtain predicted expression parameters, and the predicted expression parameters indicate the expression of the virtual face in the first facial image.

In the embodiment of the application, the expression recognition model is used for recognizing the expression of the face in the facial image so as to acquire the expression parameter used for indicating the expression of the face in the facial image. And carrying out expression recognition on the first facial image through the expression recognition model to obtain predicted expression parameters, wherein the predicted expression parameters can reflect the accuracy of the expression recognition model so as to train the expression recognition model based on the predicted expression parameters.

The expression recognition model is an arbitrary network model, for example, the expression recognition model is a network model including a back (feature extraction layer), a Pooling layer, an FC (Full Connection) layer, and one Sigmoid (normalization) layer.

309. The computer device trains the expression recognition model based on the sample expression parameters and the predicted expression parameters.

In the embodiment of the application, the sample expression parameter and the predicted expression parameter both indicate the expression of the virtual face in the first facial image, the sample expression parameter is the real expression parameter of the first facial image, the predicted expression parameter is obtained through the expression recognition model, the difference between the sample expression parameter and the predicted expression parameter can reflect the accuracy of the expression recognition model, and the expression recognition model is trained based on the expression parameter and the predicted expression parameter, so that the first facial image is subjected to expression recognition through the trained expression recognition model, the difference between the obtained expression parameter and the sample expression parameter is reduced, and the accuracy of the expression recognition model is improved.

In one possible implementation, this step 309 includes: determining a fourth loss value based on the sample expression parameter and the predicted expression parameter, the fourth loss value indicating a difference between the sample expression parameter and the predicted expression parameter; based on the fourth loss value, the expression recognition model is trained.

In the embodiment of the application, the fourth loss value is determined so as to quantify the difference between the sample expression parameter and the predicted expression parameter, so that the fourth loss value is ensured to accurately reflect the difference between the sample expression parameter and the predicted expression parameter, and the expression recognition model is trained through the fourth loss value, so that the first facial image is subjected to expression recognition through the trained expression recognition model, and the difference between the obtained expression parameter and the sample expression parameter becomes smaller, so that the accuracy of the expression recognition model is improved.

In one possible implementation, the sample expression parameters include sample expression parameters of a plurality of locations, and the predicted expression parameters include predicted expression parameters of the plurality of locations; the step 309 includes: determining a first loss value based on the sample expression parameters of the plurality of parts and the predicted expression parameters of the plurality of parts, wherein the first loss value indicates the difference between the sample expression parameters and the predicted expression parameters of the same part; based on the first loss value, the expression recognition model is trained.

The plurality of parts refers to a plurality of parts of the virtual face, for example, the plurality of parts include eyes, nose, mouth, forehead, and the like of the virtual face. The sample expression parameter of any part indicates the expression of the part. Optionally, the sample expression parameter of each part comprises a plurality of sample expression parameters, and different sample expression parameters of the same part indicate different actions of the part. For example, the mouth has 2 sample expression parameters, the 1 st sample expression parameter indicates the size of mouth opening, and the 2 nd sample expression parameter indicates the degree of beep mouth.

In the embodiment of the application, the first loss value is determined, so that the difference between the sample expression parameter and the predicted expression parameter of each part is reflected, the expression recognition model is trained through the first loss value, the first facial image is subjected to expression recognition through the trained expression recognition model, the difference between the obtained expression parameter and the sample expression parameter is reduced, and the accuracy of the expression recognition model is improved.

It should be noted that, in the embodiment of the present application, the description is given by taking the case recognition model as an example, and in another embodiment, the case recognition model can be trained iteratively according to the steps 308-309. For example, according to the above step 301, a plurality of first facial images and sample expression parameters corresponding to each first facial image are obtained, and based on each first facial image and corresponding expression parameters, an iteration is performed on the expression recognition model according to the above steps 308-309; and stopping performing iterative training on the surface condition recognition model under the condition that the iteration times reach the time threshold or the condition that the first loss value is smaller than the loss value threshold.

It should be noted that, the process of training the coding model and the first decoding model and the process of training the expression recognition model are two completely independent processes, in this embodiment, after the coding model and the first decoding model are trained, the expression recognition model is described, in another embodiment, the coding model and the first decoding model can be iteratively trained first, and after the coding model and the first decoding model are trained, the expression recognition model is iteratively trained according to the steps 308-309; or, the method comprises the steps 308-309, performing iterative training on the expression recognition model, and performing iterative training on the coding model and the first decoding model after the expression recognition model is trained.

In addition, the embodiment of the application identifies the expression parameters of the facial image through the expression identification model, but when the expression identification model is trained, if the real facial image is used as the training data of the expression identification model, but the accurate expression parameters of the real facial image cannot be obtained, so that the accuracy of the trained expression identification model is poor, the facial image containing the virtual face and the corresponding expression parameters are easier to obtain, and if the facial image containing the virtual face and the corresponding expression parameters are used as the training data of the expression identification model, the sufficient and accurate training data can be ensured, and the accuracy of the trained expression identification model can be ensured.

Under the condition that the virtual face model is imported into the rendering engine, only any sample expression parameter is required to be set, and the sample expression parameter is input into the rendering engine, so that a face image containing a virtual face can be generated by the rendering engine based on the virtual face model, a large number of sample expression parameters and corresponding first face images can be obtained, enough training data is provided for training the expression recognition model, and the training data obtaining process is simple. Considering that the expression parameters of the real facial image are usually utilized in various scenes, the method and the device for generating the virtual facial image based on the facial expression parameters are implemented by training the coding model and the first decoding model, so that the real facial image is converted into the virtual facial image through the trained coding model and the trained first decoding model, and the expression of the virtual face in the virtual facial image is identical to the expression of the face in the real facial image, so that the expression parameters of the real facial image can be identified through the cooperation among the coding model, the first decoding model and the expression identification model, and the accuracy of the expression parameters is ensured.

It should be noted that, in the embodiment of the present application, under the condition that a large number of real facial images and corresponding expression parameters cannot be obtained, the coding model, the first decoding model and the expression recognition model are trained according to the above manner, so that the real facial images are converted into virtual facial images including virtual faces through the coding model and the first decoding model, and the expression parameters of the virtual facial images are recognized through the expression recognition model. Under the condition that a large number of real facial images and corresponding expression parameters can be obtained, the expression recognition model is trained only by the real facial images and the corresponding expression parameters, and the expression parameters of the real facial images are directly recognized by the expression recognition model.

In addition, in the process of performing iterative training on the coding model and the first decoding model, the second decoding model may also be trained synchronously, where the process of training the second decoding model includes: in the case of performing iterative training on the coding model and the first decoding model, the second decoding model is trained based on the second face image and the fourth face image.

In this embodiment of the present invention, the difference between the second face image and the fourth face image can reflect the accuracy of the second decoding model, so that the second decoding model is trained based on the second face image and the fourth face image, so that the difference between the face image and the second face image, which are obtained by decoding the second feature through the trained second decoding model, is reduced, and the accuracy of the second decoding model is further improved. In addition, since the difference between the second face image and the fourth face image can reflect not only the accuracy of the second decoding model, but also the accuracy of the coding model, in the process of performing iterative training on the coding model and the first decoding model, the second decoding model is iteratively trained to improve the accuracy of the second decoding model as much as possible, so that the influence of the difference between the second face image and the fourth face image due to the inaccuracy of the second decoding model is weakened, the difference between the second face image and the fourth face image is only caused by the coding model, and the coding model is trained based on the difference between the second face image and the fourth face image, so that the accuracy of the coding model can be further improved.

For example, in the process of performing iterative training on the coding model, the first decoding model, and the second decoding model, in one iteration, the coding model and the first decoding model are trained according to the steps 301 to 307, and the second decoding model is trained based on the second face image and the fourth face image; and then the next iteration is performed.

Optionally, the training the second decoding model comprises: determining a third loss value based on the second face image and the fourth face image, the third loss value indicating a difference between the second face image and the fourth face image; the second decoding model is trained based on the third loss value.

In the embodiment of the present application, by determining the third loss value, so as to quantify the difference between the second face image and the fourth face image, it is ensured that the third loss value can accurately reflect the difference between the second face image and the fourth face image, and then the second decoding model is trained by the third loss value, so that when the second feature is decoded by the trained second decoding model, the difference between the obtained face image and the second face image becomes smaller, so that the accuracy of the second decoding model is improved.

It should be noted that, in the foregoing embodiment shown in fig. 3, in the process of performing iterative training on the encoding model, the first decoding model, and the second decoding model, one first face image and one second face image are utilized in each iteration, and in another embodiment, a plurality of first face images and a plurality of second face images can also be utilized in each iteration. Taking the second facial image as a real facial image and the number of times threshold as 200 as an example, an Adam (Adaptive Moment Estimation, an adaptive learning rate optimization algorithm) optimizer is used to iteratively train the coding model, the first decoding model, and the second decoding model. In the iterative training process, the learning rate is 1e-4; the batch size is set to 128, i.e., 128 facial images are acquired in each iteration; the 128 face images comprise 64 first face images and 64 second face images, and the 64 first face images all comprise the same virtual face but the expressions of the virtual faces in different first face images are different; the 64 second face images each contain the same face but the expressions of the faces in different second face images are different, or the 64 second face images each contain a different face. Inputting 128 face images into a coding model, respectively coding each face image through the coding model to obtain the characteristics of each face image, namely obtaining 64 first characteristics and 64 second characteristics, wherein the 64 first characteristics are in one-to-one correspondence with the 64 first face images, and the 64 second characteristics are in one-to-one correspondence with the 64 second face images. Decoding each first feature through a first decoding model to obtain a third face image corresponding to each first face image; and decoding each second feature through a second decoding model to obtain a fourth facial image corresponding to each second facial image. Determining a loss value by taking the following loss function based on the 64 first face images, the 64 second face images, the 64 third face images, and the 64 fourth face images; the coding model is trained based on the loss values using an Adam optimizer.

Wherein,for indicating loss value->For representing either the first facial image or the second facial image,for representing a third face image output by the first decoder or a fourth face image output by the second decoder,/or->A second face image for representing a first face image or a fourth face image corresponding to the third face image; />For representing norms. />

For 64 first face images and 64 third face images, determining a loss value based on a difference between each first face image and the corresponding third face image, the loss value indicating a difference between each first face image and the corresponding third face image; the first decoding model is trained based on the loss values using an Adam optimizer. For the 64 second face images and the 64 fourth face images, determining a loss value based on a difference between each second face image and the corresponding fourth face image, the loss value indicating a difference between each second face image and the corresponding fourth face image; the second decoding model is trained based on the loss values using an Adam optimizer.

In the iterative training process of the coding model, the first decoding model and the second decoding model, under the condition that the iteration number reaches 200, the fact that training of the coding model, the first decoding model and the second decoding model is completed is indicated, the real face image can be coded through the coding model, the characteristics of the real face image are obtained, the characteristics of the real face image are decoded through the first decoding model, a virtual face image is obtained, the virtual face in the virtual face image is identical to the virtual face in the first face image in the training process, and the expression of the virtual face in the virtual face image is identical to the expression of the face in the real face image.

Optionally, the first facial image corresponds to a facial image of a rendering domain, the second facial image corresponds to a facial image of a real domain, the first decoding model corresponds to a rendering domain decoder, and the second decoding model corresponds to a real domain decoder; and forming a picture converter by the trained coding model and the first decoding model, and converting the real face image into a virtual face image through the picture converter.

In a mode of acquiring a plurality of first face images and a plurality of second face images, a face of a user is photographed by a terminal to obtain a video, and a plurality of video frames including the face in the video are taken as the plurality of first face images. And respectively adjusting the faces of the virtual facial models based on the set plurality of sample expression parameters through the rendering engine, and shooting the adjusted virtual facial models to obtain a plurality of second facial images.

It should be noted that, in the foregoing embodiment shown in fig. 3, one first facial image is utilized in each iteration in the process of performing iterative training on the surface recognition model, and in another embodiment, a plurality of first facial images can also be utilized in each iteration. In the embodiment of the present application, an expression recognition model includes a backbox layer, a Pooling layer, an FC layer, and a Sigmoid layer, where the backbox is a Residual Network50 (Residual Network 50); as shown in fig. 5, the process of training the expression recognition model includes: according to step 301, a plurality of sample expression parameters and a first facial image corresponding to each sample expression parameter are obtained by a rendering engine. For example, a plurality of first face images are acquired as shown in fig. 6. In the process of carrying out iterative training on the expression recognition model, the Batchsize is set to be B, namely, B sample expression parameters and first facial images corresponding to each sample expression parameter are obtained in each iteration. And taking the obtained B first facial images as the input of an expression recognition model, and respectively carrying out expression recognition on each first facial image through the expression recognition model to obtain the predicted expression parameters of each first facial image. Taking the scale of the first facial image as 256×256 as an example, feature extraction is performed on the B first facial images by a back bone in the expression recognition model to obtain features of the B first facial images, for example, a Feature Map (Feature Map) of b×2048×8×8 output by the back bone represents features of the B first facial images. B×2048×8×8 Feature Map output by the backup is input into the Pooling layer, the Pooling layer outputs b×2048 features into the FC layer, the FC layer outputs features, and the Sigmoid layer outputs b×130 expression parameters. That is, the expression parameters of B first facial images are obtained, the dimension of the expression parameter of each first facial image is 130, and the expression parameters of different dimensions indicate the expressions of different parts or indicate different actions of the same part. The parameter dimension of the FC layer is 2048×130, and the sigmoid layer is used for normalizing the numerical value of the expression parameter of B×130 output by the FC to be between 0 and 1 by adopting the following functions.

Wherein,for representing any expression parameter input to the Sigmoid layer,/>For representing Sigmoid function->Used to represent natural constants.

When the expression parameter of bx 130 outputted by the Sigmoid layer is obtained, a loss function is used to determine a loss value, and a model for recognizing an expression is trained based on the loss value.

Wherein,for representing a loss value; />For representing the number of first facial images acquired in each iteration, namely, the Batchsize; />For representing any first facial image, < >>For representing that the first facial image is predicted by the expression recognition model +.>Corresponding predicted expression parameters, < >>For representing a first facial image->Corresponding sample expression parameters.

Taking the case that the number of iterative training of the expression recognition model is 40 as an example, the expression recognition model is iteratively trained for 40 times according to the process, which indicates that the training of the expression recognition model is completed.

According to the embodiment of the application, the coding model, the first decoding model and the second decoding model are trained in a semi-supervised mode, under the condition that the expression parameters corresponding to the real face image cannot be obtained, the coding model and the first decoding model are trained by utilizing the real face image and the virtual face image so as to convert the real face image into the virtual face image through the coding model and the first decoding model, and the expression recognition model is trained through the virtual face image and the corresponding sample expression parameters, so that the condition that the expression recognition model is inaccurate due to insufficient training data is avoided, and the accuracy of the expression recognition model is ensured. Through the coding model and the first decoding model, the real face is firstly converted into the virtual face image, and then the expression parameters of the virtual face image are identified through the expression identification model, so that a new expression identification scheme is realized, the model training process is simple, the expression identification scheme is simple, and the application range of the expression identification scheme is improved.

It should be noted that, based on the above model training method, the embodiment of the present application can utilize the trained encoding model and the first decoding model to realize facial expression migration, and the specific process is described in the following embodiment.

Fig. 7 is a flowchart of a facial expression migration method provided in an embodiment of the present application, where the method is performed by a computing device, and as shown in fig. 7, the method includes the following steps.

701. The computer device acquires either of the facial images.

In the embodiment of the present application, the face image can be any face image, for example, the face image is a real face image or a cartoon face image, or the like.

In one possible implementation, the computer device is provided as a server, and the process of acquiring the face image includes: the terminal shoots the face through the camera to obtain a face image, the face image is sent to the server, and the server receives the face image.

702. The computer device encodes the facial image through the encoding model to obtain the features of the facial image.

This step 702 is similar to step 303 described above and will not be described again here.

703. The computer equipment decodes the characteristics of the face image through the first decoding model to obtain a virtual face image, wherein the virtual face image comprises a virtual face, and the expression of the face in the face image is identical to that of the virtual face in the virtual face image.

In the embodiment of the present application, the virtual face in the virtual face image is the virtual face in the first face image used when training the coding model and the first understanding model, but the expression of the virtual face in the virtual face image may be different from the expression of the virtual face in the first face image. The first decoding model can decode the characteristics output by the encoding model into the facial image containing the virtual face, and the characteristics of the facial image output by the encoding model can represent the facial image and the expression of the face in the facial image, so that the characteristics of the facial image are decoded by the first decoding model, the expression of the virtual face in the obtained virtual facial image is the same as the expression of the face in the facial image, and the transfer of the expression of the face to the virtual face is realized.

In the scheme provided by the embodiment of the application, as the first decoding model can decode the characteristics output by the encoding model into the facial image containing the virtual face, any facial image is encoded through the encoding model, the characteristics obtained by encoding are decoded through the first decoding model, the virtual facial image containing the virtual face is obtained, and the expression of the virtual face in the virtual facial image is the same as the expression of the face in the facial image, so that the migration of the expression is realized, and the accuracy of the facial expression migration is ensured.

For example, taking a face image as a real face image, encoding the real face image through an encoding model to obtain the characteristics of the real face image, decoding the characteristics of the real face image through a first decoding model, wherein the first decoding model can identify the characteristics of the real face image, decode a virtual face image containing a virtual face, the virtual face in the virtual face image retains the same expression as the face in the real face image, and is equivalent to the picture type of the real face image, and the real face image is converted into the virtual face image.

On the basis of the embodiment shown in fig. 7, the embodiment of the present application can also perform expression recognition on the facial image obtained by facial expression migration through the expression recognition model, so as to adjust the facial expression of the virtual object in the virtual scene by using the expression parameters obtained by recognition, and the specific process is described in the following embodiment.

Fig. 8 is a flowchart of another facial expression migration method provided in an embodiment of the present application, applied to a second device, as shown in fig. 8, and the method includes the following steps.

801. The computer device acquires either of the facial images.

802. The computer device encodes the facial image through the encoding model to obtain the features of the facial image.

803. The computer equipment decodes the characteristics of the face image through the first decoding model to obtain a virtual face image, wherein the virtual face image comprises a virtual face, and the expression of the face in the face image is identical to that of the virtual face in the virtual face image.

The steps 801 to 803 are the same as the steps 701 to 703, and are not described herein.

804. The computer equipment carries out expression recognition on the virtual facial image through the expression recognition model to obtain expression parameters, and the expression parameters indicate the expression of the virtual face in the virtual facial image.

In the embodiment of the application, the expression recognition is performed on the virtual facial image through the expression recognition model, so that the obtained expression parameters are matched with the expression of the virtual face in the virtual facial image, and the accuracy of the expression parameters is guaranteed.

Step 804 is similar to step 308 described above and will not be described again.

805. The computer device adjusts the face of the virtual object in the virtual scene based on the expression parameters so that the facial expression of the adjusted virtual object is the same as the expression of the face in the facial image.

In the embodiment of the application, under the condition that the expression parameters of the facial image are obtained, the face of the virtual object in the virtual scene is adjusted based on the expression parameters, so that the facial expression of the virtual object is identical to the expression of the face in the facial image, a novel expression driving mode is realized, the virtual object in the virtual scene is driven by the facial image to make the expression identical to the expression of the face in the facial image, so that the facial expression of the virtual object is synchronous with the expression of the face in the facial image, the accuracy of controlling the facial expression of the virtual object is ensured, and the high-precision expression driving effect is realized.

The virtual object can be any object, for example, a cartoon object, a virtual character in a game, or the like.

In one possible implementation, the virtual object corresponds to an expression controller, where the expression controller is configured to control a facial expression of the virtual object, and the step 805 includes: and adjusting the face of the virtual object in the virtual scene based on the expression parameters through the expression controller so that the facial expression of the adjusted virtual object is the same as the facial expression in the facial image.

In the embodiment of the application, the expression controller can control the facial expression of the virtual object, and then the facial expression of the virtual object is adjusted based on the expression parameters through the expression controller so as to ensure the accuracy of expression adjustment.

Optionally, the virtual object corresponds to a plurality of expression controllers, the expression parameter includes a plurality of expression controllers, each expression controller is used for controlling a fine expression of the virtual object in a one-to-one correspondence with the expression parameter. For example, the expression parameter is 130-dimensional, that is, the expression parameter includes 130 expression controllers, and the virtual object corresponds to 130 expression controllers. For another example, the expression parameters include expression parameters of 3 parts, and each expression parameter of each part includes 5, and then the virtual object corresponds to 15 expression controllers.

In the embodiment of the application, for any expression parameter, the expression parameter is assigned to be a corresponding expression controller, and the expression controller controls the facial expression of the virtual object according to the expression parameter, so that the face of the virtual object presents a corresponding expression. For example, any expression controller is used for controlling the eye closure of the virtual object, and when the expression parameter corresponding to the expression controller is 1, the eye closure is indicated, and then the eye closure of the virtual object is controlled; when the expression parameter corresponding to the expression controller is 0, the eye is opened, and then the opening of the eye of the virtual object is controlled; when the expression parameter corresponding to the expression controller is 005, the eye is semi-closed, and the eye of the virtual object is controlled to be semi-closed.

In one possible implementation, the expression parameters include expression parameters of a plurality of parts, each of the expression parameters of the parts includes a plurality of the expression parameters, and different expression parameters of the same part indicate different actions of the part; then the step 805 includes: based on the expression parameters of the multiple parts, fusing the expression parameters of the same part to obtain fused expression parameters of each part; and adjusting each part in the virtual object based on the fusion expression parameters of each part.

In the embodiment of the application, the same part is provided with a plurality of expression parameters, different expression parameters of the same part indicate different actions of the part, and then the plurality of expression parameters of the same part are fused, so that the fusion expression parameters of the part can reflect the actions executed by the part, and then the part in the virtual object is adjusted based on the fusion expression parameters, so that the part presents the actions indicated by the fusion expression parameters, and further the face of the virtual object presents the expressions indicated by the expression parameters, and the accuracy of expression control is ensured.

In one possible implementation, the facial image is any video frame in the video; a process for adjusting faces of virtual objects in a virtual scene, comprising: under the condition that the expression parameters of a plurality of video frames in the video are obtained, the faces of the virtual objects are sequentially adjusted according to the sequence of the video frames based on the expression parameters of the video frames, so that the facial expression of the virtual objects changes along with the expression change of the faces in the video.

In this embodiment of the present application, the video includes a plurality of video frames, where the plurality of video frames include the same face, the video can reflect the expression change of the face in a period of time, according to the steps 801-804 described above, the expression parameter of each video frame can be obtained to indicate the expression of the face in the video frame, and based on the expression parameters of the plurality of video frames, the face of the virtual object is adjusted to present the change of the facial expression of the virtual object along with the expression change of the face in the video, so as to further present the animation of the facial expression change of the virtual object, so as to ensure the synchronization of the expression of the virtual object and the expression of the person in the video, and ensure the expression migration effect.

Wherein the video is an arbitrary video. For example, the video is a video obtained by photographing any face by a user through a terminal, and during photographing, the photographed face makes various expressions.

For example, the video frames in the video are real face images, the user shoots the faces of the actors through the camera of the terminal to obtain the video, the video frames in the video are preprocessed, and the scale of the video frames is adjusted, for example, the scale of the video frames in the video is adjusted to 256×256. The coding model and the first decoding model can form a picture converter, the picture converter can convert the adjusted video frame into a face image containing a virtual face, namely the real face image in the video corresponds to the virtual face image one by one, and the obtained multiple virtual face images can also form a video; carrying out expression recognition on each virtual facial image through an expression recognition model to obtain expression parameters of each virtual facial image, namely, obtaining expression parameters of a real facial image corresponding to each virtual facial image; based on the expression parameters of the plurality of real facial images, the faces of the virtual objects in the virtual scene are adjusted so that the virtual objects can make the same expression with the actors.

Optionally, the video is a video acquired by the terminal in real time, and taking the computer device as an example, the process of adjusting the face of the virtual object includes: the terminal acquires the video in real time, sends the video to the server in real time, and the server receives the video, and adjusts the facial expression of the virtual object in the virtual scene based on the video frame in the video according to the steps 801-805, so that the terminal displays the animation of the facial expression of the virtual object along with the facial expression change in the video.

For example, the terminal displays the virtual object in the virtual scene in the display interface, and can shoot the video in real time through the camera and send the video to the server, and when the server acquires the expression parameter of each video frame, the server can adjust the face of the virtual object, and the terminal displays the picture of facial expression change of the virtual object in the display interface. Or under the condition that the server acquires the expression parameter of each video frame, the server sends the expression parameter of each video frame to the terminal, the terminal adjusts the face of the virtual object in the virtual scene based on the expression parameter, and a picture of facial expression change of the virtual object is displayed in the display interface.

According to the scheme provided by the embodiment of the application, the visual action capturing application or the plug-in can be accessed, the facial expression of the face can be captured, and the captured facial expression is migrated to the face of the virtual object, so that the virtual object perfectly re-inscribes the facial action of the face.

In the scheme provided by the embodiment of the application, since the first decoding model can decode the characteristics output by the encoding model into the facial image containing the virtual face, any facial image is encoded through the encoding model, the characteristics obtained by encoding are decoded through the first decoding model, the virtual facial image containing the virtual face is obtained, and the expression of the virtual face in the virtual facial image is the same as the expression of the face in the facial image, so that the migration of the expression is realized, the accuracy of the expression migration is ensured, and the expression driving effect with high precision is realized.

As shown in fig. 9, the coding model and the first decoding model provided in the embodiments of the present application can form a picture converter, where the picture converter is configured to convert a real facial image into a virtual facial image including a virtual face, and the expression recognition model can recognize expression parameters from the virtual facial image.

It should be noted that, the facial expression migration method provided by the embodiment of the application has low requirements on equipment, and can obtain high-precision expression parameters by shooting facial images through any camera, so that high-precision expression capture is realized, and the effect of driving a virtual face through the expression parameters is ensured. Compared with the related art, the facial expression migration method provided by the embodiment of the application enables the virtual image to perfectly re-etch the facial actions of actors when driving the facial expression of the virtual object, enables the facial expression of the virtual object to be more real and vivid, has better expression driving effect, and simplifies the model training process and the expression migration process. The model training efficiency and the expression migration efficiency are guaranteed.

The scheme provided by the embodiment of the application can be applied to various scenes, for example, a movie scene or other virtual object control scenes.

For example, taking application in a movie scene as an example, according to the method provided in the embodiment of the application, a video is obtained by shooting an actor, and expression parameters of a video frame including the face of the actor in the video can be identified through an encoding model, a first decoding model and an expression identification model, so that facial expressions of virtual objects in the virtual scene are controlled based on the expression parameters, and the facial expressions of the virtual objects are synchronized and consistent with the expressions of the actor, so that an animation including the facial expressions of the virtual objects can be added into the movie subsequently, and therefore, the facial expressions of the virtual objects are driven without identifying the expressions of the actor through professional equipment.

For another example, the terminal logs in an application through a user identifier, the application is a virtual object control application, and the user can select any virtual object through the application in the terminal; the terminal displays the selected virtual object in the application interface; the user shoots the face of the user through the terminal, the terminal acquires the video containing the face of the user in real time and sends the video to a server providing service for the application, the server receives the video in real time, and according to the scheme provided by the embodiment of the application, the expression parameter of each video frame is determined, and the expression parameter is sent to the terminal; the terminal controls the facial expression of the virtual object based on the expression parameters through the expression controller corresponding to the virtual object, so that the facial expression of the virtual object displayed by the terminal is synchronous with the facial expression of the user shot by the terminal, as shown in fig. 10 and 11, so that the user can control the facial expression of the virtual object in the virtual scene by shooting the user himself in real time. And the user can generate the video based on the expression animation of the displayed virtual object through the terminal, and can share the video to other users.

For another example, taking an application in a game scene as an example, a terminal logs in the game application through a user identifier, and a user displays a virtual scene through the game application in the terminal, wherein a virtual object controlled by the terminal is displayed in the virtual scene; the user shoots the face of the user through the terminal, the terminal acquires the video containing the face of the user in real time and sends the video to a server for providing service for game application, the server receives the video in real time, and according to the scheme provided by the embodiment of the application, the expression parameter of each video frame is determined, and the expression parameter is sent to the terminal; the terminal controls the facial expression of the virtual object based on the expression parameters through the expression controller corresponding to the virtual object in the game application, so that the facial expression of the virtual object in the virtual scene is synchronous with the facial expression of the user shot by the terminal, and the user can control the facial expression of the virtual object in the virtual scene by shooting the user in real time.

Fig. 12 is a schematic structural diagram of a model training device according to an embodiment of the present application, as shown in fig. 12, where the device includes:

an acquisition module 1201, configured to acquire a first face image and a second face image, where the first face image includes a virtual face, and the face in the second face image is different from the virtual face;

The encoding module 1202 is configured to encode the first facial image and the second facial image through an encoding model, to obtain a first feature and a second feature, where the first feature indicates the first facial image, and the second feature indicates the second facial image;

the decoding module 1203 is configured to decode the first feature through the first decoding model to obtain a third face image;

the decoding module 1203 is further configured to decode the second feature through the second decoding model to obtain a fourth facial image;

a training module 1204 for training the first decoding model based on the first face image and the third face image; training the coding model based on the first face image, the third face image, the second face image, and the fourth face image;

the first decoding model and the encoding model are used for generating a virtual face image based on any face image, wherein the expression of the virtual face in the virtual face image is the same as that of the face in the face image.

In one possible implementation, the obtaining module 1201 is configured to generate a first facial image based on a sample expression parameter, where the sample expression parameter indicates an expression of a virtual face in the first facial image;

As shown in fig. 13, the apparatus further includes:

the recognition module 1205 is configured to perform expression recognition on the first facial image through an expression recognition model to obtain predicted expression parameters, where the predicted expression parameters indicate an expression of a virtual face in the first facial image;

the training module 1204 is further configured to train the expression recognition model based on the sample expression parameter and the predicted expression parameter.

In another possible implementation, the sample expression parameters include sample expression parameters of a plurality of locations, and the predicted expression parameters include predicted expression parameters of the plurality of locations; the training module 1204 is configured to determine a first loss value based on the sample expression parameters of the plurality of parts and the predicted expression parameters of the plurality of parts, where the first loss value indicates a difference between the sample expression parameters and the predicted expression parameters of the same part; based on the first loss value, the expression recognition model is trained.

In another possible implementation, the training module 1204 is configured to determine a second loss value based on the first facial image and the third facial image, the second loss value indicating a difference between the first facial image and the third facial image; determining a third loss value based on the second face image and the fourth face image, the third loss value indicating a difference between the second face image and the fourth face image; the coding model is trained based on the second loss value and the third loss value.

In another possible implementation, the training module 1204 is further configured to train the second decoding model based on the second face image and the fourth face image in case of performing iterative training on the encoding model and the first decoding model.

It should be noted that: the model training apparatus provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the computer device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the model training device and the model training method provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

Fig. 14 is a schematic structural diagram of a facial expression migration apparatus according to an embodiment of the present application, as shown in fig. 14, where the apparatus includes:

an acquisition module 1401 for acquiring any one of the face images;

the encoding module 1402 is configured to encode the facial image through the encoding model to obtain features of the facial image;

a decoding module 1403, configured to decode, by using the first decoding model, features of the face image to obtain a virtual face image, where the virtual face image includes a virtual face, and a face in the face image has the same expression as the virtual face in the virtual face image;

The coding model and the first decoding model are obtained by training based on the model training method in the embodiment.

In one possible implementation, as shown in fig. 15, the apparatus further includes:

the recognition module 1404 is configured to perform expression recognition on the virtual facial image through the expression recognition model to obtain expression parameters, where the expression parameters indicate an expression of a virtual face in the virtual facial image.

In another possible implementation, as shown in fig. 15, the apparatus further includes:

the adjusting module 1405 is configured to adjust a face of the virtual object in the virtual scene based on the expression parameter, so that the facial expression of the adjusted virtual object is the same as the expression of the face in the facial image.

In another possible implementation, the expression parameters include expression parameters of a plurality of parts, each of the expression parameters of the parts includes a plurality of the expression parameters, and different expression parameters of the same part indicate different actions of the part; an adjusting module 1405, configured to fuse multiple expression parameters of the same location based on the expression parameters of the multiple locations, to obtain a fused expression parameter of each location; and adjusting each part in the virtual object based on the fusion expression parameters of each part.

In another possible implementation, the facial image is any video frame in the video; the adjusting module 1405 is configured to, when obtaining expression parameters of a plurality of video frames in the video, sequentially adjust a face of the virtual object according to an order of the plurality of video frames based on the expression parameters of the plurality of video frames, so that a facial expression of the virtual object changes along with an expression change of the face in the video.

It should be noted that: the facial expression migration apparatus provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the computer device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the facial expression migration device provided in the above embodiment and the facial expression migration method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, which is not repeated here.

The embodiment of the application further provides a computer device, which comprises a processor and a memory, wherein at least one computer program is stored in the memory, and the at least one computer program is loaded and executed by the processor to realize the operations executed by the model training method or the facial expression migration method of the embodiment.

Optionally, the computer device is provided as a terminal. Fig. 16 shows a block diagram of a terminal 1600 according to an exemplary embodiment of the present application. Terminal 1600 includes: a processor 1601, and a memory 1602.

Processor 1601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1601 may be implemented in at least one hardware form of a DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 1601 may also include a host processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1601 may be integrated with a GPU (Graphics Processing Unit, image processor) for taking care of rendering and rendering of content to be displayed by the display screen. In some embodiments, the processor 1601 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 1602 may include one or more computer-readable storage media, which may be non-transitory. Memory 1602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1602 is used to store at least one computer program for execution by processor 1601 to implement the model training method or the facial expression migration method provided by the method embodiments in the present application.

In some embodiments, terminal 1600 may also optionally include: a peripheral interface 1603, and at least one peripheral. The processor 1601, memory 1602, and peripheral interface 1603 may be connected by bus or signal lines. The individual peripheral devices may be connected to the peripheral device interface 1603 by buses, signal lines, or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1604, a display screen 1605, a camera assembly 1606, audio circuitry 1607, and a power supply 1608.

Peripheral interface 1603 may be used to connect I/O (Input/Output) related at least one peripheral to processor 1601 and memory 1602. In some embodiments, the processor 1601, memory 1602, and peripheral interface 1603 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 1601, memory 1602, and peripheral interface 1603 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 1604 is used for receiving and transmitting RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1604 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1604 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1604 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuit 1604 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuit 1604 may also include NFC (Near Field Communication ) related circuits, which are not limited in this application.

The display screen 1605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 1605 is a touch display, the display 1605 also has the ability to collect touch signals at or above the surface of the display 1605. The touch signal may be input to the processor 1601 as a control signal for processing. At this point, the display 1605 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 1605 may be one and disposed on the front panel of the terminal 1600; in other embodiments, the display 1605 may be at least two, each disposed on a different surface of the terminal 1600 or in a folded configuration; in other embodiments, the display 1605 may be a flexible display disposed on a curved surface or a folded surface of the terminal 1600. Even more, the display screen 1605 may be arranged in an irregular pattern other than rectangular, i.e., a shaped screen. The display 1605 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 1606 is used to capture images or video. Optionally, camera assembly 1606 includes a front camera and a rear camera. The front camera is arranged on the front panel of the terminal, and the rear camera is arranged on the back of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 1606 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

Audio circuitry 1607 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 1601 for processing, or inputting the electric signals to the radio frequency circuit 1604 for voice communication. The microphone may be provided in a plurality of different locations of the terminal 1600 for stereo acquisition or noise reduction purposes. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 1601 or the radio frequency circuit 1604 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, audio circuitry 1607 may also include a headphone jack.

A power supply 1608 is used to power the various components in the terminal 1600. The power supply 1608 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power supply 1608 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

Those skilled in the art will appreciate that the structure shown in fig. 16 is not limiting and that more or fewer components than shown may be included or certain components may be combined or a different arrangement of components may be employed.

Optionally, the computer device is provided as a server. Fig. 17 is a schematic structural diagram of a server provided in the embodiments of the present application, where the server 1700 may have a relatively large difference due to configuration or performance, and may include one or more processors (Central Processing Units, CPU) 1701 and one or more memories 1702, where at least one computer program is stored in the memories 1702, and the at least one computer program is loaded and executed by the processors 1701 to implement the methods provided in the respective method embodiments described above. Of course, the server may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

Embodiments of the present application also provide a computer readable storage medium having at least one computer program stored therein, the at least one computer program being loaded and executed by a processor to implement the operations performed by the model training method or the facial expression migration method of the above embodiments.

Embodiments of the present application also provide a computer program product, including a computer program, which when executed by a processor implements the operations performed by the model training method or the facial expression migration method of the above embodiments.

Those of ordinary skill in the art will appreciate that all or a portion of the steps implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the above storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the embodiments is merely an optional embodiment and is not intended to limit the embodiments, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the embodiments of the present application are intended to be included in the scope of the present application.

Claims

1. A method of model training, the method comprising:

2. The method of claim 1, wherein acquiring the first facial image comprises:

generating the first facial image based on a sample expression parameter, the sample expression parameter indicating an expression of the virtual face in the first facial image;

after the training of the coding model based on the first face image, the third face image, the second face image, and the fourth face image, the method further includes:

carrying out expression recognition on the first facial image through an expression recognition model to obtain predicted expression parameters, wherein the predicted expression parameters indicate the expression of the virtual face in the first facial image;

and training the expression recognition model based on the sample expression parameters and the predicted expression parameters.

3. The method of claim 2, wherein the sample expression parameters comprise sample expression parameters of a plurality of sites, and the predicted expression parameters comprise predicted expression parameters of the plurality of sites; the training the expression recognition model based on the sample expression parameters and the predicted expression parameters includes:

Determining a first loss value based on sample expression parameters of the plurality of locations and predicted expression parameters of the plurality of locations, the first loss value indicating a difference between the sample expression parameters and the predicted expression parameters of the same location;

and training the expression recognition model based on the first loss value.

4. The method of claim 1, wherein the training the coding model based on the first face image, the third face image, the second face image, and the fourth face image comprises:

determining a second loss value based on the first face image and the third face image, the second loss value indicating a difference between the first face image and the third face image;

determining a third loss value based on the second face image and the fourth face image, the third loss value indicating a difference between the second face image and the fourth face image;

training the coding model based on the second loss value and the third loss value.

5. The method according to claim 1, wherein the method further comprises:

When the coding model and the first decoding model are iteratively trained, the second decoding model is trained based on the second face image and the fourth face image.

6. A method of facial expression migration, the method comprising:

acquiring any facial image;

wherein the coding model and the first decoding model are trained based on the method of any one of claims 1-5.

7. The method of claim 6, wherein after decoding the features of the face image with the first decoding model to obtain a virtual face image, the method further comprises:

and carrying out expression recognition on the virtual facial image through an expression recognition model to obtain expression parameters, wherein the expression parameters indicate the expression of the virtual face in the virtual facial image.

8. The method according to claim 7, wherein after performing expression recognition on the virtual facial image by using an expression recognition model to obtain expression parameters, the method further comprises:

and adjusting the face of the virtual object in the virtual scene based on the expression parameter so that the facial expression of the adjusted virtual object is the same as the expression of the face in the facial image.

9. The method of claim 8, wherein the expression parameters include expression parameters of a plurality of sites, each site expression parameter including a plurality of sites, different expression parameters of the same site indicating different actions of the site; the adjusting the face of the virtual object in the virtual scene based on the expression parameter includes:

based on the expression parameters of the multiple parts, fusing the expression parameters of the same part to obtain fused expression parameters of each part;

and adjusting each part in the virtual object based on the fusion expression parameters of each part.

10. The method of claim 8, wherein the facial image is any one of video frames in a video; the adjusting the face of the virtual object in the virtual scene based on the expression parameter so that the facial expression of the adjusted virtual object is the same as the facial expression in the facial image, includes:

And under the condition that the expression parameters of a plurality of video frames in the video are obtained, the face of the virtual object is sequentially adjusted according to the sequence of the video frames based on the expression parameters of the video frames, so that the facial expression of the virtual object changes along with the expression change of the face in the video.

11. A model training apparatus, the apparatus comprising:

12. A facial expression migration apparatus, the apparatus comprising:

the acquisition module is used for acquiring any facial image;

13. A computer device comprising a processor and a memory, the memory having stored therein at least one computer program that is loaded and executed by the processor to perform the operations performed by the model training method of any of claims 1 to 5; or to implement the operations performed by the facial expression migration method of any one of claims 6 to 10.

14. A computer readable storage medium having stored therein at least one computer program loaded and executed by a processor to implement the operations performed by the model training method of any of claims 1 to 5; or to implement the operations performed by the facial expression migration method of any one of claims 6 to 10.