CN113205449A

CN113205449A - Expression migration model training method and device and expression migration method and device

Info

Publication number: CN113205449A
Application number: CN202110560292.4A
Authority: CN
Inventors: 梁延研; 冯梓原; 林旭新; 杨林; 史少桦
Original assignee: Zhuhai Kingsoft Online Game Technology Co Ltd
Current assignee: Zhuhai Kingsoft Online Game Technology Co Ltd
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2021-08-03

Abstract

The application provides a training method and device of an expression migration model, and an expression migration method and device, wherein the expression migration model comprises an encoder, a first decoder and a second decoder, and the training method comprises the following steps: acquiring a first three-dimensional face sample and a second three-dimensional face sample; training the first three-dimensional face sample based on the encoder and the first decoder, and training the second three-dimensional face sample based on the encoder and the second decoder; it is determined whether a training stop condition is reached and, in the event that the training stop condition is reached, the training process is stopped. According to the training method of the expression migration model, the graph convolution neural network is used for extracting features, the network structure of the self-encoder is adopted for training, and the expression migration to the specific three-dimensional face is achieved.

Description

Expression migration model training method and device and expression migration method and device

Technical Field

The specification relates to the technical field of computers, in particular to a method and a device for training an expression migration model and a method and a device for expression migration.

Background

With the development of technology, processing and analyzing facial expressions become a research hotspot in the fields of computer vision and graphics, and facial expression migration is also widely applied. The facial expression migration means that the captured real user expression is mapped to another target image, so that the purpose of migrating the facial expression to the target image is achieved. The technology not only enables the user to control the facial expression in the target picture or video by inputting the face, but also provides data enhancement service for the face recognition task.

The existing three-dimensional facial expression migration is mainly performed by a method of detecting parameters of a face key point fitting 3D dm Model (3D deformable Model), wherein the 3D dm Model is constructed by Principal Component Analysis (PCA) of a database, and can be regarded as an average face (mean face) obtained from a large number of faces in the database, and always has features of the faces in the database. The 3DMM model reconstructs the human face, and not only needs similar growth and expression, but also needs the same expression, so that the expression cannot be transferred to the specific three-dimensional human face.

Disclosure of Invention

In view of this, the embodiments of the present specification provide a training method for an expression migration model and an expression migration method. The present specification also relates to a training apparatus for an expression migration model, an expression migration apparatus, a computing device, and a computer-readable storage medium, so as to solve the technical defects in the prior art.

According to a first aspect of embodiments of the present specification, there is provided a training method for an expression migration model, where the expression migration model includes an encoder, a first decoder, and a second decoder, the training method includes:

acquiring a first three-dimensional face sample and a second three-dimensional face sample;

training the first three-dimensional face sample based on the encoder and the first decoder, and training the second three-dimensional face sample based on the encoder and the second decoder;

it is determined whether a training stop condition is reached and, in the event that the training stop condition is reached, the training process is stopped.

Optionally, training a first three-dimensional face sample based on the encoder and the first decoder comprises:

inputting the initial vertex information and the adjacency matrix of the first three-dimensional face sample into the encoder to obtain a first encoding vector;

inputting the first coding vector into the first decoder to obtain a first decoding vector, and obtaining a loss value according to the first decoding vector and the initial vertex information;

adjusting a coefficient vector of a network layer in the encoder and the first decoder according to the loss value.

Optionally, training the second three-dimensional face sample based on the encoder and the second decoder comprises:

inputting the initial vertex information and the adjacency matrix of the second three-dimensional face sample into the encoder to obtain a second encoding vector,

inputting the second coding vector into the second decoder to obtain a second decoding vector, and obtaining a loss value according to the second decoding vector and the initial vertex information of the second three-dimensional face sample;

adjusting a coefficient vector of a network layer in the encoder and the second decoder according to the loss value.

Optionally, the encoder comprises a graph convolution neural network layer, a down-sampling layer and a full connection layer, the graph convolution neural network layer and the down-sampling layer are sequentially arranged at intervals, the first decoder and the second decoder comprise the full connection layer, an up-sampling layer and a convolutional neural network layer, and the up-sampling layer and the convolutional neural network layer are sequentially arranged at intervals.

Optionally, training a first preset number of first three-dimensional face samples and training a second preset number of second three-dimensional face samples are alternately performed.

Optionally, the obtaining a first three-dimensional face sample comprises:

the method comprises the steps of obtaining a plurality of face images, and carrying out face reconstruction on the face images to obtain a first three-dimensional face sample.

Optionally, the obtaining the first three-dimensional face sample further includes:

and after face reconstruction is carried out on the plurality of face images, carrying out spatial alignment on the reconstructed three-dimensional face and the second three-dimensional face sample.

According to a second aspect of embodiments of the present specification, there is provided an expression migration method using an expression migration model, the expression migration model including an encoder, a first decoder, and a second decoder and being trained in advance by the training method of any one of the above, the method including:

acquiring a first three-dimensional face of an expression to be migrated;

and performing expression migration on the first three-dimensional face based on the encoder and the second decoder to obtain a second three-dimensional face.

Optionally, performing expression migration on the first three-dimensional face based on the encoder and the second decoder comprises:

inputting the initial vertex information and the adjacency matrix of the first three-dimensional face into the encoder to obtain a first encoding vector;

inputting the first encoded vector to the second decoder.

Optionally, the obtaining of the first three-dimensional face with the expression to be migrated includes:

and intercepting multi-frame face images of the same face from a target video, and carrying out face reconstruction on the multi-frame face images to obtain a plurality of first three-dimensional faces of expressions to be migrated.

Optionally, the expression migration method further includes:

and generating an animation based on the second three-dimensional face.

According to a third aspect of embodiments of the present specification, there is provided a training apparatus for an expression migration model, the expression migration model including an encoder, a first decoder, and a second decoder, the training apparatus including:

a first obtaining module configured to obtain a first three-dimensional face sample and a second three-dimensional face sample;

a training module configured to train the first three-dimensional face sample based on the encoder and the first decoder, and train the second three-dimensional face sample based on the encoder and the second decoder;

a judging module configured to judge whether a training stop condition is reached and, in case the training stop condition is reached, stop the training process.

According to a fourth aspect of embodiments of the present specification, there is provided an expression migration apparatus using an expression migration model, the expression migration model including an encoder, a first decoder, and a second decoder and being trained in advance by the training method described in any one of the above, the expression migration apparatus including:

the second acquisition module is configured to acquire a first three-dimensional face of the expression to be migrated;

and the migration module is configured to perform expression migration on the first three-dimensional face based on the encoder and the second decoder to obtain a second three-dimensional face.

According to a fifth aspect of embodiments herein, there is provided a computing device comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions, where the computer-executable instructions, when executed by the processor, implement the method for training an expression migration model according to the first aspect, or implement the operation steps of the expression migration method using an expression migration model according to the second aspect.

According to a sixth aspect of embodiments of the present specification, there is provided a computer-readable storage medium storing computer-executable instructions, which when executed by a processor, implement the method for training an expression migration model according to the first aspect or the operation steps of the expression migration method using an expression migration model according to the second aspect.

The method for training the expression migration model comprises the steps of obtaining a first three-dimensional face sample and a second three-dimensional face sample; training a first three-dimensional face sample based on an encoder and a first decoder, and training a second three-dimensional face sample based on the encoder and a second decoder; in case the training stop condition is reached, the training process is stopped.

The training method of the expression migration model according to the description extracts features by using a graph convolutional neural network, and performs training by using a network structure of a self-encoder to realize the migration of expressions to a specific three-dimensional face.

Drawings

Fig. 1 is a flowchart illustrating a training method of an expression migration model according to an embodiment of the present specification;

fig. 2 is a process schematic diagram illustrating a training method of an expression migration model according to an embodiment of the present specification;

FIG. 3 is a schematic diagram illustrating a network architecture of an expression migration model according to an embodiment of the present specification;

FIG. 4 illustrates a schematic diagram of a down-sampling layer and an up-sampling layer in the expression migration model of FIG. 3;

FIG. 5 is a flowchart illustrating an expression migration method using an expression migration model according to an embodiment of the present specification;

fig. 6 is a flowchart illustrating an expression migration method using an expression migration model according to an embodiment of the present specification;

FIG. 7 is a flowchart illustrating a training method for an expression migration model according to an embodiment of the present specification;

FIG. 8 is a flowchart illustrating a process of a method for three-dimensional facial expression migration according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram illustrating a training apparatus for an expression migration model according to an embodiment of the present specification;

fig. 10 is a schematic structural diagram illustrating an expression transfer apparatus according to an embodiment of the present specification;

fig. 11 shows a block diagram of a computing device according to an embodiment of the present specification.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, as those skilled in the art will be able to make and use the present disclosure without departing from the spirit and scope of the present disclosure.

The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

In the present specification, a method for training an expression migration model is provided, and the present specification also relates to an apparatus for training an expression migration model, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.

Fig. 1 is a flowchart illustrating a training method for an expression migration model according to an embodiment of the present specification, where the expression migration model includes an encoder, a first decoder, and a second decoder, and the training method specifically includes the steps of:

step 102: and acquiring a first three-dimensional face sample and a second three-dimensional face sample.

The first three-dimensional face sample and the second three-dimensional face sample are three-dimensional face samples of characters or virtual characters, and the first three-dimensional face sample and the second three-dimensional face sample correspond to different characters or virtual characters. The three-dimensional face has different expressions, which are characterized in that three-dimensional face grids are specifically arranged, vertexes are arranged on the grids, and the three-dimensional face grids are defined as a graph

Representative of the fact that,

is a set of vertices, a is an adjacency matrix that characterizes the relationship between vertices, each vertex being represented by a three-dimensional coordinate value. And determining the elements of each position according to the adjacent relation of the corresponding row and the corresponding column of each position by taking each vertex of the three-dimensional face sample as the row and the column respectively. In one embodiment, the first three-dimensional face sample is a three-dimensional face sample of a character, and the second three-dimensional face sample is a three-dimensional face sample of a virtual character, such as a three-dimensional face sample of a game character.

Optionally, the obtaining a first three-dimensional face sample comprises:

The first three-dimensional face sample can be obtained by performing three-dimensional face reconstruction on a two-dimensional image, for example, capturing a frame-by-frame image from a video, and then performing three-dimensional face reconstruction to obtain a three-dimensional face sample. And three-dimensional face reconstruction can be carried out on a plurality of two-dimensional images obtained by direct shooting to obtain a three-dimensional face sample.

The three-dimensional face reconstruction can be reconstructed by conventional three-dimensional face reconstruction methods, for example by single-map modeling including muscle models, image-based modeling, multi-map modeling including orthogonal view modeling, multi-map modeling systems, model-based three-dimensional face reconstruction including the generic face model CANDIDE-3, the three-dimensional deformation model 3DMM, and end-to-end three-dimensional face reconstruction including VRNet, PRNet. And reconstructing the two-dimensional image into a three-dimensional face through three-dimensional face reconstruction.

Spatial alignment (Spatial alignment) as shown in fig. 2 is used to align the reconstructed three-dimensional face with the second three-dimensional face sample in space, with the eye, mouth, nose positions in the two faces aligned. And then topology unification is carried out, so that the number of points of the reconstructed three-dimensional face and the number of the points of the second three-dimensional face sample are the same as that of the reconstructed three-dimensional face, the three-dimensional face is formed by faces, the faces are formed by points and are unified, and meanwhile, the serial numbers of the points forming the faces are also required to be the same, so that the faces of the two three-dimensional faces are in one-to-one correspondence with the faces, and the faces are changed into the same topology structure.

Step 104: the first three-dimensional face sample is trained based on an encoder and a first decoder, and the second three-dimensional face sample is trained based on the encoder and a second decoder.

As shown in fig. 2, a first three-dimensional face sample is trained by an encoder and a first decoder, and a second three-dimensional face sample is trained by an encoder and a second decoder. Optionally, training a first three-dimensional face sample based on the encoder and the first decoder may be achieved by:

Optionally, training the second three-dimensional face sample based on the encoder and the second decoder may be achieved by:

The network architecture for expression migration of a three-dimensional face is composed of an encoder (encoder) and two decoders (decoders), the network architecture is shown in fig. 3, the encoder encodes expression information, and the decoders recover identity information of the three-dimensional face. Optionally, the encoder comprises a graph convolution neural network layer, a down-sampling layer and a full connection layer, the graph convolution neural network layer and the down-sampling layer are sequentially arranged at intervals, the first decoder and the second decoder comprise the full connection layer, an up-sampling layer and a convolutional neural network layer, and the up-sampling layer and the convolutional neural network layer are sequentially arranged at intervals. As shown in fig. 3, G represents a graph convolutional neural network, F represents a fully connected layer, the length of a box represents the number of vertices of a face mesh, except for the last layer, each layer of the graph convolutional neural network is followed by a downsampling operation, the graph convolutional neural network layer and the downsampling layer are sequentially arranged at intervals, each downsampling reduces the number of input vertices to 1/4, the last layer is a fully connected layer, and the output result is Z. Each downsampling can reduce the number of the top points to one sixth, one eighth and the like of the original number, and the number of the convolution network layers is unchanged. The number of the encoder layer and the decoder layer can be adjusted, and is not limited to 4 layers, and can be 6 layers, 8 layers and the like, which is not limited in the application.

As shown in fig. 2 and 3, the decoder is composed of two parts, respectively decoding the corresponding 3D face mesh. The fully-connected layer is followed by an upsampling layer, followed by a atlas neural network. The image convolution neural network and the upsampling are performed alternately, the vertex is restored to be 4 times of the original vertex in each upsampling, and finally the vertex of the complete 3D face mesh is output. The first decoder comprises a full connection layer, an upper sampling layer and a convolutional neural network layer, wherein the upper sampling layer and the convolutional neural network layer are sequentially arranged at intervals. The original vertex information can be recovered during up-sampling, the up-sampling and the down-sampling are corresponding, the vertex is deleted during the down-sampling, the coordinates of the deleted points are recorded while the deletion is carried out, and the original vertex information can be recovered as much as possible according to the coordinates of the deleted points during the up-sampling.

As shown in FIG. 4, part (a) represents the original mesh, Q_dDeleting some points in the down-sampling process to form part (b), convolving to form part (c), and up-sampling_uAnd (d) finding the nearest grid according to the coordinate projection of the deleted point during upsampling, recovering, and comparing the final point number with the original point number without change.

The vertex information is coordinate information of the vertexes of the three-dimensional face mesh, and provides input dimensionality, output dimensionality and an adjacency matrix when the graph convolution network is defined. The vertex set is input into a first graph convolution neural network of the encoder, and graph convolution is carried out by combining adjacency matrix information. In the graph convolution neural network, vertex information is aggregated by using a preset function according to a certain rule to obtain a new coding vector and output the new coding vector.

The graph convolution will be explained below. Calculating an adjacent matrix A to obtain a degree matrix D, obtaining a Laplace matrix L by using L ═ D-A, and obtaining a characteristic vector u of the Laplace matrix₀,u₁,…,u_n-1Wherein, L ═ U Λ U^T，U＝[u₀,u₁,…,u_n-1]，Λ＝diag([λ₀,λ₁,…,λ_n-1])∈R^n×n. The general definition of graph convolution is:

x*y＝U((U^Tx)⊙(U^Ty)) (1)

where x and y are inputs, x represents a convolution operation, U is a feature vector, and U is a function vector^Tx represents the Fourier transform of x, U^Ty represents a Fourier transform on y, and |, represents a Hadamard product. X is the input vector to be convolved, y is the signal information on the graph, and convolution operation is carried out after Fourier transformation.

An image is given, a Laplace matrix of the image is obtained, eigenvalues and eigenvectors of the Laplace matrix are calculated, the eigenvectors are used as a group of basis vectors of Fourier transform, then Fourier transform is carried out, the input on the image is changed into representation on a frequency domain, convolution operation is carried out in the space of the frequency domain, and then the result is transformed into the original space. In other words, the signal to be convolved is transformed into the space of the set of basis, i.e. the frequency domain space, by the set of basis vectors, convolved in the frequency domain space, and then inverse transformed back to the original space by the set of basis vectors. The above convolution process is called spectral convolution.

The above is the theoretical knowledge of the spectrum convolution, and the following is a description of the specific convolution operation. The calculation process of the feature vector is simulated by using a chebyshev polynomial, as shown in the following formula:

wherein, g_θIs a convolution kernel that is a function of the convolution kernel,

is the scaling factor of the Laplacian, where I_nIs an n-dimensional identity matrix, theta is the Chebyshev coefficient vector, T_kIs k order Chebyshev polynomial, k can be 6, 8, 10, etc., and can be set by self, theta is coefficient vector to be trained by model, lambda is_maxThe maximum eigenvalue of the laplacian matrix.

Chebyshev polynomial equation is T_k(x)＝2xT_k-1(x)-T_k-2(x),T₀＝1,T₁The graph convolution after applying the chebyshev polynomial is defined as follows:

wherein, y_jThe jth feature of y is calculated,

x is input and has F_inThe characteristics of the device are as follows,

F_inis the coordinate dimension of the coordinate vertex, in this case, F_in3. Each convolution layer has F_in×F_outIndividual Chebyshev coefficient vector, theta_i,_j∈R^KCan be used as a training parameter. Fo_utThe number of channels of the output vector is a quantity specified when the graph convolution network layer is set. In one embodiment, n is 3791, F_out16. In one embodiment, at the time of network training, k is selected to be 6, an SGD stochastic gradient descent optimizer is used, the attenuation rate of the weight is 0.0005, the size of the batch is 16, the learning rate is 0.008, and the attenuation of the learning rate is 0.99. The decay rate of the weights is used for regularization to prevent overfitting from occurring. The batch is a hyper-parameter, and in this embodiment, is defined as 16, and represents that only 16 three-dimensional faces are taken for training each time, and the batch may be set to other values, preferably to multiples of 2.

Step 106: it is determined whether a training stop condition is reached and, in the event that the training stop condition is reached, the training process is stopped.

As shown in fig. 2, discriminators are respectively disposed behind the first decoder and the second decoder, the discriminators determine whether the face result decoded by the decoders is good or bad, the result is true (real) as the face result is closer to the input three-dimensional face, the vertex coordinates of the decoded face are compared with the vertex coordinates of the reconstructed face, the difference value is smaller than a preset threshold value, and the discrimination is determined as true (real), so that the generated result is closer to the original input. During training, a part of three-dimensional face samples in a training set are input, and a part of target face samples are input. The parameters are changed during the training of the three-dimensional face sample, the changed parameters are also suitable for the target face sample, the encoder can search for a set of parameters to achieve the effect of simultaneously coding the three-dimensional face sample and the target face sample, and the set of parameters is used during the application of the model. Because the encoders of the first three-dimensional face sample and the second three-dimensional face sample learn common information of different individuals, namely network parameters of the hidden layer, and the decoders are responsible for reconstructing corresponding individuals in training and learning information (including different facial expressions) containing independent individuals, the first three-dimensional face sample realizes individual facial expression migration based on the second decoder in the application process.

Optionally, training a first preset number of first three-dimensional face samples and training a second preset number of second three-dimensional face samples are alternately performed. That is, during training, the first batch may be a three-dimensional face sample, and the second batch may be a target face sample. Or training the three-dimensional face samples in a disorganized sequence, or training n three-dimensional face samples first, then training n target face samples, and alternately training. This is not a limitation of the present application.

Fig. 5 shows an expression migration method using an expression migration model, where the expression migration model includes an encoder, a first decoder, and a second decoder, and is trained in advance by the above training method, and the expression migration method includes:

step 502: acquiring a first three-dimensional face of an expression to be migrated;

the first three-dimensional face can be obtained by performing three-dimensional face reconstruction on the two-dimensional image, for example, capturing a plurality of frames of face images of the same face from the target video, and performing three-dimensional face reconstruction to obtain a plurality of first three-dimensional faces of the expression to be migrated. And three-dimensional face reconstruction can be carried out on a plurality of two-dimensional images obtained by direct shooting to obtain the three-dimensional face.

Step 504: and performing expression migration on the first three-dimensional face based on the encoder and the second decoder to obtain a second three-dimensional face.

The encoder comprises a graph convolution neural network layer, a down-sampling layer and a full-connection layer, the graph convolution neural network layer and the down-sampling layer are sequentially arranged at intervals, the first decoder and the second decoder comprise the full-connection layer, an upper sampling layer and a convolutional neural network layer, and the upper sampling layer and the convolutional neural network layer are sequentially arranged at intervals. As shown in fig. 3, G represents a graph convolutional neural network, F represents a fully connected layer, the length of a box represents the number of vertices of a face mesh, except for the last layer, each layer of the graph convolutional neural network is followed by a downsampling operation, the graph convolutional neural network layer and the downsampling layer are sequentially arranged at intervals, each downsampling reduces the number of input vertices to 1/4, the last layer is a fully connected layer, and the output result is Z. Each downsampling can reduce the number of the top points to one sixth, one eighth and the like of the original number, and the number of the convolution network layers is unchanged. The number of the encoder layer and the decoder layer can be adjusted, and is not limited to 4 layers, and can be 6 layers, 8 layers and the like, which is not limited in the application.

The decoder consists of two parts, which respectively decode the corresponding 3D face mesh. The fully-connected layer is followed by an upsampling layer, followed by a atlas neural network. The image convolution neural network and the upsampling are performed alternately, the vertex is restored to be 4 times of the original vertex in each upsampling, and finally the vertex of the complete 3D face mesh is output. The first decoder comprises a full connection layer, an upper sampling layer and a convolutional neural network layer, wherein the upper sampling layer and the convolutional neural network layer are sequentially arranged at intervals. The original vertex information can be recovered during up-sampling, the up-sampling and the down-sampling are corresponding, the vertex is deleted during the down-sampling, the coordinates of the deleted points are recorded while the deletion is carried out, and the original vertex information can be recovered as much as possible according to the coordinates of the deleted points during the up-sampling.

As shown in FIG. 4, a is the original grid, Q_dAnd deleting some points in the downsampling process for downsampling, reserving coordinates of the deleted points, generating c after convolution, finding the nearest grid according to the coordinate projection of the deleted points during upsampling, and recovering, wherein the number of the final points is not changed compared with the original number.

Optionally, performing expression migration on the first three-dimensional face based on the encoder and the second decoder may be implemented by:

inputting the first encoded vector to the second decoder.

As shown in fig. 6, the initial vertex information and the adjacency matrix are input into a graph convolution neural network layer of an encoder, a first coding vector is output through a downsampling layer and a full-connection layer, the first coding vector is input into a full-connection layer of a second decoder, a decoding vector is output through an upsampling layer and the convolution neural network layer, and a three-dimensional face after expression migration is obtained according to the decoding vector.

The expression migration model is trained through the training method, and in the training process, encoders of the first three-dimensional face sample and the second three-dimensional face sample learn common information of different individuals, namely network parameters of the hidden layer, and respective decoders are responsible for reconstructing corresponding individuals in the training process and learn information (including different facial expressions) containing independent individuals, so that the first three-dimensional face sample realizes individual facial expression migration based on the second decoder in the application process, the expression migration of the first three-dimensional face to the target three-dimensional face is realized, and the target three-dimensional face is a face reconstructed in the second decoder and obtained through training of a large number of second three-dimensional face samples in the training process.

In one embodiment, the expression migration method further includes:

and generating an animation based on the second three-dimensional face.

Inputting a plurality of first three-dimensional faces of the expression migration model, outputting a plurality of corresponding second three-dimensional faces through the expression migration model, and generating an animation based on the output second three-dimensional faces. For example, animation may be generated using the MPEG-4 based three-dimensional face animation principles, and the application is not limited thereto. The expressions of the first three-dimensional faces are transferred to the target three-dimensional faces to obtain a plurality of second three-dimensional faces, the second three-dimensional faces are output in an animation mode, the expressions transferred from the original video can be displayed in the animation mode, the animation formed by virtual characters with the expressions is realized, the expressions of the formed virtual characters are richer due to the fact that the face expressions can be transferred to the virtual characters to form the expressions of the virtual characters, and the problems that time and labor are wasted when the expressions of game characters are played in a traditional method are solved.

The following describes the training method of the expression migration model with reference to fig. 7. Fig. 7 shows a processing flow chart of a training method for an expression migration model provided in an embodiment of the present specification, which specifically includes the following steps:

step 702: acquiring a first three-dimensional face sample and a three-dimensional face sample of a virtual character in a game picture;

the three-dimensional face sample of the virtual character may be an expressionless face sample or an expressive face sample of the same character, for example, an created virtual character face sample with an expression or an expressive virtual character face sample obtained by other methods or the expression migration method of this embodiment.

Step 704: inputting the initial vertex information of the first three-dimensional face sample into an encoder of the expression migration model to obtain a first encoding vector;

referring to fig. 3, the encoder includes a convolutional neural network layer, a downsampling layer, and a full connection layer, and the convolutional neural network layer and the downsampling layer are sequentially disposed at intervals.

Step 706: inputting the first coding vector into a first decoder of the expression migration model to obtain a first decoding vector, and obtaining a loss value according to the first decoding vector and the initial vertex information;

the first decoder comprises a full connection layer, an upper sampling layer and a convolutional neural network layer, wherein the upper sampling layer and the convolutional neural network layer are sequentially arranged at intervals.

Step 708: adjusting coefficient vectors of convolutional neural network layers in the encoder and the first decoder according to the loss values;

and adjusting the Chebyshev coefficient vector theta according to the loss value, and calculating a primary loss function and adjusting theta after the three-dimensional face sample of each batch is input.

Step 710: inputting the initial vertex information of the three-dimensional face sample of the virtual character into the encoder to obtain a second encoding vector;

and then training the three-dimensional face sample of the virtual character, inputting the three-dimensional face sample into an encoder, and obtaining a second encoding vector through a graph convolution neural network layer, a down-sampling layer and a connecting layer.

Step 712: inputting a second coding vector into a second decoder of the migration model to obtain a second decoding vector, and obtaining a loss value according to the second decoding vector and the initial vertex information of the three-dimensional face sample of the virtual character;

the second decoder comprises a full connection layer, an upper sampling layer and a convolutional neural network layer, wherein the upper sampling layer and the convolutional neural network layer are sequentially arranged at intervals.

Step 714: and adjusting coefficient vectors of the convolutional neural network layers in the encoder and the second decoder according to the loss value until a training stopping condition is reached, and stopping the training process.

The training stop condition is that the loss functions of the first decoder and the second decoder both converge. In one embodiment, stopping the training process until the training stop condition is reached may include:

judging whether the loss value is smaller than a preset threshold value or not;

if not, continuing training;

if so, determining that the training stop condition is reached.

The preset threshold is a critical value of the loss value, when the loss value is greater than or equal to the preset threshold, it is indicated that a certain deviation still exists between the prediction result and the real result of the initial model, the parameters of the initial model still need to be adjusted, and the model continues to be trained; and under the condition that the loss value is smaller than the preset threshold value, the approximation degree of the predicted result and the real result of the initial model is enough, and the training can be stopped. The value of the preset threshold may be determined according to actual conditions, and the specification does not limit this.

The detailed parameters for each layer, which were specified in one embodiment, are shown in tables 1 and 2.

TABLE 1 encoder architecture

Layer(s)	Input size	Output size
			Convolution with a bit line	3791×3	3791×16
Down sampling	3791×16	948×16
			Convolution with a bit line	948×16	948×16
Down sampling	948×16	237×16
			Convolution with a bit line	237×16	237×16
Down sampling	237×16	60×16
			Convolution with a bit line	60×16	60×32
Down sampling	60×32	15×32
			Full connection	15×32	8

TABLE 2 decoder architecture

Layer(s)	Input size	Output size
			Full connection	8	15×32
Upsampling	15×32	60×32
			Convolution with a bit line	60×32	60×32
Upsampling	60×32	237×32
			Convolution with a bit line	237×32	237×16
Upsampling	237×16	948×16
			Convolution with a bit line	948×16	948×16
Upsampling	948×16	3791×16
			Convolution with a bit line	3791×16	3791×3

The input size and the output size of the tables 1 and 2 are good in experimental effect. But not as a limitation of the present application, the number of subsequent channels may be set by itself.

The following describes an application method of the expression migration model with reference to fig. 8. Fig. 8 shows a processing flow chart of a method for three-dimensional facial expression migration provided in an embodiment of the present specification, which specifically includes the following steps:

802: intercepting one frame of image from a video, and carrying out face reconstruction to obtain a three-dimensional face;

the face reconstruction is to generate a corresponding three-dimensional face from a two-dimensional face in a two-dimensional photo, and reconstruct each frame of the photo into a three-dimensional face.

804: inputting the three-dimensional face into an expression migration model, performing three-dimensional face expression migration, and migrating the expression to the three-dimensional face of the virtual character;

the expression migration model is obtained by training through a method shown in the attached figure 7, after the three-dimensional face is appointed to be encoded by an encoder and output a coding vector, the coding vector is input into a second decoder to be decoded to obtain a decoding vector, the expression migration of the three-dimensional face is realized, and the expression of the three-dimensional face obtained by the multi-frame image reconstruction of the video is migrated to the three-dimensional face of a virtual character, such as the expressionless face of the virtual character.

Step 806: and generating animation based on the three-dimensional face after the expression of the virtual character is transferred, and displaying the expression transferred from the original video.

Because the facial expressions can be transferred to the virtual roles to form the expressions of the virtual roles, the formed expressions of the virtual roles are richer, and the problem that the traditional method wastes time and labor when being used as the expressions of the game characters is solved.

Corresponding to the above method embodiment, an embodiment of a training device for an expression migration model is also provided in this specification, and fig. 9 shows a schematic structural diagram of the training device for an expression migration model provided in an embodiment of this specification. The expression migration model includes an encoder, a first decoder, and a second decoder, as shown in fig. 9, the apparatus includes:

a first obtaining module 902 configured to obtain a first three-dimensional face sample and a second three-dimensional face sample;

a training module 904 configured to train the first three-dimensional face sample based on the encoder and the first decoder, and train the second three-dimensional face sample based on the encoder and the second decoder;

a decision module 906 configured to decide whether a training stop condition is reached and, in case the training stop condition is reached, to stop the training process.

The above is a schematic scheme of the training apparatus for an expression migration model according to this embodiment. It should be noted that the technical solution of the training apparatus for expression migration models and the technical solution of the training method for expression migration models belong to the same concept, and details that are not described in detail in the technical solution of the training apparatus for expression migration models can be referred to the description of the technical solution of the training method for expression migration models.

Corresponding to the above method embodiment, the present specification further provides an expression migration apparatus using an expression migration model, and fig. 10 shows a schematic structural diagram of an expression migration apparatus provided in an embodiment of the present specification. The expression migration model includes an encoder, a first decoder and a second decoder and is trained in advance by the training method described in any one of the above, as shown in fig. 10, the apparatus includes:

a second obtaining module 1002, configured to obtain a first three-dimensional face of an expression to be migrated;

a migration module 1004 configured to perform expression migration on the first three-dimensional face based on the encoder and the second decoder, so as to obtain a second three-dimensional face.

The above is a schematic scheme of an expression migration apparatus according to this embodiment. It should be noted that the technical solution of the expression migration apparatus and the technical solution of the expression migration method using the expression migration model belong to the same concept, and details of the technical solution of the expression migration apparatus, which are not described in detail, can be referred to the description of the technical solution of the expression migration method using the expression migration model.

FIG. 11 illustrates a block diagram of a computing device 1100 provided in accordance with an embodiment of the present description. The components of the computing device 1100 include, but are not limited to, memory 1110 and a processor 1120. The processor 1120 is coupled to the memory 1110 via a bus 1130 and the database 1150 is used to store data.

The computing device 1100 also includes an access device 1140, the access device 1140 enabling the computing device 1100 to communicate via one or more networks 1160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 1140 may include one or more of any type of network interface, e.g., a Network Interface Card (NIC), wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 1100, as well as other components not shown in FIG. 11, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 11 is for purposes of example only and is not limiting as to the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 1100 can be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 1100 can also be a mobile or stationary server.

The processor 1120 is configured to execute computer-executable instructions, and the computer-executable instructions, when executed by the processor, implement the above-mentioned method for training expression migration models, or the above-mentioned operation steps of the expression migration method using expression migration models.

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the above-mentioned expression migration model training method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the above-mentioned expression migration model training method.

An embodiment of the present specification further provides a computer-readable storage medium, which stores computer instructions, which when executed by a processor, are configured to implement the method for training an expression migration model described above, or the operation steps of the expression migration method using an expression migration model described above.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium and the technical solution of the training method of the expression migration model belong to the same concept, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the training method of the expression migration model.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present disclosure is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present disclosure. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that acts and modules referred to are not necessarily required for this description.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the specification and its practical application, to thereby enable others skilled in the art to best understand the specification and its practical application. The specification is limited only by the claims and their full scope and equivalents.

Claims

1. A training method of an expression migration model is characterized in that the expression migration model comprises an encoder, a first decoder and a second decoder, and the training method comprises the following steps:

2. The training method of claim 1, wherein training a first three-dimensional face sample based on the encoder and the first decoder comprises:

3. Training method according to claim 1 or 2, wherein training the second three-dimensional face sample based on the encoder and the second decoder comprises:

4. The training method of claim 1, wherein the encoder comprises a convolutional neural network layer, a downsampling layer and a fully-connected layer, the convolutional neural network layer and the downsampling layer are sequentially arranged at intervals, the first decoder and the second decoder comprise a fully-connected layer, an upsampling layer and a convolutional neural network layer, and the upsampling layer and the convolutional neural network layer are sequentially arranged at intervals.

5. A training method as claimed in claim 3, characterized in that training a first preset number of first three-dimensional face samples and training a second preset number of second three-dimensional face samples are performed alternately.

6. The training method of claim 1, wherein obtaining a first three-dimensional face sample comprises:

7. The training method of claim 6, wherein obtaining a first three-dimensional face sample further comprises:

8. An expression migration method using an expression migration model, wherein the expression migration model includes an encoder, a first decoder, and a second decoder, and is pre-trained by the training method of any one of claims 1 to 7, the expression migration method comprising:

acquiring a first three-dimensional face of an expression to be migrated;

9. The expression migration method according to claim 8, wherein performing expression migration on the first three-dimensional face based on the encoder and the second decoder comprises:

inputting the first encoded vector to the second decoder.

10. The expression migration method according to claim 8 or 9, wherein the obtaining of the first three-dimensional face of the expression to be migrated includes:

11. The expression migration method according to claim 10, further comprising:

and generating an animation based on the second three-dimensional face.

12. An apparatus for training an expression migration model, wherein the expression migration model includes an encoder, a first decoder, and a second decoder, the apparatus comprising:

13. An expression migration apparatus using an expression migration model, wherein the expression migration model includes an encoder, a first decoder, and a second decoder, and is pre-trained by the training method according to any one of claims 1 to 7, the expression migration apparatus comprising:

14. A computing device, comprising:

a memory and a processor;

the memory is used for storing computer-executable instructions, and the processor is used for executing the computer-executable instructions to implement the method for training the expression migration model according to any one of claims 1 to 7 or the operation steps of the expression migration method using the expression migration model according to any one of claims 8 to 11.

15. A computer-readable storage medium storing computer instructions which, when executed by a processor, implement the method for training an expression migration model according to any one of claims 1 to 7 or the operation steps of the expression migration method using an expression migration model according to any one of claims 8 to 11.