CN112308949A

CN112308949A - Model training method, human face image generation device and storage medium

Info

Publication number: CN112308949A
Application number: CN202010604310.XA
Authority: CN
Inventors: 申童; 张炜; 梅涛
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2020-06-29
Filing date: 2020-06-29
Publication date: 2021-02-02

Abstract

The invention provides a model training method and device, a face image generation method and device and a storage medium, and relates to the technical field of deep learning, wherein the method comprises the following steps: acquiring an audio sample and a face sample image corresponding to the video sample, and generating a sample sound characteristic matrix corresponding to the audio sample; generating a human face key point deviation sequence according to a comparison result of the human face reference image and the human face sample image; constructing a training sample according to the associated sample voice feature matrix and the human face key point deviation sequence; and training the human face key point deviation model to be trained by utilizing the training sample. The method, the device and the storage medium can describe the expression and the posture of a speaker by using the face key points in a two-dimensional space; training data is collected without special equipment, the mapping relation from the voice signal to the key points of the human face is realized through the voice coding and the human face key point deviation model, and the efficiency and the accuracy of model training and use are improved.

Description

Model training method, human face image generation device and storage medium

Technical Field

The present disclosure relates to the field of deep learning technologies, and in particular, to a model training method and apparatus, a face image generation method and apparatus, and a storage medium.

Background

With the development of internet technology, many video platforms, live broadcast platforms and the like appear, and the virtual characters such as a virtual host and the like can be used for realizing the anchor on the platforms. For the virtual character anchor function, a video of the facial expression and the corresponding mouth shape of the virtual character needs to be generated according to the voice signal, and the video is used for playing on the platform. At present, in a technical scheme for generating a virtual character, a three-dimensional face model is obtained by three-dimensionally reconstructing a face image in training data based on a deep learning technology and a 3D technology, then various states of the three-dimensional face model are controlled by using fewer parameters through parameterization of the three-dimensional face model, the three-dimensional face model is rendered through a graphics technology, and finally the three-dimensional face model is re-projected to a 2D space to generate a video. The existing virtual character generation scheme has the disadvantages of complex flow and high resource consumption, a three-dimensional face model obtained by three-dimensional reconstruction is rough and affects the subsequent rendering effect, the deviation between the generated facial expression of the virtual character and the facial expression of an actual character is large, and the generated visual effect of the virtual character is poor.

Disclosure of Invention

In view of the above, an object of the present invention is to provide a model training method and apparatus, a face image generation method and apparatus, and a storage medium.

According to a first aspect of the present disclosure, there is provided a model training method, comprising: carrying out separation processing on a video sample to obtain an audio sample and a face sample image corresponding to the video sample; generating a sample sound feature matrix corresponding to the audio sample; generating a human face key point deviation sequence according to a comparison result of the human face reference image and the human face sample image; constructing a training sample according to the associated sample sound feature matrix and the human face key point deviation sequence; and training a human face key point deviation model to be trained by using the training sample to obtain the trained human face key point deviation model.

Optionally, the generating a sample sound feature matrix corresponding to the audio sample comprises: obtaining a Mel Frequency Cepstrum Coefficient (MFCC) feature corresponding to the audio sample, and generating an initial feature matrix based on the MFCC feature; and performing convolution processing on the initial feature matrix by using an encoder to generate the sample sound feature matrix.

Optionally, the human face key point deviation model comprises a facial key point deviation model and an expression key point deviation model; the training samples comprise facial training samples and expression training samples; the training of the human face key point deviation model to be trained by using the training sample to obtain the trained human face key point deviation model comprises the following steps: training the facial key point deviation model to be trained by using the facial training sample to obtain the trained facial key point deviation model; and training the expression key point deviation model to be trained by using the expression training sample to obtain the trained expression key point deviation model.

Optionally, the face key points include facial key points and expression key points; the generating of the human face key point deviation sequence according to the comparison result of the human face reference image and the human face sample image comprises: acquiring facial key point reference positions and expression key point reference positions corresponding to the facial key points and the expression key points in the facial reference image respectively; acquiring facial key point sample positions and expression key point sample positions corresponding to facial key points and expression key points in the facial sample image respectively; generating a facial key point deviation sequence based on the position deviation between the facial key point reference position and the facial key point sample position, and generating an expression key point deviation sequence based on the position deviation between the expression key point reference position and the expression key point sample position.

Optionally, the constructing a training sample according to the associated sample sound feature matrix and the face keypoint bias sequence includes: constructing the facial training sample according to the associated sample sound feature matrix and the facial key point deviation sequence; and constructing the expression training sample according to the associated sample sound feature matrix and the expression key point deviation sequence.

Optionally, the facial keypoint bias model and the expression keypoint bias model comprise: an LSTM model; the LSTM model is composed of an input layer, a plurality of hidden layers and an output layer, the last hidden layer outputs a prediction result to the output layer, and the output layer limits the value of the prediction result to a preset value interval by adopting an activation function.

According to a second aspect of the present disclosure, there is provided a face image generation method, including: receiving audio information, and generating a target sound characteristic matrix corresponding to the audio information; acquiring human face key point deviation information corresponding to the target sound characteristic matrix by using a human face key point deviation model; acquiring a face reference image corresponding to the audio information, and generating a face image corresponding to the audio information based on the face reference image and the face key point deviation information; the human face key point deviation model is obtained by training through the model training method.

Optionally, the human face key point deviation model comprises a facial key point deviation model and an expression key point deviation model; the face key points comprise face key points and expression key points; the obtaining of the human face key point deviation information corresponding to the target sound feature matrix by using the human face key point deviation model includes: acquiring facial key point deviation information corresponding to the target sound feature matrix by using the facial key point deviation model; and obtaining the expression key point deviation information corresponding to the target sound characteristic matrix by using the expression key point deviation model.

Optionally, the generating a face image corresponding to the audio information based on the face reference image and the face key point deviation information includes: acquiring facial key point reference positions and expression key point reference positions corresponding to the facial key points and the expression key points in the facial reference image; generating a facial key point position map based on the facial key point reference position and the facial key point deviation information, and generating an expression key point position map based on the expression key point reference position and the expression key point deviation information; fusing the facial key point position graph and the expression key point position graph to generate a face key point position graph; and generating the face image based on the face key point position image.

Optionally, the face image includes: two-dimensional face images or three-dimensional face images.

According to a third aspect of the present disclosure, there is provided a model training apparatus comprising: the video separation module is used for separating a video sample to obtain an audio sample and a face sample image corresponding to the video sample; a sound sample generation module for generating a sample sound feature matrix corresponding to the audio sample; the image sample generation module is used for generating a human face key point deviation sequence according to a comparison result of a human face reference image and the human face sample image; the training sample construction module is used for constructing a training sample according to the associated sample sound characteristic matrix and the human face key point deviation sequence; and the model training module is used for training the human face key point deviation model to be trained by utilizing the training sample to obtain the trained human face key point deviation model.

According to a fourth aspect of the present disclosure, there is provided a face image generation apparatus comprising: the sound characteristic generating module is used for receiving audio information and generating a target sound characteristic matrix corresponding to the audio information; the deviation obtaining module is used for obtaining the deviation information of the human face key points corresponding to the target sound characteristic matrix by using a human face key point deviation model; and the image processing module is used for acquiring a face reference image corresponding to the audio information and generating a face image corresponding to the audio information based on the face reference image and the face key point deviation information. The human face key point deviation model is obtained by training through the model training method.

According to a fifth aspect of the present disclosure, there is provided a model training apparatus comprising: a memory; and a processor coupled to the memory, the processor configured to perform the method as described above based on instructions stored in the memory.

According to a sixth aspect of the present disclosure, there is provided a face image generation apparatus comprising: a memory; and a processor coupled to the memory, the processor configured to perform the method as described above based on instructions stored in the memory.

According to a seventh aspect of the present disclosure, there is provided a computer-readable storage medium storing computer instructions for a processor to perform the above model training method, and/or to perform the above face image generation method.

The model training method and device, the face image generation method and device and the storage medium generate a sample sound characteristic matrix and a face key point deviation sequence for an audio sample and a face sample image, and construct a training sample to train a face key point deviation model to be trained; the three-dimensional structure reconstruction of the facial structure is not needed, and the expression and the posture of the speaker can be described by using the key points of the human face in a two-dimensional space; training data is collected without special equipment, the mapping relation from the voice signals to the key points of the human face is realized through the voice coding and the human face key point deviation model, the deviation information of the human face key points is obtained, and the efficiency and the accuracy of model training and use are improved.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without inventive exercise.

FIG. 1 is a schematic flow chart diagram illustrating one embodiment of a model training method according to the present disclosure;

FIG. 2 is a schematic flow chart diagram of generating a sample acoustic feature matrix in one embodiment of a model training method according to the present disclosure;

FIG. 3 is a schematic diagram of an encoder for performing convolution processing on an initial feature matrix;

FIG. 4 is a schematic flow chart diagram illustrating the generation of a human face keypoint bias sequence in an embodiment of a model training method according to the present disclosure;

FIG. 5 is a schematic structural diagram of an LSTM model;

FIG. 6 is a schematic flow chart diagram illustrating one embodiment of a method for generating a face image according to the present disclosure;

FIG. 7 is a schematic flow chart diagram of a face image generation process by fusion processing according to an embodiment of a face image generation method of the present disclosure;

FIG. 8 is a schematic diagram of a generated face keypoint location map;

FIG. 9 is a block diagram illustration of one embodiment of a model training apparatus according to the present disclosure;

FIG. 10 is a block diagram illustration of another embodiment of a model training apparatus according to the present disclosure;

FIG. 11 is a block diagram of one embodiment of a face image generation apparatus according to the present disclosure;

FIG. 12 is a block diagram of an image processing module in an embodiment of a face image generation apparatus according to the present disclosure;

fig. 13 is a block diagram of another embodiment of a face image generation apparatus according to the present disclosure.

Detailed Description

The present disclosure now will be described more fully hereinafter with reference to the accompanying drawings, in which exemplary embodiments of the disclosure are shown. The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure. The technical solution of the present disclosure is described in various aspects below with reference to various figures and embodiments.

Fig. 1 is a schematic flow chart diagram of an embodiment of a model training method according to the present disclosure, as shown in fig. 1:

step 101, performing separation processing on a video sample to obtain an audio sample and a face sample image corresponding to the video sample.

In one embodiment, the video samples may be a plurality of video files of a speaker (e.g., a host, etc.), the video samples include face image information of the speaker and audio information of the speaker, and the video samples may be separated from the video samples by an existing method, and the audio samples and the audio-free video may be extracted from the video samples.

A series of video frames containing face images are extracted from the audio-free video to serve as a face image sequence, and the video frame images in the face image sequence are face sample images. For example, 3 or 5 video frames containing face images can be extracted from a video without audio of 1 second length to serve as a face image sequence, and the arrangement sequence of face sample images in the face image sequence is the same as the shooting sequence of the face sample images. Or acquiring a video and an audio corresponding to the video, acquiring an audio sample from the audio, and extracting a face sample image from the video.

Step 102, a sample sound feature matrix corresponding to the audio sample is generated.

In one embodiment, the audio samples may be processed using a variety of methods to generate a sample sound feature matrix. For example, for each second length of audio in an audio sample, a corresponding sample sound feature matrix is generated.

And 103, generating a human face key point deviation sequence according to the comparison result of the human face reference image and the human face sample image.

In one embodiment, a face reference image corresponding to a speaker in a video is acquired, the face reference image being an image of the speaker in a normal state (the speaker expression is a normal expression and the speaker is not speaking), and the image including a face image of the speaker.

The face key points in the face reference image and the face sample image can be obtained by using the existing face key point detection method, and the positions and the number of the face key points can be set according to design requirements. The key points of the human face are the key points of cheeks, eyes, eyebrows, nose, mouth and the like. Setting the same coordinate system for the human face reference image and the human face sample image, wherein the coordinate system can be a plurality of two-dimensional coordinate systems; and acquiring coordinates of the key points of the face on the face reference image and the face sample image as the positions of the key points of the face. And comparing the face reference image with the face sample image to obtain a coordinate difference value between the coordinates of the face key point in the face reference image and the coordinates of the face key point in the face sample image as the position deviation.

Because the number of the face key points is multiple, a difference value array or a vector corresponding to a face sample image can be generated based on the position deviation of the face key points to serve as a face key point deviation sequence. And respectively establishing corresponding difference value arrays or vectors for a plurality of face sample images in the face image sequence to generate a plurality of face key point deviation sequences.

And 104, constructing a training sample according to the associated sample voice feature matrix and the human face key point deviation sequence.

In one embodiment, a sample sound feature matrix is generated for each second of audio in the audio samples, a face image sequence is generated for each second of video corresponding to each second of audio, a plurality of face sample images in the face image sequence, and a face keypoint bias sequence corresponding to the face image sequence is generated.

And determining the incidence relation between the sample sound characteristic matrix and the human face key point deviation sequence based on the time information of the video sample, labeling the associated sample sound characteristic matrix by using the human face key point deviation sequence, and constructing a training sample by using the labeled information.

And 105, training the human face key point deviation model to be trained by using the training sample to obtain the trained human face key point deviation model.

In one embodiment, the face keypoint bias model may be a neural network model or the like. After the human face key point deviation model is trained, a sample sound characteristic matrix is input, and a human face key point deviation sequence is obtained. Using face keypoints can describe the expression, mouth shape, etc. of the face with fewer parameters (e.g., 106 individual face keypoints only need 212 parameters), and a face image can be further generated based on the face keypoint bias sequence.

Fig. 2 is a schematic flowchart of generating a sample sound feature matrix in an embodiment of the model training method according to the present disclosure, as shown in fig. 2:

step 201, obtaining a Mel frequency cepstrum coefficient MFCC feature corresponding to the audio sample, and generating an initial feature matrix based on the MFCC feature.

And 202, performing convolution processing on the initial feature matrix by using an encoder to generate a sample sound feature matrix.

In one embodiment, the original sound signal is complex and redundant, not suitable for direct use as a network input, and needs to be encoded and converted into a signal suitable for the network input. The method comprises the steps of obtaining a section of audio sample, wherein the audio sample is a sound signal, obtaining a Mel frequency spectrum through a Mel filter, carrying out cepstrum analysis on the Mel frequency spectrum to obtain Mel-frequency cepstrum coefficients (MFCC) characteristics, namely an audio characteristic sequence, and further encoding by using an encoder.

For example, a sound signal of audio samples would be sampled at a 16k sampling rate and MFCCs extracted as features. For MFCC, a sampling window of 25ms and a step size of 10ms are used, and the last 12 of 13 coefficients are taken as the sound signal characteristic for each pane. Every 1 second of the sound signal will be decoded at 100fps, resulting in a 100x12 feature matrix.

As shown in fig. 3, in order to take into account the historical sound characteristics and the future sound characteristics when generating the current motion, a sliding window with a length of 15 is further added, so that each time point corresponds to 15 sound characteristics in the window, that is, each time point includes not only the sound characteristics (12 coefficients) of the current time point but also the sound characteristics of 7 time points before and after the current time point, and a total of 15 sound characteristics. At this time, the 1-second sound signal is a feature matrix (initial feature matrix) of 100 × 15 × 12, that is, each of 100 time points in one second corresponds to the sound feature of the current time point and 7 features before and after the current time point, and each feature is a 12-dimensional vector. Through a series of convolution modules of the encoder, the sound features at each time point are encoded into 64-dimensional vectors, that is, 1 second sound signals are encoded into 100x64 matrixes, that is, 100 time points, and each time point is represented by a 64-dimensional vector.

In one embodiment, the face keypoint bias models include a face keypoint bias model and an expression keypoint bias model, and the training samples include a facial training sample and an expression training sample. Because the speaker has the mouth shape and expression when speaking, the head may move as a whole, so each face key point has the global displacement besides the local relative displacement.

The face key points include face key points and expression key points. When a speaker speaks, the positions of the key points of the face are relatively fixed, and the key points of the face comprise key points of cheeks, canthus, nose bridges and the like; when a speaker speaks, the relative change of the positions of the expression key points is large, and the expression key points comprise key points at the periphery of lips and at the positions of the lips close to the chin, and key points of eyelids, facial muscles and the like; in order to decouple the two displacement relations, the human face key point deviation model comprises a facial key point deviation model and an expression key point deviation model, the facial key point deviation model can be used as a global network model, and the expression key point deviation model can be used as a local network model.

The face key point deviation model is used for predicting deviation information of relatively fixed face key points in the face and reflecting the overall motion of the head; the expression key point deviation model is used for predicting deviation information of the expression key points and reflecting the change of the mouth shape. Training a facial key point deviation model to be trained by using a facial training sample to obtain a trained facial key point deviation model; and training the expression key point deviation model to be trained by using the expression training sample to obtain the trained expression key point deviation model.

In an embodiment, fig. 4 is a schematic flowchart of generating a face keypoint bias sequence in an embodiment of a model training method according to the present disclosure, as shown in fig. 4:

step 401, acquiring reference positions of facial key points and expression key points corresponding to the facial key points and the expression key points in the facial reference image respectively.

Step 402, obtaining facial key point sample positions and expression key point sample positions corresponding to facial key points and expression key points in the facial sample image respectively.

Step 403, generating a facial key point deviation sequence based on the position deviation between the facial key point reference position and the facial key point sample position, and generating an expression key point deviation sequence based on the position deviation between the expression key point reference position and the expression key point sample position.

And constructing a facial training sample according to the associated sample sound feature matrix and the facial key point deviation sequence, and constructing an expression training sample according to the associated sample sound feature matrix and the expression key point deviation sequence.

In one embodiment, the facial and expression keypoint bias models may be recurrent neural networks, such as the LSTM (Long Short-Term Memory) model. The LSTM model is composed of an input layer, a plurality of hidden layers and an output layer, wherein the output of the previous hidden layer is used as the input of the next hidden layer, the last hidden layer outputs a prediction result to the output layer, and the output layer limits the value of the prediction result to a preset value interval by adopting an activation function, wherein the value interval can be [ -1,1 ].

As shown in fig. 5, X is the sample sound feature matrix, H is the hidden layer output, and O is the final output of the LSTM model. The hidden layer of the LSTM model can be two layers, and the number of hidden nodes is 128 dimensions. The output of the LSTM model will go through the additional fully-connected layers to produce the final output. The face key points are 106 points in total, the sum of the number of horizontal directions and the number of numerical directions is a vector with 212 dimensions, and the final output of the network is also 212 dimensions.

For each face key point, a direct coordinate learning mode is not adopted, but a difference value (position deviation) between the absolute coordinates (key point sample position) of the face key point in the face sample image and the absolute coordinates (key point reference position) of the face key point in the face reference image is learned, and the output of the LSTM model generates a value of-1-1 (face key point deviation information) through a tanh activation function.

The model training method in the embodiment can be carried out in a 2D space, does not need to carry out three-dimensional structure reconstruction of a face structure, directly carries out modeling aiming at face key points, and describes the expression and the posture of a speaker by using the face key points; data used for training are collected without special equipment, only video and audio are needed to be obtained, and the human face key point deviation model learns the corresponding relation from the sound signals to the human face key points, so that corresponding human face key point deviation information is generated.

Fig. 6 is a schematic flow chart of an embodiment of a face image generation method according to the present disclosure, as shown in fig. 6:

step 601, receiving the audio information, and generating a target sound feature matrix corresponding to the audio information. The formats and the generation methods of the target sound feature matrix and the sample sound feature matrix are the same.

Step 602, obtaining human face key point deviation information corresponding to the target sound feature matrix by using the human face key point deviation model.

In an embodiment, the face keypoint bias model is obtained by training through a training method as in any one of the above embodiments. And inputting the target sound characteristic matrix into the human face key point deviation model, and acquiring human face key point deviation information output by the human face key point deviation model.

For example, 10 target sound feature matrices are generated for 10 seconds of audio information, and 10 pieces of face key point deviation information corresponding to the 10 target sound feature matrices are acquired using the face key point deviation model. The face key point deviation information may be a face key point deviation sequence, the face key point deviation sequence includes a plurality of face key point deviation arrays or vectors corresponding to the face image, and each array or vector includes position deviation values of a plurality of face key points.

Step 603, a face reference image corresponding to the audio information is obtained, and a face image corresponding to the audio information is generated based on the face reference image and the face key point deviation information.

In one embodiment, reference positions of a plurality of face key points in a face reference image corresponding to audio information are acquired. The method comprises the steps of obtaining 1 piece of face key point deviation information predicted by a face key point deviation model, namely a face key point deviation sequence, extracting a plurality of face key point deviation arrays or vectors from the face key point deviation sequence, wherein each face key point deviation array or vector comprises deviation values of a plurality of face key points.

Determining the actual positions of the face key points in the face image based on the deviation values of the face key points in the face key point deviation array or vector and the reference positions of the face key points, generating a key point position diagram, and generating the face image based on the key point position diagram.

Various methods may be used to generate the face image corresponding to the audio information. For example, a trained face image generation model may be used, and the face image generation model may be a convolutional neural network model or the like. And inputting the key point position diagram into a human face image generation model to obtain a human face image corresponding to the key point position diagram, wherein the human face image can be a two-dimensional human face image or a three-dimensional human face image.

For a plurality of face key point deviation arrays or vectors in the 1-person face key point deviation sequence, a plurality of corresponding key point position graphs are respectively generated, a plurality of face images corresponding to the plurality of key point position graphs are generated by using a face image generation model, and a video with the length of 1 second can be generated based on the obtained plurality of face images.

In one embodiment, the face keypoint bias model comprises a face keypoint bias model and an expression keypoint bias model; the face key points include face key points and expression key points. And acquiring facial key point deviation information corresponding to the target sound feature matrix by using the facial key point deviation model, and acquiring expression key point deviation information corresponding to the target sound feature matrix by using the expression key point deviation model. The facial key point deviation information may be a facial key point deviation sequence, and the expression key point deviation information may be an expression key point deviation sequence.

Fig. 7 is a schematic flowchart of a process of generating a face image through fusion processing according to an embodiment of the face image generation method of the present disclosure, as shown in fig. 7:

step 701, acquiring reference positions of facial key points and expression key points corresponding to the facial key points and the expression key points in the facial reference image.

Step 702, generating a facial key point position map based on the deviation information of the facial key point reference position and the facial key point, and generating an expression key point position map based on the deviation information of the expression key point reference position and the expression key point.

And 703, fusing the facial key point position map and the expression key point position map to generate a facial key point position map.

Step 704, generating a face image based on the face key point position map.

As shown in fig. 8, the facial keypoint bias model may be used as a global network model, and the expression keypoint bias model may be used as a local network model. And respectively inputting the audio signals (target sound characteristic matrix) into the global network model and the expression key point deviation model to respectively obtain the facial key point deviation information and the expression key point deviation information. The global network model learns the change of relatively fixed points (such as key points of the canthus, the nose bridge and the like) in the human face, and focuses on the overall motion of the head; the local network model has higher weight for the lips and the chin part, and the change of the mouth shape is intensively learned.

Acquiring facial key point reference positions and expression key point reference positions corresponding to facial key points and expression key points in a facial reference image; generating a facial key point position map based on the deviation information of the facial key point reference position and the facial key point, and generating an expression key point position map based on the deviation information of the expression key point reference position and the expression key point.

And fusing the face key point position graph and the expression key point position graph to generate a face key point position graph. The fusion processing may be performed by using a variety of methods, for example, a perspective transformation matrix between two position maps is calculated based on specific points in the facial key point position map and the expression key point position map, and the two position maps are fused by using the perspective transformation matrix to obtain a facial key point position map. The face key point position graph can be input into a face image generation model, and a face image corresponding to the face key point position graph is obtained.

In one embodiment, as shown in fig. 9, the present disclosure provides a model training apparatus 90 comprising: a video separation module 91, a sound sample generation module 92, an image sample generation module 93, a training sample construction module 94, and a model training module 95.

The video separation module 91 performs separation processing on the video sample to obtain an audio sample and a face sample image corresponding to the video sample. The sound sample generation module 92 generates a sample sound feature matrix corresponding to the audio samples. The image sample generating module 93 generates a human face key point deviation sequence according to the comparison result between the human face reference image and the human face sample image.

The training sample construction module 94 constructs training samples according to the associated sample voice feature matrix and the face keypoint bias sequence. The model training module 95 trains the face key point deviation model to be trained by using the training sample to obtain the trained face key point deviation model.

In one implementation, the sound sample generation module 92 obtains mel-frequency cepstrum coefficients MFCC features corresponding to the audio samples, generates an initial feature matrix based on the MFCC features, and performs convolution processing on the initial feature matrix using an encoder to generate a sample sound feature matrix.

The human face key point deviation model comprises a facial key point deviation model and an expression key point deviation model; the training samples include facial training samples and expression training samples. The model training module 95 trains the facial key point deviation model to be trained by using the facial training sample to obtain a trained facial key point deviation model; the model training module 95 trains the expression key point deviation model to be trained by using the expression training sample to obtain a trained expression key point deviation model.

The face key points include face key points and expression key points. The image sample generation module 93 obtains the reference positions of the facial key points and the expression key points corresponding to the facial key points and the expression key points in the face reference image. The image sample generation module 93 obtains facial key point sample positions and expression key point sample positions corresponding to facial key points and expression key points in the face sample image respectively.

The image sample generation module 93 generates a facial key point deviation sequence based on the positional deviation between the facial key point reference position and the facial key point sample position, and generates an expression key point deviation sequence based on the positional deviation between the expression key point reference position and the expression key point sample position.

The training sample construction module 94 constructs a facial training sample according to the associated sample sound feature matrix and facial key point deviation sequence, and constructs an expression training sample according to the associated sample sound feature matrix and expression key point deviation sequence.

In one embodiment, FIG. 10 is a block diagram illustration of another embodiment of a model training apparatus according to the present disclosure. As shown in fig. 10, the apparatus may include a memory 1001, a processor 1002, a communication interface 1003, and a bus 1004. The memory 1001 is used for storing instructions, the processor 1002 is coupled to the memory 1001, and the processor 1002 is configured to execute the model training method based on the instructions stored in the memory 1001.

The memory 1001 may be a high-speed RAM memory, a non-volatile memory (non-volatile memory), or the like, and the memory 1001 may be a memory array. The storage 1001 may also be partitioned and the blocks may be combined into virtual volumes according to certain rules. Processor 72 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement the model training methods of the present disclosure.

In one embodiment, as shown in fig. 11, the present disclosure provides a face image generation apparatus 1100, including: a sound feature generation module 1101, a deviation acquisition module 1102 and an image processing module 1103. The sound feature generation module 1101 receives the audio information and generates a target sound feature matrix corresponding to the audio information. The bias obtaining module 1102 obtains face keypoint bias information corresponding to the target sound feature matrix using a face keypoint bias model. The image processing module 1103 obtains a face reference image corresponding to the audio information, and generates a face image corresponding to the audio information based on the face reference image and the face key point deviation information.

The human face key point deviation model comprises a facial key point deviation model and an expression key point deviation model; the face key points include face key points and expression key points. The deviation obtaining module 1102 obtains the deviation information of the facial key points corresponding to the target sound feature matrix using the facial key point deviation model, and obtains the deviation information of the expression key points corresponding to the target sound feature matrix using the expression key point deviation model.

In one embodiment, as shown in fig. 12, the image processing module 1103 includes: a map generation unit 1104, a fusion processing unit 1105, and an image generation unit 1106. The position map generating unit 1104 acquires the reference positions of the facial key points and the expression key points corresponding to the facial key points and the expression key points in the face reference image; the position map generating unit 1104 generates a facial key point position map based on the facial key point reference position and the facial key point deviation information, and generates an expression key point position map based on the expression key point reference position and the expression key point deviation information.

The fusion processing unit 1105 performs fusion processing on the face key point position map and the expression key point position map to generate a face key point position map. The image generation unit 1106 generates a face image based on the face keypoint location map.

In one embodiment, fig. 13 is a block diagram of another embodiment of a face image generation apparatus according to the present disclosure. As shown in fig. 13, the apparatus may include a memory 131, a processor 132, a communication interface 133, and a bus 134. The memory 131 is used for storing instructions, the processor 132 is coupled to the memory 131, and the processor 132 is configured to execute the human face image generation method based on the instructions stored in the memory 131.

The memory 131 may be a high-speed RAM memory, a non-volatile memory (non-volatile memory), or the like, and the memory 131 may be a memory array. The storage 131 may also be partitioned, and the blocks may be combined into virtual volumes according to certain rules. The processor 72 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement the facial image generation of the present disclosure.

In one embodiment, the present disclosure provides a computer-readable storage medium storing computer instructions that, when executed by a processor, implement a model training method as in any one of the above embodiments, and/or a face image generation method as in any one of the above embodiments.

The model training method and device, the face image generation method and device and the storage medium provided by the embodiment generate a sample voice feature matrix and a face key point deviation sequence for an audio sample and a face sample image, construct a training sample, and train a face key point deviation model to be trained; the three-dimensional structure reconstruction of the facial structure is not needed, and the expression and the posture of the speaker can be described by using the key points of the human face in a two-dimensional space; training data is collected without special equipment, only video and audio are needed to be acquired, the mapping relation from the sound signals to the key points of the human face is realized through the voice coding and the human face key point deviation model, the human face key point deviation information is acquired, the efficiency and the accuracy of model training and use are improved, the generated virtual character has vivid facial expression and good visual effect.

The method and system of the present disclosure may be implemented in a number of ways. For example, the methods and systems of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

The description of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A model training method, comprising:

carrying out separation processing on a video sample to obtain an audio sample and a face sample image corresponding to the video sample;

generating a sample sound feature matrix corresponding to the audio sample;

generating a human face key point deviation sequence according to a comparison result of the human face reference image and the human face sample image;

constructing a training sample according to the associated sample sound feature matrix and the human face key point deviation sequence;

and training a human face key point deviation model to be trained by using the training sample to obtain the trained human face key point deviation model.

2. The method of claim 1, the generating a sample sound feature matrix corresponding to the audio samples comprising:

obtaining a Mel Frequency Cepstrum Coefficient (MFCC) feature corresponding to the audio sample, and generating an initial feature matrix based on the MFCC feature;

and performing convolution processing on the initial feature matrix by using an encoder to generate the sample sound feature matrix.

3. The method of claim 1, wherein the face keypoint bias models comprise a facial keypoint bias model and an expression keypoint bias model; the training samples comprise facial training samples and expression training samples; the training of the human face key point deviation model to be trained by using the training sample to obtain the trained human face key point deviation model comprises the following steps:

training the facial key point deviation model to be trained by using the facial training sample to obtain the trained facial key point deviation model;

and training the expression key point deviation model to be trained by using the expression training sample to obtain the trained expression key point deviation model.

4. The method of claim 3, the face keypoints comprising facial keypoints and expression keypoints; the generating of the human face key point deviation sequence according to the comparison result of the human face reference image and the human face sample image comprises:

acquiring facial key point reference positions and expression key point reference positions corresponding to the facial key points and the expression key points in the facial reference image respectively;

acquiring facial key point sample positions and expression key point sample positions corresponding to facial key points and expression key points in the facial sample image respectively;

generating a facial key point deviation sequence based on the position deviation between the facial key point reference position and the facial key point sample position, and generating an expression key point deviation sequence based on the position deviation between the expression key point reference position and the expression key point sample position.

5. The method of claim 4, the constructing training samples from the associated sample voice feature matrix and the face keypoint bias sequence comprising:

constructing the facial training sample according to the associated sample sound feature matrix and the facial key point deviation sequence;

and constructing the expression training sample according to the associated sample sound feature matrix and the expression key point deviation sequence.

6. The method of any one of claims 3 to 5,

the facial keypoint bias model and the expression keypoint bias model include: an LSTM model;

the LSTM model is composed of an input layer, a plurality of hidden layers and an output layer, the last hidden layer outputs a prediction result to the output layer, and the output layer limits the value of the prediction result to a preset value interval by adopting an activation function.

7. A face image generation method comprises the following steps:

receiving audio information, and generating a target sound characteristic matrix corresponding to the audio information;

acquiring human face key point deviation information corresponding to the target sound characteristic matrix by using a human face key point deviation model;

acquiring a face reference image corresponding to the audio information, and generating a face image corresponding to the audio information based on the face reference image and the face key point deviation information;

the human face key point deviation model is obtained by training through the model training method of any one of claims 1 to 6.

8. The method of claim 7, wherein the face keypoint bias models comprise a facial keypoint bias model and an expression keypoint bias model; the face key points comprise face key points and expression key points; the obtaining of the human face key point deviation information corresponding to the target sound feature matrix by using the human face key point deviation model includes:

acquiring facial key point deviation information corresponding to the target sound feature matrix by using the facial key point deviation model;

and obtaining the expression key point deviation information corresponding to the target sound characteristic matrix by using the expression key point deviation model.

9. The method of claim 8, wherein the generating a face image corresponding to the audio information based on the face reference image and the face keypoint bias information comprises:

acquiring facial key point reference positions and expression key point reference positions corresponding to the facial key points and the expression key points in the facial reference image;

generating a facial key point position map based on the facial key point reference position and the facial key point deviation information, and generating an expression key point position map based on the expression key point reference position and the expression key point deviation information;

fusing the facial key point position graph and the expression key point position graph to generate a face key point position graph;

and generating the face image based on the face key point position image.

10. The method of any one of claims 7 to 9,

the face image includes: two-dimensional face images or three-dimensional face images.

11. A model training apparatus comprising:

the video separation module is used for separating a video sample to obtain an audio sample and a face sample image corresponding to the video sample;

a sound sample generation module for generating a sample sound feature matrix corresponding to the audio sample;

the image sample generation module is used for generating a human face key point deviation sequence according to a comparison result of a human face reference image and the human face sample image;

the training sample construction module is used for constructing a training sample according to the associated sample sound characteristic matrix and the human face key point deviation sequence;

and the model training module is used for training the human face key point deviation model to be trained by utilizing the training sample to obtain the trained human face key point deviation model.

12. A face image generation apparatus comprising:

the sound characteristic generating module is used for receiving audio information and generating a target sound characteristic matrix corresponding to the audio information;

the deviation obtaining module is used for obtaining the deviation information of the human face key points corresponding to the target sound characteristic matrix by using a human face key point deviation model;

and the image processing module is used for acquiring a face reference image corresponding to the audio information and generating a face image corresponding to the audio information based on the face reference image and the face key point deviation information.

13. A model training apparatus comprising:

a memory; and a processor coupled to the memory, the processor configured to perform the method of any of claims 1-6 based on instructions stored in the memory.

14. A face image generation apparatus comprising:

a memory; and a processor coupled to the memory, the processor configured to perform the method of any of claims 7-10 based on instructions stored in the memory.

15. A computer-readable storage medium storing, non-transitory, computer instructions for execution by a processor of the method of any one of claims 1 to 6 and/or of any one of claims 7 to 10.