CN112562722A

CN112562722A - Audio-driven digital human generation method and system based on semantics

Info

Publication number: CN112562722A
Application number: CN202011382282.8A
Authority: CN
Inventors: 王涛; 徐常亮
Original assignee: Xinhua Zhiyun Technology Co ltd
Current assignee: Xinhua Zhiyun Technology Co ltd
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2021-03-26

Abstract

The invention discloses a semantic-based audio-driven digital human generation method and a semantic-based audio-driven digital human generation system, wherein the generation method comprises the following steps: acquiring a target audio and a first face image sequence; extracting the characteristics of the target audio to obtain corresponding audio characteristics; inputting the audio features into a pre-trained semantic conversion network, and performing semantic conversion on the audio features by the semantic conversion network to obtain a corresponding semantic motion sequence, wherein the semantic motion sequence comprises a plurality of mouth semantic graphs; and acquiring the face images to be rendered with the same number of the mouth semantic graphs based on a first face image sequence, shielding the mouth region of the face images to be rendered, and performing face synthesis based on the mouth semantic graphs and the face images to be rendered to generate a synthesized face sequence. The invention realizes the conversion of audio and facial semantics through the semantic conversion network, and achieves the accurate expression of mouth shape by utilizing the facial semantics.

Description

Audio-driven digital human generation method and system based on semantics

Technical Field

The invention relates to the field of machine learning, in particular to a semantic-based audio-driven digital human generation method and system.

Background

The video of the synchronous speaking action of the digital person generated by the audio drive is widely applied to various video sharing scenes, such as news broadcasting, training sharing, advertising and the like;

a method for synchronously driving a three-dimensional human face mouth shape and a facial pose animation by voice is disclosed by the publication number CN1032188842, mouth shape characteristic parameters and facial pose characteristic parameters which are defined based on MPEG-4 and correspond to each initial and final in a video frame are extracted, then a difference value Vel of each characteristic point coordinate and a standard frame coordinate is calculated, a corresponding scale reference quantity P on a human face defined according to MPEG-4 is calculated, and a human face motion parameter is calculated and obtained through the difference value Vel and the scale reference quantity P;

the patent application adopts the constructed three-dimensional face as a digital person, and the face generated by modeling is greatly different from a real face, so that the method is not suitable for occasions requiring the consistency of the digital face and the real face, such as news broadcasting, training sharing and the like;

because the face movement and speaking are a very elaborate and complex process, the face movement can only be preliminarily represented by using the coordinates of the feature points, errors exist in the positioning of the face feature points, and the face movement and speaking have individual differences; the method associates each initial and final with the mouth shape face posture characteristic parameters, and the tone, language and speed of the sound are related to the face movement, so the method has large limitation.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides the audio-driven digital person generation method and system based on the semantic meaning, which can accurately and finely express the face, and is suitable for occasions requiring the digital person to be similar to a target person.

In order to solve the technical problem, the invention is solved by the following technical scheme:

a semantic-based audio-driven digital human generation method comprises the following steps:

acquiring a target audio and a target face image sequence, and after masking the mouth region of each target face image in the target face image sequence, acquiring a corresponding first face image sequence;

extracting the characteristics of the target audio to obtain corresponding audio characteristics;

inputting the audio features into a pre-trained semantic conversion network, and performing semantic conversion on the audio features by the semantic conversion network to obtain a corresponding semantic motion sequence, wherein the semantic motion sequence comprises a plurality of mouth semantic graphs;

and constructing a second face image sequence based on the first face image sequence, wherein the second face image sequence contains the face images to be rendered in the same quantity as the mouth semantic images, and generating a synthesized face sequence based on the mouth semantic images and the face images to be rendered, and the synthesized face sequence contains synthesized faces corresponding to the mouth semantic images one to one.

As an implementable embodiment, the semantic conversion network comprises a recurrent neural network and an upsampling convolutional neural network;

the recurrent neural network is used for converting the audio features into expression vectors:

the up-sampling convolution neural network is used for generating a semantic motion sequence based on the expression vector.

As an implementable embodiment:

respectively connecting the semantic graph of the mouth part with the corresponding face image to be rendered to obtain corresponding data to be synthesized;

and inputting the data to be synthesized into a preset neural rendering network, and synthesizing and rendering the face image to be rendered by the neural rendering network based on the mouth semantic graph to generate a corresponding synthesized face.

As an implementation manner, the pre-training semantic conversion network includes the following steps:

acquiring a speaking video corresponding to a target face, extracting audio features of the speaking video, acquiring sample audio features, extracting video frames of the speaking video, detecting the face in each video frame, segmenting a mouth semantic graph of the face, and taking the obtained mouth semantic graph as a sample semantic graph;

training the semantic conversion network based on the sample audio features and the sample semantic graph.

masking the mouth region of the face in the video frame to obtain a corresponding sample image to be rendered;

connecting the sample image to be rendered and the corresponding sample semantic graph to obtain corresponding sample data to be synthesized;

and training the neural rendering network based on the sample data to be synthesized and the sample face image.

As an implementable embodiment:

the audio features are mel-frequency cepstral coefficients.

The invention also provides a semantic-based audio-driven digital human generation system, which comprises:

the data acquisition module is used for acquiring a target audio and a target face image sequence, and after masking processing is carried out on mouth regions of all target face images in the target face image sequence, a corresponding first face image sequence is obtained;

the characteristic extraction module is used for extracting the characteristics of the target audio to obtain corresponding audio characteristics;

the semantic conversion module is used for inputting the audio features to a pre-trained semantic conversion network, and the semantic conversion network performs semantic conversion on the audio features to obtain a corresponding semantic motion sequence, wherein the semantic motion sequence comprises a plurality of mouth semantic graphs;

and the synthetic rendering module is used for constructing a second face image sequence based on the first face image sequence, the second face image sequence comprises the face images to be rendered in the same quantity as the mouth semantic graphs, face synthesis is carried out based on the mouth semantic graphs and the face images to be rendered, a synthetic face sequence is generated, and synthetic faces corresponding to the mouth semantic graphs one by one are contained in the synthetic face sequence.

As an implementable embodiment:

the semantic conversion network comprises a cyclic neural network and an up-sampling convolutional neural network;

As one implementable way, the composition rendering module includes:

the connecting unit is used for connecting the semantic graph of the mouth part with the corresponding face image to be rendered respectively to obtain corresponding data to be synthesized;

and the rendering unit is used for inputting the data to be synthesized into a preset neural rendering network, and the neural rendering network synthesizes and renders the face image to be rendered based on the mouth semantic graph to generate a corresponding synthesized face.

The invention also proposes a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of the preceding claims.

Due to the adoption of the technical scheme, the invention has the remarkable technical effects that:

according to the invention, through a pre-trained semantic conversion network, the semantic is adopted to achieve the fine expression of the mouth shape, the semantic is essentially a binary image of the mouth shape of the digital human face, and compared with key points or parameters of the face, the expression of the face is more accurate and fine.

The invention carries out synthesis rendering through the neural rendering network, can more accurately realize the digital human generation driven by audio frequency of Lubang, enables the synthesized face to be more similar to the real face, and improves the watching experience.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow diagram of a semantic-based audio-driven digital human generation method of the present invention;

FIG. 2 is a schematic diagram of a network architecture of a neural rendering network according to embodiment 1;

FIG. 3 is a schematic diagram of a neural rendering network generating a corresponding synthetic face based on a semantic graph of a mouth and a face image to be rendered in a case;

fig. 4 is a schematic diagram of module connection of a semantic-based audio-driven digital human generation system according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples, which are illustrative of the present invention and are not to be construed as being limited thereto.

Embodiment 1, a semantic-based audio-driven digital human generation method, as shown in fig. 1, includes the following steps:

s100, acquiring a target audio and a target face image sequence, and masking mouth regions of all target face images in the target face image sequence to obtain a corresponding first face image sequence;

after the mouth region of the target face image is subjected to mask processing, corresponding face images to be rendered are obtained, and the face images to be rendered corresponding to the target face images one to one form a first face image sequence.

S200, extracting the characteristics of the target audio to obtain corresponding audio characteristics;

s300, inputting the audio features into a pre-trained semantic conversion network, and performing semantic conversion on the audio features by the semantic conversion network to obtain a corresponding semantic motion sequence, wherein the semantic motion sequence comprises a plurality of mouth semantic graphs;

s400, constructing a second face image sequence based on the first face image sequence, wherein the second face image sequence comprises the same number of face images to be rendered as the mouth semantic graphs, carrying out face synthesis based on the mouth semantic graphs and the face images to be rendered to generate a synthesized face sequence, and the synthesized face sequence comprises synthesized faces corresponding to the mouth semantic graphs one by one.

When the number of the mouth semantic images is less than or equal to the number of the face images to be rendered in the first face image sequence, extracting the corresponding number of face images to be rendered in sequence to form a second face image sequence;

when the number of the mouth semantic images is larger than that of the face images to be rendered in the first face image sequence, the first face image sequence can be repeatedly played to amplify and obtain a corresponding number of face images to be rendered so as to form a second face image sequence;

for example, the first face image sequence is placed in the front direction and placed in the reverse direction in a circulating mode, an image sequence with infinite length can be generated, and the second face image sequence can be obtained by sequentially extracting face images to be rendered with corresponding length according to the number of the semantic graphs of the mouth.

The obtained synthesized face sequence is a picture sequence of the digital person speaking generated based on the target face, the digital person speaking action is consistent with the target audio, and then a corresponding video can be generated based on the corresponding synthesized face sequence and the target audio.

In the embodiment, the conversion between audio and semantics is realized through a pre-trained semantic conversion network, the semantics is adopted to achieve the fine expression of the mouth shape, the semantics is essentially a binary image of the digital human face mouth shape, and compared with key points or parameters of the face, the expression of the face is more accurate and fine, so that the method is suitable for the requirements of news broadcasting, teaching training and the like that the digital human face needs to have a target character and the speech action is more real.

The specific way of obtaining the face image to be rendered in step S100 is as follows:

manually setting a mask on the target face image to shield the mouth region of the target face image, and taking the obtained image as the face image to be rendered, wherein a person skilled in the art can set the mask region according to the actual situation.

The specific way of extracting the audio features in step S200 is as follows:

the audio features are mel-frequency cepstrum coefficients (MFCCs), in this embodiment, the target audio is framed according to 40 milliseconds and corresponding mel-frequency cepstrum coefficients are extracted, and a person skilled in the art can design the framing time interval according to actual needs.

Further, the pre-training mode of the semantic conversion network in step S300 is as follows:

a1, acquiring a speaking video corresponding to a target face, extracting audio features of the speaking video, acquiring sample audio features, extracting video frames of the speaking video, detecting the face in each video frame, segmenting a semantic graph of the mouth of the face, and taking the obtained semantic graph of the mouth as a sample semantic graph;

in the embodiment, speaking videos of a target person are collected, and audio and video separation is carried out on each speaking video to obtain corresponding speaking audio and a plurality of video frames; framing the audio data according to 40 milliseconds and extracting corresponding Mel frequency cepstrum coefficients to obtain sample audio features; based on the existing face detection and face segmentation technology, the face in each video frame is detected, and a semantic graph of a mouth part corresponding to the face is extracted to be used as a sample semantic graph.

A2, training the semantic conversion network based on the sample audio features and the sample semantic graph, and iteratively training the semantic conversion network based on the following steps:

taking the sample audio features as an input of a semantic conversion network, and outputting a predicted mouth semantic graph, namely a predicted semantic graph, by the semantic conversion network;

performing loss calculation based on a corresponding sample semantic graph (real data) and a prediction semantic graph (prediction data), performing gradient rotation based on a first loss value obtained by calculation, and updating parameters of a semantic conversion network, wherein the first loss value is the sum of cross entropy loss and perception loss;

and finishing the training when the training times reach a preset iteration time threshold or the loss value is reduced to a preset loss threshold.

The model training step belongs to the conventional technical means in the field, so the model training step is not further detailed in this embodiment, and a person skilled in the art can also train to obtain a corresponding semantic conversion network.

Further, the semantic conversion network comprises a recurrent neural network and an upsampling convolutional neural network;

In this embodiment, the recurrent neural network includes two GRU layers, GRUs (gated recurrent units), and this embodiment extracts the time sequence relationship of the input audio by using this network, and averages the output of this network in the time dimension, and then sends the result to the Linear layer.

In the embodiment, Tanh is used as an activation function in an output layer of the upsampling convolutional neural network for predicting the mouth semantic graph.

The network structure of the semantic conversion network is specifically shown in the following table:

TABLE 1

Wherein kernel is a convolution kernel, stride is a step length, the Linear layer is a full-link layer, and the Reshape layer is used for transforming vector dimensions.

Further, in step S400, performing face synthesis based on the mouth semantic graph and the face image to be rendered, and generating a synthesized face sequence specifically includes:

s410, connecting the semantic graphs of the mouth parts with the corresponding face images to be rendered respectively to obtain corresponding data to be synthesized;

the connection refers to connecting two pieces of multidimensional data of the mouth semantic graph and the face image to be rendered on one channel (dimension), for example, connecting a 20-dimensional vector and a 30-dimensional vector to form a 50-dimensional vector.

And S420, inputting the data to be synthesized into a preset neural rendering network, and synthesizing and rendering the face image to be rendered by the neural rendering network based on the mouth semantic graph to generate a corresponding synthesized face.

In the embodiment, the synthesis rendering is performed through the neural rendering network, so that the audio-driven digital human generation can be more accurately realized by Lu Pont, the synthesized face is more similar to the real face, and the watching experience is improved.

The pre-training mode of the neural rendering network is as follows:

b1, masking the mouth region of the face in the video frame to obtain a corresponding sample image to be rendered;

the video frame is the video frame extracted in step a 1;

in this step, the mouth region is consistent with the mouth region in step S100, that is, after a person skilled in the art sets a fixed region for shielding the mouth according to actual needs, mask processing is performed on a target face image for synthesizing a digital person and a video frame serving as a training sample based on the fixed region.

B2, connecting the sample image to be rendered and the corresponding sample semantic graph to obtain corresponding sample data to be synthesized;

the connection described in the above-described connection synchronization step S410 is not described in detail.

B3, training the neural rendering network based on the sample data to be synthesized and the sample face image.

Taking the sample data to be synthesized as the input of a neural rendering network, and outputting the predicted synthesized face by the semantic conversion network, namely, predicting a face image;

loss calculation is carried out based on the corresponding sample face image (real data) and the prediction face image (prediction data), gradient rotation is carried out based on a second loss value obtained by calculation, and parameters of the semantic conversion network are updated, wherein the second loss value is the sum of L1 loss and perception loss;

Note that the cross-entropy loss, the L1 loss and the perceptual loss are all conventional loss functions in the art, and a person skilled in the art can calculate corresponding loss values according to actual situations without providing detailed formulas.

In this embodiment, the formula for calculating the perception loss by the semantic conversion network and the neural rendering network is as follows:

in the present embodiment, the real data Y and the prediction data are combined

Respectively input into a pre-trained VGG network V, wherein V is in the formula_jThe activation condition of the j layer of the VGG network in the real data or the prediction data processing is shown in (C)_j,H_j,W_j). The squares of the L2 losses are then used to compare the real data Y with the predicted data

As a corresponding loss of perception.

When training semantic exchange network, real data Y is sample semantic graph and prediction data

Is a corresponding prediction semantic graph;

when the neural rendering network is trained, the real data Y is a sample face image and prediction data

And the corresponding predicted face image is obtained.

In this embodiment, the network structure of the neural rendering network is specifically shown in the following table:

TABLE 2

In the above table, Mask M_tRepresenting a face image with a mask, namely a face image to be rendered or a sample image to be rendered; image Q represents a semantic graph of the mouth; skip indicates that it belongs to a Skip layer connection structure, the network architecture diagram of the neural rendering network is shown in fig. 2, and a dotted line in fig. 2 indicates Skip layer connection.

Case (2):

and acquiring a speaking video of the target person, and training to acquire a semantic conversion network and a video rendering network by using audio data (MFCC) and image data in the speaking video according to the training steps.

Referring to fig. 3, according to actual needs, extracting a segment of speaking video from pre-collected speaking videos, or selecting a speaking video (non-training video) designated by a user, masking a mouth region of a face in each video frame (i.e., a target face image) with a target face image sequence of video frames of the speaking video to obtain a face image to be rendered, thereby generating a first face image sequence;

acquiring a target audio, extracting a mel frequency cepstrum coefficient of the target audio, and acquiring corresponding audio characteristics; inputting the audio features into a semantic conversion network to obtain a plurality of corresponding mouth semantic graphs;

after the first face image sequence is subjected to forward amplification and backward amplification, the length of the obtained image sequence is consistent with the number of the mouth semantic images, face images to be rendered in the image sequence are sequentially extracted at the moment, the corresponding mouth semantic images and the face images to be rendered are connected and then input into a neural rendering network, and corresponding synthesized faces are obtained;

and generating a video based on the obtained synthesized face and the target audio, and synchronizing the mouth shape of the digital person corresponding to the target person with the target audio.

Embodiment 2, a semantic-based audio-driven digital human generation system, as shown in fig. 4, includes:

the data acquisition module 100 is configured to acquire a target audio and a target face image sequence, and perform masking processing on a mouth region of each target face image in the target face image sequence to obtain a corresponding first face image sequence;

the feature extraction module 200 is configured to perform feature extraction on the target audio to obtain corresponding audio features;

a semantic conversion module 300, configured to input the audio features into a pre-trained semantic conversion network, where the audio features are subjected to semantic conversion by the semantic conversion network to obtain a corresponding semantic motion sequence, where the semantic motion sequence includes a plurality of mouth semantic graphs;

and a synthesis rendering module 400, configured to construct a second face image sequence based on the first face image sequence, where the second face image sequence includes the same number of face images to be rendered as the mouth semantic graphs, and perform face synthesis based on the mouth semantic graphs and the face images to be rendered to generate a synthesized face sequence, where the synthesized face sequence includes synthesized faces corresponding to the mouth semantic graphs one to one.

The composite rendering module 400 includes:

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Embodiment 3 is a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method of embodiment 1.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be noted that:

reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Thus, the appearances of the phrase "one embodiment" or "an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

In addition, it should be noted that the specific embodiments described in the present specification may differ in the shape of the components, the names of the components, and the like. All equivalent or simple changes of the structure, the characteristics and the principle of the invention which are described in the patent conception of the invention are included in the protection scope of the patent of the invention. Various modifications, additions and substitutions for the specific embodiments described may be made by those skilled in the art without departing from the scope of the invention as defined in the accompanying claims.

Claims

1. A semantic-based audio-driven digital human generation method is characterized by comprising the following steps:

2. The semantic-based audio driven digital human generation method of claim 1, wherein the semantic conversion network comprises a recurrent neural network and an upsampled convolutional neural network;

3. The semantic-based audio-driven digital human generation method of claim 1 or 2, wherein:

connecting the semantic graph of the mouth with the corresponding face image to be rendered to obtain corresponding data to be synthesized;

and inputting the data to be synthesized into a preset neural rendering network, and performing synthetic rendering on the face image to be rendered by the neural rendering network based on the mouth semantic graph to generate a corresponding synthetic face.

4. The semantic-based audio-driven digital human generation method of claim 3, wherein the pre-training of the semantic conversion network comprises:

5. The semantic-based audio-driven digital human generation method of claim 4, wherein the pre-training of the semantic conversion network comprises:

6. The semantic-based audio-driven digital human generation method of claim 5, wherein:

the audio features are mel-frequency cepstral coefficients.

7. A semantic-based audio-driven digital human generation system, comprising:

and the synthetic rendering module is used for constructing a second face image sequence based on the first face image sequence, the second face image sequence contains the face images to be rendered in the same quantity as the mouth semantic graphs, and is also used for carrying out face synthesis based on the mouth semantic graphs and the face images to be rendered to generate a synthetic face sequence, and the synthetic face sequence contains synthetic faces corresponding to the mouth semantic graphs one by one.

8. The semantic-based audio-driven digital human generation system of claim 7, wherein:

9. The semantic-based audio-driven digital human generation system according to claim 7 or 8, wherein the synthesis rendering module comprises:

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.