CN116597859A

CN116597859A - Speech driving speaker face video synthesis method containing head motion gesture

Info

Publication number: CN116597859A
Application number: CN202310540049.5A
Authority: CN
Inventors: 李永源; 魏明强; 祝阅兵
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2023-05-15
Filing date: 2023-05-15
Publication date: 2023-08-15

Abstract

The application discloses a voice-driven speaker face video synthesis method with a head movement gesture, which comprises the following steps: acquiring voice and image data required by design; preprocessing voice and image data; performing content decoupling on the extracted voice features, and combining with an audioVC network to separate content characterization to obtain content information related to a speaker; and extracting key point information from the face image, performing alignment operation on the obtained face key points, and removing relevant identity information of the speaker. According to the application, the content representation and the identity representation in voice and images are decoupled respectively by using the face key point information as an intermediate vector representation on the basis of the coding and decoding data characteristics of the neural network, and the speaker face images which are reconstructed by restricting the mouth shape consistency and the front and back image continuity respectively by designing a plurality of discriminators are synthesized by a two-stage refined neural network structure, so that the speaker face video which has high naturalness, synchronous mouth shape and head movement posture can be synthesized.

Description

Speech driving speaker face video synthesis method containing head motion gesture

Technical Field

The application relates to the technical field of image generation, in particular to a voice-driven speaker face video synthesis method with a head motion gesture.

Background

In daily life, hearing and vision are the most important communication modes of human beings, and the two signals are in close and indistinguishable connection, so that rich characteristic information can be provided between the two signals. The speaker face synthesis technology is used for designing and training a model through collected face data, news can be directly broadcasted only by inputting voice and words in an application stage, and the technology is widely applied to the fields of digital game industry, teaching field, meta universe, virtual anchor and the like, so that a large amount of labor cost is saved, and real-time broadcasting can be realized.

With the progress of technology, the requirements of people on the definition of videos become higher, which means that people are highly sensitive to the fine defects of face videos synthesized by the technology, and in the aspect of audio-video synchronization, people can perceive a time difference of 0.05 seconds, so that the synchronicity of mouth shapes in the synthesized face videos has close relation with the visual perception of people, and then, how to generate videos with synchronous mouth shapes of speaking faces is a great difficulty.

In terms of pixel continuity, as the synthesis of the face video is the process of splicing the image frames, the local pixel jitter situation appears obviously in the face video synthesized by some methods, so that the continuity of the image frames before and after is crucial, and the situation of whether the synthesized video has pixel jitter is determined; in addition, the synthesis of natural and real facial videos is an important sign of whether the technology is mature or not, the requirement that the speaker facial videos have different language expression styles is that most of the prior art focuses on improving mouth shape synchronism, and the horror valley effect in the synthesized videos is ignored, especially on the basis of single picture modeling, only the mouth shape opening and closing videos often give a stiff and horror feeling.

Disclosure of Invention

The application aims to provide a method, a device, equipment and a medium for synthesizing voice-driven speaker face video with head movement gestures, so as to solve the problems in the background technology.

In order to achieve the above purpose, the present application provides the following technical solutions: a speech driving speaker face video synthesis method containing head movement posture includes the following steps:

acquiring voice and image data required by design;

preprocessing voice and image data;

performing content decoupling on the extracted voice features, and combining with an audioVC network to separate content characterization to obtain content information related to a speaker;

extracting key point information from the face image, performing alignment operation on the obtained face key points, and removing relevant identity information of a speaker;

designing a network to obtain an implicit mapping function from audio to key points, and predicting the offset of the key points of the face through the network;

performing contour fitting on the predicted offset of the key point sequence to obtain the approximate contour of the face of the person;

reconstructing the human face based on the generated countermeasure network, and analyzing the generated human face as an index.

Preferably, the acquiring the voice and image data required by the design includes:

obtaining a speaker face video containing head gesture movement from the existing public data set;

the video format is converted to 25 frame rate, audio is separated from the video and converted to a sample rate 16k format.

Preferably, the preprocessing of the voice and image data includes:

sequentially carrying out a series of operations such as pre-emphasis, windowing, framing, fourier transformation, mel filtering and the like on the extracted voice signals to obtain the mel cepstrum characteristics of the voice signals;

and framing the processed video format, aligning the processed video format with the voice frame, obtaining a required face through a dlib face detection algorithm, extracting face key points, and finally aligning the extracted face key points to a standardized face.

Preferably, the content decoupling of the extracted voice features and the content characterization combined with the AudioVC network are performed to obtain content information related to the speaker, including:

obtaining a plurality of scale neighborhood information of each vertex through a random walk strategy, and modeling a plurality of samples of each scale by utilizing a Gaussian mixture model;

speaker content characteristics are converted through an audioVC network, and speaker content characterization is separated:

wherein two sets of variables (U1, Z1, X1) and (U2, Z2, X2) are random samples which are independent and distributed uniformly, (U1, Z1, X1) belong to a source speaker and (U2, Z2, X2) belong to a target speaker; the speech converter produces a converted output X1-2 that retains the content in X1 but matches the speaker characteristics of speaker U2.

Preferably, the extracting key point information from the face image, and performing alignment operation on the obtained face key point, and removing identity information related to the speaker, includes:

firstly, fixing two outer corners of each video first frame to two fixed positions in an image coordinate system through 6-degree-of-freedom affine transformation, and then transforming all key points in all video frames by using the same transformation so as to remove speaker-related identity information;

wherein, removing speaker related identity information specifically comprises:

the first step, calculating an average face shape by averaging all aligned landmark positions in the entire training set;

secondly, for each face key point sequence, calculating an average shape and affine transformation between first frames of the sequence;

thirdly, calculating the difference between the current frame and the first frame, and multiplying the scaling coefficient obtained in the second step by the result obtained in the third step;

and fourthly, adding the average shape to the obtained result to obtain the face key point sequence without identity.

Preferably, the designing the network to obtain an implicit mapping function from audio to key points, and predicting the offset of the key points of the face through the network includes:

inputting the human face key point sequence which is separated from the voice signal and is spliced to remove the identity information into a long-short-term memory network for further feature coding, and performing iterative processing for a plurality of times to obtain a highly fitting implicit mapping function;

and predicting the offset of the key point sequence through an implicit mapping function from the audio to the key points.

Preferably, the reconstructing the face based on the generation countermeasure network and performing index analysis on the generated face includes:

modeling the input face key points and the original image based on the generated countermeasure network;

wherein x-p _data Representing the distribution of real data, x-p _G Representing a distribution of the generated data;

the network structure adopts a jump connection mode to fuse fine granularity information of the image;

designing a time sequence discriminator on the time sequence, and discriminating the time sequence of the generated image;

performing mouth shape synchronism judgment on the generated images through a pre-trained mouth shape synchronous discriminator on mouth shape synchronism;

and (5) analyzing the synthesized face index, and analyzing the structural similarity and the peak signal-to-noise ratio.

The application also provides a voice-driven speaker face synthesis device with the head movement gesture, which comprises:

the data acquisition and preprocessing module is used for acquiring voice and image data required by design;

the data acquisition and preprocessing module is also used for preprocessing voice and image data;

the data processing and generating module is used for performing content decoupling on the extracted voice characteristics and combining with the audioVC network to separate content characterization so as to obtain content information related to a speaker;

the data processing and generating module is also used for extracting key point information from the face image, performing alignment operation on the obtained face key points and removing relevant identity information of a speaker;

the data processing and generating module is also used for designing a network to obtain an implicit mapping function from the audio to the key points and predicting the offset of the key points of the face through the network;

the data processing and generating module is also used for performing contour fitting on the offset of the predicted key point sequence to obtain the rough contour of the face of the person;

the data processing and generating module is also used for reconstructing the face based on the generated countermeasure network and performing index analysis on the generated face.

The application also provides an electronic device, which is entity equipment, comprising:

the device comprises a processor and a memory, wherein the memory is in communication connection with the processor;

the memory is configured to store executable instructions that are executed by at least one of the processors, and the processor is configured to execute the executable instructions to implement the speech driven speaker face synthesis method including head motion gestures as described above.

The present application also provides a computer readable storage medium having stored therein a computer program which when executed by a processor implements a speech driven speaker face synthesis method including head motion gestures as described above.

Compared with the prior art, the application has the beneficial effects that:

according to the application, the content representation and the identity representation in voice and images are decoupled respectively by using the face key point information as an intermediate vector representation on the basis of the coding and decoding data characteristics of the neural network, and the speaker face images which are reconstructed by restricting the mouth shape consistency and the front and back image continuity respectively by designing a plurality of discriminators are synthesized by a two-stage refined neural network structure, so that the speaker face video which has high naturalness, synchronous mouth shape and head movement posture can be synthesized.

Drawings

FIG. 1 is a main flow chart of a method for synthesizing a voice-driven speaker face video with a head motion gesture according to an embodiment of the present application;

fig. 2 is a schematic diagram of a convolutional neural network of an audio prediction key point offset in a speech driven speaker face video synthesis method including a head motion gesture according to an embodiment of the present application;

fig. 3 is a schematic outline fitting diagram of face key points in a speech driving speaker face video synthesis method with head motion gesture according to an embodiment of the present application;

fig. 4 is a schematic diagram of an antagonistic network structure generated in a method for synthesizing a voice-driven speaker face video with a head motion gesture according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The main execution body of the method in this embodiment is a terminal, and the terminal may be a device such as a mobile phone, a tablet computer, a PDA, a notebook or a desktop, but of course, may be another device with a similar function, and this embodiment is not limited thereto.

Referring to fig. 1, the present application provides a method for synthesizing a voice-driven speaker face video with a head motion gesture, the method is applied to video image generation, and includes:

step 101, obtaining voice and image data required by design.

Specifically, the step 101 further includes:

and (3) finding the face video of the speaker from the open source data, or collecting the face video of the speaker from the existing public speech video, and obtaining a data conversion format through a ffmpeg library pair to obtain 25fps video format and audio data with a sampling rate of 16 k.

Step 102, preprocessing is performed on the voice and image data.

Specifically, the step 102 further includes:

The method comprises the steps of extracting the mel cepstrum features of audio by using a Librosa algorithm and a python_speech_features library, acquiring a human face by using a Dlib face detection algorithm, and performing strict alignment operation with a voice frame.

And 103, performing content decoupling on the extracted voice characteristics, and separating the content characterization by combining with an audioVC network to obtain content information related to a speaker.

Specifically, the step 103 includes:

Where the above formula represents the content z1=z1 in the source speech Z1 given the identity u2=u2 of the target speaker, the converted speech should sound like Z1 of U2 utterances.

And 104, extracting key point information from the face image, and removing relevant identity information of a speaker by performing alignment operation on the obtained face key points.

In particular, to help the network capture audiovisual coordination rather than changes in the shape of different human faces, key points of all training data are converted to average faces of all speakers in the training set. The two outer corners of the first frame of each video are fixed to two fixed positions of the image coordinate system by a 6-degree-of-freedom affine transformation, and then all key points in all video frames are transformed using the same transformation. After alignment, the faces of different speakers are similar in size and general position, however, their shapes and mouth positions are still different, and such identity-related changes may affect the relationship between the web learning capture speech and the lip movements, thus removing identity information in key points by:

step 1041, calculating an average face shape by averaging all aligned landmark positions in the entire training set;

step 1042, for each sequence of face keypoints, calculating an affine transformation between the average shape and the first frame of the sequence;

step 1043, calculating the difference between the current frame and the first frame, and multiplying the scaling factor obtained in step 1042 by the result obtained in step 1043;

and step 1044, adding the average shape to the obtained result to obtain a face key point sequence without identity.

Step 105, designing a network to obtain an implicit mapping function from the audio to the key points, and predicting the offset of the key points of the face through the network.

Specifically, step 105 includes:

and extracting to obtain the audio characteristics, and obtaining the speaking content characterization through the existing audioVC network. The characterization information is then input into a long-short-term memory network (LSTM) in combination with the key point coordinates from which speaker identity information has been removed to learn the mapping function from audio to key points, the network being capable of receiving audio signals and face key points, extracting features by multiple convolutions, pooling and full-connection, and outputting face key point coordinates in the last layer to achieve audio to key point mapping.

And 106, performing contour fitting on the offset of the predicted key point sequence to obtain the approximate contour of the face.

Specifically, for the key points obtained through prediction, the method can provide geometric structure information of the face, including position and shape information of the face contour, eyes, nose, mouth and other parts, but the only key point information is insufficient to provide enough prior information to enable a network to restore the real face, so that the key points of the face are subjected to shape fitting by using a matplot library, as shown in fig. 3, a simple general contour of the face is obtained, and therefore accuracy and capability of model reconstruction are improved.

And 107, reconstructing the face based on the generated countermeasure network, and performing index analysis on the generated face.

Specifically, modeling is performed on the input face key points and the original image based on the generation of the countermeasure network;

wherein x-p _data Representing the distribution of real data, x-p _G Representing a distribution of the generated data; through iterative optimization, the generator and the discriminator of the GAN network can achieve a dynamic balance, so that the generated data samples and the real data samples are distributed similarly.

The network structure adopts a jump connection mode to fuse fine granularity information of the image; ensuring that the network gets more details;

a time sequence discriminator is designed on the time sequence, and time sequence discrimination is carried out on the generated images, so that the images synthesized by the generator have more front-back continuity;

performing mouth shape synchronism judgment on the generated images through a pre-trained mouth shape synchronous discriminator on mouth shape synchronism, so that mouth shapes of the images synthesized by the generator are consistent;

The method comprises the steps of generating an idea of an countermeasure network through conditions, splicing an image after contour fitting with a single image randomly selected from a sequence, and generating new data through the joint vector, wherein the synthesized new data can keep the identity characteristics of faces in original pictures.

In the embodiment, the application uses the key point information of the human face as the intermediate vector to represent the content representation and the identity representation in the decoupled voice and the images respectively on the basis of the coding and decoding data characteristics of the neural network, and the speaker face images which are reconstructed from the aspects of mouth shape consistency and front-back image continuity are constrained by designing a plurality of discriminators respectively, so that the speaker face video with high naturalness, mouth shape synchronization and head movement posture can be synthesized by a two-stage refined neural network structure.

For better understanding of the above embodiments, please refer to fig. 2, fig. 3 and fig. 4, fig. 2 is a schematic diagram of a convolutional neural network for predicting the offset of key points in audio in a voice-driven speaker face video synthesis method with head motion gesture provided in an embodiment of the present application, fig. 3 is a schematic diagram of contour fitting of key points of a face in a voice-driven speaker face video synthesis method with head motion gesture provided in an embodiment of the present application, and fig. 4 is a schematic diagram of an countermeasure network generated in a voice-driven speaker face video synthesis method with head motion gesture provided in an embodiment of the present application; in fig. 2, the AudioVC module is composed of an automatic encoder, and X1 represents source voice information, which includes two types of information: speaker information (dark representation) and speaker independent information (light representation), which are referred to as content information, C1 represents embedded information, S1 represents speaker information of a target voice, X1-1 represents reconstruction information, ec (-) and D (-) represent encoder and decoding, respectively;

in the design of the discriminant, the common fine control of the mouth-sync discriminant and the time-series discriminant allows the generation network to generate the required high-quality data. A synthetic image sequence and original audio are received by a pre-trained mouth shape synchronization discriminator, whether they are synchronous or not is judged, if so, a higher score is given, otherwise, a lower score is given, and the mouth shape synchronization discriminator is a process for encoding the characteristics of the audio and the image data into a high-dimensional vector space and reducing the distance between the mouth shape synchronization data by contrast loss. The time sequence discriminator adds one data dimension, namely the time dimension, and uses three-dimensional convolution operation in the synthesized image sequence to judge whether the generation of the whole sequence is consistent with the original sequence or not, so that the generated image sequence is constrained to have the continuity of the front pixel and the rear pixel.

Finally, whether the generated image is similar to the original image or not is measured through Structural Similarity (SSIM) and peak signal to noise ratio (PSNR). The SSIM index defines structural information from the perspective of image composition as brightness, contrast independent, reflecting the properties of the object structure in the scene, and models distortion as a combination of three different factors, brightness, contrast and structure, with SSIM values ranging from 0,1, with larger representing images more similar. The PSNR index is used to measure an image quality reference value between a maximum value signal and background noise, and the larger the value is, the less image distortion is.

On the basis of the embodiment, the application also provides an image classification device based on biased selection pooling, which is used for supporting the voice-driven speaker face video synthesis method with the head motion gesture in the embodiment, and the voice-driven speaker face video synthesis device with the head motion gesture comprises the following steps:

Furthermore, the voice-driven speaker face video synthesis device with the head motion gesture can operate the voice-driven speaker face video synthesis method with the head motion gesture, and specific implementation can be seen in method embodiments, which are not described herein.

On the basis of the embodiment, the application further provides electronic equipment, which comprises:

the device comprises a processor and a memory, wherein the processor is in communication connection with the memory;

in this embodiment, the memory may be implemented in any suitable manner, for example: the memory can be read-only memory, mechanical hard disk, solid state disk, USB flash disk or the like; the memory is used for storing executable instructions executed by at least one of the processors;

in this embodiment, the processor may be implemented in any suitable manner, e.g., the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, etc.; the processor is configured to execute the executable instructions to implement the voice-driven speaker face video synthesis method including the head motion gesture as described above.

On the basis of the above embodiment, the present application further provides a computer readable storage medium, where a computer program is stored, where the computer program, when executed by a processor, implements the speech driven speaker face video synthesis method including the head motion gesture as described above.

Those of ordinary skill in the art will appreciate that the modules and method steps of the examples described in connection with the embodiments disclosed herein can be implemented as electronic hardware, or as a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and module described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or units may be combined or integrated into another apparatus, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or apparatuses, which may be in electrical, mechanical or other form.

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a read-only memory server, a random access memory server, a magnetic disk or an optical disk, or other various media capable of storing program instructions.

In addition, it should be noted that the combination of the technical features described in the present application is not limited to the combination described in the claims or the combination described in the specific embodiments, and all the technical features described in the present application may be freely combined or combined in any manner unless contradiction occurs between them.

It should be noted that the above-mentioned embodiments are merely examples of the present application, and it is obvious that the present application is not limited to the above-mentioned embodiments, and many similar variations are possible. All modifications attainable or obvious from the present disclosure set forth herein should be deemed to be within the scope of the present disclosure.

The foregoing is merely illustrative of the preferred embodiments of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A voice-driven speaker face video synthesis method with head motion gesture is characterized by comprising the following steps:

acquiring voice and image data required by design;

preprocessing voice and image data;

2. The method of claim 1, wherein the obtaining speech and image data required for a design comprises:

3. The method of claim 1, wherein preprocessing the voice and image data comprises:

4. The method for synthesizing a voice-driven speaker face video with head motion pose according to claim 1, wherein said performing content decoupling on the extracted voice features and separating content characterization in combination with an AudioVC network to obtain content information related to the speaker comprises:

5. The method for synthesizing a voice-driven speaker face video with head motion pose according to claim 1, wherein said extracting key point information from the face image and aligning the obtained face key points, removing speaker-related identity information, comprises:

wherein, removing speaker related identity information specifically comprises:

6. The method of claim 1, wherein the designing the network to obtain an implicit mapping function from audio to key points and predicting the offset of the key points of the face through the network comprises:

inputting the human face key point sequence which is separated from the voice signal and is spliced to remove the identity information into a long-short-term memory network for further coding features, and performing iterative processing for a plurality of times to obtain a highly fitting implicit mapping function;

7. The method for synthesizing a video of a speech driven speaker face with a head motion pose according to claim 1, wherein reconstructing a face based on a generated countermeasure network and performing an index analysis on the generated face comprises:

8. A speech driven speaker face synthesis apparatus including a head motion pose, comprising:

9. An electronic device, the electronic device comprising:

the memory is configured to store executable instructions that are executed by at least one of the processors, the processor configured to execute the executable instructions to implement the speech driven speaker face synthesis method with head motion pose according to any of claims 1 to 7.

10. A computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, which when executed by a processor, implements the speech driven speaker face synthesis method with head motion pose according to any of claims 1 to 7.