CN116597857A

CN116597857A - Method, system, device and storage medium for driving image by voice

Info

Publication number: CN116597857A
Application number: CN202310334646.2A
Authority: CN
Inventors: 李�权; 杨锦; 彭绪坪; 叶俊杰; 王伦基; 成秋喜; 付玟
Original assignee: Guangzhou Sailingli Technology Co ltd
Current assignee: Guangzhou Sailingli Technology Co ltd
Priority date: 2023-03-30
Filing date: 2023-03-30
Publication date: 2023-08-15

Abstract

The application discloses a method, a system, a device and a storage medium for driving images by voice, which comprises the following steps: acquiring audio data, and extracting an audio feature vector corresponding to the audio data through a voice feature extraction model; predicting the audio feature vector through a lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence; and obtaining a three-dimensional face basic model, and synthesizing the three-dimensional face basic model and the lip expression offset sequence to obtain a three-dimensional face lip expression animation. The embodiment of the application can generate the three-dimensional animation comprising the lip shape and the expression according to the input voice driving image, has high efficiency and good stability, and can be widely applied to the technical field of computers.

Description

Method, system, device and storage medium for driving image by voice

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, a system, an apparatus, and a storage medium for driving an image by voice.

Background

With the continuous growth of the diversity of 3D video content and the rapid development of digital virtual people application scenes, the creation demands of higher quality and higher efficiency are provided for the related content output of 3D digital virtual people. Lip movements and facial expressions when the 3D digital virtual person is generated through rapid production can help audience to understand conversation contents more vividly. The expression mode of the bimodal information fusion of the visual animation and the auditory sound can not only improve the understanding degree of the user on the content, but also provide a more accurate experience in the scene needing interaction, and improve the artistry and the ornamental degree of the 3D virtual digital person.

The current technical scheme for manufacturing the 3D character lip expression animation comprises the following types: firstly, hearing audio content by a professional animator, and manually producing a key frame animation with sound matched with a character animation lip form by manpower; secondly, capturing facial lip expressions of professional actors through a motion capture device, performing secondary trimming adjustment on captured data by manpower, and finally guiding a rendering engine to drive the facial lip expressions of the characters to move. Both of the above schemes require significant manpower and time costs and different people and equipment have an impact on the stability of the final produced content.

Disclosure of Invention

Accordingly, an object of the embodiments of the present application is to provide a method, a system, a device, and a storage medium for driving an image by voice, which can generate a three-dimensional animation including lips and expressions according to an input voice driving image, and has high efficiency and good stability.

In a first aspect, an embodiment of the present application provides a method for driving an image by voice, including the steps of:

acquiring audio data, and extracting an audio feature vector corresponding to the audio data through a voice feature extraction model;

predicting the audio feature vector through a lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence;

and obtaining a three-dimensional face basic model, and synthesizing the three-dimensional face basic model and the lip expression offset sequence to obtain a three-dimensional face lip expression animation.

Optionally, the voice feature extraction model includes a convolutional neural network and a two-way long and short memory network, and the extracting, by the voice feature extraction model, the audio feature vector corresponding to the audio data specifically includes:

inputting the one-dimensional vector corresponding to the audio data into the convolutional neural network to obtain high-level voice characteristics;

and inputting the high-level voice characteristics into the two-way long and short memory network to obtain an audio characteristic vector.

Optionally, the training process of the speech feature extraction model includes:

acquiring voice sample data and corresponding real voice sample feature vectors;

inputting the voice sample data into an initial model, and extracting a predicted voice sample feature vector;

and adjusting model parameters of the initial model according to the error between the predicted voice sample feature vector and the real voice sample feature vector until the error between the predicted voice sample feature vector and the real voice sample feature vector output by the initial model meets training requirements, so as to obtain the voice feature extraction model.

Optionally, the lip expression prediction model includes a transducer neural network model, where the transducer neural network model includes an encoder network and a decoder network, and the audio feature vector is predicted by the lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence, which specifically includes:

inputting the audio feature vector into an encoder network to obtain an audio information characterization vector sequence;

and inputting the audio information representation vector sequence and the determined expression emotion vector into a decoder network to obtain a lip expression offset sequence.

Optionally, the training process of the lip expression prediction model includes:

acquiring video sample data of a plurality of visual angles of a speaker, establishing a three-dimensional point cloud face sequence according to the video data, and determining a real face lip expression offset according to the three-dimensional point cloud face sequence;

extracting voice sample data of video sample data, and matching and labeling the three-dimensional point cloud face sequence and the voice sample data to form a sample data pair;

inputting voice sample data in the sample data pair into an encoder network to obtain an audio sample information characterization vector;

inputting the audio sample information characterization vector, the three-dimensional point cloud face sequence in the sample data pair and the expression emotion vector generated randomly into a decoder network to obtain a predicted face lip expression offset;

and calculating a loss value between the real face lip expression offset and the predicted face lip expression offset according to the target loss function, and updating the encoder network, the decoder network and the target loss function according to the loss value to obtain a transducer neural network model.

Optionally, the calculation formula of the target loss function is as follows:

Loss＝S _l ×L _lip +S _f ×L _face +S _r ×L _reg

wherein Loss represents a Loss value, L _lip Representing the loss value of the lip region, S _l Representing the coefficient of influence of the lip region, L _face Loss value representing facial expression region other than lip region, S _f An influence coefficient indicating a facial expression region other than a lip region, L _reg Loss value representing regular expression term S _r And the influence coefficient of the expression regular term is represented.

Optionally, the expression emotion vector is obtained by:

determining an expression emotion vector obtained by learning in the lip expression prediction model training process as an expression emotion vector;

or, obtaining expression information, and determining an expression emotion vector according to the expression information.

In a second aspect, an embodiment of the present application provides a system for driving an image by voice, including:

the first module is used for acquiring audio data and extracting an audio feature vector corresponding to the audio data through a voice feature extraction model;

the second module is used for predicting the audio feature vector through the lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence;

and the third module is used for obtaining a three-dimensional face basic model, and carrying out synthesis processing on the three-dimensional face basic model and the lip expression offset sequence to obtain a three-dimensional face lip expression animation.

In a third aspect, an embodiment of the present application provides an apparatus for driving an image by voice, including:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method described above.

In a fourth aspect, embodiments of the present application provide a storage medium having stored therein a processor-executable program for performing the above-described method when executed by a processor.

The embodiment of the application has the following beneficial effects: according to the embodiment, the audio feature vectors corresponding to the audio data are extracted through the voice feature extraction model, so that the lip expression prediction model can adapt to different languages, then the lip expression prediction model and the determined expression emotion vectors are used for predicting the audio feature vectors to obtain a lip expression offset sequence, the variation of the lip and the surface is obtained, and then the three-dimensional facial lip expression animation is obtained according to the three-dimensional facial base model and the lip expression offset sequence, so that the three-dimensional animation comprising the lip and the expression is generated according to the voice driving image, the efficiency is high, and the stability is good.

Drawings

FIG. 1 is a flowchart illustrating a method for driving an image by voice according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating steps of another method for driving an image by voice according to an embodiment of the present application;

FIG. 3 is a block diagram of a language feature extraction model provided by an embodiment of the present application;

FIG. 4 is a block diagram of a lip expression prediction model according to an embodiment of the present application;

FIG. 5 is a block diagram of a system for voice-driven imaging according to an embodiment of the present application;

fig. 6 is a block diagram of a voice-driven image device according to an embodiment of the present application.

Detailed Description

The application will now be described in further detail with reference to the drawings and to specific examples. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.

Referring to fig. 1 and 2, an embodiment of the present application provides a method for driving an image by voice, including the following steps:

s100, acquiring audio data, and extracting an audio feature vector corresponding to the audio data through a voice feature extraction model.

The audio data refers to voice data for driving an image, and the voice data includes a plurality of languages such as chinese or english, etc. The audio feature vector is used to characterize the audio features of the linguistic data. The voice characteristic extraction model is used for obtaining an output audio characteristic vector according to the input audio data.

It will be appreciated by those skilled in the art that the specific type of audio feature vector is determined according to the actual application, and the present embodiment is not particularly limited. For example, the audio feature vector is a PPG (phonetic posteriorgrams, phoneme posterior probability chart) feature vector, and the PPG audio feature vector can extract richer audio feature information, so that the self-adaptation capability to different languages is improved in the process of predicting 3D face lip expression through voice later.

It should be noted that, the specific structure of the speech feature extraction model is determined according to practical applications, and the embodiment is not limited specifically. Referring to fig. 3, in one specific implementation, the speech feature extraction model includes a convolutional neural network (CNN, convolutional Neural Networks) and a Bi-directional long-short Term Memory network (BiLSTM, bi-directional Long Short-Term Memory), the input of the speech feature extraction model is a speech signal, the output of the speech feature extraction model is a corresponding speech feature vector, and the speech signal is a one-dimensional vector obtained by sampling audio data at a certain time interval. Specifically, a voice signal is input into a 1D-CNN (one-dimensional convolutional neural network), and high-level voice characteristics are extracted through 3 1D-CNN network layers and a pooling layer; then taking the output of the CNN layer as input, capturing the time sequence information of the audio signal through BiLSTM, and further extracting the voice characteristics; the last layer network uses the fully connected layer as the output layer, mapping the output of the BiLSTM layer to the PPG feature vector.

s101, acquiring voice sample data and corresponding real voice sample feature vectors;

s102, inputting the voice sample data into an initial model, and extracting a predicted voice sample feature vector;

and S103, according to the error between the predicted voice sample feature vector and the real voice sample feature vector, adjusting the model parameters of the initial model until the error between the predicted voice sample feature vector and the real voice sample feature vector output by the initial model meets the training requirement, and obtaining the voice feature extraction model.

The voice sample data comprises sample data in multiple languages, and the real voice sample feature vector is the feature vector of the voice sample data. The initial model refers to a speech feature extraction model whose model parameters are to be determined. Specifically, firstly, voice sample data is input into a voice feature extraction model to obtain a predicted voice sample feature vector, then model parameters of an initial model are adjusted according to errors between the predicted voice sample feature vector and a real voice sample feature vector, in the modulation process, the errors between the predicted voice sample feature vector and the real voice sample feature vector are reduced, and when the errors between the predicted voice sample feature vector and the real voice sample feature vector output by the initial model meet training requirements, the initial model corresponding to the model parameters is used as the voice feature extraction model.

s110, inputting a one-dimensional vector corresponding to the audio data into the convolutional neural network to obtain high-level voice characteristics;

s120, inputting the high-level voice characteristics into the two-way long and short memory network to obtain an audio characteristic vector.

Specifically, referring to fig. 3, first, a one-dimensional vector corresponding to audio data is input to a CNN (convolutional neural network) in a speech feature extraction model, and high-level speech features are obtained through extraction; then, the high-level voice feature is input into BiLSTM (two-way long and short memory network) in the voice feature extraction model, and the voice feature vector is obtained through extraction.

S200, predicting the audio feature vector through the lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence.

The lip expression prediction model is used for predicting a lip expression offset sequence according to the audio feature vector and the expression emotion vector. The lip expression offset characterizes the degree of deviation of lips and expressions based on non-speaking and anepithymic face point clouds. It should be noted that, the lip expression prediction model is determined according to practical applications, and the embodiment is not particularly limited. In a specific embodiment, referring to fig. 4, the lip expression prediction model includes an encoder and a decoder, the encoder includes a forward propagation layer, a plurality of overlapping multi-headed self-attention mechanisms and forward propagation layers, and a linear projection layer, the decoder includes a forward propagation layer, a multi-headed self-attention mechanism and a linear projection layer, an input of the encoder is an audio feature vector, an output of the encoder is an input of the decoder, and an output of the decoder is a three-dimensional face lip expression offset value.

s201, obtaining video sample data of multiple visual angles of a speaker, establishing a three-dimensional point cloud face sequence according to the video data, and determining a real face lip expression offset according to the three-dimensional point cloud face sequence.

The video sample data comprises multi-view video data for collecting voices of different persons through a multi-view array camera, wherein voice data in the video are multi-person multi-language mixed voice data, image data in the video are multi-view face data for different persons to speak, and meanwhile, the video resolution of each view is as high as 1080 p. Carrying out 3D point cloud alignment reconstruction on each frame of multi-view face data of the acquired video to obtain a three-dimensional point cloud face sequence; meanwhile, a 3D face model which does not speak and is closed in a natural state is selected for each speak in the reconstructed data, the 3D face model is used as a natural expression basic model and is stored, and the offset between the three-dimensional point cloud face sequence and the natural expression basic model is used as a real face lip expression offset.

S202, extracting voice sample data of video sample data, and carrying out matching labeling on the three-dimensional point cloud face sequence and the voice sample data to form a sample data pair.

And extracting voices of the video sample data as voice sample data, marking the 3D human face point cloud sequence of the speaker and the corresponding voices in a matching way, marking the 3D human face point cloud sequence of the speaker corresponding to each voice, dividing the data into data pairs with the voices matched with the 3D human face point cloud of the speaker through marking, and finally forming sequence segments and storing.

S203, inputting the voice sample data in the sample data pair into an encoder network to obtain an audio sample information characterization vector.

Referring to fig. 4, an audio sample feature vector corresponding to voice sample data in a sample data pair is extracted, and the audio sample feature vector is input to an encoder network to obtain an audio sample information characterization vector. It should be noted that the number of the multi-headed self-focusing mechanism and the forward propagation layer overlapped in the encoder network is determined according to practical applications, and the present embodiment is not particularly limited, for example, the number of the multi-headed self-focusing mechanism and the forward propagation layer overlapped is 5.

S204, inputting the audio sample information characterization vector, the three-dimensional point cloud face sequence in the sample data pair and the randomly generated expression emotion vector into a decoder network to obtain the predicted face lip expression offset.

In the training process, the expression emotion vectors are N-dimensional vectors which are randomly Gaussian distribution sampled, and as the training data comprise pronunciation expression data pairs with different emotions of a speaker, the training data with the emotions are input into a lip-shaped expression prediction model, the emotion vectors contained in the different emotions are automatically learned through training calculation loss function back propagation, and finally the expression emotion vectors learned under the different emotion data are combined to form an expression emotion vector matrix and are stored. Referring to fig. 4, an audio sample information characterization vector output by an encoder, a three-dimensional point cloud face sequence in a sample data pair and a randomly generated expression emotion vector are input to a decoder network, and the output of the decoder network is a predicted face lip expression offset.

S205, calculating a loss value between the real face lip expression offset and the predicted face lip expression offset according to the target loss function, and updating the encoder network, the decoder network and the target loss function according to the loss value to obtain a transducer neural network model.

The objective loss function is used to calculate a function of an error between the model predicted value and the actual target value, and the specific form of the objective loss function is determined according to the actual application, which is not particularly limited in this embodiment. The smaller the loss value calculated according to the target loss function is, the more accurate the model parameters of the obtained transducer neural network model are.

Optionally, the calculation formula of the target loss function is as follows:

Loss＝S _l ×L _lip +S _f ×L _face +S _r ×L _reg

Specifically, the influence coefficient S of the lip region _l The value of (2) and the influence coefficient S of the facial expression region outside the lip region _f The value of S is adjusted according to the weight in practical application _l And S is equal to _f After adjustment, the influence coefficient S of the regular expression term is adjusted simultaneously _r The model pays attention to the expression change in a longer time in training, so that the severe change of the model expression prediction in a short time can be avoided, and the expression change can be more natural. In the training process, minimizing the loss value of the target loss function through iteration continuously while adjusting S _l 、S _f And S is _r And the coefficients enable more accurate and natural 3D face lip expression animation to be generated.

s210, inputting the audio feature vector into an encoder network to obtain an audio information characterization vector sequence;

s220, inputting the audio information representation vector sequence and the determined expression emotion vector into a decoder network to obtain a lip expression offset sequence.

The encoder network is mainly used for encoding and extracting the audio representation information related to the context from the audio characteristics, wherein the input data is an audio characteristic vector, and the output is an audio information representation vector related to the context; the decoder network is used for decoding the audio information representation vector and the expression emotion vector which are output by the encoder network and related to the context, the input of the decoder network is the audio information representation vector, the 3D point cloud face and the determined expression emotion vector which are output by the encoder network, and the output of the decoder network is a lip expression offset sequence.

Optionally, the expression emotion vector is obtained by:

s221, determining an expression emotion vector obtained by learning in the lip expression prediction model training process as an expression emotion vector;

s222, or, acquiring expression information, and determining an expression emotion vector according to the expression information.

Specifically, in the prediction process, the expression emotion vector input by the decoder network can be the expression emotion vector learned in the training process, or a new expression emotion vector can be formed by linear superposition and combination of a plurality of expression emotion vectors to serve as input to control and output the emotion in the 3D face lip expression vertex animation.

S300, acquiring a three-dimensional face basic model, and synthesizing the three-dimensional face basic model and the lip expression offset sequence to obtain a three-dimensional face lip expression animation.

The three-dimensional face basic model characterizes a non-speaking and non-expressive three-dimensional face model, and the lip expression offset represents the peak offset of the lips and the expressions of the three-dimensional face. And carrying out superposition processing on the three-dimensional face basic model and the lip expression offset sequence to obtain the three-dimensional face lip expression animation.

Referring to fig. 5, an embodiment of the present application provides a system for driving an image by voice, including:

It can be seen that the content in the above method embodiment is applicable to the system embodiment, and the functions specifically implemented by the system embodiment are the same as those of the method embodiment, and the beneficial effects achieved by the method embodiment are the same as those achieved by the method embodiment.

Referring to fig. 6, an embodiment of the present application provides a device for driving an image by voice, including:

at least one processor;

at least one memory for storing at least one program;

It can be seen that the content in the above method embodiment is applicable to the embodiment of the present device, and the functions specifically implemented by the embodiment of the present device are the same as those of the embodiment of the above method, and the beneficial effects achieved by the embodiment of the above method are the same as those achieved by the embodiment of the above method.

Furthermore, the embodiment of the application also discloses a computer program product or a computer program, and the computer program product or the computer program is stored in a computer readable storage medium. The computer program may be read from a computer readable storage medium by a processor of a computer device, the processor executing the computer program causing the computer device to perform the method as described above. Similarly, the content in the above method embodiment is applicable to the present storage medium embodiment, and the specific functions of the present storage medium embodiment are the same as those of the above method embodiment, and the achieved beneficial effects are the same as those of the above method embodiment.

While the preferred embodiment of the present application has been described in detail, the application is not limited to the embodiment, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims

1. A method of voice-driving an image, comprising:

2. The method according to claim 1, wherein the speech feature extraction model includes a convolutional neural network and a two-way long-short memory network, and the extracting the audio feature vector corresponding to the audio data by the speech feature extraction model specifically includes:

3. The method of claim 2, wherein the training process of the speech feature extraction model comprises:

4. The method according to claim 1, wherein the lip expression prediction model comprises a transducer neural network model, the transducer neural network model comprises an encoder network and a decoder network, and the audio feature vector is predicted by the lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence, specifically comprising:

5. The method of claim 4, wherein the training process of the lip expression prediction model comprises:

6. The method of claim 5, wherein the objective loss function is calculated as:

Loss＝S _l ×L _lip +S _f ×L _face +S _r ×L _reg

7. The method of claim 1, wherein the emotive vector is obtained by:

8. A system for voice-driven imaging, comprising:

9. An apparatus for voice-driven imaging, comprising:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method of any of claims 1-7.

10. A storage medium having stored therein a processor executable program, which when executed by a processor is adapted to carry out the method of any one of claims 1-7.