CN116597857A - Method, system, device and storage medium for driving image by voice - Google Patents

Method, system, device and storage medium for driving image by voice Download PDF

Info

Publication number
CN116597857A
CN116597857A CN202310334646.2A CN202310334646A CN116597857A CN 116597857 A CN116597857 A CN 116597857A CN 202310334646 A CN202310334646 A CN 202310334646A CN 116597857 A CN116597857 A CN 116597857A
Authority
CN
China
Prior art keywords
expression
lip
voice
vector
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310334646.2A
Other languages
Chinese (zh)
Inventor
李�权
杨锦
彭绪坪
叶俊杰
王伦基
成秋喜
付玟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Sailingli Technology Co ltd
Original Assignee
Guangzhou Sailingli Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Sailingli Technology Co ltd filed Critical Guangzhou Sailingli Technology Co ltd
Priority to CN202310334646.2A priority Critical patent/CN116597857A/en
Publication of CN116597857A publication Critical patent/CN116597857A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/18Details of the transformation process
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application discloses a method, a system, a device and a storage medium for driving images by voice, which comprises the following steps: acquiring audio data, and extracting an audio feature vector corresponding to the audio data through a voice feature extraction model; predicting the audio feature vector through a lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence; and obtaining a three-dimensional face basic model, and synthesizing the three-dimensional face basic model and the lip expression offset sequence to obtain a three-dimensional face lip expression animation. The embodiment of the application can generate the three-dimensional animation comprising the lip shape and the expression according to the input voice driving image, has high efficiency and good stability, and can be widely applied to the technical field of computers.

Description

Method, system, device and storage medium for driving image by voice
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, a system, an apparatus, and a storage medium for driving an image by voice.
Background
With the continuous growth of the diversity of 3D video content and the rapid development of digital virtual people application scenes, the creation demands of higher quality and higher efficiency are provided for the related content output of 3D digital virtual people. Lip movements and facial expressions when the 3D digital virtual person is generated through rapid production can help audience to understand conversation contents more vividly. The expression mode of the bimodal information fusion of the visual animation and the auditory sound can not only improve the understanding degree of the user on the content, but also provide a more accurate experience in the scene needing interaction, and improve the artistry and the ornamental degree of the 3D virtual digital person.
The current technical scheme for manufacturing the 3D character lip expression animation comprises the following types: firstly, hearing audio content by a professional animator, and manually producing a key frame animation with sound matched with a character animation lip form by manpower; secondly, capturing facial lip expressions of professional actors through a motion capture device, performing secondary trimming adjustment on captured data by manpower, and finally guiding a rendering engine to drive the facial lip expressions of the characters to move. Both of the above schemes require significant manpower and time costs and different people and equipment have an impact on the stability of the final produced content.
Disclosure of Invention
Accordingly, an object of the embodiments of the present application is to provide a method, a system, a device, and a storage medium for driving an image by voice, which can generate a three-dimensional animation including lips and expressions according to an input voice driving image, and has high efficiency and good stability.
In a first aspect, an embodiment of the present application provides a method for driving an image by voice, including the steps of:
acquiring audio data, and extracting an audio feature vector corresponding to the audio data through a voice feature extraction model;
predicting the audio feature vector through a lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence;
and obtaining a three-dimensional face basic model, and synthesizing the three-dimensional face basic model and the lip expression offset sequence to obtain a three-dimensional face lip expression animation.
Optionally, the voice feature extraction model includes a convolutional neural network and a two-way long and short memory network, and the extracting, by the voice feature extraction model, the audio feature vector corresponding to the audio data specifically includes:
inputting the one-dimensional vector corresponding to the audio data into the convolutional neural network to obtain high-level voice characteristics;
and inputting the high-level voice characteristics into the two-way long and short memory network to obtain an audio characteristic vector.
Optionally, the training process of the speech feature extraction model includes:
acquiring voice sample data and corresponding real voice sample feature vectors;
inputting the voice sample data into an initial model, and extracting a predicted voice sample feature vector;
and adjusting model parameters of the initial model according to the error between the predicted voice sample feature vector and the real voice sample feature vector until the error between the predicted voice sample feature vector and the real voice sample feature vector output by the initial model meets training requirements, so as to obtain the voice feature extraction model.
Optionally, the lip expression prediction model includes a transducer neural network model, where the transducer neural network model includes an encoder network and a decoder network, and the audio feature vector is predicted by the lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence, which specifically includes:
inputting the audio feature vector into an encoder network to obtain an audio information characterization vector sequence;
and inputting the audio information representation vector sequence and the determined expression emotion vector into a decoder network to obtain a lip expression offset sequence.
Optionally, the training process of the lip expression prediction model includes:
acquiring video sample data of a plurality of visual angles of a speaker, establishing a three-dimensional point cloud face sequence according to the video data, and determining a real face lip expression offset according to the three-dimensional point cloud face sequence;
extracting voice sample data of video sample data, and matching and labeling the three-dimensional point cloud face sequence and the voice sample data to form a sample data pair;
inputting voice sample data in the sample data pair into an encoder network to obtain an audio sample information characterization vector;
inputting the audio sample information characterization vector, the three-dimensional point cloud face sequence in the sample data pair and the expression emotion vector generated randomly into a decoder network to obtain a predicted face lip expression offset;
and calculating a loss value between the real face lip expression offset and the predicted face lip expression offset according to the target loss function, and updating the encoder network, the decoder network and the target loss function according to the loss value to obtain a transducer neural network model.
Optionally, the calculation formula of the target loss function is as follows:
Loss=S l ×L lip +S f ×L face +S r ×L reg
wherein Loss represents a Loss value, L lip Representing the loss value of the lip region, S l Representing the coefficient of influence of the lip region, L face Loss value representing facial expression region other than lip region, S f An influence coefficient indicating a facial expression region other than a lip region, L reg Loss value representing regular expression term S r And the influence coefficient of the expression regular term is represented.
Optionally, the expression emotion vector is obtained by:
determining an expression emotion vector obtained by learning in the lip expression prediction model training process as an expression emotion vector;
or, obtaining expression information, and determining an expression emotion vector according to the expression information.
In a second aspect, an embodiment of the present application provides a system for driving an image by voice, including:
the first module is used for acquiring audio data and extracting an audio feature vector corresponding to the audio data through a voice feature extraction model;
the second module is used for predicting the audio feature vector through the lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence;
and the third module is used for obtaining a three-dimensional face basic model, and carrying out synthesis processing on the three-dimensional face basic model and the lip expression offset sequence to obtain a three-dimensional face lip expression animation.
In a third aspect, an embodiment of the present application provides an apparatus for driving an image by voice, including:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method described above.
In a fourth aspect, embodiments of the present application provide a storage medium having stored therein a processor-executable program for performing the above-described method when executed by a processor.
The embodiment of the application has the following beneficial effects: according to the embodiment, the audio feature vectors corresponding to the audio data are extracted through the voice feature extraction model, so that the lip expression prediction model can adapt to different languages, then the lip expression prediction model and the determined expression emotion vectors are used for predicting the audio feature vectors to obtain a lip expression offset sequence, the variation of the lip and the surface is obtained, and then the three-dimensional facial lip expression animation is obtained according to the three-dimensional facial base model and the lip expression offset sequence, so that the three-dimensional animation comprising the lip and the expression is generated according to the voice driving image, the efficiency is high, and the stability is good.
Drawings
FIG. 1 is a flowchart illustrating a method for driving an image by voice according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating steps of another method for driving an image by voice according to an embodiment of the present application;
FIG. 3 is a block diagram of a language feature extraction model provided by an embodiment of the present application;
FIG. 4 is a block diagram of a lip expression prediction model according to an embodiment of the present application;
FIG. 5 is a block diagram of a system for voice-driven imaging according to an embodiment of the present application;
fig. 6 is a block diagram of a voice-driven image device according to an embodiment of the present application.
Detailed Description
The application will now be described in further detail with reference to the drawings and to specific examples. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.
Referring to fig. 1 and 2, an embodiment of the present application provides a method for driving an image by voice, including the following steps:
s100, acquiring audio data, and extracting an audio feature vector corresponding to the audio data through a voice feature extraction model.
The audio data refers to voice data for driving an image, and the voice data includes a plurality of languages such as chinese or english, etc. The audio feature vector is used to characterize the audio features of the linguistic data. The voice characteristic extraction model is used for obtaining an output audio characteristic vector according to the input audio data.
It will be appreciated by those skilled in the art that the specific type of audio feature vector is determined according to the actual application, and the present embodiment is not particularly limited. For example, the audio feature vector is a PPG (phonetic posteriorgrams, phoneme posterior probability chart) feature vector, and the PPG audio feature vector can extract richer audio feature information, so that the self-adaptation capability to different languages is improved in the process of predicting 3D face lip expression through voice later.
It should be noted that, the specific structure of the speech feature extraction model is determined according to practical applications, and the embodiment is not limited specifically. Referring to fig. 3, in one specific implementation, the speech feature extraction model includes a convolutional neural network (CNN, convolutional Neural Networks) and a Bi-directional long-short Term Memory network (BiLSTM, bi-directional Long Short-Term Memory), the input of the speech feature extraction model is a speech signal, the output of the speech feature extraction model is a corresponding speech feature vector, and the speech signal is a one-dimensional vector obtained by sampling audio data at a certain time interval. Specifically, a voice signal is input into a 1D-CNN (one-dimensional convolutional neural network), and high-level voice characteristics are extracted through 3 1D-CNN network layers and a pooling layer; then taking the output of the CNN layer as input, capturing the time sequence information of the audio signal through BiLSTM, and further extracting the voice characteristics; the last layer network uses the fully connected layer as the output layer, mapping the output of the BiLSTM layer to the PPG feature vector.
Optionally, the training process of the speech feature extraction model includes:
s101, acquiring voice sample data and corresponding real voice sample feature vectors;
s102, inputting the voice sample data into an initial model, and extracting a predicted voice sample feature vector;
and S103, according to the error between the predicted voice sample feature vector and the real voice sample feature vector, adjusting the model parameters of the initial model until the error between the predicted voice sample feature vector and the real voice sample feature vector output by the initial model meets the training requirement, and obtaining the voice feature extraction model.
The voice sample data comprises sample data in multiple languages, and the real voice sample feature vector is the feature vector of the voice sample data. The initial model refers to a speech feature extraction model whose model parameters are to be determined. Specifically, firstly, voice sample data is input into a voice feature extraction model to obtain a predicted voice sample feature vector, then model parameters of an initial model are adjusted according to errors between the predicted voice sample feature vector and a real voice sample feature vector, in the modulation process, the errors between the predicted voice sample feature vector and the real voice sample feature vector are reduced, and when the errors between the predicted voice sample feature vector and the real voice sample feature vector output by the initial model meet training requirements, the initial model corresponding to the model parameters is used as the voice feature extraction model.
Optionally, the voice feature extraction model includes a convolutional neural network and a two-way long and short memory network, and the extracting, by the voice feature extraction model, the audio feature vector corresponding to the audio data specifically includes:
s110, inputting a one-dimensional vector corresponding to the audio data into the convolutional neural network to obtain high-level voice characteristics;
s120, inputting the high-level voice characteristics into the two-way long and short memory network to obtain an audio characteristic vector.
Specifically, referring to fig. 3, first, a one-dimensional vector corresponding to audio data is input to a CNN (convolutional neural network) in a speech feature extraction model, and high-level speech features are obtained through extraction; then, the high-level voice feature is input into BiLSTM (two-way long and short memory network) in the voice feature extraction model, and the voice feature vector is obtained through extraction.
S200, predicting the audio feature vector through the lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence.
The lip expression prediction model is used for predicting a lip expression offset sequence according to the audio feature vector and the expression emotion vector. The lip expression offset characterizes the degree of deviation of lips and expressions based on non-speaking and anepithymic face point clouds. It should be noted that, the lip expression prediction model is determined according to practical applications, and the embodiment is not particularly limited. In a specific embodiment, referring to fig. 4, the lip expression prediction model includes an encoder and a decoder, the encoder includes a forward propagation layer, a plurality of overlapping multi-headed self-attention mechanisms and forward propagation layers, and a linear projection layer, the decoder includes a forward propagation layer, a multi-headed self-attention mechanism and a linear projection layer, an input of the encoder is an audio feature vector, an output of the encoder is an input of the decoder, and an output of the decoder is a three-dimensional face lip expression offset value.
Optionally, the training process of the lip expression prediction model includes:
s201, obtaining video sample data of multiple visual angles of a speaker, establishing a three-dimensional point cloud face sequence according to the video data, and determining a real face lip expression offset according to the three-dimensional point cloud face sequence.
The video sample data comprises multi-view video data for collecting voices of different persons through a multi-view array camera, wherein voice data in the video are multi-person multi-language mixed voice data, image data in the video are multi-view face data for different persons to speak, and meanwhile, the video resolution of each view is as high as 1080 p. Carrying out 3D point cloud alignment reconstruction on each frame of multi-view face data of the acquired video to obtain a three-dimensional point cloud face sequence; meanwhile, a 3D face model which does not speak and is closed in a natural state is selected for each speak in the reconstructed data, the 3D face model is used as a natural expression basic model and is stored, and the offset between the three-dimensional point cloud face sequence and the natural expression basic model is used as a real face lip expression offset.
S202, extracting voice sample data of video sample data, and carrying out matching labeling on the three-dimensional point cloud face sequence and the voice sample data to form a sample data pair.
And extracting voices of the video sample data as voice sample data, marking the 3D human face point cloud sequence of the speaker and the corresponding voices in a matching way, marking the 3D human face point cloud sequence of the speaker corresponding to each voice, dividing the data into data pairs with the voices matched with the 3D human face point cloud of the speaker through marking, and finally forming sequence segments and storing.
S203, inputting the voice sample data in the sample data pair into an encoder network to obtain an audio sample information characterization vector.
Referring to fig. 4, an audio sample feature vector corresponding to voice sample data in a sample data pair is extracted, and the audio sample feature vector is input to an encoder network to obtain an audio sample information characterization vector. It should be noted that the number of the multi-headed self-focusing mechanism and the forward propagation layer overlapped in the encoder network is determined according to practical applications, and the present embodiment is not particularly limited, for example, the number of the multi-headed self-focusing mechanism and the forward propagation layer overlapped is 5.
S204, inputting the audio sample information characterization vector, the three-dimensional point cloud face sequence in the sample data pair and the randomly generated expression emotion vector into a decoder network to obtain the predicted face lip expression offset.
In the training process, the expression emotion vectors are N-dimensional vectors which are randomly Gaussian distribution sampled, and as the training data comprise pronunciation expression data pairs with different emotions of a speaker, the training data with the emotions are input into a lip-shaped expression prediction model, the emotion vectors contained in the different emotions are automatically learned through training calculation loss function back propagation, and finally the expression emotion vectors learned under the different emotion data are combined to form an expression emotion vector matrix and are stored. Referring to fig. 4, an audio sample information characterization vector output by an encoder, a three-dimensional point cloud face sequence in a sample data pair and a randomly generated expression emotion vector are input to a decoder network, and the output of the decoder network is a predicted face lip expression offset.
S205, calculating a loss value between the real face lip expression offset and the predicted face lip expression offset according to the target loss function, and updating the encoder network, the decoder network and the target loss function according to the loss value to obtain a transducer neural network model.
The objective loss function is used to calculate a function of an error between the model predicted value and the actual target value, and the specific form of the objective loss function is determined according to the actual application, which is not particularly limited in this embodiment. The smaller the loss value calculated according to the target loss function is, the more accurate the model parameters of the obtained transducer neural network model are.
Optionally, the calculation formula of the target loss function is as follows:
Loss=S l ×L lip +S f ×L face +S r ×L reg
wherein Loss represents a Loss value, L lip Representing the loss value of the lip region, S l Representing the coefficient of influence of the lip region, L face Loss value representing facial expression region other than lip region, S f An influence coefficient indicating a facial expression region other than a lip region, L reg Loss value representing regular expression term S r And the influence coefficient of the expression regular term is represented.
Specifically, the influence coefficient S of the lip region l The value of (2) and the influence coefficient S of the facial expression region outside the lip region f The value of S is adjusted according to the weight in practical application l And S is equal to f After adjustment, the influence coefficient S of the regular expression term is adjusted simultaneously r The model pays attention to the expression change in a longer time in training, so that the severe change of the model expression prediction in a short time can be avoided, and the expression change can be more natural. In the training process, minimizing the loss value of the target loss function through iteration continuously while adjusting S l 、S f And S is r And the coefficients enable more accurate and natural 3D face lip expression animation to be generated.
Optionally, the lip expression prediction model includes a transducer neural network model, where the transducer neural network model includes an encoder network and a decoder network, and the audio feature vector is predicted by the lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence, which specifically includes:
s210, inputting the audio feature vector into an encoder network to obtain an audio information characterization vector sequence;
s220, inputting the audio information representation vector sequence and the determined expression emotion vector into a decoder network to obtain a lip expression offset sequence.
The encoder network is mainly used for encoding and extracting the audio representation information related to the context from the audio characteristics, wherein the input data is an audio characteristic vector, and the output is an audio information representation vector related to the context; the decoder network is used for decoding the audio information representation vector and the expression emotion vector which are output by the encoder network and related to the context, the input of the decoder network is the audio information representation vector, the 3D point cloud face and the determined expression emotion vector which are output by the encoder network, and the output of the decoder network is a lip expression offset sequence.
Optionally, the expression emotion vector is obtained by:
s221, determining an expression emotion vector obtained by learning in the lip expression prediction model training process as an expression emotion vector;
s222, or, acquiring expression information, and determining an expression emotion vector according to the expression information.
Specifically, in the prediction process, the expression emotion vector input by the decoder network can be the expression emotion vector learned in the training process, or a new expression emotion vector can be formed by linear superposition and combination of a plurality of expression emotion vectors to serve as input to control and output the emotion in the 3D face lip expression vertex animation.
S300, acquiring a three-dimensional face basic model, and synthesizing the three-dimensional face basic model and the lip expression offset sequence to obtain a three-dimensional face lip expression animation.
The three-dimensional face basic model characterizes a non-speaking and non-expressive three-dimensional face model, and the lip expression offset represents the peak offset of the lips and the expressions of the three-dimensional face. And carrying out superposition processing on the three-dimensional face basic model and the lip expression offset sequence to obtain the three-dimensional face lip expression animation.
The embodiment of the application has the following beneficial effects: according to the embodiment, the audio feature vectors corresponding to the audio data are extracted through the voice feature extraction model, so that the lip expression prediction model can adapt to different languages, then the lip expression prediction model and the determined expression emotion vectors are used for predicting the audio feature vectors to obtain a lip expression offset sequence, the variation of the lip and the surface is obtained, and then the three-dimensional facial lip expression animation is obtained according to the three-dimensional facial base model and the lip expression offset sequence, so that the three-dimensional animation comprising the lip and the expression is generated according to the voice driving image, the efficiency is high, and the stability is good.
Referring to fig. 5, an embodiment of the present application provides a system for driving an image by voice, including:
the first module is used for acquiring audio data and extracting an audio feature vector corresponding to the audio data through a voice feature extraction model;
the second module is used for predicting the audio feature vector through the lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence;
and the third module is used for obtaining a three-dimensional face basic model, and carrying out synthesis processing on the three-dimensional face basic model and the lip expression offset sequence to obtain a three-dimensional face lip expression animation.
It can be seen that the content in the above method embodiment is applicable to the system embodiment, and the functions specifically implemented by the system embodiment are the same as those of the method embodiment, and the beneficial effects achieved by the method embodiment are the same as those achieved by the method embodiment.
Referring to fig. 6, an embodiment of the present application provides a device for driving an image by voice, including:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method described above.
It can be seen that the content in the above method embodiment is applicable to the embodiment of the present device, and the functions specifically implemented by the embodiment of the present device are the same as those of the embodiment of the above method, and the beneficial effects achieved by the embodiment of the above method are the same as those achieved by the embodiment of the above method.
Furthermore, the embodiment of the application also discloses a computer program product or a computer program, and the computer program product or the computer program is stored in a computer readable storage medium. The computer program may be read from a computer readable storage medium by a processor of a computer device, the processor executing the computer program causing the computer device to perform the method as described above. Similarly, the content in the above method embodiment is applicable to the present storage medium embodiment, and the specific functions of the present storage medium embodiment are the same as those of the above method embodiment, and the achieved beneficial effects are the same as those of the above method embodiment.
While the preferred embodiment of the present application has been described in detail, the application is not limited to the embodiment, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims (10)

1. A method of voice-driving an image, comprising:
acquiring audio data, and extracting an audio feature vector corresponding to the audio data through a voice feature extraction model;
predicting the audio feature vector through a lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence;
and obtaining a three-dimensional face basic model, and synthesizing the three-dimensional face basic model and the lip expression offset sequence to obtain a three-dimensional face lip expression animation.
2. The method according to claim 1, wherein the speech feature extraction model includes a convolutional neural network and a two-way long-short memory network, and the extracting the audio feature vector corresponding to the audio data by the speech feature extraction model specifically includes:
inputting the one-dimensional vector corresponding to the audio data into the convolutional neural network to obtain high-level voice characteristics;
and inputting the high-level voice characteristics into the two-way long and short memory network to obtain an audio characteristic vector.
3. The method of claim 2, wherein the training process of the speech feature extraction model comprises:
acquiring voice sample data and corresponding real voice sample feature vectors;
inputting the voice sample data into an initial model, and extracting a predicted voice sample feature vector;
and adjusting model parameters of the initial model according to the error between the predicted voice sample feature vector and the real voice sample feature vector until the error between the predicted voice sample feature vector and the real voice sample feature vector output by the initial model meets training requirements, so as to obtain the voice feature extraction model.
4. The method according to claim 1, wherein the lip expression prediction model comprises a transducer neural network model, the transducer neural network model comprises an encoder network and a decoder network, and the audio feature vector is predicted by the lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence, specifically comprising:
inputting the audio feature vector into an encoder network to obtain an audio information characterization vector sequence;
and inputting the audio information representation vector sequence and the determined expression emotion vector into a decoder network to obtain a lip expression offset sequence.
5. The method of claim 4, wherein the training process of the lip expression prediction model comprises:
acquiring video sample data of a plurality of visual angles of a speaker, establishing a three-dimensional point cloud face sequence according to the video data, and determining a real face lip expression offset according to the three-dimensional point cloud face sequence;
extracting voice sample data of video sample data, and matching and labeling the three-dimensional point cloud face sequence and the voice sample data to form a sample data pair;
inputting voice sample data in the sample data pair into an encoder network to obtain an audio sample information characterization vector;
inputting the audio sample information characterization vector, the three-dimensional point cloud face sequence in the sample data pair and the expression emotion vector generated randomly into a decoder network to obtain a predicted face lip expression offset;
and calculating a loss value between the real face lip expression offset and the predicted face lip expression offset according to the target loss function, and updating the encoder network, the decoder network and the target loss function according to the loss value to obtain a transducer neural network model.
6. The method of claim 5, wherein the objective loss function is calculated as:
Loss=S l ×L lip +S f ×L face +S r ×L reg
wherein Loss represents a Loss value, L lip Representing the loss value of the lip region, S l Representing the coefficient of influence of the lip region, L face Loss value representing facial expression region other than lip region, S f An influence coefficient indicating a facial expression region other than a lip region, L reg Loss value representing regular expression term S r And the influence coefficient of the expression regular term is represented.
7. The method of claim 1, wherein the emotive vector is obtained by:
determining an expression emotion vector obtained by learning in the lip expression prediction model training process as an expression emotion vector;
or, obtaining expression information, and determining an expression emotion vector according to the expression information.
8. A system for voice-driven imaging, comprising:
the first module is used for acquiring audio data and extracting an audio feature vector corresponding to the audio data through a voice feature extraction model;
the second module is used for predicting the audio feature vector through the lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence;
and the third module is used for obtaining a three-dimensional face basic model, and carrying out synthesis processing on the three-dimensional face basic model and the lip expression offset sequence to obtain a three-dimensional face lip expression animation.
9. An apparatus for voice-driven imaging, comprising:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method of any of claims 1-7.
10. A storage medium having stored therein a processor executable program, which when executed by a processor is adapted to carry out the method of any one of claims 1-7.
CN202310334646.2A 2023-03-30 2023-03-30 Method, system, device and storage medium for driving image by voice Pending CN116597857A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310334646.2A CN116597857A (en) 2023-03-30 2023-03-30 Method, system, device and storage medium for driving image by voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310334646.2A CN116597857A (en) 2023-03-30 2023-03-30 Method, system, device and storage medium for driving image by voice

Publications (1)

Publication Number Publication Date
CN116597857A true CN116597857A (en) 2023-08-15

Family

ID=87603317

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310334646.2A Pending CN116597857A (en) 2023-03-30 2023-03-30 Method, system, device and storage medium for driving image by voice

Country Status (1)

Country Link
CN (1) CN116597857A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117218224A (en) * 2023-08-21 2023-12-12 华院计算技术(上海)股份有限公司 Face emotion image generation method and device, readable storage medium and terminal
CN117372553A (en) * 2023-08-25 2024-01-09 华院计算技术(上海)股份有限公司 Face image generation method and device, computer readable storage medium and terminal

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117218224A (en) * 2023-08-21 2023-12-12 华院计算技术(上海)股份有限公司 Face emotion image generation method and device, readable storage medium and terminal
CN117372553A (en) * 2023-08-25 2024-01-09 华院计算技术(上海)股份有限公司 Face image generation method and device, computer readable storage medium and terminal

Similar Documents

Publication Publication Date Title
CN111243626B (en) Method and system for generating speaking video
CN112184858B (en) Virtual object animation generation method and device based on text, storage medium and terminal
CN103650002B (en) Text based video generates
CN113194348B (en) Virtual human lecture video generation method, system, device and storage medium
CN110751708B (en) Method and system for driving face animation in real time through voice
CN112562722A (en) Audio-driven digital human generation method and system based on semantics
CN112465935A (en) Virtual image synthesis method and device, electronic equipment and storage medium
CN116597857A (en) Method, system, device and storage medium for driving image by voice
JP2002507033A (en) Face synthesis device and face synthesis method
JP2014519082A5 (en)
CN113592985B (en) Method and device for outputting mixed deformation value, storage medium and electronic device
JP2003529861A (en) A method for animating a synthetic model of a human face driven by acoustic signals
CN111459450A (en) Interactive object driving method, device, equipment and storage medium
CN113421547B (en) Voice processing method and related equipment
CN113838174B (en) Audio-driven face animation generation method, device, equipment and medium
CN113228163A (en) Real-time text and audio based face reproduction
CN113077537A (en) Video generation method, storage medium and equipment
CN114332318A (en) Virtual image generation method and related equipment thereof
CN113299312A (en) Image generation method, device, equipment and storage medium
CN114882862A (en) Voice processing method and related equipment
CN116051692A (en) Three-dimensional digital human face animation generation method based on voice driving
Rastgoo et al. A survey on recent advances in Sign Language Production
CN113395569A (en) Video generation method and device
Filntisis et al. Video-realistic expressive audio-visual speech synthesis for the Greek language
CN116758189A (en) Digital human image generation method, device and storage medium based on voice driving

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination