CN116597857A - Method, system, device and storage medium for driving image by voice - Google Patents
Method, system, device and storage medium for driving image by voice Download PDFInfo
- Publication number
- CN116597857A CN116597857A CN202310334646.2A CN202310334646A CN116597857A CN 116597857 A CN116597857 A CN 116597857A CN 202310334646 A CN202310334646 A CN 202310334646A CN 116597857 A CN116597857 A CN 116597857A
- Authority
- CN
- China
- Prior art keywords
- expression
- lip
- voice
- vector
- audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000003860 storage Methods 0.000 title claims abstract description 11
- 230000014509 gene expression Effects 0.000 claims abstract description 154
- 239000013598 vector Substances 0.000 claims abstract description 136
- 230000008451 emotion Effects 0.000 claims abstract description 46
- 238000000605 extraction Methods 0.000 claims abstract description 34
- 230000002194 synthesizing effect Effects 0.000 claims abstract description 4
- 238000012549 training Methods 0.000 claims description 20
- 230000006870 function Effects 0.000 claims description 18
- 230000008569 process Effects 0.000 claims description 15
- 238000013527 convolutional neural network Methods 0.000 claims description 13
- 238000012512 characterization method Methods 0.000 claims description 11
- 230000015654 memory Effects 0.000 claims description 11
- 238000003062 neural network model Methods 0.000 claims description 10
- 230000008921 facial expression Effects 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 4
- 230000000007 visual effect Effects 0.000 claims description 4
- 230000015572 biosynthetic process Effects 0.000 claims description 3
- 238000003384 imaging method Methods 0.000 claims description 3
- 238000002372 labelling Methods 0.000 claims description 3
- 238000003786 synthesis reaction Methods 0.000 claims description 3
- 238000004590 computer program Methods 0.000 description 6
- 230000001815 facial effect Effects 0.000 description 6
- 230000009286 beneficial effect Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 230000033001 locomotion Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000002902 bimodal effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000009966 trimming Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/40—3D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/18—Details of the transformation process
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
- G10L2021/105—Synthesis of the lips movements from speech, e.g. for talking heads
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The application discloses a method, a system, a device and a storage medium for driving images by voice, which comprises the following steps: acquiring audio data, and extracting an audio feature vector corresponding to the audio data through a voice feature extraction model; predicting the audio feature vector through a lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence; and obtaining a three-dimensional face basic model, and synthesizing the three-dimensional face basic model and the lip expression offset sequence to obtain a three-dimensional face lip expression animation. The embodiment of the application can generate the three-dimensional animation comprising the lip shape and the expression according to the input voice driving image, has high efficiency and good stability, and can be widely applied to the technical field of computers.
Description
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, a system, an apparatus, and a storage medium for driving an image by voice.
Background
With the continuous growth of the diversity of 3D video content and the rapid development of digital virtual people application scenes, the creation demands of higher quality and higher efficiency are provided for the related content output of 3D digital virtual people. Lip movements and facial expressions when the 3D digital virtual person is generated through rapid production can help audience to understand conversation contents more vividly. The expression mode of the bimodal information fusion of the visual animation and the auditory sound can not only improve the understanding degree of the user on the content, but also provide a more accurate experience in the scene needing interaction, and improve the artistry and the ornamental degree of the 3D virtual digital person.
The current technical scheme for manufacturing the 3D character lip expression animation comprises the following types: firstly, hearing audio content by a professional animator, and manually producing a key frame animation with sound matched with a character animation lip form by manpower; secondly, capturing facial lip expressions of professional actors through a motion capture device, performing secondary trimming adjustment on captured data by manpower, and finally guiding a rendering engine to drive the facial lip expressions of the characters to move. Both of the above schemes require significant manpower and time costs and different people and equipment have an impact on the stability of the final produced content.
Disclosure of Invention
Accordingly, an object of the embodiments of the present application is to provide a method, a system, a device, and a storage medium for driving an image by voice, which can generate a three-dimensional animation including lips and expressions according to an input voice driving image, and has high efficiency and good stability.
In a first aspect, an embodiment of the present application provides a method for driving an image by voice, including the steps of:
acquiring audio data, and extracting an audio feature vector corresponding to the audio data through a voice feature extraction model;
predicting the audio feature vector through a lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence;
and obtaining a three-dimensional face basic model, and synthesizing the three-dimensional face basic model and the lip expression offset sequence to obtain a three-dimensional face lip expression animation.
Optionally, the voice feature extraction model includes a convolutional neural network and a two-way long and short memory network, and the extracting, by the voice feature extraction model, the audio feature vector corresponding to the audio data specifically includes:
inputting the one-dimensional vector corresponding to the audio data into the convolutional neural network to obtain high-level voice characteristics;
and inputting the high-level voice characteristics into the two-way long and short memory network to obtain an audio characteristic vector.
Optionally, the training process of the speech feature extraction model includes:
acquiring voice sample data and corresponding real voice sample feature vectors;
inputting the voice sample data into an initial model, and extracting a predicted voice sample feature vector;
and adjusting model parameters of the initial model according to the error between the predicted voice sample feature vector and the real voice sample feature vector until the error between the predicted voice sample feature vector and the real voice sample feature vector output by the initial model meets training requirements, so as to obtain the voice feature extraction model.
Optionally, the lip expression prediction model includes a transducer neural network model, where the transducer neural network model includes an encoder network and a decoder network, and the audio feature vector is predicted by the lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence, which specifically includes:
inputting the audio feature vector into an encoder network to obtain an audio information characterization vector sequence;
and inputting the audio information representation vector sequence and the determined expression emotion vector into a decoder network to obtain a lip expression offset sequence.
Optionally, the training process of the lip expression prediction model includes:
acquiring video sample data of a plurality of visual angles of a speaker, establishing a three-dimensional point cloud face sequence according to the video data, and determining a real face lip expression offset according to the three-dimensional point cloud face sequence;
extracting voice sample data of video sample data, and matching and labeling the three-dimensional point cloud face sequence and the voice sample data to form a sample data pair;
inputting voice sample data in the sample data pair into an encoder network to obtain an audio sample information characterization vector;
inputting the audio sample information characterization vector, the three-dimensional point cloud face sequence in the sample data pair and the expression emotion vector generated randomly into a decoder network to obtain a predicted face lip expression offset;
and calculating a loss value between the real face lip expression offset and the predicted face lip expression offset according to the target loss function, and updating the encoder network, the decoder network and the target loss function according to the loss value to obtain a transducer neural network model.
Optionally, the calculation formula of the target loss function is as follows:
Loss=S l ×L lip +S f ×L face +S r ×L reg
wherein Loss represents a Loss value, L lip Representing the loss value of the lip region, S l Representing the coefficient of influence of the lip region, L face Loss value representing facial expression region other than lip region, S f An influence coefficient indicating a facial expression region other than a lip region, L reg Loss value representing regular expression term S r And the influence coefficient of the expression regular term is represented.
Optionally, the expression emotion vector is obtained by:
determining an expression emotion vector obtained by learning in the lip expression prediction model training process as an expression emotion vector;
or, obtaining expression information, and determining an expression emotion vector according to the expression information.
In a second aspect, an embodiment of the present application provides a system for driving an image by voice, including:
the first module is used for acquiring audio data and extracting an audio feature vector corresponding to the audio data through a voice feature extraction model;
the second module is used for predicting the audio feature vector through the lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence;
and the third module is used for obtaining a three-dimensional face basic model, and carrying out synthesis processing on the three-dimensional face basic model and the lip expression offset sequence to obtain a three-dimensional face lip expression animation.
In a third aspect, an embodiment of the present application provides an apparatus for driving an image by voice, including:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method described above.
In a fourth aspect, embodiments of the present application provide a storage medium having stored therein a processor-executable program for performing the above-described method when executed by a processor.
The embodiment of the application has the following beneficial effects: according to the embodiment, the audio feature vectors corresponding to the audio data are extracted through the voice feature extraction model, so that the lip expression prediction model can adapt to different languages, then the lip expression prediction model and the determined expression emotion vectors are used for predicting the audio feature vectors to obtain a lip expression offset sequence, the variation of the lip and the surface is obtained, and then the three-dimensional facial lip expression animation is obtained according to the three-dimensional facial base model and the lip expression offset sequence, so that the three-dimensional animation comprising the lip and the expression is generated according to the voice driving image, the efficiency is high, and the stability is good.
Drawings
FIG. 1 is a flowchart illustrating a method for driving an image by voice according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating steps of another method for driving an image by voice according to an embodiment of the present application;
FIG. 3 is a block diagram of a language feature extraction model provided by an embodiment of the present application;
FIG. 4 is a block diagram of a lip expression prediction model according to an embodiment of the present application;
FIG. 5 is a block diagram of a system for voice-driven imaging according to an embodiment of the present application;
fig. 6 is a block diagram of a voice-driven image device according to an embodiment of the present application.
Detailed Description
The application will now be described in further detail with reference to the drawings and to specific examples. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.
Referring to fig. 1 and 2, an embodiment of the present application provides a method for driving an image by voice, including the following steps:
s100, acquiring audio data, and extracting an audio feature vector corresponding to the audio data through a voice feature extraction model.
The audio data refers to voice data for driving an image, and the voice data includes a plurality of languages such as chinese or english, etc. The audio feature vector is used to characterize the audio features of the linguistic data. The voice characteristic extraction model is used for obtaining an output audio characteristic vector according to the input audio data.
It will be appreciated by those skilled in the art that the specific type of audio feature vector is determined according to the actual application, and the present embodiment is not particularly limited. For example, the audio feature vector is a PPG (phonetic posteriorgrams, phoneme posterior probability chart) feature vector, and the PPG audio feature vector can extract richer audio feature information, so that the self-adaptation capability to different languages is improved in the process of predicting 3D face lip expression through voice later.
It should be noted that, the specific structure of the speech feature extraction model is determined according to practical applications, and the embodiment is not limited specifically. Referring to fig. 3, in one specific implementation, the speech feature extraction model includes a convolutional neural network (CNN, convolutional Neural Networks) and a Bi-directional long-short Term Memory network (BiLSTM, bi-directional Long Short-Term Memory), the input of the speech feature extraction model is a speech signal, the output of the speech feature extraction model is a corresponding speech feature vector, and the speech signal is a one-dimensional vector obtained by sampling audio data at a certain time interval. Specifically, a voice signal is input into a 1D-CNN (one-dimensional convolutional neural network), and high-level voice characteristics are extracted through 3 1D-CNN network layers and a pooling layer; then taking the output of the CNN layer as input, capturing the time sequence information of the audio signal through BiLSTM, and further extracting the voice characteristics; the last layer network uses the fully connected layer as the output layer, mapping the output of the BiLSTM layer to the PPG feature vector.
Optionally, the training process of the speech feature extraction model includes:
s101, acquiring voice sample data and corresponding real voice sample feature vectors;
s102, inputting the voice sample data into an initial model, and extracting a predicted voice sample feature vector;
and S103, according to the error between the predicted voice sample feature vector and the real voice sample feature vector, adjusting the model parameters of the initial model until the error between the predicted voice sample feature vector and the real voice sample feature vector output by the initial model meets the training requirement, and obtaining the voice feature extraction model.
The voice sample data comprises sample data in multiple languages, and the real voice sample feature vector is the feature vector of the voice sample data. The initial model refers to a speech feature extraction model whose model parameters are to be determined. Specifically, firstly, voice sample data is input into a voice feature extraction model to obtain a predicted voice sample feature vector, then model parameters of an initial model are adjusted according to errors between the predicted voice sample feature vector and a real voice sample feature vector, in the modulation process, the errors between the predicted voice sample feature vector and the real voice sample feature vector are reduced, and when the errors between the predicted voice sample feature vector and the real voice sample feature vector output by the initial model meet training requirements, the initial model corresponding to the model parameters is used as the voice feature extraction model.
Optionally, the voice feature extraction model includes a convolutional neural network and a two-way long and short memory network, and the extracting, by the voice feature extraction model, the audio feature vector corresponding to the audio data specifically includes:
s110, inputting a one-dimensional vector corresponding to the audio data into the convolutional neural network to obtain high-level voice characteristics;
s120, inputting the high-level voice characteristics into the two-way long and short memory network to obtain an audio characteristic vector.
Specifically, referring to fig. 3, first, a one-dimensional vector corresponding to audio data is input to a CNN (convolutional neural network) in a speech feature extraction model, and high-level speech features are obtained through extraction; then, the high-level voice feature is input into BiLSTM (two-way long and short memory network) in the voice feature extraction model, and the voice feature vector is obtained through extraction.
S200, predicting the audio feature vector through the lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence.
The lip expression prediction model is used for predicting a lip expression offset sequence according to the audio feature vector and the expression emotion vector. The lip expression offset characterizes the degree of deviation of lips and expressions based on non-speaking and anepithymic face point clouds. It should be noted that, the lip expression prediction model is determined according to practical applications, and the embodiment is not particularly limited. In a specific embodiment, referring to fig. 4, the lip expression prediction model includes an encoder and a decoder, the encoder includes a forward propagation layer, a plurality of overlapping multi-headed self-attention mechanisms and forward propagation layers, and a linear projection layer, the decoder includes a forward propagation layer, a multi-headed self-attention mechanism and a linear projection layer, an input of the encoder is an audio feature vector, an output of the encoder is an input of the decoder, and an output of the decoder is a three-dimensional face lip expression offset value.
Optionally, the training process of the lip expression prediction model includes:
s201, obtaining video sample data of multiple visual angles of a speaker, establishing a three-dimensional point cloud face sequence according to the video data, and determining a real face lip expression offset according to the three-dimensional point cloud face sequence.
The video sample data comprises multi-view video data for collecting voices of different persons through a multi-view array camera, wherein voice data in the video are multi-person multi-language mixed voice data, image data in the video are multi-view face data for different persons to speak, and meanwhile, the video resolution of each view is as high as 1080 p. Carrying out 3D point cloud alignment reconstruction on each frame of multi-view face data of the acquired video to obtain a three-dimensional point cloud face sequence; meanwhile, a 3D face model which does not speak and is closed in a natural state is selected for each speak in the reconstructed data, the 3D face model is used as a natural expression basic model and is stored, and the offset between the three-dimensional point cloud face sequence and the natural expression basic model is used as a real face lip expression offset.
S202, extracting voice sample data of video sample data, and carrying out matching labeling on the three-dimensional point cloud face sequence and the voice sample data to form a sample data pair.
And extracting voices of the video sample data as voice sample data, marking the 3D human face point cloud sequence of the speaker and the corresponding voices in a matching way, marking the 3D human face point cloud sequence of the speaker corresponding to each voice, dividing the data into data pairs with the voices matched with the 3D human face point cloud of the speaker through marking, and finally forming sequence segments and storing.
S203, inputting the voice sample data in the sample data pair into an encoder network to obtain an audio sample information characterization vector.
Referring to fig. 4, an audio sample feature vector corresponding to voice sample data in a sample data pair is extracted, and the audio sample feature vector is input to an encoder network to obtain an audio sample information characterization vector. It should be noted that the number of the multi-headed self-focusing mechanism and the forward propagation layer overlapped in the encoder network is determined according to practical applications, and the present embodiment is not particularly limited, for example, the number of the multi-headed self-focusing mechanism and the forward propagation layer overlapped is 5.
S204, inputting the audio sample information characterization vector, the three-dimensional point cloud face sequence in the sample data pair and the randomly generated expression emotion vector into a decoder network to obtain the predicted face lip expression offset.
In the training process, the expression emotion vectors are N-dimensional vectors which are randomly Gaussian distribution sampled, and as the training data comprise pronunciation expression data pairs with different emotions of a speaker, the training data with the emotions are input into a lip-shaped expression prediction model, the emotion vectors contained in the different emotions are automatically learned through training calculation loss function back propagation, and finally the expression emotion vectors learned under the different emotion data are combined to form an expression emotion vector matrix and are stored. Referring to fig. 4, an audio sample information characterization vector output by an encoder, a three-dimensional point cloud face sequence in a sample data pair and a randomly generated expression emotion vector are input to a decoder network, and the output of the decoder network is a predicted face lip expression offset.
S205, calculating a loss value between the real face lip expression offset and the predicted face lip expression offset according to the target loss function, and updating the encoder network, the decoder network and the target loss function according to the loss value to obtain a transducer neural network model.
The objective loss function is used to calculate a function of an error between the model predicted value and the actual target value, and the specific form of the objective loss function is determined according to the actual application, which is not particularly limited in this embodiment. The smaller the loss value calculated according to the target loss function is, the more accurate the model parameters of the obtained transducer neural network model are.
Optionally, the calculation formula of the target loss function is as follows:
Loss=S l ×L lip +S f ×L face +S r ×L reg
wherein Loss represents a Loss value, L lip Representing the loss value of the lip region, S l Representing the coefficient of influence of the lip region, L face Loss value representing facial expression region other than lip region, S f An influence coefficient indicating a facial expression region other than a lip region, L reg Loss value representing regular expression term S r And the influence coefficient of the expression regular term is represented.
Specifically, the influence coefficient S of the lip region l The value of (2) and the influence coefficient S of the facial expression region outside the lip region f The value of S is adjusted according to the weight in practical application l And S is equal to f After adjustment, the influence coefficient S of the regular expression term is adjusted simultaneously r The model pays attention to the expression change in a longer time in training, so that the severe change of the model expression prediction in a short time can be avoided, and the expression change can be more natural. In the training process, minimizing the loss value of the target loss function through iteration continuously while adjusting S l 、S f And S is r And the coefficients enable more accurate and natural 3D face lip expression animation to be generated.
Optionally, the lip expression prediction model includes a transducer neural network model, where the transducer neural network model includes an encoder network and a decoder network, and the audio feature vector is predicted by the lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence, which specifically includes:
s210, inputting the audio feature vector into an encoder network to obtain an audio information characterization vector sequence;
s220, inputting the audio information representation vector sequence and the determined expression emotion vector into a decoder network to obtain a lip expression offset sequence.
The encoder network is mainly used for encoding and extracting the audio representation information related to the context from the audio characteristics, wherein the input data is an audio characteristic vector, and the output is an audio information representation vector related to the context; the decoder network is used for decoding the audio information representation vector and the expression emotion vector which are output by the encoder network and related to the context, the input of the decoder network is the audio information representation vector, the 3D point cloud face and the determined expression emotion vector which are output by the encoder network, and the output of the decoder network is a lip expression offset sequence.
Optionally, the expression emotion vector is obtained by:
s221, determining an expression emotion vector obtained by learning in the lip expression prediction model training process as an expression emotion vector;
s222, or, acquiring expression information, and determining an expression emotion vector according to the expression information.
Specifically, in the prediction process, the expression emotion vector input by the decoder network can be the expression emotion vector learned in the training process, or a new expression emotion vector can be formed by linear superposition and combination of a plurality of expression emotion vectors to serve as input to control and output the emotion in the 3D face lip expression vertex animation.
S300, acquiring a three-dimensional face basic model, and synthesizing the three-dimensional face basic model and the lip expression offset sequence to obtain a three-dimensional face lip expression animation.
The three-dimensional face basic model characterizes a non-speaking and non-expressive three-dimensional face model, and the lip expression offset represents the peak offset of the lips and the expressions of the three-dimensional face. And carrying out superposition processing on the three-dimensional face basic model and the lip expression offset sequence to obtain the three-dimensional face lip expression animation.
The embodiment of the application has the following beneficial effects: according to the embodiment, the audio feature vectors corresponding to the audio data are extracted through the voice feature extraction model, so that the lip expression prediction model can adapt to different languages, then the lip expression prediction model and the determined expression emotion vectors are used for predicting the audio feature vectors to obtain a lip expression offset sequence, the variation of the lip and the surface is obtained, and then the three-dimensional facial lip expression animation is obtained according to the three-dimensional facial base model and the lip expression offset sequence, so that the three-dimensional animation comprising the lip and the expression is generated according to the voice driving image, the efficiency is high, and the stability is good.
Referring to fig. 5, an embodiment of the present application provides a system for driving an image by voice, including:
the first module is used for acquiring audio data and extracting an audio feature vector corresponding to the audio data through a voice feature extraction model;
the second module is used for predicting the audio feature vector through the lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence;
and the third module is used for obtaining a three-dimensional face basic model, and carrying out synthesis processing on the three-dimensional face basic model and the lip expression offset sequence to obtain a three-dimensional face lip expression animation.
It can be seen that the content in the above method embodiment is applicable to the system embodiment, and the functions specifically implemented by the system embodiment are the same as those of the method embodiment, and the beneficial effects achieved by the method embodiment are the same as those achieved by the method embodiment.
Referring to fig. 6, an embodiment of the present application provides a device for driving an image by voice, including:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method described above.
It can be seen that the content in the above method embodiment is applicable to the embodiment of the present device, and the functions specifically implemented by the embodiment of the present device are the same as those of the embodiment of the above method, and the beneficial effects achieved by the embodiment of the above method are the same as those achieved by the embodiment of the above method.
Furthermore, the embodiment of the application also discloses a computer program product or a computer program, and the computer program product or the computer program is stored in a computer readable storage medium. The computer program may be read from a computer readable storage medium by a processor of a computer device, the processor executing the computer program causing the computer device to perform the method as described above. Similarly, the content in the above method embodiment is applicable to the present storage medium embodiment, and the specific functions of the present storage medium embodiment are the same as those of the above method embodiment, and the achieved beneficial effects are the same as those of the above method embodiment.
While the preferred embodiment of the present application has been described in detail, the application is not limited to the embodiment, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.
Claims (10)
1. A method of voice-driving an image, comprising:
acquiring audio data, and extracting an audio feature vector corresponding to the audio data through a voice feature extraction model;
predicting the audio feature vector through a lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence;
and obtaining a three-dimensional face basic model, and synthesizing the three-dimensional face basic model and the lip expression offset sequence to obtain a three-dimensional face lip expression animation.
2. The method according to claim 1, wherein the speech feature extraction model includes a convolutional neural network and a two-way long-short memory network, and the extracting the audio feature vector corresponding to the audio data by the speech feature extraction model specifically includes:
inputting the one-dimensional vector corresponding to the audio data into the convolutional neural network to obtain high-level voice characteristics;
and inputting the high-level voice characteristics into the two-way long and short memory network to obtain an audio characteristic vector.
3. The method of claim 2, wherein the training process of the speech feature extraction model comprises:
acquiring voice sample data and corresponding real voice sample feature vectors;
inputting the voice sample data into an initial model, and extracting a predicted voice sample feature vector;
and adjusting model parameters of the initial model according to the error between the predicted voice sample feature vector and the real voice sample feature vector until the error between the predicted voice sample feature vector and the real voice sample feature vector output by the initial model meets training requirements, so as to obtain the voice feature extraction model.
4. The method according to claim 1, wherein the lip expression prediction model comprises a transducer neural network model, the transducer neural network model comprises an encoder network and a decoder network, and the audio feature vector is predicted by the lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence, specifically comprising:
inputting the audio feature vector into an encoder network to obtain an audio information characterization vector sequence;
and inputting the audio information representation vector sequence and the determined expression emotion vector into a decoder network to obtain a lip expression offset sequence.
5. The method of claim 4, wherein the training process of the lip expression prediction model comprises:
acquiring video sample data of a plurality of visual angles of a speaker, establishing a three-dimensional point cloud face sequence according to the video data, and determining a real face lip expression offset according to the three-dimensional point cloud face sequence;
extracting voice sample data of video sample data, and matching and labeling the three-dimensional point cloud face sequence and the voice sample data to form a sample data pair;
inputting voice sample data in the sample data pair into an encoder network to obtain an audio sample information characterization vector;
inputting the audio sample information characterization vector, the three-dimensional point cloud face sequence in the sample data pair and the expression emotion vector generated randomly into a decoder network to obtain a predicted face lip expression offset;
and calculating a loss value between the real face lip expression offset and the predicted face lip expression offset according to the target loss function, and updating the encoder network, the decoder network and the target loss function according to the loss value to obtain a transducer neural network model.
6. The method of claim 5, wherein the objective loss function is calculated as:
Loss=S l ×L lip +S f ×L face +S r ×L reg
wherein Loss represents a Loss value, L lip Representing the loss value of the lip region, S l Representing the coefficient of influence of the lip region, L face Loss value representing facial expression region other than lip region, S f An influence coefficient indicating a facial expression region other than a lip region, L reg Loss value representing regular expression term S r And the influence coefficient of the expression regular term is represented.
7. The method of claim 1, wherein the emotive vector is obtained by:
determining an expression emotion vector obtained by learning in the lip expression prediction model training process as an expression emotion vector;
or, obtaining expression information, and determining an expression emotion vector according to the expression information.
8. A system for voice-driven imaging, comprising:
the first module is used for acquiring audio data and extracting an audio feature vector corresponding to the audio data through a voice feature extraction model;
the second module is used for predicting the audio feature vector through the lip expression prediction model and the determined expression emotion vector to obtain a lip expression offset sequence;
and the third module is used for obtaining a three-dimensional face basic model, and carrying out synthesis processing on the three-dimensional face basic model and the lip expression offset sequence to obtain a three-dimensional face lip expression animation.
9. An apparatus for voice-driven imaging, comprising:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method of any of claims 1-7.
10. A storage medium having stored therein a processor executable program, which when executed by a processor is adapted to carry out the method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310334646.2A CN116597857A (en) | 2023-03-30 | 2023-03-30 | Method, system, device and storage medium for driving image by voice |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310334646.2A CN116597857A (en) | 2023-03-30 | 2023-03-30 | Method, system, device and storage medium for driving image by voice |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116597857A true CN116597857A (en) | 2023-08-15 |
Family
ID=87603317
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310334646.2A Pending CN116597857A (en) | 2023-03-30 | 2023-03-30 | Method, system, device and storage medium for driving image by voice |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116597857A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117218224A (en) * | 2023-08-21 | 2023-12-12 | 华院计算技术(上海)股份有限公司 | Face emotion image generation method and device, readable storage medium and terminal |
CN117372553A (en) * | 2023-08-25 | 2024-01-09 | 华院计算技术(上海)股份有限公司 | Face image generation method and device, computer readable storage medium and terminal |
-
2023
- 2023-03-30 CN CN202310334646.2A patent/CN116597857A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117218224A (en) * | 2023-08-21 | 2023-12-12 | 华院计算技术(上海)股份有限公司 | Face emotion image generation method and device, readable storage medium and terminal |
CN117372553A (en) * | 2023-08-25 | 2024-01-09 | 华院计算技术(上海)股份有限公司 | Face image generation method and device, computer readable storage medium and terminal |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111243626B (en) | Method and system for generating speaking video | |
CN112184858B (en) | Virtual object animation generation method and device based on text, storage medium and terminal | |
CN103650002B (en) | Text based video generates | |
CN113194348B (en) | Virtual human lecture video generation method, system, device and storage medium | |
CN110751708B (en) | Method and system for driving face animation in real time through voice | |
CN112562722A (en) | Audio-driven digital human generation method and system based on semantics | |
CN112465935A (en) | Virtual image synthesis method and device, electronic equipment and storage medium | |
CN116597857A (en) | Method, system, device and storage medium for driving image by voice | |
JP2002507033A (en) | Face synthesis device and face synthesis method | |
JP2014519082A5 (en) | ||
CN113592985B (en) | Method and device for outputting mixed deformation value, storage medium and electronic device | |
JP2003529861A (en) | A method for animating a synthetic model of a human face driven by acoustic signals | |
CN111459450A (en) | Interactive object driving method, device, equipment and storage medium | |
CN113421547B (en) | Voice processing method and related equipment | |
CN113838174B (en) | Audio-driven face animation generation method, device, equipment and medium | |
CN113228163A (en) | Real-time text and audio based face reproduction | |
CN113077537A (en) | Video generation method, storage medium and equipment | |
CN114332318A (en) | Virtual image generation method and related equipment thereof | |
CN113299312A (en) | Image generation method, device, equipment and storage medium | |
CN114882862A (en) | Voice processing method and related equipment | |
CN116051692A (en) | Three-dimensional digital human face animation generation method based on voice driving | |
Rastgoo et al. | A survey on recent advances in Sign Language Production | |
CN113395569A (en) | Video generation method and device | |
Filntisis et al. | Video-realistic expressive audio-visual speech synthesis for the Greek language | |
CN116758189A (en) | Digital human image generation method, device and storage medium based on voice driving |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |