CN116997934A - Three-dimensional facial animation based on speech - Google Patents

Three-dimensional facial animation based on speech Download PDF

Info

Publication number
CN116997934A
CN116997934A CN202280022450.3A CN202280022450A CN116997934A CN 116997934 A CN116997934 A CN 116997934A CN 202280022450 A CN202280022450 A CN 202280022450A CN 116997934 A CN116997934 A CN 116997934A
Authority
CN
China
Prior art keywords
subject
face
audio
facial
mesh
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280022450.3A
Other languages
Chinese (zh)
Inventor
迈克尔·佐尔霍费尔
费尔南多·德·拉·托雷
亚瑟·谢赫
亚历山大·理查德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Meta Platforms Technologies LLC
Original Assignee
Meta Platforms Technologies LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/669,270 external-priority patent/US11756250B2/en
Application filed by Meta Platforms Technologies LLC filed Critical Meta Platforms Technologies LLC
Publication of CN116997934A publication Critical patent/CN116997934A/en
Pending legal-status Critical Current

Links

Landscapes

  • Processing Or Creating Images (AREA)

Abstract

A method for training a three-dimensional model facial animation model based on speech is provided. The method comprises the following steps: determining a first correlation value of the facial feature based on the audio waveform from the first subject; generating a first mesh for a lower portion of the face based on the facial features and the first correlation value; updating the first correlation value when the difference between the first grid and the ground truth image of the first object is greater than a preselected threshold; and providing the three-dimensional model of the face animated by speech to an immersive reality application accessed by the client device based on the difference between the first mesh and the ground truth image of the first object. Also provided are a non-transitory computer readable medium having instructions stored thereon to cause the system to perform the above method.

Description

Three-dimensional facial animation based on speech
Background
Technical Field
The present disclosure relates generally to the field of generating three-dimensional computer models of video-captured objects. More particularly, the present disclosure relates to generating three-dimensional (3D), full facial animation of objects based on speech in video acquisition.
Background
Existing audio-driven facial animation methods exhibit abnormal or static facial animation, fail to generate accurate and reasonable co-pronunciation, or rely on person-specific models, limiting their scalability.
Drawings
Fig. 1 illustrates an example architecture suitable for providing voice-based 3D facial animation for an immersive reality environment, according to some embodiments.
Fig. 2 is a block diagram illustrating an example server and client from the architecture in fig. 1, in accordance with certain aspects of the present disclosure.
Fig. 3 illustrates a block diagram of a facial mesh and mapping of speech signals to a categorized facial expression space, in accordance with some embodiments.
FIG. 4 illustrates a block diagram in an autoregressive model including preselected tags, according to some embodiments.
Fig. 5 illustrates a visualization of potential spaces clustered according to an expression input, according to some embodiments.
Fig. 6A and 6B illustrate the effect of audio input and expressive input on a facial grid according to some embodiments.
Fig. 7 illustrates different facial expressions for different identities under the same speech expression, according to some embodiments.
Fig. 8 illustrates repositioning of facial expressions (e.g., lip shape, eye closure, and eyebrow height) according to neutral expressions of different identities according to some embodiments.
Fig. 9 illustrates adjustment of facial expressions based on an audio language (english/spanish) according to some embodiments.
Fig. 10 is a flow chart illustrating steps in a method for using a three-dimensional model of a face animated by speech in an immersive reality application according to some embodiments.
FIG. 11 is a flowchart illustrating steps in a method for generating a three-dimensional model of a face animated by speech, in accordance with some embodiments.
Fig. 12 is a block diagram illustrating an example computer system with which the clients and servers of fig. 1 and 2 and the methods of fig. 10 and 11 may be implemented.
In the drawings, elements referred to with the same or similar designations have the same or similar features and descriptions unless otherwise indicated.
Disclosure of Invention
In a first embodiment, a computer-implemented method includes: identifying audio-related facial features from an audio collection of the subject; generating a first mesh for a lower portion of the face of the subject based on the audio-related facial features; and identifying facial features of the subject that resemble expressions. The computer-implemented method further comprises: generating a second mesh for an upper portion of the face of the subject based on facial features of similar expression; forming a composite grid using the first grid and the second grid; and determining a loss value for the composite grid based on the ground truth image of the object. The computer-implemented method further comprises: generating a three-dimensional model of the face of the subject using the synthetic mesh based on the loss values; and providing the three-dimensional model of the face of the object to a display in a client device running an immersive reality application comprising the object.
In some embodiments of the first embodiment, one or more of the following features may be used. The method may further comprise: an audio acquisition of an object is received from a virtual reality headset. Identifying audio-related facial features may include: the intensity and frequency of the audio acquisition from the subject is identified and the amplitude and frequency of the audio waveform is correlated with the geometry of the lower portion of the subject's face. Generating the first grid may include: including blinking or eyebrow movements of the subject. Identifying expressive-like facial features of the subject may include: facial features resembling the expression are randomly selected based on a prior sampling of facial expressions of the plurality of subjects. Identifying expressive-like facial features of the subject may include: the facial features are associated with speech features from an audio collection of the subject. Identifying expressive-like facial features of the subject may include: random sampling of facial expressions of multiple subjects collected during a training phase in which a second subject reads text or plays a conversation is used. Generating the second grid may include: a three-dimensional model of a face of an object having a neutral expression is accessed. Forming the composite mesh may include: the lip shapes in the first mesh are continuously merged across the face of the subject to eye closure in the second mesh. The method may further comprise: receiving an audio acquisition of the object and an image acquisition of a face of the object, and generating a second mesh comprises: image acquisition is used.
In a second embodiment, a system includes: a memory storing a plurality of instructions and one or more processors configured to execute the instructions to cause the system to perform operations. The operations include: identifying audio-related facial features from an audio collection of the subject; generating a first mesh for a lower portion of the face of the subject based on the audio-related facial features; and identifying facial features of the subject that resemble expressions. The operations further comprise: generating a second mesh for an upper portion of the face of the subject based on facial features of similar expression; forming a composite grid using the first grid and the second grid; and determining a loss value for the composite grid based on the ground truth image of the object. The operations further comprise: generating a three-dimensional model of the face of the subject using the synthetic mesh based on the loss values; and providing the three-dimensional model of the face of the object to a display in a client device running an immersive reality application comprising the object.
In some embodiments of the second embodiment, one or more of the following features may be used. The one or more processors may also execute the instructions to: an audio acquisition of an object is received from a virtual reality headset. To identify expressive-like facial features of the subject, the one or more processors may execute instructions to: facial features resembling the expression are randomly selected based on a prior sampling of facial expressions of the plurality of subjects. To identify expressive-like facial features of the subject, the one or more processors may execute instructions to: the facial features are associated with speech features from an audio collection of the subject. To identify expressive-like facial features of the subject, the one or more processors may execute instructions to: a second object is used to read text or randomly sample facial expressions of multiple objects collected during a training phase of a conversation.
In a third embodiment, a computer-implemented method includes: determining a first correlation value of the facial feature based on the audio waveform from the first subject; generating a first mesh for a lower portion of the face based on the facial features and the first correlation value; updating the first correlation value based on a difference between the first grid and a ground truth image of the first object; and providing the three-dimensional model of the face animated by speech to an immersive reality application accessed by the client device based on the difference between the first mesh and the ground truth image of the first object.
In one or more embodiments of the third embodiment, one or more of the following features may be used. The method may further comprise: determining a second correlation value of the upper face feature; generating a second mesh for an upper portion of the face based on the upper face feature and the second correlation value; forming a composite grid using the first grid and the second grid; and forming a three-dimensional model of the face animated by voice using the composite mesh. Determining the first correlation value of the facial feature may include: facial features are identified based on the intensity and frequency of the audio waveform. The method may further comprise: a penalty value for the first grid is determined based on the ground truth image of the first object. The method may further comprise: the first correlation value of the facial feature is updated based on the audio waveform from the second object.
In another embodiment, a non-transitory computer readable medium stores instructions that, when executed by a processor, cause a computer to perform the following method. The method comprises the following steps: identifying audio-related facial features from an audio collection of the subject; generating a first mesh for a lower portion of the face of the subject based on the audio-related facial features; and identifying facial features of the subject that resemble expressions. The method further comprises the steps of: generating a second mesh for an upper portion of the face of the subject based on facial features of similar expression; forming a composite grid using the first grid and the second grid; and determining a loss value for the composite grid based on the ground truth image of the object. The method further comprises the steps of: generating a three-dimensional model of the face of the subject using the synthetic mesh based on the loss values; and providing the three-dimensional model of the face of the object to a display in a client device running an immersive reality application comprising the object.
In yet another embodiment, a system includes: means for storing instructions and means for executing the instructions to perform a method comprising: identifying audio-related facial features from an audio collection of the subject; generating a first mesh for a lower portion of the face of the subject based on the audio-related facial features; and identifying facial features of the subject that resemble expressions. The method further comprises the steps of: generating a second mesh for an upper portion of the face of the subject based on facial features of similar expression; forming a composite grid using the first grid and the second grid; and determining a loss value for the composite grid based on the ground truth image of the object. The method further comprises the steps of: generating a three-dimensional model of the face of the subject using the synthetic mesh based on the loss values; and providing the three-dimensional model of the face of the object to a display in a client device running an immersive reality application comprising the object.
It should be understood that any feature described herein as being suitable for incorporation into one or more aspects or embodiments of the present disclosure is intended to be generalized to any and all aspects and embodiments of the present disclosure. Other aspects of the disclosure will be understood by those skilled in the art from the description, claims, and drawings of the disclosure. The foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the claims.
Detailed Description
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be apparent, however, to one skilled in the art that the embodiments of the present disclosure may be practiced without some of these specific details. In other instances, well-known structures and techniques have not been shown in detail in order not to obscure the disclosure.
General overview
Voice driven facial animation is a challenging technical problem with several applications, such as facial animation for computer games, electronic commerce, immersive Virtual Reality (VR) telepresence, and other augmented reality (augmented reality, AR) applications. The need for voice-driven facial animation varies from application to application. Applications such as speech therapy or entertainment (e.g., dynamic expressions (Animoji) or AR effects) may use lower accuracy/realism in animation. In contrast, in movie production, movie dubbing, driven avatars for e-commerce applications or immersive telepresence, the quality of speech animation requires a high degree of naturalness, realism, and must provide understandability comparable to natural speakers. The human visual system has evolved to accommodate understanding subtle facial movements and expressions. Thus, poorly animated faces that do not have realistic co-sound effects or mouth-type dyssynchrony are considered to be distractions to the user and hamper commercial success of the device or application.
There is a large degree of dependency between speech and facial gestures. This dependency has been exploited by audio-driven facial animation methods developed in computer vision and graphics. With advances in deep learning techniques, some audio-driven facial animation techniques utilize person-specific methods that train in a supervised manner based on a large number of paired audio and mesh data. Some of these methods achieve high quality lip animation and synthesize an artificial facial motion based on audio alone. However, to obtain the required training data, a high quality vision-based action acquisition of the user is required, which makes these methods very impractical for consumer-oriented applications in real world environments. Some methods include generalization or averaging across different identities and thus are able to animate any user based on a user's given audio stream and static neutral 3D scan. Although such methods are practical in real world environments, they typically exhibit abnormal or static facial animation because audio does not encode all aspects of facial expression. Thus, available typical audio driven facial animation models attempt to learn one-to-many mappings, i.e., multiple simulated outputs per input. This often results in overcomplete results (e.g., abnormal, unusual, or clearly artificial), especially in facial regions that are weakly correlated with the audio signal alone or even not correlated with the audio signal.
To address these technical problems occurring in the fields of computer networks, computer simulations, and immersive reality applications, embodiments disclosed herein include the following technical aspects: such as an audio-driven facial animation method that enables highly realistic motion synthesis for the entire face and also generalizes to invisible identities. Thus, the machine learning application includes a classification potential space for facial animation that decouples audio-related information from audio-unrelated information. For example, eye closure may not bind with a particular lip shape. The potential space is trained based on a new cross-modal loss that facilitates model with accurate face reconstruction independent of audio input and accurate mouth region dependent only on the audio input provided. This decouples the actions of the lower and upper face regions and prevents overcomplete results. Action synthesis is based on an autoregressive sampling strategy of an audio-conditioned time model over the learned categorical potential space. Our method ensures high precision lip movements while also being able to sample realistic animations of facial parts that are not associated with audio signals, such as blinking and eyebrow movements.
It is desirable to animate any neutral face mesh using only speech because this is faster to process (e.g., less than 1 second of audio waveform may be sufficient). Since speech does not encode all aspects of facial expression (e.g., blinks, etc.), there are many speech-independent expressive features in the face. This results in most existing audio driving methods exhibiting abnormal or static facial animation. To overcome this technical problem, embodiments disclosed herein include a classification potential space for facial expressions stored in a training database. At the time of reasoning, some embodiments perform autoregressive sampling from a speech-conditioned temporal model over the classification potential space to ensure accurate lip motion while synthesizing a simulated animation of facial parts that are not speech-related. Classifying the potential space may include the following features. 1) Classification: the space is partitioned by the learned classification. 2) The expressive force is as follows: the potential space may be capable of encoding different facial expressions, including sparse facial events like blinks. 3) semantic decoupling: the voice-related information and voice-unrelated information may desirably be at least partially decoupled, e.g., eye closure should not bind with a given lip shape or mouth pose.
Further, embodiments disclosed herein include a redirection configuration in which a 3D speech animation model trained on one or more objects is seamlessly applied to different objects. In some embodiments, the 3D speech animation model disclosed herein may be used to dub speech from a given object into multilingual speech from one or more different objects.
Example System architecture
FIG. 1 illustrates an example architecture 100 suitable for accessing a 3D voice animation engine, according to some embodiments. Architecture 100 includes a server 130 communicatively coupled with a client device 110 and at least one database 152 via a network 150. One of the plurality of servers 130 is configured to host a memory comprising instructions that, when executed by a processor, cause the server 130 to perform at least some of the plurality of steps in the methods disclosed herein. In some embodiments, the processor is configured to control a graphical user interface (graphical user interface, GUI) for a user of one of the plurality of client devices 110 to access the 3D voice animation engine. The 3D speech animation engine may be configured to train a machine learning model to execute a particular application. Accordingly, the processor may include a dashboard tool configured to display components and graphical results to a user via the GUI. For load balancing purposes, the plurality of servers 130 may host memory including instructions to one or more processors, and the plurality of servers 130 may host a history log and database 152 including a plurality of training profiles for the 3D voice animation engine. Further, in some embodiments, multiple users of client device 110 may access the same 3D voice animation engine to run one or more machine learning models. In some embodiments, a single user with a single client device 110 may train multiple machine learning models running in parallel in one or more servers 130. Thus, multiple client devices 110 may communicate with each other via network 150 and by accessing one or more servers 130 and resources located in the one or more servers. In some embodiments, at least one or more client devices 110 may include a head-mounted device for a Virtual Reality (VR) application or smart glasses for an Augmented Reality (AR) application, as disclosed herein. In this regard, the headset or smart glasses may be paired with the smart phone for wireless communication with an AR/VR application installed in the smart phone, and the headset or smart glasses may communicate with the server 130 from the smart phone via the network 150.
Server 130 may comprise any device having suitable processor, memory, and communication capabilities for hosting a 3D voice animation engine that includes a plurality of tools associated therewith. The 3D voice animation engine may be accessed by various clients 110 over a network 150. The client 110 may be: such as a desktop computer, a mobile computer, a tablet computer (e.g., including an electronic book reader), a mobile device (e.g., a smart phone or Personal Digital Assistant (PDA)), or any other device having suitable processor, memory, and communication capabilities for accessing a 3D voice animation engine on one or more of the plurality of servers 130. Network 150 may include, for example, one or more of the following: local area tools (LAN), wide area tools (WAN), the internet, etc. Further, network 150 may include, but is not limited to, any one or more of the following tool topologies: these tool topologies include bus networks, star networks, ring networks, mesh networks, star bus networks, and tree or hierarchical networks, among others.
Fig. 2 is a block diagram 200 illustrating an example server 130 and client device 110 from architecture 100 in accordance with certain aspects of the present disclosure. Client device 110 and server 130 are communicatively coupled via a network 150 via respective communication modules 218-1 and 218-2 (hereinafter collectively referred to as "communication modules 218"). The communication module 218 is configured to connect with the network 150 to send and receive information (e.g., data, requests, responses, and commands) to and from other devices via the network 150. The communication module 218 may be, for example, a modem or an ethernet card. A user may interact with client device 110 via input device 214 and output device 216. The input device 214 may include: a mouse, a keyboard, a pointer, a touch screen, a microphone, a joystick, a wireless joystick, etc. The output device 216 may be a screen display, touch screen, speaker, etc. Client device 110 may include memory 220-1 and processor 212-1. Memory 220-1 may include an application 222 and a GUI 225, with application 222 and GUI 225 configured to run in client device 110 and coupled with input device 214 and output device 216. The application 222 may be downloaded by a user from the server 130 and may be hosted by the server 130. In some embodiments, as disclosed herein, the client device 110 may include a head-mounted device or smart glasses, and the application 222 may include an immersive reality environment in an AR/VR application. In running application 222, client device 110 and server 130 may communicate data packets 227-1 and 227-2 between each other via communication module 218 and network 150. For example, the client device 110 may provide the server 130 with a data packet 227-1 that includes a voice signal or sound file from the user. Accordingly, the server 130 may provide the client device 110 with the data packet 227-2 including the 3D animation model of the user based on the voice signal or sound file from the user.
The server 130 includes a memory 220-2, a processor 212-2, and a communication module 218-2. Hereinafter, the processors 212-1 and 212-2 and the memories 220-1 and 220-2 will be collectively referred to as "processor 212" and "memory 220", respectively. The processor 212 is configured to execute instructions stored in the memory 220. In some embodiments, memory 220-2 includes a 3D voice animation engine 232. The 3D voice animation engine 232 may share or provide features and resources to the GUI 225 that include a plurality of tools related to training and using 3D model animation of human faces for immersive reality applications including voice. The user may access the 3D voice animation engine 232 through the application 222 installed in the memory 220-1 of the client device 110. Accordingly, the application 222 may be installed by the server 130 and execute scripts and other routines provided by the server 130 through any of a number of tools. Execution of the application 222 may be controlled by the processor 212-1.
In this regard, the 3D voice animation engine 232 may be configured to create, store, update, and maintain the multimodal encoder 240, as disclosed herein. The multi-mode encoder 240 may include an audio encoder 242, a facial expression encoder 244, a convolution tool 246, and a synthesis encoder 248. The 3D voice animation engine 232 may also include a composition decoder 248. In some embodiments, 3D voice animation engine 232 may access one or more machine learning models stored in training database 252. Based on user input through application 222, training database 252 includes training files and other data files that may be used by 3D speech animation engine 232 in training of machine learning models. Further, in some embodiments, at least one or more training files or machine learning models may be stored in any of the plurality of memories 220. A user of client device 110 may access the training profile through application 222.
The audio encoder 242 recognizes audio-related facial features according to a classification scheme learned through training to generate a first mesh for a lower portion of the face of the subject. To this end, the audio encoder 242 is capable of identifying the intensity and frequency of the acoustic waveform, or a portion of the intensity and frequency, in an audio acquisition from the subject. The audio collection may include a portion of speech from the subject collected in real-time by an AR/VR application (e.g., application 222) or collected and stored in training database 252 during a training phase. The audio encoder 242 may also correlate the intensity and frequency of the acoustic waveform to the geometry of the lower portions of the subject's face (e.g., mouth and lips, and chin and cheek portions). The facial expression encoder 244 identifies facial features of the subject that resemble expressions to generate a second mesh for an upper portion of the subject's face. Thus, the facial expression encoder 244 may randomly select facial features of similar expressions based on a prior sampling of facial expressions of multiple subjects. In this regard, facial expressions of a plurality of subjects collected during a training phase in which a second subject reads text or plays a conversation may be stored in training database 252 and accessed by facial expression encoder 244. In some embodiments, the facial expression encoder 244 correlates the facial features with the voice features from the audio collection of the subject.
The convolution tool 246 may be part of a convolutional neural network (convolutional neural network, CNN) configured to reduce the dimensions of multiple neural network layers in a 3D animation model. In some embodiments, convolution tool 246 provides a temporal convolution for a 3D animation of a face of an object from speech (e.g., tCNN). In some embodiments, convolution tool 246 provides an autoregressive convolution in which labels generated in further layers of the neural network are fed back to previous layers to improve classification scans in the CNN. The synthesis decoder 248 generates a synthesis mesh of the entire face of the subject using the first mesh provided by the audio encoder 242 and the second mesh provided by the facial expression encoder 244. Thus, the synthesis decoder 248 continuously and seamlessly merges the lip shape in the first mesh provided by the audio encoder 242 across the face of the subject into the eye closure in the second mesh provided by the facial expression encoder 244. In some embodiments, the constituent decoder 248 may include additional jump connections to handle limited computational power using the generalized bias of the CNN.
The 3D speech animation engine 232 also includes a multi-modal decoder 250 configured to generate a three-dimensional model of the face of the object using the composite mesh and provide the three-dimensional model of the face of the object to a display in the client device 110 running the application 222 (e.g., an immersive reality application that includes the object).
The 3D voice animation engine 232 may include algorithms trained for the specific purpose of the engine and tools included in the engine. The algorithm may include a machine learning or artificial intelligence algorithm using any linear or nonlinear algorithm, such as a neural network algorithm or a multiple regression algorithm. In some embodiments, the machine learning model may include a Neural Network (NN), a Convolutional Neural Network (CNN), a generative antagonistic neural network (generative adversarial neural network, GAN), a deep reinforcement learning (deep reinforcement learning, DRL) algorithm, a deep cyclic neural network (deep recurrent neural network, DRNN), a classical machine learning algorithm (e.g., random forest, k-nearest neighbor (KNN) algorithm), a k-means (k-means) clustering algorithm), or any combination thereof. More generally, the machine learning model may include any machine learning model that involves a training step and an optimization step. In some embodiments, training database 252 may include a training archive to modify coefficients according to desired results of the machine learning model. Thus, in some embodiments, 3D voice animation engine 232 is configured to access training database 252 to retrieve documents and archives as input to the machine learning model. In some embodiments, 3D voice animation engine 232, tools included in the 3D voice animation engine, and at least a portion of training database 252 may be hosted in different servers accessible to server 130.
Fig. 3 illustrates a block diagram of mapping 300 a neutral facial mesh 327 and a speech signal 328 to a speech animated expressive facial mesh 351 according to some embodiments. The synthesis encoder 348 includes a fusion block 330 to map the sequence of input animated face grids 329 (expression signals) and speech signals 328 to the encoded expressions 341 in the classified potential space 340 via the synthesis encoder 348. The decoder 350 animates the central facial grid 327 according to the encoded expression 341.
To achieve high fidelity, in some embodiments, the map 300 is trained on multiple objects and available datasets (including eyelid, facial hair, or eyebrows), and thus high fidelity full-face actions are rendered from speech on arbitrary identities. In some embodiments, training is performed using an internal dataset of 250 objects, each of which reads a total of 50 speech balanced sentences. The speech signal 328 is acquired at 30 frames per second and the face mesh (reference neutral face mesh 327 and animated face mesh 329) is tracked from 80 synchronized cameras around the subject's head. In some embodiments, the face mesh may include 6172 vertices with a high level of detail (including eyelid, upper face structure, and different hairstyles). In some embodiments, this data corresponds to 13 hours of paired audiovisual data, or 140 ten thousand frames of tracked 3D face mesh. Map 300 may train on the first 40 sentences of 200 objects and use the remaining 10 sentences of the remaining 50 objects as a validation set (10 objects) and a test set (40 objects). In some embodiments, a subset of the 16 objects in the dataset may be used as a benchmark for comparison with map 300. The data is stored in a database (reference training database 252).
In some embodiments, the voice signal 328 is recorded at 16 kilohertz (kHz). For each tracked grid, a Mel-pattern will be generated that includes 600ms audio segments starting 500 milliseconds (ms) before the corresponding visual frame and ending 100ms after the corresponding visual frame. In some embodiments, the speech signal 328 includes 80-dimensional Mel spectral features collected once every 10ms, using 1024 frequency bins (frequency bins) and window size 800 for the underlying fourier transform.
To train the classified potential space 340, let x be 1:T =(x 1 ,...,x T ),x t ∈R V×3 Is a sequence of T face meshes 329, each represented by V vertices. Further set a 1:T =(a 1 ,...,a T ),a t ∈R D Is a sequence of T speech segments 328, each having D samples, aligned to a corresponding (visual) frame T. In addition, template grid 327 may be represented as h ε R V×3
To achieve high expressive force, it is desirable that the classification potential space 340 be large. However, for a single potential classification layer, this may result in an impracticably large number of classifications C. Thus, some embodiments model a smaller number (H) of potential classification heads 335 for C-way (C-way) classification. This allows for a large expression space with a relatively small number of classifications because the number of configurations of classification potential space 340 is C H And thus increases exponentially with H. In some embodiments, the values c=128 and h=64 may be sufficient to obtain accurate results for real-time applications.
Mapping from expression and audio input signals to multi-head class latent space by encoder(e.g., fusion block 330) implementation, the encoder maps from the space of the audio sequence 328 and the expression sequence 329 to a T x H x C dimensional encoding as follows:
in some embodiments, the consecutive values in equation 1 are transcoded into a classification representation using a gummel-Softmax transform on each potential classification header,
c 1:T,1:H =[Gumbel(enc t,h,1:C )] 1:T,1:H (2)
so that at time step t and each classification component in the potential classification header h is assigned C classification labels C t,h One class label in e { 1..c }. Complete coding function(followed by a classification (see equation 2)) can be expressed as epsilon.
The animation of the input template grid 327 (h) is performed by the decoder 350 (D) as follows:
it maps the coded expression 341 onto the template grid 327 (h). Decoder 350 generates animation sequence 351 of the face meshThe animation sequence looks like represented by template grid 327 (h), but according to expression code c 1:T,1:H A moving person.
At training time, the ground truth value correspondence is applicable to the following cases: (a) Template grid 327, speech signal 328, and expression signal 329 are from the same object, and (b) the desired output from decoder 350 (e.g., animation sequence 351) is equal to expression input 329 (e.g., 1:T xSee above). To complete training, some embodiments include a cross-modal loss function L that ensures that information from two input modalities (e.g., speech signal 328 and expression signal 329) is used in classification potential space 340. Let x be 1:T And a 1:T A given expression sequence 329 and a speech sequence 328, respectively. Further set h x Representing for signal x 1:T A template grid 327 of the object represented in (a). In some embodiments, decoder 350 generates two different reconstructions instead of a single reconstruction
Wherein,,and->Is a randomly sampled expression and audio sequence from a training database (e.g., training database 252). In some embodiments, ->Is a reconstruction given the correct audio but expressing the sequence randomly, and +.>Is a reconstruction given the correct expression sequence but with random audio. Thus, cross-modal loss L xMod Can be defined as:
wherein,,is a mask that assigns high weights to vertices v of the upper face and low weights to vertices around the mouth. Similarly, a->Vertices v around the mouth are assigned high weights and other vertices are assigned low weights.
In some embodiments, cross-modal loss L xMod The model is caused to have an accurate facial reconstruction independent of the audio input 328 and thus an accurate reconstruction of the audio-based mouth region independent of the expression sequence 329. Since blinking is a fast and sparse event affecting only a few vertices, some embodiments include a loss of L eyelid This loss emphasizes eyelid vertices during training as follows:
wherein,,is a binary mask, one for the eyelid vertex and zero for the other vertices. Thus, the final loss function L can be optimized as: l=l xMod +L eyelid . In some embodiments, two terms (L xMod And L eyelid ) Equal weights perform well in practice. Accordingly, other embodiments may include L xMod Loss and L eyelid Different weights between losses.
In some embodiments, the audio encoder 342 comprises a four-layer one-dimensional (1D) time convolution network. In some embodiments, the expression encoder 344 may include three fully connected layers followed by a single long short-term memory (LSTM) layer to capture the time dependence. The fusion block 330 may include three layers of perceptrons. Decoder 350 (D) may include additional hopping connection architecture. This architecture induction bias prevents the network from deviating too much from template grid 327. In the bottleneck layer, expression code c 1:T,1:H Connected to the coded expression 341. In some embodiments, the bottleneck layer is followed by two LSTM layers to model the temporal dependencies between frames, followed by three fully connected layers that remap the representation to the vertex space. Expression input x by including a sequence of audio signals 328 and face grid 329 in classified potential space 340 1:T Including a target signal that minimizes a loss function at the output of the decoder 350(refer to equation 6 and equation 7). This approach avoids problems that occur in many multi-modal approaches, where "weaker" modalities (e.g., audio with generally lower data densities) tend to be ignored.
In some embodiments, the training class potential space 340 may omit the audio signal 328. The limited capacity of the classification potential space 340 and the generalized offset of the audio decoder 342 (e.g., the jump connection therein) ensure that even in this case, sufficient information can be used from the template geometry. In some embodiments, this arrangement also results in a low reconstruction error as shown in table 1. In some embodiments, it is desirable to avoid strong entanglement between eye movements and mouth shapes in a potential representation for accurate lip shapes, and at the same time generate temporally consistent and realistic facial movements.
TABLE 1
To quantify this effect ("confusion"), a categorized potential representation 340 (c) of the test set data is given 1:T,1:H ) The confusion may be calculated as follows:
equation 8 is the inverse geometric mean of the likelihood of the potential representation under model 300. Intuitively, low confusion means that each prediction step model 300 has only a small number of potential classifications h to choose from, while high confusion means that the model is less certain as to which classification representation to choose next. A confusion of 1 means that the autoregressive model is fully deterministic, e.g. the potential embedding is fully defined by the audio input being a condition. This may not occur frequently in practice due to the presence of facial motion that is not related to audio. In some embodiments (referring to the third row of table 1), training the classification potential space 340 from audio and expression inputs results in a stronger and more confident model 300 than learning the potential space from expression inputs alone.
The training loss of the decoder (equations 6 and 7) may determine how the model 300 uses different input modalities (audio/facial expressions). Since the expression input (facial expression 329) is sufficient for accurate reconstruction, a simple loss on the desired output grid will cause the model 300 to ignore the audio input, and the result is similar to the case where audio is not given as encoder input above (see lines 1-2 of table 1). Cross-modal loss L xMod (equation 6) provides an effective solution by causing the model 300 to learn an accurate lip shape even if the expression input is exchanged by different random expressions. Similarly, the face-up action is caused to remain accurate independent of the audio input. The cross-modal loss does not affect the expressivity of the learned potential space (see line 3 of table 1), e.g., the reconstruction error is small for all potential space variables, and the reconstruction error positively affects the confusion of the autoregressive model (see equation 8).
FIG. 4 illustrates a block diagram in an autoregressive model 400 including a preselected tag 405, according to some embodiments. Expression input x when the template grid (e.g., grid 327) is driven using only audio input 428 1:T Not usable. In the case where only one modality is given, missing information that is not inferred from the audio input 428 is synthesized. Thus, some embodiments include classifying the autoregressive time model 400 over the potential space 440. The audio signal 428 is encoded by an audio encoder 442 and the head reader prepares a classification encoding space 440 that is scanned in the time direction by the audio conditioning latent codes 435. For each position c in the categorized potential expression space 440 t,h The audio condition latent code 435 is sampled, wherein the autoregressive block 445 may access the preselected tag 405.
The autoregressive time model 400 allows the classified potential space 440 to be sampled to generate a simulated expression consistent with the audio input 428. Given an audio input a according to bayesian rules 1:T Is potentially embedded c of (2) 1:T,1:H The probability of (2) can be decomposed into
Equation 9 includes a time causal relationship in the decomposition, i.e., category c at time t t,h Depending only on the current and past audio information a.ltoreq.t, and not on the future context a 1:T . In some embodiments, the autoregressive block 445 is a time CNN that includes four convolution layers with incremental expansion along the time axis. In some embodiments, the convolution is masked such that for c t,h The model has access only to all classification headers c from the past <t,1:H Previously classified header c at the current time step t,<h (refer to the block preceding the selected block 405 in the timeline). To train the autoregressive block 445, the audio encoder 442 trains the expression and audio sequence (x 1:T ,a 1:T ) Mapped to its classification embedment (refer to equation 1). Cross entropy loss and force teaching (teacher forming) are used on the latent classification labels to optimize the autoregressive block 445. At the time of reasoning, an autoregressive time model 400 is used for each position c t,h The classified emoticons are sequentially sampled.
Fig. 5 illustrates a diagram 500 of a potential classification space 540 (e.g., classification spaces 340 and 440) with classifiers clustered according to an expression input, in accordance with some embodiments. The chart 500 includes a lower face mesh 521A, a composite mesh 521B, and an upper face mesh 521C (hereinafter collectively referred to as "face meshes 521") in a potential classification space 540. The composite mesh 521B successfully incorporates face motion and lip sync from different input modalities. In some embodiments, classifying the potential space 540 may be preferable to a continuous potential space to reduce computational complexity. In some embodiments, the continuous potential space may provide higher rendering fidelity.
Cross-modal decoupling results in a structured classification potential space 540 in which each input modality has a different impact on the face mesh 521. In some embodiments, model 500 generates two different sets of potential representations S audio And S is expr 。S audio Including by fixing the expression input to the facial expression encoder (e.g., facial expression encoder 244 and facial expression encoder344 And changes the latent code (lower face mesh 521A) obtained by the audio signal. Similarly, S expr Containing a latent code (upper face mesh 521C) obtained by fixing an audio signal and changing an expression input. In the extreme case of perfect cross-modal decoupling, S audio And S is expr Two non-overlapping clusters 521A and 521C are formed. At S audio ∪S expr The separate hyperplane 535 fitted to the points in (a) helps to visualize the resulting two-dimensional (2D) projection. Note that in the process of the step S audio And S is expr There is only minimal leakage between the clusters formed.
Fig. 6A and 6B illustrate model results 600 of the effect of audio input and expression input on a lower face mesh 621A, an upper face mesh 621C, and a synthetic mesh 621B-1 (continuous) and 621B-2 (classified), collectively referred to hereinafter as "face mesh 621" and "synthetic mesh 621B", in accordance with some embodiments. The face mesh 621 includes a lower face vertex 610A, an upper face vertex 610C, and a transition vertex 610B (hereinafter collectively referred to as "vertices 610"). The face mesh 621 indicates which face vertices are most moved by the following potential representations: cluster S audio Latent representation within (e.g., face vertices 610A), cluster S expr A potential representation within (e.g., upper face vertex 610C) and a potential representation near the decision boundary (e.g., transition vertex 610B). Although audio primarily controls the mouth region (e.g., lower face mesh 621A) and expressions control upper face mesh 621C, the potential representation near the decision boundary affects the face vertices (vertices 610B) in all regions, reflecting the intuitive concept that some upper facial expressions (e.g., raising the eyebrows) are related to speech. For example, in some embodiments, L is lost xMod (see equation 6) results in significant cross-modal decoupling into upper and lower face actions. However, it is noted that in addition to the effect of audio on the lips and chin, audio also has a considerable effect on the eyebrow area (see vertex 611A).
Fig. 6B shows a variation of vertices 610B of composite face mesh 621B. Note how the facial motion collapses to an average expression for continuous space 621B-1 (only a few vertex motions, reference vertex 611B-1), while classification space 621B-2 allows for a rich and diversified facial motion to be sampled (reference vertex 611B-2).
To maintain the randomness of the continuous space (reference grid 621B-1), the model predicts the mean and variance of each frame from which the representation is then sampled. At the time of reasoning, the autoregressive model predicts mean and variance, for example, from the audio input and all past potential representations. The next embedding is then sampled based on these mean and variance predictions. In some embodiments, the lip error and overall vertex error are greater for continuous spatial grid 621B-1 than for classified potential space (see Table 2).
TABLE 2
To evaluate the quality of the generated lip sync achieved by the embodiments disclosed herein, the lip error of a single frame may be the maximum error of the lip vertex and report the average total frame in the test set. Since the upper lip and mouth angle move much less than the lower lip, the average overall lip vertex error tends to mask inaccurate lip shapes, but the maximum lip vertex error per frame is more related to perceived quality. Table 3 shows lip vertex errors for the different models disclosed herein, including voice-activated character animation (VOCA-operated character animation), variants where deep speech features include Mel-spectrograms, and the models disclosed herein (e.g., model 300 and autoregressive convolution model 400). Table 3 shows that the autoregressive convolution model achieves a lower average lip error per frame.
TABLE 3 Table 3
The quality of the model disclosed herein is quite independent of the identity of the selected conditional. As disclosed herein, table 4 compares perceptual evaluation results from different models, wherein the tracked ground truth values are judged by a total of 100 participants on three subtasks: full face comparison, lip sync comparison using only the area between chin and nose, and upper face comparison using the face up from nose. For each line, 400 pairs of short video segments are evaluated, each short video segment containing one sentence spoken by an object from the test set. The participant may choose to prefer one of the segments or to rate those segments as good.
TABLE 4 Table 4
FIG. 7 illustrates different facial expressions for different objects 727-1, 727-2, and 727-3 (hereinafter collectively referred to as "objects 727") under the same verbal expression 728, according to some embodiments. The speech expression 728 is an english sentence that includes three speech portions (A, B, C). Thus, the 3D speech animation engine disclosed herein generates facial animations 751A-1, 751B-1, and 751C-1 for object 727-1 and speech sections A, B and C, respectively; generating facial animations 751A-2, 751B-2, and 751C-2 for object 727-2; and facial animations 751A-3, 751B-3, and 751C-3 (hereinafter collectively referred to as "facial animations 751") are generated for the object 727-3.
The lip shape conforms to the individual speech portions A, B and C in the object 727. In addition, a unique and various facial movements (e.g., eyebrow lifting and blinking) are generated separately for each sequence (e.g., sequences 751A-1, 751A-2, 751A-3 ("sequence 751A")), sequences 751B-1, 751B-2, 751B-3 ("sequence 751B")), and sequences 751C-1, 751C-2, 751C-3 ("sequence 751C")).
FIG. 8 illustrates repositioning of facial animations 851A-1, 851B-1, and 851C-1 for an object 827-1 (hereinafter collectively referred to as "facial animations 851-1"), and facial animations 851A-2, 851B-2, and 851C-2 for an object 827-2 (hereinafter collectively referred to as "facial animations 851-2"), and "facial animations 851" for an object 827 according to some embodiments. Repositioning is the process of mapping facial actions from one identity's face to another identity's face. A typical application is a movie or computer game, where an actor presents a face that is not his own.
Facial animation 851 is obtained by the 3D speech animation engine (reference model 300) disclosed herein from speech portions from different objects 827A, 827B, and 827C, respectively. It can be seen that facial animation 851 maintains common characteristics of different objects, such as lip shape, eye closure, and eyebrow height from neutral expressions.
The template mesh used in the model is the target object 827. The 3D voice animation engine synthesizes the audio and the originally animated face mesh into a classification subcode and decodes it into a facial animation 851. In some embodiments, facial animation 851 may be obtained without an autoregressive model (e.g., autoregressive model 400).
Fig. 9 illustrates the adjustment ("mesh dubbing") of the following facial expressions based on audio input 927-1 (english) and audio input 927-2 (spanish) (hereinafter collectively referred to as "multi-lingual audio input 927") according to some embodiments: facial expressions 951A-1, 951B-1, 951C-1, 951D-1, 951E-1 (hereinafter collectively referred to as "English facial expression 951-1"); and facial expressions 951A-2, 951B-2, 951C-2, 951D-2, 951E-2 (hereinafter collectively referred to as "Spanish facial expressions 951-2").
In some embodiments, the 3D voice animation engine disclosed herein (with reference to 3D voice animation engine 232) may be applied to dubbing video such that the speech translates into multi-lingual audio input 927 that is fully consistent with lip movements in the original language. Facial expressions 951-1 and 951-2 (hereinafter collectively referred to as "facial expressions 951") have matching lip movements in the multi-lingual audio input 927 while leaving the facial movements intact. Thus, the 3D speech animation engine re-synthesizes lip actions in the new language 927-2. Since the classified potential space is decoupled across multiple modalities (reference grids 521 and 621), lip motion adapts to the audio clip 927-2, but general facial motion, such as blinking, is preserved from the original video clip (reference lower face grids 521A and 621A and upper face grids 521C and 621C).
Fig. 10 is a flowchart illustrating steps in a method 1000 for embedding a 3D voice animation model in a virtual reality environment, according to some embodiments. In some embodiments, method 1000 may be performed at least in part by a processor executing instructions in a client device or server disclosed herein (with reference to processor 212 and memory 220, client device 110, and server 130). In some embodiments, at least one or more steps of method 1000 may be performed by an application installed in the client device, or a 3D speech animation engine (e.g., application 222, 3D speech animation engine 232, multimodal encoder 240, and multimodal decoder 250) that includes a multimodal encoder and a multimodal decoder. As disclosed herein, a user may interact with an application in a client device via input and output elements and GUIs (referring to input device 214, output device 216, and GUI 225). As disclosed herein, the multi-modal encoder may include an audio encoder, a facial expression encoder, a convolution tool, and a synthesis encoder (e.g., audio encoder 242, facial expression encoder 244, convolution tool 246, and synthesis encoder 248). In some embodiments, methods consistent with the present disclosure may include at least one or more steps in method 1000 performed in a different order, concurrently, quasi-concurrently, or overlapping in time.
Step 1002 includes identifying audio-related facial features from an audio collection of a subject. In some embodiments, step 1002 further comprises receiving an audio acquisition of the object from the virtual reality headset. In some embodiments, step 1002 further comprises identifying the intensity and frequency of the audio acquisition from the subject and correlating the amplitude and frequency of the audio waveform with the geometry of the lower portion of the subject's face.
Step 1004 includes generating a first mesh for a lower portion of a face of the subject based on the audio-related facial features. In some embodiments, step 1004 further comprises adding a blink or eyebrow action of the subject.
Step 1006 includes identifying expressive-like facial features of the subject. In some embodiments, step 1006 further includes randomly selecting facial features of similar expression based on a prior sampling of facial expressions of the plurality of subjects. In some embodiments, step 1006 further includes associating the facial features with speech features from the audio collection of the subject. In some embodiments, step 1006 further includes using random sampling of facial expressions of the plurality of subjects collected during a training phase in which the second subject reads text or plays a conversation.
Step 1008 includes generating a second mesh for an upper portion of the subject's face based on the facial features of similar expressions. In some embodiments, step 1008 further comprises accessing a three-dimensional model of the face of the subject having a neutral expression.
Step 1010 includes forming a composite grid using the first grid and the second grid. In some embodiments, step 1010 includes continuously merging lip shapes in the first mesh into eye closures in the second mesh across the face of the subject.
Step 1012 includes determining loss values for the composite grid based on the ground truth image of the object.
Step 1014 includes generating a three-dimensional model of the face of the subject using the synthetic mesh based on the loss values.
Step 1016 includes providing the three-dimensional model of the face of the object to a display in a client device running an immersive reality application including the object. In some embodiments, step 1016 includes receiving an audio acquisition of the subject and an image acquisition of the face of the subject, and generating the second grid includes using the image acquisition.
FIG. 11 is a flowchart illustrating steps in a method 1100 for training a 3D model to create a real-time 3D voice animation of an object, according to some embodiments. In some embodiments, method 1000 may be performed at least in part by a processor executing instructions in a client device or server disclosed herein (with reference to processor 212 and memory 220, client device 110, and server 130). In some embodiments, at least one or more steps of method 1100 may be performed by an application installed in the client device, or a 3D speech animation engine (e.g., application 222, 3D speech animation engine 232, multimodal encoder 240, and multimodal decoder 250) that includes a multimodal encoder and a multimodal decoder. As disclosed herein, a user may interact with an application in a client device via input and output elements and GUIs (referring to input device 214, output device 216, and GUI 225). As disclosed herein, the multi-modal encoder may include an audio encoder, a facial expression encoder, a convolution tool, and a synthesis encoder (e.g., audio encoder 242, facial expression encoder 244, convolution tool 246, and synthesis encoder 248). In some embodiments, methods consistent with the present disclosure may include at least one or more steps in method 1100, the at least one or more steps being performed in a different order, concurrently, quasi-concurrently, or overlapping in time.
Step 1102 includes determining a first correlation value for a facial feature based on an audio waveform from a first subject. In some embodiments, step 1102 further comprises determining a second correlation value for the upper face feature. In some embodiments, step 1102 includes identifying facial features based on the intensity and frequency of the audio waveform.
Step 1104 includes generating a first mesh for a lower portion of the face based on the facial features and the first correlation value. In some embodiments, step 1104 further includes generating a second mesh for the upper portion of the face based on the upper face features and the second correlation values, and forming a composite mesh using the first mesh and the second mesh.
Step 1106 includes updating a first correlation value based on a difference between the first grid and a ground truth image of the first object.
Step 1108 includes providing a three-dimensional model of the face animated by speech to an immersive reality application accessed by the client device based on the difference between the first mesh and the ground truth image of the first object. In some embodiments, step 1108 further comprises forming a three-dimensional model of the face animated by speech using the composite mesh. In some embodiments, step 1108 includes determining a penalty value for the first grid based on the ground truth image of the first object. In some embodiments, step 1108 includes updating a first correlation value of the facial feature based on the audio waveform from the second object.
Hardware overview
Fig. 12 is a block diagram illustrating an exemplary computer system 1200 with which the clients and servers of fig. 1 and 2 and the methods of fig. 10 and 11 may be implemented. In some aspects, computer system 1200 may be implemented using hardware, or a combination of software and hardware, in a dedicated server, or integrated into another entity, or distributed across multiple entities.
Computer system 1200 (e.g., client 110 and server 130) includes a bus 1208 or other communication mechanism for communicating information, and a processor 1202 (e.g., processor 212) coupled with bus 1208 for processing information. By way of example, computer system 1200 may be implemented with one or more processors 1202. Processor 1202 may be a general purpose microprocessor, microcontroller, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field Programmable Gate Array, FPGA), programmable logic device (Programmable Logic Device, PLD), controller, state machine, gate logic, discrete hardware components, or any other suitable entity that can perform calculations or other manipulations of information.
In addition to hardware, computer system 1200 may include code that creates an execution environment for the computer program in question, e.g., code that constitutes the following stored in an included memory 1204 (e.g., memory 220): processor firmware, protocol stacks, a database management system, an operating system, or a combination of one or more thereof, such as random access Memory (Random Access Memory, RAM), flash Memory, read-Only Memory (ROM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), erasable PROM (EPROM), registers, hard disk, a removable disk, a compact disc Read-Only Memory (CD-ROM), a Digital Versatile Disc (DVD), or any other suitable storage device, coupled with bus 1208 for storing information and instructions to be executed by processor 1202. The processor 1202 and the memory 1204 can be supplemented by, or incorporated in, special purpose logic circuitry.
The instructions may be stored in the memory 1204 and may be implemented in one or more computer program products, such as one or more modules of computer program instructions encoded on a computer readable medium for execution by, or to control the operation of, the computer system 1200 and including, but not limited to, computer languages such as data-oriented languages (e.g., SQL, dBase), system languages (e.g., C, extended C object-oriented programming language (object-C), c++, assembly), structural languages (e.g., java, ·net), and application languages (e.g., PHP, ruby, perl, python), according to any methods well known to those skilled in the art. The instructions may also be implemented in the following computer languages: such as array language, aspect-oriented language, assembly language, authoring language (authoring language), command line interface language, compiled language, concurrency language, waveform bracket language (cury-binary language), data streaming language, data structuring language, declarative language, deep language (esoteric language), extension language (extension language), fourth generation language, functional language, interactive mode language, interpreted language, interactive language (iterative language), list-based language (list-based language), small language (little language), logic-based language, machine language, macro language, meta-programming language, multi-paradigm language (multiparadigm language), numerical analysis, non-English-based language (non-englist-based language), class-based object-oriented language, prototype-based object-oriented language, offside rule (off-side rule language), procedural language, reflection-based language (reflective language), rule-based language, script language, stack-based language, synchronous language, grammar-processing language (syntax handling language), visual processing language, wirth, and xml-based language. Memory 1204 may also be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 1202.
Computer programs as discussed herein do not necessarily correspond to files in a file system. A program can be stored in a portion of a file (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network. The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output.
Computer system 1200 also includes a data storage device 1206, such as a magnetic disk or optical disk, coupled to bus 1208 for storing information and instructions. Computer system 1200 may be coupled to a variety of devices through input/output module 1210. The input/output module 1210 may be any input/output module. Exemplary input/output modules 1210 include data ports such as Universal Serial Bus (USB) ports. The input/output module 1210 is configured to be connected to a communication module 1212. Exemplary communications module 1212 (e.g., communications module 218) includes a network interface card, such as an ethernet card and a modem. In certain aspects, the input/output module 1210 is configured to connect to a plurality of devices, such as an input device 1214 (e.g., input device 214) and/or an output device 1216 (e.g., output device 216). Exemplary input devices 1214 include a keyboard and a pointing device (e.g., a mouse or trackball) by which a user can provide input to computer system 1200. Other types of input devices 1214, such as tactile input devices, visual input devices, audio input devices, or brain-computer interface devices, may also be used to provide interaction with a user. For example, feedback provided to the user may be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form including acoustic input, speech input, tactile input, or brain wave input. Exemplary output devices 1216 include a display device, such as a liquid crystal display (liquid crystal display, LCD) monitor, for displaying information to a user.
According to one aspect of the disclosure, the client 110 and the server 130 may be implemented using the computer system 1200 in response to the processor 1202 executing one or more sequences of one or more instructions contained in the memory 1204. Such instructions may be read into memory 1204 from another machine-readable medium, such as data storage device 1206. Execution of the sequences of instructions contained in main memory 1204 causes processor 1202 to perform the process steps described herein. One or more processors in a multi-processing arrangement may also be employed to execute the sequences of instructions contained in memory 1204. In alternative aspects, hard-wired circuitry may be used in place of or in combination with software instructions to implement various aspects of the present disclosure. Thus, aspects of the disclosure are not limited to any specific combination of hardware circuitry and software.
Aspects of the subject matter described in this specification can be implemented in a computing system that includes a back-end component (e.g., a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a client computer having a graphical user interface or a web browser through which a user can interact with an implementation of the subject matter described in this specification); or aspects of the subject matter described in this specification can be implemented in any combination of one or more such back-end components, one or more such middleware components, or one or more such front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). The communication network (e.g., network 150) may include, for example, any one or more of the following: local Area Networks (LANs), wide Area Networks (WANs), the internet, and the like. Further, the communication network may include, but is not limited to, for example, any one or more of the following tool topologies: these tool topologies include bus networks, star networks, ring networks, mesh networks, star bus networks, or tree or hierarchical networks, among others. The communication module may be, for example, a modem or an ethernet card.
Computer system 1200 may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The computer system 1200 may be, for example, but is not limited to, a desktop computer, a laptop computer, or a tablet computer. Computer system 1200 may also be embedded in another device, such as, but not limited to: mobile phones, personal Digital Assistants (PDAs), mobile audio players, global positioning system (Global Positioning System, GPS) receivers, video game consoles, and/or television set-top boxes.
The term "machine-readable storage medium" or "computer-readable medium" as used herein refers to any medium or media that participates in providing instructions to processor 1202 for execution. Such a medium may take many forms, including but not limited to, non-volatile media, and transmission media. Non-volatile media includes, for example, optical or magnetic disks, such as data storage device 1206. Volatile media includes dynamic memory, such as memory 1204. Transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 1208. Common forms of machine-readable media include, for example, a floppy disk (floppy disk), a flexible disk (hard disk), magnetic tape, any other magnetic medium, a CD-ROM, DVD, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH EPROM, any other memory chip or cartridge, or any other medium from which a computer can read. The machine-readable storage medium may be a machine-readable storage device, a machine-readable storage substrate, a memory device, a combination of substances affecting a machine-readable propagated signal, or a combination of one or more of them.
To illustrate the interchangeability of hardware and software, various illustrative blocks, modules, components, methods, operations, instructions, and algorithms have been described generally in terms of their functionality. Whether such functionality is implemented as hardware, software, or a combination of hardware and software depends upon the particular application and design constraints imposed on the overall system. Those skilled in the art can implement the described functionality in varying ways for each particular application.
As used herein, the phrase "at least one of" after a series of items, together with the term "and" or "separating any of those items, modifies the list as a whole, rather than modifying each element of the list (e.g., each item). The phrase "at least one of" does not require that at least one item be selected; rather, the phrase is intended to include at least one of any of these items, and/or at least one of any combination of these items, and/or at least one of each of these items. As an example, the phrase "at least one of A, B and C" or "at least one of A, B or C" each refer to: only a, only B or only C; A. any combination of B and C; and/or, at least one of each of A, B and C.
To the extent that the terms "includes," "having," and the like are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. The word "exemplary" is used herein to mean "serving as an example, instance, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
Reference to an element in the singular is not intended to mean "one and only one" unless specifically so stated, but rather "one or more. All structural and functional equivalents to the elements of the various configurations described throughout this disclosure that are known to those of ordinary skill in the art are intended to be encompassed by the subject technology. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the above description.
While this specification contains many specifics, these should not be construed as limitations on the scope of what may be claimed, but rather as descriptions of particular embodiments of the subject matter. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
The subject matter of the present specification has been described in terms of particular aspects, but other aspects can be implemented and are within the scope of the following claims. For example, although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. The actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking parallel processing may be advantageous. Moreover, the separation of various system components in the various aspects described above should not be understood as requiring such separation in all aspects, but rather, it should be understood that the described program components and systems can be generally integrated together in one software product or packaged into multiple software products. Other variations are within the scope of the following claims.

Claims (15)

1. A computer-implemented method, comprising:
identifying audio-related facial features from an audio collection of the subject;
Generating a first mesh for a lower portion of the face of the subject based on the audio-related facial features;
identifying expressive-like facial features of the subject;
generating a second mesh for an upper portion of the subject's face based on the expressive-like facial features;
forming a composite grid using the first grid and the second grid;
determining a loss value for the composite grid based on the ground truth image of the object;
generating a three-dimensional model of the face of the subject using the synthetic mesh based on the loss values; and
the three-dimensional model of the face of the object is provided to a display in a client device running an immersive reality application including the object.
2. The computer-implemented method of claim 1, further comprising: the audio acquisition of the object is received from a virtual reality headset.
3. The computer-implemented method of claim 1 or 2, wherein identifying audio-related facial features comprises: the intensity and frequency of the audio acquisition from the subject is identified and the amplitude and frequency of the audio waveform is correlated with the geometry of the lower portion of the subject's face.
4. The computer-implemented method of any of the preceding claims, wherein generating the first grid comprises: including blinking or eyebrow movements of the subject.
5. The computer-implemented method of any of the preceding claims, wherein identifying expressive-like facial features of the subject comprises one or more of:
randomly selecting facial features of the similar expressions based on prior sampling of facial expressions of a plurality of subjects;
associating an upper face feature with the audio-captured speech feature from the subject;
random sampling of facial expressions of multiple subjects collected during a training phase in which a second subject reads text or plays a conversation is used.
6. The computer-implemented method of any of the preceding claims, wherein generating a second grid comprises: the three-dimensional model of the face of the subject having a neutral expression is accessed.
7. The computer-implemented method of any of the preceding claims, wherein forming a composite grid comprises: the lip shapes in the first mesh are continuously merged across the face of the object to eye closure in the second mesh.
8. The computer-implemented method of any of the preceding claims, further comprising receiving the audio acquisition of the object and an image acquisition of a face of the object, and generating the second grid comprises using the image acquisition.
9. A system, comprising:
one or more processors; and
a memory storing instructions that, when executed by the one or more processors, cause the system to:
identifying audio-related facial features from an audio collection of the subject;
generating a first mesh for a lower portion of the face of the subject based on the audio-related facial features;
identifying expressive-like facial features of the subject;
generating a second mesh for an upper portion of the subject's face based on the expressive-like facial features;
forming a composite grid using the first grid and the second grid;
determining a loss value for the composite grid based on the ground truth image of the object;
generating a three-dimensional model of the face of the subject using the synthetic mesh based on the loss values; and
the three-dimensional model of the face of the object is provided to a display in a client device running an immersive reality application including the object.
10. The system of claim 9, wherein the one or more processors further execute instructions to: the audio acquisition of the object is received from a virtual reality headset.
11. The system of claim 9 or 10, wherein to identify expressive-like facial features of the subject, the one or more processors execute instructions to perform one or more of:
randomly selecting facial features of the similar expressions based on prior sampling of facial expressions of a plurality of subjects;
associating an upper face feature with the audio-captured speech feature from the subject;
random sampling of facial expressions of multiple subjects collected during a training phase in which a second subject reads text or plays a conversation is used.
12. A computer-implemented method, comprising:
determining a first correlation value of the facial feature based on the audio waveform from the first subject;
generating a first mesh for a lower portion of the face based on the facial features and the first correlation value;
updating the first correlation value based on a difference between the first grid and a ground truth image of the first object; and
based on the difference between the first mesh and the ground truth image of the first object, a three-dimensional model of the face animated by speech is provided to an immersive reality application accessed by a client device.
13. The computer-implemented method of claim 12, further comprising:
determining a second correlation value of the upper face feature;
generating a second mesh for an upper portion of the face based on the upper face feature and the second correlation value;
forming a composite grid using the first grid and the second grid; and
the three-dimensional model of the face animated by speech is formed using the composite mesh.
14. The computer-implemented method of claim 12 or 13, wherein determining a first correlation value for a facial feature comprises: the facial features are identified based on the intensity and frequency of the audio waveform.
15. The computer-implemented method of any of claims 12 to 14, further comprising:
determining a loss value of the first grid based on a ground truth image of the first object; and/or
The first correlation value of facial features is updated based on an audio waveform from a second subject.
CN202280022450.3A 2021-03-16 2022-03-13 Three-dimensional facial animation based on speech Pending CN116997934A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US63/161,848 2021-03-16
US17/669,270 2022-02-10
US17/669,270 US11756250B2 (en) 2021-03-16 2022-02-10 Three-dimensional face animation from speech
PCT/US2022/020089 WO2022197569A1 (en) 2021-03-16 2022-03-13 Three-dimensional face animation from speech

Publications (1)

Publication Number Publication Date
CN116997934A true CN116997934A (en) 2023-11-03

Family

ID=88534348

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280022450.3A Pending CN116997934A (en) 2021-03-16 2022-03-13 Three-dimensional facial animation based on speech

Country Status (1)

Country Link
CN (1) CN116997934A (en)

Similar Documents

Publication Publication Date Title
Richard et al. Meshtalk: 3d face animation from speech using cross-modality disentanglement
Bhattacharya et al. Speech2affectivegestures: Synthesizing co-speech gestures with generative adversarial affective expression learning
US11210836B2 (en) Applying artificial intelligence to generate motion information
Ferreira et al. Learning to dance: A graph convolutional adversarial network to generate realistic dance motions from audio
US20200151438A1 (en) Compact Language-Free Facial Expression Embedding and Novel Triplet Training Scheme
Nyatsanga et al. A Comprehensive Review of Data‐Driven Co‐Speech Gesture Generation
Sadoughi et al. Speech-driven expressive talking lips with conditional sequential generative adversarial networks
Chiu et al. How to train your avatar: A data driven approach to gesture generation
US11756250B2 (en) Three-dimensional face animation from speech
Chiu et al. Gesture generation with low-dimensional embeddings
US20230177384A1 (en) Attention Bottlenecks for Multimodal Fusion
US11544886B2 (en) Generating digital avatar
Fares et al. Zero-shot style transfer for gesture animation driven by text and speech using adversarial disentanglement of multimodal style encoding
WO2024100129A1 (en) Audio-driven body motion synthesis using a diffusion probabilistic model
He et al. LLMs Meet Multimodal Generation and Editing: A Survey
Kim et al. Co-Speech Gesture Generation via Audio and Text Feature Engineering
CN116997934A (en) Three-dimensional facial animation based on speech
Pham et al. Learning continuous facial actions from speech for real-time animation
Bhattacharya et al. Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs
Zhao et al. Generating diverse gestures from speech using memory networks as dynamic dictionaries
Lei et al. [Retracted] Dance Evaluation Based on Movement and Neural Network
Fares et al. I-Brow: Hierarchical and Multimodal Transformer Model for Eyebrows Animation Synthesis
Gibet et al. Challenges for the animation of expressive virtual characters: The standpoint of sign language and theatrical gestures
Adversarial Speech-Driven Expressive Talking Lips with Conditional Sequential Generative Adversarial Networks
Alonso de Apellániz Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination