WO2023146741A1 - Method, apparatus and computer program - Google Patents
Method, apparatus and computer program Download PDFInfo
- Publication number
- WO2023146741A1 WO2023146741A1 PCT/US2023/010261 US2023010261W WO2023146741A1 WO 2023146741 A1 WO2023146741 A1 WO 2023146741A1 US 2023010261 W US2023010261 W US 2023010261W WO 2023146741 A1 WO2023146741 A1 WO 2023146741A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- user
- machine learning
- learning model
- avatar
- movements
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—Three-dimensional [3D] animation
- G06T13/205—Three-dimensional [3D] animation driven by audio data
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—Three-dimensional [3D] animation
- G06T13/40—Three-dimensional [3D] animation of characters, e.g. humans, animals or virtual beings
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
- G10L2021/105—Synthesis of the lips movements from speech, e.g. for talking heads
Definitions
- avatars animated only by audio signals may have visual movement animations ranging from the very simple, such as mouth flapping with no other motion, through to full motion.
- the mapping of sound to movements may be artificial (that is, unrelated to the sound except that the sound is present or absent), or based on a generic model (that is, using some known sounds to animate some known movements, such as rounding the mouth during an "O" sound), or some combination thereof.
- Current audio-driven animations focus on mouth or mouth and facial expression. Other head and or body motion tends to be absent or wholly artificial and generic to all users of a service.
- Figure 8 shows an example computing device
- an avatar may be considered to comprise a visual digital representation of a user.
- the avatar may change pose during communication with other users.
- the pose changes may be made dependent on audio data received from the user being represented by the avatar.
- the pose changes may comprise changes in head position of the avatar.
- an avatar may be based on a user’s appearance.
- the avatar may be created based on image data corresponding to the user.
- the image data may be used to create a mesh corresponding to a user’s appearance and then image data may be used to overlay further details (e.g., skin, hair, makeup) onto the mesh to create an avatar corresponding to the user’s appearance.
- the avatar may therefore represent a close likeness of the user.
- the user may choose to customize or change the appearance of the avatar such that the avatar does not represent a close likeness of the user.
- An avatar may be considered to comprise a virtual representation of a user.
- the avatar may comprise a representation of the user from the shoulders up or from the neck up.
- mesh 212 may be proprietary geometry designed by a designer (e.g., an artist, etc.) and informed by three dimensional scans of human heads. However, any mesh representation of a head geometry can be used.
- Figure 2A describes an example method 200 of training a base model.
- the example method that can be used to determine a a ‘base’ model f 0 that is considered ‘personalizable’ for a particular user.
- Base model f 0 is trained (or meta-trained) on a pre-existing training dataset.
- the pre-existing dataset will have many users, each with one or more corresponding videos of the user talking.
- the one or more corresponding videos for each user will comprise a small number of videos for each user.
- the small number of videos may be between 1 and 5.
- Each of the videos of the training dataset will have head vertices labelled.
- Each of the videos may comprise audio data synchronized with image data.
- User audio and image data 208 may be captured by image receiving equipment and audio receiving equipment of a user device e.g., of user device 102a. User audio and image data may also be downloaded or uploaded to user device 102a.
- user device 102a receives video data. This may be captured by user device 102a or may be uploaded or downloaded to user device 102a.
- the video data may comprise one or more videos.
- the videos may comprise audio data and synchronized image data.
- the videos may comprise videos of the user operating user device 102a.
- FIG. 6 show an example neural network 600 that could be used in examples for any of neural networks g s , g' e , f' e .
- the neural network is shown as an example only, and neural networks having different structures can be used in methods and systems as described herein. It should also be noted that neural networks having more layers and/or nodes may be used.
- Each node e.g., node 662a, node 662b, node 662c, node 664
- Each edge e.g., edge 666a, edge 666b, edge 666c, etc.
- to compute the value of a node in a given layer take each node that is connected to it from the previous layer and multiply it by its associated edge value/weight, then add these together.
- a nonlinear function e.g., a sigmoid, or tanh
- This is repeated to get the value of each node in a given layer and is repeated for each layer until the output layer is reached.
- the output is a 3-dimensional vector, which could represent the probability the model predicts that the input belongs to 3 possible object classes.
- the neural network has 1 hidden layer of nodes. In the case of a deep neural network which may be used in some examples as a neural network in the methods disclosed herein, there are typically more thanl hidden layers.
- any neural network may be used as an ML model as disclosed herein and Figure 6 is shown as an example only.
- any of the following may be used: sequential variational autoencoders (VAEs); convolutional neural networks; generative models; discriminative models; etc..
- VAEs sequential variational autoencoders
- convolutional neural networks generative models
- discriminative models etc.
- method 900 comprises using the predicted movements of the user to generate animation of an avatar of the user.
- a first aspect there is provided computer-implemented method comprising: receiving, from a user device, video data from a user; training a first machine learning model based on the video data to provide a second machine learning model, the second machine learning model being personalized to the user, wherein the second machine learning model is configured to predict movement of the user based on audio data; receiving further audio data from the user; determining predicted movements of the user based on the further audio data and the second machine learning model; using the predicted movements of the user to generate animation of an avatar of the user.
- At least one of the first machine learning model, the second machine learning model, the third machine learning model and the fourth machine learning model comprises: a convolutional and/or a sequential neural network configured to operate on audio data to predict changes in head pose, expression, and lip movements.
- the instructions executable by a processor of the third aspect may be for performing any of the steps of the examples of the method of the first aspect.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Quality & Reliability (AREA)
- Processing Or Creating Images (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Priority Applications (5)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| KR1020247025950A KR20240142448A (ko) | 2022-01-31 | 2023-01-06 | 방법, 장치 및 컴퓨터 프로그램 |
| JP2024536188A JP2025505340A (ja) | 2022-01-31 | 2023-01-06 | 方法、装置及びコンピュータプログラム |
| CN202380019541.6A CN118648026A (zh) | 2022-01-31 | 2023-01-06 | 方法、装置和计算机程序 |
| EP23704845.9A EP4473490A1 (en) | 2022-01-31 | 2023-01-06 | Method, apparatus and computer program |
| US18/726,789 US20250069308A1 (en) | 2022-01-31 | 2023-01-06 | Method, apparatus and computer program |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP22154373.9A EP4220565A1 (en) | 2022-01-31 | 2022-01-31 | Method, apparatus and computer program |
| EP22154373.9 | 2022-01-31 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023146741A1 true WO2023146741A1 (en) | 2023-08-03 |
Family
ID=80119033
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2023/010261 Ceased WO2023146741A1 (en) | 2022-01-31 | 2023-01-06 | Method, apparatus and computer program |
Country Status (6)
| Country | Link |
|---|---|
| US (1) | US20250069308A1 (https=) |
| EP (2) | EP4220565A1 (https=) |
| JP (1) | JP2025505340A (https=) |
| KR (1) | KR20240142448A (https=) |
| CN (1) | CN118648026A (https=) |
| WO (1) | WO2023146741A1 (https=) |
Families Citing this family (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12518482B2 (en) * | 2023-08-10 | 2026-01-06 | Qualcomm Incorporated | Virtual representative conditioning system |
| US12367425B1 (en) | 2024-01-12 | 2025-07-22 | THIA ST Co. | Copilot customization with data producer(s) |
| US12242503B1 (en) | 2024-01-12 | 2025-03-04 | THIA ST Co. | Copilot architecture: network of microservices including specialized machine learning tools |
| US12536045B2 (en) | 2024-01-12 | 2026-01-27 | THIA ST Co. | Distribution of tasks among microservices in a copilot |
| US12367426B1 (en) * | 2024-01-12 | 2025-07-22 | THIA ST Co. | Customization of machine learning tools with occupation training |
| US20250267239A1 (en) * | 2024-02-15 | 2025-08-21 | Microsoft Technology Licensing, Llc | Generative communication session event effects |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20160134840A1 (en) * | 2014-07-28 | 2016-05-12 | Alexa Margaret McCulloch | Avatar-Mediated Telepresence Systems with Enhanced Filtering |
| US20190122411A1 (en) * | 2016-06-23 | 2019-04-25 | LoomAi, Inc. | Systems and Methods for Generating Computer Ready Animation Models of a Human Head from Captured Data Images |
| US10755463B1 (en) * | 2018-07-20 | 2020-08-25 | Facebook Technologies, Llc | Audio-based face tracking and lip syncing for natural facial animation and lip movement |
| US20200302184A1 (en) * | 2019-03-21 | 2020-09-24 | Samsung Electronics Co., Ltd. | Electronic device and controlling method thereof |
| US20210056348A1 (en) * | 2019-08-19 | 2021-02-25 | Neon Evolution Inc. | Methods and systems for image and voice processing |
| WO2021155140A1 (en) * | 2020-01-29 | 2021-08-05 | Google Llc | Photorealistic talking faces from audio |
| US11127225B1 (en) | 2020-06-01 | 2021-09-21 | Microsoft Technology Licensing, Llc | Fitting 3D models of composite objects |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP3797404A4 (en) * | 2018-05-22 | 2022-02-16 | Magic Leap, Inc. | SKELETAL SYSTEMS FOR ANIMATION OF VIRTUAL AVATARS |
| WO2022103877A1 (en) * | 2020-11-13 | 2022-05-19 | Innopeak Technology, Inc. | Realistic audio driven 3d avatar generation |
| US11734888B2 (en) * | 2021-04-23 | 2023-08-22 | Meta Platforms Technologies, Llc | Real-time 3D facial animation from binocular video |
-
2022
- 2022-01-31 EP EP22154373.9A patent/EP4220565A1/en not_active Withdrawn
-
2023
- 2023-01-06 KR KR1020247025950A patent/KR20240142448A/ko active Pending
- 2023-01-06 EP EP23704845.9A patent/EP4473490A1/en active Pending
- 2023-01-06 CN CN202380019541.6A patent/CN118648026A/zh active Pending
- 2023-01-06 JP JP2024536188A patent/JP2025505340A/ja active Pending
- 2023-01-06 US US18/726,789 patent/US20250069308A1/en active Pending
- 2023-01-06 WO PCT/US2023/010261 patent/WO2023146741A1/en not_active Ceased
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20160134840A1 (en) * | 2014-07-28 | 2016-05-12 | Alexa Margaret McCulloch | Avatar-Mediated Telepresence Systems with Enhanced Filtering |
| US20190122411A1 (en) * | 2016-06-23 | 2019-04-25 | LoomAi, Inc. | Systems and Methods for Generating Computer Ready Animation Models of a Human Head from Captured Data Images |
| US10755463B1 (en) * | 2018-07-20 | 2020-08-25 | Facebook Technologies, Llc | Audio-based face tracking and lip syncing for natural facial animation and lip movement |
| US20200302184A1 (en) * | 2019-03-21 | 2020-09-24 | Samsung Electronics Co., Ltd. | Electronic device and controlling method thereof |
| US20210056348A1 (en) * | 2019-08-19 | 2021-02-25 | Neon Evolution Inc. | Methods and systems for image and voice processing |
| WO2021155140A1 (en) * | 2020-01-29 | 2021-08-05 | Google Llc | Photorealistic talking faces from audio |
| US11127225B1 (en) | 2020-06-01 | 2021-09-21 | Microsoft Technology Licensing, Llc | Fitting 3D models of composite objects |
Non-Patent Citations (5)
| Title |
|---|
| DANIEL CUDEIRO ET AL: "Capture, Learning, and Synthesis of 3D Speaking Styles", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 8 May 2019 (2019-05-08), XP081270322 * |
| FINN ET AL.: "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks", PROCEEDINGS OF THE 34TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING, PMLR, vol. 70, 2017, pages 1126 - 113 5 |
| REQUIEMA: "CNAPs such as those described in Fast and Flexible Multi-Task Classification Using Conditional Neural Adaptive Processes", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, vol. 32, 2019, pages 7957 - 7968 |
| SNELL ET AL.: "ProtoNets such as those described in", PROTOTYPICAL NETWORKS FOR FEW-SHOT LEARNING, 19 June 2017 (2017-06-19) |
| YOSINSKI ET AL.: "How transferable are features in neural networks?", CORR ABS/1411.1792, 2014, Retrieved from the Internet <URL:https://arxiv.org/abs/1411.1792> |
Also Published As
| Publication number | Publication date |
|---|---|
| CN118648026A (zh) | 2024-09-13 |
| US20250069308A1 (en) | 2025-02-27 |
| KR20240142448A (ko) | 2024-09-30 |
| EP4220565A1 (en) | 2023-08-02 |
| EP4473490A1 (en) | 2024-12-11 |
| JP2025505340A (ja) | 2025-02-26 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20250069308A1 (en) | Method, apparatus and computer program | |
| US12277640B2 (en) | Photorealistic real-time portrait animation | |
| KR102758381B1 (ko) | 3차원(3d) 환경에 대한 통합된 입/출력 | |
| KR102863164B1 (ko) | 모바일 디바이스에서 사실적인 머리 회전들 및 얼굴 애니메이션 합성을 위한 방법들 및 시스템들 | |
| US11410364B2 (en) | Systems and methods for realistic head turns and face animation synthesis on mobile device | |
| US11114086B2 (en) | Text and audio-based real-time face reenactment | |
| US10528801B2 (en) | Method and system for incorporating contextual and emotional visualization into electronic communications | |
| US20220392133A1 (en) | Realistic head turns and face animation synthesis on mobile device | |
| KR102509666B1 (ko) | 텍스트 및 오디오 기반 실시간 얼굴 재연 | |
| CN114766016A (zh) | 用于通过迭代生成增强输出内容的设备、方法和程序 | |
| US11983808B2 (en) | Conversation-driven character animation | |
| KR20250088614A (ko) | 사람의 전체-신체를 스타일화함 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23704845 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2024536188 Country of ref document: JP |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 18726789 Country of ref document: US |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202417055724 Country of ref document: IN |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202380019541.6 Country of ref document: CN |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023704845 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2023704845 Country of ref document: EP Effective date: 20240902 |
|
| WWP | Wipo information: published in national office |
Ref document number: 18726789 Country of ref document: US |