WO2023146741A1 - Method, apparatus and computer program - Google Patents

Method, apparatus and computer program Download PDF

Info

Publication number
WO2023146741A1
WO2023146741A1 PCT/US2023/010261 US2023010261W WO2023146741A1 WO 2023146741 A1 WO2023146741 A1 WO 2023146741A1 US 2023010261 W US2023010261 W US 2023010261W WO 2023146741 A1 WO2023146741 A1 WO 2023146741A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
machine learning
learning model
avatar
movements
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2023/010261
Other languages
English (en)
French (fr)
Inventor
Pashmina Jonathan Cameron
Cecily Peregrine Borgatti Morrison
Martin Philip GRAYSON
Daniela MASSICETI
Matthew Alastair JOHNSON
Edward Sean Lloyd RINTEL
Rita FAIA MARQUES
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Technology Licensing LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing LLC filed Critical Microsoft Technology Licensing LLC
Priority to KR1020247025950A priority Critical patent/KR20240142448A/ko
Priority to JP2024536188A priority patent/JP2025505340A/ja
Priority to CN202380019541.6A priority patent/CN118648026A/zh
Priority to EP23704845.9A priority patent/EP4473490A1/en
Priority to US18/726,789 priority patent/US20250069308A1/en
Publication of WO2023146741A1 publication Critical patent/WO2023146741A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/20Three-dimensional [3D] animation
    • G06T13/205Three-dimensional [3D] animation driven by audio data
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/20Three-dimensional [3D] animation
    • G06T13/40Three-dimensional [3D] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads

Definitions

  • avatars animated only by audio signals may have visual movement animations ranging from the very simple, such as mouth flapping with no other motion, through to full motion.
  • the mapping of sound to movements may be artificial (that is, unrelated to the sound except that the sound is present or absent), or based on a generic model (that is, using some known sounds to animate some known movements, such as rounding the mouth during an "O" sound), or some combination thereof.
  • Current audio-driven animations focus on mouth or mouth and facial expression. Other head and or body motion tends to be absent or wholly artificial and generic to all users of a service.
  • Figure 8 shows an example computing device
  • an avatar may be considered to comprise a visual digital representation of a user.
  • the avatar may change pose during communication with other users.
  • the pose changes may be made dependent on audio data received from the user being represented by the avatar.
  • the pose changes may comprise changes in head position of the avatar.
  • an avatar may be based on a user’s appearance.
  • the avatar may be created based on image data corresponding to the user.
  • the image data may be used to create a mesh corresponding to a user’s appearance and then image data may be used to overlay further details (e.g., skin, hair, makeup) onto the mesh to create an avatar corresponding to the user’s appearance.
  • the avatar may therefore represent a close likeness of the user.
  • the user may choose to customize or change the appearance of the avatar such that the avatar does not represent a close likeness of the user.
  • An avatar may be considered to comprise a virtual representation of a user.
  • the avatar may comprise a representation of the user from the shoulders up or from the neck up.
  • mesh 212 may be proprietary geometry designed by a designer (e.g., an artist, etc.) and informed by three dimensional scans of human heads. However, any mesh representation of a head geometry can be used.
  • Figure 2A describes an example method 200 of training a base model.
  • the example method that can be used to determine a a ‘base’ model f 0 that is considered ‘personalizable’ for a particular user.
  • Base model f 0 is trained (or meta-trained) on a pre-existing training dataset.
  • the pre-existing dataset will have many users, each with one or more corresponding videos of the user talking.
  • the one or more corresponding videos for each user will comprise a small number of videos for each user.
  • the small number of videos may be between 1 and 5.
  • Each of the videos of the training dataset will have head vertices labelled.
  • Each of the videos may comprise audio data synchronized with image data.
  • User audio and image data 208 may be captured by image receiving equipment and audio receiving equipment of a user device e.g., of user device 102a. User audio and image data may also be downloaded or uploaded to user device 102a.
  • user device 102a receives video data. This may be captured by user device 102a or may be uploaded or downloaded to user device 102a.
  • the video data may comprise one or more videos.
  • the videos may comprise audio data and synchronized image data.
  • the videos may comprise videos of the user operating user device 102a.
  • FIG. 6 show an example neural network 600 that could be used in examples for any of neural networks g s , g' e , f' e .
  • the neural network is shown as an example only, and neural networks having different structures can be used in methods and systems as described herein. It should also be noted that neural networks having more layers and/or nodes may be used.
  • Each node e.g., node 662a, node 662b, node 662c, node 664
  • Each edge e.g., edge 666a, edge 666b, edge 666c, etc.
  • to compute the value of a node in a given layer take each node that is connected to it from the previous layer and multiply it by its associated edge value/weight, then add these together.
  • a nonlinear function e.g., a sigmoid, or tanh
  • This is repeated to get the value of each node in a given layer and is repeated for each layer until the output layer is reached.
  • the output is a 3-dimensional vector, which could represent the probability the model predicts that the input belongs to 3 possible object classes.
  • the neural network has 1 hidden layer of nodes. In the case of a deep neural network which may be used in some examples as a neural network in the methods disclosed herein, there are typically more thanl hidden layers.
  • any neural network may be used as an ML model as disclosed herein and Figure 6 is shown as an example only.
  • any of the following may be used: sequential variational autoencoders (VAEs); convolutional neural networks; generative models; discriminative models; etc..
  • VAEs sequential variational autoencoders
  • convolutional neural networks generative models
  • discriminative models etc.
  • method 900 comprises using the predicted movements of the user to generate animation of an avatar of the user.
  • a first aspect there is provided computer-implemented method comprising: receiving, from a user device, video data from a user; training a first machine learning model based on the video data to provide a second machine learning model, the second machine learning model being personalized to the user, wherein the second machine learning model is configured to predict movement of the user based on audio data; receiving further audio data from the user; determining predicted movements of the user based on the further audio data and the second machine learning model; using the predicted movements of the user to generate animation of an avatar of the user.
  • At least one of the first machine learning model, the second machine learning model, the third machine learning model and the fourth machine learning model comprises: a convolutional and/or a sequential neural network configured to operate on audio data to predict changes in head pose, expression, and lip movements.
  • the instructions executable by a processor of the third aspect may be for performing any of the steps of the examples of the method of the first aspect.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Processing Or Creating Images (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
PCT/US2023/010261 2022-01-31 2023-01-06 Method, apparatus and computer program Ceased WO2023146741A1 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
KR1020247025950A KR20240142448A (ko) 2022-01-31 2023-01-06 방법, 장치 및 컴퓨터 프로그램
JP2024536188A JP2025505340A (ja) 2022-01-31 2023-01-06 方法、装置及びコンピュータプログラム
CN202380019541.6A CN118648026A (zh) 2022-01-31 2023-01-06 方法、装置和计算机程序
EP23704845.9A EP4473490A1 (en) 2022-01-31 2023-01-06 Method, apparatus and computer program
US18/726,789 US20250069308A1 (en) 2022-01-31 2023-01-06 Method, apparatus and computer program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP22154373.9A EP4220565A1 (en) 2022-01-31 2022-01-31 Method, apparatus and computer program
EP22154373.9 2022-01-31

Publications (1)

Publication Number Publication Date
WO2023146741A1 true WO2023146741A1 (en) 2023-08-03

Family

ID=80119033

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/010261 Ceased WO2023146741A1 (en) 2022-01-31 2023-01-06 Method, apparatus and computer program

Country Status (6)

Country Link
US (1) US20250069308A1 (https=)
EP (2) EP4220565A1 (https=)
JP (1) JP2025505340A (https=)
KR (1) KR20240142448A (https=)
CN (1) CN118648026A (https=)
WO (1) WO2023146741A1 (https=)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12518482B2 (en) * 2023-08-10 2026-01-06 Qualcomm Incorporated Virtual representative conditioning system
US12367425B1 (en) 2024-01-12 2025-07-22 THIA ST Co. Copilot customization with data producer(s)
US12242503B1 (en) 2024-01-12 2025-03-04 THIA ST Co. Copilot architecture: network of microservices including specialized machine learning tools
US12536045B2 (en) 2024-01-12 2026-01-27 THIA ST Co. Distribution of tasks among microservices in a copilot
US12367426B1 (en) * 2024-01-12 2025-07-22 THIA ST Co. Customization of machine learning tools with occupation training
US20250267239A1 (en) * 2024-02-15 2025-08-21 Microsoft Technology Licensing, Llc Generative communication session event effects

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160134840A1 (en) * 2014-07-28 2016-05-12 Alexa Margaret McCulloch Avatar-Mediated Telepresence Systems with Enhanced Filtering
US20190122411A1 (en) * 2016-06-23 2019-04-25 LoomAi, Inc. Systems and Methods for Generating Computer Ready Animation Models of a Human Head from Captured Data Images
US10755463B1 (en) * 2018-07-20 2020-08-25 Facebook Technologies, Llc Audio-based face tracking and lip syncing for natural facial animation and lip movement
US20200302184A1 (en) * 2019-03-21 2020-09-24 Samsung Electronics Co., Ltd. Electronic device and controlling method thereof
US20210056348A1 (en) * 2019-08-19 2021-02-25 Neon Evolution Inc. Methods and systems for image and voice processing
WO2021155140A1 (en) * 2020-01-29 2021-08-05 Google Llc Photorealistic talking faces from audio
US11127225B1 (en) 2020-06-01 2021-09-21 Microsoft Technology Licensing, Llc Fitting 3D models of composite objects

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3797404A4 (en) * 2018-05-22 2022-02-16 Magic Leap, Inc. SKELETAL SYSTEMS FOR ANIMATION OF VIRTUAL AVATARS
WO2022103877A1 (en) * 2020-11-13 2022-05-19 Innopeak Technology, Inc. Realistic audio driven 3d avatar generation
US11734888B2 (en) * 2021-04-23 2023-08-22 Meta Platforms Technologies, Llc Real-time 3D facial animation from binocular video

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160134840A1 (en) * 2014-07-28 2016-05-12 Alexa Margaret McCulloch Avatar-Mediated Telepresence Systems with Enhanced Filtering
US20190122411A1 (en) * 2016-06-23 2019-04-25 LoomAi, Inc. Systems and Methods for Generating Computer Ready Animation Models of a Human Head from Captured Data Images
US10755463B1 (en) * 2018-07-20 2020-08-25 Facebook Technologies, Llc Audio-based face tracking and lip syncing for natural facial animation and lip movement
US20200302184A1 (en) * 2019-03-21 2020-09-24 Samsung Electronics Co., Ltd. Electronic device and controlling method thereof
US20210056348A1 (en) * 2019-08-19 2021-02-25 Neon Evolution Inc. Methods and systems for image and voice processing
WO2021155140A1 (en) * 2020-01-29 2021-08-05 Google Llc Photorealistic talking faces from audio
US11127225B1 (en) 2020-06-01 2021-09-21 Microsoft Technology Licensing, Llc Fitting 3D models of composite objects

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
DANIEL CUDEIRO ET AL: "Capture, Learning, and Synthesis of 3D Speaking Styles", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 8 May 2019 (2019-05-08), XP081270322 *
FINN ET AL.: "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks", PROCEEDINGS OF THE 34TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING, PMLR, vol. 70, 2017, pages 1126 - 113 5
REQUIEMA: "CNAPs such as those described in Fast and Flexible Multi-Task Classification Using Conditional Neural Adaptive Processes", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, vol. 32, 2019, pages 7957 - 7968
SNELL ET AL.: "ProtoNets such as those described in", PROTOTYPICAL NETWORKS FOR FEW-SHOT LEARNING, 19 June 2017 (2017-06-19)
YOSINSKI ET AL.: "How transferable are features in neural networks?", CORR ABS/1411.1792, 2014, Retrieved from the Internet <URL:https://arxiv.org/abs/1411.1792>

Also Published As

Publication number Publication date
CN118648026A (zh) 2024-09-13
US20250069308A1 (en) 2025-02-27
KR20240142448A (ko) 2024-09-30
EP4220565A1 (en) 2023-08-02
EP4473490A1 (en) 2024-12-11
JP2025505340A (ja) 2025-02-26

Similar Documents

Publication Publication Date Title
US20250069308A1 (en) Method, apparatus and computer program
US12277640B2 (en) Photorealistic real-time portrait animation
KR102758381B1 (ko) 3차원(3d) 환경에 대한 통합된 입/출력
KR102863164B1 (ko) 모바일 디바이스에서 사실적인 머리 회전들 및 얼굴 애니메이션 합성을 위한 방법들 및 시스템들
US11410364B2 (en) Systems and methods for realistic head turns and face animation synthesis on mobile device
US11114086B2 (en) Text and audio-based real-time face reenactment
US10528801B2 (en) Method and system for incorporating contextual and emotional visualization into electronic communications
US20220392133A1 (en) Realistic head turns and face animation synthesis on mobile device
KR102509666B1 (ko) 텍스트 및 오디오 기반 실시간 얼굴 재연
CN114766016A (zh) 用于通过迭代生成增强输出内容的设备、方法和程序
US11983808B2 (en) Conversation-driven character animation
KR20250088614A (ko) 사람의 전체-신체를 스타일화함

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23704845

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2024536188

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 18726789

Country of ref document: US

WWE Wipo information: entry into national phase

Ref document number: 202417055724

Country of ref document: IN

WWE Wipo information: entry into national phase

Ref document number: 202380019541.6

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 2023704845

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2023704845

Country of ref document: EP

Effective date: 20240902

WWP Wipo information: published in national office

Ref document number: 18726789

Country of ref document: US