BR112023019971A2 - Reconhecimento de fala visual adaptativo - Google Patents

Reconhecimento de fala visual adaptativo

Info

Publication number
BR112023019971A2
BR112023019971A2 BR112023019971A BR112023019971A BR112023019971A2 BR 112023019971 A2 BR112023019971 A2 BR 112023019971A2 BR 112023019971 A BR112023019971 A BR 112023019971A BR 112023019971 A BR112023019971 A BR 112023019971A BR 112023019971 A2 BR112023019971 A2 BR 112023019971A2
Authority
BR
Brazil
Prior art keywords
speech recognition
video
visual speech
speaker
adaptive visual
Prior art date
Application number
BR112023019971A
Other languages
English (en)
Inventor
Brendan Shillingford
Alexandros Assael Ioannis
Ferdinando Gomes De Freitas Joao
Original Assignee
Deepmind Tech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Deepmind Tech Ltd filed Critical Deepmind Tech Ltd
Publication of BR112023019971A2 publication Critical patent/BR112023019971A2/pt

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Image Analysis (AREA)
  • Fittings On The Vehicle Exterior For Carrying Loads, And Devices For Holding Or Mounting Articles (AREA)

Abstract

reconhecimento de fala visual adaptativo. métodos, sistemas e aparelhos, incluindo programas de computador codificados em meios de armazenamento de computador, para processamento de dados de vídeo utilizando um modelo de reconhecimento de fala visual adaptativo. um dos métodos inclui receber um vídeo que inclui uma pluralidade de quadros de vídeo que representam um primeiro locutor; obter uma primeira incorporação caracterizando o primeiro locutor; e processar uma primeira entrada compreendendo (i) o vídeo e (ii) a primeira incorporação usando uma rede neural de reconhecimento de fala visual tendo uma pluralidade de parâmetros, em que a rede neural de reconhecimento de fala visual é configurada para processar o vídeo e a primeira incorporação de acordo com valores treinados dos parâmetros para gerar uma saída de reconhecimento de fala que define uma sequência de uma ou mais palavras sendo faladas pelo primeiro locutor no vídeo.
BR112023019971A 2021-06-18 2022-06-15 Reconhecimento de fala visual adaptativo BR112023019971A2 (pt)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GR20210100402 2021-06-18
PCT/EP2022/066419 WO2022263570A1 (en) 2021-06-18 2022-06-15 Adaptive visual speech recognition

Publications (1)

Publication Number Publication Date
BR112023019971A2 true BR112023019971A2 (pt) 2023-11-21

Family

ID=82385592

Family Applications (1)

Application Number Title Priority Date Filing Date
BR112023019971A BR112023019971A2 (pt) 2021-06-18 2022-06-15 Reconhecimento de fala visual adaptativo

Country Status (9)

Country Link
US (1) US20240265911A1 (pt)
EP (1) EP4288960A1 (pt)
JP (1) JP2024520985A (pt)
KR (1) KR102663654B1 (pt)
CN (1) CN117121099A (pt)
AU (1) AU2022292104B2 (pt)
BR (1) BR112023019971A2 (pt)
CA (1) CA3214170A1 (pt)
WO (1) WO2022263570A1 (pt)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20240095449A1 (en) * 2022-09-16 2024-03-21 Verizon Patent And Licensing Inc. Systems and methods for adjusting a transcript based on output from a machine learning model

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101382504B1 (ko) 2007-05-21 2014-04-07 삼성전자주식회사 매크로 생성 장치 및 방법
US8407057B2 (en) * 2009-01-21 2013-03-26 Nuance Communications, Inc. Machine, system and method for user-guided teaching and modifying of voice commands and actions executed by a conversational learning system
KR102301880B1 (ko) * 2014-10-14 2021-09-14 삼성전자 주식회사 전자 장치 및 이의 음성 대화 방법
US10474883B2 (en) * 2016-11-08 2019-11-12 Nec Corporation Siamese reconstruction convolutional neural network for pose-invariant face recognition
US11919531B2 (en) * 2018-01-31 2024-03-05 Direct Current Capital LLC Method for customizing motion characteristics of an autonomous vehicle for a user
EP3766065A1 (en) * 2018-05-18 2021-01-20 Deepmind Technologies Limited Visual speech recognition by phoneme prediction
WO2020048358A1 (en) * 2018-09-04 2020-03-12 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Method, system, and computer-readable medium for recognizing speech using depth information
US10846522B2 (en) * 2018-10-16 2020-11-24 Google Llc Speaking classification using audio-visual data
US20210065712A1 (en) * 2019-08-31 2021-03-04 Soundhound, Inc. Automotive visual speech recognition
JP7442631B2 (ja) * 2019-10-18 2024-03-04 グーグル エルエルシー エンドツーエンドのマルチスピーカ視聴覚自動音声認識
CN111723758B (zh) * 2020-06-28 2023-10-31 腾讯科技(深圳)有限公司 视频信息的处理方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
AU2022292104A1 (en) 2023-09-21
KR102663654B1 (ko) 2024-05-10
WO2022263570A1 (en) 2022-12-22
US20240265911A1 (en) 2024-08-08
CN117121099A (zh) 2023-11-24
AU2022292104B2 (en) 2024-08-01
CA3214170A1 (en) 2022-12-22
EP4288960A1 (en) 2023-12-13
KR20230141932A (ko) 2023-10-10
JP2024520985A (ja) 2024-05-28

Similar Documents

Publication Publication Date Title
AU2019202382A1 (en) Robotic agent conversation escalation
SG10201707702YA (en) Collaborative Voice Controlled Devices
CN106528530A (zh) 一种确定句子类型的方法及装置
DE112016005912T5 (de) Technologien zur satzende-detektion unter verwendung von syntaktischer kohärenz
BR112016023920A2 (pt) métodos e sistemas para lidar com um diálogo com um robô
Chen et al. Fine-grained style control in transformer-based text-to-speech synthesis
BR112023020614A2 (pt) Processamento de entradas multimodais usando modelos de linguagem
BR112023019971A2 (pt) Reconhecimento de fala visual adaptativo
CN112434514B (zh) 基于多粒度多通道的神经网络的语义匹配方法、装置及计算机设备
WO2022115676A3 (en) Out-of-domain data augmentation for natural language processing
Zhu et al. Robust spoken language understanding with unsupervised asr-error adaptation
Zhou et al. Transferable positive/negative speech emotion recognition via class-wise adversarial domain adaptation
CN118043885A (zh) 用于半监督语音识别的对比孪生网络
EP3869505A3 (en) Method, apparatus, system, electronic device for processing information and storage medium
Montenegro et al. Analysis of the sensitivity of the End-Of-Turn Detection task to errors generated by the Automatic Speech Recognition process
Jalal et al. Removing bias with residual mixture of multi-view attention for speech emotion recognition
JP2018205945A (ja) 対話応答文書自動作成人工知能装置
Juszkiewicz Improving noise robustness of speech emotion recognition system
Naini et al. Generalization of Self-Supervised Learning-Based Representations for Cross-Domain Speech Emotion Recognition
Yao et al. Anchor voiceprint recognition in live streaming via RawNet-SA and gated recurrent unit
Verkholyak et al. Hierarchical two-level modelling of emotional states in spoken dialog systems
EP3825897A3 (en) Method, apparatus, device, storage medium and program for outputting information
US20230081543A1 (en) Method for synthetizing speech and electronic device
Sathya et al. Artificial intelligence for speech recognition
Wagner et al. Acted and spontaneous conversational prosody—Same or Different