EP3151239A1 - Procedes et systemes pour la synthese de texte en discours - Google Patents

Procedes et systemes pour la synthese de texte en discours Download PDF

Info

Publication number
EP3151239A1
EP3151239A1 EP16190998.1A EP16190998A EP3151239A1 EP 3151239 A1 EP3151239 A1 EP 3151239A1 EP 16190998 A EP16190998 A EP 16190998A EP 3151239 A1 EP3151239 A1 EP 3151239A1
Authority
EP
European Patent Office
Prior art keywords
speech
attribute
text
training
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
EP16190998.1A
Other languages
German (de)
English (en)
Inventor
Ilya Vladimirovich Edrenkin
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yandex Europe AG
Original Assignee
Yandex Europe AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from RU2015141342A external-priority patent/RU2632424C2/ru
Application filed by Yandex Europe AG filed Critical Yandex Europe AG
Publication of EP3151239A1 publication Critical patent/EP3151239A1/fr
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers

Definitions

  • One or more speech attribute may be defined during the training steps. Similarly, one or more speech attribute may be selected during the conversion/speech synthesis steps.
  • Non-limiting examples of speech attributes include emotions, genders, language, intonations, accents, speaking styles, dynamics, and speaker identities.
  • the one or more defined speech attribute comprises at least an emotion, a gender and a language.
  • the one or more defined speech attribute comprises at least one of an emotion, a gender and a language.
  • the training data includes training text data and respective training acoustic data in different languages.
  • the selected speech attribute includes a language thereby allowing the synthetic speech to be provided in a specified language.
  • Computer-readable instructions stored on the information storage medium 104, when executed, can further cause the processor 108 to receive a selection of a speech attribute 420, the speech attribute 420 having a selected attribute weight.
  • One or more speech attribute 420 may be received, each having one or more selected attribute weight.
  • the selected attribute weight defines the weight of the speech attribute 420 desired in the synthetic speech to be outputted.
  • the synthetic speech will have a weighted sum of speech attributes 420.
  • a speech attribute 420 may be variable over a continuous range, for example intermediate between "sad” and "happy" or "sad” and "angry".
  • computer-readable instructions stored on the local memory 114, when executed, can cause the processor 116 to receive a text, receive one or more selected speech attributes, etc.
  • the instruction to perform TTS can be instructions of a user 121 entered using the input module 113.
  • the client device 112 responsive to user 121 requesting to read text messages out-loud, can receive instruction to perform TTS.
  • FIG 2 illustrates a computer-implemented method 200 for text-to-speech (TTS) synthesis, the method executable on a computing device (which can be either the client device 112 or the server 102) of the system 100 of Figure 1 .
  • TTS text-to-speech
  • the method 200 starts at step 202, where a computing device, being in this implementation of the present technology the server 102, receives instruction for TTS, specifically to output a synthetic speech having a selected speech attribute.
  • training text data 312 is received.
  • the form of the training text data 312 is not particularly limited. It may be part of a written text of any type, e.g., a book, an article, an e-mail, a text message, and the like.
  • the training text data 312 is received via text input 130 and input module 113. It may be received from an e-mail client, an e-book reader, a messaging system, a web browser, or within another application containing text content. Alternatively, the training text data 312 may be received from the operating system of the computing device (e.g., the server 102, or the client device 112).
  • step 204 the method 200 proceeds to step 204.
  • Step 208 - using a deep neural network to determine interdependency factors between the speech attributes in the training data the deep neural network generating a single, continuous acoustic space model based on the interdependency factors, the acoustic space model thereby taking into account a plurality of interdependent speech attributes and allowing for modelling of a continuous spectrum of the interdependent speech attributes
  • the input into the dnn 330 is the training data (not depicted), and the output from the dnn 330 is the acoustic space model 340.
  • the dnn 330 thus generates a single, continuous acoustic space model 340 based on the interdependency factors between the speech attributes 326, the acoustic space model 340 thereby taking into account a plurality of interdependent speech attributes and allowing for modelling of a continuous spectrum of the interdependent speech attributes.
  • the acoustic space model 340 can now be used in the remaining steps 210-216 of the method 200.
  • the method 200 now continues with steps 210-216 in which text-to-speech synthesis is performed, using the acoustic space model 340 generated in step 208.
  • steps 210-216 in which text-to-speech synthesis is performed, using the acoustic space model 340 generated in step 208.
  • Figure 4 depicts a schematic diagram 400 of text-to-speech synthesis (TTS) in accordance with non-limiting embodiments of the present technology.
  • step 212 The method 200 now continues with step 212.
  • Step 212 - receiving a selection of a speech attribute, the speech attribute having a selected attribute weight
  • a selection of a speech attribute 420 is received.
  • One or more speech attribute 420 may be selected and received.
  • Speech attribute 420 is not particularly limited and may correspond, for example, to an emotion (angry, happy, sad, etc.), the gender of the speaker, a language, an accent, an intonation, a dynamic, a speaker identity, a speaking style, etc.
  • the one or more speech attribute 326 is defined, to allow an association or a correlation between vocoder features 324 of the acoustic data 322 and speech attributes 326 during training of the acoustic space model 340 (defined further below).
  • the selection of the speech attribute 420 is received via the input module 113. In some non-limiting embodiments, it may be received with the text 410 via the text input 130. In alternative embodiments, the text 410 and the speech attribute 420 are received separately (e.g., at different times, or from different applications, or from different users, or in different files, etc.), via the input module 113.
  • step 216 in which the synthetic speech 440 is outputted as audio having the selected speech attribute(s) 420.
  • the synthetic speech 440 produced by the acoustic space model 340 has perceivable characteristics 430, the perceivable characteristics 430 producing sound having the selected speech attribute(s) 420.
  • a synthetic speech can be outputted, even if no respective training acoustic data with the selected attributes was received during training.
  • the text converted to synthetic speech need not correspond to the training text data, and a text can be converted to synthetic speech even though no respective acoustic data for that text was received during the training process. At least some of these technical effects are achieved through building an acoustic model that is based on interdependencies of the attributes of the acoustic data.
  • the present technology may provide synthetic speech that sounds like a natural human voice, having the selected speech attributes. Instead of a 'post-processing' approach seen in some of the prior art where an 'average' voice is synthesized then adapted according to desired criteria, embodiments of the present method and system provide a one-step generation of the desired synthetic speech. Furthermore, embodiments of the present method and system can provide unique synthetic speech based on unique and different combinations, along a continuous spectrum, of many different speech attributes including language, emotion, and accent.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
EP16190998.1A 2015-09-29 2016-09-28 Procedes et systemes pour la synthese de texte en discours Ceased EP3151239A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
RU2015141342A RU2632424C2 (ru) 2015-09-29 2015-09-29 Способ и сервер для синтеза речи по тексту
US15/263,525 US9916825B2 (en) 2015-09-29 2016-09-13 Method and system for text-to-speech synthesis

Publications (1)

Publication Number Publication Date
EP3151239A1 true EP3151239A1 (fr) 2017-04-05

Family

ID=56997428

Family Applications (1)

Application Number Title Priority Date Filing Date
EP16190998.1A Ceased EP3151239A1 (fr) 2015-09-29 2016-09-28 Procedes et systemes pour la synthese de texte en discours

Country Status (1)

Country Link
EP (1) EP3151239A1 (fr)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107634898A (zh) * 2017-08-18 2018-01-26 上海云从企业发展有限公司 通过电子通信设备上的聊天工具来实现真人语音信息通信
EP3438972A4 (fr) * 2016-03-28 2019-07-10 Sony Corporation Dispositif de traitement d'informations et procédé de traitement d'informations
CN110175262A (zh) * 2019-05-31 2019-08-27 武汉斗鱼鱼乐网络科技有限公司 基于聚类的深度学习模型压缩方法、存储介质及系统
WO2019191251A1 (fr) * 2018-03-28 2019-10-03 Telepathy Labs, Inc. Procédé et système de synthèse vocale
CN111276120A (zh) * 2020-01-21 2020-06-12 华为技术有限公司 语音合成方法、装置和计算机可读存储介质
WO2020145439A1 (fr) * 2019-01-11 2020-07-16 엘지전자 주식회사 Procédé et dispositif de synthèse vocale basée sur des informations d'émotion
CN111681638A (zh) * 2020-04-20 2020-09-18 深圳奥尼电子股份有限公司 车载智能语音控制方法及系统
CN111954903A (zh) * 2018-12-11 2020-11-17 微软技术许可有限责任公司 多说话者神经文本到语音合成
CN112382267A (zh) * 2020-11-13 2021-02-19 北京有竹居网络技术有限公司 用于转换口音的方法、装置、设备以及存储介质
CN112786006A (zh) * 2021-01-13 2021-05-11 北京有竹居网络技术有限公司 语音合成方法、合成模型训练方法、装置、介质及设备
CN112885326A (zh) * 2019-11-29 2021-06-01 阿里巴巴集团控股有限公司 个性化语音合成模型创建、语音合成和测试方法及装置
CN113192482A (zh) * 2020-01-13 2021-07-30 北京地平线机器人技术研发有限公司 语音合成方法及语音合成模型的训练方法、装置、设备
CN113314096A (zh) * 2020-02-25 2021-08-27 阿里巴巴集团控股有限公司 语音合成方法、装置、设备和存储介质
CN113823257A (zh) * 2021-06-18 2021-12-21 腾讯科技(深圳)有限公司 语音合成器的构建方法、语音合成方法及装置
WO2023215132A1 (fr) * 2022-05-04 2023-11-09 Cerence Operating Company Modification interactive de style de parole de parole synthétisée

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8135591B2 (en) 2006-09-08 2012-03-13 At&T Intellectual Property Ii, L.P. Method and system for training a text-to-speech synthesis system using a specific domain speech database
US8527276B1 (en) * 2012-10-25 2013-09-03 Google Inc. Speech synthesis using deep neural networks
US20130262119A1 (en) 2012-03-30 2013-10-03 Kabushiki Kaisha Toshiba Text to speech system
US8886537B2 (en) 2007-03-20 2014-11-11 Nuance Communications, Inc. Method and system for text-to-speech synthesis with personalized voice

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8135591B2 (en) 2006-09-08 2012-03-13 At&T Intellectual Property Ii, L.P. Method and system for training a text-to-speech synthesis system using a specific domain speech database
US8886537B2 (en) 2007-03-20 2014-11-11 Nuance Communications, Inc. Method and system for text-to-speech synthesis with personalized voice
US20130262119A1 (en) 2012-03-30 2013-10-03 Kabushiki Kaisha Toshiba Text to speech system
EP2650874A1 (fr) * 2012-03-30 2013-10-16 Kabushiki Kaisha Toshiba Système de texte vers parole
US8527276B1 (en) * 2012-10-25 2013-09-03 Google Inc. Speech synthesis using deep neural networks

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "Vocoder - Wikipedia", 21 September 2015 (2015-09-21), XP055340197, Retrieved from the Internet <URL:https://en.wikipedia.org/w/index.php?title=Vocoder&oldid=682020055> [retrieved on 20170130] *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3438972A4 (fr) * 2016-03-28 2019-07-10 Sony Corporation Dispositif de traitement d'informations et procédé de traitement d'informations
CN107634898A (zh) * 2017-08-18 2018-01-26 上海云从企业发展有限公司 通过电子通信设备上的聊天工具来实现真人语音信息通信
WO2019191251A1 (fr) * 2018-03-28 2019-10-03 Telepathy Labs, Inc. Procédé et système de synthèse vocale
US11450307B2 (en) 2018-03-28 2022-09-20 Telepathy Labs, Inc. Text-to-speech synthesis system and method
CN111954903B (zh) * 2018-12-11 2024-03-15 微软技术许可有限责任公司 多说话者神经文本到语音合成
CN111954903A (zh) * 2018-12-11 2020-11-17 微软技术许可有限责任公司 多说话者神经文本到语音合成
WO2020145439A1 (fr) * 2019-01-11 2020-07-16 엘지전자 주식회사 Procédé et dispositif de synthèse vocale basée sur des informations d'émotion
US11514886B2 (en) 2019-01-11 2022-11-29 Lg Electronics Inc. Emotion classification information-based text-to-speech (TTS) method and apparatus
CN110175262A (zh) * 2019-05-31 2019-08-27 武汉斗鱼鱼乐网络科技有限公司 基于聚类的深度学习模型压缩方法、存储介质及系统
CN112885326A (zh) * 2019-11-29 2021-06-01 阿里巴巴集团控股有限公司 个性化语音合成模型创建、语音合成和测试方法及装置
CN113192482A (zh) * 2020-01-13 2021-07-30 北京地平线机器人技术研发有限公司 语音合成方法及语音合成模型的训练方法、装置、设备
CN113192482B (zh) * 2020-01-13 2023-03-21 北京地平线机器人技术研发有限公司 语音合成方法及语音合成模型的训练方法、装置、设备
CN111276120A (zh) * 2020-01-21 2020-06-12 华为技术有限公司 语音合成方法、装置和计算机可读存储介质
CN111276120B (zh) * 2020-01-21 2022-08-19 华为技术有限公司 语音合成方法、装置和计算机可读存储介质
CN113314096A (zh) * 2020-02-25 2021-08-27 阿里巴巴集团控股有限公司 语音合成方法、装置、设备和存储介质
CN111681638A (zh) * 2020-04-20 2020-09-18 深圳奥尼电子股份有限公司 车载智能语音控制方法及系统
CN112382267A (zh) * 2020-11-13 2021-02-19 北京有竹居网络技术有限公司 用于转换口音的方法、装置、设备以及存储介质
CN112786006A (zh) * 2021-01-13 2021-05-11 北京有竹居网络技术有限公司 语音合成方法、合成模型训练方法、装置、介质及设备
WO2022151931A1 (fr) * 2021-01-13 2022-07-21 北京有竹居网络技术有限公司 Procédé et appareil de synthèse de la parole, procédé et appareil d'entraînement de modèle de synthèse, support et dispositif
CN112786006B (zh) * 2021-01-13 2024-05-17 北京有竹居网络技术有限公司 语音合成方法、合成模型训练方法、装置、介质及设备
CN113823257A (zh) * 2021-06-18 2021-12-21 腾讯科技(深圳)有限公司 语音合成器的构建方法、语音合成方法及装置
CN113823257B (zh) * 2021-06-18 2024-02-09 腾讯科技(深圳)有限公司 语音合成器的构建方法、语音合成方法及装置
WO2023215132A1 (fr) * 2022-05-04 2023-11-09 Cerence Operating Company Modification interactive de style de parole de parole synthétisée

Similar Documents

Publication Publication Date Title
US9916825B2 (en) Method and system for text-to-speech synthesis
EP3151239A1 (fr) Procedes et systemes pour la synthese de texte en discours
KR102582291B1 (ko) 감정 정보 기반의 음성 합성 방법 및 장치
US11727914B2 (en) Intent recognition and emotional text-to-speech learning
US11531819B2 (en) Text-to-speech adapted by machine learning
JP6802005B2 (ja) 音声認識装置、音声認識方法及び音声認識システム
US10607595B2 (en) Generating audio rendering from textual content based on character models
CN111048062A (zh) 语音合成方法及设备
US10685644B2 (en) Method and system for text-to-speech synthesis
EP2634714A2 (fr) Appareil et procédé de synthèse audio émotionnelle
Delgado et al. Spoken, multilingual and multimodal dialogue systems: development and assessment
JP2024505076A (ja) 多様で自然なテキスト読み上げサンプルを生成する
Fellbaum et al. Principles of electronic speech processing with applications for people with disabilities
US11176943B2 (en) Voice recognition device, voice recognition method, and computer program product
López-Ludeña et al. LSESpeak: A spoken language generator for Deaf people
Ifeanyi et al. Text–To–Speech Synthesis (TTS)
EP3534363A1 (fr) Dispositif de traitement d&#39;informations et procédé de traitement d&#39;informations
US20230148275A1 (en) Speech synthesis device and speech synthesis method
CN116631434A (zh) 基于转换系统的视频语音同步方法、装置、电子设备
Chaurasiya Cognitive hexagon-controlled intelligent speech interaction system
JP6289950B2 (ja) 読み上げ装置、読み上げ方法及びプログラム
US20190019497A1 (en) Expressive control of text-to-speech content
US11887583B1 (en) Updating models with trained model update objects
KR20220116660A (ko) 인공지능 스피커 기능을 탑재한 텀블러 장치
KR20230067501A (ko) 음성 합성 장치 및 그의 음성 합성 방법

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20171005

RBV Designated contracting states (corrected)

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20190830

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED

18R Application refused

Effective date: 20210917