EP3151239A1 - Procedes et systemes pour la synthese de texte en discours - Google Patents
Procedes et systemes pour la synthese de texte en discours Download PDFInfo
- Publication number
- EP3151239A1 EP3151239A1 EP16190998.1A EP16190998A EP3151239A1 EP 3151239 A1 EP3151239 A1 EP 3151239A1 EP 16190998 A EP16190998 A EP 16190998A EP 3151239 A1 EP3151239 A1 EP 3151239A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- speech
- attribute
- text
- training
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
- 238000000034 method Methods 0.000 title claims abstract description 78
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 20
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 20
- 238000012549 training Methods 0.000 claims abstract description 177
- 238000013528 artificial neural network Methods 0.000 claims abstract description 13
- 238000001228 spectrum Methods 0.000 claims abstract description 10
- 230000008451 emotion Effects 0.000 claims description 17
- 230000009467 reduction Effects 0.000 claims description 5
- 238000005516 engineering process Methods 0.000 description 54
- 238000004891 communication Methods 0.000 description 44
- 230000014509 gene expression Effects 0.000 description 14
- 230000008569 process Effects 0.000 description 13
- 230000004048 modification Effects 0.000 description 8
- 238000012986 modification Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 7
- 241000282414 Homo sapiens Species 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000000875 corresponding effect Effects 0.000 description 3
- 238000003066 decision tree Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 239000007787 solid Substances 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 210000004556 brain Anatomy 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
Definitions
- One or more speech attribute may be defined during the training steps. Similarly, one or more speech attribute may be selected during the conversion/speech synthesis steps.
- Non-limiting examples of speech attributes include emotions, genders, language, intonations, accents, speaking styles, dynamics, and speaker identities.
- the one or more defined speech attribute comprises at least an emotion, a gender and a language.
- the one or more defined speech attribute comprises at least one of an emotion, a gender and a language.
- the training data includes training text data and respective training acoustic data in different languages.
- the selected speech attribute includes a language thereby allowing the synthetic speech to be provided in a specified language.
- Computer-readable instructions stored on the information storage medium 104, when executed, can further cause the processor 108 to receive a selection of a speech attribute 420, the speech attribute 420 having a selected attribute weight.
- One or more speech attribute 420 may be received, each having one or more selected attribute weight.
- the selected attribute weight defines the weight of the speech attribute 420 desired in the synthetic speech to be outputted.
- the synthetic speech will have a weighted sum of speech attributes 420.
- a speech attribute 420 may be variable over a continuous range, for example intermediate between "sad” and "happy" or "sad” and "angry".
- computer-readable instructions stored on the local memory 114, when executed, can cause the processor 116 to receive a text, receive one or more selected speech attributes, etc.
- the instruction to perform TTS can be instructions of a user 121 entered using the input module 113.
- the client device 112 responsive to user 121 requesting to read text messages out-loud, can receive instruction to perform TTS.
- FIG 2 illustrates a computer-implemented method 200 for text-to-speech (TTS) synthesis, the method executable on a computing device (which can be either the client device 112 or the server 102) of the system 100 of Figure 1 .
- TTS text-to-speech
- the method 200 starts at step 202, where a computing device, being in this implementation of the present technology the server 102, receives instruction for TTS, specifically to output a synthetic speech having a selected speech attribute.
- training text data 312 is received.
- the form of the training text data 312 is not particularly limited. It may be part of a written text of any type, e.g., a book, an article, an e-mail, a text message, and the like.
- the training text data 312 is received via text input 130 and input module 113. It may be received from an e-mail client, an e-book reader, a messaging system, a web browser, or within another application containing text content. Alternatively, the training text data 312 may be received from the operating system of the computing device (e.g., the server 102, or the client device 112).
- step 204 the method 200 proceeds to step 204.
- Step 208 - using a deep neural network to determine interdependency factors between the speech attributes in the training data the deep neural network generating a single, continuous acoustic space model based on the interdependency factors, the acoustic space model thereby taking into account a plurality of interdependent speech attributes and allowing for modelling of a continuous spectrum of the interdependent speech attributes
- the input into the dnn 330 is the training data (not depicted), and the output from the dnn 330 is the acoustic space model 340.
- the dnn 330 thus generates a single, continuous acoustic space model 340 based on the interdependency factors between the speech attributes 326, the acoustic space model 340 thereby taking into account a plurality of interdependent speech attributes and allowing for modelling of a continuous spectrum of the interdependent speech attributes.
- the acoustic space model 340 can now be used in the remaining steps 210-216 of the method 200.
- the method 200 now continues with steps 210-216 in which text-to-speech synthesis is performed, using the acoustic space model 340 generated in step 208.
- steps 210-216 in which text-to-speech synthesis is performed, using the acoustic space model 340 generated in step 208.
- Figure 4 depicts a schematic diagram 400 of text-to-speech synthesis (TTS) in accordance with non-limiting embodiments of the present technology.
- step 212 The method 200 now continues with step 212.
- Step 212 - receiving a selection of a speech attribute, the speech attribute having a selected attribute weight
- a selection of a speech attribute 420 is received.
- One or more speech attribute 420 may be selected and received.
- Speech attribute 420 is not particularly limited and may correspond, for example, to an emotion (angry, happy, sad, etc.), the gender of the speaker, a language, an accent, an intonation, a dynamic, a speaker identity, a speaking style, etc.
- the one or more speech attribute 326 is defined, to allow an association or a correlation between vocoder features 324 of the acoustic data 322 and speech attributes 326 during training of the acoustic space model 340 (defined further below).
- the selection of the speech attribute 420 is received via the input module 113. In some non-limiting embodiments, it may be received with the text 410 via the text input 130. In alternative embodiments, the text 410 and the speech attribute 420 are received separately (e.g., at different times, or from different applications, or from different users, or in different files, etc.), via the input module 113.
- step 216 in which the synthetic speech 440 is outputted as audio having the selected speech attribute(s) 420.
- the synthetic speech 440 produced by the acoustic space model 340 has perceivable characteristics 430, the perceivable characteristics 430 producing sound having the selected speech attribute(s) 420.
- a synthetic speech can be outputted, even if no respective training acoustic data with the selected attributes was received during training.
- the text converted to synthetic speech need not correspond to the training text data, and a text can be converted to synthetic speech even though no respective acoustic data for that text was received during the training process. At least some of these technical effects are achieved through building an acoustic model that is based on interdependencies of the attributes of the acoustic data.
- the present technology may provide synthetic speech that sounds like a natural human voice, having the selected speech attributes. Instead of a 'post-processing' approach seen in some of the prior art where an 'average' voice is synthesized then adapted according to desired criteria, embodiments of the present method and system provide a one-step generation of the desired synthetic speech. Furthermore, embodiments of the present method and system can provide unique synthetic speech based on unique and different combinations, along a continuous spectrum, of many different speech attributes including language, emotion, and accent.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
RU2015141342A RU2632424C2 (ru) | 2015-09-29 | 2015-09-29 | Способ и сервер для синтеза речи по тексту |
US15/263,525 US9916825B2 (en) | 2015-09-29 | 2016-09-13 | Method and system for text-to-speech synthesis |
Publications (1)
Publication Number | Publication Date |
---|---|
EP3151239A1 true EP3151239A1 (fr) | 2017-04-05 |
Family
ID=56997428
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP16190998.1A Ceased EP3151239A1 (fr) | 2015-09-29 | 2016-09-28 | Procedes et systemes pour la synthese de texte en discours |
Country Status (1)
Country | Link |
---|---|
EP (1) | EP3151239A1 (fr) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107634898A (zh) * | 2017-08-18 | 2018-01-26 | 上海云从企业发展有限公司 | 通过电子通信设备上的聊天工具来实现真人语音信息通信 |
EP3438972A4 (fr) * | 2016-03-28 | 2019-07-10 | Sony Corporation | Dispositif de traitement d'informations et procédé de traitement d'informations |
CN110175262A (zh) * | 2019-05-31 | 2019-08-27 | 武汉斗鱼鱼乐网络科技有限公司 | 基于聚类的深度学习模型压缩方法、存储介质及系统 |
WO2019191251A1 (fr) * | 2018-03-28 | 2019-10-03 | Telepathy Labs, Inc. | Procédé et système de synthèse vocale |
CN111276120A (zh) * | 2020-01-21 | 2020-06-12 | 华为技术有限公司 | 语音合成方法、装置和计算机可读存储介质 |
WO2020145439A1 (fr) * | 2019-01-11 | 2020-07-16 | 엘지전자 주식회사 | Procédé et dispositif de synthèse vocale basée sur des informations d'émotion |
CN111681638A (zh) * | 2020-04-20 | 2020-09-18 | 深圳奥尼电子股份有限公司 | 车载智能语音控制方法及系统 |
CN111954903A (zh) * | 2018-12-11 | 2020-11-17 | 微软技术许可有限责任公司 | 多说话者神经文本到语音合成 |
CN112382267A (zh) * | 2020-11-13 | 2021-02-19 | 北京有竹居网络技术有限公司 | 用于转换口音的方法、装置、设备以及存储介质 |
CN112786006A (zh) * | 2021-01-13 | 2021-05-11 | 北京有竹居网络技术有限公司 | 语音合成方法、合成模型训练方法、装置、介质及设备 |
CN112885326A (zh) * | 2019-11-29 | 2021-06-01 | 阿里巴巴集团控股有限公司 | 个性化语音合成模型创建、语音合成和测试方法及装置 |
CN113192482A (zh) * | 2020-01-13 | 2021-07-30 | 北京地平线机器人技术研发有限公司 | 语音合成方法及语音合成模型的训练方法、装置、设备 |
CN113314096A (zh) * | 2020-02-25 | 2021-08-27 | 阿里巴巴集团控股有限公司 | 语音合成方法、装置、设备和存储介质 |
CN113823257A (zh) * | 2021-06-18 | 2021-12-21 | 腾讯科技(深圳)有限公司 | 语音合成器的构建方法、语音合成方法及装置 |
WO2023215132A1 (fr) * | 2022-05-04 | 2023-11-09 | Cerence Operating Company | Modification interactive de style de parole de parole synthétisée |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8135591B2 (en) | 2006-09-08 | 2012-03-13 | At&T Intellectual Property Ii, L.P. | Method and system for training a text-to-speech synthesis system using a specific domain speech database |
US8527276B1 (en) * | 2012-10-25 | 2013-09-03 | Google Inc. | Speech synthesis using deep neural networks |
US20130262119A1 (en) | 2012-03-30 | 2013-10-03 | Kabushiki Kaisha Toshiba | Text to speech system |
US8886537B2 (en) | 2007-03-20 | 2014-11-11 | Nuance Communications, Inc. | Method and system for text-to-speech synthesis with personalized voice |
-
2016
- 2016-09-28 EP EP16190998.1A patent/EP3151239A1/fr not_active Ceased
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8135591B2 (en) | 2006-09-08 | 2012-03-13 | At&T Intellectual Property Ii, L.P. | Method and system for training a text-to-speech synthesis system using a specific domain speech database |
US8886537B2 (en) | 2007-03-20 | 2014-11-11 | Nuance Communications, Inc. | Method and system for text-to-speech synthesis with personalized voice |
US20130262119A1 (en) | 2012-03-30 | 2013-10-03 | Kabushiki Kaisha Toshiba | Text to speech system |
EP2650874A1 (fr) * | 2012-03-30 | 2013-10-16 | Kabushiki Kaisha Toshiba | Système de texte vers parole |
US8527276B1 (en) * | 2012-10-25 | 2013-09-03 | Google Inc. | Speech synthesis using deep neural networks |
Non-Patent Citations (1)
Title |
---|
ANONYMOUS: "Vocoder - Wikipedia", 21 September 2015 (2015-09-21), XP055340197, Retrieved from the Internet <URL:https://en.wikipedia.org/w/index.php?title=Vocoder&oldid=682020055> [retrieved on 20170130] * |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3438972A4 (fr) * | 2016-03-28 | 2019-07-10 | Sony Corporation | Dispositif de traitement d'informations et procédé de traitement d'informations |
CN107634898A (zh) * | 2017-08-18 | 2018-01-26 | 上海云从企业发展有限公司 | 通过电子通信设备上的聊天工具来实现真人语音信息通信 |
WO2019191251A1 (fr) * | 2018-03-28 | 2019-10-03 | Telepathy Labs, Inc. | Procédé et système de synthèse vocale |
US11450307B2 (en) | 2018-03-28 | 2022-09-20 | Telepathy Labs, Inc. | Text-to-speech synthesis system and method |
CN111954903B (zh) * | 2018-12-11 | 2024-03-15 | 微软技术许可有限责任公司 | 多说话者神经文本到语音合成 |
CN111954903A (zh) * | 2018-12-11 | 2020-11-17 | 微软技术许可有限责任公司 | 多说话者神经文本到语音合成 |
WO2020145439A1 (fr) * | 2019-01-11 | 2020-07-16 | 엘지전자 주식회사 | Procédé et dispositif de synthèse vocale basée sur des informations d'émotion |
US11514886B2 (en) | 2019-01-11 | 2022-11-29 | Lg Electronics Inc. | Emotion classification information-based text-to-speech (TTS) method and apparatus |
CN110175262A (zh) * | 2019-05-31 | 2019-08-27 | 武汉斗鱼鱼乐网络科技有限公司 | 基于聚类的深度学习模型压缩方法、存储介质及系统 |
CN112885326A (zh) * | 2019-11-29 | 2021-06-01 | 阿里巴巴集团控股有限公司 | 个性化语音合成模型创建、语音合成和测试方法及装置 |
CN113192482A (zh) * | 2020-01-13 | 2021-07-30 | 北京地平线机器人技术研发有限公司 | 语音合成方法及语音合成模型的训练方法、装置、设备 |
CN113192482B (zh) * | 2020-01-13 | 2023-03-21 | 北京地平线机器人技术研发有限公司 | 语音合成方法及语音合成模型的训练方法、装置、设备 |
CN111276120A (zh) * | 2020-01-21 | 2020-06-12 | 华为技术有限公司 | 语音合成方法、装置和计算机可读存储介质 |
CN111276120B (zh) * | 2020-01-21 | 2022-08-19 | 华为技术有限公司 | 语音合成方法、装置和计算机可读存储介质 |
CN113314096A (zh) * | 2020-02-25 | 2021-08-27 | 阿里巴巴集团控股有限公司 | 语音合成方法、装置、设备和存储介质 |
CN111681638A (zh) * | 2020-04-20 | 2020-09-18 | 深圳奥尼电子股份有限公司 | 车载智能语音控制方法及系统 |
CN112382267A (zh) * | 2020-11-13 | 2021-02-19 | 北京有竹居网络技术有限公司 | 用于转换口音的方法、装置、设备以及存储介质 |
CN112786006A (zh) * | 2021-01-13 | 2021-05-11 | 北京有竹居网络技术有限公司 | 语音合成方法、合成模型训练方法、装置、介质及设备 |
WO2022151931A1 (fr) * | 2021-01-13 | 2022-07-21 | 北京有竹居网络技术有限公司 | Procédé et appareil de synthèse de la parole, procédé et appareil d'entraînement de modèle de synthèse, support et dispositif |
CN112786006B (zh) * | 2021-01-13 | 2024-05-17 | 北京有竹居网络技术有限公司 | 语音合成方法、合成模型训练方法、装置、介质及设备 |
CN113823257A (zh) * | 2021-06-18 | 2021-12-21 | 腾讯科技(深圳)有限公司 | 语音合成器的构建方法、语音合成方法及装置 |
CN113823257B (zh) * | 2021-06-18 | 2024-02-09 | 腾讯科技(深圳)有限公司 | 语音合成器的构建方法、语音合成方法及装置 |
WO2023215132A1 (fr) * | 2022-05-04 | 2023-11-09 | Cerence Operating Company | Modification interactive de style de parole de parole synthétisée |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9916825B2 (en) | Method and system for text-to-speech synthesis | |
EP3151239A1 (fr) | Procedes et systemes pour la synthese de texte en discours | |
KR102582291B1 (ko) | 감정 정보 기반의 음성 합성 방법 및 장치 | |
US11727914B2 (en) | Intent recognition and emotional text-to-speech learning | |
US11531819B2 (en) | Text-to-speech adapted by machine learning | |
JP6802005B2 (ja) | 音声認識装置、音声認識方法及び音声認識システム | |
US10607595B2 (en) | Generating audio rendering from textual content based on character models | |
CN111048062A (zh) | 语音合成方法及设备 | |
US10685644B2 (en) | Method and system for text-to-speech synthesis | |
EP2634714A2 (fr) | Appareil et procédé de synthèse audio émotionnelle | |
Delgado et al. | Spoken, multilingual and multimodal dialogue systems: development and assessment | |
JP2024505076A (ja) | 多様で自然なテキスト読み上げサンプルを生成する | |
Fellbaum et al. | Principles of electronic speech processing with applications for people with disabilities | |
US11176943B2 (en) | Voice recognition device, voice recognition method, and computer program product | |
López-Ludeña et al. | LSESpeak: A spoken language generator for Deaf people | |
Ifeanyi et al. | Text–To–Speech Synthesis (TTS) | |
EP3534363A1 (fr) | Dispositif de traitement d'informations et procédé de traitement d'informations | |
US20230148275A1 (en) | Speech synthesis device and speech synthesis method | |
CN116631434A (zh) | 基于转换系统的视频语音同步方法、装置、电子设备 | |
Chaurasiya | Cognitive hexagon-controlled intelligent speech interaction system | |
JP6289950B2 (ja) | 読み上げ装置、読み上げ方法及びプログラム | |
US20190019497A1 (en) | Expressive control of text-to-speech content | |
US11887583B1 (en) | Updating models with trained model update objects | |
KR20220116660A (ko) | 인공지능 스피커 기능을 탑재한 텀블러 장치 | |
KR20230067501A (ko) | 음성 합성 장치 및 그의 음성 합성 방법 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
AX | Request for extension of the european patent |
Extension state: BA ME |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20171005 |
|
RBV | Designated contracting states (corrected) |
Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
17Q | First examination report despatched |
Effective date: 20190830 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED |
|
18R | Application refused |
Effective date: 20210917 |