WO2022039636A1 - Procédé de synthèse vocale avec attribution d'une intonation fiable d'un modèle à cloner - Google Patents
Procédé de synthèse vocale avec attribution d'une intonation fiable d'un modèle à cloner Download PDFInfo
- Publication number
- WO2022039636A1 WO2022039636A1 PCT/RU2021/050284 RU2021050284W WO2022039636A1 WO 2022039636 A1 WO2022039636 A1 WO 2022039636A1 RU 2021050284 W RU2021050284 W RU 2021050284W WO 2022039636 A1 WO2022039636 A1 WO 2022039636A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- neural network
- text
- selected speaker
- voice
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 230000002194 synthesizing effect Effects 0.000 title abstract description 7
- 238000013528 artificial neural network Methods 0.000 claims abstract description 53
- 238000012549 training Methods 0.000 claims abstract description 26
- 238000013135 deep learning Methods 0.000 claims abstract description 24
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 21
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 21
- 238000002360 preparation method Methods 0.000 claims description 15
- 238000012546 transfer Methods 0.000 claims description 12
- 239000012634 fragment Substances 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 9
- 238000012986 modification Methods 0.000 claims description 7
- 230000004048 modification Effects 0.000 claims description 7
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 230000006978 adaptation Effects 0.000 claims description 5
- 230000003993 interaction Effects 0.000 claims description 4
- 238000013518 transcription Methods 0.000 claims description 4
- 230000035897 transcription Effects 0.000 claims description 4
- 238000012545 processing Methods 0.000 abstract description 5
- 238000004458 analytical method Methods 0.000 abstract description 2
- 238000007781 pre-processing Methods 0.000 abstract 1
- 230000009466 transformation Effects 0.000 description 4
- 238000010367 cloning Methods 0.000 description 3
- 238000001308 synthesis method Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- PWPJGUXAGUPAHP-UHFFFAOYSA-N lufenuron Chemical compound C1=C(Cl)C(OC(F)(F)C(C(F)(F)F)F)=CC(Cl)=C1NC(=O)NC(=O)C1=C(F)C=CC=C1F PWPJGUXAGUPAHP-UHFFFAOYSA-N 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Definitions
- the invention relates to the field of methods and devices for recognizing, processing, analyzing and synthesizing speech, and in particular to methods for synthesizing speech using artificial neural networks, and can be used for cloning and synthesizing the speech of a selected speaker with the transfer of reliable intonation of the cloned sample.
- Tacotron 2 and Waveglow neural networks can be distinguished as the most well-known and perfect neural networks currently used for speech synthesis with the transfer of reliable intonation of the cloned sample.
- Tacotron 2 (accessed 07/29/2020) consists of two neural networks, the first of which converts text into a chalk spectrogram, which is then transmitted to the second network (WaveNet) to read visual images and create corresponding sound elements.
- Waveglow (WAVEGLOW: A FLOWBASED GENERATIVE NETWORK FOR SPEECH S YNTHESIS"// Ryan Prenger, Rafael Valle, Bryan Catanzaro NVIDIA Corporation// electronic resource URL: https://arxiv.org/pdf/ 1811.00002.pdf (accessed 07/27/2020 ) is a stream-based network capable of generating high quality speech from chalk spectrograms.WaveGlow combines ideas from Glow and WaveNet to provide fast, efficient and high quality audio synthesis without the need for autoregression.
- TR2018036413 A “EDUCATIONAL VOICE SYNTHESIS DEVICE, METHOD AND PROGRAM”, Russian patent for invention No. 268658 “MIXED SPEECH RECOGNITION”, Russian patent for invention No. 2720359 “METHOD AND EQUIPMENT FOR RECOGNITION OF EMOTIONS IN SPEECH”, Russian patent for invention No. 2698153 “ADAPTIVE AUDIO ENHANCEMENT FOR MULTICHANNEL SPEECH RECOGNITION”.
- trainable artificial neural networks including two simultaneously neural networks, preliminary preparation of a training database for a neural network, application of the transformation of the initial data into a chalk spectrogram and further processing of the chalk spectrogram and its conversion to speech, the use of software, the use of a convolutional neural network for deep learning.
- the closest technical solution is the technical solution according to the Russian patent for invention No. 2632424 "METHOD AND SERVER FOR SPEECH SYNTHESIS BY TEXT" (priority date 09/29/2015).
- This solution is characterized in that it is a speech-to-text method, which includes the steps of obtaining training text data and corresponding training acoustic data, extracting one or more phonetic and linguistic characteristics of training text data, extracting vocoder characteristics of the corresponding training acoustic data, and correlating vocoder features with the phonetic and linguistic features of the training text data and with one or more specific speech attributes, using a deep neural network to determine interdependence factors between speech attributes in the training data, getting text, getting a choice of speech attribute, converting text to synthesized speech using acoustic spatial model, the output of the synthesized speech in the form of audio with the selected speech attribute.
- the technical result is to increase the naturalness human voice in synthesized speech.
- the shortcomings of the prototype do not allow for a qualitative, exact match of the intonation of the synthesized speech to the cloned speech sample of any speaker in any natural language, including a complex one, for example, in Russian.
- none of the presented technical solutions from the indicated field of technology offers a full-fledged hardware-software method for the synthesis of any speech in any natural language, including Russian or other complex languages, performed by any speaker with the transfer of reliable intonation of the cloned sample in all its aspects with the maximum correspondence of the synthesized voice to the voice of a real human speaker.
- the method of speech synthesis with the transfer of reliable intonation of the cloned sample which is claimed for registration, solves this technical problem, since it is a full-fledged hardware-software method for synthesizing any speech in any natural language, including Russian or other complex language, performed by any speaker with the transfer of reliable intonation of the cloned sample in all its aspects with the maximum correspondence of the synthesized voice to the voice of a real human speaker, which is achieved by careful manual (mechanical) preparation of the training dataset for neural networks, using Tacotron2 and Waveglow neural networks simultaneously, with deep learning and modification of the Tacotron2 network in order to maximize the adaptation of the neural network to the features of a particular language, the use of software to control the operation of neural networks, and the use of a web service and a website for the interaction of any user with software and computer.
- the technical result of the proposed technical solution "Method of speech synthesis with the transfer of reliable intonation of the cloned sample” is that as a result of speech synthesis according to the proposed method due to careful manual (mechanical) preparation of the training dataset, a qualitative change in the architecture
- the artificial neural network used for its maximum adaptation to the characteristics of a particular language achieves the transfer of reliable intonation of the cloned speech sample of any selected speaker in any natural language, including a complex language, for example, Russian, that is, the maximum correspondence of all aspects of intonation synthesized based on the input by a third-party user of an arbitrary text of speech to the voice of any speaker in any natural language, as a result of which the synthesized speech becomes indistinguishable from natural, as well as, in general, expanding the arsenal of speech synthesis methods using artificial neural networks.
- the method of speech synthesis with the transfer of reliable intonation of the cloned sample includes the steps of preliminary preparation of a training dataset consisting of a text and a corresponding audio recording of the speech of the selected speaker, deep learning of the neural network based on the training dataset and obtaining a chalk spectrogram at the output the voice of the selected speaker, converting the chalk spectrogram using a vocoder with the output of an audio file in WAV format, re-using the already trained neural network and vocoder to convert user-loaded arbitrary text into speech of the selected speaker, processed at the stages of dataset preparation and deep learning of the neural network with obtaining at the output an audio file of voicing an arbitrary text by the voice of the selected speaker in WAV format, characterized in that the audio recording of the speech of the selected speaker is divided into fragments of no more than 16 seconds each, the preparation of the dataset is carried out is carried out manually by a person carefully checking each fragment of the audio recording and the corresponding fragment of the text for a complete match
- the method of speech synthesis with the transfer of reliable intonation of the cloned sample includes the following steps.
- a training dataset is manually prepared, consisting of a text and the corresponding audio recording of the speech of the selected speaker, divided into fragments no longer than 16 seconds each.
- Manual preparation of the dataset means that each fragment of the audio recording and the corresponding fragment of text are carefully checked by a person by listening to a fragment of the audio recording and reading at the same time the corresponding fragment of text for their complete coincidence. If the text does not match the audio recording, a person uses a computer to make changes to the text to maximize the correspondence of the transcription of the audio recording to the text.
- the minimum amount of dataset for future full-fledged training a neural network based on this dataset is 20 hours of audio recording for satisfactory (test) quality and 30 hours of speech for the commercial operation of the voice of the selected speaker.
- the process of modification and deep learning of the artificial neural network (model) Tacotron2 is carried out in relation to the specifics of a particular natural language, for example, Russian.
- the manually prepared training dataset and neural networks (models) of Tacotron2 and Waveglow are loaded into the graphics and central processors of the computer and tensor calculations of the weights of the Tacotron2 and Waveglow models are performed, which determine the speech features of the selected speaker.
- the encoding stage the transformation of text characters from the dataset into their numerical representation. Further, the convolutional layers of the Tacotron2 neural network determine the relationship of letters in the word and in the text as a whole. Then the result goes to the bidirectional layer of the Tacotron2 neural network, which uses its internal memory to process sequences of arbitrary length, which saves the state of the “past” and “future”, that is, remembers the context of a particular piece of text and audio recording.
- the decoding stage - the result obtained at the encoding stage passes through the Tacotron2 "attention" network layer, which calculates the average moment over all possible results of the encoding stage network, which in turn consists of two unidirectional memory layers of the Tacotron2 neural network, the pre-net layer, necessary for learning attention, and a layer of linear transformation into a chalk spectrogram.
- the result of the decoding stage passes through the five-convolution layer (post-net) of the Tacotron2 neural network to improve the quality of the chalk spectrogram.
- the resulting processed chalk spectrogram is transferred to the vocoder, which is the Waveglow neural network, which outputs an audio file in WAV format at the output.
- the Tacotron2 model modified at the previous stages of deep learning and the Waveglow network with calculated weights are reloaded on the graphics and CPU of the computer, and the arbitrary text loaded by the user is converted into the speech of the speaker, processed at the stages of dataset preparation and deep learning of the Tacotron2 model.
- the processes of modification and deep learning of the Tacotron2 model with the output of a chalk spectrogram, conversion of the chalk spectrogram into a WAV audio file by the Waveglow network, and further conversion of user-loaded arbitrary text into speech of the speaker, processed at the dataset preparation and deep learning stages of the Tacotron2 model, are controlled by software security.
- the interaction of the user with software and computer equipment when he downloads arbitrary text for its voicing by the voice of the selected speaker and receives an audio file in WAV format as an output is carried out using a web service in the Java language and a website.
- the novelty and inventive level of the presented invention lies in the fact that in the described method of speech synthesis with the transfer of reliable intonation of the cloned sample, a thorough manual (mechanical) preparation of the training dataset for the Tacotron2 and Waveglow neural networks is carried out, the Tacotron2 neural network undergoes a modification process by increasing the number of weights of its model , expanding the amount of its memory and its subsequent deep learning based on a prepared training dataset using a larger number of "features" (specific software capabilities) in order to maximize the adaptation of the neural network to the features of a particular language.
- features specific software capabilities
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
L'invention se rapporte au domaine de la reconnaissance, du traitement, de l'analyse et de la synthèse vocales, et concerne notamment des procédés de synthèse vocale en utilisant un réseau neuronal artificiel. Le résultat technique de l'invention consiste en l'attribution d'une intonation fiable d'un modèle à cloner d'un locuteur choisi dans une quelconque langue naturelle, y compris dans une langue complexe, comme le russe, c'est à dire une correspondance maximale de tous les aspects de l'intonation synthétisée sur la base d'un texte aléatoire introduit par un utilisateur tiers à la voix d'un quelconque locuteur choisi dans une quelconque langue naturelle, la parole synthétisée ne pouvant ainsi être distinguée de celle naturelle. On effectue une préparation préalable d'un ensemble de données à étudier se composant d'un texte et d'un enregistrement audio lui correspondant de la voix du locuteur choisi. On effectue un apprentissage profond du réseau neuronal sur la base de l'ensemble de données à étudier, et on obtient en sortie un spectrogramme mel de la voix du locuteur choisi. On convertit le spectrogramme mel à l'aide d'un vocodeur afin d'obtenir un fichier audio en sortie. On applique de façon répétée le réseau neuronal instruit et le vocodeur afin de convertir le texte aléatoire entré par l'utilisateur en parole du locuteur choisi, de manière à obtenir en sortie un fichier audio de sonorisation de texte aléatoire avec la voix du locuteur choisi.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
RU2020127476A RU2754920C1 (ru) | 2020-08-17 | 2020-08-17 | Способ синтеза речи с передачей достоверного интонирования клонируемого образца |
RU2020127476 | 2020-08-17 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022039636A1 true WO2022039636A1 (fr) | 2022-02-24 |
Family
ID=77670309
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/RU2021/050284 WO2022039636A1 (fr) | 2020-08-17 | 2021-09-02 | Procédé de synthèse vocale avec attribution d'une intonation fiable d'un modèle à cloner |
Country Status (2)
Country | Link |
---|---|
RU (1) | RU2754920C1 (fr) |
WO (1) | WO2022039636A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116151832A (zh) * | 2023-04-18 | 2023-05-23 | 支付宝(杭州)信息技术有限公司 | 一种交互式风控系统及方法 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
RU2632424C2 (ru) * | 2015-09-29 | 2017-10-04 | Общество С Ограниченной Ответственностью "Яндекс" | Способ и сервер для синтеза речи по тексту |
CN110335587A (zh) * | 2019-06-14 | 2019-10-15 | 平安科技(深圳)有限公司 | 语音合成方法、系统、终端设备和可读存储介质 |
CN108597492B (zh) * | 2018-05-02 | 2019-11-26 | 百度在线网络技术(北京)有限公司 | 语音合成方法和装置 |
JP6649210B2 (ja) * | 2016-08-30 | 2020-02-19 | 日本電信電話株式会社 | 音声合成学習装置、方法、及びプログラム |
CN110853616A (zh) * | 2019-10-22 | 2020-02-28 | 武汉水象电子科技有限公司 | 一种基于神经网络的语音合成方法、系统与存储介质 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9390712B2 (en) * | 2014-03-24 | 2016-07-12 | Microsoft Technology Licensing, Llc. | Mixed speech recognition |
KR102151682B1 (ko) * | 2016-03-23 | 2020-09-04 | 구글 엘엘씨 | 다중채널 음성 인식을 위한 적응성 오디오 강화 |
RU2720359C1 (ru) * | 2019-04-16 | 2020-04-29 | Хуавэй Текнолоджиз Ко., Лтд. | Способ и оборудование распознавания эмоций в речи |
-
2020
- 2020-08-17 RU RU2020127476A patent/RU2754920C1/ru active
-
2021
- 2021-09-02 WO PCT/RU2021/050284 patent/WO2022039636A1/fr active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
RU2632424C2 (ru) * | 2015-09-29 | 2017-10-04 | Общество С Ограниченной Ответственностью "Яндекс" | Способ и сервер для синтеза речи по тексту |
JP6649210B2 (ja) * | 2016-08-30 | 2020-02-19 | 日本電信電話株式会社 | 音声合成学習装置、方法、及びプログラム |
CN108597492B (zh) * | 2018-05-02 | 2019-11-26 | 百度在线网络技术(北京)有限公司 | 语音合成方法和装置 |
CN110335587A (zh) * | 2019-06-14 | 2019-10-15 | 平安科技(深圳)有限公司 | 语音合成方法、系统、终端设备和可读存储介质 |
CN110853616A (zh) * | 2019-10-22 | 2020-02-28 | 武汉水象电子科技有限公司 | 一种基于神经网络的语音合成方法、系统与存储介质 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116151832A (zh) * | 2023-04-18 | 2023-05-23 | 支付宝(杭州)信息技术有限公司 | 一种交互式风控系统及方法 |
CN116151832B (zh) * | 2023-04-18 | 2023-07-21 | 支付宝(杭州)信息技术有限公司 | 一种交互式风控系统及方法 |
Also Published As
Publication number | Publication date |
---|---|
RU2754920C1 (ru) | 2021-09-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7355306B2 (ja) | 機械学習を利用したテキスト音声合成方法、装置およびコンピュータ読み取り可能な記憶媒体 | |
JP7436709B2 (ja) | 非発話テキストおよび音声合成を使う音声認識 | |
WO2020215666A1 (fr) | Procédé et appareil de synthèse de la parole, dispositif informatique et support de stockage | |
CN112687259B (zh) | 一种语音合成方法、装置以及可读存储介质 | |
CN113439301A (zh) | 使用序列到序列映射在模拟数据与语音识别输出之间进行协调 | |
JP2023535230A (ja) | 2レベル音声韻律転写 | |
US20230036020A1 (en) | Text-to-Speech Synthesis Method and System, a Method of Training a Text-to-Speech Synthesis System, and a Method of Calculating an Expressivity Score | |
JP2023539888A (ja) | 声変換および音声認識モデルを使用した合成データ拡大 | |
US20220246132A1 (en) | Generating Diverse and Natural Text-To-Speech Samples | |
EP4205106A1 (fr) | Procédé et système de synthèse vocale, et procédé d'apprentissage d'un système de synthèse vocale | |
WO2023245389A1 (fr) | Procédé de gestion de chanson, appareil, dispositif électronique et support de stockage | |
Jain et al. | A text-to-speech pipeline, evaluation methodology, and initial fine-tuning results for child speech synthesis | |
Kaur et al. | Genetic algorithm for combined speaker and speech recognition using deep neural networks | |
CN113470622B (zh) | 一种可将任意语音转换成多个语音的转换方法及装置 | |
Shechtman et al. | Synthesis of Expressive Speaking Styles with Limited Training Data in a Multi-Speaker, Prosody-Controllable Sequence-to-Sequence Architecture. | |
WO2022039636A1 (fr) | Procédé de synthèse vocale avec attribution d'une intonation fiable d'un modèle à cloner | |
Li et al. | End-to-end mongolian text-to-speech system | |
KR102639322B1 (ko) | 실시간 음색 및 운율 스타일 복제 가능한 음성합성 시스템 및 방법 | |
US20230146945A1 (en) | Method of forming augmented corpus related to articulation disorder, corpus augmenting system, speech recognition platform, and assisting device | |
CN115359775A (zh) | 一种端到端的音色及情感迁移的中文语音克隆方法 | |
JP7357518B2 (ja) | 音声合成装置及びプログラム | |
Eshghi et al. | An Investigation of Features for Fundamental Frequency Pattern Prediction in Electrolaryngeal Speech Enhancement | |
Nazir et al. | Multi speaker text-to-speech synthesis using generalized end-to-end loss function | |
Zhang et al. | A Non-Autoregressivee Network for Chinese Text to Speech and Voice Cloning | |
Wu et al. | VStyclone: Real-time Chinese voice style clone |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21858698 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21858698 Country of ref document: EP Kind code of ref document: A1 |