WO2022039636A1 - Procédé de synthèse vocale avec attribution d'une intonation fiable d'un modèle à cloner - Google Patents

Procédé de synthèse vocale avec attribution d'une intonation fiable d'un modèle à cloner Download PDF

Info

Publication number
WO2022039636A1
WO2022039636A1 PCT/RU2021/050284 RU2021050284W WO2022039636A1 WO 2022039636 A1 WO2022039636 A1 WO 2022039636A1 RU 2021050284 W RU2021050284 W RU 2021050284W WO 2022039636 A1 WO2022039636 A1 WO 2022039636A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
neural network
text
selected speaker
voice
Prior art date
Application number
PCT/RU2021/050284
Other languages
English (en)
Russian (ru)
Inventor
Петр Владимирович ТАГУНОВ
Владислав Александрович ГОНТА
Original Assignee
Автономная некоммерческая организация поддержки и развития науки, управления и социального развития людей в области разработки и внедрения искусственного интеллекта "ЦифровойТы"
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Автономная некоммерческая организация поддержки и развития науки, управления и социального развития людей в области разработки и внедрения искусственного интеллекта "ЦифровойТы" filed Critical Автономная некоммерческая организация поддержки и развития науки, управления и социального развития людей в области разработки и внедрения искусственного интеллекта "ЦифровойТы"
Publication of WO2022039636A1 publication Critical patent/WO2022039636A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the invention relates to the field of methods and devices for recognizing, processing, analyzing and synthesizing speech, and in particular to methods for synthesizing speech using artificial neural networks, and can be used for cloning and synthesizing the speech of a selected speaker with the transfer of reliable intonation of the cloned sample.
  • Tacotron 2 and Waveglow neural networks can be distinguished as the most well-known and perfect neural networks currently used for speech synthesis with the transfer of reliable intonation of the cloned sample.
  • Tacotron 2 (accessed 07/29/2020) consists of two neural networks, the first of which converts text into a chalk spectrogram, which is then transmitted to the second network (WaveNet) to read visual images and create corresponding sound elements.
  • Waveglow (WAVEGLOW: A FLOWBASED GENERATIVE NETWORK FOR SPEECH S YNTHESIS"// Ryan Prenger, Rafael Valle, Bryan Catanzaro NVIDIA Corporation// electronic resource URL: https://arxiv.org/pdf/ 1811.00002.pdf (accessed 07/27/2020 ) is a stream-based network capable of generating high quality speech from chalk spectrograms.WaveGlow combines ideas from Glow and WaveNet to provide fast, efficient and high quality audio synthesis without the need for autoregression.
  • TR2018036413 A “EDUCATIONAL VOICE SYNTHESIS DEVICE, METHOD AND PROGRAM”, Russian patent for invention No. 268658 “MIXED SPEECH RECOGNITION”, Russian patent for invention No. 2720359 “METHOD AND EQUIPMENT FOR RECOGNITION OF EMOTIONS IN SPEECH”, Russian patent for invention No. 2698153 “ADAPTIVE AUDIO ENHANCEMENT FOR MULTICHANNEL SPEECH RECOGNITION”.
  • trainable artificial neural networks including two simultaneously neural networks, preliminary preparation of a training database for a neural network, application of the transformation of the initial data into a chalk spectrogram and further processing of the chalk spectrogram and its conversion to speech, the use of software, the use of a convolutional neural network for deep learning.
  • the closest technical solution is the technical solution according to the Russian patent for invention No. 2632424 "METHOD AND SERVER FOR SPEECH SYNTHESIS BY TEXT" (priority date 09/29/2015).
  • This solution is characterized in that it is a speech-to-text method, which includes the steps of obtaining training text data and corresponding training acoustic data, extracting one or more phonetic and linguistic characteristics of training text data, extracting vocoder characteristics of the corresponding training acoustic data, and correlating vocoder features with the phonetic and linguistic features of the training text data and with one or more specific speech attributes, using a deep neural network to determine interdependence factors between speech attributes in the training data, getting text, getting a choice of speech attribute, converting text to synthesized speech using acoustic spatial model, the output of the synthesized speech in the form of audio with the selected speech attribute.
  • the technical result is to increase the naturalness human voice in synthesized speech.
  • the shortcomings of the prototype do not allow for a qualitative, exact match of the intonation of the synthesized speech to the cloned speech sample of any speaker in any natural language, including a complex one, for example, in Russian.
  • none of the presented technical solutions from the indicated field of technology offers a full-fledged hardware-software method for the synthesis of any speech in any natural language, including Russian or other complex languages, performed by any speaker with the transfer of reliable intonation of the cloned sample in all its aspects with the maximum correspondence of the synthesized voice to the voice of a real human speaker.
  • the method of speech synthesis with the transfer of reliable intonation of the cloned sample which is claimed for registration, solves this technical problem, since it is a full-fledged hardware-software method for synthesizing any speech in any natural language, including Russian or other complex language, performed by any speaker with the transfer of reliable intonation of the cloned sample in all its aspects with the maximum correspondence of the synthesized voice to the voice of a real human speaker, which is achieved by careful manual (mechanical) preparation of the training dataset for neural networks, using Tacotron2 and Waveglow neural networks simultaneously, with deep learning and modification of the Tacotron2 network in order to maximize the adaptation of the neural network to the features of a particular language, the use of software to control the operation of neural networks, and the use of a web service and a website for the interaction of any user with software and computer.
  • the technical result of the proposed technical solution "Method of speech synthesis with the transfer of reliable intonation of the cloned sample” is that as a result of speech synthesis according to the proposed method due to careful manual (mechanical) preparation of the training dataset, a qualitative change in the architecture
  • the artificial neural network used for its maximum adaptation to the characteristics of a particular language achieves the transfer of reliable intonation of the cloned speech sample of any selected speaker in any natural language, including a complex language, for example, Russian, that is, the maximum correspondence of all aspects of intonation synthesized based on the input by a third-party user of an arbitrary text of speech to the voice of any speaker in any natural language, as a result of which the synthesized speech becomes indistinguishable from natural, as well as, in general, expanding the arsenal of speech synthesis methods using artificial neural networks.
  • the method of speech synthesis with the transfer of reliable intonation of the cloned sample includes the steps of preliminary preparation of a training dataset consisting of a text and a corresponding audio recording of the speech of the selected speaker, deep learning of the neural network based on the training dataset and obtaining a chalk spectrogram at the output the voice of the selected speaker, converting the chalk spectrogram using a vocoder with the output of an audio file in WAV format, re-using the already trained neural network and vocoder to convert user-loaded arbitrary text into speech of the selected speaker, processed at the stages of dataset preparation and deep learning of the neural network with obtaining at the output an audio file of voicing an arbitrary text by the voice of the selected speaker in WAV format, characterized in that the audio recording of the speech of the selected speaker is divided into fragments of no more than 16 seconds each, the preparation of the dataset is carried out is carried out manually by a person carefully checking each fragment of the audio recording and the corresponding fragment of the text for a complete match
  • the method of speech synthesis with the transfer of reliable intonation of the cloned sample includes the following steps.
  • a training dataset is manually prepared, consisting of a text and the corresponding audio recording of the speech of the selected speaker, divided into fragments no longer than 16 seconds each.
  • Manual preparation of the dataset means that each fragment of the audio recording and the corresponding fragment of text are carefully checked by a person by listening to a fragment of the audio recording and reading at the same time the corresponding fragment of text for their complete coincidence. If the text does not match the audio recording, a person uses a computer to make changes to the text to maximize the correspondence of the transcription of the audio recording to the text.
  • the minimum amount of dataset for future full-fledged training a neural network based on this dataset is 20 hours of audio recording for satisfactory (test) quality and 30 hours of speech for the commercial operation of the voice of the selected speaker.
  • the process of modification and deep learning of the artificial neural network (model) Tacotron2 is carried out in relation to the specifics of a particular natural language, for example, Russian.
  • the manually prepared training dataset and neural networks (models) of Tacotron2 and Waveglow are loaded into the graphics and central processors of the computer and tensor calculations of the weights of the Tacotron2 and Waveglow models are performed, which determine the speech features of the selected speaker.
  • the encoding stage the transformation of text characters from the dataset into their numerical representation. Further, the convolutional layers of the Tacotron2 neural network determine the relationship of letters in the word and in the text as a whole. Then the result goes to the bidirectional layer of the Tacotron2 neural network, which uses its internal memory to process sequences of arbitrary length, which saves the state of the “past” and “future”, that is, remembers the context of a particular piece of text and audio recording.
  • the decoding stage - the result obtained at the encoding stage passes through the Tacotron2 "attention" network layer, which calculates the average moment over all possible results of the encoding stage network, which in turn consists of two unidirectional memory layers of the Tacotron2 neural network, the pre-net layer, necessary for learning attention, and a layer of linear transformation into a chalk spectrogram.
  • the result of the decoding stage passes through the five-convolution layer (post-net) of the Tacotron2 neural network to improve the quality of the chalk spectrogram.
  • the resulting processed chalk spectrogram is transferred to the vocoder, which is the Waveglow neural network, which outputs an audio file in WAV format at the output.
  • the Tacotron2 model modified at the previous stages of deep learning and the Waveglow network with calculated weights are reloaded on the graphics and CPU of the computer, and the arbitrary text loaded by the user is converted into the speech of the speaker, processed at the stages of dataset preparation and deep learning of the Tacotron2 model.
  • the processes of modification and deep learning of the Tacotron2 model with the output of a chalk spectrogram, conversion of the chalk spectrogram into a WAV audio file by the Waveglow network, and further conversion of user-loaded arbitrary text into speech of the speaker, processed at the dataset preparation and deep learning stages of the Tacotron2 model, are controlled by software security.
  • the interaction of the user with software and computer equipment when he downloads arbitrary text for its voicing by the voice of the selected speaker and receives an audio file in WAV format as an output is carried out using a web service in the Java language and a website.
  • the novelty and inventive level of the presented invention lies in the fact that in the described method of speech synthesis with the transfer of reliable intonation of the cloned sample, a thorough manual (mechanical) preparation of the training dataset for the Tacotron2 and Waveglow neural networks is carried out, the Tacotron2 neural network undergoes a modification process by increasing the number of weights of its model , expanding the amount of its memory and its subsequent deep learning based on a prepared training dataset using a larger number of "features" (specific software capabilities) in order to maximize the adaptation of the neural network to the features of a particular language.
  • features specific software capabilities

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

L'invention se rapporte au domaine de la reconnaissance, du traitement, de l'analyse et de la synthèse vocales, et concerne notamment des procédés de synthèse vocale en utilisant un réseau neuronal artificiel. Le résultat technique de l'invention consiste en l'attribution d'une intonation fiable d'un modèle à cloner d'un locuteur choisi dans une quelconque langue naturelle, y compris dans une langue complexe, comme le russe, c'est à dire une correspondance maximale de tous les aspects de l'intonation synthétisée sur la base d'un texte aléatoire introduit par un utilisateur tiers à la voix d'un quelconque locuteur choisi dans une quelconque langue naturelle, la parole synthétisée ne pouvant ainsi être distinguée de celle naturelle. On effectue une préparation préalable d'un ensemble de données à étudier se composant d'un texte et d'un enregistrement audio lui correspondant de la voix du locuteur choisi. On effectue un apprentissage profond du réseau neuronal sur la base de l'ensemble de données à étudier, et on obtient en sortie un spectrogramme mel de la voix du locuteur choisi. On convertit le spectrogramme mel à l'aide d'un vocodeur afin d'obtenir un fichier audio en sortie. On applique de façon répétée le réseau neuronal instruit et le vocodeur afin de convertir le texte aléatoire entré par l'utilisateur en parole du locuteur choisi, de manière à obtenir en sortie un fichier audio de sonorisation de texte aléatoire avec la voix du locuteur choisi.
PCT/RU2021/050284 2020-08-17 2021-09-02 Procédé de synthèse vocale avec attribution d'une intonation fiable d'un modèle à cloner WO2022039636A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
RU2020127476A RU2754920C1 (ru) 2020-08-17 2020-08-17 Способ синтеза речи с передачей достоверного интонирования клонируемого образца
RU2020127476 2020-08-17

Publications (1)

Publication Number Publication Date
WO2022039636A1 true WO2022039636A1 (fr) 2022-02-24

Family

ID=77670309

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2021/050284 WO2022039636A1 (fr) 2020-08-17 2021-09-02 Procédé de synthèse vocale avec attribution d'une intonation fiable d'un modèle à cloner

Country Status (2)

Country Link
RU (1) RU2754920C1 (fr)
WO (1) WO2022039636A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116151832A (zh) * 2023-04-18 2023-05-23 支付宝(杭州)信息技术有限公司 一种交互式风控系统及方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2632424C2 (ru) * 2015-09-29 2017-10-04 Общество С Ограниченной Ответственностью "Яндекс" Способ и сервер для синтеза речи по тексту
CN110335587A (zh) * 2019-06-14 2019-10-15 平安科技(深圳)有限公司 语音合成方法、系统、终端设备和可读存储介质
CN108597492B (zh) * 2018-05-02 2019-11-26 百度在线网络技术(北京)有限公司 语音合成方法和装置
JP6649210B2 (ja) * 2016-08-30 2020-02-19 日本電信電話株式会社 音声合成学習装置、方法、及びプログラム
CN110853616A (zh) * 2019-10-22 2020-02-28 武汉水象电子科技有限公司 一种基于神经网络的语音合成方法、系统与存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9390712B2 (en) * 2014-03-24 2016-07-12 Microsoft Technology Licensing, Llc. Mixed speech recognition
KR102151682B1 (ko) * 2016-03-23 2020-09-04 구글 엘엘씨 다중채널 음성 인식을 위한 적응성 오디오 강화
RU2720359C1 (ru) * 2019-04-16 2020-04-29 Хуавэй Текнолоджиз Ко., Лтд. Способ и оборудование распознавания эмоций в речи

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2632424C2 (ru) * 2015-09-29 2017-10-04 Общество С Ограниченной Ответственностью "Яндекс" Способ и сервер для синтеза речи по тексту
JP6649210B2 (ja) * 2016-08-30 2020-02-19 日本電信電話株式会社 音声合成学習装置、方法、及びプログラム
CN108597492B (zh) * 2018-05-02 2019-11-26 百度在线网络技术(北京)有限公司 语音合成方法和装置
CN110335587A (zh) * 2019-06-14 2019-10-15 平安科技(深圳)有限公司 语音合成方法、系统、终端设备和可读存储介质
CN110853616A (zh) * 2019-10-22 2020-02-28 武汉水象电子科技有限公司 一种基于神经网络的语音合成方法、系统与存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116151832A (zh) * 2023-04-18 2023-05-23 支付宝(杭州)信息技术有限公司 一种交互式风控系统及方法
CN116151832B (zh) * 2023-04-18 2023-07-21 支付宝(杭州)信息技术有限公司 一种交互式风控系统及方法

Also Published As

Publication number Publication date
RU2754920C1 (ru) 2021-09-08

Similar Documents

Publication Publication Date Title
JP7355306B2 (ja) 機械学習を利用したテキスト音声合成方法、装置およびコンピュータ読み取り可能な記憶媒体
JP7436709B2 (ja) 非発話テキストおよび音声合成を使う音声認識
WO2020215666A1 (fr) Procédé et appareil de synthèse de la parole, dispositif informatique et support de stockage
CN112687259B (zh) 一种语音合成方法、装置以及可读存储介质
CN113439301A (zh) 使用序列到序列映射在模拟数据与语音识别输出之间进行协调
JP2023535230A (ja) 2レベル音声韻律転写
US20230036020A1 (en) Text-to-Speech Synthesis Method and System, a Method of Training a Text-to-Speech Synthesis System, and a Method of Calculating an Expressivity Score
JP2023539888A (ja) 声変換および音声認識モデルを使用した合成データ拡大
US20220246132A1 (en) Generating Diverse and Natural Text-To-Speech Samples
EP4205106A1 (fr) Procédé et système de synthèse vocale, et procédé d'apprentissage d'un système de synthèse vocale
WO2023245389A1 (fr) Procédé de gestion de chanson, appareil, dispositif électronique et support de stockage
Jain et al. A text-to-speech pipeline, evaluation methodology, and initial fine-tuning results for child speech synthesis
Kaur et al. Genetic algorithm for combined speaker and speech recognition using deep neural networks
CN113470622B (zh) 一种可将任意语音转换成多个语音的转换方法及装置
Shechtman et al. Synthesis of Expressive Speaking Styles with Limited Training Data in a Multi-Speaker, Prosody-Controllable Sequence-to-Sequence Architecture.
WO2022039636A1 (fr) Procédé de synthèse vocale avec attribution d'une intonation fiable d'un modèle à cloner
Li et al. End-to-end mongolian text-to-speech system
KR102639322B1 (ko) 실시간 음색 및 운율 스타일 복제 가능한 음성합성 시스템 및 방법
US20230146945A1 (en) Method of forming augmented corpus related to articulation disorder, corpus augmenting system, speech recognition platform, and assisting device
CN115359775A (zh) 一种端到端的音色及情感迁移的中文语音克隆方法
JP7357518B2 (ja) 音声合成装置及びプログラム
Eshghi et al. An Investigation of Features for Fundamental Frequency Pattern Prediction in Electrolaryngeal Speech Enhancement
Nazir et al. Multi speaker text-to-speech synthesis using generalized end-to-end loss function
Zhang et al. A Non-Autoregressivee Network for Chinese Text to Speech and Voice Cloning
Wu et al. VStyclone: Real-time Chinese voice style clone

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21858698

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21858698

Country of ref document: EP

Kind code of ref document: A1