WO2022039636A1

WO2022039636A1 - Method for synthesizing speech and transmitting the authentic intonation of a clonable sample

Info

Publication number: WO2022039636A1
Application number: PCT/RU2021/050284
Authority: WO
Inventors: Петр Владимирович ТАГУНОВ; Владислав Александрович ГОНТА
Priority date: 2020-08-17
Filing date: 2021-09-02
Publication date: 2022-02-24
Also published as: RU2754920C1

Abstract

The invention relates to the field of speech recognition, processing, analysis and synthesis, and more particularly to methods for synthesizing speech using artificial neural networks. The technical result of the invention consists in transmitting the authentic intonation of a clonable sample of the speech of a selected speaker in any natural language, including complex languages such as Russian, in other words maximally matching all aspects of the intonation of speech synthesized on the basis of an arbitrary text input by a third-party user to the voice of any selected speaker in a natural language, as a result of which the synthesized speech becomes indistinguishable from natural speech. A training data set consisting of a text and a corresponding audio recording of the speech of a selected speaker is subjected to pre-processing. Deep learning is performed on a neural network using the training data set, and a mel spectrogram of the voice of the selected speaker is obtained at the output. Said mel spectrogram is converted with the aid of a vocoder so that an audio file is obtained at the output. The trained neural network and the vocoder are reused to convert an arbitrary text input by a user into the speech of the selected speaker so that an audio file of the arbitrary text spoken in the voice of the selected speaker is obtained at the output.

Description

METHOD FOR SPEECH SYNTHESIS WITH TRANSMISSION OF RELIABLE INTONATION OF CLONED SAMPLE

The invention relates to the field of methods and devices for recognizing, processing, analyzing and synthesizing speech, and in particular to methods for synthesizing speech using artificial neural networks, and can be used for cloning and synthesizing the speech of a selected speaker with the transfer of reliable intonation of the cloned sample.

Various technical solutions are known from the general level of technology in the field of methods and devices for speech recognition, processing, analysis and synthesis. Some of these solutions involve the use of artificial neural networks in the process of processing, analyzing and synthesizing speech. The main task in speech synthesis is the transformation of a text into audible speech. Artificial neural networks have the property of deep learning (by analogy with the human brain), in connection with which they make it possible to convert the text not into some kind of mechanical lifeless voice, but to ensure that the text is voiced by a “live”, natural human voice, including the voice of selected people (for example, famous personalities) due to the preliminary training of the neural network in the voice of the selected speaker.

Tacotron 2 and Waveglow neural networks can be distinguished as the most well-known and perfect neural networks currently used for speech synthesis with the transfer of reliable intonation of the cloned sample. Tacotron 2 (accessed 07/29/2020) consists of two neural networks, the first of which converts text into a chalk spectrogram, which is then transmitted to the second network (WaveNet) to read visual images and create corresponding sound elements. Waveglow (WAVEGLOW: A FLOWBASED GENERATIVE NETWORK FOR SPEECH S YNTHESIS"// Ryan Prenger, Rafael Valle, Bryan Catanzaro NVIDIA Corporation// electronic resource URL: https://arxiv.org/pdf/ 1811.00002.pdf (accessed 07/27/2020 ) is a stream-based network capable of generating high quality speech from chalk spectrograms.WaveGlow combines ideas from Glow and WaveNet to provide fast, efficient and high quality audio synthesis without the need for autoregression.

As examples of patented technical solutions using artificial neural networks for speech synthesis, we can cite foreign invention patent No. CN110335587A "SPEECH SYNTHESIS METHOD, SPEECH SYNTHESIS SYSTEM, TERMINAL EQUIPMENT AND MACHINE READABLE STORAGE MEDIA", foreign invention patent No. CN110853616A "METHOD AND SPEECH SYNTHESIS SYSTEM BASED ON A NEURAL NETWORK AND INFORMATION CARRIER”, foreign patent for invention No. CN108597492A “METHOD AND DEVICE FOR VOICE SYNTHESIS”, foreign patent for invention No. TR2018036413 A “EDUCATIONAL VOICE SYNTHESIS DEVICE, METHOD AND PROGRAM”, Russian patent for invention No. 268658 “MIXED SPEECH RECOGNITION”, Russian patent for invention No. 2720359 “METHOD AND EQUIPMENT FOR RECOGNITION OF EMOTIONS IN SPEECH”, Russian patent for invention No. 2698153 “ADAPTIVE AUDIO ENHANCEMENT FOR MULTICHANNEL SPEECH RECOGNITION”. As common features of these technical solutions with the proposed method of speech synthesis with the transfer of reliable intonation of the cloned sample, one can single out the use of trainable artificial neural networks, including two simultaneously neural networks, preliminary preparation of a training database for a neural network, application of the transformation of the initial data into a chalk spectrogram and further processing of the chalk spectrogram and its conversion to speech, the use of software, the use of a convolutional neural network for deep learning.

Also in the public domain are references to the RESEMBLE platform (site RESEMBLE PLATFORM// electronic resource URL: htt s://www.resemble.ai/ (access date 28.07.2020)) for voice cloning and the VeraVoice project (site VeraVoice// electronic resource URL: https://veravoice.ai/(flaTa accessed 07/28/2020)). However, there is no technical description of these solutions.

The closest technical solution (prototype) is the technical solution according to the Russian patent for invention No. 2632424 "METHOD AND SERVER FOR SPEECH SYNTHESIS BY TEXT" (priority date 09/29/2015). This solution is characterized in that it is a speech-to-text method, which includes the steps of obtaining training text data and corresponding training acoustic data, extracting one or more phonetic and linguistic characteristics of training text data, extracting vocoder characteristics of the corresponding training acoustic data, and correlating vocoder features with the phonetic and linguistic features of the training text data and with one or more specific speech attributes, using a deep neural network to determine interdependence factors between speech attributes in the training data, getting text, getting a choice of speech attribute, converting text to synthesized speech using acoustic spatial model, the output of the synthesized speech in the form of audio with the selected speech attribute. The technical result is to increase the naturalness human voice in synthesized speech. Common features of the prototype with the claimed technical solution are the use of a deep learning neural network, preliminary preparation of a training database consisting of text and acoustic data.

However, the prototype has several disadvantages:

- there is no technical description of the deep learning neural network and the principle of its operation. The solution describes in great detail the hardware part of the speech synthesis method from the text, but the description of the neural network itself, its properties is omitted, while neural networks differ significantly from each other, have different structures, properties, and to be used for speech cloning, the neural network must have strictly defined properties (for example, to be recurrent), to have certain layers;

- there is no technical description of how to prepare a training data base consisting of training text data and corresponding training acoustic data. Text and acoustic data must strictly correspond to each other, voice transcription must match the text. With an increase in the amount of data, the risk of errors and inaccuracies increases, as a result of which the quality of neural network training decreases, and hence the correspondence of the synthesized speech to the sample;

- converting text into synthesized speech using an acoustic spatial model using mainly hardware without using chalk spectrograms can also lead to errors and inaccuracies when converting text to speech, make the voice partially artificial, "lifeless" due to the incomplete transmission of all the intonations of the real voice person.

As a result, the shortcomings of the prototype do not allow for a qualitative, exact match of the intonation of the synthesized speech to the cloned speech sample of any speaker in any natural language, including a complex one, for example, in Russian. Thus, none of the presented technical solutions from the indicated field of technology offers a full-fledged hardware-software method for the synthesis of any speech in any natural language, including Russian or other complex languages, performed by any speaker with the transfer of reliable intonation of the cloned sample in all its aspects with the maximum correspondence of the synthesized voice to the voice of a real human speaker.

Unlike the prototype and other technical solutions, the method of speech synthesis with the transfer of reliable intonation of the cloned sample, which is claimed for registration, solves this technical problem, since it is a full-fledged hardware-software method for synthesizing any speech in any natural language, including Russian or other complex language, performed by any speaker with the transfer of reliable intonation of the cloned sample in all its aspects with the maximum correspondence of the synthesized voice to the voice of a real human speaker, which is achieved by careful manual (mechanical) preparation of the training dataset for neural networks, using Tacotron2 and Waveglow neural networks simultaneously, with deep learning and modification of the Tacotron2 network in order to maximize the adaptation of the neural network to the features of a particular language, the use of software to control the operation of neural networks, and the use of a web service and a website for the interaction of any user with software and computer.

Accordingly, the technical result of the proposed technical solution "Method of speech synthesis with the transfer of reliable intonation of the cloned sample" is that as a result of speech synthesis according to the proposed method due to careful manual (mechanical) preparation of the training dataset, a qualitative change in the architecture The artificial neural network used for its maximum adaptation to the characteristics of a particular language achieves the transfer of reliable intonation of the cloned speech sample of any selected speaker in any natural language, including a complex language, for example, Russian, that is, the maximum correspondence of all aspects of intonation synthesized based on the input by a third-party user of an arbitrary text of speech to the voice of any speaker in any natural language, as a result of which the synthesized speech becomes indistinguishable from natural, as well as, in general, expanding the arsenal of speech synthesis methods using artificial neural networks.

The technical result is achieved by the fact that the method of speech synthesis with the transfer of reliable intonation of the cloned sample includes the steps of preliminary preparation of a training dataset consisting of a text and a corresponding audio recording of the speech of the selected speaker, deep learning of the neural network based on the training dataset and obtaining a chalk spectrogram at the output the voice of the selected speaker, converting the chalk spectrogram using a vocoder with the output of an audio file in WAV format, re-using the already trained neural network and vocoder to convert user-loaded arbitrary text into speech of the selected speaker, processed at the stages of dataset preparation and deep learning of the neural network with obtaining at the output an audio file of voicing an arbitrary text by the voice of the selected speaker in WAV format, characterized in that the audio recording of the speech of the selected speaker is divided into fragments of no more than 16 seconds each, the preparation of the dataset is carried out is carried out manually by a person carefully checking each fragment of the audio recording and the corresponding fragment of the text for a complete match between the transcription of the audio recording and the text, the Tacotron2 network is used as a deep learning neural network, and the neural network is used as a vocoder Waveglow network, in the process of deep learning of the Tacotron2 neural network, based on the prepared dataset, it is modified by increasing the number of weights of its model, expanding the amount of its memory in order to maximize the adaptation of the neural network to the features of a particular language, the processes of modification and deep learning of the Tacotron2 model with obtaining on the output of the chalk spectrogram, the conversion of the chalk spectrogram into a WAV audio file by the Waveglow network, and the further conversion of user-uploaded arbitrary text into speech of the speaker, processed at the dataset preparation and deep learning stages of the Tacotron2 model, are controlled by software, user interaction with software and computer hardware when he uploads arbitrary text for its voicing by the voice of the selected speaker and receives an audio file in WAV format as an output, it is carried out using a web service in the Java language and a website.

To obtain a technical result, the invention can be carried out in the following preferred manner, not excluding other ways of implementation within the framework of the claimed claims.

The method of speech synthesis with the transfer of reliable intonation of the cloned sample includes the following steps. At the first stage, a training dataset is manually prepared, consisting of a text and the corresponding audio recording of the speech of the selected speaker, divided into fragments no longer than 16 seconds each. Manual preparation of the dataset means that each fragment of the audio recording and the corresponding fragment of text are carefully checked by a person by listening to a fragment of the audio recording and reading at the same time the corresponding fragment of text for their complete coincidence. If the text does not match the audio recording, a person uses a computer to make changes to the text to maximize the correspondence of the transcription of the audio recording to the text. At the same time, the minimum amount of dataset for future full-fledged training a neural network based on this dataset, for example, for Russian speech, is 20 hours of audio recording for satisfactory (test) quality and 30 hours of speech for the commercial operation of the voice of the selected speaker. Further, on the basis of the prepared dataset, the process of modification and deep learning of the artificial neural network (model) Tacotron2 is carried out in relation to the specifics of a particular natural language, for example, Russian. The manually prepared training dataset and neural networks (models) of Tacotron2 and Waveglow are loaded into the graphics and central processors of the computer and tensor calculations of the weights of the Tacotron2 and Waveglow models are performed, which determine the speech features of the selected speaker. This is followed by the encoding stage - the transformation of text characters from the dataset into their numerical representation. Further, the convolutional layers of the Tacotron2 neural network determine the relationship of letters in the word and in the text as a whole. Then the result goes to the bidirectional layer of the Tacotron2 neural network, which uses its internal memory to process sequences of arbitrary length, which saves the state of the “past” and “future”, that is, remembers the context of a particular piece of text and audio recording. Next comes the decoding stage - the result obtained at the encoding stage passes through the Tacotron2 "attention" network layer, which calculates the average moment over all possible results of the encoding stage network, which in turn consists of two unidirectional memory layers of the Tacotron2 neural network, the pre-net layer, necessary for learning attention, and a layer of linear transformation into a chalk spectrogram. The result of the decoding stage passes through the five-convolution layer (post-net) of the Tacotron2 neural network to improve the quality of the chalk spectrogram. Next, the resulting processed chalk spectrogram is transferred to the vocoder, which is the Waveglow neural network, which outputs an audio file in WAV format at the output. Further, the Tacotron2 model modified at the previous stages of deep learning and the Waveglow network with calculated weights are reloaded on the graphics and CPU of the computer, and the arbitrary text loaded by the user is converted into the speech of the speaker, processed at the stages of dataset preparation and deep learning of the Tacotron2 model. The processes of modification and deep learning of the Tacotron2 model with the output of a chalk spectrogram, conversion of the chalk spectrogram into a WAV audio file by the Waveglow network, and further conversion of user-loaded arbitrary text into speech of the speaker, processed at the dataset preparation and deep learning stages of the Tacotron2 model, are controlled by software security. The interaction of the user with software and computer equipment when he downloads arbitrary text for its voicing by the voice of the selected speaker and receives an audio file in WAV format as an output is carried out using a web service in the Java language and a website.

The novelty and inventive level of the presented invention lies in the fact that in the described method of speech synthesis with the transfer of reliable intonation of the cloned sample, a thorough manual (mechanical) preparation of the training dataset for the Tacotron2 and Waveglow neural networks is carried out, the Tacotron2 neural network undergoes a modification process by increasing the number of weights of its model , expanding the amount of its memory and its subsequent deep learning based on a prepared training dataset using a larger number of "features" (specific software capabilities) in order to maximize the adaptation of the neural network to the features of a particular language. As a result of applying the proposed method, a qualitative correspondence of the sounding of the synthesized speech to the voice of a real person (speaker) selected by the user, performed in any natural language, is achieved.

Claims

CLAIM

1. The method of speech synthesis with the transfer of reliable intonation of the cloned sample is characterized by the fact that it includes the steps of preliminary preparation of the training dataset, consisting of the text and the corresponding audio recording of the speech of the selected speaker, deep learning of the neural network based on the training dataset and obtaining a chalk spectrogram at the output the voice of the selected speaker, converting the chalk spectrogram using a vocoder with the output of an audio file in WAV format, re-using the already trained neural network and vocoder to convert user-loaded arbitrary text into speech of the selected speaker, processed at the stages of dataset preparation and deep learning of the neural network with receiving at the output an audio file of voicing an arbitrary text by the voice of the selected speaker in WAV format, characterized in that the audio recording of the speech of the selected speaker is divided into fragments of no more than 16 seconds each, the dataset is prepared manually mode by carefully checking by a person each fragment of the audio recording and the corresponding fragment of text for complete coincidence of the transcription of the audio recording with the text, the Tacotron2 network is used as a deep learning neural network, the Waveglow neural network is used as a vocoder, in the process of deep learning of the Tacotron2 neural network, based on the prepared dataset, its modification by increasing the number of weights of its model, expanding the amount of its memory in order to maximize the adaptation of the neural network to the features of a particular language, the processes of modification and deep learning of the Tacotron2 model with the output of a chalk spectrogram, converting the chalk spectrogram into an audio file into an audio file in the Waveglow network WAV format and further conversion of user-loaded free text into speech of the speaker, processed at the dataset preparation and deep learning stages of the Tacotron2 model, are controlled special software, the interaction of the user with the software and computer equipment when he downloads arbitrary text to be voiced by the voice of the selected speaker and receives an audio file in WAV format as an output is carried out using a web service in Java and a website.