CN115359775A

CN115359775A - End-to-end tone and emotion migration Chinese voice cloning method

Info

Publication number: CN115359775A
Application number: CN202210846358.0A
Authority: CN
Inventors: 刘丁玮; 陈铧浚; 毛爱华; 刘江枫; 郭勇彬; 张柳坚
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-07-05
Filing date: 2022-07-05
Publication date: 2022-11-18

Abstract

The invention discloses a Chinese voice cloning method for end-to-end tone and emotion migration, which comprises the following steps: collecting Chinese voice recorded by a user as training data, and extracting required voice characteristics; training a speech clone synthesis model, which comprises a timbre emotion encoder, a synthesizer and a vocoder; generating the existing voice of the appointed speaker of the voice clone synthesis model according to the voice or character content input by the user by utilizing the trained voice clone synthesis model; or quickly cloning the tone and emotion in the voice of the user according to the short-time voice input by the user. The invention realizes end-to-end voice synthesis and cloning, and synthesizes voices with different emotions and timbres by embedding the same model and different speaker vectors through a multi-speaker model. The invention uses speaker embedded vector generated by phrase voice, combines with a generating model trained by more linguistic data to clone voice, and realizes the voice cloning which can embody voice and emotion of a specific speaker.

Description

End-to-end tone and emotion migration Chinese voice cloning method

Technical Field

The invention relates to the technical field of computer voice synthesis, in particular to a voice synthesis method for end-to-end timbre and emotion migration based on phrase voice training.

Background

Speech synthesis is a key technology necessary for realizing man-machine speech communication and establishing a spoken language system with listening and speaking capabilities. The computer has the speaking ability similar to that of human, and is an important competitive market of the information industry in the modern times. The speech synthesis can convert any text information into standard fluent speech in real time for reading, wherein the text can be a commonly used language text or can also comprise SSML markup language and the like, the sound is a continuous analog signal, and the synthesis process to be performed by a computer is simulated by a digital signal. In the process of voice synthesis, problems of pronunciation, alignment, prosody, intonation and the like are all keys of the synthesized voice, and problems of tone change, polyphones, prosody complexity and the like in Chinese voice which are difficult to process are key and difficult points of the existing Chinese voice synthesis technology.

The voice clone is used as an extension direction of voice synthesis, and aims to realize the real-time cloning of the voice and emotion of a speaker, so that the voice synthesized by the system is not the voice and emotion of the speaker data during model training any more, but the voice is different from one person to another and has the voice and emotion of a target speaker. At present, a method for cloning Chinese speech is urgently needed, aiming at embedding speaker characteristics, namely speaker voice color and emotion characteristics, in synthesized speech, so that the synthesized speech is natural and vivid and corresponds to a specified speaker.

Disclosure of Invention

The invention aims to reduce the requirement on the training voice duration of a single speaker in the voice synthesis technology, and provides an end-to-end voice synthesis technology based on phrase tone training and a voice synthesis method for tone and emotion migration.

The purpose of the invention can be achieved by adopting the following technical scheme:

a Chinese voice cloning method for end-to-end tone and emotion migration comprises the following steps:

s1, voice data acquisition: collecting a plurality of Chinese phrase voice files of a plurality of speakers, recording a plurality of phrase voices by each speaker according to a given text, establishing a corresponding text mark for each voice file, wherein each voice is not more than 15 seconds, the total voice duration is not less than 30 hours, and recording the voices in a quiet environment;

s2, data preprocessing: processing the voice file collected in the step S1, unifying the sampling rate, format, bit depth and channel number of the voice file to obtain a required audio file, and simultaneously generating a JSON file containing a recording file mark, a corresponding voice text mark and a speaker mark;

s3, constructing a Chinese voice clone synthesis model: the Chinese voice clone synthesis model comprises a tone emotion coder, a synthesizer and a vocoder;

s4, constructing a tone emotion encoder: the tone emotion encoder comprises three layers of LSTM networks which are sequentially connected, a frequency domain characteristic Mel frequency spectrum of the audio file is calculated to be used as the input of the tone emotion encoder, and a speaker embedding vector with fixed dimensionality is obtained to be used as the output of the tone emotion encoder;

s5, training a synthesizer: the synthesizer consists of 1 encoder and 1 decoder which are connected in sequence, wherein the encoder comprises a preprocessing network consisting of full connection layers, a word embedding module, 3 one-dimensional convolution layers and 1 two-way LSTM network which are connected in sequence, the JSON file is used as the input of the encoder, and the hidden state of the encoder is used as the output of the encoder; the decoder comprises 1 preprocessing network, 2 layers of LSTM networks, 1 projection layer consisting of linear mapping layers and 1 post-processing network which are sequentially connected, and the speaker embedded vectors output by the encoder in the hidden state and the timbre emotion encoding are spliced and then used as the input of the decoder to obtain the Mel frequency spectrum of the synthesized voice as the output of the decoder;

s6, training a vocoder: the vocoder is composed of parallel WaveRNN vocoder and Griffin-Lim vocoder, the Mel frequency spectrum of the synthesized voice output by the decoder is used as the input of the vocoder, and the waveform prediction of the synthesized voice is used as the output of the vocoder;

s7, generating clone voice: the method comprises the steps that a text input by a user or a text obtained by voice recognition of voice input by the user is embedded into vectors by using different speakers according to the speaker specified by the user, and output voice is obtained through a synthesizer and a vocoder;

or voice quick cloning: and preprocessing the audio of the user, inputting the audio into a tone emotion encoder to obtain a speaker embedded vector, and storing the speaker embedded vector for generating the cloned voice.

Further, the preprocessing process of step S2 is as follows:

s2.1, carrying out voice processing on the multiple short sentence voice files, and converting the multiple short sentence recording files into audio files with audio sampling rate of 16000Hz, audio format of wav format, bit depth of 16bits and single sound channel. The relevant parameters and formats of the audio files are unified, so that the extraction and processing of the Chinese voice clone synthesis model to the audio file data can be accelerated, the training efficiency is improved, and a better effect is achieved;

and S2.2, generating a JSON file containing marks, splicing the text marks, the speaker ID and the audio file marks obtained by voice processing to obtain one or more files in a JSON format, wherein the text marks refer to Chinese texts corresponding to audio contents, the speaker ID refers to a number mark for the speaker, and the audio file marks refer to the speaker and the name of the audio file corresponding to the speaker contents. The generated JSON file provides a data set required by training for a Chinese speech clone synthesis model, namely the text contents of speech and speech, and simultaneously, the speech information and the information of a speaker are in one-to-one correspondence.

Further, the working process of the pornographic encoder in the step S4 is as follows:

s4.1, for a given multi-sentence short recording, calculating the Mel frequency spectrum according to the following formula:

where f is the frequency of the short speech and m is the Mel-frequency spectrum of the short speech. The Mel frequency spectrum can reserve the information which is understood by human ears and is needed by the original voice to a great extent, so the accuracy of the timbre emotion encoder can be improved by using the Mel frequency spectrum as the input of the timbre emotion encoder;

s4.2, inputting the Mel frequency spectrum of the short-sentence voice into a timbre emotion coder, and outputting a speaker embedding vector with a fixed dimensionality, wherein the process is as follows:

s4.2.1, inputting the Mel frequency spectrum of the short speech into three layers of sequentially connected LSTM networks, and mapping the output of each frame of the last layer of LSTM network to a vector with 256-dimensional fixed length, wherein each frame refers to a fixed time unit. The tone emotional characteristics of the speaker in the voice can be effectively extracted through the LSTM network with 3 layers, and the extra cost of memory caused by excessive layers is avoided;

and S4.2.2, averaging and normalizing the output of all time units obtained in S4.2.1 to obtain a final speaker embedded vector with fixed dimensionality, wherein the speaker embedded vector is used for distinguishing a speaker corresponding to the voice from other speakers, and the speaker embedded vector can keep the tone and emotion of the speaker. The mean value and normalization process scales the data to the same interval, so that the calculated amount is reduced, and the training efficiency of the speaker encoder can be improved.

Further, the training process of the pornographic encoder in step S4 is as follows:

comparing the speaker embedded vector output by the speaker timbre emotion coder in a certain training iteration with a corresponding speaker sample, and when the speaker can be distinguished according to a comparison result obtained by the speaker embedded vector, indicating that the timbre emotion coder extracts the characteristics capable of distinguishing the speaker, reserving the parameters of the timbre emotion coder, and otherwise, continuing the iterative training. By adopting an iterative training mode, the obtained speaker embedded vector can be ensured to represent the characteristics of tone and emotion of a certain speaker and can be distinguished from tone and emotion of other speakers.

Further, the operation of the encoder of the synthesizer in step S5 is as follows:

s5.1.1, obtaining the input of an encoder: translating the text marks in the JOSN file generated in the step S2 into a phoneme sequence, wherein Chinese is converted into corresponding pinyin, performing pinyin on the obtained phoneme sequence and a JSON file into the JSON file, and taking the obtained JSON file as the input of an encoder; the Chinese text data is converted into the phoneme sequence, so that the encoder can better reserve key information in the voice into the word vector, and the information extraction capability of the encoder on the voice content is improved.

S5.1.2, generating word embedded vectors: firstly, the JSON obtained in the step S5.1.1 is input into a preprocessing network for analysis and transformation, a preprocessed sequence is output, the preprocessing operation can further convert the input sequence into a format which can enable an encoder to better extract voice information, and the preprocessing network can adjust parameters by self and extract information more effectively; and then carrying out word embedding operation on the preprocessed sequence, calculating the weight of each phoneme in the phoneme sequence corresponding to the rest phonemes, integrating the correlation between each phoneme into training to improve the accuracy of the result, and finally outputting a 512-dimensional word vector. The word vector comprises position information and content information of the voice;

s5.1.3, acquiring an intermediate state: inputting the word vectors obtained in the step S5.1.2 into 3 one-dimensional convolution layers which are connected in sequence for convolution operation, simultaneously performing BatchNorm operation and Dropout operation on the output after each convolution layer, and obtaining an intermediate state as the output on the last convolution layer; the BatchNorm operation keeps the same distribution of the input of each layer of neural network in the deep neural network training process, can accelerate the training speed, and the Dropout operation shields the neural network units temporarily with a certain probability to avoid the overfitting problem. Through the two operations, the voice information in the word vector can be better reserved;

s5.1.4, obtaining an encoder hidden state: and (4) inputting the intermediate state obtained in the step S5.1.3 into a bidirectional LSTM network, and taking the output of the bidirectional LSTM network as an encoder hidden state. The correlation of the time of the input data can be obtained by adopting a bidirectional LSTM network, and complete voice information is further obtained;

s5.1.5, embedding timbre and emotional characteristics: and splicing the speaker embedded vector output by the timbre emotion encoder with the encoder hidden state obtained in the S5.1.4 to obtain a final encoder hidden state. The encoder after splicing has hidden state with phonetic information and the tone and emotional information of the speaker.

Further, the operation of the decoder of the synthesizer in step S5 is as follows:

s5.2.1, circularly operating a decoder, wherein each cycle is called a time step, in each time step, performing attention mechanism operation on the encoder hidden state obtained in S5.1.5, and measuring unit similarity in the encoder hidden state, wherein the attention mechanism is a matrix formed by context weight vectors, scoring each dimension of input data, weighting features according to the scores to highlight the influence of important features on a downstream model or module, and then performing normalization processing on the encoder hidden state subjected to attention mechanism operation to obtain a context vector, wherein the context vector contains information in the encoder hidden state;

s5.2.2, a preprocessing network in a decoder takes the context vector output in the last time step as input, analyzes and processes the context vector, inputs the processed context vector into two layers of LSTM networks which are sequentially connected in the decoder for processing, takes the output of the last layer of LSTM as a new context vector, and finally inputs the new context vector into a projection layer to obtain a sound spectrum frame and an end probability as output, wherein the sound spectrum frame comprises a predicted Mel frequency spectrum. The accuracy of the voice information extracted by the decoder is ensured by adopting the context vector transmission processing; (ii) a

And S5.2.3, taking the sound spectrum frame obtained in the S5.2.2 as the input of a post-processing network in a decoder, wherein the post-processing network consists of three sequentially connected convolutional layers, and taking the output of the last convolutional layer as the Mel frequency spectrum of the synthesized voice. The post-processing network can improve the quality of the Mel frequency spectrum, and guarantees the subsequent use of the Mel frequency spectrum to synthesize high-fidelity voice.

Further, the operation process of the WaveRNN vocoder in step S6 is as follows:

the WaveRNN vocoder is composed of a single-layer RNN network and two softmax layers which are sequentially connected, the Mel frequency spectrum of the synthetic speech obtained in the step S5 is used as input, after single-layer RNN processing, the obtained output is divided into two parts which are respectively input into the corresponding softmax layers, and the two softmax outputs are spliced to obtain a predicted 16-bit audio sequence. The synthesis process of the WaveRNN guarantees the quality of the synthesized voice and simultaneously considers the speed of the synthesized voice.

Further, the operation of the Griffin-Lim vocoder in the step S6 is as follows:

the Griffin-Lim vocoder adopts iterative processing, firstly a time domain graph is initialized randomly, inverse short-time Fourier inverse transformation is carried out on the Mel frequency spectrum of the synthetic voice obtained in the step S5 and the initialized time domain graph to obtain a new time domain graph, then short-time Fourier inverse transformation is carried out on the new time domain graph to obtain a new time domain graph and a new frequency spectrum, the new frequency spectrum graph is discarded, then the voice is synthesized by using the new time domain graph and the known frequency spectrum graph, and the operation is continuously iterated to obtain the optimal synthetic voice as output. And the iterative mode is adopted to reconstruct the phase information from the given time frequency spectrum, so that the mean square error of the time frequency spectrum of the reconstructed signal and the given time frequency spectrum is reduced, and the quality of the synthesized voice can be improved.

Further, the process of generating the clone voice in step S7 is as follows:

s7.1.1, acquiring an input sequence: the input sequence is a Chinese text or a Chinese voice input by a user, wherein the Chinese voice is converted into the Chinese text by using the existing voice recognition method;

s7.1.2, obtaining the speaker embedded vector: after the data collected in S1 is preprocessed in S2, speaker embedded vectors corresponding to different speakers are extracted through a tone emotion coder trained in S3, and the speaker embedded vectors corresponding to the speakers are selected according to needs of a user to clone; the corresponding speaker embedded vector stores the tone emotion information of the speaker;

s7.1.3, generating a Mel frequency spectrum of the synthesized voice: the input sequence obtained in the S7.1.1 and the speaker embedded vector obtained in the S7.1.2 are used as the input of a synthesizer, and the output of the synthesizer is the Mel frequency spectrum of the synthesized voice; the input speaker embedded vector can enable the output synthesized voice to have the tone and the emotional characteristics of the speaker;

and S7.1.4, inputting the Mel frequency spectrum obtained in the S7.1.3 into a vocoder, and outputting a synthesized clone voice, wherein the synthesized clone voice has the tone and emotion of the speaker specified in the S7.1.2.

Further, the process of fast cloning the speech in step S7 is as follows:

s7.2.1, acquiring input voice: collecting Chinese voice of a user;

s7.2.2, pretreatment: the Chinese speech obtained in the step S7.2.1 is processed by the preprocessing method in the S2 to obtain a corresponding JSON file and a preprocessed audio file, namely the JSON file and the preprocessed audio file are converted into an input format which can be received by a Chinese speech clone synthesis model;

and S7.2.3, inputting the audio file obtained in the S7.2.2 into a tone emotion encoder, wherein the output of the tone emotion encoder is a speaker embedded vector, and the speaker vector and a corresponding speaker identifier are stored and used for synthesizing subsequent cloned voices. The subsequent cloning of a certain speaker voice can directly call the stored speaker embedded vector corresponding to the speaker voice.

Compared with the prior art, the invention has the following advantages and effects:

(1) The invention realizes end-to-end voice synthesis and cloning and simplifies the synthesis process.

(2) The invention synthesizes the voices with different emotions and timbres by embedding the same model and different speaker vectors through a multi-speaker model.

(3) The invention can realize the voice clone which reflects the emotion and the tone color of the specific speaker by using the short voice by embedding the speaker into the vector generated by the phrase voice and combining the generated model trained by using more linguistic data, thereby solving the problem of large demand of the linguistic data for voice synthesis training.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention and do not constitute a limitation of the invention. In the drawings:

FIG. 1 is a process flow diagram of the Chinese phonetic clone synthesis model disclosed in the present invention;

FIG. 2 is a network architecture of an encoder in the synthesizer of the present invention;

FIG. 3 is a network architecture diagram of a decoder in the synthesizer of the present invention;

FIG. 4 is a flow chart of the process of the Chinese speech clone synthesis model of the removed timbre emotion coder of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the method in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are a part of the embodiments of the present invention, but not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Example 1

The embodiment discloses a Chinese voice cloning method based on end-to-end tone color and emotion migration, fig. 1 is a processing flow chart of a Chinese voice cloning synthesis model disclosed by the invention, fig. 2 is a coder structure in a synthesizer in the embodiment, fig. 3 is a decoder structure in the synthesizer in the embodiment, a user can input voice or text to determine the content of synthesized voice, and a corresponding speaker embedded vector is selected according to the tone color and emotion of a speaker to be simulated to synthesize target voice. In this embodiment, taking a computer as an example, a process flow for generating a clone voice is specifically described, which includes the following steps:

step 101, obtaining user input. A user can input Chinese text or Chinese voice; if the input is Chinese speech, it is converted into Chinese text by the existing speech recognition method.

Step 102, obtaining a speaker embedded vector of a speaker to be imitated.

The step 102 of obtaining the speaker embedding vector is as follows:

1) Voice data acquisition: the Chinese voice of the user is collected and recorded in a quiet environment.

2) Pretreatment: and processing the collected voice file, and converting the voice file into an audio file with an audio sampling rate of 16000Hz, an audio format of wav format, a bit depth of 16bits and a single sound channel number.

3) And inputting the audio file into a tone emotion coder, wherein the output of the tone emotion coder is the speaker embedded vector. And saving the speaker vector and the corresponding speaker identifier for synthesizing the subsequent clone voice.

Step 103, generating a mel spectrum of the synthesized speech: the user input obtained in step 101 and the speaker-embedded vector obtained in step 102 are used as the input of a synthesizer, and the output of the synthesizer is the Mel frequency spectrum of the synthesized voice.

The step 103 of obtaining the synthesized mel frequency spectrum includes the following steps:

1) Obtaining input of an encoder in a synthesizer: translating the input sequence obtained in the step 101 into a phoneme sequence, wherein Chinese is converted into corresponding pinyin, and taking the obtained phoneme sequence as the input of an encoder;

2) Generating a word embedding vector: firstly, inputting the phoneme sequence obtained in the step 1) into a preprocessing network for analysis and transformation, and outputting the preprocessed sequence; then, performing word embedding operation on the preprocessed sequence, calculating the weight of each phoneme in the phoneme sequence corresponding to the rest phonemes, and outputting a 512-dimensional word vector;

3) Acquiring an intermediate state: inputting the word vectors obtained in the step 2) into 3 one-dimensional convolutional layers which are sequentially connected for convolution operation, simultaneously performing BatchNorm operation and Dropout operation on the output after each convolutional layer, and obtaining an intermediate state as the output at the last convolutional layer; wherein the BatchNorm operation keeps the same distribution of the inputs of each layer of neural network during the deep neural network training process, and the Dropout operation temporarily masks the neural network units with a certain probability;

4) Obtaining an encoder hidden state: inputting the intermediate state obtained in the step 3) into a bidirectional LSTM network, and taking the output of the bidirectional LSTM network as an encoder hidden state;

5) Tone and emotional feature embedding: splicing the speaker embedded vector obtained in the step 102 with the encoder hidden state obtained in the step 4) to obtain a final encoder hidden state.

6) Inputting the encoder hidden state obtained in the step 5) into a decoder, wherein the decoder operates in a cycle, each cycle is called a time step, in each time step, attention mechanism operation is carried out on the encoder hidden state obtained in the step 5), unit similarity in the encoder hidden state is measured, wherein the attention mechanism is a matrix formed by context weight vectors, and then normalization processing is carried out on the encoder hidden state subjected to the attention mechanism operation to obtain a context vector;

7) A preprocessing network in a decoder takes the context vector output in the last time step as input, analyzes and processes the context vector, inputs the processed context vector into two layers of LSTM networks which are sequentially connected in the decoder for processing, the output of the last layer of LSTM is taken as a new context vector, and finally inputs the new context vector into a projection layer to obtain a sound spectrum frame and an end probability as output, wherein the sound spectrum frame comprises a predicted Mel frequency spectrum;

8) Taking the sound spectrum frame obtained in the step 7) as the input of a post-processing network in a decoder, wherein the post-processing network consists of three convolutional layers which are sequentially connected, and taking the output of the last convolutional layer as the Mel frequency spectrum of the synthesized voice.

Step 104, inputting the mel spectrum generated in step 103 into a vocoder to obtain a desired voice.

The step 104 of acquiring the required speech specifically comprises the following steps:

in the network architecture for speech synthesis, a WaveRNN vocoder and a Griffin-Lim vocoder are used, both of which receive the mel spectrum as input and predict the waveform for speech synthesis. The Mel frequency spectrum generated in step 103 is input into two vocoders, and the synthesized speech with high synthesis speed is output and has the voice and emotional characteristics of the specified speaker.

In summary, the present embodiment embeds the vector according to the text or the voice input by the user and the speaker to be simulated. The content text is converted into a pinyin phoneme sequence, an encoder is input to obtain an encoder hidden state, the encoder hidden state is spliced with the speaker embedded vector to obtain an encoder hidden state with tone emotion, then a Mel frequency spectrum is input into a decoder to generate a Mel frequency spectrum, the Mel frequency spectrum is input into a vocoder to obtain target voice, and the synthesized target voice has the tone and emotion of a specified speaker. The invention solves the problem of large requirement of the speech synthesis training corpus by using the speaker embedded vector generated by phrase voice and combining the generated model trained by more corpora for speech cloning.

Example 2

In this embodiment, the timbre emotion encoder in the Chinese speech clone synthesis model is removed, as shown in FIG. 4, to prove the effect of the timbre emotion encoder part in retaining the timbre and emotion information of the speaker in the present invention. In this embodiment, taking a computer as an example, a processing flow of the speech synthesis method is specifically introduced, which includes the following steps:

step 101 example 1 was identical to example 1 with reference to step 101.

Step 102, generating a mel spectrum of the synthesized speech: the user input obtained in step 101 is used as input to a synthesizer, the output of which is the mel spectrum of the synthesized speech.

The step 102 of obtaining the synthesized mel-frequency spectrum is specifically as follows:

the step of obtaining the encoder hidden state refers to 1) to 4) of step 103 in embodiment 1), the generated encoder hidden state is not combined with the speaker embedded vector because the timbre emotion encoder is removed, and is directly input into the decoder, and the step of generating the synthesized mel frequency spectrum refers to 6) to 8) of step 103 in embodiment 1).

Step 103 synthesis of the desired speech refers to step 104 in embodiment 1. Because the tone emotion coder is removed, the Chinese speech clone synthesis model lacks the capability of extracting the tone and the emotional characteristics of the speaker in the speech, and the synthesizer and the vocoder structure can only complete the speech synthesis task.

For further verification, an original Chinese voice clone synthesis model and a Chinese voice clone synthesis model of a coder for removing timbre emotion are used for respectively synthesizing two groups of voices with the same content, the two groups of voices and the real voices are put on an online webpage together for questionnaire survey, the survey adopts a blind test mode, namely, an evaluator does not know the voice source, and the international standard MOS score which is most authoritative in judging the voice quality is adopted for scoring, wherein the full score is 5. The final score shows that the score of the speech synthesized by the Chinese speech clone synthesis model with the removed timbre emotion coder is low, as shown in Table 1 below.

TABLE 1 comparison table of MOS scores of different speech sources

Sources of speech	MOS scoring
		Ground Truth	4.78±0.05
Original Chinese speech clone synthesis model	4.62±0.05
		Chinese speech clone synthesis model for removing timbre emotion coder	4.36±0.04

It can be seen from this embodiment that the Chinese speech clone synthesis of the present invention has the capability of migrating the tone and emotional characteristics of the speaker, and can better enhance the quality of the synthesized speech.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A Chinese voice cloning method for end-to-end tone and emotion migration is characterized by comprising the following steps:

s1, voice data acquisition: collecting a plurality of Chinese short sentence voice files of a plurality of speakers, recording a plurality of short sentence voices by each speaker according to a given text, establishing a corresponding text mark for each voice file, wherein each voice does not exceed 15 seconds, the total voice duration is not less than 30 hours, and recording the voice in a quiet environment;

or voice quick cloning: the voice frequency of the user is preprocessed and input into a tone emotion encoder to obtain a speaker embedded vector, and the speaker embedded vector is stored to be used for generating the clone voice.

2. The method for cloning Chinese speech with end-to-end timbre and emotion migration according to claim 1, wherein the preprocessing of step S2 comprises the following steps:

s2.1, carrying out voice processing on the plurality of short sentence voice files, and converting the plurality of short sentence recording files into audio files with the audio sampling rate of 16000Hz, the audio format of wav format, the bit depth of 16bits and single sound track;

s2.2, JSON files containing marks are generated, one or more files in a JSON format are obtained by splicing text marks, speakers, speaker IDs and audio file marks obtained through voice processing, wherein the text marks refer to Chinese texts corresponding to audio contents, the speaker IDs refer to numbering marks of the speakers, and the audio file marks refer to the speakers and audio file names corresponding to the speaker contents.

3. The method for cloning Chinese speech with end-to-end timbre and emotion migration according to claim 1, wherein the working process of the pornographic and emotion encoder in step S4 is as follows:

wherein f is the frequency of the short speech, and m is the Mel frequency spectrum of the short speech;

s4.2.1, inputting the Mel frequency spectrum of the short speech into three layers of sequentially connected LSTM networks, and mapping the output of each frame of the last layer of LSTM network to a vector with 256-dimensional fixed length, wherein each frame refers to a fixed time unit;

and S4.2.2, averaging and normalizing the output of all time units obtained in S4.2.1 to obtain a final speaker embedding vector with fixed dimensionality, wherein the speaker embedding vector has the function of distinguishing a speaker corresponding to the voice from other speakers, and the speaker embedding vector can keep the tone and emotion of the speaker.

4. The method for cloning Chinese speech with end-to-end timbre and emotion migration according to claim 1, wherein the training process of the pornographic and emotion encoder in step S4 is as follows:

comparing the speaker embedded vector output by the speaker timbre emotion coder in a certain training iteration with a corresponding speaker sample, and when the speaker can be distinguished according to a comparison result obtained by the speaker embedded vector, indicating that the timbre emotion coder extracts the characteristics capable of distinguishing the speaker, reserving the parameters of the timbre emotion coder, and otherwise, continuing the iterative training.

5. The method for cloning Chinese speech with end-to-end timbre and emotion migration according to claim 1, wherein the encoder of the synthesizer in step S5 operates as follows:

s5.1.1, obtaining the input of an encoder: translating the text marks in the JOSN file generated in the step S2 into a phoneme sequence, wherein Chinese is converted into corresponding pinyin, performing pinyin on the obtained phoneme sequence and a JSON file into the JSON file, and taking the obtained JSON file as the input of an encoder;

s5.1.2, generating word embedding vectors: firstly, inputting the JSON obtained in the step S5.1.1 into a preprocessing network for analysis and transformation, and outputting a preprocessed sequence; then, performing word embedding operation on the preprocessed sequence, calculating the weight of each phoneme in the phoneme sequence corresponding to the rest phonemes, and outputting a 512-dimensional word vector;

s5.1.3, acquiring an intermediate state: inputting the word vectors obtained in the step S5.1.2 into 3 one-dimensional convolutional layers which are sequentially connected for convolution operation, simultaneously performing BatchNorm operation and Dropout operation on the output after each convolutional layer, and obtaining an intermediate state as the output on the last convolutional layer; wherein the BatchNorm operation keeps the same distribution of the inputs of each layer of neural network during the deep neural network training process, and the Dropout operation temporarily masks the neural network units with a certain probability;

s5.1.4, obtaining the hidden state of the encoder: inputting the intermediate state obtained in the step S5.1.3 into a bidirectional LSTM network, and taking the output of the bidirectional LSTM network as an encoder hidden state;

s5.1.5, embedding timbre and emotional characteristics: and splicing the speaker embedded vector output by the timbre emotion encoder with the encoder hidden state obtained in the S5.1.4 to obtain a final encoder hidden state.

6. The method for cloning Chinese speech with end-to-end timbre and emotion migration according to claim 1, wherein the decoder of the synthesizer in step S5 operates as follows:

s5.2.1, circularly operating a decoder, wherein each cycle is called a time step, and in each time step, performing attention mechanism operation on the encoder hidden state obtained in S5.1.5 to measure unit similarity in the encoder hidden state, wherein the attention mechanism is a matrix formed by context weight vectors, and then performing normalization processing on the encoder hidden state subjected to attention mechanism operation to obtain the context vectors;

s5.2.2, a preprocessing network in a decoder takes the context vector output in the last time step as input, analyzes and processes the context vector, inputs the processed context vector into two layers of LSTM networks which are sequentially connected in the decoder for processing, takes the output of the last layer of LSTM as a new context vector, and finally inputs the new context vector into a projection layer to obtain a sound spectrum frame and an end probability as output, wherein the sound spectrum frame comprises a predicted Mel frequency spectrum;

and S5.2.3, taking the sound spectrum frame obtained in the S5.2.2 as the input of a post-processing network in a decoder, wherein the post-processing network consists of three sequentially connected convolutional layers, and taking the output of the last convolutional layer as the Mel frequency spectrum of the synthesized voice.

7. The method for cloning Chinese speech with end-to-end timbre and emotion migration according to claim 1, wherein the WaveRNN vocoder in step S6 operates as follows:

the WaveRNN vocoder consists of a single-layer RNN and two softmax layers which are sequentially connected, the Mel frequency spectrum of the synthetic voice obtained in the step S5 is used as input, after single-layer RNN processing, the obtained output is divided into two parts which are respectively input into the corresponding softmax layers, and the outputs of the two softmax layers are spliced to obtain a predicted 16-bit audio sequence.

8. The method for cloning chinese language into end-to-end timbre and emotion migration according to claim 1, wherein the working procedure of the Griffin-Lim vocoder in the step S6 is as follows:

the Griffin-Lim vocoder adopts iterative processing, firstly a time domain graph is initialized randomly, inverse short-time Fourier inverse transformation is carried out on the Mel frequency spectrum of the synthetic voice obtained in the step S5 and the initialized time domain graph to obtain a new time domain graph, then short-time Fourier inverse transformation is carried out on the new time domain graph to obtain a new time domain graph and a new frequency spectrum, the new frequency spectrum graph is discarded, then the voice is synthesized by using the new time domain graph and the known frequency spectrum graph, and the operation is continuously iterated to obtain the optimal synthetic voice as output.

9. The method for cloning Chinese speech with end-to-end timbre and emotion migration as claimed in claim 1, wherein the process of generating cloned speech in step S7 is as follows:

s7.1.2, obtaining the speaker embedded vector: after the data collected in S1 is preprocessed in S2, speaker embedded vectors corresponding to different speakers are extracted through a tone emotion coder trained in S3, and the speaker embedded vectors corresponding to the speakers are selected according to needs of a user to clone;

s7.1.3, generating a Mel frequency spectrum of the synthesized voice: the input sequence obtained in the S7.1.1 and the speaker embedded vector obtained in the S7.1.2 are used as the input of a synthesizer, and the output of the synthesizer is the Mel frequency spectrum of the synthesized voice;

10. The method for cloning Chinese speech with end-to-end timbre and emotion migration according to claim 1, wherein the process of fast cloning speech in step S7 is as follows:

s7.2.1, acquiring input voice: collecting Chinese voice of a user;

s7.2.2, pretreatment: processing the Chinese speech obtained in S7.2.1 by using the preprocessing method in S2 to obtain a corresponding JSON file and a preprocessed audio file;

and S7.2.3, inputting the audio file obtained in the S7.2.2 into a tone emotion encoder, wherein the output of the tone emotion encoder is a speaker embedded vector, and the speaker vector and a corresponding speaker identifier are stored and used for synthesizing subsequent cloned voices.