CN115547293A

CN115547293A - Multi-language voice synthesis method and system based on layered prosody prediction

Info

Publication number: CN115547293A
Application number: CN202211178621.XA
Authority: CN
Inventors: 王秋华; 陈嘉怡; 李逸佳; 吴国华; 任一支
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-09-27
Filing date: 2022-09-27
Publication date: 2022-12-30

Abstract

The invention discloses a multi-language voice synthesis method based on layered prosody prediction, which comprises the following steps: s1, making a training set, acquiring multi-language standard reference audios of different speakers and corresponding sample texts, preprocessing the reference audios to obtain training samples, and making the training samples into the training set; s2, constructing and training a voice synthesis model, and training the constructed voice synthesis model through a preprocessed training set; and S3, synthesizing the voice, wherein the trained voice synthesis model generates multi-language voice with a specified voice style according to the input text to be synthesized and the reference audio, and outputs the multi-language voice through a vocoder. The prosody characteristics in the text and the reference audio are effectively extracted under the scene of alternate use of multiple languages, the flexibility and controllability of the prosody of the synthesized voice are improved, the prosody is regulated and controlled in a fine granularity so as to improve the naturalness of the synthesized voice, and the functions of copying the voice of any speaker and transferring any speaking style are realized.

Description

Multi-language voice synthesis method and system based on layered prosody prediction

Technical Field

The invention belongs to the field of speech synthesis, relates to multi-language mixed speech synthesis, and particularly relates to a multi-language speech synthesis method and system based on layered prosody prediction.

Background

Speech synthesis is a technology of converting text into speech mechanically or electronically. In recent years, speech synthesis based on a neural network is becoming mainstream, and the method can directly learn the corresponding relation between a text sequence end and an acoustic feature end, further model prosody (such as intonation, rhythm, speed, volume and the like) in human speech, and improve the quality and naturalness of the synthesized speech. However, the prosody control of the above method often considers only single-language cases.

In the daily communication of people nowadays, there are often many languages used alternately, especially in the fields of medicine and computer, there are a lot of trans-language proper nouns, such as "work on NLP". ", which phenomenon is known in linguistics as transcoding. Therefore, in order to adapt to such a situation, the speech synthesis system should not be limited to speech synthesis in a single language, but should be extended to a speech synthesis system in multiple languages.

However, the following technical problems exist in constructing a multi-language speech synthesis system: (1) Different grapheme and pronunciation exist among different languages, so that the difficulty of multi-language voice synthesis is increased; (2) The linguistic data of multiple languages has sparseness, namely, some languages only have recording data of a single sound style of a small number of speakers, and the linguistic data of multiple languages spoken by the same speaker are less, so that a deep neural network is not trained sufficiently; (3) The voice style extraction module and the style feature clustering method based on the global style token provided by the related technology can only realize coarse-grained (sentence-level) regulation and control of voice style, cannot realize fine-grained (phrase-level, word-level and phoneme-level) prosody change, and lack of a style learning method facing an input text sequence, and the voice synthesis system has defects in utilization of prosody information in a text.

Disclosure of Invention

The invention aims to solve the technical problems and provides a multi-language voice synthesis method and system based on layered prosody prediction, so that prosody features in texts and reference audios can be effectively extracted under the scene of alternate use of multiple languages, the flexibility and controllability of the prosody of the synthesized voice are improved, the prosody is regulated and controlled at a fine granularity so as to improve the naturalness of the synthesized voice, and the functions of copying the voice of any speaker and transferring any speaking style are realized.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a multi-language voice synthesis method based on layered prosody prediction comprises the following steps:

s1, making a training set

Acquiring multi-language standard reference audio of different speakers and corresponding sample texts, preprocessing the multi-language standard reference audio to obtain training samples, and making the training samples into training sets;

s2, constructing and training a speech synthesis model

S21, constructing a voice synthesis model, wherein the voice synthesis model comprises a generated convolution encoder, a speaker encoder, a batch example standardized global style marking layer, a prosody module, an confrontation type speaker classifier, an attention mechanism module, a generated confrontation network and a decoder;

s22, training the constructed voice synthesis model through the preprocessed training set;

s3, voice synthesis

And the trained voice synthesis model generates multi-language voice with a specified voice style according to the input text to be synthesized and the reference audio, and outputs the multi-language voice through the vocoder.

The present invention relates to the following definitions:

definition 1: the synthesizer is a generator consisting of a convolution encoder, an attention mechanism and a decoder and is used for continuously outputting a synthesized Mel frequency spectrum diagram.

Definition 2: language ID, additional information to the text to distinguish multiple languages pronunciations. For example: when the text is English, the language ID is marked as "en"; in the case of Chinese, the symbol is "zh".

Definition 3: the speaker ID is a code for distinguishing a speaker to which speech belongs in a multi-speaker data set.

Definition 4: IPA, international phonetic symbol, is a standardized method of labeling spoken sounds to accurately record and distinguish pronunciations.

Preferably, in step S1, the preprocessing method includes: extracting a feature vector of the sample text, and converting the standard reference audio into a Mel frequency spectrogram.

Preferably, the feature vectors include speaker ID feature vectors, word-level features, character vectors, and language ID feature vectors.

Preferably, the prosody module consists of a word-level style extractor and an IPA-level style predictor; the generating convolutional encoder is composed of a context parameter generator and a text encoder.

Preferably, in step S22, the training method of the speech synthesis model includes the following sub-steps:

s221, inputting the language ID feature vector into a context parameter generator to obtain parameters required by each layer of the network in a text encoder, encoding the multi-language text by the text encoder to obtain IPA voice features, converting the character vector into hidden layer voice features, and outputting a voice text feature vector;

s222, taking the speech text feature vector output by the generated convolutional encoder as the input of an antagonistic speaker classifier, obtaining speaker feature information of the text through a self-adaptive average pooling layer, a full-link layer and an L2 paradigm normalization, then performing reverse updating, and multiplying the gradient transmitted to the generated convolutional encoder by a negative constant value through a gradient inversion layer to achieve the effect of antagonistic training, so that the output of the generated convolutional encoder cannot be distinguished as the speaker, thereby decoupling the speaker feature and the text content feature, namely, enabling the text encoder to learn the text information independent of the speaker, and enabling the system to convert the voice of the speaker across languages;

s223, respectively splicing the multi-source features extracted through the Mel frequency spectrogram with the feature vectors of the voice texts;

s224, summarizing the feature vectors of the voice texts into context weight vectors of each decoding time step by an attention mechanism;

s225, a decoder predicts a corresponding Mel frequency spectrogram according to the voice text feature vector, the multi-source feature and the context weight vector;

and S226, utilizing the generated countermeasure network to improve the voice quality in the training process.

Preferably, the method of step S223 is:

inputting the Mel frequency spectrogram into a speaker encoder, extracting the voice characteristics of the speaker, and splicing the voice characteristics with the voice text characteristic vector output by the generated convolution encoder;

inputting the Mel frequency spectrogram into a batch example standardized global style labeling layer, extracting the style characteristics of a sentence-level speaker, and splicing the style characteristics with the voice text characteristic vector output by a generated convolution encoder;

inputting the word-level features, the IPA voice features and the Mel frequency spectrogram into a rhythm module, training by using the average absolute error as a loss function, predicting the IPA-level text style features in a layering mode, and splicing the features with the voice text feature vectors output by the convolutional encoder.

In step S225, the decoder predicts a corresponding mel-frequency spectrogram according to the speech text feature vector, the speaker voice feature, the sentence-level speaker style feature, the IPA-level text style feature, and the context vector.

Preferably, the generation countermeasure network includes a generator and a discriminator, the generator is a synthesizer, and the discriminator is a binary classifier.

Preferably, the method for generating the countermeasure network to improve the voice quality includes: the input of the synthesizer is text and standard reference audio, and the output is a synthesized Mel frequency spectrogram; meanwhile, a binary classifier is added as a discriminator to judge whether the input is real or a synthesized Mel frequency spectrogram, and after repeated iterative training, the synthesizer gradually generates the Mel frequency spectrogram which cannot be distinguished from the true or false, namely, the linguistic data output by the synthesizer is closer to the real, so that the effect of further improving the voice quality is achieved.

Preferably, step S3 includes the steps of:

s31, preprocessing a text to be synthesized and a reference audio frequency of any speaker to be used as an input of a trained voice synthesis model;

s32, outputting a Mel frequency spectrogram of the sound style of the specified speaker corresponding to the text content by the voice synthesis model;

s33, the vocoder converts the Mel frequency spectrum into a voice signal to realize the instant generation of voice.

Further, the antagonistic speaker classifier is used only during training.

Further, the attention mechanism module assigns a score to each state in the encoder and calculates an attention weight, calculates a context vector based thereon, and inputs information collected from the context vector to the decoder.

Furthermore, the generation confrontation network is composed of a generator and an identifier, pseudo corpora are continuously generated through the generator, the identifier continuously identifies the true corpora, and in the confrontation training process, the generator is forced to output the pseudo corpora which are more similar to the true corpora, so that the identifier is difficult to identify the true corpora gradually.

Furthermore, the decoder is an autoregressive cyclic neural network and consists of a preprocessing network, two layers of long and short term memory networks, a linear mapping network and a post-processing network.

Further, the vocoder WaveGlow is an autoregressive vocoder, consisting of a stream-based generative model and WaveNet. The WaveGlow synthesized speech has lower computational complexity compared with WaveNet, and can meet the requirement of a real-time system.

The invention provides a multi-language voice synthesis system based on layered prosody prediction, which comprises the following steps:

a sample pretreatment module: the system comprises a voice recognition module, a voice recognition module and a voice recognition module, wherein the voice recognition module is used for converting an input text into a phoneme sequence and respectively coding the phoneme sequence, the language ID and a speaker ID into a character vector, a language ID characteristic vector and a speaker ID characteristic vector; extracting word level characteristics; converting the reference audio into a Mel frequency spectrogram;

a sound style migration module: the system comprises a speaker encoder, a batch example standardized global style labeling layer and a hierarchical prosody module, and is used for acquiring speaker voice characteristics, sentence-level speaker style characteristics and style characteristics of a text to be synthesized in a reference audio; the prosodic module consists of a word-level style extractor, a word-level style predictor, an IPA-level style extractor and an IPA-level style predictor, and can predict the style characteristics of the IPA-level text on the basis of predicting the word-level text style characteristics of an input text sequence;

a speech synthesis module: the system consists of a convolution encoder, an attention mechanism and a decoder, and is used for obtaining a multi-language voice Mel frequency spectrogram with a specified voice style;

a vocoder: the Mel frequency spectrum output by the decoder is converted into a sound signal.

The invention has the following characteristics and beneficial effects:

by adopting the technical scheme, the voice characteristics based on the international phonetic alphabet IPA are used for realizing multi-language pronunciation. The invention provides an input mode based on international phonetic symbol IPA voice characteristics, so that when a new language is added into a voice synthesis system, the closest pronunciation can be obtained without adding a phoneme set corresponding to the language, namely, when the new language is added, the phoneme of the new language is not required to be further trained, the voice synthesis system has good language transfer learning capability and language expansion capability on a multilingual voice synthesis model, and the construction efficiency of the multilingual voice synthesis model is improved.

The application of the antagonistic speaker classifier overcomes the sparsity of multilingual corpus and improves the accuracy of cross-language coding. In the invention, an antagonistic speaker classifier is additionally added to decouple the characteristics of the speaker and the characteristics of the text content, so that a text encoder is guided to encode the text in a speaker-independent mode, and the system can convert the voice of the speaker in a cross-language mode without any training corpora mixed with multiple languages.

Providing hierarchical prosodic modules more efficiently obtains prosodic information in text. The prosody module provided by the invention consists of a word-level style extractor, a word-level style predictor, an IPA-level style extractor and an IPA-level style predictor, and is used for predicting the style characteristics of an IPA-level text on the basis of predicting the word-level text style characteristics of an input text sequence, so that layered prosody prediction is realized, more prosody information is provided to regulate and control the style characteristics with fine granularity, and the naturalness of synthesized voice is improved;

the style information in the voice is extracted by applying the batch example standardized global style labeling layer, so that the interfering prosody can be selectively standardized while the useful prosody is kept, and the neural network can learn the prosody of the voice from the reference audio, thereby achieving the effects of freely controlling the voice style and transferring any speaking style.

The speaker coder realizes the reproduction of any speaker voice. The invention provides an independently-trained speaker encoder, which extracts speaker information from the audio of any given speaker, so that a synthesizer can generate a Mel frequency spectrogram by using the voice of the speaker, and the effect of copying the voice of any speaker is achieved.

Generating a competing network further improves voice quality. The generated countermeasure network consists of a generator and an identifier, the generator continuously generates pseudo corpora, and the identifier continuously identifies the true corpora, so that the generator is forced to output the pseudo corpora more similar to the true corpora in the countermeasure training process, and the identifier is gradually difficult to identify the true or false. The invention applies a method for generating a confrontation network, designs a synthesizer as a generator for continuously outputting synthesized Mel frequency spectrums, simultaneously adds a binary classifier as a voice discriminator for continuously learning and discriminating the synthesized frequency spectrums output by the generator and the real frequency spectrums of training data, and through the game process of the confrontation training, when the synthesizer gradually generates a Mel frequency spectrum graph which cannot be distinguished by the discriminator from true to false, the language materials output by the synthesizer are more true, thereby achieving the effect of further improving the quality of multi-language voice.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a block diagram of a multi-lingual speech synthesis system based on hierarchical prosody prediction according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an overall model for a multi-lingual speech synthesis system based on hierarchical prosody prediction according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a method for multi-lingual speech synthesis based on hierarchical prosody prediction according to an embodiment of the invention;

FIG. 4 is a block diagram of a sample preprocessing module of a hierarchical prosody prediction based multilingual speech synthesis system according to an embodiment of the present invention;

FIG. 5 is a block diagram illustrating the steps of a voice style migration module of a hierarchical prosody prediction based multilingual speech synthesis system according to an embodiment of the present invention;

FIG. 6 is a block diagram of a speech synthesis module of a hierarchical prosody prediction based multilingual speech synthesis system according to an embodiment of the present invention;

FIG. 7 is a block diagram of a vocoder for a multi-lingual speech synthesis system based on hierarchical prosody prediction according to an embodiment of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The following describes the implementation of the present invention in more detail by taking the chinese-english language as an example with reference to the drawings, but the scope of the present invention is not limited to the following description.

As shown in fig. 2, in an embodiment of the present invention, the speech synthesis model adopted in the present invention comprises a generating convolutional encoder, a speaker encoder, a batch instance standardized global style label layer, a prosody module, an antagonistic speaker classifier, an attention mechanism module, a generating antagonistic network, and a decoder, wherein the prosody module comprises a word-level and IPA-level style extractor and a style predictor, and the generating convolutional encoder comprises a context parameter generator and a text encoder;

taking the Chinese and English bilingual languages as an example, the method for multi-language speech synthesis based on hierarchical prosody prediction of the invention, as shown in fig. 3, comprises the following steps:

s1, acquiring and processing multi-language reference audio of different speakers and corresponding sample texts;

the step S1 specifically includes the following steps:

s11, aiming at a reference audio, performing pre-emphasis, sound framing, a Hamming window, short-time Fourier transform, a Mel filter and log logarithm output Mel spectrogram;

and S12, aiming at the text, obtaining word-level characteristic information, converting Chinese and English in the text into phonemes, and constructing a Chinese-English mixed phoneme word library. English does not need to be specially converted, and Chinese uses a unified symbol to break words so as to represent the ventilation or pause state in voice; further, the Chinese text is converted into a Chinese pinyin format, chinese characters are represented by English letters, and tones of the Chinese characters are represented by numbers 1 to 5; for example, the original text "i cannot find the book you want. "is converted into" wo3 × zhao3 bu2 dao4 × ni3 × xiang3 yao × de5 × shu1 "by word breaking and pinyin. "; further, character vectors are used as text features, and a word stock is constructed for all the characters appearing in the training sample, wherein the word stock comprises letters a-z, 1-5 and various punctuations; taking zhao3 bu2 dao4 as an example, the system numbers each character "z", "h", "a", "o", "3", "blank", etc., and randomly initializes the number to a feature vector of 512, i.e., a character vector;

s13, adding a language ID and a speaker ID as additional information of each character:

s131, when the text is English, marking the language ID as 'en', and when the text is Chinese, marking the text as 'zh', and encoding the language ID into a language ID feature vector with the size of 4 by using an embedded module in PyTorch;

s132, similarly, the speaker ID is encoded into a speaker ID feature vector with a size of 32 by the embedding module. In actual application, only the ID and the text of the speaker need to be input: "I engaged in the computerVision side. "pretreated and transformed into" wo 3. Cng 2 shi 4. ComputerVision fang1mian 4. De 5. Gong1 zuo4 ". ", and its language ID is calculated as" zh-19, en-16, zh ".

S2, constructing and training a voice synthesis model, and training the constructed model according to the preprocessed sample, wherein the voice synthesis model comprises a generated convolution encoder, a speaker encoder, a batch example standardized global style marking layer, a rhythm module, an antagonistic speaker classifier, an attention mechanism module, a generated antagonistic network and a decoder;

in step S2, the training method of the speech synthesis model is as follows:

s21, using the character vector c and the language ID feature vector l as input for generating a convolutional encoder:

s211, the generated convolution coder consists of a text coder and 28 parameter generators; each parameter generator PG ⁱ The same language ID feature vector l is used as input to obtain the parameter theta required by each layer of the network in the text encoder ⁱ ＝PG ⁱ (l)＝w ⁱ l, i is more than or equal to 1 and less than or equal to 28, so that the text encoder can encode Chinese and English texts at the same time; wherein w represents a trainable parameter;

s212, the text encoder is composed of a one-dimensional convolutional layer (CNN) and a batch standardization layer (BN), wherein the number of the groups is 14, and j represents a group number; inputting the character vector c into a text encoder, for a first group CNN and BN, CNN ^j According to the parameter theta ^2j-1 Feature extraction is performed on the input character vector c, and the phonemes are mapped to their phonetic features using an IPA-based dictionary. Specifically, the first nine were read directly from IPA using the following 10 multi-valued speech features: consonants/vowels, pronunciation (voiced/unvoiced), vowel degree, vowel opening, vowel roundness, vowel accent, consonant pronunciation location, consonant pronunciation style, and inflexion (e.g., nasal, lingual); taking symbol type as the tenth featureFor the integration of tokens, such as silence, sentence ends, and word boundaries. The feedforward neural network encodes each multivalued speech feature into binary variables of different quantities, constructs binary feature vectors, and collectively refers to the obtained binary vectors as IPA speech features. Further, BN ^j According to the parameter theta ^2j Normalizing the features output by the previous layer to obtain a first set of hidden layer features h ^j Replacing the character vector c with h ^j The input is used as the input of the next group of CNN and BN, and the analogy is repeated, and finally, the feature vector of the voice text is output;

s22, in the training stage, the speech text feature vector output by the generated convolution encoder is used as the input of the antagonistic speaker classifier: firstly, reducing the time axis dimension of a voice text feature vector from two dimensions to one dimension through a self-adaptive average pooling layer; secondly, scaling the number of channels through a layer of full connection layer; finally, the speaker characteristic information of the text is output by using L2 normal form normalization, then reverse updating is carried out, the gradient transmitted to the generated convolution encoder is multiplied by a negative constant value through a gradient inversion layer, the effect of countermeasure training is achieved, the output of the generated convolution encoder cannot distinguish the speaker, and therefore the speaker characteristic and the text content characteristic are decoupled, namely the text encoder learns the text information independent of the speaker, and the system can convert the speaker voice across languages;

s23, taking the word-level features, the IPA voice features and a Mel frequency spectrogram obtained by processing the reference audio as the input of the voice style migration module to extract the voice features of the speaker, the sentence-level style features of the speaker and the IPA text style features for decoding:

s231, inputting the Mel frequency spectrogram into a speaker encoder, and outputting voice characteristics of the speaker to realize the function that the voice synthesis system can copy the voice of any speaker;

s232, inputting the Mel frequency spectrogram into a batch example standardized global style labeling layer, and outputting sentence-level speaker style characteristics to realize the function that the speech synthesis system can transfer any speaking style;

s233, inputting the word-level characteristics and the IPA voice characteristics into a hierarchical prosody module; specifically, in the training stage, the training frequency epoch =30k is preset, the number of samples batch _ size =16 of one training is preset, and one sample includes a standard reference audio and a corresponding sample text;

s234, inputting a Mel frequency spectrogram of a standard reference audio into a word-level style extractor for each sample, outputting a real word-level speech style characteristic of the sample, inputting a word-level characteristic of a sample text into a word-level style predictor, outputting a predicted word-level text style characteristic of the sample, training the real word-level speech style characteristic of the sample and the predicted word-level text style characteristic of the sample by using an average absolute error as a loss function, and adjusting parameters by adopting an Adam optimizer;

s235, adding the word-level voice style characteristics in the S234 into IPA voice characteristics;

s236, for each sample, inputting a Mel spectrogram of a standard reference audio into an IPA level lattice extractor, outputting real IPA level voice style characteristics of the sample, inputting the expanded IPA voice characteristics into an IPA level lattice predictor, outputting the sample to predict IPA level text style characteristics, training by using an average absolute error as a loss function, and optimizing parameters by using an Adam optimizer;

and S237, in an inference stage, inputting the word-level characteristics into the word-level style predictor to obtain word-level text style characteristics, adding the word-level text style characteristics into the IPA voice characteristics, and inputting the expanded IPA voice characteristics into the IPA-level style predictor to obtain the IPA-level text style characteristics, so that more prosodic information is provided to regulate and control the style characteristics with fine granularity, and the effect of improving the naturalness of the synthesized voice is achieved.

S24, splicing the voice characteristics of the speaker, the sentence-level speaker style characteristics and the IPA-level text style characteristics with the voice text characteristic vector output by the generated convolution encoder;

and S25, the decoder predicts acoustic characteristics according to the generated voice text characteristic vector output by the convolutional encoder, the speaker voice characteristics, the sentence-level speaker style characteristics and the IPA-level text style characteristics output by the voice style migration module, and the context weight vector, outputs a corresponding Mel frequency spectrogram, and achieves the effects of copying the voice of any speaker, migrating any speaking style and improving the voice naturalness.

S3, the voice synthesis model generates multi-language voice with a specified voice style according to the input text to be synthesized and the reference audio, and the multi-language voice is output through a vocoder;

the step S3 specifically includes the following steps:

s31, preprocessing a text to be synthesized and a reference audio frequency of any speaker to be used as the input of a trained voice synthesis model;

s32, outputting a Mel frequency spectrogram of the sound style of the specified speaker corresponding to the text content by the speech synthesis model;

and S33, the vocoder converts the Mel frequency spectrum into a voice signal to realize the instant generation of voice.

Based on the above embodiments, the present invention further provides a multi-language speech synthesis system based on hierarchical prosody prediction, as shown in fig. 1, including a sample preprocessing module 10, a voice style migration module 20, a speech synthesis module 30, and a vocoder 40.

Specifically, as shown in fig. 2, a sample preprocessing module 10 is configured to obtain and process input reference audio and text;

the voice style migration module 20 is configured to extract an IPA level text style feature, a speaker voice feature and a sentence level speaker style feature from the word level feature, the IPA voice feature and the mel frequency spectrogram corresponding to the reference audio;

the speech synthesis module 30 is connected with the sample preprocessing module 10 and the sound style migration module 20, and the speech synthesis module 30 is configured to generate a mel frequency spectrum diagram corresponding to a multi-language speech with a specified sound style according to an input text to be synthesized and a reference audio;

the vocoder 40 is for outputting synthesized speech.

While the embodiments of the present invention have been shown and described, the above embodiments are illustrative, and not to be construed as limiting the present invention, and those skilled in the art can make changes, modifications, substitutions and alterations to the above embodiments within the scope of the present invention, and such changes should be construed as being covered by the present invention.

Claims

1. A multi-language voice synthesis method based on layered prosody prediction is characterized by comprising the following steps:

s1, making a training set

s2, constructing and training a speech synthesis model

S21, constructing a voice synthesis model, wherein the voice synthesis model comprises a generated convolution encoder, a speaker encoder, a batch instance standardized global style marking layer, a prosody module, a confrontation type speaker classifier, an attention mechanism module, a generated confrontation network and a decoder;

s3, voice synthesis

2. The method of claim 1, wherein the preprocessing in step S1 comprises: extracting the feature vector of the sample text, and converting the standard reference audio into a Mel frequency spectrogram.

3. The method of claim 2, wherein the feature vectors include speaker ID feature vectors, word-level features, character vectors, and language ID feature vectors.

4. The method of claim 3, wherein the prosodic module is comprised of a word-level, IPA-level style extractor and style predictor; the generating convolutional encoder is composed of a context parameter generator and a text encoder.

5. The method for multi-lingual speech synthesis based on hierarchical prosody prediction according to claim 4, wherein the training method of the speech synthesis model in step S22 comprises the following sub-steps:

s222, taking the speech text feature vector output by the generated convolutional encoder as the input of the antagonistic speaker classifier, normalizing the speech text feature vector through a self-adaptive average pooling layer, a full-link layer and an L2 paradigm to obtain speaker feature information of the text, then performing reverse updating, and multiplying the gradient transmitted to the generated convolutional encoder by a negative constant value through a gradient inversion layer to ensure that the output of the generated convolutional encoder cannot distinguish the speaker to which the output belongs, thereby decoupling the speaker feature and the text content feature;

s224, summarizing the speech text feature vectors into context weight vectors of each decoding time step by the attention mechanism;

6. The method for multi-lingual speech synthesis based on hierarchical prosody prediction according to claim 5, wherein the method of step S223 comprises:

and inputting the word-level features, the IPA voice features and the Mel frequency spectrogram into a prosody module, training by using an average absolute error as a loss function, predicting IPA-level text style features in a layering mode, and splicing the IPA-level text style features with voice text feature vectors output by a generated convolution encoder.

In step S225, the decoder predicts a corresponding mel spectrum chart according to the speech text feature vector, the speaker voice feature, the sentence-level speaker style feature, the IPA-level text style feature, and the context vector.

7. The method of claim 5, wherein the generating countermeasure network comprises a generator and a discriminator, the generator is a synthesizer and the discriminator is a binary classifier.

8. The method of claim 7, wherein the method of generating the countermeasure network to improve speech quality comprises: the input of the synthesizer is text and standard reference audio, and the output is a synthesized Mel frequency spectrogram; and meanwhile, adding a binary classifier as a discriminator to judge whether the input is a real or synthesized Mel frequency spectrogram, and after repeated iterative training, gradually generating the Mel frequency spectrogram which cannot be distinguished by the discriminator by the synthesizer.

9. A system for multi-lingual speech synthesis based on hierarchical prosody prediction, comprising:

a sample preprocessing module: the system comprises a voice recognition module, a voice recognition module and a voice recognition module, wherein the voice recognition module is used for converting an input text into a phoneme sequence and respectively coding the phoneme sequence, the language ID and the speaker ID into a character vector, a language ID characteristic vector and a speaker ID characteristic vector; extracting word level characteristics; converting the reference audio into a Mel frequency spectrogram;

a speech synthesis module: the system consists of a generating convolution encoder, an attention mechanism and a decoder, and is used for obtaining a multi-language voice Mel frequency spectrogram with a specified voice style;