CN112289304A

CN112289304A - A Multi-Speaker Speech Synthesis Method Based on Variational Autoencoder

Info

Publication number: CN112289304A
Application number: CN201910671050.5A
Authority: CN
Inventors: 张鹏远; 蒿晓阳; 颜永红
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS
Priority date: 2019-07-24
Filing date: 2019-07-24
Publication date: 2021-01-29
Anticipated expiration: 2039-07-24
Also published as: CN112289304B

Abstract

The invention discloses a multi-speaker voice synthesis method based on a variational self-encoder, which comprises the following steps: extracting a phoneme-level duration parameter and a frame-level acoustic parameter of a clean voice of a speaker to be synthesized, inputting the normalized phoneme-level duration parameter into a first variational self-encoder, and outputting a duration speaker tag; inputting the normalized frame level acoustic parameters into a second variational self-encoder, and outputting an acoustic speaker tag; extracting frame-level linguistic features and phoneme-level linguistic features of a voice signal to be synthesized, wherein the voice signal comprises a plurality of speakers; inputting the time length speaker label and the normalized phoneme level linguistic feature into a time length prediction network, and outputting the current phoneme prediction time length; obtaining the frame-level linguistic characteristics of the phoneme through the current phoneme prediction duration, inputting the frame-level linguistic characteristics of the phoneme and the acoustic speaker tag into an acoustic parameter prediction network, and outputting normalized acoustic parameters of predicted speech; and inputting the normalized predicted voice acoustic parameters into a vocoder, and outputting a synthesized voice signal.

Description

Multi-speaker voice synthesis method based on variational self-encoder

Technical Field

The invention relates to a voice synthesis method, in particular to a multi-speaker voice synthesis method based on a variational self-encoder.

Background

The speech synthesis technology is an important technology for converting an input text into speech, and is also an important research content in the field of human-computer interaction.

The traditional speech synthesis algorithm needs to record a relatively comprehensive sound base covered by single speaker phonemes to ensure that the single speaker phonemes can synthesize the voices of various texts, but causes the problems of high recording cost, low efficiency and only single speaker voice synthesis. The voice synthesis of multiple speakers supports the parallel recording of voices of different speakers, and can synthesize voices from different speakers. The traditional multi-speaker speech synthesis usually needs to obtain the speaker information of the current speech and manually label a speaker tag, such as the one-hot encoding of the speaker, which belongs to a supervised learning, and the synthesized speech often has the tone overlapping of a plurality of speakers when the number of speakers is large. The method introduces a variational self-encoder network, samples the output of the network to obtain the label of the speaker.

Disclosure of Invention

The invention aims to solve the problems of supervised learning and tone overlapping of a plurality of speakers when the number of speakers is large in the traditional multi-speaker voice synthesis method, and provides a multi-speaker voice synthesis method based on a variational self-encoder by introducing a variational self-encoder network and sampling the output of the network to obtain the labels of the speakers.

In order to achieve the above object, the present invention provides a method for synthesizing multiple speakers based on a variational self-encoder, the method comprising:

extracting a phoneme-level duration parameter and a frame-level acoustic parameter of a clean voice of a speaker to be synthesized, normalizing the phoneme-level duration parameter, inputting the normalized phoneme-level duration parameter into a first variational self-encoder, and outputting a duration speaker tag; inputting the normalized frame level acoustic parameters into a second variational self-encoder, and outputting an acoustic speaker tag;

extracting frame-level linguistic features and phoneme-level linguistic features from a voice signal to be synthesized, wherein the voice signal comprises a plurality of speakers, and normalizing the frame-level linguistic features and the phoneme-level linguistic features;

inputting the time length speaker label and the normalized phoneme level linguistic feature into a time length prediction network, and outputting the predicted time length of the current phoneme;

obtaining the frame-level linguistic characteristics of the phoneme through the predicted duration of the current phoneme, inputting the frame-level linguistic characteristics of the phoneme and the acoustic speaker tag into an acoustic parameter prediction network, and outputting normalized acoustic parameters of predicted speech;

the normalized acoustic parameters of the predicted speech are input to the vocoder, and the synthesized speech signal is output.

As an improvement of the above method, the first variational self-encoder/the second variational self-encoder includes 5 one-dimensional convolution layers, 1 long-short-term memory layer and 1 fully-connected layer, where the convolution kernel size of the convolution layer is 5, the step size is 2, the number is 128, the fully-connected layer outputs the standard deviation and the mean value of the predicted gaussian distribution, the long-short-term memory layer includes 128 neurons, and the activation function of each neuron uses a modified linear unit, and its expression is:

f(x)＝max(0，x)

the input of the first variational auto-encoder/the second variational auto-encoder is normalized phoneme-level duration parameter/normalized frame-level acoustic parameter, the output is mean and standard deviation of gaussian distribution, and the relative entropy between the predicted encoded distribution and the real distribution is calculated as:

where N is the dimension of the Gaussian distribution, σ_nAnd u_nRespectively, the standard deviation and mean of the nth dimension of σ (x) and u (x) of the gaussian distribution predicted by the variational autocoder,

for implicit vector true distribution, p_θ(z) is the implicit vector distribution predicted by the variational self-encoder, and the true distribution is assumed to be standard gaussian distribution;

the hidden vector is implemented by resampling:

z^N＝u(x)+σ(x)·ε^N

wherein z is^NAs hidden vectors, x is the normalized phoneme-level duration parameter/normalized frame-level acoustic parameter, ε^NN (0, I), a vector obtained by standard Gaussian sampling, and the gradient of the variational output feedback from the encoder is calculated as:

wherein e is^NIs an N-dimensional all-1 vector; hidden vector z^NThe output of the first variational auto-encoder/the second variational auto-encoder is a long-time speaker tag/an acoustic speaker tag converted into a 128-dimensional speaker tag by a full-connection layer containing 256 neurons.

As an improvement of the above method, the duration prediction network comprises a full connection layer and a two-way long and short duration memory layer;

the input of the full connection layer is a time length speaker label and normalized phoneme level linguistic characteristics; the output of the bidirectional long-short time memory layer is the predicted duration of the current phoneme and a Loss function Loss_MSEFor the minimum mean square error of the normalized predicted speech duration parameter and the real speech duration parameter:

wherein d is a time length parameter of the real voice,

is a time length parameter of the normalized predicted speech;

the number of the neurons of the full-connection layer and the bidirectional long-short time memory layer is 256; the activation functions of all neurons use modified linear elements, whose expression is:

f(x)＝max(0，x)。

as an improvement of the above method, the acoustic parameter prediction network comprises a full connection layer and three bidirectional long-term and short-term memory layers;

the input of the full connection layer is acoustic speaker labels and normalized frame level linguistic characteristics; the output of the three-layer bidirectional long-short time memory layer is normalized acoustic parameters of predicted voice and Loss function Loss_MSEFor the minimum mean square error of the normalized acoustic parameters of the predicted speech and the acoustic parameters of the real speech:

wherein x is_jThe j-th dimension of the acoustic parameter of the real voice,

a value of j dimension of the acoustic parameter for the normalized predicted speech;

the number of the nerve cells of the connecting layer and the bidirectional long-time and short-time memory layer is 256; the activation functions of all neurons use modified linear elements, whose expression is:

f(x)＝max(0，x)。

as an improvement of the above method, the method further comprises: extracting frame-level acoustic parameters, phoneme-level duration parameters, frame-level linguistic features and phoneme-level linguistic features from the recorded voice signals containing a plurality of speakers, and normalizing the frame-level acoustic parameters, the phoneme-level duration parameters, the frame-level linguistic features and the phoneme-level linguistic features respectively;

the frame-level acoustic parameters include: 60-dimensional Mel cepstrum coefficient and its first and second order difference, 1-dimensional fundamental frequency parameter and its first and second order difference, 1-dimensional non-periodic parameter and its first and second order difference, and 1-dimensional vowel consonant decision parameter;

the phoneme-level duration parameter comprises 1-dimensional duration information;

the frame-level linguistic features include 624-dimensional phoneme-level linguistic features and 4-dimensional frame position information; the phone-level linguistic features include: 477-dimensional text pronunciation characteristics, 177-dimensional word segmentation and rhythm characteristics;

the frame-level acoustic parameters and the phoneme-level duration parameters are normalized by adopting a 0-mean value; the frame-level linguistic features and the phoneme-level linguistic features are normalized using a maximum and a minimum value.

As an improvement of the above method, the method further comprises: the step of training the first variational self-encoder and the duration parameter prediction network specifically comprises the following steps:

inputting the normalized phoneme-level duration parameter into a first variational self-encoder, calculating the loss function of the variational self-encoder

Outputting a time length speaker tag;

inputting the normalized phoneme level linguistic features and the time length speaker label into the time length parameter prediction network to calculate the Loss function Loss_MSE；

An optimization function C1 obtained by weighted summation of the loss function of the first variational self-encoder and the loss function of the time-length parameter prediction network:

where ω (n) is the weight of the loss function of the encoder:

n is the iteration number of the whole training data;

and (4) carrying out gradient feedback by reducing the value of the optimization function C1 and updating the parameters of the network to obtain the trained first variational self-encoder and the duration parameter prediction network.

As an improvement of the above method, the method further comprises: the step of training the second variational autocoder and the acoustic parameter prediction network comprises the following steps:

inputting the normalized frame-level acoustic parameters into a second variational auto-encoder, calculating a loss function of the second variational auto-encoder

Outputting an acoustic speaker tag;

inputting acoustic speaker label and normalized frame level linguistic feature into the acoustic parameter prediction network calculation Loss function Loss_MSE；

An optimization function C2 obtained by weighted summation of the loss function of the second variational self-encoder and the loss function of the acoustic parameter prediction network:

where ω (n) is the weight of the loss function of the encoder:

n is the iteration number of the whole training data;

and (4) carrying out gradient back transmission by reducing the value of the optimization function C2 and updating the parameters of the network to obtain the trained second variational self-encoder and the acoustic parameter prediction network.

As an improvement of the above method, the extracting and normalizing the phoneme-level duration parameter and the frame-level acoustic parameter of the clean speech of the speaker to be synthesized specifically includes:

extracting a phoneme-level duration parameter of a clean voice of a speaker to be synthesized, wherein the phoneme-level duration parameter comprises 1-dimensional duration information;

extracting a frame level acoustic parameter of the clean voice of the speaker to be synthesized; the frame-level acoustic parameters include: 60-dimensional Mel cepstrum coefficient and its first and second order difference, 1-dimensional fundamental frequency parameter and its first and second order difference, 1-dimensional non-periodic parameter and its first and second order difference, and 1-dimensional vowel consonant decision parameter;

and the frame-level acoustic parameters and the phoneme-level duration parameters are normalized by 0 mean value.

As an improvement of the above method, the speech signal to be synthesized containing multiple speakers extracts frame-level linguistic features and phoneme-level linguistic features and performs normalization; the method specifically comprises the following steps:

extracting frame-level linguistic features from a speech signal to be synthesized, wherein the speech signal comprises a plurality of speakers, and the frame-level linguistic features comprise 624-dimensional phoneme-level linguistic features and 4-dimensional frame position information;

extracting phoneme-level linguistic features from a speech signal to be synthesized containing a plurality of speakers, wherein the phoneme-level linguistic features comprise: 477-dimensional text pronunciation characteristics, 177-dimensional word segmentation and rhythm characteristics;

the frame-level linguistic features and the phoneme-level linguistic features are normalized using a maximum and a minimum value.

As an improvement of the above method, the obtaining the frame-level linguistic feature of the phoneme by predicting the duration specifically includes: and obtaining the relative position and the absolute position of the current frame relative to the predicted duration through predicting the duration, thereby obtaining the frame-level linguistic feature of the phoneme.

The invention has the advantages that:

the invention learns the tag information of the speaker without supervision through a variational self-encoder, obtains different implicit vector distributions by selecting voices from different speakers, and synthesizes the voices of different speakers by obtaining speaker tags through sampling the implicit vectors.

Drawings

FIG. 1 is a flow chart of the method for multi-speaker speech synthesis based on variational auto-encoders of the present invention;

FIG. 2 is a block diagram of a variational self-encoder and acoustic parameter prediction network of the present invention;

fig. 3 is a block diagram of a variation autoencoder and duration parameter prediction network of the present invention.

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings.

The invention has proposed a multi-speaker speech synthesis method based on variational self-encoder, said method comprises training stage and synthetic stage;

as shown in fig. 1, the training phase includes:

step 101) extracting acoustic parameters at a frame level, duration parameters at a phoneme level, and linguistic characteristics at the frame level and the phoneme level from recorded voice signals containing a plurality of speakers, and normalizing the acoustic parameters, the duration parameters, the frame level and the linguistic characteristics at the phoneme level respectively.

The acoustic parameters at the frame level have 187 dimensions, and comprise: 60-dimensional Mel cepstrum coefficient and its first and second order difference, 1-dimensional fundamental frequency parameter and its first and second order difference, 1-dimensional non-periodic parameter and its first and second order difference, and 1-dimensional vowel consonant decision parameter. 624 dimensions for the phonetic feature of the phoneme level, comprising: 477 dimensional text pronunciation characteristics, 177 dimensional word segmentation and rhythm characteristics. The frame-level linguistic feature includes a 624-dimensional phoneme-level linguistic feature and 4-dimensional frame position information. The duration parameter comprises 1-dimensional duration information.

The linguistic features of the frame level and the phoneme level are normalized by adopting the maximum and minimum values, and the calculation formula is as follows:

wherein

For the normalized value of the ith dimension feature,

min as the value before normalization for the ith dimension feature_iAnd max_iThe maximum and minimum values of the ith dimension feature, respectively.

The acoustic parameters and the duration parameters are normalized by 0 mean value, and the calculation formula is as follows:

wherein

For the normalized value of the ith dimension feature,

for the value before normalization for the i-th dimension feature, u_iAnd σ_iRespectively, mean and variance of the ith dimension feature.

Step 102) constructing a variational self-encoder network, taking the normalized frame-level acoustic parameters as input, assuming that the encoded distribution is Gaussian distribution, and the output of the network is the mean value and standard deviation of the Gaussian distribution, and calculating the relative entropy between the variational self-encoder network and the actual distribution to be used as a loss function of an encoder.

The encoder contains 5 one-dimensional convolutional layers to and 1 layer long-term memory network (LSTM layer) and 1 layer of full connection layer, wherein convolutional kernel size of convolutional layer is 5, the step length is 2, the quantity is 128, full connection layer output is standard deviation and the mean value of the gaussian distribution that the encoder predicts, LSTM layer contains 128 neurons, the use of the activation function of each neuron is the correction linear unit, its expression is:

f(x)＝max(0，x)

the relative entropy between the predicted encoded distribution and the true distribution is calculated as:

for implicit vector true distribution, p_θ(z) is the implicit vector distribution predicted by the variational self-encoder, and the true distribution is assumed to be standard gaussian distribution; at a relative entropy

As a loss function of the encoder:

step 103) obtaining a hidden vector as an acoustic speaker label based on the distributed sampling of the step 102);

in order to avoid the problem that the gradient cannot be transmitted back due to direct sampling, the hidden vector is realized by resampling, and the formula of resampling is as follows:

z^N＝u(x)+σ(x)·ε^N

wherein N is the dimension of a Gaussian distribution, z^NIs a hidden vector, x is an input normalized frame-level acoustic parameter, ε^NN (0, I), a vector obtained for standard Gaussian sampling, where the gradient for the encoder output return can be calculated as:

wherein e^NIs an N-dimensional all-1 vector;

hidden vector z^NConversion to 128-dimensional sound through a full-connection layer containing 256 neuronsA speaker learning tag.

Step 104) constructing an acoustic parameter prediction network, taking the normalized frame-level linguistic features and the acoustic speaker labels as input, and taking the output of the network as the predicted acoustic parameters of the normalized predicted speech, and calculating the minimum mean square error between the predicted acoustic parameters and the actual acoustic parameters, so as to be used as a loss function of the network.

As shown in fig. 2, the acoustic parameter prediction network includes a full connection layer and three bidirectional long and short term memory layers;

wherein x is_jThe j-th dimension of the acoustic parameter of the real voice,

f(x)＝max(0，x)。

step 105) carrying out weighted summation on the loss function of the variational self-encoder in the step 102) and the loss function of the acoustic parameter prediction network in the step 104) to serve as an optimization function, carrying out gradient return transmission by reducing the value of the optimization function, and updating the parameters of the network to obtain a trained network;

the optimization function obtained by weighted summation of the loss function of the encoder and the loss function of the acoustic parameter prediction network is calculated as follows:

where ω (n) is an expression of the weight of the loss function of the encoder over the number of iterations of the training data:

step 106) building a variational self-encoder network with the same structure as the step 102), taking normalized phoneme-level duration parameters as input, outputting as a duration speaker tag, and building a duration prediction network at the same time, taking the duration speaker tag and the phoneme-level linguistic features as input; outputting the time length parameter as a prediction;

as shown in fig. 3, the duration prediction network includes a full connection layer and a two-way long and short duration memory layer;

wherein d is a time length parameter of the real voice,

is a time length parameter of the normalized predicted speech;

f(x)＝max(0，x)。

step 107) training the variational self-encoder and the duration parameter prediction network in the step 106);

inputting the normalized phoneme-level duration parameter into a variational self-encoder, and calculating the loss function of the variational self-encoder

Outputting a time length speaker tag;

calculating Loss function Loss of network by inputting normalized phoneme-level linguistic features and duration speaker tags into duration parameter prediction network_MSE；

And (3) an optimization function C obtained by weighted summation of the loss function of the variational self-encoder and the loss function of the time-length parameter prediction network:

where ω (n) is the weight of the loss function of the encoder:

n is the iteration number of the whole training data;

The synthesis stage comprises:

step 201) extracting a phoneme level duration parameter and a frame level acoustic parameter of a clean voice of a speaker to be synthesized, normalizing the phoneme level duration parameter, inputting the normalized phoneme level duration parameter into the variation self-encoder in the step 105), and outputting a duration speaker label; inputting the normalized frame level acoustic parameters into the variational self-encoder in the step 107) and outputting an acoustic speaker label;

the phoneme-level duration parameter comprises 1-dimensional duration information; the frame-level acoustic parameters include: 60-dimensional Mel cepstrum coefficient and its first and second order difference, 1-dimensional fundamental frequency parameter and its first and second order difference, 1-dimensional non-periodic parameter and its first and second order difference, and 1-dimensional vowel consonant decision parameter; and the frame-level acoustic parameters and the phoneme-level duration parameters are normalized by 0 mean value.

Step 202) extracting frame-level linguistic features and phoneme-level linguistic features from a voice signal to be synthesized, wherein the voice signal comprises a plurality of speakers, and normalizing the frame-level linguistic features and the phoneme-level linguistic features;

the frame-level linguistic features include 624-dimensional phoneme-level linguistic features and 4-dimensional frame position information; the phone-level linguistic features include: 477-dimensional text pronunciation characteristics, 177-dimensional word segmentation and rhythm characteristics; the frame-level linguistic features and the phoneme-level linguistic features are normalized using a maximum and a minimum value.

Step 203) inputting the time length speaker label and the normalized phoneme level linguistic feature into the time length prediction network in the step 107) and outputting the predicted time length of the current phoneme;

step 204) obtaining the frame-level linguistic characteristics of the current phoneme through the predicted duration of the current phoneme, inputting the frame-level linguistic characteristics of the current phoneme and the acoustic parameter prediction network of the acoustic speaker tag input step 105), and outputting normalized acoustic parameters of predicted speech;

and obtaining the relative position and the absolute position of the current frame relative to the predicted duration through predicting the duration, thereby obtaining the frame-level linguistic feature of the phoneme.

Step 205) inputs the normalized acoustic parameters of the predicted speech into the vocoder and outputs a synthesized speech signal.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A multi-speaker speech synthesis method based on variational autoencoder, the method comprising:

Extract and normalize the phoneme-level duration parameters and frame-level acoustic parameters of a clean speech of the speaker to be synthesized, input the normalized phoneme-level duration parameters into the first variational autoencoder, and output the duration speaker label; The normalized frame-level acoustic parameters are input to the second variational autoencoder, and the acoustic speaker label is output;

Extract and normalize frame-level linguistic features and phoneme-level linguistic features from the speech signal containing multiple speakers to be synthesized;

Input the duration speaker label and the normalized phoneme-level linguistic features into the duration prediction network, and output the predicted duration of the current phoneme;

Obtain the frame-level linguistic feature of the phoneme through the prediction duration of the current phoneme, input it and the acoustic speaker label into the acoustic parameter prediction network, and output the normalized acoustic parameters of the predicted speech;

The normalized acoustic parameters of the predicted speech are input into the vocoder, and the synthesized speech signal is output.

2. The multi-speaker speech synthesis method based on variational autoencoder according to claim 1, wherein the first variational autoencoder/second variational autoencoder comprises 5 layers of one-dimensional volumes Convolution layer, 1 layer of long and short-term memory layer and 1 layer of fully connected layer, in which the convolution kernel size of the convolution layer is 5, the stride is 2, the number is 128, and the output of the fully connected layer is the standard deviation of the predicted Gaussian distribution and The mean value, the long and short-term memory layer contains 128 neurons, the activation function of each neuron uses a modified linear unit, and its expression is:

f(x)=max(0,x)

The input of the first variational autoencoder/second variational autoencoder is the normalized phoneme-level duration parameter/normalized frame-level acoustic parameter, the output is the mean and standard deviation of the Gaussian distribution, and the prediction is calculated. The relative entropy between the encoded distribution of and the true distribution is:

where N is the dimension of the Gaussian distribution, σ _n and u _n are the standard deviation and mean of the nth dimension of σ(x) and u(x) of the Gaussian distribution predicted by the variational autoencoder, respectively,

is the true distribution of the latent vector, p _θ (z) is the latent vector distribution predicted by the variational autoencoder, and the true distribution is assumed to be a standard Gaussian distribution;

The hidden vector is implemented by resampling:

z ^N =u(x)+σ(x)· ^εN

Among them, z ^N is the latent vector, x is the normalized phoneme-level duration parameter/normalized frame-level acoustic parameter, ε ^N ~ N(0, I) is the vector obtained by standard Gaussian sampling, and the variation The gradient returned from the output of the autoencoder is calculated as:

Among them, e ^N is an N-dimensional all-one vector; the hidden vector z ^N is converted into a 128-dimensional speaker label through a fully connected layer containing 256 neurons, and the first variational autoencoder/second variational autoencoder The output of the device is the duration speaker label/acoustic speaker label.

3. The multi-speaker speech synthesis method based on a variational autoencoder according to claim 2, wherein the duration prediction network comprises a fully connected layer and a one-layer bidirectional long-short-term memory layer;

The input of the fully connected layer is the duration speaker label and the normalized phoneme-level linguistic feature; the output of the two-way long-short-term memory layer is the predicted duration of the current phoneme, and the loss function Loss _MSE is the normalized prediction. The minimum mean square error of the duration parameter of the speech and the duration parameter of the real speech:

Among them, d is the duration parameter of real speech,

is the duration parameter of the normalized predicted speech;

The number of neurons in the fully connected layer and the bidirectional long-term memory layer are both 256; the activation functions of all neurons use a modified linear unit, and its expression is:

f(x)=max(0,x).

4. The multi-speaker speech synthesis method based on variational autoencoder according to claim 2, wherein the acoustic parameter prediction network comprises a fully connected layer and a three-layer bidirectional long-short-term memory layer;

The input of the fully connected layer is the acoustic speaker label and the normalized frame-level linguistic feature; the output of the three-layer bidirectional long-short-term memory layer is the normalized acoustic parameter of the predicted speech, and the loss function Loss _MSE is The normalized minimum mean square error of the acoustic parameters of the predicted speech and the acoustic parameters of the real speech:

Among them, x _j is the value of the jth dimension of the acoustic parameter of the real speech,

is the value of the jth dimension of the acoustic parameter of the normalized predicted speech;

The number of neurons in the connection layer and the bidirectional long and short-term memory layer are both 256; the activation functions of all neurons use a modified linear unit, and its expression is:

f(x)=max(0,x).

5. The multi-speaker speech synthesis method based on variational autoencoder according to claim 3, characterized in that, before the method, the method further comprises: extracting frame-level acoustics from the recorded speech signals containing multiple speakers parameters, phoneme-level duration parameters, frame-level linguistic features, and phoneme-level linguistic features, and normalize them respectively;

The frame-level acoustic parameters include: 60-dimensional Mel cepstral coefficients and their first- and second-order differences, 1-dimensional frequency parameters and their first- and second-order differences, and 1-dimensional aperiodic parameters and their first- and second-order differences , 1-dimensional vowel and consonant judgment parameters;

The phoneme-level duration parameter includes 1-dimensional duration information;

The frame-level linguistic features include 624-dimensional phoneme-level linguistic features and 4-dimensional frame position information; the phoneme-level linguistic features include: 477-dimensional text pronunciation features, 177-dimensional word segmentation and prosody features;

The frame-level acoustic parameters and the phoneme-level duration parameters are normalized by 0 mean; the frame-level linguistic features and phoneme-level linguistic features are normalized by the maximum and minimum values.

6. The multi-speaker speech synthesis method based on variational autoencoder according to claim 5, characterized in that, before the method, further comprising: performing the first variational autoencoder and the duration parameter prediction network on the The training steps include:

Input the normalized phoneme-level duration parameter into the first variational autoencoder, and calculate the loss function of the variational autoencoder

Output duration speaker label;

Input the normalized phoneme-level linguistic features and the duration speaker label into the duration parameter prediction network to calculate the loss function Loss _MSE ;

The optimization function C1 obtained by the weighted summation of the loss function of the first variational autoencoder and the loss function of the length parameter prediction network:

where ω(n) is the weight of the loss function of the encoder:

n is the number of iterations for the entire training data;

By reducing the value of the optimization function C1 for gradient backhaul and updating the parameters of the network, the trained first variational autoencoder and the duration parameter prediction network are obtained.

7. The multi-speaker speech synthesis method based on variational autoencoder according to claim 5, characterized in that, before the method further comprising: performing the second variational autoencoder and the acoustic parameter prediction network on the The steps of training include:

Input the normalized frame-level acoustic parameters into a second variational autoencoder, and calculate the loss function of the second variational autoencoder

output the acoustic speaker label;

Input the acoustic speaker labels and the normalized frame-level linguistic features into the acoustic parameter prediction network to calculate the loss function Loss _MSE ;

The optimization function C2 obtained by the weighted summation of the loss function of the second variational autoencoder and the loss function of the acoustic parameter prediction network:

where ω(n) is the weight of the loss function of the encoder:

n is the number of iterations for the entire training data;

By reducing the value of the optimization function C2 for gradient backhaul and updating the parameters of the network, the trained second variational autoencoder and acoustic parameter prediction network are obtained.

8. The multi-speaker speech synthesis method based on variational autoencoder according to claim 1, is characterized in that, described extracting a phoneme-level duration parameter and frame-level acoustic parameter of a speaker's clean speech to be synthesized and normalized. Unification, including:

Extracting a phoneme-level duration parameter of the clean speech of the speaker to be synthesized, where the phoneme-level duration parameter includes 1-dimensional duration information;

Extract the frame-level acoustic parameters of a clean speech of the speaker to be synthesized; the frame-level acoustic parameters include: 60-dimensional Mel cepstral coefficients and their first-order and second-order differences, 1-dimensional frequency parameters and their first- and second-order differences , 1-dimensional aperiodic parameters and their first-order and second-order differences, and 1-dimensional vowel and consonant decision parameters;

The frame-level acoustic parameters and the phoneme-level duration parameters are normalized with zero mean.

9. The multi-speaker speech synthesis method based on variational autoencoder according to claim 1, characterized in that, the speech signal containing multiple speakers to be synthesized extracts frame-level linguistic features and phoneme-level language characteristics and normalized; specifically include:

Extracting frame-level linguistic features from a speech signal containing multiple speakers to be synthesized, the frame-level linguistic features comprising 624-dimensional phoneme-level linguistic features and 4-dimensional frame position information;

Extracting phoneme-level linguistic features from the speech signals containing multiple speakers to be synthesized, the phoneme-level linguistic features including: 477-dimensional text pronunciation features, 177-dimensional word segmentation and prosody features;

The frame-level linguistic features and phoneme-level linguistic features are normalized by the maximum and minimum values.

10. The multi-speaker speech synthesis method based on variational autoencoder according to claim 1, wherein the obtaining the frame-level linguistic feature of the phoneme by predicting the duration specifically comprises: obtaining the current The relative and absolute position of the frame relative to the predicted duration to obtain the frame-level linguistic feature of the phoneme.