Disclosure of Invention
The invention aims to solve the problems of supervised learning and tone overlapping of a plurality of speakers when the number of speakers is large in the traditional multi-speaker voice synthesis method, and provides a multi-speaker voice synthesis method based on a variational self-encoder by introducing a variational self-encoder network and sampling the output of the network to obtain the labels of the speakers.
In order to achieve the above object, the present invention provides a method for synthesizing multiple speakers based on a variational self-encoder, the method comprising:
extracting a phoneme-level duration parameter and a frame-level acoustic parameter of a clean voice of a speaker to be synthesized, normalizing the phoneme-level duration parameter, inputting the normalized phoneme-level duration parameter into a first variational self-encoder, and outputting a duration speaker tag; inputting the normalized frame level acoustic parameters into a second variational self-encoder, and outputting an acoustic speaker tag;
extracting frame-level linguistic features and phoneme-level linguistic features from a voice signal to be synthesized, wherein the voice signal comprises a plurality of speakers, and normalizing the frame-level linguistic features and the phoneme-level linguistic features;
inputting the time length speaker label and the normalized phoneme level linguistic feature into a time length prediction network, and outputting the predicted time length of the current phoneme;
obtaining the frame-level linguistic characteristics of the phoneme through the predicted duration of the current phoneme, inputting the frame-level linguistic characteristics of the phoneme and the acoustic speaker tag into an acoustic parameter prediction network, and outputting normalized acoustic parameters of predicted speech;
the normalized acoustic parameters of the predicted speech are input to the vocoder, and the synthesized speech signal is output.
As an improvement of the above method, the first variational self-encoder/the second variational self-encoder includes 5 one-dimensional convolution layers, 1 long-short-term memory layer and 1 fully-connected layer, where the convolution kernel size of the convolution layer is 5, the step size is 2, the number is 128, the fully-connected layer outputs the standard deviation and the mean value of the predicted gaussian distribution, the long-short-term memory layer includes 128 neurons, and the activation function of each neuron uses a modified linear unit, and its expression is:
f(x)=max(0,x)
the input of the first variational auto-encoder/the second variational auto-encoder is normalized phoneme-level duration parameter/normalized frame-level acoustic parameter, the output is mean and standard deviation of gaussian distribution, and the relative entropy between the predicted encoded distribution and the real distribution is calculated as:
where N is the dimension of the Gaussian distribution, σ
nAnd u
nRespectively, the standard deviation and mean of the nth dimension of σ (x) and u (x) of the gaussian distribution predicted by the variational autocoder,
for implicit vector true distribution, p
θ(z) is the implicit vector distribution predicted by the variational self-encoder, and the true distribution is assumed to be standard gaussian distribution;
the hidden vector is implemented by resampling:
zN=u(x)+σ(x)·εN
wherein z isNAs hidden vectors, x is the normalized phoneme-level duration parameter/normalized frame-level acoustic parameter, εNN (0, I), a vector obtained by standard Gaussian sampling, and the gradient of the variational output feedback from the encoder is calculated as:
wherein e isNIs an N-dimensional all-1 vector; hidden vector zNThe output of the first variational auto-encoder/the second variational auto-encoder is a long-time speaker tag/an acoustic speaker tag converted into a 128-dimensional speaker tag by a full-connection layer containing 256 neurons.
As an improvement of the above method, the duration prediction network comprises a full connection layer and a two-way long and short duration memory layer;
the input of the full connection layer is a time length speaker label and normalized phoneme level linguistic characteristics; the output of the bidirectional long-short time memory layer is the predicted duration of the current phoneme and a Loss function LossMSEFor the minimum mean square error of the normalized predicted speech duration parameter and the real speech duration parameter:
wherein d is a time length parameter of the real voice,
is a time length parameter of the normalized predicted speech;
the number of the neurons of the full-connection layer and the bidirectional long-short time memory layer is 256; the activation functions of all neurons use modified linear elements, whose expression is:
f(x)=max(0,x)。
as an improvement of the above method, the acoustic parameter prediction network comprises a full connection layer and three bidirectional long-term and short-term memory layers;
the input of the full connection layer is acoustic speaker labels and normalized frame level linguistic characteristics; the output of the three-layer bidirectional long-short time memory layer is normalized acoustic parameters of predicted voice and Loss function LossMSEFor the minimum mean square error of the normalized acoustic parameters of the predicted speech and the acoustic parameters of the real speech:
wherein x is
jThe j-th dimension of the acoustic parameter of the real voice,
a value of j dimension of the acoustic parameter for the normalized predicted speech;
the number of the nerve cells of the connecting layer and the bidirectional long-time and short-time memory layer is 256; the activation functions of all neurons use modified linear elements, whose expression is:
f(x)=max(0,x)。
as an improvement of the above method, the method further comprises: extracting frame-level acoustic parameters, phoneme-level duration parameters, frame-level linguistic features and phoneme-level linguistic features from the recorded voice signals containing a plurality of speakers, and normalizing the frame-level acoustic parameters, the phoneme-level duration parameters, the frame-level linguistic features and the phoneme-level linguistic features respectively;
the frame-level acoustic parameters include: 60-dimensional Mel cepstrum coefficient and its first and second order difference, 1-dimensional fundamental frequency parameter and its first and second order difference, 1-dimensional non-periodic parameter and its first and second order difference, and 1-dimensional vowel consonant decision parameter;
the phoneme-level duration parameter comprises 1-dimensional duration information;
the frame-level linguistic features include 624-dimensional phoneme-level linguistic features and 4-dimensional frame position information; the phone-level linguistic features include: 477-dimensional text pronunciation characteristics, 177-dimensional word segmentation and rhythm characteristics;
the frame-level acoustic parameters and the phoneme-level duration parameters are normalized by adopting a 0-mean value; the frame-level linguistic features and the phoneme-level linguistic features are normalized using a maximum and a minimum value.
As an improvement of the above method, the method further comprises: the step of training the first variational self-encoder and the duration parameter prediction network specifically comprises the following steps:
inputting the normalized phoneme-level duration parameter into a first variational self-encoder, calculating the loss function of the variational self-encoder
Outputting a time length speaker tag;
inputting the normalized phoneme level linguistic features and the time length speaker label into the time length parameter prediction network to calculate the Loss function LossMSE;
An optimization function C1 obtained by weighted summation of the loss function of the first variational self-encoder and the loss function of the time-length parameter prediction network:
where ω (n) is the weight of the loss function of the encoder:
n is the iteration number of the whole training data;
and (4) carrying out gradient feedback by reducing the value of the optimization function C1 and updating the parameters of the network to obtain the trained first variational self-encoder and the duration parameter prediction network.
As an improvement of the above method, the method further comprises: the step of training the second variational autocoder and the acoustic parameter prediction network comprises the following steps:
inputting the normalized frame-level acoustic parameters into a second variational auto-encoder, calculating a loss function of the second variational auto-encoder
Outputting an acoustic speaker tag;
inputting acoustic speaker label and normalized frame level linguistic feature into the acoustic parameter prediction network calculation Loss function LossMSE;
An optimization function C2 obtained by weighted summation of the loss function of the second variational self-encoder and the loss function of the acoustic parameter prediction network:
where ω (n) is the weight of the loss function of the encoder:
n is the iteration number of the whole training data;
and (4) carrying out gradient back transmission by reducing the value of the optimization function C2 and updating the parameters of the network to obtain the trained second variational self-encoder and the acoustic parameter prediction network.
As an improvement of the above method, the extracting and normalizing the phoneme-level duration parameter and the frame-level acoustic parameter of the clean speech of the speaker to be synthesized specifically includes:
extracting a phoneme-level duration parameter of a clean voice of a speaker to be synthesized, wherein the phoneme-level duration parameter comprises 1-dimensional duration information;
extracting a frame level acoustic parameter of the clean voice of the speaker to be synthesized; the frame-level acoustic parameters include: 60-dimensional Mel cepstrum coefficient and its first and second order difference, 1-dimensional fundamental frequency parameter and its first and second order difference, 1-dimensional non-periodic parameter and its first and second order difference, and 1-dimensional vowel consonant decision parameter;
and the frame-level acoustic parameters and the phoneme-level duration parameters are normalized by 0 mean value.
As an improvement of the above method, the speech signal to be synthesized containing multiple speakers extracts frame-level linguistic features and phoneme-level linguistic features and performs normalization; the method specifically comprises the following steps:
extracting frame-level linguistic features from a speech signal to be synthesized, wherein the speech signal comprises a plurality of speakers, and the frame-level linguistic features comprise 624-dimensional phoneme-level linguistic features and 4-dimensional frame position information;
extracting phoneme-level linguistic features from a speech signal to be synthesized containing a plurality of speakers, wherein the phoneme-level linguistic features comprise: 477-dimensional text pronunciation characteristics, 177-dimensional word segmentation and rhythm characteristics;
the frame-level linguistic features and the phoneme-level linguistic features are normalized using a maximum and a minimum value.
As an improvement of the above method, the obtaining the frame-level linguistic feature of the phoneme by predicting the duration specifically includes: and obtaining the relative position and the absolute position of the current frame relative to the predicted duration through predicting the duration, thereby obtaining the frame-level linguistic feature of the phoneme.
The invention has the advantages that:
the invention learns the tag information of the speaker without supervision through a variational self-encoder, obtains different implicit vector distributions by selecting voices from different speakers, and synthesizes the voices of different speakers by obtaining speaker tags through sampling the implicit vectors.
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings.
The invention has proposed a multi-speaker speech synthesis method based on variational self-encoder, said method comprises training stage and synthetic stage;
as shown in fig. 1, the training phase includes:
step 101) extracting acoustic parameters at a frame level, duration parameters at a phoneme level, and linguistic characteristics at the frame level and the phoneme level from recorded voice signals containing a plurality of speakers, and normalizing the acoustic parameters, the duration parameters, the frame level and the linguistic characteristics at the phoneme level respectively.
The acoustic parameters at the frame level have 187 dimensions, and comprise: 60-dimensional Mel cepstrum coefficient and its first and second order difference, 1-dimensional fundamental frequency parameter and its first and second order difference, 1-dimensional non-periodic parameter and its first and second order difference, and 1-dimensional vowel consonant decision parameter. 624 dimensions for the phonetic feature of the phoneme level, comprising: 477 dimensional text pronunciation characteristics, 177 dimensional word segmentation and rhythm characteristics. The frame-level linguistic feature includes a 624-dimensional phoneme-level linguistic feature and 4-dimensional frame position information. The duration parameter comprises 1-dimensional duration information.
The linguistic features of the frame level and the phoneme level are normalized by adopting the maximum and minimum values, and the calculation formula is as follows:
wherein
For the normalized value of the ith dimension feature,
min as the value before normalization for the ith dimension feature
iAnd max
iThe maximum and minimum values of the ith dimension feature, respectively.
The acoustic parameters and the duration parameters are normalized by 0 mean value, and the calculation formula is as follows:
wherein
For the normalized value of the ith dimension feature,
for the value before normalization for the i-th dimension feature, u
iAnd σ
iRespectively, mean and variance of the ith dimension feature.
Step 102) constructing a variational self-encoder network, taking the normalized frame-level acoustic parameters as input, assuming that the encoded distribution is Gaussian distribution, and the output of the network is the mean value and standard deviation of the Gaussian distribution, and calculating the relative entropy between the variational self-encoder network and the actual distribution to be used as a loss function of an encoder.
The encoder contains 5 one-dimensional convolutional layers to and 1 layer long-term memory network (LSTM layer) and 1 layer of full connection layer, wherein convolutional kernel size of convolutional layer is 5, the step length is 2, the quantity is 128, full connection layer output is standard deviation and the mean value of the gaussian distribution that the encoder predicts, LSTM layer contains 128 neurons, the use of the activation function of each neuron is the correction linear unit, its expression is:
f(x)=max(0,x)
the relative entropy between the predicted encoded distribution and the true distribution is calculated as:
where N is the dimension of the Gaussian distribution, σ
nAnd u
nRespectively, the standard deviation and mean of the nth dimension of σ (x) and u (x) of the gaussian distribution predicted by the variational autocoder,
for implicit vector true distribution, p
θ(z) is the implicit vector distribution predicted by the variational self-encoder, and the true distribution is assumed to be standard gaussian distribution; at a relative entropy
As a loss function of the encoder:
step 103) obtaining a hidden vector as an acoustic speaker label based on the distributed sampling of the step 102);
in order to avoid the problem that the gradient cannot be transmitted back due to direct sampling, the hidden vector is realized by resampling, and the formula of resampling is as follows:
zN=u(x)+σ(x)·εN
wherein N is the dimension of a Gaussian distribution, zNIs a hidden vector, x is an input normalized frame-level acoustic parameter, εNN (0, I), a vector obtained for standard Gaussian sampling, where the gradient for the encoder output return can be calculated as:
wherein eNIs an N-dimensional all-1 vector;
hidden vector zNConversion to 128-dimensional sound through a full-connection layer containing 256 neuronsA speaker learning tag.
Step 104) constructing an acoustic parameter prediction network, taking the normalized frame-level linguistic features and the acoustic speaker labels as input, and taking the output of the network as the predicted acoustic parameters of the normalized predicted speech, and calculating the minimum mean square error between the predicted acoustic parameters and the actual acoustic parameters, so as to be used as a loss function of the network.
As shown in fig. 2, the acoustic parameter prediction network includes a full connection layer and three bidirectional long and short term memory layers;
the input of the full connection layer is acoustic speaker labels and normalized frame level linguistic characteristics; the output of the three-layer bidirectional long-short time memory layer is normalized acoustic parameters of predicted voice and Loss function LossMSEFor the minimum mean square error of the normalized acoustic parameters of the predicted speech and the acoustic parameters of the real speech:
wherein x is
jThe j-th dimension of the acoustic parameter of the real voice,
a value of j dimension of the acoustic parameter for the normalized predicted speech;
the number of the nerve cells of the connecting layer and the bidirectional long-time and short-time memory layer is 256; the activation functions of all neurons use modified linear elements, whose expression is:
f(x)=max(0,x)。
step 105) carrying out weighted summation on the loss function of the variational self-encoder in the step 102) and the loss function of the acoustic parameter prediction network in the step 104) to serve as an optimization function, carrying out gradient return transmission by reducing the value of the optimization function, and updating the parameters of the network to obtain a trained network;
the optimization function obtained by weighted summation of the loss function of the encoder and the loss function of the acoustic parameter prediction network is calculated as follows:
where ω (n) is an expression of the weight of the loss function of the encoder over the number of iterations of the training data:
step 106) building a variational self-encoder network with the same structure as the step 102), taking normalized phoneme-level duration parameters as input, outputting as a duration speaker tag, and building a duration prediction network at the same time, taking the duration speaker tag and the phoneme-level linguistic features as input; outputting the time length parameter as a prediction;
as shown in fig. 3, the duration prediction network includes a full connection layer and a two-way long and short duration memory layer;
the input of the full connection layer is a time length speaker label and normalized phoneme level linguistic characteristics; the output of the bidirectional long-short time memory layer is the predicted duration of the current phoneme and a Loss function LossMSEFor the minimum mean square error of the normalized predicted speech duration parameter and the real speech duration parameter:
wherein d is a time length parameter of the real voice,
is a time length parameter of the normalized predicted speech;
the number of the neurons of the full-connection layer and the bidirectional long-short time memory layer is 256; the activation functions of all neurons use modified linear elements, whose expression is:
f(x)=max(0,x)。
step 107) training the variational self-encoder and the duration parameter prediction network in the step 106);
inputting the normalized phoneme-level duration parameter into a variational self-encoder, and calculating the loss function of the variational self-encoder
Outputting a time length speaker tag;
calculating Loss function Loss of network by inputting normalized phoneme-level linguistic features and duration speaker tags into duration parameter prediction networkMSE;
And (3) an optimization function C obtained by weighted summation of the loss function of the variational self-encoder and the loss function of the time-length parameter prediction network:
where ω (n) is the weight of the loss function of the encoder:
n is the iteration number of the whole training data;
and (4) carrying out gradient back transmission by reducing the value of the optimization function C2 and updating the parameters of the network to obtain the trained second variational self-encoder and the acoustic parameter prediction network.
The synthesis stage comprises:
step 201) extracting a phoneme level duration parameter and a frame level acoustic parameter of a clean voice of a speaker to be synthesized, normalizing the phoneme level duration parameter, inputting the normalized phoneme level duration parameter into the variation self-encoder in the step 105), and outputting a duration speaker label; inputting the normalized frame level acoustic parameters into the variational self-encoder in the step 107) and outputting an acoustic speaker label;
the phoneme-level duration parameter comprises 1-dimensional duration information; the frame-level acoustic parameters include: 60-dimensional Mel cepstrum coefficient and its first and second order difference, 1-dimensional fundamental frequency parameter and its first and second order difference, 1-dimensional non-periodic parameter and its first and second order difference, and 1-dimensional vowel consonant decision parameter; and the frame-level acoustic parameters and the phoneme-level duration parameters are normalized by 0 mean value.
Step 202) extracting frame-level linguistic features and phoneme-level linguistic features from a voice signal to be synthesized, wherein the voice signal comprises a plurality of speakers, and normalizing the frame-level linguistic features and the phoneme-level linguistic features;
the frame-level linguistic features include 624-dimensional phoneme-level linguistic features and 4-dimensional frame position information; the phone-level linguistic features include: 477-dimensional text pronunciation characteristics, 177-dimensional word segmentation and rhythm characteristics; the frame-level linguistic features and the phoneme-level linguistic features are normalized using a maximum and a minimum value.
Step 203) inputting the time length speaker label and the normalized phoneme level linguistic feature into the time length prediction network in the step 107) and outputting the predicted time length of the current phoneme;
step 204) obtaining the frame-level linguistic characteristics of the current phoneme through the predicted duration of the current phoneme, inputting the frame-level linguistic characteristics of the current phoneme and the acoustic parameter prediction network of the acoustic speaker tag input step 105), and outputting normalized acoustic parameters of predicted speech;
and obtaining the relative position and the absolute position of the current frame relative to the predicted duration through predicting the duration, thereby obtaining the frame-level linguistic feature of the phoneme.
Step 205) inputs the normalized acoustic parameters of the predicted speech into the vocoder and outputs a synthesized speech signal.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.