CN112289304A - Multi-speaker voice synthesis method based on variational self-encoder - Google Patents

Multi-speaker voice synthesis method based on variational self-encoder Download PDF

Info

Publication number
CN112289304A
CN112289304A CN201910671050.5A CN201910671050A CN112289304A CN 112289304 A CN112289304 A CN 112289304A CN 201910671050 A CN201910671050 A CN 201910671050A CN 112289304 A CN112289304 A CN 112289304A
Authority
CN
China
Prior art keywords
level
phoneme
frame
parameter
encoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910671050.5A
Other languages
Chinese (zh)
Other versions
CN112289304B (en
Inventor
张鹏远
蒿晓阳
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Original Assignee
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, Beijing Kexin Technology Co Ltd filed Critical Institute of Acoustics CAS
Priority to CN201910671050.5A priority Critical patent/CN112289304B/en
Publication of CN112289304A publication Critical patent/CN112289304A/en
Application granted granted Critical
Publication of CN112289304B publication Critical patent/CN112289304B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/086Detection of language
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a multi-speaker voice synthesis method based on a variational self-encoder, which comprises the following steps: extracting a phoneme-level duration parameter and a frame-level acoustic parameter of a clean voice of a speaker to be synthesized, inputting the normalized phoneme-level duration parameter into a first variational self-encoder, and outputting a duration speaker tag; inputting the normalized frame level acoustic parameters into a second variational self-encoder, and outputting an acoustic speaker tag; extracting frame-level linguistic features and phoneme-level linguistic features of a voice signal to be synthesized, wherein the voice signal comprises a plurality of speakers; inputting the time length speaker label and the normalized phoneme level linguistic feature into a time length prediction network, and outputting the current phoneme prediction time length; obtaining the frame-level linguistic characteristics of the phoneme through the current phoneme prediction duration, inputting the frame-level linguistic characteristics of the phoneme and the acoustic speaker tag into an acoustic parameter prediction network, and outputting normalized acoustic parameters of predicted speech; and inputting the normalized predicted voice acoustic parameters into a vocoder, and outputting a synthesized voice signal.

Description

Multi-speaker voice synthesis method based on variational self-encoder
Technical Field
The invention relates to a voice synthesis method, in particular to a multi-speaker voice synthesis method based on a variational self-encoder.
Background
The speech synthesis technology is an important technology for converting an input text into speech, and is also an important research content in the field of human-computer interaction.
The traditional speech synthesis algorithm needs to record a relatively comprehensive sound base covered by single speaker phonemes to ensure that the single speaker phonemes can synthesize the voices of various texts, but causes the problems of high recording cost, low efficiency and only single speaker voice synthesis. The voice synthesis of multiple speakers supports the parallel recording of voices of different speakers, and can synthesize voices from different speakers. The traditional multi-speaker speech synthesis usually needs to obtain the speaker information of the current speech and manually label a speaker tag, such as the one-hot encoding of the speaker, which belongs to a supervised learning, and the synthesized speech often has the tone overlapping of a plurality of speakers when the number of speakers is large. The method introduces a variational self-encoder network, samples the output of the network to obtain the label of the speaker.
Disclosure of Invention
The invention aims to solve the problems of supervised learning and tone overlapping of a plurality of speakers when the number of speakers is large in the traditional multi-speaker voice synthesis method, and provides a multi-speaker voice synthesis method based on a variational self-encoder by introducing a variational self-encoder network and sampling the output of the network to obtain the labels of the speakers.
In order to achieve the above object, the present invention provides a method for synthesizing multiple speakers based on a variational self-encoder, the method comprising:
extracting a phoneme-level duration parameter and a frame-level acoustic parameter of a clean voice of a speaker to be synthesized, normalizing the phoneme-level duration parameter, inputting the normalized phoneme-level duration parameter into a first variational self-encoder, and outputting a duration speaker tag; inputting the normalized frame level acoustic parameters into a second variational self-encoder, and outputting an acoustic speaker tag;
extracting frame-level linguistic features and phoneme-level linguistic features from a voice signal to be synthesized, wherein the voice signal comprises a plurality of speakers, and normalizing the frame-level linguistic features and the phoneme-level linguistic features;
inputting the time length speaker label and the normalized phoneme level linguistic feature into a time length prediction network, and outputting the predicted time length of the current phoneme;
obtaining the frame-level linguistic characteristics of the phoneme through the predicted duration of the current phoneme, inputting the frame-level linguistic characteristics of the phoneme and the acoustic speaker tag into an acoustic parameter prediction network, and outputting normalized acoustic parameters of predicted speech;
the normalized acoustic parameters of the predicted speech are input to the vocoder, and the synthesized speech signal is output.
As an improvement of the above method, the first variational self-encoder/the second variational self-encoder includes 5 one-dimensional convolution layers, 1 long-short-term memory layer and 1 fully-connected layer, where the convolution kernel size of the convolution layer is 5, the step size is 2, the number is 128, the fully-connected layer outputs the standard deviation and the mean value of the predicted gaussian distribution, the long-short-term memory layer includes 128 neurons, and the activation function of each neuron uses a modified linear unit, and its expression is:
f(x)=max(0,x)
the input of the first variational auto-encoder/the second variational auto-encoder is normalized phoneme-level duration parameter/normalized frame-level acoustic parameter, the output is mean and standard deviation of gaussian distribution, and the relative entropy between the predicted encoded distribution and the real distribution is calculated as:
Figure BDA0002141722860000021
where N is the dimension of the Gaussian distribution, σnAnd unRespectively, the standard deviation and mean of the nth dimension of σ (x) and u (x) of the gaussian distribution predicted by the variational autocoder,
Figure BDA0002141722860000022
for implicit vector true distribution, pθ(z) is the implicit vector distribution predicted by the variational self-encoder, and the true distribution is assumed to be standard gaussian distribution;
the hidden vector is implemented by resampling:
zN=u(x)+σ(x)·εN
wherein z isNAs hidden vectors, x is the normalized phoneme-level duration parameter/normalized frame-level acoustic parameter, εNN (0, I), a vector obtained by standard Gaussian sampling, and the gradient of the variational output feedback from the encoder is calculated as:
Figure BDA0002141722860000023
wherein e isNIs an N-dimensional all-1 vector; hidden vector zNThe output of the first variational auto-encoder/the second variational auto-encoder is a long-time speaker tag/an acoustic speaker tag converted into a 128-dimensional speaker tag by a full-connection layer containing 256 neurons.
As an improvement of the above method, the duration prediction network comprises a full connection layer and a two-way long and short duration memory layer;
the input of the full connection layer is a time length speaker label and normalized phoneme level linguistic characteristics; the output of the bidirectional long-short time memory layer is the predicted duration of the current phoneme and a Loss function LossMSEFor the minimum mean square error of the normalized predicted speech duration parameter and the real speech duration parameter:
Figure BDA0002141722860000031
wherein d is a time length parameter of the real voice,
Figure BDA0002141722860000032
is a time length parameter of the normalized predicted speech;
the number of the neurons of the full-connection layer and the bidirectional long-short time memory layer is 256; the activation functions of all neurons use modified linear elements, whose expression is:
f(x)=max(0,x)。
as an improvement of the above method, the acoustic parameter prediction network comprises a full connection layer and three bidirectional long-term and short-term memory layers;
the input of the full connection layer is acoustic speaker labels and normalized frame level linguistic characteristics; the output of the three-layer bidirectional long-short time memory layer is normalized acoustic parameters of predicted voice and Loss function LossMSEFor the minimum mean square error of the normalized acoustic parameters of the predicted speech and the acoustic parameters of the real speech:
Figure BDA0002141722860000033
wherein x isjThe j-th dimension of the acoustic parameter of the real voice,
Figure BDA0002141722860000034
a value of j dimension of the acoustic parameter for the normalized predicted speech;
the number of the nerve cells of the connecting layer and the bidirectional long-time and short-time memory layer is 256; the activation functions of all neurons use modified linear elements, whose expression is:
f(x)=max(0,x)。
as an improvement of the above method, the method further comprises: extracting frame-level acoustic parameters, phoneme-level duration parameters, frame-level linguistic features and phoneme-level linguistic features from the recorded voice signals containing a plurality of speakers, and normalizing the frame-level acoustic parameters, the phoneme-level duration parameters, the frame-level linguistic features and the phoneme-level linguistic features respectively;
the frame-level acoustic parameters include: 60-dimensional Mel cepstrum coefficient and its first and second order difference, 1-dimensional fundamental frequency parameter and its first and second order difference, 1-dimensional non-periodic parameter and its first and second order difference, and 1-dimensional vowel consonant decision parameter;
the phoneme-level duration parameter comprises 1-dimensional duration information;
the frame-level linguistic features include 624-dimensional phoneme-level linguistic features and 4-dimensional frame position information; the phone-level linguistic features include: 477-dimensional text pronunciation characteristics, 177-dimensional word segmentation and rhythm characteristics;
the frame-level acoustic parameters and the phoneme-level duration parameters are normalized by adopting a 0-mean value; the frame-level linguistic features and the phoneme-level linguistic features are normalized using a maximum and a minimum value.
As an improvement of the above method, the method further comprises: the step of training the first variational self-encoder and the duration parameter prediction network specifically comprises the following steps:
inputting the normalized phoneme-level duration parameter into a first variational self-encoder, calculating the loss function of the variational self-encoder
Figure BDA0002141722860000041
Outputting a time length speaker tag;
inputting the normalized phoneme level linguistic features and the time length speaker label into the time length parameter prediction network to calculate the Loss function LossMSE
An optimization function C1 obtained by weighted summation of the loss function of the first variational self-encoder and the loss function of the time-length parameter prediction network:
Figure BDA0002141722860000042
where ω (n) is the weight of the loss function of the encoder:
Figure BDA0002141722860000043
n is the iteration number of the whole training data;
and (4) carrying out gradient feedback by reducing the value of the optimization function C1 and updating the parameters of the network to obtain the trained first variational self-encoder and the duration parameter prediction network.
As an improvement of the above method, the method further comprises: the step of training the second variational autocoder and the acoustic parameter prediction network comprises the following steps:
inputting the normalized frame-level acoustic parameters into a second variational auto-encoder, calculating a loss function of the second variational auto-encoder
Figure BDA0002141722860000044
Outputting an acoustic speaker tag;
inputting acoustic speaker label and normalized frame level linguistic feature into the acoustic parameter prediction network calculation Loss function LossMSE
An optimization function C2 obtained by weighted summation of the loss function of the second variational self-encoder and the loss function of the acoustic parameter prediction network:
Figure BDA0002141722860000045
where ω (n) is the weight of the loss function of the encoder:
Figure BDA0002141722860000046
n is the iteration number of the whole training data;
and (4) carrying out gradient back transmission by reducing the value of the optimization function C2 and updating the parameters of the network to obtain the trained second variational self-encoder and the acoustic parameter prediction network.
As an improvement of the above method, the extracting and normalizing the phoneme-level duration parameter and the frame-level acoustic parameter of the clean speech of the speaker to be synthesized specifically includes:
extracting a phoneme-level duration parameter of a clean voice of a speaker to be synthesized, wherein the phoneme-level duration parameter comprises 1-dimensional duration information;
extracting a frame level acoustic parameter of the clean voice of the speaker to be synthesized; the frame-level acoustic parameters include: 60-dimensional Mel cepstrum coefficient and its first and second order difference, 1-dimensional fundamental frequency parameter and its first and second order difference, 1-dimensional non-periodic parameter and its first and second order difference, and 1-dimensional vowel consonant decision parameter;
and the frame-level acoustic parameters and the phoneme-level duration parameters are normalized by 0 mean value.
As an improvement of the above method, the speech signal to be synthesized containing multiple speakers extracts frame-level linguistic features and phoneme-level linguistic features and performs normalization; the method specifically comprises the following steps:
extracting frame-level linguistic features from a speech signal to be synthesized, wherein the speech signal comprises a plurality of speakers, and the frame-level linguistic features comprise 624-dimensional phoneme-level linguistic features and 4-dimensional frame position information;
extracting phoneme-level linguistic features from a speech signal to be synthesized containing a plurality of speakers, wherein the phoneme-level linguistic features comprise: 477-dimensional text pronunciation characteristics, 177-dimensional word segmentation and rhythm characteristics;
the frame-level linguistic features and the phoneme-level linguistic features are normalized using a maximum and a minimum value.
As an improvement of the above method, the obtaining the frame-level linguistic feature of the phoneme by predicting the duration specifically includes: and obtaining the relative position and the absolute position of the current frame relative to the predicted duration through predicting the duration, thereby obtaining the frame-level linguistic feature of the phoneme.
The invention has the advantages that:
the invention learns the tag information of the speaker without supervision through a variational self-encoder, obtains different implicit vector distributions by selecting voices from different speakers, and synthesizes the voices of different speakers by obtaining speaker tags through sampling the implicit vectors.
Drawings
FIG. 1 is a flow chart of the method for multi-speaker speech synthesis based on variational auto-encoders of the present invention;
FIG. 2 is a block diagram of a variational self-encoder and acoustic parameter prediction network of the present invention;
fig. 3 is a block diagram of a variation autoencoder and duration parameter prediction network of the present invention.
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings.
The invention has proposed a multi-speaker speech synthesis method based on variational self-encoder, said method comprises training stage and synthetic stage;
as shown in fig. 1, the training phase includes:
step 101) extracting acoustic parameters at a frame level, duration parameters at a phoneme level, and linguistic characteristics at the frame level and the phoneme level from recorded voice signals containing a plurality of speakers, and normalizing the acoustic parameters, the duration parameters, the frame level and the linguistic characteristics at the phoneme level respectively.
The acoustic parameters at the frame level have 187 dimensions, and comprise: 60-dimensional Mel cepstrum coefficient and its first and second order difference, 1-dimensional fundamental frequency parameter and its first and second order difference, 1-dimensional non-periodic parameter and its first and second order difference, and 1-dimensional vowel consonant decision parameter. 624 dimensions for the phonetic feature of the phoneme level, comprising: 477 dimensional text pronunciation characteristics, 177 dimensional word segmentation and rhythm characteristics. The frame-level linguistic feature includes a 624-dimensional phoneme-level linguistic feature and 4-dimensional frame position information. The duration parameter comprises 1-dimensional duration information.
The linguistic features of the frame level and the phoneme level are normalized by adopting the maximum and minimum values, and the calculation formula is as follows:
Figure BDA0002141722860000061
wherein
Figure BDA0002141722860000062
For the normalized value of the ith dimension feature,
Figure BDA0002141722860000063
min as the value before normalization for the ith dimension featureiAnd maxiThe maximum and minimum values of the ith dimension feature, respectively.
The acoustic parameters and the duration parameters are normalized by 0 mean value, and the calculation formula is as follows:
Figure BDA0002141722860000064
wherein
Figure BDA0002141722860000065
For the normalized value of the ith dimension feature,
Figure BDA0002141722860000066
for the value before normalization for the i-th dimension feature, uiAnd σiRespectively, mean and variance of the ith dimension feature.
Step 102) constructing a variational self-encoder network, taking the normalized frame-level acoustic parameters as input, assuming that the encoded distribution is Gaussian distribution, and the output of the network is the mean value and standard deviation of the Gaussian distribution, and calculating the relative entropy between the variational self-encoder network and the actual distribution to be used as a loss function of an encoder.
The encoder contains 5 one-dimensional convolutional layers to and 1 layer long-term memory network (LSTM layer) and 1 layer of full connection layer, wherein convolutional kernel size of convolutional layer is 5, the step length is 2, the quantity is 128, full connection layer output is standard deviation and the mean value of the gaussian distribution that the encoder predicts, LSTM layer contains 128 neurons, the use of the activation function of each neuron is the correction linear unit, its expression is:
f(x)=max(0,x)
the relative entropy between the predicted encoded distribution and the true distribution is calculated as:
Figure BDA0002141722860000071
where N is the dimension of the Gaussian distribution, σnAnd unRespectively, the standard deviation and mean of the nth dimension of σ (x) and u (x) of the gaussian distribution predicted by the variational autocoder,
Figure BDA0002141722860000072
for implicit vector true distribution, pθ(z) is the implicit vector distribution predicted by the variational self-encoder, and the true distribution is assumed to be standard gaussian distribution; at a relative entropy
Figure BDA0002141722860000073
As a loss function of the encoder:
Figure BDA0002141722860000074
step 103) obtaining a hidden vector as an acoustic speaker label based on the distributed sampling of the step 102);
in order to avoid the problem that the gradient cannot be transmitted back due to direct sampling, the hidden vector is realized by resampling, and the formula of resampling is as follows:
zN=u(x)+σ(x)·εN
wherein N is the dimension of a Gaussian distribution, zNIs a hidden vector, x is an input normalized frame-level acoustic parameter, εNN (0, I), a vector obtained for standard Gaussian sampling, where the gradient for the encoder output return can be calculated as:
Figure BDA0002141722860000075
wherein eNIs an N-dimensional all-1 vector;
hidden vector zNConversion to 128-dimensional sound through a full-connection layer containing 256 neuronsA speaker learning tag.
Step 104) constructing an acoustic parameter prediction network, taking the normalized frame-level linguistic features and the acoustic speaker labels as input, and taking the output of the network as the predicted acoustic parameters of the normalized predicted speech, and calculating the minimum mean square error between the predicted acoustic parameters and the actual acoustic parameters, so as to be used as a loss function of the network.
As shown in fig. 2, the acoustic parameter prediction network includes a full connection layer and three bidirectional long and short term memory layers;
the input of the full connection layer is acoustic speaker labels and normalized frame level linguistic characteristics; the output of the three-layer bidirectional long-short time memory layer is normalized acoustic parameters of predicted voice and Loss function LossMSEFor the minimum mean square error of the normalized acoustic parameters of the predicted speech and the acoustic parameters of the real speech:
Figure BDA0002141722860000076
wherein x isjThe j-th dimension of the acoustic parameter of the real voice,
Figure BDA0002141722860000081
a value of j dimension of the acoustic parameter for the normalized predicted speech;
the number of the nerve cells of the connecting layer and the bidirectional long-time and short-time memory layer is 256; the activation functions of all neurons use modified linear elements, whose expression is:
f(x)=max(0,x)。
step 105) carrying out weighted summation on the loss function of the variational self-encoder in the step 102) and the loss function of the acoustic parameter prediction network in the step 104) to serve as an optimization function, carrying out gradient return transmission by reducing the value of the optimization function, and updating the parameters of the network to obtain a trained network;
the optimization function obtained by weighted summation of the loss function of the encoder and the loss function of the acoustic parameter prediction network is calculated as follows:
Figure BDA0002141722860000082
where ω (n) is an expression of the weight of the loss function of the encoder over the number of iterations of the training data:
Figure BDA0002141722860000083
step 106) building a variational self-encoder network with the same structure as the step 102), taking normalized phoneme-level duration parameters as input, outputting as a duration speaker tag, and building a duration prediction network at the same time, taking the duration speaker tag and the phoneme-level linguistic features as input; outputting the time length parameter as a prediction;
as shown in fig. 3, the duration prediction network includes a full connection layer and a two-way long and short duration memory layer;
the input of the full connection layer is a time length speaker label and normalized phoneme level linguistic characteristics; the output of the bidirectional long-short time memory layer is the predicted duration of the current phoneme and a Loss function LossMSEFor the minimum mean square error of the normalized predicted speech duration parameter and the real speech duration parameter:
Figure BDA0002141722860000084
wherein d is a time length parameter of the real voice,
Figure BDA0002141722860000085
is a time length parameter of the normalized predicted speech;
the number of the neurons of the full-connection layer and the bidirectional long-short time memory layer is 256; the activation functions of all neurons use modified linear elements, whose expression is:
f(x)=max(0,x)。
step 107) training the variational self-encoder and the duration parameter prediction network in the step 106);
inputting the normalized phoneme-level duration parameter into a variational self-encoder, and calculating the loss function of the variational self-encoder
Figure BDA0002141722860000086
Outputting a time length speaker tag;
calculating Loss function Loss of network by inputting normalized phoneme-level linguistic features and duration speaker tags into duration parameter prediction networkMSE
And (3) an optimization function C obtained by weighted summation of the loss function of the variational self-encoder and the loss function of the time-length parameter prediction network:
Figure BDA0002141722860000091
where ω (n) is the weight of the loss function of the encoder:
Figure BDA0002141722860000092
n is the iteration number of the whole training data;
and (4) carrying out gradient back transmission by reducing the value of the optimization function C2 and updating the parameters of the network to obtain the trained second variational self-encoder and the acoustic parameter prediction network.
The synthesis stage comprises:
step 201) extracting a phoneme level duration parameter and a frame level acoustic parameter of a clean voice of a speaker to be synthesized, normalizing the phoneme level duration parameter, inputting the normalized phoneme level duration parameter into the variation self-encoder in the step 105), and outputting a duration speaker label; inputting the normalized frame level acoustic parameters into the variational self-encoder in the step 107) and outputting an acoustic speaker label;
the phoneme-level duration parameter comprises 1-dimensional duration information; the frame-level acoustic parameters include: 60-dimensional Mel cepstrum coefficient and its first and second order difference, 1-dimensional fundamental frequency parameter and its first and second order difference, 1-dimensional non-periodic parameter and its first and second order difference, and 1-dimensional vowel consonant decision parameter; and the frame-level acoustic parameters and the phoneme-level duration parameters are normalized by 0 mean value.
Step 202) extracting frame-level linguistic features and phoneme-level linguistic features from a voice signal to be synthesized, wherein the voice signal comprises a plurality of speakers, and normalizing the frame-level linguistic features and the phoneme-level linguistic features;
the frame-level linguistic features include 624-dimensional phoneme-level linguistic features and 4-dimensional frame position information; the phone-level linguistic features include: 477-dimensional text pronunciation characteristics, 177-dimensional word segmentation and rhythm characteristics; the frame-level linguistic features and the phoneme-level linguistic features are normalized using a maximum and a minimum value.
Step 203) inputting the time length speaker label and the normalized phoneme level linguistic feature into the time length prediction network in the step 107) and outputting the predicted time length of the current phoneme;
step 204) obtaining the frame-level linguistic characteristics of the current phoneme through the predicted duration of the current phoneme, inputting the frame-level linguistic characteristics of the current phoneme and the acoustic parameter prediction network of the acoustic speaker tag input step 105), and outputting normalized acoustic parameters of predicted speech;
and obtaining the relative position and the absolute position of the current frame relative to the predicted duration through predicting the duration, thereby obtaining the frame-level linguistic feature of the phoneme.
Step 205) inputs the normalized acoustic parameters of the predicted speech into the vocoder and outputs a synthesized speech signal.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1. A method of multi-speaker speech synthesis based on a variational auto-encoder, the method comprising:
extracting a phoneme-level duration parameter and a frame-level acoustic parameter of a clean voice of a speaker to be synthesized, normalizing the phoneme-level duration parameter, inputting the normalized phoneme-level duration parameter into a first variational self-encoder, and outputting a duration speaker tag; inputting the normalized frame level acoustic parameters into a second variational self-encoder, and outputting an acoustic speaker tag;
extracting frame-level linguistic features and phoneme-level linguistic features from a voice signal to be synthesized, wherein the voice signal comprises a plurality of speakers, and normalizing the frame-level linguistic features and the phoneme-level linguistic features;
inputting the time length speaker label and the normalized phoneme level linguistic feature into a time length prediction network, and outputting the predicted time length of the current phoneme;
obtaining the frame-level linguistic characteristics of the phoneme through the predicted duration of the current phoneme, inputting the frame-level linguistic characteristics of the phoneme and the acoustic speaker tag into an acoustic parameter prediction network, and outputting normalized acoustic parameters of predicted speech;
the normalized acoustic parameters of the predicted speech are input to the vocoder, and the synthesized speech signal is output.
2. The method of claim 1, wherein the first variational auto-encoder/the second variational auto-encoder comprises 5 one-dimensional convolutional layers, 1 long-short-term memory layer and 1 fully-connected layer, wherein the convolutional layer has a convolutional kernel size of 5, a step size of 2 and a number of 128, the fully-connected layer outputs the standard deviation and the mean of the predicted gaussian distribution, the long-short-term memory layer comprises 128 neurons, and the activation function of each neuron uses a modified linear unit, and the expression is as follows:
f(x)=max(0,x)
the input of the first variational auto-encoder/the second variational auto-encoder is normalized phoneme-level duration parameter/normalized frame-level acoustic parameter, the output is mean and standard deviation of gaussian distribution, and the relative entropy between the predicted encoded distribution and the real distribution is calculated as:
Figure FDA0002141722850000011
where N is the dimension of the Gaussian distribution, σnAnd unRespectively, the standard deviation and mean of the nth dimension of σ (x) and u (x) of the gaussian distribution predicted by the variational autocoder,
Figure FDA0002141722850000012
for implicit vector true distribution, pθ(z) is the implicit vector distribution predicted by the variational self-encoder, and the true distribution is assumed to be standard gaussian distribution;
the hidden vector is implemented by resampling:
zN=u(x)+σ(x)·εN
wherein z isNAs hidden vectors, x is the normalized phoneme-level duration parameter/normalized frame-level acoustic parameter, εNN (0, I), a vector obtained by standard Gaussian sampling, and the gradient of the variational output feedback from the encoder is calculated as:
Figure FDA0002141722850000021
wherein e isNIs an N-dimensional all-1 vector; hidden vector zNThe output of the first variational auto-encoder/the second variational auto-encoder is a long-time speaker tag/an acoustic speaker tag converted into a 128-dimensional speaker tag by a full-connection layer containing 256 neurons.
3. The method of claim 2, wherein the duration prediction network comprises a full-concatenation layer and a two-way long-duration memory layer;
the input of the full connection layer is a time length speaker label and normalized phoneme level linguistic characteristics; the output of the bidirectional long-short time memory layer is the predicted duration of the current phoneme and a Loss function LossMSEFor the minimum mean square error of the normalized predicted speech duration parameter and the real speech duration parameter:
Figure FDA0002141722850000022
wherein d is a time length parameter of the real voice,
Figure FDA0002141722850000023
is a time length parameter of the normalized predicted speech;
the number of the neurons of the full-connection layer and the bidirectional long-short time memory layer is 256; the activation functions of all neurons use modified linear elements, whose expression is:
f(x)=max(0,x)。
4. the method of claim 2, wherein the acoustic parameter prediction network comprises a full-connected layer and three bi-directional long-short term memory layers;
the input of the full connection layer is acoustic speaker labels and normalized frame level linguistic characteristics; the output of the three-layer bidirectional long-short time memory layer is normalized acoustic parameters of predicted voice and Loss function LossMSEFor the minimum mean square error of the normalized acoustic parameters of the predicted speech and the acoustic parameters of the real speech:
Figure FDA0002141722850000024
wherein x isjThe j-th dimension of the acoustic parameter of the real voice,
Figure FDA0002141722850000025
a value of j dimension of the acoustic parameter for the normalized predicted speech;
the number of the nerve cells of the connecting layer and the bidirectional long-time and short-time memory layer is 256; the activation functions of all neurons use modified linear elements, whose expression is:
f(x)=max(0,x)。
5. the method of variational self-encoder based multi-speaker speech synthesis according to claim 3, wherein said method further comprises before: extracting frame-level acoustic parameters, phoneme-level duration parameters, frame-level linguistic features and phoneme-level linguistic features from the recorded voice signals containing a plurality of speakers, and normalizing the frame-level acoustic parameters, the phoneme-level duration parameters, the frame-level linguistic features and the phoneme-level linguistic features respectively;
the frame-level acoustic parameters include: 60-dimensional Mel cepstrum coefficient and its first and second order difference, 1-dimensional fundamental frequency parameter and its first and second order difference, 1-dimensional non-periodic parameter and its first and second order difference, and 1-dimensional vowel consonant decision parameter;
the phoneme-level duration parameter comprises 1-dimensional duration information;
the frame-level linguistic features include 624-dimensional phoneme-level linguistic features and 4-dimensional frame position information; the phone-level linguistic features include: 477-dimensional text pronunciation characteristics, 177-dimensional word segmentation and rhythm characteristics;
the frame-level acoustic parameters and the phoneme-level duration parameters are normalized by adopting a 0-mean value; the frame-level linguistic features and the phoneme-level linguistic features are normalized using a maximum and a minimum value.
6. The method of variational self-encoder based multi-speaker speech synthesis according to claim 5, wherein said method further comprises before: the step of training the first variational self-encoder and the duration parameter prediction network specifically comprises the following steps:
inputting the normalized phoneme-level duration parameter into a first variational self-encoder, calculating the loss function of the variational self-encoder
Figure FDA0002141722850000031
Outputting a time length speaker tag;
inputting the normalized phoneme level linguistic features and the time length speaker label into the time length parameter prediction network to calculate the Loss function LossMSE
An optimization function C1 obtained by weighted summation of the loss function of the first variational self-encoder and the loss function of the time-length parameter prediction network:
Figure FDA0002141722850000032
where ω (n) is the weight of the loss function of the encoder:
Figure FDA0002141722850000033
n is the iteration number of the whole training data;
and (4) carrying out gradient feedback by reducing the value of the optimization function C1 and updating the parameters of the network to obtain the trained first variational self-encoder and the duration parameter prediction network.
7. The method of variational self-encoder based multi-speaker speech synthesis according to claim 5, wherein said method further comprises before: the step of training the second variational autocoder and the acoustic parameter prediction network comprises the following steps:
inputting the normalized frame-level acoustic parameters into a second variational auto-encoder, calculating a loss function of the second variational auto-encoder
Figure FDA0002141722850000041
Outputting an acoustic speaker tag;
inputting acoustic speaker label and normalized frame level linguistic feature into the acoustic parameter prediction network calculation Loss function LossMSE
An optimization function C2 obtained by weighted summation of the loss function of the second variational self-encoder and the loss function of the acoustic parameter prediction network:
Figure FDA0002141722850000042
where ω (n) is the weight of the loss function of the encoder:
Figure FDA0002141722850000043
n is the iteration number of the whole training data;
and (4) carrying out gradient back transmission by reducing the value of the optimization function C2 and updating the parameters of the network to obtain the trained second variational self-encoder and the acoustic parameter prediction network.
8. The method according to claim 1, wherein the extracting and normalizing the phoneme-level duration parameter and the frame-level acoustic parameter of a clean speaker speech to be synthesized specifically comprises:
extracting a phoneme-level duration parameter of a clean voice of a speaker to be synthesized, wherein the phoneme-level duration parameter comprises 1-dimensional duration information;
extracting a frame level acoustic parameter of the clean voice of the speaker to be synthesized; the frame-level acoustic parameters include: 60-dimensional Mel cepstrum coefficient and its first and second order difference, 1-dimensional fundamental frequency parameter and its first and second order difference, 1-dimensional non-periodic parameter and its first and second order difference, and 1-dimensional vowel consonant decision parameter;
and the frame-level acoustic parameters and the phoneme-level duration parameters are normalized by 0 mean value.
9. The method according to claim 1, wherein the speech signal to be synthesized containing multiple speakers is normalized by extracting frame-level linguistic features and phoneme-level linguistic features; the method specifically comprises the following steps:
extracting frame-level linguistic features from a speech signal to be synthesized, wherein the speech signal comprises a plurality of speakers, and the frame-level linguistic features comprise 624-dimensional phoneme-level linguistic features and 4-dimensional frame position information;
extracting phoneme-level linguistic features from a speech signal to be synthesized containing a plurality of speakers, wherein the phoneme-level linguistic features comprise: 477-dimensional text pronunciation characteristics, 177-dimensional word segmentation and rhythm characteristics;
the frame-level linguistic features and the phoneme-level linguistic features are normalized using a maximum and a minimum value.
10. The method of claim 1, wherein the obtaining the frame-level linguistic characteristics of the phoneme by predicting the duration comprises: and obtaining the relative position and the absolute position of the current frame relative to the predicted duration through predicting the duration, thereby obtaining the frame-level linguistic feature of the phoneme.
CN201910671050.5A 2019-07-24 2019-07-24 Multi-speaker voice synthesis method based on variation self-encoder Active CN112289304B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910671050.5A CN112289304B (en) 2019-07-24 2019-07-24 Multi-speaker voice synthesis method based on variation self-encoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910671050.5A CN112289304B (en) 2019-07-24 2019-07-24 Multi-speaker voice synthesis method based on variation self-encoder

Publications (2)

Publication Number Publication Date
CN112289304A true CN112289304A (en) 2021-01-29
CN112289304B CN112289304B (en) 2024-05-31

Family

ID=74418960

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910671050.5A Active CN112289304B (en) 2019-07-24 2019-07-24 Multi-speaker voice synthesis method based on variation self-encoder

Country Status (1)

Country Link
CN (1) CN112289304B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113066511A (en) * 2021-03-16 2021-07-02 云知声智能科技股份有限公司 Voice conversion method and device, electronic equipment and storage medium
CN113409764A (en) * 2021-06-11 2021-09-17 北京搜狗科技发展有限公司 Voice synthesis method and device for voice synthesis
CN113450761A (en) * 2021-06-17 2021-09-28 清华大学深圳国际研究生院 Parallel speech synthesis method and device based on variational self-encoder
CN113488022A (en) * 2021-07-07 2021-10-08 北京搜狗科技发展有限公司 Speech synthesis method and device
CN113707122A (en) * 2021-08-11 2021-11-26 北京搜狗科技发展有限公司 Method and device for constructing voice synthesis model
CN113782045A (en) * 2021-08-30 2021-12-10 江苏大学 Single-channel voice separation method for multi-scale time delay sampling
CN114267331A (en) * 2021-12-31 2022-04-01 达闼机器人有限公司 Speaker coding method, device and multi-speaker voice synthesis system
CN114267329A (en) * 2021-12-24 2022-04-01 厦门大学 Multi-speaker speech synthesis method based on probability generation and non-autoregressive model
CN118430512A (en) * 2024-07-02 2024-08-02 厦门蝉羽网络科技有限公司 Speech synthesis method and device for improving phoneme pronunciation time accuracy

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109326283A (en) * 2018-11-23 2019-02-12 南京邮电大学 Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder
JP2019109306A (en) * 2017-12-15 2019-07-04 日本電信電話株式会社 Voice conversion device, voice conversion method and program
WO2019138897A1 (en) * 2018-01-10 2019-07-18 ソニー株式会社 Learning device and method, and program
CN110047501A (en) * 2019-04-04 2019-07-23 南京邮电大学 Multi-to-multi phonetics transfer method based on beta-VAE

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019109306A (en) * 2017-12-15 2019-07-04 日本電信電話株式会社 Voice conversion device, voice conversion method and program
WO2019138897A1 (en) * 2018-01-10 2019-07-18 ソニー株式会社 Learning device and method, and program
CN109326283A (en) * 2018-11-23 2019-02-12 南京邮电大学 Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder
CN110047501A (en) * 2019-04-04 2019-07-23 南京邮电大学 Multi-to-multi phonetics transfer method based on beta-VAE

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHIN-CHENG HSU ET AL.: "《Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder》", 《ARXIV:1610.04019V1》, pages 1 - 6 *
YAJIE ZHANG ET AL.: "《LEARNING LATENT REPRESENTATIONS FOR STYLE CONTROL AND TRANSFER IN END-TO-END SPEECH SYNTHESIS》", 《ARXIV:1812.04342V2》, pages 1 - 5 *
黄国捷等: "《增强变分自编码器做非平行语料语音转换》", 《信 号 处 理》, vol. 34, no. 10, pages 1246 - 1251 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113066511A (en) * 2021-03-16 2021-07-02 云知声智能科技股份有限公司 Voice conversion method and device, electronic equipment and storage medium
CN113409764A (en) * 2021-06-11 2021-09-17 北京搜狗科技发展有限公司 Voice synthesis method and device for voice synthesis
CN113409764B (en) * 2021-06-11 2024-04-26 北京搜狗科技发展有限公司 Speech synthesis method and device for speech synthesis
CN113450761A (en) * 2021-06-17 2021-09-28 清华大学深圳国际研究生院 Parallel speech synthesis method and device based on variational self-encoder
CN113450761B (en) * 2021-06-17 2023-09-22 清华大学深圳国际研究生院 Parallel voice synthesis method and device based on variation self-encoder
CN113488022A (en) * 2021-07-07 2021-10-08 北京搜狗科技发展有限公司 Speech synthesis method and device
CN113488022B (en) * 2021-07-07 2024-05-10 北京搜狗科技发展有限公司 Speech synthesis method and device
CN113707122A (en) * 2021-08-11 2021-11-26 北京搜狗科技发展有限公司 Method and device for constructing voice synthesis model
CN113707122B (en) * 2021-08-11 2024-04-05 北京搜狗科技发展有限公司 Method and device for constructing voice synthesis model
CN113782045B (en) * 2021-08-30 2024-01-05 江苏大学 Single-channel voice separation method for multi-scale time delay sampling
CN113782045A (en) * 2021-08-30 2021-12-10 江苏大学 Single-channel voice separation method for multi-scale time delay sampling
CN114267329A (en) * 2021-12-24 2022-04-01 厦门大学 Multi-speaker speech synthesis method based on probability generation and non-autoregressive model
CN114267331A (en) * 2021-12-31 2022-04-01 达闼机器人有限公司 Speaker coding method, device and multi-speaker voice synthesis system
CN118430512A (en) * 2024-07-02 2024-08-02 厦门蝉羽网络科技有限公司 Speech synthesis method and device for improving phoneme pronunciation time accuracy
CN118430512B (en) * 2024-07-02 2024-10-22 厦门蝉羽网络科技有限公司 Speech synthesis method and device for improving phoneme pronunciation time accuracy

Also Published As

Publication number Publication date
CN112289304B (en) 2024-05-31

Similar Documents

Publication Publication Date Title
CN112289304B (en) Multi-speaker voice synthesis method based on variation self-encoder
Purwins et al. Deep learning for audio signal processing
CN111883110B (en) Acoustic model training method, system, equipment and medium for speech recognition
CN111081230B (en) Speech recognition method and device
CN113205792A (en) Mongolian speech synthesis method based on Transformer and WaveNet
CN115428066A (en) Synthesized speech processing
CN110930975B (en) Method and device for outputting information
WO2022105472A1 (en) Speech recognition method, apparatus, and electronic device
Kumar et al. A comprehensive review of recent automatic speech summarization and keyword identification techniques
Becerra et al. Speech recognition in a dialog system: From conventional to deep processing: A case study applied to Spanish
Garg et al. Survey on acoustic modeling and feature extraction for speech recognition
O’Shaughnessy Recognition and processing of speech signals using neural networks
Vegesna et al. Dnn-hmm acoustic modeling for large vocabulary telugu speech recognition
Sen et al. A convolutional neural network based approach to recognize bangla spoken digits from speech signal
CN117063228A (en) Mixed model attention for flexible streaming and non-streaming automatic speech recognition
Mei et al. A particular character speech synthesis system based on deep learning
Lee et al. Isolated word recognition using modular recurrent neural networks
Dudhrejia et al. Speech recognition using neural networks
Evrard Transformers in automatic speech recognition
Oprea et al. An artificial neural network-based isolated word speech recognition system for the Romanian language
CN112951270B (en) Voice fluency detection method and device and electronic equipment
Galatang Syllable-Based Indonesian Automatic Speech Recognition.
Atanda et al. Yorùbá automatic speech recognition: A review
Djuraev et al. An In-Depth Analysis of Automatic Speech Recognition System
JP2002091480A (en) Acoustic model generator and voice recognition device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20241011

Address after: 100190, No. 21 West Fourth Ring Road, Beijing, Haidian District

Patentee after: INSTITUTE OF ACOUSTICS, CHINESE ACADEMY OF SCIENCES

Country or region after: China

Address before: 100190, No. 21 West Fourth Ring Road, Beijing, Haidian District

Patentee before: INSTITUTE OF ACOUSTICS, CHINESE ACADEMY OF SCIENCES

Country or region before: China

Patentee before: BEIJING KEXIN TECHNOLOGY Co.,Ltd.