CN114023300A - Chinese speech synthesis method based on diffusion probability model - Google Patents

Chinese speech synthesis method based on diffusion probability model Download PDF

Info

Publication number
CN114023300A
CN114023300A CN202111295924.5A CN202111295924A CN114023300A CN 114023300 A CN114023300 A CN 114023300A CN 202111295924 A CN202111295924 A CN 202111295924A CN 114023300 A CN114023300 A CN 114023300A
Authority
CN
China
Prior art keywords
diffusion
model
attention
probability
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111295924.5A
Other languages
Chinese (zh)
Inventor
王海舟
范润琦
吴英奡
许晋荣
张新悦
吴心宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202111295924.5A priority Critical patent/CN114023300A/en
Publication of CN114023300A publication Critical patent/CN114023300A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/69Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L2013/083Special characters, e.g. punctuation marks

Abstract

The invention discloses a Chinese speech synthesis method based on a diffusion probability model, which comprises the steps of firstly constructing a Chinese text front-end processing module, then constructing an end-to-end frequency spectrum generation network based on a forward attention mechanism, and finally using a Diffwave vocoder based on the diffusion probability model to synthesize Chinese speech. The invention adopts a forward attention mechanism to solve the problems of poor alignment effect of the voice frames and the like in the synthesis of the Chinese long sentence; and a non-autoregressive Diffwave vocoder based on a diffusion probability model is used in the vocoder part, so that the quality and efficiency of model synthetic voice are obviously improved.

Description

Chinese speech synthesis method based on diffusion probability model
Technical Field
The invention relates to the technical field of artificial intelligence voice synthesis, in particular to a Chinese voice synthesis method based on a diffusion probability model.
Background
Speech synthesis techniques generally refer to the conversion of text to speech. With the continuous development and maturity of the fields of internet technology, information technology, artificial intelligence and the like, the popularization and the generation of intelligent terminals, and a new man-machine interaction mode represented by the synthetic artificial voice technology is quietly popularized. Nowadays, speech synthesis is widely applied to scenes such as map navigation, voice assistance, audio book reading, short video dubbing and the like.
With the continuous development of deep learning, a plurality of speech synthesis models achieve good effects. The currently common speech synthesis scheme based on deep learning mainly comprises two steps: firstly, predicting acoustic features such as Mel frequency spectrum according to text information; the vocoder is then used to convert the acoustic feature prediction into the original audio waveform. The currently popular deep learning-based speech synthesis models are mainly classified into autoregressive and non-autoregressive types. The main problem faced by autoregressive speech synthesis is that the synthesis speed is slow, and the WaveNet vocoder used by Tacotron2 is an autoregressive convolutional neural network, and in order to solve the long-distance dependence problem, WaveNet needs to balance the extended Field (perceptual Field) and the number of parameters. With the structure that WaveNet stacks multiple layers of one-dimensional extended convolution, the span of a convolution kernel is 2, the receptive field is exponentially increased along with the increase of the number of layers, and the synthesis speed is slow. The main problem of the traditional non-autoregressive speech synthesis is that the speech synthesis quality is low, for example, a FastSpeech model can be generated through parallel Mel spectrograms, the synthesis process is accelerated, FastSpeech is trained based on a Transformer structure, but the extracted alignment effect is not accurate enough, and the obtained target Mel frequency spectrum has some information loss, so that the tone quality effect is poor.
Disclosure of Invention
In view of the above problems, it is an object of the present invention to provide a method for Chinese speech synthesis based on a diffusion probability model, which uses a forward attention mechanism in a decoder and utilizes a Diffwave vocoder based on a diffusion probability model to realize more efficient and higher-quality Chinese speech synthesis. The technical scheme is as follows:
a Chinese speech synthesis method based on a diffusion probability model comprises the following steps:
s1: text front-end processing:
acquiring a text data set, constructing a Chinese text front-end processing module, and performing mandarin text-to-phoneme conversion, text regularization and punctuation mark deletion or conversion on the text data set to obtain a phoneme sequence;
s2: constructing an end-to-end spectrum generation network based on a forward attention mechanism to encode and decode the processed text:
and (3) encoding: the encoder module processes the input phoneme sequence to obtain a hidden layer sequence, and at each decoding moment, an attention mechanism performs soft selection on the input sequence to obtain an attention context vector as the input of a decoder;
and (3) decoding: the decoder module predicts the time step by a preprocessing network, and the output of the preprocessing network and the attention context vector are connected and transmitted by two one-way LSTM layer stacks; projection of the connected prediction target spectrogram frame of LSTM layer output and context vector of attention by linear transformation; the predicted Mel spectrogram passes through a 5-layer convolution post-processing network, and prediction residual errors are added into prediction to improve overall reconstruction;
s3: chinese speech synthesis using a Diffwave vocoder based on a diffusion probability model:
the diffusion probability model divides the mapping relation between the noise and the target waveform into T steps to form a Markov chain, trains the diffusion process of the chain, namely from the target audio to the noise, and then decodes the chain through the reverse process, namely from the noise to the target audio.
Further, the processing of converting mandarin chinese text into phonemes specifically comprises: for the sequence of Chinese characters in each sentence of the text data set from left to right, preferentially searching whether a word beginning with the Chinese character exists in a word pinyin library, checking whether the Chinese character behind the Chinese character in the text is matched with the word, and directly obtaining the pinyin of the word from the word pinyin library if the Chinese character in the text is matched with the word; if not, obtaining the pinyin of the Chinese character from the pinyin library.
Further, the encoder module includes: a character embedding layer, a 3-layer convolution, a bidirectional LSTM layer; the input character is encoded into a 128-dimensional character vector; then, passing through 3 layers of convolution, wherein each layer of convolution comprises 256 convolution kernels of 5 multiplied by 1, namely each convolution kernel spans 5 characters, the convolution layers carry out large-span context modeling on an input character sequence, and are subjected to batch normalization after the convolution layers, and a ReLU activation function is used for activation; the output of the last convolutional layer is passed to the bi-directional LSTM layer to generate coding features;
Figure BDA0003336592710000021
H=EncoderRecurrency(fe) (2)
wherein f iseFor coding features, F1、F2、F3For 3 convolution kernels, relu (·) represents the nonlinear activation on each convolution layer;
Figure BDA0003336592710000022
the method comprises the steps of embedding a character sequence X, representing the circular neural network bidirectional LSTM in an encoder by EncodeRecurrency (-) and outputting an encoder hidden state by H.
Further, let the phoneme sequence of the input coder be x ═ x1,x2,…,xN]N represents the length of the phoneme sequence, and the hidden layer sequence h ═ h is obtained through the processing of the encoder1,h2,…,hN]At each decoding instant k, the attention mechanism makes a soft selection of the input sequence, resulting in a context vector ckAs input to a decoder;
fillingThe query vector of the gravity mechanism is skNote that the mechanism selects as input the output of a position between encoders 1 to N, the position being given by a random variable πkE {1, …, N }, the modeled target of the attention mechanism is the probability distribution of the position variable: p (pi)k|h,sk) (ii) a The context vector calculation is given by:
Figure BDA0003336592710000031
wherein, yk(n)=p(πk=n|h,sk) Indicating the probability of the attention staying at the output position n of the encoder at the decoding time k;
the content-based attention mechanism is calculated in the following way:
Figure BDA0003336592710000032
wherein W, V, b and V are parameters of the model; e.g. of the typek,nFor evaluation of skAnd hnThe degree of matching of (c);
assuming a random variation of pi in attention position at different timeskOutput h and query vector s at a given encoderkLater, conditional independence results in an aligned path pi1:k={π12,…,πkThe probability of is:
Figure BDA0003336592710000033
wherein s is1:kFor a set of query vectors s1,s2,…,sk};yk'k') Indicating that attention is staying at the output position pi of the encoder at an arbitrary time k' before the current decoding time kk'The size of the probability of (c);
determining that each path in a set P of legitimate paths of attention satisfies monotonicity and continuity, and given the constraint of a monotonous path, the conditional probability of attention distribution is:
p(πk|h,s1:k0:k∈P) (6)
then the forward variable a is definedk(n) is:
Figure BDA0003336592710000041
and (3) adopting a dynamic programming algorithm, and recurrently obtaining the forward variable of the current moment through the forward variable obtained at the previous moment:
ak(n)=(ak-1(n)+ak-1(n-1))yk(n) (8)
get the new attention probability from the forward variable:
Figure BDA0003336592710000042
in formula (3) with ak(n) instead of yk(n) computing a context vector ck
Figure BDA0003336592710000043
Further, the S3 specifically includes:
s31: definition of qdata(x0) Is composed of
Figure BDA0003336592710000046
Wherein L is the data dimension; definition of
Figure BDA0003336592710000047
T is 0,1, …, T is a variable sequence with the same dimension, T is an index of the diffusion step number, and T is the total diffusion step number; the diffusion probability model comprises a diffusion process and a reverse process;
the purpose of the diffusion process is to propagate x through a Markov chain0Gradual mapping to multipleA normal distribution, i.e.:
Figure BDA0003336592710000044
wherein q (x)t|xt-1) Is defined as a sum constant betatRelated Gaussian distribution
Figure BDA0003336592710000045
I is an identity matrix; the reverse process is generated based on normally distributed samples:
platent(xT)=N(0,I) (12)
Figure BDA0003336592710000051
wherein p islatent(xT) Is an isotropic Gaussian distribution with a transition probability pθ(xt-1|xt) Parameterized as Gaussian distribution N (x)t-1;μθ(xt,t),σθ(xt,t)2I);
Model muθAnd model σθThere are two inputs each: number of diffusion steps
Figure BDA0003336592710000056
And variables
Figure BDA0003336592710000057
Wherein L is the data dimension; model muθOutputting an L-dimensional vector as a mean value, model sigmaθOutputting a real number as a standard deviation; p is a radical ofθ(xt-1|xt) The purpose of the method is to gradually eliminate Gaussian noise in the diffusion process and finally generate data which accord with target distribution;
s32: sampling
For the reverse process, the generation process first pairs xTN (0, I) is sampled, followed by xt-1:pθ(xt-1|xt) T, T-1, 1 samples(ii) a X of the output0Is a sample data;
s33: training
Prior to training, the training targets of the model are first profiled, i.e. the maximum likelihood pθ(x0) (ii) a Training the model by maximizing the lower bound of variation, the formula is:
Figure BDA0003336592710000052
wherein the content of the first and second substances,
Figure BDA0003336592710000053
denotes x for distribution qdata(x0) In the expectation that the position of the target is not changed,
Figure BDA0003336592710000058
denotes x vs. distribution q (x)1,...,xT) (iii) a desire; ELBO is the lower bound of evidence;
defining a constant based on the scheduling variance in the diffusion process:
Figure BDA0003336592710000054
and for t>1, there are
Figure BDA0003336592710000055
Wherein, betatIs the forward process variance; for ease of presentation, the alternative symbol α is usedtDenotes alphat=1-βt
Then, muθAnd σθParameterization of (c) defines:
Figure BDA0003336592710000061
wherein the content of the first and second substances,
Figure BDA0003336592710000062
is a same as xtAnd the number of diffusion steps t is as followsAn incoming neural network; sigmaθ(xtT) is fixed to be constant
Figure BDA0003336592710000063
For each step under this parameterization, a closed form expression for ELBO is given as follows:
given a series of fixed schedules
Figure BDA0003336592710000064
Let e to N (0, I) and x0~qdata(ii) a Then at expectation EqWith parameterization of (a), we get:
Figure BDA0003336592710000065
for constants c and κtWherein
Figure BDA0003336592710000066
And for t>1, there are
Figure BDA0003336592710000067
The following unweighted ELBO variables are minimized to improve the quality of the generation:
Figure BDA0003336592710000068
wherein T is uniformly valued in 1.. and T;
s34: embedding in a diffusion step:
different diffusion steps t are taken as input, and different t corresponding to the model can output different eθ(. t); using a 128-dimensional code vector for each t;
Figure BDA0003336592710000069
applying three fully-connected layers on coding, wherein the first two FCs share parameters between the residual layers; the last FC maps the output of the second FC into a C-dimensional embedded vector; this vector is then broadcast and added to the input of each residual layer.
The invention has the beneficial effects that: the invention adopts a forward attention mechanism to solve the problems of poor alignment effect of the voice frames and the like in the synthesis of the Chinese long sentence; and a non-autoregressive Diffwave vocoder based on a diffusion probability model is used in the vocoder part, so that the quality and efficiency of model synthetic voice are obviously improved.
Drawings
FIG. 1 is a diagram of a deep learning-based Chinese speech synthesis model according to the present invention.
FIG. 2 is a comparison of Mel frequency spectra; (a) real voice; (b) the model of the invention; (c) tacotron2+ Griffin-Lim; (d) tacotron2+ WaveRNN; (e) tacotron2+ MB-MelGAN; (f) FastSpeech2+ MB-MelGAN.
Detailed Description
The invention is described in further detail below with reference to the figures and specific embodiments. As shown in FIG. 1, the whole framework of the deep learning-based Chinese speech synthesis model of the present invention mainly comprises three parts: text front-end processing, spectral generation networks (encoders and decoders), and vocoders.
1. Text front-end processing
(1) Putonghua text to phoneme (graph-to-phone, G2P)
For the sequence of the Chinese characters in each sentence from left to right, preferentially searching whether a word beginning with the Chinese character exists in a word pinyin library (download address: https:// githu. com/mozillazg/phrase-pinyin-data) and checking whether the Chinese character behind the Chinese character is matched with the word, and if the condition is met, directly obtaining pinyin from the word library; if the condition is not met, obtaining the pinyin of the Chinese character from a pinyin library (download address: https:// github. com/mozillazg/pinyin-data).
(2) Text regularization (text normalization, TN)
Chinese text regularization is the process of converting non-Chinese character strings into Chinese character strings to determine their pronunciations. In this embodiment, a regular expression is used to process a text, so as to realize NSW (Non-Standard-Word) normalization, and the rule is shown in table 1.
Table 1 text normalization rules table
Figure BDA0003336592710000071
Figure BDA0003336592710000081
(3) Punctuation mark
For Chinese punctuation, only' is reserved. Is there a | A ' four symbols, the remaining symbols are converted to one of the four symbols according to the following rules, detailed in table 2.
TABLE 2 symbol conversion rules Table
Before replacement After replacement
Parentheses, quotation marks, special signs outside the limits Ignore
Colons, dashes, pause signs, English commas ’,’
English exclamation mark '!'
English question mark '?'
English sentence, semicolon, ellipsis ’。’
The same' occurs consecutively. Is there a | A ' Only one is reserved
2. Encoder for encoding a video signal
The purpose of the encoder is to extract a robust sequence representation from the input text sequence. The encoder module contains a Character Embedding layer (Character Embedding), a 3-layer convolution, and a two-way LSTM (Long Short-Term Memory) layer. The input character is encoded into a 128-dimensional character vector; then, passing through a 3-layer convolution, wherein each layer of convolution comprises 256 convolution kernels of 5 × 1, namely each convolution kernel spans 5 characters, and the convolution layer can carry out large-span context modeling (similar to N-grams) on an input character sequence, wherein the use of the convolution layer for obtaining the context is mainly because the cyclic neural network is difficult to capture long-term dependence in practice; convolutional layer followed by batch normalization (batch normalization), using a ReLU (Rectified Linear Unit) activation function
Figure BDA0003336592710000082
Activating; the output of the last convolutional layer is passed to a bi-directional LSTM layer containing 512 cells (256 cells in each direction) to generate the coding features.
Figure BDA0003336592710000083
H=EncoderRecurrency(fe) (2)
Wherein, F1, F2, F3 are 3 convolution kernels, ReLU is the nonlinear activation on each convolution layer, E represents embedding the character sequence X, EncoderRecurrency represents the cyclic neural network bidirectional LSTM in the encoder, and H is the output encoder hidden state. After the encoder hidden state is generated, attention is paid to generate an encoding vector. The encoder part parameters are shown in table 3.
Table 3 encoder partial parameter list
Model parameters Parameter value
embedding_dim 128
conv_layers_num 3
conv_kernel_size 5
conv_filters 256
lstm_units 256
3. Decoder
The decoder of the present invention adopts an autoregressive recursive structure, and can predict a Mel spectrogram of one frame from an encoded input sequence. The decoder first passes the prediction of the previous time step through a small pre-processing network containing 2 fully-connected layers of 256 hidden ReLU units. Dropout in the preprocessing network as an information bottleneck is important for learning attention, and is beneficial to improving the generalization of the model. The output of the preprocessing network and the context vector of attention are connected and passed through two unidirectional LSTM layer stacks. The connected prediction target spectrogram framework of LSTM output and attention context vector is projected by linear transformation. Finally, the predicted mel-spectrum is passed through a 5-layer convolutional post-processing network that adds prediction residuals to the prediction to improve the overall reconstruction. The decoder part parameters are shown in table 4.
Table 4 decoder partial parameter list
Model parameters Parameter value
prenet_layers [256,256]
decoder_layers 2
decoder_lstm_units 256
dropout_rate 0.5
(1) Forward attention mechanism
In the decoder, a forward attention mechanism is employed to improve the model's processing power for long text.
Suppose that there are phonemes as input sequence x ═ x1,x2,…,xN]Where N represents the length of the phoneme sequence. The hidden layer sequence h ═ h is obtained by inputting the sequence to a sequence model encoder1,h2,…,hN]. At each decoding instant k, the attention mechanism makes a soft selection of the input sequence, resulting in a context vector ckAs input to a decoder. Suppose a query vector (query vector) of the attention mechanism is skThe state vector of the decoder RNN at the current instant is typically used. Note that the mechanism selects as input an output of an encoder at a position between 1 and N, which may be by a random variable πkE {1, …, N }, then the modeled objective of the attention mechanism is the probability distribution of the position variable: p (pi)k|h,sk). The context vector calculation is given by:
Figure BDA0003336592710000101
wherein, yk(n)=p(πk=n|h,sk) Indicating the probability of the attention staying at the output position n of the encoder at the decoding time k;
the content-based attention mechanism is calculated in the following way:
Figure BDA0003336592710000102
where W, V, b and V are parameters of the model, ek,nFor evaluation of skAnd hnThe degree of matching.
Assuming a random variation of pi in attention position at different timeskOutput h and query vector s at a given encoderkThe latter are condition independent. So that an alignment path pi can be obtained1:k={π12,…,πkThe probability of is:
Figure BDA0003336592710000103
in the initialization state, the method specifies pi0=1。
Consider a set of attention paths, denoted as P. The set is a set of legitimate paths, i.e., each path within the set satisfies two characteristics. The first is monotonicity: instant noteThe position at which the force is intended to stay will only increase monotonically,
Figure BDA0003336592710000104
second, continuity: i.e. no jump between two attention positions in time succession occurs,
Figure BDA0003336592710000105
the invention considers the conditional probability of the attention distribution given the constraint of a monotonic path:
p(πk|h,s1:k0:k∈P)(6)
this conditional probability is used as a coefficient of the attention distribution in order to introduce a conditional term in the probability formula. The condition item eliminates illegal paths in the voice generation task, namely all paths violating the monotonicity rule, so that the probability space can be greatly reduced, and the voice synthesis task is more reasonable. Since in this task the attention-aligned path is apparently monotonically increasing and no jumps occur. To describe the computational process of the algorithm, the forward variables are first defined:
Figure BDA0003336592710000111
the forward variables in this algorithm are similar and different from the forward variables in the ctc (connectionist Temporal classification) algorithm. The similarity is that the forward variables are a set of "legal" path probabilities, and the probability distribution between different time instants satisfies the conditional independence. However, the output of the CTC at each moment describes an output tag probability, and the attention mechanism describes a probability distribution of a random variable of an attention position. And the specification of what is a "legal" path is also different. For the CTC algorithm, the meaning of a legal path is a set that satisfies all paths that can correspond to the correct tag sequence; whereas for the forward attention mechanism, a legal path means a set of all paths that can satisfy monotonicity and continuity. Similar to the CTC algorithm, the computation of the forward variable does not need to be summed over an exhaustive list of all legal paths, and the complexity of such an algorithm reaches an exponential level, resulting in an inability to perform the operation. The forward variables can be realized through a smart forward algorithm, the core idea is a dynamic programming algorithm, and the forward variables at the current moment are obtained through recursion of the forward variables obtained at the previous moment:
ak(n)=(ak-1(n)+ak-1(n-1))yk(n) (8)
thus, a new attention probability can be derived from the forward variable:
Figure BDA0003336592710000112
after obtaining a new attention probability, we can use it in equation (3)
Figure BDA0003336592710000113
To replace yk(n) computing a context vector ck. The modified recursion is as follows:
Figure BDA0003336592710000114
some of the attention mechanism parameters are shown in table 5.
TABLE 5 attention mechanism part parameter List
Model parameters Parameter value
smoothing False
attention_dim 128
attention_filters 32
attention_kernel 31
cumulative_weights True
(2) Post-processing network
The goal of the post-processing network is to convert the sequence-to-sequence target output into a target representation that can be synthesized into a waveform, and to learn how to predict the spectral amplitude sampled on a linear frequency scale. Another purpose of the post-processing network construction is that it also sees all decoded sequences, unlike the normal sequence-to-sequence structure, which always runs sequentially from left to right, so that the construction can obtain both forward and backward bi-directional information simultaneously to correct single-frame prediction errors. The post-processing network is a 5-layer convolutional neural network, each layer consists of 256 5 multiplied by 1 convolutional kernels and a batch standardization process, and except the last layer of convolution, the batch standardization process of each layer is followed by a tanh activation function. The post-processing network part parameters are shown in table 6.
Table 6 post-processing network part parameter list
Model parameters Parameter value
postnet_layers_num 5
postnet_kernel_size 5
postnet_filters 256
4. Vocoder
The invention selects an audio generation Model based on a Diffusion probability Model (Diffusion Probabilistic Model) to generate speech waves.
The diffusion probability model is a probability model based on a Markov chain, and divides the mapping relation between noise and a target waveform into T steps to form the Markov chain. The diffusion process for the chain (from target audio to noise) is trained and then decoded by the reverse process (from noise to target audio).
First of all, q is defineddata(x0) Is composed of
Figure BDA0003336592710000121
Wherein L is the data dimension; definition of
Figure BDA0003336592710000122
T is 0,1, …, T is a variable sequence with the same dimension, T is an index of the diffusion step number, and T is the total diffusion step number. A diffusion model consists of two processes, a diffusion process and a reverse process.
(1) Diffusion process (diffusion process):
the purpose of the diffusion process is to propagate x through a Markov chain0Gradually mapped to a multidimensional normal distribution (gaussian noise), i.e.:
Figure BDA0003336592710000131
wherein q (x)t|xt-1) Is defined as a sum constant betatRelated Gaussian distribution
Figure BDA0003336592710000132
The process is equivalent to adding a small amount of Gaussian noise in an iteration mode, and finally the target is converted into the multidimensional normal distribution with different dimensionalities independent of each other.
(2) Reverse process (reserve process):
the reverse process is generated based on normally distributed samples:
platent(xT)=N(0,I) (12)
Figure BDA0003336592710000133
in the formula, platent(xT) Is an isotropic Gaussian distribution with a transition probability pθ(xt-1|xt) Parameterized as N (x)t-1;μθ(xt,t),σθ(xt,t)2I) In that respect Wherein, the model muθAnd σθThere are two inputs each: number of diffusion steps
Figure BDA0003336592710000137
And variables
Figure BDA0003336592710000138
μθOutput an L-dimensional vector as the mean, σθA real number is output as a standard deviation. p is a radical ofθ(xt-1|xt) The purpose of the method is to gradually eliminate Gaussian noise in the diffusion process and finally generate data which are in accordance with target distribution.
(3) Sampling:
for the reverse process, the generation process first pairs xTN (0, I) is sampled, followed by xt-1:pθ(xt-1|xt) T, T-1, 1 samples. X of the output0Is a sampled data.
(4) Training:
before training, the training target of the model, i.e. the maximum likelihood p, is first profiledθ(x0) The formula is as follows:
Figure BDA0003336592710000134
wherein the content of the first and second substances,
Figure BDA0003336592710000135
denotes x for distribution qdata(x0) In the expectation that the position of the target is not changed,
Figure BDA0003336592710000136
denotes x vs. distribution q (x)1,...,xT) (iii) a desire; ELBO is the lower bound of evidence.
Under certain parameterization conditions, ELBO (Evidence Lower Bound) of the diffusion model can be calculated by a closed form. This not only speeds up the computation but also avoids the Monte Carlo estimation with too large variance. The parameterization is driven by its link to Langevin dynamical denoising score matching. To introduce this parameterization, a constant based on the scheduling variance in the diffusion process is defined:
Figure BDA0003336592710000141
and for t>1, there are
Figure BDA0003336592710000142
Wherein, betatIs the forward process variance; for ease of presentation, the symbol α is usedt=1-βt
Then, muθAnd σθParameterization of (c) defines:
Figure BDA0003336592710000143
wherein the content of the first and second substances,
Figure BDA0003336592710000144
is a same as xtAnd a neural network with the diffusion step number t as input; sigmaθ(xtT) is fixed to be constant
Figure BDA0003336592710000145
For each step under this parameterization, a closed form expression for ELBO is given as follows:
given a series of fixed schedules
Figure BDA0003336592710000146
And x0~qdata(ii) a Then at expectation EqWith parameterization of (a), we get:
Figure BDA0003336592710000147
for constants c and κtWherein
Figure BDA0003336592710000148
And for t>1, there are
Figure BDA0003336592710000149
Where c is independent of optimization objectives. The key idea demonstrated is to expand ELBO to the sum of KL divergence between controllable gaussian distributions with closed-form expressions.
Minimizing the following unweighted ELBO variables can improve the quality of generation:
Figure BDA00033365927100001410
wherein T is uniformly valued at 1. Therefore, this training target is also used in the model of the present invention.
(5) Embedding in a diffusion step:
different diffusion steps t are taken as input, different epsilon can be output by the model corresponding to different tθ(. t). For each onet uses a 128-dimensional code vector.
Figure BDA0003336592710000151
Then three Fully Connected (FC) layers are applied on the encoding, where the first two FCs share parameters between the residual layers. The last FC maps the output of the second FC to a C-dimensional (residual channel number) embedding vector. This vector is then broadcast and added to the input of each residual layer.
The model has a Conditioner (Conditioner) to encode condition information such as Mel-spectra, speaker labels, etc. In the training and decoding process, the total diffusion round numbers T and beta are set in advancet. For example, T is 200, beta with best effectt=[1×10-4,0.02]I.e. initially 1X 10-4Each iteration is incremented by 0.02. The larger the T is, the more times of iteration are, and the better the generation effect is.
(6) A regulator: these neural vocoders were tested using an 80-band mel-frequency spectrum of the original audio as a regulator. The FFT size is set to 1024, the hop size is set to 256, and the window size is set to 1024. The mel spectrogram is sampled 256 times, and two-dimensional convolution (in time and frequency) interleaving with two layers of transposes is performed through a leaky ReLU (alpha is 0.4) function. For each layer, the upsampling step is 16 in time and the two-dimensional filter size is [32,3 ]. After upsampling, 80 mel-bands are mapped to 2 residual channels using the specific layer's Conv1 × 1, and then an adjustment factor is added as a bias term for the extended convolution before the gate-tanh nonlinear function of each residual layer. The vocoder portion parameters are shown in table 7.
TABLE 7 partial parameter List of vocoders
Figure BDA0003336592710000152
Figure BDA0003336592710000161
5. Data set and training
Training is carried out in the environment of a server carrying Nvidia GTX 1080Ti, a data set provides a Chinese female voice synthesis database (BZNSSYN) which is free and open in trademark shellfish science and technology from the technical scheme of Japanese voice synthesis in 11/2018 (download address: https:// weixinxcxddb. oss-cn-beijing. aliyunncs. com/gwyinpinnKu/BZNSYP. rar), and 10000 sentences of Chinese female voice (the total duration is about 12 hours) and text label documents corresponding to all audio files are contained. In the experiment, 95% of the data set is divided into a training set and 5% is divided into a testing set.
The audio file is processed into a Mel frequency spectrum characteristic matrix for extracting acoustic characteristics of voice, and the pinyin labels are corresponding to the sound spectrum in the train.
For text, it is converted into phonetic sequences, and the symbols are only reserved'. Is there a | A ' four, the remaining symbols are converted to one of the four symbols according to the rules mentioned in the front of the text. The word embedding layer is used in the model, and the word vector of each word in the corpus is continuously learned through training.
In training the spectral generation network, the Batch Size is set to 32, and the learning rate is exponentially decayed, the initial learning rate is set to 0.001, and the exponential decay is started to 0.00001 (approximately at 310 k) when the iteration step number reaches 50 k.
In the training of vocoder, to ensure data consistency, Merr spectrogram generated by model is used as input, Adam optimizer, batch size is 16, learning rate is 2 × 10-4The training step is 1M.
6. Feature extraction:
(1) word embedding
The speech synthesis technique is to let the machine learn to map each character including spaces and punctuation to a certain number of frames of the mel-frequency spectrum.
Since the plain text data cannot be input as deep learning, for chinese, the hanzi sequence pair is first converted into pinyin sequence (symbols are only reserved',. The word vectors for each word in the corpus are continuously learned through training using a truncated normal distributed word embedding layer with a standard deviation of steddev.
(2) Audio feature extraction
For audio, its mel-frequency spectrum features are mainly extracted. The Mel-Frequency Cepstral coeffients (MFCC) is a relatively common audio feature, for sound, the MFCC is a one-dimensional time domain signal, the change rule of a Frequency domain is difficult to see intuitively, and as we know, Fourier change can be used to obtain Frequency domain information of the MFCC, but the time domain information is lost, and the change of the Frequency domain along with the time domain cannot be seen, so that the sound cannot be well described.
7. Experiment:
(1) introduction to evaluation methods
The model of the invention was evaluated by subjective and objective evaluation methods.
1) Subjective evaluation method:
the subjective evaluation method used by the model is Mean Opinion Score (MOS), the naturalness and intelligibility of the synthesized voice are mainly concerned, the grading standard of the MOS value is classified into 5 grades and 1-5 grades, and the higher the grade is, the better the voice quality is. The evaluation criteria of the mean opinion score are shown in Table 8.
TABLE 8 evaluation criteria for mean opinion score
Rank of Score of Evaluation criteria
Superior food 5.0 The pronunciation is clear; the delay is small, the communication is smooth, and the overall hearing sense is good; very similar
Good wine 4.0 The pronunciation is clear and understandable; the delay is small, the alternating current is not smooth, and the noise is slight; is relatively similar
In 3.0 Can be substantially understood; the communication can be realized with a certain delay, and the whole feeling is not smooth; moderate degree of similarity
Difference (D) 2.0 It is very difficult to understand and hear; the delay is large, and the communication needs to be repeated for many times; are slightly similar
Bad quality 1.0 The pronunciation is unclear and difficult to understand; large delay and unsmooth alternating current; are completely dissimilar
Calculating the MOS value:
m sentences are selected to evaluate K speech synthesis systems, MK samples are formed jointly, N tested points are scored, and the average score mu of the system is expected to be obtained. To improve the random significance of the metric results, scores within the 95% confidence interval were used as the average score of the system, and the formula is as follows:
μmn=μ+xm+yn+zmn (20)
Figure BDA0003336592710000181
Figure BDA0003336592710000182
Figure BDA0003336592710000183
wherein the content of the first and second substances,
Figure BDA0003336592710000184
for modeling sentence quality, subject preferences and subjective uncertainty,
Figure BDA0003336592710000185
depending on the particular system under test and the test environment. Then calculate
Figure BDA0003336592710000186
The formula is as follows:
Figure BDA0003336592710000187
Figure BDA0003336592710000188
can be obtained from a least squares estimation, the formula is as follows:
Figure BDA0003336592710000189
the resulting estimate of the mean score variance is:
Figure BDA00033365927100001810
the confidence interval for obtaining the average score according to the t distribution by combining the above formula is as follows:
Figure BDA00033365927100001811
wherein, the degree of freedom of t distribution is min (N, M) -1, the confidence coefficient is selected to be 95%, and the value of t can be obtained by looking up a table.
2) Objective evaluation method:
in the objective evaluation method, the difference between the synthesized speech and the real speech is measured by Mel Cepstral Distortion (MCD), the MCD represents the difference between the MFCC characteristics of the converted speech and the MFCC characteristics of the standard output speech, and the smaller the Distortion value, the better the sound quality of the synthesized speech.
MCD calculation formula:
Figure BDA00033365927100001812
where α is a scaling factor, typically having a value of
Figure BDA00033365927100001813
L and M are respectively Mel cepstrum index and frame index, M is speech frame number, L is Mel cepstrum dimension, s (L, M) and
Figure BDA00033365927100001814
the mel cepstrum of real speech and synthesized speech, respectively.
(2) Comparative experiment of speech synthesis model
In order to prove that the speech synthesis model provided by the invention has obvious advantages in speech synthesis quality and naturalness, the mainstream speech synthesis model based on deep learning at present is selected and tested. Including autoregressive model Tacotron2(Jonathan Shen, rumming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifing Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrtanakis, and Yonghui Wu. Nature TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions [ C ]. proceedof 43th IEEE International Conference Acoustics, Speech and nal Processingg (ICASSP 2018), Calgator, Alberta, April 15-20,2018, Fastfasi 9-4783, and autoregressive model (Jian-J. Zhang, Zhang J. evaluation by the different experimental methods, Huang Shuxuan Wan-Zhen J. experiment, Huang-Shen J. evaluation, Huang-Shen J. experiment, Huang Ko, Huang Cha, Huang Cha, Zhang-Shen, and Xishu model (Xishu) and Xiahu model, evaluation method, evaluation.
According to the characteristics of the models, three popular vocoders are matched for comparison experiments: Griffin-Lim (Perraudin N, Balazs P,
Figure BDA0003336592710000191
P L.A Fast Griffin-Lim Algorithm[C].Proceeding of 14th IEEE Workshop on Applications of Signal Processing to Audio and Acoustics(WASPAA 2013),New Paltz,New York,U.S.A,October 2013,pp.1-4.),MB-MelGAN(Geng Yang,Shan Yang,Kai Liu,Peng Fang,Wei Chen,Lei Xie.Multi-Band Melgan:Faster Waveform Generation For High-Quality Text-To-Speech[C]the proceedings of the 8th IEEE splash Language Technology Workshop (SLT 2021), Shenzhen, China, January 2021, pp.492-498.) and WaveRNN (Nal Kalch brenner, Erich Elsen, Karen Simony, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, A aron van den Oord, Sandieleman, Koray Kavukcukcuoglu. Effient Neural Audio Synthesis [ C].Proceeding of 35th International Conference on Machine Learning(ICML 2018),Stockholm,Sweden,July 2018,pp.2410-2419.)。
Griffin-Lim: the model generates speech under the condition of known magnitude spectrum and unknown phase spectrum. Griffin-Lim acquires the entire spectrogram, iteratively estimating the missing phase information by repeatedly switching between the frequency domain and the time domain. In this experiment, 60 iterations from the frequency to the time domain were used.
MB-MelGAN: i.e., Multi-Band MelGAN. On the basis of the MelGAN, the model increases the receptive field of a generator, and meanwhile, the multi-resolution short-time Fourier transform loss is used for replacing the characteristic matching loss, so that better training quality and better training stability are brought.
waveRNN: the model is a high-speed audio synthesis vocoder based on a neural network, and the high-speed audio synthesis vocoder has a single-layer RNN network architecture and uses two softmax layers. One advantage of WavRNN is the use of matrix sparsification, which will increase the synthesis speed by a factor of 10.
In the subjective evaluation method, the quality and naturalness of synthesized speech are evaluated using a Mean Opinion Score (MOS). Firstly, randomly selecting 20 texts from a test set, mixing synthesized voice and real voice together, randomly disordering to form a voice set to be tested, scoring each voice by 15 testers in the same laboratory environment with noise less than 30 decibels, recycling scoring results of all testers, eliminating scoring data with larger errors, and then calculating average score according to 95% confidence coefficient to serve as final score.
In the objective evaluation method, the present embodiment measures the difference between synthesized speech and real speech using mel-frequency cepstral distortion (MCD). And 20 texts were selected for speech synthesis and there was a corresponding real recorded audio for each text. The MCD of the 20 texts for the different models is calculated separately, and then the average value is calculated separately for each model as the final score.
TABLE 9 comparison table of experimental results of speech synthesis model
Acoustic model MOS MCD
Tacotron2+Griffin-Lim 4.04±0.19 6.80
Tacotron2+MB-MelGAN 4.17±0.25 7.14
Tacotron2+WaveRNN 4.19±0.10 6.69
FastSpeech2+MB-MelGAN 3.96±0.15 6.97
Model of the invention 4.22±0.30 6.22
True speech 4.50±0.24
The results of the MOS and MCD experiments are shown in Table 9. As can be seen from Table 9, the MOS score of the model of the invention is 4.22, which is higher than that of other models, and the model of the invention proves that the speech synthesis quality and the naturalness are better. Furthermore, the inventive model MCD scores 6.22, lower than the other models, indicating that the inventive model has minimal difference between synthesized speech and real speech.
In addition to performing the quantitative comparison experiment of MCD, the present embodiment also graphically represents the difference between the synthesized speech and the real speech through the mel-frequency spectrogram. Fig. 2 is a comparison of the mel spectrum generated by the models and the real mel spectrum, wherein, the diagram (a) is the real mel spectrum, and other spectrograms are the mel spectra generated by the models. It can be seen from fig. 2 that the mel-frequency spectrogram generated by the model of the present invention is closer to the true mel-frequency spectrum than other speech synthesis models.

Claims (5)

1. A Chinese speech synthesis method based on a diffusion probability model is characterized by comprising the following steps:
s1: text front-end processing:
acquiring a text data set, constructing a Chinese text front-end processing module, and performing mandarin text-to-phoneme conversion, text regularization and punctuation mark deletion or conversion on the text data set to obtain a phoneme sequence;
s2: constructing an end-to-end spectrum generation network based on a forward attention mechanism to encode and decode the processed text: and (3) encoding: the encoder module processes the input phoneme sequence to obtain a hidden layer sequence, and at each decoding moment, an attention mechanism performs soft selection on the input sequence to obtain an attention context vector as the input of a decoder;
and (3) decoding: the decoder module predicts the time step by a preprocessing network, and the output of the preprocessing network and the attention context vector are connected and transmitted by two one-way LSTM layer stacks; projection of the connected prediction target spectrogram frame of LSTM layer output and context vector of attention by linear transformation; the predicted Mel spectrogram passes through a 5-layer convolution post-processing network, and prediction residual errors are added into prediction to improve overall reconstruction;
s3: chinese speech synthesis using a Diffwave vocoder based on a diffusion probability model:
the diffusion probability model divides the mapping relation between the noise and the target waveform into T steps to form a Markov chain, trains the diffusion process of the chain, namely from the target audio to the noise, and then decodes the chain through the reverse process, namely from the noise to the target audio.
2. The method as claimed in claim 1, wherein the mandarin chinese text to phoneme conversion process is specifically as follows: for the sequence of Chinese characters in each sentence of the text data set from left to right, preferentially searching whether a word beginning with the Chinese character exists in a word pinyin library, checking whether the Chinese character behind the Chinese character in the text is matched with the word, and directly obtaining the pinyin of the word from the word pinyin library if the Chinese character in the text is matched with the word; if not, obtaining the pinyin of the Chinese character from the pinyin library.
3. The method of diffuse probability model based chinese speech synthesis of claim 1, wherein the encoder module comprises: a character embedding layer, a 3-layer convolution, a bidirectional LSTM layer; the input character is encoded into a 128-dimensional character vector; then, passing through 3 layers of convolution, wherein each layer of convolution comprises 256 convolution kernels of 5 multiplied by 1, namely each convolution kernel spans 5 characters, the convolution layers carry out large-span context modeling on an input character sequence, and are subjected to batch normalization after the convolution layers, and a ReLU activation function is used for activation; the output of the last convolutional layer is passed to the bi-directional LSTM layer to generate coding features;
Figure FDA0003336592700000011
H=EncoderRecurrency(fe) (2)
wherein f iseFor coding features, F1、F2、F3For 3 convolution kernels, relu (·) represents the nonlinear activation on each convolution layer;
Figure FDA0003336592700000021
the method comprises the steps of embedding a character sequence X, representing the circular neural network bidirectional LSTM in an encoder by EncodeRecurrency (-) and outputting an encoder hidden state by H.
4. The method of Chinese speech synthesis based on diffuse probability model of claim 1, wherein the forward attention mechanism comprises:
let the phoneme sequence of the input coder be x ═ x1,x2,…,xN]N representing a sequence of phonemesLength, processing by encoder to obtain hidden layer sequence h ═ h1,h2,…,hN]At each decoding instant k, the attention mechanism makes a soft selection of the input sequence, resulting in a context vector ckAs input to a decoder;
let the query vector of attention mechanism be skNote that the mechanism selects as input the output of a position between encoders 1 to N, the position being given by a random variable πkE {1, …, N }, the modeled target of the attention mechanism is the probability distribution of the position variable: p (pi)k|h,sk) (ii) a The context vector calculation is given by:
Figure FDA0003336592700000022
wherein, yk(n)=p(πk=n|h,sk) Indicating the probability of the attention staying at the output position n of the encoder at the decoding time k;
the content-based attention mechanism is calculated in the following way:
Figure FDA0003336592700000023
wherein W, V, b and V are parameters of the model; e.g. of the typek,nFor evaluation of skAnd hnThe degree of matching of (c);
assuming a random variation of pi in attention position at different timeskOutput h and query vector s at a given encoderkLater, conditional independence results in an aligned path pi1:k={π12,…,πkThe probability of is:
Figure FDA0003336592700000031
wherein s is1:kFor a set of query vectors s1,s2,…,sk};yk'k') Indicating that attention is staying at the output position pi of the encoder at an arbitrary time k' before the current decoding time kkThe size of the probability of';
determining that each path in a set P of legitimate paths of attention satisfies monotonicity and continuity, and given the constraint of a monotonous path, the conditional probability of attention distribution is:
p(πk|h,s1:k0:k∈P) (6)
then the forward variable a is definedk(n) is:
Figure FDA0003336592700000032
and (3) adopting a dynamic programming algorithm, and recurrently obtaining the forward variable of the current moment through the forward variable obtained at the previous moment:
ak(n)=(ak-1(n)+ak-1(n-1))yk(n) (8)
get the new attention probability from the forward variable:
Figure FDA0003336592700000033
in formula (3)
Figure FDA0003336592700000034
To replace yk(n) computing a context vector ck
Figure FDA0003336592700000035
5. The method for Chinese speech synthesis based on diffusion probability model according to claim 1, wherein the S3 specifically includes:
S31:definition of qdata(x0) Is composed of
Figure FDA0003336592700000041
Wherein L is the data dimension; definition of
Figure FDA0003336592700000042
T is a variable sequence with the same dimensionality, T is an index of diffusion step number, and T is total diffusion step number; the diffusion probability model comprises a diffusion process and a reverse process;
the purpose of the diffusion process is to propagate x through a Markov chain0Gradually mapping to a multidimensional normal distribution, i.e.:
Figure FDA0003336592700000043
wherein q (x)t|xt-1) Is defined as a sum constant betatRelated Gaussian distribution
Figure FDA0003336592700000044
I is an identity matrix; the reverse process is generated based on normally distributed samples:
platent(xT)=N(0,I) (12)
Figure FDA0003336592700000045
wherein p islatent(xT) Is an isotropic Gaussian distribution with a transition probability pθ(xt-1|xt) Parameterized as Gaussian distribution N (x)t-1;μθ(xt,t),σθ(xt,t)2I);
Wherein, the model muθAnd model σθThere are two inputs each: number of diffusion steps
Figure FDA0003336592700000046
And variables
Figure FDA0003336592700000047
Wherein L is the data dimension; model muθOutputting an L-dimensional vector as a mean value, model sigmaθOutputting a real number as a standard deviation; p is a radical ofθ(xt-1|xt) The purpose of the method is to gradually eliminate Gaussian noise in the diffusion process and finally generate data which accord with target distribution;
s32: sampling
For the reverse process, the generation process first pairs xTN (0, I) is sampled, followed by xt-1:pθ(xt-1|xt) T, T-1,. 1, sample; x of the output0Is a sample data;
s33: training
Prior to training, the training targets of the model are first profiled, i.e. the maximum likelihood pθ(x0) (ii) a Training the model by maximizing the lower bound of variation, the formula is:
Figure FDA0003336592700000051
wherein the content of the first and second substances,
Figure FDA0003336592700000052
denotes x for distribution qdata(x0) In the expectation that the position of the target is not changed,
Figure FDA0003336592700000053
denotes x vs. distribution q (x)1,...,xT) (iii) a desire; ELBO is the lower bound of evidence;
defining a constant based on the scheduling variance in the diffusion process:
Figure FDA0003336592700000054
wherein, betatIs the forward process variance; for ease of presentation, the alternative symbol α is usedtDenotes alphat=1-βt
Then, muθAnd σθParameterization of (c) defines:
Figure FDA0003336592700000055
wherein e isθ:
Figure FDA0003336592700000056
Is a same as xtAnd a neural network with the diffusion step number t as input; sigmaθ(xtT) is fixed to be constant
Figure FDA0003336592700000057
For each step under this parameterization, a closed form expression for ELBO is given as follows:
given a series of fixed schedules
Figure FDA0003336592700000058
Let e to N (0, I) and x0~qdata(ii) a Then at expectation EqWith parameterization of (a), we get:
Figure FDA0003336592700000059
for constants c and κtWherein
Figure FDA00033365927000000510
And for t>1, there are
Figure FDA00033365927000000511
The following unweighted ELBO variables are minimized to improve the quality of the generation:
Figure FDA0003336592700000061
wherein T is uniformly valued in 1.. and T;
s34: embedding in a diffusion step:
different diffusion steps t are taken as input, and different t corresponding to the model can output different eθ(. t); using a 128-dimensional code vector for each t;
Figure FDA0003336592700000062
applying three fully-connected layers on coding, wherein the first two FCs share parameters between the residual layers; the last FC maps the output of the second FC into a C-dimensional embedded vector; this vector is then broadcast and added to the input of each residual layer.
CN202111295924.5A 2021-11-03 2021-11-03 Chinese speech synthesis method based on diffusion probability model Pending CN114023300A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111295924.5A CN114023300A (en) 2021-11-03 2021-11-03 Chinese speech synthesis method based on diffusion probability model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111295924.5A CN114023300A (en) 2021-11-03 2021-11-03 Chinese speech synthesis method based on diffusion probability model

Publications (1)

Publication Number Publication Date
CN114023300A true CN114023300A (en) 2022-02-08

Family

ID=80060249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111295924.5A Pending CN114023300A (en) 2021-11-03 2021-11-03 Chinese speech synthesis method based on diffusion probability model

Country Status (1)

Country Link
CN (1) CN114023300A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116884391A (en) * 2023-09-06 2023-10-13 中国科学院自动化研究所 Multimode fusion audio generation method and device based on diffusion model
CN116884495A (en) * 2023-08-07 2023-10-13 成都信息工程大学 Diffusion model-based long tail chromatin state prediction method
CN116977652A (en) * 2023-09-22 2023-10-31 之江实验室 Workpiece surface morphology generation method and device based on multi-mode image generation
CN117423329A (en) * 2023-12-19 2024-01-19 北京中科汇联科技股份有限公司 Model training and voice generating method, device, equipment and storage medium
CN117809621A (en) * 2024-02-29 2024-04-02 暗物智能科技(广州)有限公司 Speech synthesis method, device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111951778A (en) * 2020-07-15 2020-11-17 天津大学 Method for synthesizing emotion voice by using transfer learning under low resource
CN112652291A (en) * 2020-12-15 2021-04-13 携程旅游网络技术(上海)有限公司 Speech synthesis method, system, device and storage medium based on neural network
CN113345415A (en) * 2021-06-01 2021-09-03 平安科技(深圳)有限公司 Speech synthesis method, apparatus, device and storage medium
CN113678200A (en) * 2019-02-21 2021-11-19 谷歌有限责任公司 End-to-end voice conversion

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113678200A (en) * 2019-02-21 2021-11-19 谷歌有限责任公司 End-to-end voice conversion
CN111951778A (en) * 2020-07-15 2020-11-17 天津大学 Method for synthesizing emotion voice by using transfer learning under low resource
CN112652291A (en) * 2020-12-15 2021-04-13 携程旅游网络技术(上海)有限公司 Speech synthesis method, system, device and storage medium based on neural network
CN113345415A (en) * 2021-06-01 2021-09-03 平安科技(深圳)有限公司 Speech synthesis method, apparatus, device and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JING-XUAN ZHANG 等: "FORWARD ATTENTION IN SEQUENCE-TO-SEQUENCE ACOUSTIC MODELING FOR SPEECH SYNTHESIS", 《ARXIV:1807.06736V1》 *
JONATHAN SHEN 等: "NATURAL TTS SYNTHESIS BY CONDITIONINGWAVENET ON MEL SPECTROGRAM PREDICTIONS", 《ICASSP 2018》 *
ZHIFENG KONG 等: "DIFFWAVE: A VERSATILE DIFFUSION MODEL FOR AUDIO SYNTHESIS", 《ARXIV:2009.09761V3》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116884495A (en) * 2023-08-07 2023-10-13 成都信息工程大学 Diffusion model-based long tail chromatin state prediction method
CN116884495B (en) * 2023-08-07 2024-03-08 成都信息工程大学 Diffusion model-based long tail chromatin state prediction method
CN116884391A (en) * 2023-09-06 2023-10-13 中国科学院自动化研究所 Multimode fusion audio generation method and device based on diffusion model
CN116884391B (en) * 2023-09-06 2023-12-01 中国科学院自动化研究所 Multimode fusion audio generation method and device based on diffusion model
CN116977652A (en) * 2023-09-22 2023-10-31 之江实验室 Workpiece surface morphology generation method and device based on multi-mode image generation
CN116977652B (en) * 2023-09-22 2023-12-22 之江实验室 Workpiece surface morphology generation method and device based on multi-mode image generation
CN117423329A (en) * 2023-12-19 2024-01-19 北京中科汇联科技股份有限公司 Model training and voice generating method, device, equipment and storage medium
CN117423329B (en) * 2023-12-19 2024-02-23 北京中科汇联科技股份有限公司 Model training and voice generating method, device, equipment and storage medium
CN117809621A (en) * 2024-02-29 2024-04-02 暗物智能科技(广州)有限公司 Speech synthesis method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Van Den Oord et al. Wavenet: A generative model for raw audio
Oord et al. Wavenet: A generative model for raw audio
Le et al. Voicebox: Text-guided multilingual universal speech generation at scale
Huang et al. Joint optimization of masks and deep recurrent neural networks for monaural source separation
US20200026760A1 (en) Enhanced attention mechanisms
CN113439301A (en) Reconciling between analog data and speech recognition output using sequence-to-sequence mapping
Liu et al. Towards unsupervised speech recognition and synthesis with quantized speech representation learning
CN114023300A (en) Chinese speech synthesis method based on diffusion probability model
Jemine Real-time voice cloning
US20230036020A1 (en) Text-to-Speech Synthesis Method and System, a Method of Training a Text-to-Speech Synthesis System, and a Method of Calculating an Expressivity Score
Luo et al. Emotional voice conversion using neural networks with arbitrary scales F0 based on wavelet transform
Kaur et al. Conventional and contemporary approaches used in text to speech synthesis: A review
KR102272554B1 (en) Method and system of text to multiple speech
Khanam et al. Text to speech synthesis: A systematic review, deep learning based architecture and future research direction
Yang et al. Adversarial feature learning and unsupervised clustering based speech synthesis for found data with acoustic and textual noise
CN114495969A (en) Voice recognition method integrating voice enhancement
Schnell et al. Investigating a neural all pass warp in modern TTS applications
Tan Neural text-to-speech synthesis
Mei et al. A particular character speech synthesis system based on deep learning
US20230178069A1 (en) Methods and systems for synthesising speech from text
Zhao et al. Research on voice cloning with a few samples
CN115359775A (en) End-to-end tone and emotion migration Chinese voice cloning method
Wen et al. Improving deep neural network based speech synthesis through contextual feature parametrization and multi-task learning
Tailor et al. Deep learning approach for spoken digit recognition in Gujarati language
Liu et al. A New Speech Encoder Based on Dynamic Framing Approach.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220208