CN114023300A

CN114023300A - Chinese speech synthesis method based on diffusion probability model

Info

Publication number: CN114023300A
Application number: CN202111295924.5A
Authority: CN
Inventors: 王海舟; 范润琦; 吴英奡; 许晋荣; 张新悦; 吴心宇
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2021-11-03
Filing date: 2021-11-03
Publication date: 2022-02-08

Abstract

The invention discloses a Chinese speech synthesis method based on a diffusion probability model, which comprises the steps of firstly constructing a Chinese text front-end processing module, then constructing an end-to-end frequency spectrum generation network based on a forward attention mechanism, and finally using a Diffwave vocoder based on the diffusion probability model to synthesize Chinese speech. The invention adopts a forward attention mechanism to solve the problems of poor alignment effect of the voice frames and the like in the synthesis of the Chinese long sentence; and a non-autoregressive Diffwave vocoder based on a diffusion probability model is used in the vocoder part, so that the quality and efficiency of model synthetic voice are obviously improved.

Description

Chinese speech synthesis method based on diffusion probability model

Technical Field

The invention relates to the technical field of artificial intelligence voice synthesis, in particular to a Chinese voice synthesis method based on a diffusion probability model.

Background

Speech synthesis techniques generally refer to the conversion of text to speech. With the continuous development and maturity of the fields of internet technology, information technology, artificial intelligence and the like, the popularization and the generation of intelligent terminals, and a new man-machine interaction mode represented by the synthetic artificial voice technology is quietly popularized. Nowadays, speech synthesis is widely applied to scenes such as map navigation, voice assistance, audio book reading, short video dubbing and the like.

With the continuous development of deep learning, a plurality of speech synthesis models achieve good effects. The currently common speech synthesis scheme based on deep learning mainly comprises two steps: firstly, predicting acoustic features such as Mel frequency spectrum according to text information; the vocoder is then used to convert the acoustic feature prediction into the original audio waveform. The currently popular deep learning-based speech synthesis models are mainly classified into autoregressive and non-autoregressive types. The main problem faced by autoregressive speech synthesis is that the synthesis speed is slow, and the WaveNet vocoder used by Tacotron2 is an autoregressive convolutional neural network, and in order to solve the long-distance dependence problem, WaveNet needs to balance the extended Field (perceptual Field) and the number of parameters. With the structure that WaveNet stacks multiple layers of one-dimensional extended convolution, the span of a convolution kernel is 2, the receptive field is exponentially increased along with the increase of the number of layers, and the synthesis speed is slow. The main problem of the traditional non-autoregressive speech synthesis is that the speech synthesis quality is low, for example, a FastSpeech model can be generated through parallel Mel spectrograms, the synthesis process is accelerated, FastSpeech is trained based on a Transformer structure, but the extracted alignment effect is not accurate enough, and the obtained target Mel frequency spectrum has some information loss, so that the tone quality effect is poor.

Disclosure of Invention

In view of the above problems, it is an object of the present invention to provide a method for Chinese speech synthesis based on a diffusion probability model, which uses a forward attention mechanism in a decoder and utilizes a Diffwave vocoder based on a diffusion probability model to realize more efficient and higher-quality Chinese speech synthesis. The technical scheme is as follows:

a Chinese speech synthesis method based on a diffusion probability model comprises the following steps:

s1: text front-end processing:

acquiring a text data set, constructing a Chinese text front-end processing module, and performing mandarin text-to-phoneme conversion, text regularization and punctuation mark deletion or conversion on the text data set to obtain a phoneme sequence;

s2: constructing an end-to-end spectrum generation network based on a forward attention mechanism to encode and decode the processed text:

and (3) encoding: the encoder module processes the input phoneme sequence to obtain a hidden layer sequence, and at each decoding moment, an attention mechanism performs soft selection on the input sequence to obtain an attention context vector as the input of a decoder;

and (3) decoding: the decoder module predicts the time step by a preprocessing network, and the output of the preprocessing network and the attention context vector are connected and transmitted by two one-way LSTM layer stacks; projection of the connected prediction target spectrogram frame of LSTM layer output and context vector of attention by linear transformation; the predicted Mel spectrogram passes through a 5-layer convolution post-processing network, and prediction residual errors are added into prediction to improve overall reconstruction;

s3: chinese speech synthesis using a Diffwave vocoder based on a diffusion probability model:

the diffusion probability model divides the mapping relation between the noise and the target waveform into T steps to form a Markov chain, trains the diffusion process of the chain, namely from the target audio to the noise, and then decodes the chain through the reverse process, namely from the noise to the target audio.

Further, the processing of converting mandarin chinese text into phonemes specifically comprises: for the sequence of Chinese characters in each sentence of the text data set from left to right, preferentially searching whether a word beginning with the Chinese character exists in a word pinyin library, checking whether the Chinese character behind the Chinese character in the text is matched with the word, and directly obtaining the pinyin of the word from the word pinyin library if the Chinese character in the text is matched with the word; if not, obtaining the pinyin of the Chinese character from the pinyin library.

Further, the encoder module includes: a character embedding layer, a 3-layer convolution, a bidirectional LSTM layer; the input character is encoded into a 128-dimensional character vector; then, passing through 3 layers of convolution, wherein each layer of convolution comprises 256 convolution kernels of 5 multiplied by 1, namely each convolution kernel spans 5 characters, the convolution layers carry out large-span context modeling on an input character sequence, and are subjected to batch normalization after the convolution layers, and a ReLU activation function is used for activation; the output of the last convolutional layer is passed to the bi-directional LSTM layer to generate coding features;

H＝EncoderRecurrency(f_e) (2)

wherein f is_eFor coding features, F₁、F₂、F₃For 3 convolution kernels, relu (·) represents the nonlinear activation on each convolution layer;

the method comprises the steps of embedding a character sequence X, representing the circular neural network bidirectional LSTM in an encoder by EncodeRecurrency (-) and outputting an encoder hidden state by H.

Further, let the phoneme sequence of the input coder be x ═ x₁,x₂,…,x_N]N represents the length of the phoneme sequence, and the hidden layer sequence h ═ h is obtained through the processing of the encoder₁,h₂,…,h_N]At each decoding instant k, the attention mechanism makes a soft selection of the input sequence, resulting in a context vector c_kAs input to a decoder;

fillingThe query vector of the gravity mechanism is s_kNote that the mechanism selects as input the output of a position between encoders 1 to N, the position being given by a random variable π_kE {1, …, N }, the modeled target of the attention mechanism is the probability distribution of the position variable: p (pi)_k|h,s_k) (ii) a The context vector calculation is given by:

wherein, y_k(n)＝p(π_k＝n|h,s_k) Indicating the probability of the attention staying at the output position n of the encoder at the decoding time k;

the content-based attention mechanism is calculated in the following way:

wherein W, V, b and V are parameters of the model; e.g. of the type_k,nFor evaluation of s_kAnd h_nThe degree of matching of (c);

assuming a random variation of pi in attention position at different times_kOutput h and query vector s at a given encoder_kLater, conditional independence results in an aligned path pi_1:k＝{π₁,π₂,…,π_kThe probability of is:

wherein s is_1:kFor a set of query vectors s₁,s₂,…,s_k}；y_k'(π_k') Indicating that attention is staying at the output position pi of the encoder at an arbitrary time k' before the current decoding time k_k'The size of the probability of (c);

determining that each path in a set P of legitimate paths of attention satisfies monotonicity and continuity, and given the constraint of a monotonous path, the conditional probability of attention distribution is:

p(π_k|h,s_1:k,π_0:k∈P) (6)

then the forward variable a is defined_k(n) is:

and (3) adopting a dynamic programming algorithm, and recurrently obtaining the forward variable of the current moment through the forward variable obtained at the previous moment:

a_k(n)＝(a_k-1(n)+a_k-1(n-1))y_k(n) (8)

get the new attention probability from the forward variable:

in formula (3) with a_k(n) instead of y_k(n) computing a context vector c_k：

Further, the S3 specifically includes:

s31: definition of q_data(x₀) Is composed of

Wherein L is the data dimension; definition of

T is 0,1, …, T is a variable sequence with the same dimension, T is an index of the diffusion step number, and T is the total diffusion step number; the diffusion probability model comprises a diffusion process and a reverse process;

the purpose of the diffusion process is to propagate x through a Markov chain₀Gradual mapping to multipleA normal distribution, i.e.:

wherein q (x)_t|x_t-1) Is defined as a sum constant beta_tRelated Gaussian distribution

I is an identity matrix; the reverse process is generated based on normally distributed samples:

p_latent(x_T)＝N(0,I) (12)

wherein p is_latent(x_T) Is an isotropic Gaussian distribution with a transition probability p_θ(x_t-1|x_t) Parameterized as Gaussian distribution N (x)_t-1；μ_θ(x_t,t),σ_θ(x_t,t)²I)；

Model mu_θAnd model σ_θThere are two inputs each: number of diffusion steps

And variables

Wherein L is the data dimension; model mu_θOutputting an L-dimensional vector as a mean value, model sigma_θOutputting a real number as a standard deviation; p is a radical of_θ(x_t-1|x_t) The purpose of the method is to gradually eliminate Gaussian noise in the diffusion process and finally generate data which accord with target distribution;

s32: sampling

For the reverse process, the generation process first pairs x_TN (0, I) is sampled, followed by x_t-1:p_θ(x_t-1|x_t) T, T-1, 1 samples(ii) a X of the output₀Is a sample data;

s33: training

Prior to training, the training targets of the model are first profiled, i.e. the maximum likelihood p_θ(x₀) (ii) a Training the model by maximizing the lower bound of variation, the formula is:

wherein the content of the first and second substances,

denotes x for distribution q_data(x₀) In the expectation that the position of the target is not changed,

denotes x vs. distribution q (x)₁,...,x_T) (iii) a desire; ELBO is the lower bound of evidence;

defining a constant based on the scheduling variance in the diffusion process:

and for t>1, there are

Wherein, beta_tIs the forward process variance; for ease of presentation, the alternative symbol α is used_tDenotes alpha_t＝1-β_t

Then, mu_θAnd σ_θParameterization of (c) defines:

wherein the content of the first and second substances,

is a same as x_tAnd the number of diffusion steps t is as followsAn incoming neural network; sigma_θ(x_tT) is fixed to be constant

For each step under this parameterization, a closed form expression for ELBO is given as follows:

given a series of fixed schedules

Let e to N (0, I) and x₀～q_data(ii) a Then at expectation E_qWith parameterization of (a), we get:

for constants c and κ_tWherein

And for t>1, there are

The following unweighted ELBO variables are minimized to improve the quality of the generation:

wherein T is uniformly valued in 1.. and T;

s34: embedding in a diffusion step:

different diffusion steps t are taken as input, and different t corresponding to the model can output different e_θ(. t); using a 128-dimensional code vector for each t;

applying three fully-connected layers on coding, wherein the first two FCs share parameters between the residual layers; the last FC maps the output of the second FC into a C-dimensional embedded vector; this vector is then broadcast and added to the input of each residual layer.

The invention has the beneficial effects that: the invention adopts a forward attention mechanism to solve the problems of poor alignment effect of the voice frames and the like in the synthesis of the Chinese long sentence; and a non-autoregressive Diffwave vocoder based on a diffusion probability model is used in the vocoder part, so that the quality and efficiency of model synthetic voice are obviously improved.

Drawings

FIG. 1 is a diagram of a deep learning-based Chinese speech synthesis model according to the present invention.

FIG. 2 is a comparison of Mel frequency spectra; (a) real voice; (b) the model of the invention; (c) tacotron2+ Griffin-Lim; (d) tacotron2+ WaveRNN; (e) tacotron2+ MB-MelGAN; (f) FastSpeech2+ MB-MelGAN.

Detailed Description

The invention is described in further detail below with reference to the figures and specific embodiments. As shown in FIG. 1, the whole framework of the deep learning-based Chinese speech synthesis model of the present invention mainly comprises three parts: text front-end processing, spectral generation networks (encoders and decoders), and vocoders.

1. Text front-end processing

(1) Putonghua text to phoneme (graph-to-phone, G2P)

For the sequence of the Chinese characters in each sentence from left to right, preferentially searching whether a word beginning with the Chinese character exists in a word pinyin library (download address: https:// githu. com/mozillazg/phrase-pinyin-data) and checking whether the Chinese character behind the Chinese character is matched with the word, and if the condition is met, directly obtaining pinyin from the word library; if the condition is not met, obtaining the pinyin of the Chinese character from a pinyin library (download address: https:// github. com/mozillazg/pinyin-data).

(2) Text regularization (text normalization, TN)

Chinese text regularization is the process of converting non-Chinese character strings into Chinese character strings to determine their pronunciations. In this embodiment, a regular expression is used to process a text, so as to realize NSW (Non-Standard-Word) normalization, and the rule is shown in table 1.

Table 1 text normalization rules table

(3) Punctuation mark

For Chinese punctuation, only' is reserved. Is there a | A ' four symbols, the remaining symbols are converted to one of the four symbols according to the following rules, detailed in table 2.

TABLE 2 symbol conversion rules Table

Before replacement	After replacement
		Parentheses, quotation marks, special signs outside the limits	Ignore
Colons, dashes, pause signs, English commas	’，’
		English exclamation mark	'！'
English question mark	'？'
		English sentence, semicolon, ellipsis	’。’
The same' occurs consecutively. Is there a \| A '	Only one is reserved

2. Encoder for encoding a video signal

The purpose of the encoder is to extract a robust sequence representation from the input text sequence. The encoder module contains a Character Embedding layer (Character Embedding), a 3-layer convolution, and a two-way LSTM (Long Short-Term Memory) layer. The input character is encoded into a 128-dimensional character vector; then, passing through a 3-layer convolution, wherein each layer of convolution comprises 256 convolution kernels of 5 × 1, namely each convolution kernel spans 5 characters, and the convolution layer can carry out large-span context modeling (similar to N-grams) on an input character sequence, wherein the use of the convolution layer for obtaining the context is mainly because the cyclic neural network is difficult to capture long-term dependence in practice; convolutional layer followed by batch normalization (batch normalization), using a ReLU (Rectified Linear Unit) activation function

Activating; the output of the last convolutional layer is passed to a bi-directional LSTM layer containing 512 cells (256 cells in each direction) to generate the coding features.

H＝EncoderRecurrency(f_e) (2)

Wherein, F1, F2, F3 are 3 convolution kernels, ReLU is the nonlinear activation on each convolution layer, E represents embedding the character sequence X, EncoderRecurrency represents the cyclic neural network bidirectional LSTM in the encoder, and H is the output encoder hidden state. After the encoder hidden state is generated, attention is paid to generate an encoding vector. The encoder part parameters are shown in table 3.

Table 3 encoder partial parameter list

Model parameters	Parameter value
		embedding_dim	128
conv_layers_num	3
		conv_kernel_size	5
conv_filters	256
		lstm_units	256

3. Decoder

The decoder of the present invention adopts an autoregressive recursive structure, and can predict a Mel spectrogram of one frame from an encoded input sequence. The decoder first passes the prediction of the previous time step through a small pre-processing network containing 2 fully-connected layers of 256 hidden ReLU units. Dropout in the preprocessing network as an information bottleneck is important for learning attention, and is beneficial to improving the generalization of the model. The output of the preprocessing network and the context vector of attention are connected and passed through two unidirectional LSTM layer stacks. The connected prediction target spectrogram framework of LSTM output and attention context vector is projected by linear transformation. Finally, the predicted mel-spectrum is passed through a 5-layer convolutional post-processing network that adds prediction residuals to the prediction to improve the overall reconstruction. The decoder part parameters are shown in table 4.

Table 4 decoder partial parameter list

Model parameters	Parameter value
		prenet_layers	[256,256]
decoder_layers	2
		decoder_lstm_units	256
dropout_rate	0.5

(1) Forward attention mechanism

In the decoder, a forward attention mechanism is employed to improve the model's processing power for long text.

Suppose that there are phonemes as input sequence x ═ x₁,x₂,…,x_N]Where N represents the length of the phoneme sequence. The hidden layer sequence h ═ h is obtained by inputting the sequence to a sequence model encoder₁,h₂,…,h_N]. At each decoding instant k, the attention mechanism makes a soft selection of the input sequence, resulting in a context vector c_kAs input to a decoder. Suppose a query vector (query vector) of the attention mechanism is s_kThe state vector of the decoder RNN at the current instant is typically used. Note that the mechanism selects as input an output of an encoder at a position between 1 and N, which may be by a random variable π_kE {1, …, N }, then the modeled objective of the attention mechanism is the probability distribution of the position variable: p (pi)_k|h,s_k). The context vector calculation is given by:

the content-based attention mechanism is calculated in the following way:

where W, V, b and V are parameters of the model, e_k,nFor evaluation of s_kAnd h_nThe degree of matching.

Assuming a random variation of pi in attention position at different times_kOutput h and query vector s at a given encoder_kThe latter are condition independent. So that an alignment path pi can be obtained_1:k＝{π₁,π₂,…,π_kThe probability of is:

in the initialization state, the method specifies pi₀＝1。

Consider a set of attention paths, denoted as P. The set is a set of legitimate paths, i.e., each path within the set satisfies two characteristics. The first is monotonicity: instant noteThe position at which the force is intended to stay will only increase monotonically,

second, continuity: i.e. no jump between two attention positions in time succession occurs,

the invention considers the conditional probability of the attention distribution given the constraint of a monotonic path:

p(π_k|h,s_1:k,π_0:k∈P)(6)

this conditional probability is used as a coefficient of the attention distribution in order to introduce a conditional term in the probability formula. The condition item eliminates illegal paths in the voice generation task, namely all paths violating the monotonicity rule, so that the probability space can be greatly reduced, and the voice synthesis task is more reasonable. Since in this task the attention-aligned path is apparently monotonically increasing and no jumps occur. To describe the computational process of the algorithm, the forward variables are first defined:

the forward variables in this algorithm are similar and different from the forward variables in the ctc (connectionist Temporal classification) algorithm. The similarity is that the forward variables are a set of "legal" path probabilities, and the probability distribution between different time instants satisfies the conditional independence. However, the output of the CTC at each moment describes an output tag probability, and the attention mechanism describes a probability distribution of a random variable of an attention position. And the specification of what is a "legal" path is also different. For the CTC algorithm, the meaning of a legal path is a set that satisfies all paths that can correspond to the correct tag sequence; whereas for the forward attention mechanism, a legal path means a set of all paths that can satisfy monotonicity and continuity. Similar to the CTC algorithm, the computation of the forward variable does not need to be summed over an exhaustive list of all legal paths, and the complexity of such an algorithm reaches an exponential level, resulting in an inability to perform the operation. The forward variables can be realized through a smart forward algorithm, the core idea is a dynamic programming algorithm, and the forward variables at the current moment are obtained through recursion of the forward variables obtained at the previous moment:

a_k(n)＝(a_k-1(n)+a_k-1(n-1))y_k(n) (8)

thus, a new attention probability can be derived from the forward variable:

after obtaining a new attention probability, we can use it in equation (3)

To replace y_k(n) computing a context vector c_k. The modified recursion is as follows:

some of the attention mechanism parameters are shown in table 5.

TABLE 5 attention mechanism part parameter List

Model parameters	Parameter value
		smoothing	False
attention_dim	128
		attention_filters	32
attention_kernel	31
		cumulative_weights	True

(2) Post-processing network

The goal of the post-processing network is to convert the sequence-to-sequence target output into a target representation that can be synthesized into a waveform, and to learn how to predict the spectral amplitude sampled on a linear frequency scale. Another purpose of the post-processing network construction is that it also sees all decoded sequences, unlike the normal sequence-to-sequence structure, which always runs sequentially from left to right, so that the construction can obtain both forward and backward bi-directional information simultaneously to correct single-frame prediction errors. The post-processing network is a 5-layer convolutional neural network, each layer consists of 256 5 multiplied by 1 convolutional kernels and a batch standardization process, and except the last layer of convolution, the batch standardization process of each layer is followed by a tanh activation function. The post-processing network part parameters are shown in table 6.

Table 6 post-processing network part parameter list

Model parameters	Parameter value
		postnet_layers_num	5
postnet_kernel_size	5
		postnet_filters	256

4. Vocoder

The invention selects an audio generation Model based on a Diffusion probability Model (Diffusion Probabilistic Model) to generate speech waves.

The diffusion probability model is a probability model based on a Markov chain, and divides the mapping relation between noise and a target waveform into T steps to form the Markov chain. The diffusion process for the chain (from target audio to noise) is trained and then decoded by the reverse process (from noise to target audio).

First of all, q is defined_data(x₀) Is composed of

Wherein L is the data dimension; definition of

T is 0,1, …, T is a variable sequence with the same dimension, T is an index of the diffusion step number, and T is the total diffusion step number. A diffusion model consists of two processes, a diffusion process and a reverse process.

(1) Diffusion process (diffusion process):

the purpose of the diffusion process is to propagate x through a Markov chain₀Gradually mapped to a multidimensional normal distribution (gaussian noise), i.e.:

The process is equivalent to adding a small amount of Gaussian noise in an iteration mode, and finally the target is converted into the multidimensional normal distribution with different dimensionalities independent of each other.

(2) Reverse process (reserve process):

the reverse process is generated based on normally distributed samples:

p_latent(x_T)＝N(0,I) (12)

in the formula, p_latent(x_T) Is an isotropic Gaussian distribution with a transition probability p_θ(x_t-1|x_t) Parameterized as N (x)_t-1；μ_θ(x_t,t),σ_θ(x_t,t)²I) In that respect Wherein, the model mu_θAnd σ_θThere are two inputs each: number of diffusion steps

And variables

μ_θOutput an L-dimensional vector as the mean, σ_θA real number is output as a standard deviation. p is a radical of_θ(x_t-1|x_t) The purpose of the method is to gradually eliminate Gaussian noise in the diffusion process and finally generate data which are in accordance with target distribution.

(3) Sampling:

for the reverse process, the generation process first pairs x_TN (0, I) is sampled, followed by x_t-1:p_θ(x_t-1|x_t) T, T-1, 1 samples. X of the output₀Is a sampled data.

(4) Training:

before training, the training target of the model, i.e. the maximum likelihood p, is first profiled_θ(x₀) The formula is as follows:

wherein the content of the first and second substances,

denotes x vs. distribution q (x)₁,...,x_T) (iii) a desire; ELBO is the lower bound of evidence.

Under certain parameterization conditions, ELBO (Evidence Lower Bound) of the diffusion model can be calculated by a closed form. This not only speeds up the computation but also avoids the Monte Carlo estimation with too large variance. The parameterization is driven by its link to Langevin dynamical denoising score matching. To introduce this parameterization, a constant based on the scheduling variance in the diffusion process is defined:

and for t>1, there are

Wherein, beta_tIs the forward process variance; for ease of presentation, the symbol α is used_t＝1-β_t

Then, mu_θAnd σ_θParameterization of (c) defines:

wherein the content of the first and second substances,

is a same as x_tAnd a neural network with the diffusion step number t as input; sigma_θ(x_tT) is fixed to be constant

given a series of fixed schedules

And x₀～q_data(ii) a Then at expectation E_qWith parameterization of (a), we get:

for constants c and κ_tWherein

And for t>1, there are

Where c is independent of optimization objectives. The key idea demonstrated is to expand ELBO to the sum of KL divergence between controllable gaussian distributions with closed-form expressions.

Minimizing the following unweighted ELBO variables can improve the quality of generation:

wherein T is uniformly valued at 1. Therefore, this training target is also used in the model of the present invention.

(5) Embedding in a diffusion step:

different diffusion steps t are taken as input, different epsilon can be output by the model corresponding to different t_θ(. t). For each onet uses a 128-dimensional code vector.

Then three Fully Connected (FC) layers are applied on the encoding, where the first two FCs share parameters between the residual layers. The last FC maps the output of the second FC to a C-dimensional (residual channel number) embedding vector. This vector is then broadcast and added to the input of each residual layer.

The model has a Conditioner (Conditioner) to encode condition information such as Mel-spectra, speaker labels, etc. In the training and decoding process, the total diffusion round numbers T and beta are set in advance_t. For example, T is 200, beta with best effect_t＝[1×10^-4,0.02]I.e. initially 1X 10^-4Each iteration is incremented by 0.02. The larger the T is, the more times of iteration are, and the better the generation effect is.

(6) A regulator: these neural vocoders were tested using an 80-band mel-frequency spectrum of the original audio as a regulator. The FFT size is set to 1024, the hop size is set to 256, and the window size is set to 1024. The mel spectrogram is sampled 256 times, and two-dimensional convolution (in time and frequency) interleaving with two layers of transposes is performed through a leaky ReLU (alpha is 0.4) function. For each layer, the upsampling step is 16 in time and the two-dimensional filter size is [32,3 ]. After upsampling, 80 mel-bands are mapped to 2 residual channels using the specific layer's Conv1 × 1, and then an adjustment factor is added as a bias term for the extended convolution before the gate-tanh nonlinear function of each residual layer. The vocoder portion parameters are shown in table 7.

TABLE 7 partial parameter List of vocoders

5. Data set and training

Training is carried out in the environment of a server carrying Nvidia GTX 1080Ti, a data set provides a Chinese female voice synthesis database (BZNSSYN) which is free and open in trademark shellfish science and technology from the technical scheme of Japanese voice synthesis in 11/2018 (download address: https:// weixinxcxddb. oss-cn-beijing. aliyunncs. com/gwyinpinnKu/BZNSYP. rar), and 10000 sentences of Chinese female voice (the total duration is about 12 hours) and text label documents corresponding to all audio files are contained. In the experiment, 95% of the data set is divided into a training set and 5% is divided into a testing set.

The audio file is processed into a Mel frequency spectrum characteristic matrix for extracting acoustic characteristics of voice, and the pinyin labels are corresponding to the sound spectrum in the train.

For text, it is converted into phonetic sequences, and the symbols are only reserved'. Is there a | A ' four, the remaining symbols are converted to one of the four symbols according to the rules mentioned in the front of the text. The word embedding layer is used in the model, and the word vector of each word in the corpus is continuously learned through training.

In training the spectral generation network, the Batch Size is set to 32, and the learning rate is exponentially decayed, the initial learning rate is set to 0.001, and the exponential decay is started to 0.00001 (approximately at 310 k) when the iteration step number reaches 50 k.

In the training of vocoder, to ensure data consistency, Merr spectrogram generated by model is used as input, Adam optimizer, batch size is 16, learning rate is 2 × 10^-4The training step is 1M.

6. Feature extraction:

(1) word embedding

The speech synthesis technique is to let the machine learn to map each character including spaces and punctuation to a certain number of frames of the mel-frequency spectrum.

Since the plain text data cannot be input as deep learning, for chinese, the hanzi sequence pair is first converted into pinyin sequence (symbols are only reserved',. The word vectors for each word in the corpus are continuously learned through training using a truncated normal distributed word embedding layer with a standard deviation of steddev.

(2) Audio feature extraction

For audio, its mel-frequency spectrum features are mainly extracted. The Mel-Frequency Cepstral coeffients (MFCC) is a relatively common audio feature, for sound, the MFCC is a one-dimensional time domain signal, the change rule of a Frequency domain is difficult to see intuitively, and as we know, Fourier change can be used to obtain Frequency domain information of the MFCC, but the time domain information is lost, and the change of the Frequency domain along with the time domain cannot be seen, so that the sound cannot be well described.

7. Experiment:

(1) introduction to evaluation methods

The model of the invention was evaluated by subjective and objective evaluation methods.

1) Subjective evaluation method:

the subjective evaluation method used by the model is Mean Opinion Score (MOS), the naturalness and intelligibility of the synthesized voice are mainly concerned, the grading standard of the MOS value is classified into 5 grades and 1-5 grades, and the higher the grade is, the better the voice quality is. The evaluation criteria of the mean opinion score are shown in Table 8.

TABLE 8 evaluation criteria for mean opinion score

Rank of	Score of	Evaluation criteria
			Superior food	5.0	The pronunciation is clear; the delay is small, the communication is smooth, and the overall hearing sense is good; very similar
Good wine	4.0	The pronunciation is clear and understandable; the delay is small, the alternating current is not smooth, and the noise is slight; is relatively similar
			In	3.0	Can be substantially understood; the communication can be realized with a certain delay, and the whole feeling is not smooth; moderate degree of similarity
Difference (D)	2.0	It is very difficult to understand and hear; the delay is large, and the communication needs to be repeated for many times; are slightly similar
			Bad quality	1.0	The pronunciation is unclear and difficult to understand; large delay and unsmooth alternating current; are completely dissimilar

Calculating the MOS value:

m sentences are selected to evaluate K speech synthesis systems, MK samples are formed jointly, N tested points are scored, and the average score mu of the system is expected to be obtained. To improve the random significance of the metric results, scores within the 95% confidence interval were used as the average score of the system, and the formula is as follows:

μ_mn＝μ+x_m+y_n+z_mn (20)

wherein the content of the first and second substances,

for modeling sentence quality, subject preferences and subjective uncertainty,

depending on the particular system under test and the test environment. Then calculate

The formula is as follows:

can be obtained from a least squares estimation, the formula is as follows:

the resulting estimate of the mean score variance is:

the confidence interval for obtaining the average score according to the t distribution by combining the above formula is as follows:

wherein, the degree of freedom of t distribution is min (N, M) -1, the confidence coefficient is selected to be 95%, and the value of t can be obtained by looking up a table.

2) Objective evaluation method:

in the objective evaluation method, the difference between the synthesized speech and the real speech is measured by Mel Cepstral Distortion (MCD), the MCD represents the difference between the MFCC characteristics of the converted speech and the MFCC characteristics of the standard output speech, and the smaller the Distortion value, the better the sound quality of the synthesized speech.

MCD calculation formula:

where α is a scaling factor, typically having a value of

L and M are respectively Mel cepstrum index and frame index, M is speech frame number, L is Mel cepstrum dimension, s (L, M) and

the mel cepstrum of real speech and synthesized speech, respectively.

(2) Comparative experiment of speech synthesis model

In order to prove that the speech synthesis model provided by the invention has obvious advantages in speech synthesis quality and naturalness, the mainstream speech synthesis model based on deep learning at present is selected and tested. Including autoregressive model Tacotron2(Jonathan Shen, rumming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifing Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrtanakis, and Yonghui Wu. Nature TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions [ C ]. proceedof 43th IEEE International Conference Acoustics, Speech and nal Processingg (ICASSP 2018), Calgator, Alberta, April 15-20,2018, Fastfasi 9-4783, and autoregressive model (Jian-J. Zhang, Zhang J. evaluation by the different experimental methods, Huang Shuxuan Wan-Zhen J. experiment, Huang-Shen J. evaluation, Huang-Shen J. experiment, Huang Ko, Huang Cha, Huang Cha, Zhang-Shen, and Xishu model (Xishu) and Xiahu model, evaluation method, evaluation.

According to the characteristics of the models, three popular vocoders are matched for comparison experiments: Griffin-Lim (Perraudin N, Balazs P,

P L.A Fast Griffin-Lim Algorithm[C].Proceeding of 14th IEEE Workshop on Applications of Signal Processing to Audio and Acoustics(WASPAA 2013),New Paltz,New York,U.S.A,October 2013,pp.1-4.)，MB-MelGAN(Geng Yang,Shan Yang,Kai Liu,Peng Fang,Wei Chen,Lei Xie.Multi-Band Melgan:Faster Waveform Generation For High-Quality Text-To-Speech[C]the proceedings of the 8th IEEE splash Language Technology Workshop (SLT 2021), Shenzhen, China, January 2021, pp.492-498.) and WaveRNN (Nal Kalch brenner, Erich Elsen, Karen Simony, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, A aron van den Oord, Sandieleman, Koray Kavukcukcuoglu. Effient Neural Audio Synthesis [ C].Proceeding of 35th International Conference on Machine Learning(ICML 2018),Stockholm,Sweden,July 2018,pp.2410-2419.)。

Griffin-Lim: the model generates speech under the condition of known magnitude spectrum and unknown phase spectrum. Griffin-Lim acquires the entire spectrogram, iteratively estimating the missing phase information by repeatedly switching between the frequency domain and the time domain. In this experiment, 60 iterations from the frequency to the time domain were used.

MB-MelGAN: i.e., Multi-Band MelGAN. On the basis of the MelGAN, the model increases the receptive field of a generator, and meanwhile, the multi-resolution short-time Fourier transform loss is used for replacing the characteristic matching loss, so that better training quality and better training stability are brought.

waveRNN: the model is a high-speed audio synthesis vocoder based on a neural network, and the high-speed audio synthesis vocoder has a single-layer RNN network architecture and uses two softmax layers. One advantage of WavRNN is the use of matrix sparsification, which will increase the synthesis speed by a factor of 10.

In the subjective evaluation method, the quality and naturalness of synthesized speech are evaluated using a Mean Opinion Score (MOS). Firstly, randomly selecting 20 texts from a test set, mixing synthesized voice and real voice together, randomly disordering to form a voice set to be tested, scoring each voice by 15 testers in the same laboratory environment with noise less than 30 decibels, recycling scoring results of all testers, eliminating scoring data with larger errors, and then calculating average score according to 95% confidence coefficient to serve as final score.

In the objective evaluation method, the present embodiment measures the difference between synthesized speech and real speech using mel-frequency cepstral distortion (MCD). And 20 texts were selected for speech synthesis and there was a corresponding real recorded audio for each text. The MCD of the 20 texts for the different models is calculated separately, and then the average value is calculated separately for each model as the final score.

TABLE 9 comparison table of experimental results of speech synthesis model

Acoustic model	MOS	MCD
			Tacotron2+Griffin-Lim	4.04±0.19	6.80
Tacotron2+MB-MelGAN	4.17±0.25	7.14
			Tacotron2+WaveRNN	4.19±0.10	6.69
FastSpeech2+MB-MelGAN	3.96±0.15	6.97
			Model of the invention	4.22±0.30	6.22
True speech	4.50±0.24	—

The results of the MOS and MCD experiments are shown in Table 9. As can be seen from Table 9, the MOS score of the model of the invention is 4.22, which is higher than that of other models, and the model of the invention proves that the speech synthesis quality and the naturalness are better. Furthermore, the inventive model MCD scores 6.22, lower than the other models, indicating that the inventive model has minimal difference between synthesized speech and real speech.

In addition to performing the quantitative comparison experiment of MCD, the present embodiment also graphically represents the difference between the synthesized speech and the real speech through the mel-frequency spectrogram. Fig. 2 is a comparison of the mel spectrum generated by the models and the real mel spectrum, wherein, the diagram (a) is the real mel spectrum, and other spectrograms are the mel spectra generated by the models. It can be seen from fig. 2 that the mel-frequency spectrogram generated by the model of the present invention is closer to the true mel-frequency spectrum than other speech synthesis models.

Claims

1. A Chinese speech synthesis method based on a diffusion probability model is characterized by comprising the following steps:

s1: text front-end processing:

s2: constructing an end-to-end spectrum generation network based on a forward attention mechanism to encode and decode the processed text: and (3) encoding: the encoder module processes the input phoneme sequence to obtain a hidden layer sequence, and at each decoding moment, an attention mechanism performs soft selection on the input sequence to obtain an attention context vector as the input of a decoder;

2. The method as claimed in claim 1, wherein the mandarin chinese text to phoneme conversion process is specifically as follows: for the sequence of Chinese characters in each sentence of the text data set from left to right, preferentially searching whether a word beginning with the Chinese character exists in a word pinyin library, checking whether the Chinese character behind the Chinese character in the text is matched with the word, and directly obtaining the pinyin of the word from the word pinyin library if the Chinese character in the text is matched with the word; if not, obtaining the pinyin of the Chinese character from the pinyin library.

3. The method of diffuse probability model based chinese speech synthesis of claim 1, wherein the encoder module comprises: a character embedding layer, a 3-layer convolution, a bidirectional LSTM layer; the input character is encoded into a 128-dimensional character vector; then, passing through 3 layers of convolution, wherein each layer of convolution comprises 256 convolution kernels of 5 multiplied by 1, namely each convolution kernel spans 5 characters, the convolution layers carry out large-span context modeling on an input character sequence, and are subjected to batch normalization after the convolution layers, and a ReLU activation function is used for activation; the output of the last convolutional layer is passed to the bi-directional LSTM layer to generate coding features;

H＝EncoderRecurrency(f_e) (2)

4. The method of Chinese speech synthesis based on diffuse probability model of claim 1, wherein the forward attention mechanism comprises:

let the phoneme sequence of the input coder be x ═ x₁,x₂,…,x_N]N representing a sequence of phonemesLength, processing by encoder to obtain hidden layer sequence h ═ h₁,h₂,…,h_N]At each decoding instant k, the attention mechanism makes a soft selection of the input sequence, resulting in a context vector c_kAs input to a decoder;

let the query vector of attention mechanism be s_kNote that the mechanism selects as input the output of a position between encoders 1 to N, the position being given by a random variable π_kE {1, …, N }, the modeled target of the attention mechanism is the probability distribution of the position variable: p (pi)_k|h,s_k) (ii) a The context vector calculation is given by:

the content-based attention mechanism is calculated in the following way:

wherein s is_1:kFor a set of query vectors s₁,s₂,…,s_k}；y_k'(π_k') Indicating that attention is staying at the output position pi of the encoder at an arbitrary time k' before the current decoding time k_kThe size of the probability of';

p(π_k|h,s_1:k,π_0:k∈P) (6)

then the forward variable a is defined_k(n) is:

a_k(n)＝(a_k-1(n)+a_k-1(n-1))y_k(n) (8)

get the new attention probability from the forward variable:

in formula (3)

To replace y_k(n) computing a context vector c_k：

5. The method for Chinese speech synthesis based on diffusion probability model according to claim 1, wherein the S3 specifically includes:

S31：definition of q_data(x₀) Is composed of

Wherein L is the data dimension; definition of

T is a variable sequence with the same dimensionality, T is an index of diffusion step number, and T is total diffusion step number; the diffusion probability model comprises a diffusion process and a reverse process;

the purpose of the diffusion process is to propagate x through a Markov chain₀Gradually mapping to a multidimensional normal distribution, i.e.:

p_latent(x_T)＝N(0,I) (12)

Wherein, the model mu_θAnd model σ_θThere are two inputs each: number of diffusion steps

And variables

s32: sampling

For the reverse process, the generation process first pairs x_TN (0, I) is sampled, followed by x_t-1:p_θ(x_t-1|x_t) T, T-1,. 1, sample; x of the output₀Is a sample data;

s33: training