Speech synthesis method
Technical Field
The present invention relates to speech synthesis technology, and more particularly, to a speech synthesis method.
Background
Under the large background of reducing personnel and increasing efficiency of the national power grid company, the contradiction between shortage of professional personnel and continuous increase of the number of suppliers of the power material company is increasingly prominent, and the existing large-batch bidding and performance information interaction requirements are difficult to meet. With the advent of the artificial intelligence era, speech recognition technology is continuously developing. The artificial intelligent voice technology can take over most of artificial telephone traffic work, release manpower and improve efficiency. Therefore, in the upgrading and transformation process of the intelligent provider service hall, the Chongqing electric power company material branch company of the national network confirms the key point of a material specialist in information notification by combing the requirement of the existing voice call interaction scene, constructs a conversational process of common business notification, and creates an AI intelligent outbound system with a high service paste and intelligent level by relying on the technologies of voice recognition, semantic understanding, voice synthesis, big data analysis and the like.
The AI intelligent outbound system can liberate boring repeated work of workers, work efficiency is improved, meanwhile, voice of the AI intelligent outbound system is free of emotion, and conflict can be effectively avoided. The speech synthesis technology is a very important technical problem in the AI intelligent outbound system, and how to accurately synthesize speech is a technical problem of the system.
Disclosure of Invention
Aiming at the problems in the prior art, the technical problems to be solved by the invention are as follows: how to accurately synthesize speech.
In order to solve the technical problems, the invention adopts the following technical scheme: a method of speech synthesis comprising the steps of:
s10: extracting text features and acoustic features;
the text feature extraction module firstly embeds characters into input text data, namely uses vectors with fixed dimensionality to represent text characters, and then sequentially passes through two sub-networks of Pre-Net and CBHG to obtain text feature data;
acoustic feature extraction: performing pre-emphasis processing on voice data by using a Mel frequency spectrum and a linear frequency spectrum, passing an original audio signal through a high-pass filter to obtain pre-emphasized voice data, and performing short-time Fourier transform to obtain a linear frequency spectrum;
s20, fusing the extracted text feature data and the acoustic features, which comprises the following steps:
a) constructing an encoder, wherein the encoder uses an encoder in a Tacotron frame, text characteristic data obtained in S10 is input into the encoder, and the encoder outputs a coding sequence;
b) constructing a position sensitive attention mechanism, wherein the position characteristics of the position sensitive attention mechanism are obtained by convolution of 32 1-dimensional convolution kernels with the length of 31, and the attention weight, namely an attention context vector, is obtained after the coded sequence and the position characteristics output by the step a) are projected to a 128-dimensional hidden layer for characterization;
c) constructing a decoder which is an autoregressive recurrent neural network, predicting a coding sequence output by an encoder to output a spectrogram, predicting one frame at a time, and firstly transmitting the predicted spectral frame in the previous step into a double-layer fully-connected preprocessing network pre-net composed of 256 hidden ReLU units in each layer;
splicing the output of pre-net and the attention context vector together, transmitting the spliced output to a two-layer stacked unidirectional neural network consisting of 1024 units, splicing the output of the neural network with the attention context vector again, and predicting a target frequency spectrum frame through a linear transformation projection;
predicting a residual error by a 5-layer convolution network of the predicted target spectrum frame and superposing the residual error on the spectrum frame before convolution, wherein each layer of the network consists of 512 convolution kernels with the size of 5 multiplied by 1 and a batch standardization process, and except the last layer of convolution, the batch standardization process of each layer is followed by a tanh activation function;
in parallel with the prediction of the frequency spectrum frame, the output of a decoder is spliced with the attention context vector, projected into a scalar and then transmitted to a sigmoid activation function to predict the probability of whether an output sequence is finished;
when the probability value is greater than or equal to a preset ending threshold value, indicating that the prediction is ended, and carrying out the next step;
d) and synthesizing a post-processing network and a waveform, wherein the post-processing network consists of a CBHG module and a full connection layer, the output of the decoder is converted into a linear spectrogram through the post-processing network, and the linear spectrogram is restored into a voice waveform for output by a Griffin-Lim algorithm.
Preferably, the specific method for extracting the acoustic features in S10 is as follows:
1) the original audio signal is passed through a high-pass filter to obtain pre-emphasized voice data, and formula (1) is adopted:
H(Z)=1-μ·z-1 (1);
wherein H is a voice sampling value, Z represents different time, 1 represents a sampling value of the current time, and Z-1Representing the sampling value at the previous moment, and mu is a pre-emphasis coefficient;
2) then, performing short-time Fourier transform on the voice data obtained by the formula (1) to obtain a linear spectrum, as shown in the formula (2):
where z (t) is the source signal, z (t) h (z), g (t) is a window function, and f is the frequency of the linear spectrum;
3) processing the linear spectrum with a mel filter bank to obtain a mel spectrum, as shown in formula (3):
where f is the frequency of the linear spectrum.
Preferably, the encoder in S20 is composed of a Pre-net preprocessing network and a CBHG module, and the CBHG module is sequentially composed of a one-dimensional convolution filter bank, a residual connection, a multi-layer highway network, and a bidirectional gated cyclic unit GRU network.
Preferably, the output calculation process of the post-decoder for constructing the position-sensitive attention mechanism in S20 is as follows:
the energy calculation of the position sensitive attention mechanism is as in formula (4):
wherein s is
iIs the implicit state of the decoder recurrent neural network at time i, h
jIs the jth output of the encoder, f
i,jThe convolution output representing the cumulative attention weight before time i, b is the offset value, initially the 0 vector, v
aW, V and U denote weight matrices of different network layers,
denotes v
aTransposing;
convolution output f
i,jFrom cumulative attention weights
F is the convolution kernel, as in equations (5) and (6);
compared with the prior art, the invention has at least the following advantages:
in the invention, a decoder is constructed by using an autoregressive recurrent neural network, and a position sensitive attention mechanism is introduced in the encoding process, so that an attention context vector and an encoder output encoding sequence are spliced together when the decoder is used, and because the sensitive attention mechanism can simultaneously consider the content and the position of an input phoneme, the accumulated attention weight after the previous decoding process can be used as an additional feature, so that the model keeps consistency when advancing along an input sequence, the problems of subsequence omission or repetition and the like which possibly occur in the decoding process are reduced, and the accuracy of the final synthesized speech is improved.
Detailed Description
The present invention is described in further detail below.
A method of speech synthesis comprising the steps of: s10: extracting text features and acoustic features;
the text feature extraction module firstly embeds characters into input text data, namely uses vectors with fixed dimensionality to represent text characters, and then sequentially passes through two sub-networks of Pre-Net and CBHG to obtain text feature data;
acoustic feature extraction: the method is carried out by using a Mel frequency spectrum and a linear frequency spectrum, and comprises the steps of firstly carrying out pre-emphasis processing on voice data, enabling an original audio signal to pass through a high-pass filter to obtain pre-emphasized voice data, and then carrying out short-time Fourier transform to obtain the linear frequency spectrum.
As an improvement, the specific method for extracting the acoustic features in S01 is as follows:
1) the original audio signal is passed through a high-pass filter to obtain pre-emphasized voice data, and formula (1) is adopted:
H(Z)=1-μ·z-1 (1);
wherein H is a voice sampling value, Z represents different time, 1 represents a sampling value of the current time, and Z-1Representing the sample value at the previous moment, μ is a pre-emphasis coefficient, typically between 0.9 and 1.0;
2) then, performing short-time Fourier transform on the voice data obtained by the formula (1) to obtain a linear spectrum, as shown in the formula (2):
where z (t) is the source signal, z (t) h (z), g (t) is a window function, and f is the frequency of the linear spectrum;
3) processing the linear spectrum with a mel filter bank to obtain a mel spectrum, as shown in formula (3):
where f is the frequency of the linear spectrum.
S20, fusing the extracted text feature data and the acoustic features, which comprises the following steps:
a) the encoder was constructed using an encoder in a Tacotron frame, which is prior art.
The text feature data obtained in S10 is input to the encoder, and the encoder outputs the encoded sequence.
The encoder is composed of a Pre-net preprocessing network and a CBHG module, and the Pre-net is used for preprocessing input texts. The CBHG module consists of a one-dimensional convolution filter bank, residual connection, a multilayer highway network and a bidirectional gating circulation unit GRU network in sequence; the one-dimensional convolution filter bank is a convolution layer consisting of m one-dimensional filters with different sizes, and the sizes of the filters are 1,2 and 3 … m respectively; the problem of gradient diffusion caused by too deep neural network layers can be solved by using residual connection, so that the condition that too much information input before is lost after multilayer convolution can be ensured. The Highway network is used for relieving the overfitting problem caused by network deepening and reducing the training difficulty of a deeper network. And finally, using GRU to obtain a bidirectional extraction feature sequence.
b) Constructing a position sensitive attention mechanism; the position sensitive attention mechanism can simultaneously consider the content and the position of an input phoneme, and can enable the accumulated attention weight after the previous decoding process to be used as an additional feature, so that the model keeps consistency when advancing along an input sequence, and the problems of subsequence omission or repetition and the like which possibly occur in decoding are reduced.
The position characteristics of the position sensitive attention mechanism are obtained by convolution of 32 1-dimensional convolution kernels with the length of 31, and after the coded sequence and the position characteristics output by the step a) are projected to a 128-dimensional hidden layer for representation, attention weights, namely attention context vectors, are obtained;
c) constructing a decoder which is an autoregressive recurrent neural network, predicting a coding sequence output by an encoder to output a spectrogram, predicting one frame at a time, and firstly transmitting the predicted spectral frame in the previous step into a double-layer fully-connected preprocessing network pre-net composed of 256 hidden ReLU units in each layer;
the output of the pre-net is spliced with the attention context vector and transmitted to a two-layer stacked unidirectional neural network consisting of 1024 units, the output of the neural network is spliced with the attention context vector again, and then the target frequency spectrum frame is predicted through a linear transformation projection.
And predicting a residual error by the predicted target spectrum frame through a 5-layer convolution network, and superposing the residual error on the spectrum frame before convolution so as to improve the whole process of spectrum reconstruction. Each layer of the network consists of 512 5 x 1 convolution kernels and a batch normalization process, each layer of batch normalization process being followed by a tanh activation function except for the last layer of convolution.
In parallel with the prediction of the spectrum frame, the output of the decoder is spliced with the attention context vector, projected into a scalar and then transmitted to the sigmoid activation function to predict the probability of whether the output sequence is finished.
When the probability value is greater than or equal to a preset ending threshold value, indicating that the prediction is ended, and carrying out the next step;
convolutional layers in the network are regularized using dropout with a probability of 0.5, and LSTM layers are regularized using zoneout with a probability of 0.1. To bring some variation to the output result at the time of inference, dropout with a probability of 0.5 is applied only to pre-net of the autoregressive decoder.
The model of the invention uses more compact building blocks, uses common LSTM and convolutional layers, and outputs only a single spectral frame per decoding step.
The output calculation procedure of the post-decoder for constructing the position-sensitive attention mechanism in S20 is as follows:
the energy calculation of the position sensitive attention mechanism is as in formula (4):
wherein s is
iIs the implicit state of the decoder recurrent neural network at time i, h
jIs the jth output of the encoder, f
i,jThe convolution output representing the cumulative attention weight before time i, b is the offset value, initially the 0 vector, v
aW, V and U denote weight matrices of different network layers,
denotes v
aTransposing;
convolution output f
i,jFrom cumulative attention weights
F is the convolution kernel, as in equations (5) and (6);
d) and synthesizing a post-processing network and a waveform, wherein the post-processing network consists of a CBHG module and a full connection layer, the output of the decoder is converted into a linear spectrogram through the post-processing network, and the linear spectrogram is restored into a voice waveform for output by a Griffin-Lim algorithm.
Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.