CN113838449A

CN113838449A - Novel Mongolian speech synthesis method

Info

Publication number: CN113838449A
Application number: CN202110817588.XA
Authority: CN
Inventors: 仁庆道尔吉; 张文静; 张倩; 刘馨远; 张毕力格图; 郎佳珺; 萨和雅; 吉亚图
Original assignee: Inner Mongolia University of Technology
Current assignee: Inner Mongolia University of Technology
Priority date: 2021-07-20
Filing date: 2021-07-20
Publication date: 2021-12-24

Abstract

The invention discloses a novel Mongolian speech synthesis method, which specifically comprises the following steps: s1, processing the Mongolian word sequence based on the BilSTM: the invention provides a Mongolian prosody modeling method fusing a morphological vector and a phonetic system vector based on a BilSTM neural network, which comprises an input layer, an attention layer, a BilSTM layer and an output layer, wherein the input Mongolian word sequence is processed, and specifically, a word vector WE, a morphological vector ME and a phonetic system vector PE of a Mongolian word are given. The novel Mongolian speech synthesis method provides Mongolian prosody modeling fusing form vectors and phonetic system vectors based on a BilSTM neural network, processes an input Mongolian word sequence, utilizes a synthesizer to input characters to produce acoustic features, and utilizes a vocoder to generate waveform output from the acoustic features, wherein the improvement on WaveGlow is added, and the efficiency of the synthesizer is greatly improved in calculation and consumption.

Description

Novel Mongolian speech synthesis method

Technical Field

The invention relates to the technical field of speech synthesis, in particular to a novel Mongolian speech synthesis method.

Background

Speech synthesis is a technology for generating artificial speech by mechanical and electronic methods, and TTS technology (also called text-to-speech technology) belongs to speech synthesis, and is used for converting text information generated by a computer or input from the outside into understandable text information, the technology of fluent Chinese spoken language output adopts a world-leading speech synthesis technology, developed speech synthesis assistant software can perfectly complete speech synthesis work, Mongolian belongs to the Altai language family or the Mongolian family, main users are in China Mongolian residential district, Mongolian and Russian Federal Siberian Federal district, Mongolian used by Mongolian is mainly spelled by using the Sirrial letters due to influence of Suyun in the fifth and sixth decades of the twentieth century, Russian Carlmekok language and British language are regarded as dialects of Mongolian, and Mongolian in China inner Mongolian still uses traditional Mongolian.

Existing high-quality speech synthesizers all need to consume considerable computing resources, the efficiency of the synthesizers is reduced in computing and consumption, hidden data safety hazards exist when data are transmitted to the cloud, and flow-based substitution for autoregressive is used for Waveglow, so that parallelization can be achieved, but the parallelization is difficult to apply to a real-time system.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a novel Mongolian speech synthesis method, which solves the problems that the existing high-quality speech synthesizer needs to consume considerable computing resources, the efficiency of the synthesizer is reduced in computing and consumption, hidden danger of data safety is caused when data is transmitted to a cloud end, and Waveglow replaces autoregressive with flow-based, so that parallelization is possible, but the application to a real-time system is difficult.

In order to achieve the purpose, the invention is realized by the following technical scheme: a novel Mongolian speech synthesis method specifically comprises the following steps:

s1, processing the Mongolian word sequence based on the BilSTM: a Mongolian prosody modeling method fusing a morphological vector and a phonetic system vector is provided based on a BilSTM neural network, and comprises an input layer, an attention layer, a BilSTM layer and an output layer, wherein an input Mongolian word sequence is processed, specifically, a word vector WE, a morphological vector ME and a phonetic system vector PE of a Mongolian word are given, the weights of the word vector WE, the morphological vector ME and the phonetic system vector PE are predicted through two fully-connected neural networks, and then the three vectors are multiplied by respective weights wI, w2 and w3 and spliced together to form a final Mongolian word vector representation WE^*The formula group is: w is a₁＝1-w₂+w₃， w₂＝1σ(M₁Tanh(M₂WE+M₃PRE)) and WE^*＝w₁.WE+w₂.PE+w₃ME, BilSTM layer reads the input feature vector WE^*To extract richer high-level semantic features, the feature vector WE is first read from left to right using forward LSTM^*Sequentially obtaining a hidden state

Then the hidden state is obtained by using backward LSTM

Hidden state

And

having the same time length, finally summing the hidden states of the corresponding time steps to obtain the maximum value of each time stepThe method comprises the steps of outputting H in a final hidden state, sending the H in the hidden state obtained by a BilSTM layer into an output layer, decoding to obtain a final prosodic tag corresponding to the Mongolian word, decoding the H in the hidden state output by the BilSTM layer by the output layer, selecting two output layer functions for decoding, wherein the first output layer is a Softmax function, the same as the method, converting an output vector into a probability value between 0 and 1, normalizing the probability value, finding out a prosodic tag with the maximum probability value as the final prosodic tag, and the other output layer is a conditional random field₁，y₂，…，y_T]For optimization purposes, the formulation is:

the coding network starts with an embedding layer, which converts characters or phonemes into a trainable vector representation h_eEmbedded h_eThe key vector h of attention is created by first converting from the embedding dimension to the target dimension through a fully connected layer, then processing through a convolution block to extract the temporal dependency of the textual information, and finally being projected to the embedding dimension_kThe value vector of attention is computed from the attention key vector and text embedding:

s2, using SqueezeWave to improve synthesizer efficiency: from the cloud to the edge of TTS, a typical modern speech synthesis model mainly includes two parts: synthesizer and vocoder, propose using a kind of light-weight based on voice synthesis of the marginal apparatus of the stream vocoder SqueezeWave, redesign the framework of WaveGlow, make it consume 61-214 times of calculation amount less than WaveGlow by reforming the audio tensor, adopting the separable convolution of depth and relevant optimization, can realize the generation of 123- & ltSUB & gt 303K samples per second at the notebook end, different from carrying on the convolution operation directly, WaveGlow first clustering the adjacent sample and constructing the input of the multichannel, wherein L is the length of the time domain dimension, Cg is the sample quantity of clustering combination at each time step, the total number of samples in the waveform is that the waveform is transformed by a series of bilateral mappings subsequently, wherein each will use the input to get the output, in each bilateral mapping, the input signal is first processed by reversible point-by-point convolution, then split the result into and along the channel, where it is used to calculate affine coupling coefficients, where the subsequent calculations to be applied are then functions like wavenet, where the mel spectrum Lm for coded audio is the time length of the mel spectrum, and Cm is the number of frequency components, which will eventually be combined in the channel direction to the final output;

s3, producing a waveform from the acoustic signature using the vocoder: the most important calculated amount of WaveGlow comes from WN function, the input is firstly processed by point-by-point convolution, then the one-dimensional expansion convolution with kernel of 3 continues to process the result, meanwhile, the Mel spectrum is also fed into the network, then the in _ layer and the cond _ layer outputs are combined by a gate function according to the way of waveNet, then the combined result is transmitted to res _ skip _ layer, the output length is 2000, the channel number is 512, then the combined result is split into two parts according to the channel, the structure is repeated eight times, and the point-by-point convolution is carried out on the final res _ skip _ layer output and end, and the conversion factor s is calculatedⁱ,tⁱAnd compress the channels from 512 to 8, warping the input audio to a smaller temporal length and more channels while maintaining the channel size in the WN function.

Preferably, in S1, the input layer finds corresponding word vectors, morphological vectors, and pitch vectors of the to-be-input mongolian words by looking up a word list, and the attention layer inputs three mongolian word feature vectors, and integrates the three feature vectors together in a weighted summation manner to obtain a new mongolian word vector.

Preferably, in S2, the synthesizer is used to generate the acoustic features from the text input, and then the vocoder is used to generate the waveform output from the acoustic features.

Preferably, in S3, the convolution increases the number of channels from a very large number, and the output dimension of start in WaveGlow is 256 dimensions.

Preferably, in S3, since the time domain length of the mel spectrum is much smaller than the waveform length, it needs to be up-sampled for dimension matching.

Preferably, in S3, when L is 64, the time domain length is the same as the mel spectrum, and no upsampling is needed, and when L is 128, the mel spectrum only needs to be nearest-neighbor sampled, so that the computation overhead of cond _ layer is further reduced, and the fig2 deep separable convolution reduces the computation amount.

Preferably, in S3, based on the improvement of WaveGlow, SqueezeWave-light vocoder, similar speech quality can be generated, but the network structure of WaveGlow can be redesigned by running on 61 x-214 x MAC, thereby greatly reducing the amount of computation.

Advantageous effects

The invention provides a novel Mongolian speech synthesis method. Compared with the prior art, the method has the following beneficial effects:

(1) the novel Mongolian speech synthesis method comprises the following steps of processing Mongolian word sequences based on BilsTM: a Mongolian prosody modeling method fusing a form vector and a phonetic system vector is provided based on a BilSTM neural network, and comprises an input layer, an attention layer, a BilSTM layer and an output layer, wherein an input Mongolian word sequence is processed, specifically, a word vector WE, a form vector ME and a phonetic system vector PE of a Mongolian word are given, the weights of the word vector WE, the form vector ME and the phonetic system vector PE are respectively predicted through two fully-connected neural networks, Mongolian prosody modeling of the fusion form vector and the phonetic system vector is provided based on the BilSTM neural network, the input Mongolian word sequence is processed, characters are input by a synthesizer to produce acoustic features, and then a vocoder is used to generate waveform output from the acoustic features, wherein the improvement of WaveGlow is added, and the efficiency of the synthesizer is greatly improved in calculation and consumption.

(2) The novel Mongolian speech synthesis method has the advantages that the most main calculated amount through WaveGlow comes from a WN function, the input is firstly processed through point-by-point convolution, then the one-dimensional expansion convolution with the kernel of 3 is used for continuously processing the result, and meanwhile, the one-dimensional expansion convolution with the kernel of 3 is used for processing the resultThe Mel spectrum is also fed into the network, then the in _ layer and the cond _ layer outputs are merged by a gate function according to the WaveNet mode, then the merged output is transmitted to the res _ skip _ layer, the output length is L equal to 2000, the number of channels is 512, then the merged output is split into two parts according to the channels, the structure is repeated eight times, the final res _ skip _ layer output and end are subjected to point-by-point convolution, and the conversion factor s is calculatedⁱ,tⁱAnd compressing the channel from 512 to 8, transforming the input audio into a smaller time domain length and more channels, simultaneously keeping the channel size in the WN function, redesigning the network structure of WaveGlow, thereby greatly reducing the calculation amount, greatly improving the performance of the mobile equipment of the mobile phone, greatly reducing the calculation consumption of the cloud if the mobile equipment is deployed, and fully utilizing the Mongolian morphological knowledge and the phonetic system knowledge to improve the performance of the Mongolian prosody modeling method.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of a Mongolian prosody modeling method of the present invention;

FIG. 3 is a schematic diagram of the point-by-point convolution process of the present invention;

FIG. 4 is a schematic diagram of the network structure of WaveGlow according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-4, the present invention provides a technical solution: a novel Mongolian speech synthesis method specifically comprises the following steps:

s1, processing the Mongolian word sequence based on the BilSTM: a Mongolian prosody modeling method based on a BilSTM neural network and fusing a morphological vector and a phonetic system vector is provided, and comprises an input layer, a notice layer and a data processing layerThe force layer, the BilSTM layer and the output layer process the input Mongolian word sequence, specifically, given the word vector WE, the form vector ME and the sound system vector PE of the Mongolian word, the weights of the word vector WE, the form vector ME and the sound system vector PE are respectively predicted through two fully-connected neural networks, then the three vectors are multiplied by the respective weights wI, w2 and w3 and are spliced together to form the final Mongolian word vector representation WE^*The formula group is: w is a₁＝1-w₂+w₃， w₂＝1σ(M₁Tanh(M₂WE+M₃PRE)) and WE^*＝w₁.WE+w₂.PE+w₃ME, wherein M₁、M₂、M₃And M' s₁、M`₂、M`₃Is a weight matrix and σ () is a Logistic function that normalizes the result of the computation between 0 and 1. Through the information fusion of the attention layer, the final Mongolian word vector WE can be obtained^*Obtaining maximum information benefit from different information sources, enhancing robustness of Mongolian word vector representation, and reading input feature vector WE by a BilSTM layer^*To extract richer high-level semantic features, the feature vector WE is first read from left to right using forward LSTM^*Sequentially obtaining a hidden state

Then the hidden state is obtained by using backward LSTM

Hidden state

And

the hidden state output H obtained by the BiLSTM layer is sent to an output layer to be decoded to obtain a final corresponding rhythm label of the Mongolian word, the output layer decodes the hidden state H output by the BiLSTM layer, and two output layer functions are selected for decoding, wherein the first one is that the hidden state output H of each time step is obtained by summing the hidden states of the corresponding time step, the hidden state output H obtained by the BiLSTM layer is sent to the output layer to be decoded to obtain a final corresponding rhythm label of the Mongolian word, the hidden state output H output by the BiLSTM layer is decoded by the output layer, and the first one is that the hidden state output H of the BiLSTM layer is decoded by the two output layer functionsThe Softmax function is the same as the method, can convert the output vector into the probability value between 0 and 1, then normalize the probability value, thus find out the rhythm label with the maximum probability value as the final rhythm label, and the other is the conditional random field, for the sequence labeling problem, there is a strong dependency relationship between labels, for example, the 'pause' label must be followed by the 'non-pause' label, the 'pause' label cannot be continued, in order to better describe the dependency relationship of the adjacent labels and obtain the global optimum solution, the CRF output layer maximizes the correct complete label sequence Y ═ Y₁，y₂，…，y_T]For optimization purposes, the formulation is:

wherein s is_yIs a fraction of the output sequence y,

all possible output sequences corresponding to the input sequences, the model training uses a cross entropy loss function as an optimization target, the coding network starts with an embedding layer, and characters or phonemes are converted into trainable vector expressions h_eEmbedded h_eThe key vector h of attention is created by first converting from the embedding dimension to the target dimension through a fully connected layer, then processing through a convolution block to extract the temporal dependency of the textual information, and finally being projected to the embedding dimension_kThe value vector of attention is computed from the attention key vector and text embedding:

to consider the local information h together_eAnd long-term context information h_k. Key vector h_kUsed by each attention block to compute attention weights, and the final context vector is represented by value vector h_vCalculating the weighted average of (1);

s2, using SqueezeWave to improve synthesizer efficiency: from the cloud to the edge of TTS, a typical modern speech synthesis model mainly includes two parts: a synthesizer and a vocoder, wherein the synthesizer is configured to generate acoustic signatures from the text input and then to generate a waveform output from the acoustic signatures using the vocoder. The existing high-quality speech synthesizer needs to consume quite considerable computing resources, the SqueezeWave mainly aims at improving the efficiency of the synthesizer, a lightweight stream-based vocoder is proposed for speech synthesis of edge devices, the architecture of WaveGlow is redesigned, the audio tensor is reformed, the deep separable convolution and the related optimization are adopted to enable the audio tensor to consume 61-214 times less computing quantity than the WaveGlow, the generation of 123-303K samples per second can be realized at the notebook end, and unlike the direct convolution operation, the WaveGlow firstly constructs multi-channel input by clustering adjacent samples, wherein L is the length of a time domain dimension, Cg is the number of samples of cluster combination at each time step, the total number of samples in the waveform is that the waveform is then converted by a series of bilateral mapping, each of which can use the input to obtain output, in each bilateral mapping, the input signal is first processed by a reversible point-by-point convolution, and then the result is split along the channel into sums, which are used to calculate affine coupling coefficients, where the subsequent calculations to be applied are then a function like wavenet, where the mel spectrum Lm for the encoded audio is the time length of the mel spectrum, and Cm is the number of frequency components, which will finally be combined in the channel direction to get the final output;

s3, producing a waveform from the acoustic signature using the vocoder: the most important calculated amount of WaveGlow comes from WN function, the input is firstly processed by point-by-point convolution, then the one-dimensional expansion convolution with kernel of 3 continues to process the result, meanwhile, the Mel spectrum is also fed into the network, then the in _ layer and the cond _ layer outputs are combined by a gate function according to the way of waveNet, then the combined result is transmitted to res _ skip _ layer, the output length is 2000, the channel number is 512, then the combined result is split into two parts according to the channel, the structure is repeated eight times, and the point-by-point convolution is carried out on the final res _ skip _ layer output and end, and the conversion factor s is calculatedⁱ,tⁱAnd compress channels from 512 to 8, in the WaveGlow source code, the calculation amount per second is 229G MACs, in which in _ layer occupies 47%, and cond _ layer occupies39%, and res _ skip _ layer 14%, which for this case would improve the original network structure to reduce the amount of computation and increase the computational efficiency, and transform the input audio into a smaller time domain length and more channels, while maintaining the channel size in the WN function.

Analysis of WaveGlow revealed that the most significant amount of computation comes from the shape (length) of the input audio waveform, and WaveGlow has an output dimension (L ═ 2000, Cg ═ 8), which results in very high computational complexity from three aspects: WaveGlow is a one-dimensional convolution whose computational complexity increases linearly with L, which needs to be upsampled in order to improve the temporal resolution of the mel-frequency spectrum.

In the invention, in S1, the Mongolian word to be input by the input layer is searched for the corresponding word vector, form vector and pitch system vector by looking up the word list, the attention layer inputs three Mongolian word feature vectors, and the three feature vectors are integrated together by a weighted summation mode to obtain a new Mongolian word vector.

In the present invention, in S2, the synthesizer is used to generate acoustic features from the text input, and then the vocoder is used to generate waveform output from the acoustic features.

In the present invention, in S3, the convolution increases the number of channels from a very large number, and the output dimension of start in WaveGlow is 256 dimensions.

In the present invention, in S3, since the time domain length of the mel spectrum is much smaller than the waveform length, it is necessary to perform dimension matching by up-sampling it.

In the present invention, in S3, when L is 64, the time domain length is the same as the mel spectrum, and no upsampling is needed, and when L is 128, the mel spectrum only needs to be nearest-neighbor sampled, so that the computation overhead of cond _ layer is further reduced, and the fig2 deep separable convolution reduces the computation amount.

In the present invention, in S3, based on the improvement of WaveGlow, SqueezeWave-light vocoder, similar speech quality can be generated, but the network structure of WaveGlow can be redesigned by running on 61 x-214 x MAC, thereby greatly reducing the amount of computation.

And those not described in detail in this specification are well within the skill of those in the art.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A novel Mongolian speech synthesis method is characterized in that: the method specifically comprises the following steps:

s1, processing the Mongolian word sequence based on the BilSTM: a Mongolian prosody modeling method fusing a morphological vector and a phonetic system vector is provided based on a BilSTM neural network, and comprises an input layer, an attention layer, a BilSTM layer and an output layer, wherein an input Mongolian word sequence is processed, specifically, a word vector WE, a morphological vector ME and a phonetic system vector PE of a Mongolian word are given, the weights of the word vector WE, the morphological vector ME and the phonetic system vector PE are predicted through two fully-connected neural networks, and then the three vectors are multiplied by respective weights w1, w2 and w3 and spliced together to form a final Mongolian word vector representation WE^*The formula group is: w is a₁＝1-w₂+w₃，w₂＝1σ(M₁Tanh(M₂WE+M₃PRE)) and WE^*＝w₁，WE+w₂.PE+w₃ME, BilsTM layer readInput feature vector WE^*To extract richer high-level semantic features, the feature vector WE is first read from left to right using forward LSTM^*Sequentially obtaining a hidden state

Then the hidden state is obtained by using backward LSTM

Hidden state

And

the hidden states corresponding to the time steps are summed to obtain the final hidden state output H of each time step, the hidden state output H obtained by the BiLSTM layer is sent to an output layer to be decoded to obtain the final prosodic label corresponding to the Mongolian word, the output layer decodes the hidden state output by the BiLSTM layer, two output layer functions are selected to be decoded, the first output layer is a Softmax function, as in the above method, it can translate the output vector into probability values between 0 and 1, then, the probability value is normalized, so that the prosody label with the maximum probability value is the final prosody label, and the other type is a conditional random field, for the sequence labeling problem, there are strong dependencies between tags, and in order to better describe the dependencies of adjacent tags and obtain a globally optimal solution, the CRF output layer is configured to maximize the correct complete tag sequence Y ═ Y.₁,y₂,…,y_T]For optimization purposes, the formulation is:

the coding network starts with an embedding layer, which converts characters or phonemes into a trainable vector representation h_eEmbedded h_eFirst from the embedding dimension to the target dimension by fully connected layers, and then by convolutionBlock processing to extract temporal dependencies of textual information, which are finally projected into the embedding dimension to create an attention key vector h_kThe value vector of attention is computed from the attention key vector and text embedding:

s3, producing a waveform from the acoustic signature using the vocoder: the most important calculation amount of WaveGlow comes from WN function, the input is firstly processed by point-by-point convolution, then the one-dimensional expansion convolution with kernel of 3 continues to process the result, meanwhile, the Mel spectrum is also fed into the network, then the in _ layer and the cond _ layer outputs are combined by a gate function according to the way of waveNet, then the combined output is transmitted to res _ skip _ layer, the output length is 2000, the channel number is 512, and then the combined output is split into two parts according to the channel, and the combined output is transmitted to res _ skip _ layerThe structure is repeated eight times, and point-by-point convolution is carried out on the final res _ skip _ layer output and the end, and a conversion factor s is calculatedⁱ,tⁱAnd compress the channels from 512 to 8, warping the input audio to a smaller temporal length and more channels while maintaining the channel size in the WN function.

2. The method of claim 1, wherein the Mongolian speech synthesis method comprises: in S1, the input layer finds corresponding word vectors, morphological vectors, and pitch vectors of the to-be-input mongolian words by looking up the vocabulary, the attention layer inputs three mongolian word feature vectors, and the three feature vectors are integrated together in a weighted summation manner to obtain a new mongolian word vector.

3. The method of claim 1, wherein the Mongolian speech synthesis method comprises: in S2, the synthesizer is used to generate acoustic features from the text input, and then the vocoder is used to generate waveform output from the acoustic features.

4. The method of claim 1, wherein the Mongolian speech synthesis method comprises: in S3, the convolution increases the number of channels from a very large number, and the output dimension of start in WaveGlow is 256 dimensions.

5. The method of claim 1, wherein the Mongolian speech synthesis method comprises: in S3, since the time domain length of the mel spectrum is much smaller than the waveform length, it needs to be up-sampled for dimension matching.

6. The method of claim 1, wherein the Mongolian speech synthesis method comprises: in S3, when L is 64, the time domain length is the same as the mel spectrum, and no upsampling is needed, and when L is 128, the mel spectrum only needs to be nearest-neighbor sampled, so that the computation overhead of cond _ layer is further reduced, and the fig2 deep separable convolution reduces the computation amount.

7. The method of claim 1, wherein the Mongolian speech synthesis method comprises: in S3, based on the improvement of WaveGlow, SqueezeWave-light vocoder, similar speech quality can be generated, but the network structure of WaveGlow can be redesigned by running on 61 x-214 x MAC, thereby greatly reducing the amount of computation.