CN113838449A - Novel Mongolian speech synthesis method - Google Patents

Novel Mongolian speech synthesis method Download PDF

Info

Publication number
CN113838449A
CN113838449A CN202110817588.XA CN202110817588A CN113838449A CN 113838449 A CN113838449 A CN 113838449A CN 202110817588 A CN202110817588 A CN 202110817588A CN 113838449 A CN113838449 A CN 113838449A
Authority
CN
China
Prior art keywords
mongolian
layer
vector
output
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110817588.XA
Other languages
Chinese (zh)
Inventor
仁庆道尔吉
张文静
张倩
刘馨远
张毕力格图
郎佳珺
萨和雅
吉亚图
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inner Mongolia University of Technology
Original Assignee
Inner Mongolia University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inner Mongolia University of Technology filed Critical Inner Mongolia University of Technology
Priority to CN202110817588.XA priority Critical patent/CN113838449A/en
Publication of CN113838449A publication Critical patent/CN113838449A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a novel Mongolian speech synthesis method, which specifically comprises the following steps: s1, processing the Mongolian word sequence based on the BilSTM: the invention provides a Mongolian prosody modeling method fusing a morphological vector and a phonetic system vector based on a BilSTM neural network, which comprises an input layer, an attention layer, a BilSTM layer and an output layer, wherein the input Mongolian word sequence is processed, and specifically, a word vector WE, a morphological vector ME and a phonetic system vector PE of a Mongolian word are given. The novel Mongolian speech synthesis method provides Mongolian prosody modeling fusing form vectors and phonetic system vectors based on a BilSTM neural network, processes an input Mongolian word sequence, utilizes a synthesizer to input characters to produce acoustic features, and utilizes a vocoder to generate waveform output from the acoustic features, wherein the improvement on WaveGlow is added, and the efficiency of the synthesizer is greatly improved in calculation and consumption.

Description

Novel Mongolian speech synthesis method
Technical Field
The invention relates to the technical field of speech synthesis, in particular to a novel Mongolian speech synthesis method.
Background
Speech synthesis is a technology for generating artificial speech by mechanical and electronic methods, and TTS technology (also called text-to-speech technology) belongs to speech synthesis, and is used for converting text information generated by a computer or input from the outside into understandable text information, the technology of fluent Chinese spoken language output adopts a world-leading speech synthesis technology, developed speech synthesis assistant software can perfectly complete speech synthesis work, Mongolian belongs to the Altai language family or the Mongolian family, main users are in China Mongolian residential district, Mongolian and Russian Federal Siberian Federal district, Mongolian used by Mongolian is mainly spelled by using the Sirrial letters due to influence of Suyun in the fifth and sixth decades of the twentieth century, Russian Carlmekok language and British language are regarded as dialects of Mongolian, and Mongolian in China inner Mongolian still uses traditional Mongolian.
Existing high-quality speech synthesizers all need to consume considerable computing resources, the efficiency of the synthesizers is reduced in computing and consumption, hidden data safety hazards exist when data are transmitted to the cloud, and flow-based substitution for autoregressive is used for Waveglow, so that parallelization can be achieved, but the parallelization is difficult to apply to a real-time system.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a novel Mongolian speech synthesis method, which solves the problems that the existing high-quality speech synthesizer needs to consume considerable computing resources, the efficiency of the synthesizer is reduced in computing and consumption, hidden danger of data safety is caused when data is transmitted to a cloud end, and Waveglow replaces autoregressive with flow-based, so that parallelization is possible, but the application to a real-time system is difficult.
In order to achieve the purpose, the invention is realized by the following technical scheme: a novel Mongolian speech synthesis method specifically comprises the following steps:
s1, processing the Mongolian word sequence based on the BilSTM: a Mongolian prosody modeling method fusing a morphological vector and a phonetic system vector is provided based on a BilSTM neural network, and comprises an input layer, an attention layer, a BilSTM layer and an output layer, wherein an input Mongolian word sequence is processed, specifically, a word vector WE, a morphological vector ME and a phonetic system vector PE of a Mongolian word are given, the weights of the word vector WE, the morphological vector ME and the phonetic system vector PE are predicted through two fully-connected neural networks, and then the three vectors are multiplied by respective weights wI, w2 and w3 and spliced together to form a final Mongolian word vector representation WE*The formula group is: w is a1=1-w2+w3, w2=1σ(M1Tanh(M2WE+M3PRE)) and WE*=w1.WE+w2.PE+w3ME, BilSTM layer reads the input feature vector WE*To extract richer high-level semantic features, the feature vector WE is first read from left to right using forward LSTM*Sequentially obtaining a hidden state
Figure BDA0003170707520000023
Then the hidden state is obtained by using backward LSTM
Figure BDA0003170707520000024
Hidden state
Figure BDA0003170707520000025
And
Figure BDA0003170707520000026
having the same time length, finally summing the hidden states of the corresponding time steps to obtain the maximum value of each time stepThe method comprises the steps of outputting H in a final hidden state, sending the H in the hidden state obtained by a BilSTM layer into an output layer, decoding to obtain a final prosodic tag corresponding to the Mongolian word, decoding the H in the hidden state output by the BilSTM layer by the output layer, selecting two output layer functions for decoding, wherein the first output layer is a Softmax function, the same as the method, converting an output vector into a probability value between 0 and 1, normalizing the probability value, finding out a prosodic tag with the maximum probability value as the final prosodic tag, and the other output layer is a conditional random field1,y2,…,yT]For optimization purposes, the formulation is:
Figure BDA0003170707520000027
the coding network starts with an embedding layer, which converts characters or phonemes into a trainable vector representation heEmbedded heThe key vector h of attention is created by first converting from the embedding dimension to the target dimension through a fully connected layer, then processing through a convolution block to extract the temporal dependency of the textual information, and finally being projected to the embedding dimensionkThe value vector of attention is computed from the attention key vector and text embedding:
Figure BDA0003170707520000031
s2, using SqueezeWave to improve synthesizer efficiency: from the cloud to the edge of TTS, a typical modern speech synthesis model mainly includes two parts: synthesizer and vocoder, propose using a kind of light-weight based on voice synthesis of the marginal apparatus of the stream vocoder SqueezeWave, redesign the framework of WaveGlow, make it consume 61-214 times of calculation amount less than WaveGlow by reforming the audio tensor, adopting the separable convolution of depth and relevant optimization, can realize the generation of 123- & ltSUB & gt 303K samples per second at the notebook end, different from carrying on the convolution operation directly, WaveGlow first clustering the adjacent sample and constructing the input of the multichannel, wherein L is the length of the time domain dimension, Cg is the sample quantity of clustering combination at each time step, the total number of samples in the waveform is that the waveform is transformed by a series of bilateral mappings subsequently, wherein each will use the input to get the output, in each bilateral mapping, the input signal is first processed by reversible point-by-point convolution, then split the result into and along the channel, where it is used to calculate affine coupling coefficients, where the subsequent calculations to be applied are then functions like wavenet, where the mel spectrum Lm for coded audio is the time length of the mel spectrum, and Cm is the number of frequency components, which will eventually be combined in the channel direction to the final output;
s3, producing a waveform from the acoustic signature using the vocoder: the most important calculated amount of WaveGlow comes from WN function, the input is firstly processed by point-by-point convolution, then the one-dimensional expansion convolution with kernel of 3 continues to process the result, meanwhile, the Mel spectrum is also fed into the network, then the in _ layer and the cond _ layer outputs are combined by a gate function according to the way of waveNet, then the combined result is transmitted to res _ skip _ layer, the output length is 2000, the channel number is 512, then the combined result is split into two parts according to the channel, the structure is repeated eight times, and the point-by-point convolution is carried out on the final res _ skip _ layer output and end, and the conversion factor s is calculatedi,tiAnd compress the channels from 512 to 8, warping the input audio to a smaller temporal length and more channels while maintaining the channel size in the WN function.
Preferably, in S1, the input layer finds corresponding word vectors, morphological vectors, and pitch vectors of the to-be-input mongolian words by looking up a word list, and the attention layer inputs three mongolian word feature vectors, and integrates the three feature vectors together in a weighted summation manner to obtain a new mongolian word vector.
Preferably, in S2, the synthesizer is used to generate the acoustic features from the text input, and then the vocoder is used to generate the waveform output from the acoustic features.
Preferably, in S3, the convolution increases the number of channels from a very large number, and the output dimension of start in WaveGlow is 256 dimensions.
Preferably, in S3, since the time domain length of the mel spectrum is much smaller than the waveform length, it needs to be up-sampled for dimension matching.
Preferably, in S3, when L is 64, the time domain length is the same as the mel spectrum, and no upsampling is needed, and when L is 128, the mel spectrum only needs to be nearest-neighbor sampled, so that the computation overhead of cond _ layer is further reduced, and the fig2 deep separable convolution reduces the computation amount.
Preferably, in S3, based on the improvement of WaveGlow, SqueezeWave-light vocoder, similar speech quality can be generated, but the network structure of WaveGlow can be redesigned by running on 61 x-214 x MAC, thereby greatly reducing the amount of computation.
Advantageous effects
The invention provides a novel Mongolian speech synthesis method. Compared with the prior art, the method has the following beneficial effects:
(1) the novel Mongolian speech synthesis method comprises the following steps of processing Mongolian word sequences based on BilsTM: a Mongolian prosody modeling method fusing a form vector and a phonetic system vector is provided based on a BilSTM neural network, and comprises an input layer, an attention layer, a BilSTM layer and an output layer, wherein an input Mongolian word sequence is processed, specifically, a word vector WE, a form vector ME and a phonetic system vector PE of a Mongolian word are given, the weights of the word vector WE, the form vector ME and the phonetic system vector PE are respectively predicted through two fully-connected neural networks, Mongolian prosody modeling of the fusion form vector and the phonetic system vector is provided based on the BilSTM neural network, the input Mongolian word sequence is processed, characters are input by a synthesizer to produce acoustic features, and then a vocoder is used to generate waveform output from the acoustic features, wherein the improvement of WaveGlow is added, and the efficiency of the synthesizer is greatly improved in calculation and consumption.
(2) The novel Mongolian speech synthesis method has the advantages that the most main calculated amount through WaveGlow comes from a WN function, the input is firstly processed through point-by-point convolution, then the one-dimensional expansion convolution with the kernel of 3 is used for continuously processing the result, and meanwhile, the one-dimensional expansion convolution with the kernel of 3 is used for processing the resultThe Mel spectrum is also fed into the network, then the in _ layer and the cond _ layer outputs are merged by a gate function according to the WaveNet mode, then the merged output is transmitted to the res _ skip _ layer, the output length is L equal to 2000, the number of channels is 512, then the merged output is split into two parts according to the channels, the structure is repeated eight times, the final res _ skip _ layer output and end are subjected to point-by-point convolution, and the conversion factor s is calculatedi,tiAnd compressing the channel from 512 to 8, transforming the input audio into a smaller time domain length and more channels, simultaneously keeping the channel size in the WN function, redesigning the network structure of WaveGlow, thereby greatly reducing the calculation amount, greatly improving the performance of the mobile equipment of the mobile phone, greatly reducing the calculation consumption of the cloud if the mobile equipment is deployed, and fully utilizing the Mongolian morphological knowledge and the phonetic system knowledge to improve the performance of the Mongolian prosody modeling method.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of a Mongolian prosody modeling method of the present invention;
FIG. 3 is a schematic diagram of the point-by-point convolution process of the present invention;
FIG. 4 is a schematic diagram of the network structure of WaveGlow according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-4, the present invention provides a technical solution: a novel Mongolian speech synthesis method specifically comprises the following steps:
s1, processing the Mongolian word sequence based on the BilSTM: a Mongolian prosody modeling method based on a BilSTM neural network and fusing a morphological vector and a phonetic system vector is provided, and comprises an input layer, a notice layer and a data processing layerThe force layer, the BilSTM layer and the output layer process the input Mongolian word sequence, specifically, given the word vector WE, the form vector ME and the sound system vector PE of the Mongolian word, the weights of the word vector WE, the form vector ME and the sound system vector PE are respectively predicted through two fully-connected neural networks, then the three vectors are multiplied by the respective weights wI, w2 and w3 and are spliced together to form the final Mongolian word vector representation WE*The formula group is: w is a1=1-w2+w3, w2=1σ(M1Tanh(M2WE+M3PRE)) and WE*=w1.WE+w2.PE+w3ME, wherein M1、M2、M3And M' s1、M`2、M`3Is a weight matrix and σ () is a Logistic function that normalizes the result of the computation between 0 and 1. Through the information fusion of the attention layer, the final Mongolian word vector WE can be obtained*Obtaining maximum information benefit from different information sources, enhancing robustness of Mongolian word vector representation, and reading input feature vector WE by a BilSTM layer*To extract richer high-level semantic features, the feature vector WE is first read from left to right using forward LSTM*Sequentially obtaining a hidden state
Figure BDA0003170707520000063
Then the hidden state is obtained by using backward LSTM
Figure BDA0003170707520000064
Hidden state
Figure BDA0003170707520000065
And
Figure BDA0003170707520000066
the hidden state output H obtained by the BiLSTM layer is sent to an output layer to be decoded to obtain a final corresponding rhythm label of the Mongolian word, the output layer decodes the hidden state H output by the BiLSTM layer, and two output layer functions are selected for decoding, wherein the first one is that the hidden state output H of each time step is obtained by summing the hidden states of the corresponding time step, the hidden state output H obtained by the BiLSTM layer is sent to the output layer to be decoded to obtain a final corresponding rhythm label of the Mongolian word, the hidden state output H output by the BiLSTM layer is decoded by the output layer, and the first one is that the hidden state output H of the BiLSTM layer is decoded by the two output layer functionsThe Softmax function is the same as the method, can convert the output vector into the probability value between 0 and 1, then normalize the probability value, thus find out the rhythm label with the maximum probability value as the final rhythm label, and the other is the conditional random field, for the sequence labeling problem, there is a strong dependency relationship between labels, for example, the 'pause' label must be followed by the 'non-pause' label, the 'pause' label cannot be continued, in order to better describe the dependency relationship of the adjacent labels and obtain the global optimum solution, the CRF output layer maximizes the correct complete label sequence Y ═ Y1,y2,…,yT]For optimization purposes, the formulation is:
Figure BDA0003170707520000071
wherein s isyIs a fraction of the output sequence y,
Figure BDA0003170707520000072
all possible output sequences corresponding to the input sequences, the model training uses a cross entropy loss function as an optimization target, the coding network starts with an embedding layer, and characters or phonemes are converted into trainable vector expressions heEmbedded heThe key vector h of attention is created by first converting from the embedding dimension to the target dimension through a fully connected layer, then processing through a convolution block to extract the temporal dependency of the textual information, and finally being projected to the embedding dimensionkThe value vector of attention is computed from the attention key vector and text embedding:
Figure BDA0003170707520000073
to consider the local information h togethereAnd long-term context information hk. Key vector hkUsed by each attention block to compute attention weights, and the final context vector is represented by value vector hvCalculating the weighted average of (1);
s2, using SqueezeWave to improve synthesizer efficiency: from the cloud to the edge of TTS, a typical modern speech synthesis model mainly includes two parts: a synthesizer and a vocoder, wherein the synthesizer is configured to generate acoustic signatures from the text input and then to generate a waveform output from the acoustic signatures using the vocoder. The existing high-quality speech synthesizer needs to consume quite considerable computing resources, the SqueezeWave mainly aims at improving the efficiency of the synthesizer, a lightweight stream-based vocoder is proposed for speech synthesis of edge devices, the architecture of WaveGlow is redesigned, the audio tensor is reformed, the deep separable convolution and the related optimization are adopted to enable the audio tensor to consume 61-214 times less computing quantity than the WaveGlow, the generation of 123-303K samples per second can be realized at the notebook end, and unlike the direct convolution operation, the WaveGlow firstly constructs multi-channel input by clustering adjacent samples, wherein L is the length of a time domain dimension, Cg is the number of samples of cluster combination at each time step, the total number of samples in the waveform is that the waveform is then converted by a series of bilateral mapping, each of which can use the input to obtain output, in each bilateral mapping, the input signal is first processed by a reversible point-by-point convolution, and then the result is split along the channel into sums, which are used to calculate affine coupling coefficients, where the subsequent calculations to be applied are then a function like wavenet, where the mel spectrum Lm for the encoded audio is the time length of the mel spectrum, and Cm is the number of frequency components, which will finally be combined in the channel direction to get the final output;
s3, producing a waveform from the acoustic signature using the vocoder: the most important calculated amount of WaveGlow comes from WN function, the input is firstly processed by point-by-point convolution, then the one-dimensional expansion convolution with kernel of 3 continues to process the result, meanwhile, the Mel spectrum is also fed into the network, then the in _ layer and the cond _ layer outputs are combined by a gate function according to the way of waveNet, then the combined result is transmitted to res _ skip _ layer, the output length is 2000, the channel number is 512, then the combined result is split into two parts according to the channel, the structure is repeated eight times, and the point-by-point convolution is carried out on the final res _ skip _ layer output and end, and the conversion factor s is calculatedi,tiAnd compress channels from 512 to 8, in the WaveGlow source code, the calculation amount per second is 229G MACs, in which in _ layer occupies 47%, and cond _ layer occupies39%, and res _ skip _ layer 14%, which for this case would improve the original network structure to reduce the amount of computation and increase the computational efficiency, and transform the input audio into a smaller time domain length and more channels, while maintaining the channel size in the WN function.
Analysis of WaveGlow revealed that the most significant amount of computation comes from the shape (length) of the input audio waveform, and WaveGlow has an output dimension (L ═ 2000, Cg ═ 8), which results in very high computational complexity from three aspects: WaveGlow is a one-dimensional convolution whose computational complexity increases linearly with L, which needs to be upsampled in order to improve the temporal resolution of the mel-frequency spectrum.
In the invention, in S1, the Mongolian word to be input by the input layer is searched for the corresponding word vector, form vector and pitch system vector by looking up the word list, the attention layer inputs three Mongolian word feature vectors, and the three feature vectors are integrated together by a weighted summation mode to obtain a new Mongolian word vector.
In the present invention, in S2, the synthesizer is used to generate acoustic features from the text input, and then the vocoder is used to generate waveform output from the acoustic features.
In the present invention, in S3, the convolution increases the number of channels from a very large number, and the output dimension of start in WaveGlow is 256 dimensions.
In the present invention, in S3, since the time domain length of the mel spectrum is much smaller than the waveform length, it is necessary to perform dimension matching by up-sampling it.
In the present invention, in S3, when L is 64, the time domain length is the same as the mel spectrum, and no upsampling is needed, and when L is 128, the mel spectrum only needs to be nearest-neighbor sampled, so that the computation overhead of cond _ layer is further reduced, and the fig2 deep separable convolution reduces the computation amount.
In the present invention, in S3, based on the improvement of WaveGlow, SqueezeWave-light vocoder, similar speech quality can be generated, but the network structure of WaveGlow can be redesigned by running on 61 x-214 x MAC, thereby greatly reducing the amount of computation.
And those not described in detail in this specification are well within the skill of those in the art.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (7)

1. A novel Mongolian speech synthesis method is characterized in that: the method specifically comprises the following steps:
s1, processing the Mongolian word sequence based on the BilSTM: a Mongolian prosody modeling method fusing a morphological vector and a phonetic system vector is provided based on a BilSTM neural network, and comprises an input layer, an attention layer, a BilSTM layer and an output layer, wherein an input Mongolian word sequence is processed, specifically, a word vector WE, a morphological vector ME and a phonetic system vector PE of a Mongolian word are given, the weights of the word vector WE, the morphological vector ME and the phonetic system vector PE are predicted through two fully-connected neural networks, and then the three vectors are multiplied by respective weights w1, w2 and w3 and spliced together to form a final Mongolian word vector representation WE*The formula group is: w is a1=1-w2+w3,w2=1σ(M1Tanh(M2WE+M3PRE)) and WE*=w1,WE+w2.PE+w3ME, BilsTM layer readInput feature vector WE*To extract richer high-level semantic features, the feature vector WE is first read from left to right using forward LSTM*Sequentially obtaining a hidden state
Figure RE-FDA0003353435550000011
Then the hidden state is obtained by using backward LSTM
Figure RE-FDA0003353435550000012
Hidden state
Figure RE-FDA0003353435550000013
And
Figure RE-FDA0003353435550000014
the hidden states corresponding to the time steps are summed to obtain the final hidden state output H of each time step, the hidden state output H obtained by the BiLSTM layer is sent to an output layer to be decoded to obtain the final prosodic label corresponding to the Mongolian word, the output layer decodes the hidden state output by the BiLSTM layer, two output layer functions are selected to be decoded, the first output layer is a Softmax function, as in the above method, it can translate the output vector into probability values between 0 and 1, then, the probability value is normalized, so that the prosody label with the maximum probability value is the final prosody label, and the other type is a conditional random field, for the sequence labeling problem, there are strong dependencies between tags, and in order to better describe the dependencies of adjacent tags and obtain a globally optimal solution, the CRF output layer is configured to maximize the correct complete tag sequence Y ═ Y.1,y2,…,yT]For optimization purposes, the formulation is:
Figure RE-FDA0003353435550000015
the coding network starts with an embedding layer, which converts characters or phonemes into a trainable vector representation heEmbedded heFirst from the embedding dimension to the target dimension by fully connected layers, and then by convolutionBlock processing to extract temporal dependencies of textual information, which are finally projected into the embedding dimension to create an attention key vector hkThe value vector of attention is computed from the attention key vector and text embedding:
Figure RE-FDA0003353435550000021
s2, using SqueezeWave to improve synthesizer efficiency: from the cloud to the edge of TTS, a typical modern speech synthesis model mainly includes two parts: synthesizer and vocoder, propose using a kind of light-weight based on voice synthesis of the marginal apparatus of the stream vocoder SqueezeWave, redesign the framework of WaveGlow, make it consume 61-214 times of calculation amount less than WaveGlow by reforming the audio tensor, adopting the separable convolution of depth and relevant optimization, can realize the generation of 123- & ltSUB & gt 303K samples per second at the notebook end, different from carrying on the convolution operation directly, WaveGlow first clustering the adjacent sample and constructing the input of the multichannel, wherein L is the length of the time domain dimension, Cg is the sample quantity of clustering combination at each time step, the total number of samples in the waveform is that the waveform is transformed by a series of bilateral mappings subsequently, wherein each will use the input to get the output, in each bilateral mapping, the input signal is first processed by reversible point-by-point convolution, then split the result into and along the channel, where it is used to calculate affine coupling coefficients, where the subsequent calculations to be applied are then functions like wavenet, where the mel spectrum Lm for coded audio is the time length of the mel spectrum, and Cm is the number of frequency components, which will eventually be combined in the channel direction to the final output;
s3, producing a waveform from the acoustic signature using the vocoder: the most important calculation amount of WaveGlow comes from WN function, the input is firstly processed by point-by-point convolution, then the one-dimensional expansion convolution with kernel of 3 continues to process the result, meanwhile, the Mel spectrum is also fed into the network, then the in _ layer and the cond _ layer outputs are combined by a gate function according to the way of waveNet, then the combined output is transmitted to res _ skip _ layer, the output length is 2000, the channel number is 512, and then the combined output is split into two parts according to the channel, and the combined output is transmitted to res _ skip _ layerThe structure is repeated eight times, and point-by-point convolution is carried out on the final res _ skip _ layer output and the end, and a conversion factor s is calculatedi,tiAnd compress the channels from 512 to 8, warping the input audio to a smaller temporal length and more channels while maintaining the channel size in the WN function.
2. The method of claim 1, wherein the Mongolian speech synthesis method comprises: in S1, the input layer finds corresponding word vectors, morphological vectors, and pitch vectors of the to-be-input mongolian words by looking up the vocabulary, the attention layer inputs three mongolian word feature vectors, and the three feature vectors are integrated together in a weighted summation manner to obtain a new mongolian word vector.
3. The method of claim 1, wherein the Mongolian speech synthesis method comprises: in S2, the synthesizer is used to generate acoustic features from the text input, and then the vocoder is used to generate waveform output from the acoustic features.
4. The method of claim 1, wherein the Mongolian speech synthesis method comprises: in S3, the convolution increases the number of channels from a very large number, and the output dimension of start in WaveGlow is 256 dimensions.
5. The method of claim 1, wherein the Mongolian speech synthesis method comprises: in S3, since the time domain length of the mel spectrum is much smaller than the waveform length, it needs to be up-sampled for dimension matching.
6. The method of claim 1, wherein the Mongolian speech synthesis method comprises: in S3, when L is 64, the time domain length is the same as the mel spectrum, and no upsampling is needed, and when L is 128, the mel spectrum only needs to be nearest-neighbor sampled, so that the computation overhead of cond _ layer is further reduced, and the fig2 deep separable convolution reduces the computation amount.
7. The method of claim 1, wherein the Mongolian speech synthesis method comprises: in S3, based on the improvement of WaveGlow, SqueezeWave-light vocoder, similar speech quality can be generated, but the network structure of WaveGlow can be redesigned by running on 61 x-214 x MAC, thereby greatly reducing the amount of computation.
CN202110817588.XA 2021-07-20 2021-07-20 Novel Mongolian speech synthesis method Pending CN113838449A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110817588.XA CN113838449A (en) 2021-07-20 2021-07-20 Novel Mongolian speech synthesis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110817588.XA CN113838449A (en) 2021-07-20 2021-07-20 Novel Mongolian speech synthesis method

Publications (1)

Publication Number Publication Date
CN113838449A true CN113838449A (en) 2021-12-24

Family

ID=78962831

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110817588.XA Pending CN113838449A (en) 2021-07-20 2021-07-20 Novel Mongolian speech synthesis method

Country Status (1)

Country Link
CN (1) CN113838449A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114420087A (en) * 2021-12-27 2022-04-29 北京百度网讯科技有限公司 Acoustic feature determination method, device, equipment, medium and product

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114420087A (en) * 2021-12-27 2022-04-29 北京百度网讯科技有限公司 Acoustic feature determination method, device, equipment, medium and product

Similar Documents

Publication Publication Date Title
Liu et al. Diffsinger: Singing voice synthesis via shallow diffusion mechanism
US11705107B2 (en) Real-time neural text-to-speech
CN111489734B (en) Model training method and device based on multiple speakers
CN112802448B (en) Speech synthesis method and system for generating new tone
CN111429889A (en) Method, apparatus, device and computer readable storage medium for real-time speech recognition based on truncated attention
CN110059324B (en) Neural network machine translation method and device based on dependency information supervision
CN112687259A (en) Speech synthesis method, device and readable storage medium
CN113450765B (en) Speech synthesis method, device, equipment and storage medium
CN114023300A (en) Chinese speech synthesis method based on diffusion probability model
CN113470622B (en) Conversion method and device capable of converting any voice into multiple voices
WO2024088262A1 (en) Data processing system and method for speech recognition model, and speech recognition method
CN114495969A (en) Voice recognition method integrating voice enhancement
CN113053357A (en) Speech synthesis method, apparatus, device and computer readable storage medium
WO2023226260A1 (en) Voice generation method and apparatus, storage medium, and electronic device
CN111312228A (en) End-to-end-based voice navigation method applied to electric power enterprise customer service
CN114678032B (en) Training method, voice conversion method and device and electronic equipment
CN114863945A (en) Text-based voice changing method and device, electronic equipment and storage medium
CN113838449A (en) Novel Mongolian speech synthesis method
CN114242093A (en) Voice tone conversion method and device, computer equipment and storage medium
CN114360485A (en) Voice processing method, system, device and medium
Kim et al. Enclap: Combining neural audio codec and audio-text joint embedding for automated audio captioning
Zhao et al. Research on voice cloning with a few samples
CN116524898A (en) Sound video generation method and device, electronic equipment and storage medium
CN114512121A (en) Speech synthesis method, model training method and device
Lam et al. Instance-based transfer learning approach for Vietnamese speech synthesis with very low resource

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20211224