CN110797002B - Speech synthesis method, speech synthesis device, electronic equipment and storage medium - Google Patents
Speech synthesis method, speech synthesis device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN110797002B CN110797002B CN202010006604.2A CN202010006604A CN110797002B CN 110797002 B CN110797002 B CN 110797002B CN 202010006604 A CN202010006604 A CN 202010006604A CN 110797002 B CN110797002 B CN 110797002B
- Authority
- CN
- China
- Prior art keywords
- phase
- spectrum
- linear
- text data
- predicted value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001308 synthesis method Methods 0.000 title claims abstract description 29
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 23
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 23
- 238000001228 spectrum Methods 0.000 claims abstract description 174
- 238000012549 training Methods 0.000 claims abstract description 73
- 238000003062 neural network model Methods 0.000 claims abstract description 55
- 239000013598 vector Substances 0.000 claims abstract description 46
- 230000005236 sound signal Effects 0.000 claims abstract description 34
- 238000000034 method Methods 0.000 claims abstract description 24
- 230000011218 segmentation Effects 0.000 claims description 21
- 238000004364 calculation method Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 8
- 238000004422 calculation algorithm Methods 0.000 abstract description 8
- 230000008569 process Effects 0.000 abstract description 7
- 230000006870 function Effects 0.000 description 26
- 230000003595 spectral effect Effects 0.000 description 6
- 241000288105 Grus Species 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011478 gradient descent method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 238000012805 post-processing Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012847 principal component analysis method Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a voice synthesis method, which relates to the field of voice synthesis and comprises the following steps: acquiring text data, obtaining target values of a linear frequency spectrum and a phase according to the text data, and converting the text data into text vectors; inputting the text vector into a neural network model to obtain a predicted value of a linear frequency spectrum and a phase, further calculating the overall loss, training the neural network model, and obtaining the linear frequency spectrum and an initial phase through the trained neural network model; and inputting the linear spectrum and the initial phase into a Griffin-Lim vocoder for training to obtain an audio signal corresponding to the text data. The method trains the Griffin-Lim vocoder according to the linear spectrum and the initial phase, can reduce the iteration times of the vocoder, accelerate the convergence speed of the vocoder, accelerate the real-time audio synthesis process under the condition of not reducing the audio quality, and is suitable for a voice synthesis device which utilizes the Griffin-Lim algorithm as the vocoder. The invention also discloses a voice synthesis device, electronic equipment and a computer storage medium.
Description
Technical Field
The present invention relates to the field of speech synthesis, and in particular, to a speech synthesis method, apparatus, electronic device, and storage medium.
Background
Speech synthesis is a leading-edge technology in the field of Chinese information processing, and is mainly characterized by that according to characters or words, a given text input is decomposed into characteristic vectors, then the characteristic vectors are converted into audio characteristics, and finally the audio characteristics are restored into correspondent audio file by means of vocoder, and the audio file is outputted. With the technologies of WaveNet, LpcNet and the like, a lot of voice synthesis methods adopting a neural network as a vocoder appear, but the voice synthesis methods are difficult to be commercialized in the aspect of synthesis performance or synthesis effect, and at present, the Griffin-Lim algorithm is widely applied to the voice synthesis methods as the vocoder. The Griffin-Lim algorithm is used as an iterative algorithm for predicting the phase by using the frequency spectrum, the frequency spectrum amplitude is used as input, the phase initialized randomly is iterated for a certain number of times to obtain a proper phase connected with an audio frame, and a time domain audio signal is recovered.
The existing speech synthesis method adopts a neural network model to convert text into a linear spectrum, and then inputs the linear spectrum into a Griffin-Lim vocoder to generate an audio signal with better quality through repeated iteration. In order to improve the overall performance, the Griffin-Lim vocoder is often optimized from the engineering perspective so as to improve the efficiency of single iteration, but a good initial phase is neglected to be provided for the Griffin-Lim vocoder based on a neural network model so as to accelerate the convergence speed of the Griffin-Lim vocoder, and the burden brought by multiple iterations of the Griffin-Lim vocoder is fundamentally solved.
Disclosure of Invention
In order to overcome the disadvantages of the prior art, an object of the present invention is to provide a speech synthesis method, which adds an overall loss to target values and test values of a spectrum and a phase, so that the spectrum and the phase of a model are trained in a consistent direction, obtains a linear spectrum and an initial phase based on the trained model, inputs the linear spectrum and the initial phase into a Griffin-Lim vocoder for training, obtains a joint phase connecting audio frames, and recovers and outputs a corresponding audio signal.
One of the purposes of the invention is realized by adopting the following technical scheme:
acquiring text data, obtaining a linear spectrum target value and a phase target value according to the text data, and converting the text data into a text vector;
inputting the text vector into a neural network model to obtain a linear spectrum predicted value and a phase predicted value, calculating the overall loss according to the linear spectrum target value, the linear spectrum predicted value, the phase target value and the phase predicted value, training the neural network model according to the overall loss, and obtaining a linear spectrum and an initial phase through the trained neural network model;
inputting the linear frequency spectrum and the initial phase into a Griffin-Lim vocoder for iterative training to obtain an audio signal corresponding to the text data;
inputting the linear spectrum and the initial phase into a Griffin-Lim vocoder for iterative training to obtain an audio signal corresponding to the text data, wherein the iterative training comprises the following steps: and performing inverse short-time Fourier transform on the linear frequency spectrum and the initial phase to obtain an audio signal, obtaining a joint phase connecting each audio frame through iterative training of a Griffin-Lim vocoder, and recovering and outputting the audio signal corresponding to the text data according to the joint phase.
Further, obtaining a linear spectrum target value and a phase target value according to the text data, and converting the text data into a text vector, including:
acquiring audio data matched with the text data;
performing short-time Fourier transform on the audio data to obtain the linear spectrum target value and the phase target value;
and performing word segmentation on the text data to obtain word segmentation results of the text data, and performing unique hot coding on the word segmentation results to obtain text vectors.
Further, the neural network model is a Tacotron model, and the step of inputting the text vector into the neural network model to obtain a linear spectrum predicted value and a phase predicted value includes:
and calculating the text vector through the Tacotron model to obtain a linear frequency spectrum predicted value and a phase predicted value with the same dimensionality.
Further, calculating an overall loss according to the linear spectrum target value, the linear spectrum predicted value, the phase target value, and the phase predicted value includes:
inputting the linear spectrum target value and the linear spectrum predicted value into a linear spectrum loss function for calculation to obtain linear spectrum loss;
inputting the phase target value and the phase predicted value into a phase loss function for calculation to obtain phase loss;
and adding the phase loss and the linear frequency spectrum loss according to a preset weight to obtain the overall loss.
Further, training the neural network model according to the overall loss comprises:
and when the overall loss is greater than or equal to a preset threshold value, training the neural network model based on the overall loss, and obtaining a linear spectrum predicted value and a phase predicted value output by the current training model until the overall loss is less than the preset threshold value, so as to obtain the trained neural network model.
Further, the linear spectrum and the initial phase are input to the Griffin-Lim vocoder through the same number of spectrum channels and phase channels, respectively.
Another object of the present invention is to provide a speech synthesis apparatus, which adds a whole loss to target values and test values of a spectrum and a phase to train the spectrum and the phase of a model in a uniform direction, obtains a linear spectrum and an initial phase based on the trained model, inputs the linear spectrum and the initial phase into a Griffin-Lim vocoder to train, obtains a joint phase connecting audio frames, and recovers and outputs a corresponding audio signal.
The second purpose of the invention is realized by adopting the following technical scheme:
a speech synthesis apparatus, comprising:
the data processing module is used for acquiring text data, obtaining a linear spectrum target value and a phase target value according to the text data, and converting the text data into a text vector;
the model training module is used for inputting the text vector into a neural network model to obtain a linear spectrum predicted value and a phase predicted value, calculating the overall loss according to the linear spectrum target value, the linear spectrum predicted value, the phase target value and the phase predicted value, training the neural network model according to the overall loss, and obtaining a linear spectrum and an initial phase through the trained neural network model;
the audio output module is used for inputting the linear frequency spectrum and the initial phase to a Griffin-Lim vocoder for iterative training to obtain an audio signal corresponding to the text data; inputting the linear spectrum and the initial phase into a Griffin-Lim vocoder for iterative training to obtain an audio signal corresponding to the text data, wherein the iterative training comprises the following steps: and performing inverse short-time Fourier transform on the linear frequency spectrum and the initial phase to obtain an audio signal, obtaining a joint phase connecting each audio frame through iterative training of a Griffin-Lim vocoder, and recovering and outputting the audio signal corresponding to the text data according to the joint phase.
It is a further object of the invention to provide an electronic device comprising a processor, a storage medium and a computer program, the computer program being stored in the storage medium and the computer program being executed by the processor for performing a speech synthesis method according to one of the objects of the invention.
It is a fourth object of the present invention to provide a computer-readable storage medium storing one of the objects of the invention, having a computer program stored thereon, which when executed by a processor, implements a speech synthesis method of one of the objects of the invention.
Compared with the prior art, the invention has the beneficial effects that:
according to the method, a frequency spectrum target value and a phase target value are obtained according to text data, the text data are converted into text vectors, the text vectors are input into a neural network model to predict frequency spectrums and phases, further, the overall loss is obtained through calculation, the frequency spectrums and the phases of the model are constrained to be trained in the consistent direction, linear frequency spectrums and initial phases are obtained based on the trained neural network model, the linear frequency spectrums and the initial phases are input into a Griffin-Lim vocoder to be trained, the iteration times of the vocoder can be reduced, the convergence speed of the vocoder is accelerated, further, joint phases connected with all audio frames are obtained, and audio signals corresponding to the text data are recovered and output according to the joint phases; the voice synthesis method can accelerate the real-time audio synthesis process under the condition of not reducing the audio quality, and is suitable for a voice synthesis device which uses the Griffin-Lim algorithm as a vocoder.
Drawings
FIG. 1 is a flowchart of a speech synthesis method according to a first embodiment of the present invention;
FIG. 2 is a flowchart illustrating the training of the Griffin-Lim vocoder according to the first embodiment of the present invention;
fig. 3 is a block diagram of a speech synthesis apparatus according to a fifth embodiment of the present invention;
fig. 4 is a block diagram of an electronic device according to a sixth embodiment of the present invention.
Detailed Description
The present invention will now be described in more detail with reference to the accompanying drawings, in which the description of the invention is given by way of illustration and not of limitation. The various embodiments may be combined with each other to form other embodiments not shown in the following description.
Example one
The embodiment I provides a speech synthesis method, which aims to obtain a frequency spectrum target value and a phase target value according to text data, convert the text data into a text vector, predict frequency spectrum and phase through a neural network model, calculate overall loss according to the frequency spectrum target value and the phase predicted value so as to restrict the updating direction of model parameters in the training process of the neural network model, obtain a linear frequency spectrum and an initial phase through a trained model, input the linear frequency spectrum and the initial phase into a Griffin-Lim vocoder for training, and obtain corresponding audio signals. The voice synthesis method can reduce the iterative times of the vocoder and accelerate the convergence speed of the vocoder under the condition of not reducing the audio quality, thereby integrally accelerating the real-time audio synthesis process, and is suitable for a voice synthesis device which uses the Griffin-Lim algorithm as the vocoder.
Referring to fig. 1, a speech synthesis method includes the following steps:
s110, text data are obtained, a linear spectrum target value and a phase target value are obtained according to the text data, and the text data are converted into text vectors.
The text data may be text data obtained from a self-created text library, or text data obtained from a product text library such as news broadcast, weather forecast, electronic book reading, and the like, which is not limited herein.
In order to obtain time-frequency domain information to be used as a target value of a neural network model so as to further train the model and obtain audio data matched with the text data, the audio data is analyzed and processed by using a short-time Fourier transform, a wavelet transform or a Wigner distribution time-frequency domain analysis method, without limitation, so that a linear spectrum and a phase corresponding to the text data are obtained, and the linear spectrum and the phase are used as a linear spectrum target value and a phase target value of the neural network model training. In this embodiment, the audio data is transformed by a short-time fourier transform to obtain a linear spectrum and a phase.
Because the text data is data with different text lengths and cannot be directly input into the model for training, the text data needs to be converted into corresponding text vectors, and the text vectors are used as input for model training. The text data may be segmented by words or phrases, without limitation, and then further constructed into corresponding text vectors based on the segmented words or phrases. In this embodiment, text data is divided according to words, a dictionary is constructed based on word frequencies in a text library, an id is assigned to each word through dictionary subscripts, and the divided words are matched with word vectors corresponding to the id through traversing the dictionary, so that text vectors corresponding to the text data are obtained.
S120, inputting the text vector into a neural network model to obtain a linear spectrum predicted value and a phase predicted value, calculating overall loss according to the linear spectrum target value, the linear spectrum predicted value, the phase target value and the phase predicted value, training the neural network model according to the overall loss, and obtaining a linear spectrum and an initial phase through the trained neural network model.
And inputting the text vector into a neural network model to obtain a linear spectrum predicted value and a phase predicted value with the same dimension. Inputting the phase target value and the phase predicted value into a phase loss function for calculation to obtain phase loss; and inputting the linear spectrum target value and the linear spectrum predicted value into a linear spectrum loss function for calculation to obtain linear spectrum loss. And adding the phase loss and the linear spectrum loss according to a preset weight to obtain the overall loss of the neural network model. Preferably, the phase loss and the linear spectral loss are weighted equally over the total loss.
And when the overall loss is greater than or equal to a preset threshold value, training the neural network model, so that the phase loss and the linear spectrum loss jointly constrain the training direction of the model, and updating the model parameters in the directions of the predicted spectrum and the predicted phase. Calculating gradient based on overall loss and a back propagation algorithm, updating model parameters according to the calculated gradient and a gradient descent method, and adjusting a linear spectrum predicted value and a phase predicted value output by the currently trained model until the overall loss is smaller than the preset threshold value to obtain the trained model, and the linear spectrum and the phase output by the model.
The phase of the trained model output is recorded asTo obtain a value range of [ - π, π]Initial phase of (2), this embodiment is relative to phaseThe initial phase is obtained through hyperbolic tangent function operation, and the calculation method is as follows:
Preferably, the gradient is calculated based on an overall loss and back propagation algorithm, model training is carried out by adopting an Adam optimizer, the training batch size is 32 in the training process, the training learning rate is reduced to 0.0005 after 50 thousands of steps from 0.001, is reduced to 0.0003 after 100 thousands of steps, is reduced to 0.0001 after 200 thousands of steps, the overall loss is smaller than a preset threshold value after iteration, and the training of the neural network model is completed to obtain a linear spectrum and an initial phase.
Preferably, to carry the initial phase information, the number of channels is doubled. The linear spectrum and the initial phase are respectively input into a Griffin-Lim vocoder through the spectrum channels and the phase channels with the same number of channels. The number of spectral channels and phase channels is the same as the dimensions of the linear spectrum and the initial phase.
S130, inputting the linear frequency spectrum and the initial phase into a Griffin-Lim vocoder for iterative training to obtain an audio signal corresponding to the text data.
Referring to fig. 2, a spectrum amplitude and an initial phase of a linear spectrum are input into a Griffin-Lim vocoder, the spectrum amplitude and the initial phase are subjected to inverse short-time fourier transform to obtain a time domain signal, the Griffin-Lim vocoder starts iterative training, the time domain signal is subjected to short-time fourier transform operation to obtain a new phase, the initial phase is replaced by the new phase, a joint phase for connecting each audio frame is found through repeated iteration, and then a time domain audio signal corresponding to text data is recovered and output.
The Griffin-Lim vocoder obtains a better initial phase corresponding to text data when the training is started, so that the vocoder can quickly obtain the joint phase for connecting each audio frame through fewer iterations under the condition of not reducing the audio quality, thereby accelerating the convergence speed of the training of the Griffin-Lim vocoder, further integrally accelerating the real-time audio synthesis process and reducing the synthesis real-time rate.
Example two
The second embodiment is an improvement on the first embodiment, and a linear spectrum target value and a phase target value are obtained according to text data and are used for a loss function of neural network model training; and converting the text data into text vectors for inputting the neural network model.
And acquiring audio data matched with the text data, and performing short-time Fourier transform on the audio data to obtain a linear spectrum target value and a phase target value. And framing and windowing long-time audio data through short-time Fourier transform, performing Fourier transform on each frame, stacking the transform result of each frame along the other dimension to obtain a linear spectrum and a phase corresponding to the audio data, and taking the linear spectrum and the phase as a linear spectrum target value and a phase target value for a target value parameter of a loss function of neural network model training.
Preferably, a hanning window is used for the audio data, each frame is 50 milliseconds long and 12.5 millimeters shifted, and 2048-point fourier transform is performed to obtain a linear spectrum and calculate the phase.
In order to obtain text vector input model training corresponding to text data, word segmentation is carried out on the text data to obtain word segmentation results of the text data, and the word segmentation results are subjected to one-hot coding to obtain text vectors.
Performing sentence segmentation on the text data to obtain a plurality of corresponding sentences, for example, segmenting the text data into a plurality of complete sentences according to punctuation marks; and performing word segmentation processing on the sentences by a word segmentation method which is not limited to a word meaning word segmentation method or a character string matching word segmentation method to obtain word segmentation results. The word sense segmentation method uses syntactic information and semantic information to process ambiguity phenomena to segment words. The word segmentation method for character string matching includes, but is not limited to, one of a forward maximum matching method, a reverse maximum matching method, a shortest path word segmentation method, and a bidirectional maximum matching method, and in this embodiment, the forward maximum matching method is selected to perform word segmentation on text data.
And processing the word segmentation result through one-hot coding to obtain a text vector, so that the characteristic of each dimension representation of the text vector is a continuous characteristic and is used for inputting a neural network model for model training. Preferably, the text vector is normalized, so that the training of a subsequent model is facilitated. Preferably, when the number of texts is large, the word segmentation result is processed by adopting a one-hot coding and principal component analysis method, so that a text vector with reduced dimensionality is obtained.
EXAMPLE III
The third embodiment is an improvement on the basis of the first embodiment or/and the second embodiment, a neural network model of the speech synthesis method adopts a Tacotron model, and text vectors are input into the Tacotron model to obtain a linear spectrum predicted value and a phase predicted value, which are used for calculating a loss function for restricting the training of the Tacotron model, so that the spectrum and the phase of the model are trained in the consistent direction.
The Tacotron model comprises an encoder, a decoder and a post-processing network which are connected in sequence, wherein the post-processing network comprises a CBHG unit and a full connection layer. The text vectors are sequentially calculated by an encoder and a decoder to obtain a Mel spectrum, the Mel spectrum is calculated by a CBHG unit to obtain Mel spectrum characteristics, and the Mel spectrum characteristics are calculated by a full connection layer to obtain linear spectrum predicted values and phase predicted values with the same dimensionality.
The text vector is calculated by an encoder to obtain audio features, and the audio features are input into a decoder. The decoder employs a content-based attention decoder that includes a decoding pre-processing unit, an attention RNN unit, and a decoder RNN unit. The attention RNN unit is a layer of RNN groups containing 256 GRUs, and the decoder RNN unit is a two-layer GRU group with longitudinal residual concatenation, each layer of GRU group containing 256 GRUs.
The attention RNN unit splices together the output of the decoding pre-processing unit and the output of the attention RNN unit as inputs to the decoder RNN unit. The RNN unit of the decoder is trained, the RNN unit of each step predicts r non-overlapping output frames at the same time, the input of the first step of the RNN unit of the decoder is an 'all zero frame', the last frame of the prediction result of the t step of the RNN unit of the decoder is used as the input of the t +1 step of the RNN unit of the decoder, and the RNN unit of the decoder outputs a Mel spectrum with the bandwidth of 80.
The CBHG unit comprises a one-dimensional convolution filter bank, an expressway network and a bidirectional gating circulating unit. Preferably, the one-dimensional convolution filter bank comprises 8 groups of one-dimensional convolution kernels, the width of the kernels is 3, a linear rectification function is adopted as an activation function, and the convolution operation result is 128 dimensions; the step length of the pooling is 1, and the bandwidth is 2; the expressway network is a 4-layer full-connection layer, a linear rectification function is adopted as an activation function, and the dimensionality of an output characteristic is 256 dimensions; the bi-directional gated loop unit includes 128 GRUs.
The CBHG unit convolves the Mel spectrum with K sets of one-dimensional convolution kernels, the K set comprisingThe convolution kernels with the width of K (K =1,2,3, …, K) extract the characteristic context information with different lengths, and the results of K groups of convolution operations with different widths are stacked together and maximally pooled along time. Preferably, the pooling results are subjected to batch standardization processing, so that the problems of gradient dispersion and gradient explosion in model training are avoided. And inputting the batch of standardized results into a multi-layer highway network to extract high-level features, and further extracting Mel spectral features by adopting a bidirectional gating circulation unit.
Inputting Mel spectrum characteristics into a full-link layer to be connected in a depth dimension, predicting to obtain linear spectrum and phase with the same dimension, and using the linear spectrum and the phase as predicted values to constrain the predicted value parameters of the loss function of model training.
Example four
The fourth embodiment is an improvement on the third embodiment, and the overall loss is calculated according to the linear spectrum target value, the linear spectrum predicted value, the phase target value and the phase predicted value, and is used for training the neural network model, so that the spectrum and the phase of the model can be trained in the consistent direction.
And inputting the phase target value and the phase predicted value into a phase loss function for calculation to obtain the phase loss. And inputting the linear spectrum target value and the linear spectrum predicted value into a linear spectrum loss function for calculation to obtain linear spectrum loss.
The same loss function is used for the phase loss function and the linear spectrum loss function, and the loss function includes but is not limited to one of an L1 loss function, an L2 loss function and a cross entropy loss function, and an L1 loss function is used in the embodiment.
And adding the phase loss and the linear spectrum loss according to a preset weight to obtain the overall loss of the neural network model. Preferably, the phase loss and the linear spectral loss are weighted equally over the total loss.
Preferably, the linear spectrum target value is processed by a mel filter bank to obtain a mel spectrum target value, the text vector is sequentially calculated by an encoder and a decoder in a Tacotron model to obtain a mel spectrum predicted value, and the mel spectrum target value and the mel spectrum predicted value are input into a mel spectrum loss function to be calculated to obtain the mel spectrum loss. And adding the Mel spectral loss, the phase loss and the linear spectral loss to obtain the integral loss of the model. Preferably, the weight ratio of the mel-frequency spectrum loss, the phase loss and the linear spectrum loss at the whole loss is the same.
The neural network model is trained based on an overall loss and gradient descent method, so that the mel spectrum loss, the phase loss and the linear spectrum loss jointly constrain the training direction of the model, and the linear spectrum and the initial phase are obtained through the trained neural network model. And training the Griffin-Lim vocoder by utilizing the linear spectrum and the initial phase, obtaining the joint phase connected with each audio frame through the trained Griffin-Lim vocoder, and recovering and outputting the audio signal corresponding to the text data.
EXAMPLE five
An embodiment five discloses a speech synthesis apparatus corresponding to the foregoing embodiment, which is a virtual apparatus structure of the foregoing embodiment, and as shown in fig. 3, the speech synthesis apparatus includes:
the data processing module is used for acquiring text data, obtaining a linear spectrum target value and a phase target value according to the text data, and converting the text data into a text vector;
the model training module is used for inputting the text vector into a neural network model to obtain a linear spectrum predicted value and a phase predicted value, calculating the overall loss according to the linear spectrum target value, the linear spectrum predicted value, the phase target value and the phase predicted value, training the neural network model according to the overall loss, and obtaining a linear spectrum and an initial phase through the trained neural network model;
the audio output module is used for inputting the linear frequency spectrum and the initial phase to a Griffin-Lim vocoder for iterative training to obtain an audio signal corresponding to the text data; inputting the linear spectrum and the initial phase into a Griffin-Lim vocoder for iterative training to obtain an audio signal corresponding to the text data, wherein the iterative training comprises the following steps: and performing inverse short-time Fourier transform on the linear frequency spectrum and the initial phase to obtain an audio signal, obtaining a joint phase connecting each audio frame through iterative training of a Griffin-Lim vocoder, and recovering and outputting the audio signal corresponding to the text data according to the joint phase.
Preferably, a doubled number of channels are used to carry the initial phase information. The model training module is connected with the frequency spectrum channels and the phase channels with the same number of channels, and the frequency spectrum channels and the phase channels are connected with the audio output module. The linear spectrum and the initial phase are input to the Griffin-Lim vocoder through a spectrum channel and a phase channel, respectively.
EXAMPLE six
Fig. 4 is a schematic structural diagram of an electronic device according to a sixth embodiment of the present invention, as shown in fig. 4, the electronic device includes a processor 310, a memory 320, an input device 330, and an output device 340; the number of the processors 310 in the computer device may be one or more, and one processor 310 is taken as an example in fig. 4; the processor 310, the memory 320, the input device 330 and the output device 340 in the electronic apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 4.
The memory 320 is used as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the speech synthesis method in the embodiment of the present invention (for example, the data processing module 210, the model training module 220, and the audio output module 230 in the speech synthesis method apparatus). The processor 310 executes various functional applications and data processing of the electronic device by executing the software programs, instructions and modules stored in the memory 320, that is, implements the speech synthesis method of the first to fourth embodiments.
The memory 320 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 320 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 320 may further include memory located remotely from the processor 310, which may be connected to the electronic device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 330 may be used to receive text data, preset thresholds, etc. The output device 340 may include a display device such as a display screen.
EXAMPLE seven
An embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a speech synthesis method, including:
acquiring text data, obtaining a linear spectrum target value and a phase target value according to the text data, and converting the text data into a text vector;
inputting the text vector into a neural network model to obtain a linear spectrum predicted value and a phase predicted value, calculating the overall loss according to the linear spectrum target value, the linear spectrum predicted value, the phase target value and the phase predicted value, training the neural network model according to the overall loss, and obtaining a linear spectrum and an initial phase through the trained neural network model;
inputting the linear frequency spectrum and the initial phase into a Griffin-Lim vocoder for iterative training to obtain an audio signal corresponding to the text data;
inputting the linear spectrum and the initial phase into a Griffin-Lim vocoder for iterative training to obtain an audio signal corresponding to the text data, wherein the iterative training comprises the following steps: and performing inverse short-time Fourier transform on the linear frequency spectrum and the initial phase to obtain an audio signal, obtaining a joint phase connecting each audio frame through iterative training of a Griffin-Lim vocoder, and recovering and outputting the audio signal corresponding to the text data according to the joint phase.
Of course, the storage medium containing the computer-executable instructions provided by the embodiments of the present invention is not limited to the method operations described above, and may also perform related operations in the speech synthesis based method provided by any embodiments of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes instructions for enabling an electronic device (which may be a mobile phone, a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.
It should be noted that, in the embodiment of the device based on a speech synthesis method, each unit and each module included in the device is only divided according to functional logic, but is not limited to the above division as long as the corresponding function can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
Various other modifications and changes may be made by those skilled in the art based on the above-described technical solutions and concepts, and all such modifications and changes should fall within the scope of the claims of the present invention.
Claims (9)
1. A speech synthesis method, comprising the steps of:
acquiring text data, obtaining a linear spectrum target value and a phase target value according to the text data, and converting the text data into a text vector;
inputting the text vector into a neural network model to obtain a linear spectrum predicted value and a phase predicted value, calculating the overall loss according to the linear spectrum target value, the linear spectrum predicted value, the phase target value and the phase predicted value, training the neural network model according to the overall loss, and obtaining a linear spectrum and an initial phase through the trained neural network model;
inputting the linear frequency spectrum and the initial phase into a Griffin-Lim vocoder for iterative training to obtain an audio signal corresponding to the text data;
inputting the linear spectrum and the initial phase into a Griffin-Lim vocoder for iterative training to obtain an audio signal corresponding to the text data, wherein the iterative training comprises the following steps: and performing inverse short-time Fourier transform on the linear frequency spectrum and the initial phase to obtain an audio signal, obtaining a joint phase connecting each audio frame through iterative training of a Griffin-Lim vocoder, and recovering and outputting the audio signal corresponding to the text data according to the joint phase.
2. The method of claim 1, wherein obtaining a linear spectrum target value and a phase target value according to the text data, and converting the text data into a text vector comprises:
acquiring audio data matched with the text data;
performing short-time Fourier transform on the audio data to obtain the linear spectrum target value and the phase target value;
and performing word segmentation on the text data to obtain word segmentation results of the text data, and performing unique hot coding on the word segmentation results to obtain text vectors.
3. A speech synthesis method as claimed in claim 1, characterized by: the neural network model is a Tacotron model, the text vector is input into the neural network model to obtain a linear frequency spectrum predicted value and a phase predicted value, and the method comprises the following steps:
and calculating the text vector through the Tacotron model to obtain a linear frequency spectrum predicted value and a phase predicted value with the same dimensionality.
4. A speech synthesis method according to claim 3, characterized by: calculating an overall loss according to the linear spectrum target value, the linear spectrum predicted value, the phase target value and the phase predicted value, including:
inputting the linear spectrum target value and the linear spectrum predicted value into a linear spectrum loss function for calculation to obtain linear spectrum loss;
inputting the phase target value and the phase predicted value into a phase loss function for calculation to obtain phase loss;
and adding the phase loss and the linear frequency spectrum loss according to a preset weight to obtain the overall loss.
5. A speech synthesis method according to claim 4, characterized by: training the neural network model according to the overall loss, comprising:
and when the overall loss is greater than or equal to a preset threshold value, training the neural network model based on the overall loss, and obtaining a linear spectrum predicted value and a phase predicted value output by the current training model until the overall loss is less than the preset threshold value, so as to obtain the trained neural network model.
6. A speech synthesis method according to any one of claims 1-5, characterized by: the linear spectrum and the initial phase are respectively input to the Griffin-Lim vocoder through the spectrum channels and the phase channels with the same number of channels.
7. A speech synthesis apparatus, characterized in that it comprises:
the data processing module is used for acquiring text data, obtaining a linear spectrum target value and a phase target value according to the text data, and converting the text data into a text vector;
the model training module is used for inputting the text vector into a neural network model to obtain a linear spectrum predicted value and a phase predicted value, calculating the overall loss according to the linear spectrum target value, the linear spectrum predicted value, the phase target value and the phase predicted value, training the neural network model according to the overall loss, and obtaining a linear spectrum and an initial phase through the trained neural network model;
the audio output module is used for inputting the linear frequency spectrum and the initial phase to a Griffin-Lim vocoder for iterative training to obtain an audio signal corresponding to the text data; inputting the linear spectrum and the initial phase into a Griffin-Lim vocoder for iterative training to obtain an audio signal corresponding to the text data, wherein the iterative training comprises the following steps: and performing inverse short-time Fourier transform on the linear frequency spectrum and the initial phase to obtain an audio signal, obtaining a joint phase connecting each audio frame through iterative training of a Griffin-Lim vocoder, and recovering and outputting the audio signal corresponding to the text data according to the joint phase.
8. An electronic device comprising a processor, a storage medium, and a computer program, the computer program being stored in the storage medium, wherein the computer program, when executed by the processor, performs the speech synthesis method of any one of claims 1 to 6.
9. A computer storage medium having a computer program stored thereon, characterized in that: the computer program, when executed by a processor, implements the speech synthesis method of any of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010006604.2A CN110797002B (en) | 2020-01-03 | 2020-01-03 | Speech synthesis method, speech synthesis device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010006604.2A CN110797002B (en) | 2020-01-03 | 2020-01-03 | Speech synthesis method, speech synthesis device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110797002A CN110797002A (en) | 2020-02-14 |
CN110797002B true CN110797002B (en) | 2020-05-19 |
Family
ID=69448507
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010006604.2A Active CN110797002B (en) | 2020-01-03 | 2020-01-03 | Speech synthesis method, speech synthesis device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110797002B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111048064B (en) * | 2020-03-13 | 2020-07-07 | 同盾控股有限公司 | Voice cloning method and device based on single speaker voice synthesis data set |
CN111627418B (en) * | 2020-05-27 | 2023-01-31 | 携程计算机技术(上海)有限公司 | Training method, synthesizing method, system, device and medium for speech synthesis model |
CN111833368B (en) * | 2020-07-02 | 2023-12-29 | 恩平市美高电子科技有限公司 | Speech restoration method based on phase consistency edge detection |
CN112634914B (en) * | 2020-12-15 | 2024-03-29 | 中国科学技术大学 | Neural network vocoder training method based on short-time spectrum consistency |
CN112712812B (en) * | 2020-12-24 | 2024-04-26 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio signal generation method, device, equipment and storage medium |
CN112820267B (en) * | 2021-01-15 | 2022-10-04 | 科大讯飞股份有限公司 | Waveform generation method, training method of related model, related equipment and device |
CN113436603B (en) * | 2021-06-28 | 2023-05-02 | 北京达佳互联信息技术有限公司 | Method and device for training vocoder and method and vocoder for synthesizing audio signals |
WO2023069805A1 (en) * | 2021-10-18 | 2023-04-27 | Qualcomm Incorporated | Audio signal reconstruction |
CN115424604B (en) * | 2022-07-20 | 2024-03-15 | 南京硅基智能科技有限公司 | Training method of voice synthesis model based on countermeasure generation network |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11080587B2 (en) * | 2015-02-06 | 2021-08-03 | Deepmind Technologies Limited | Recurrent neural networks for data item generation |
US10249289B2 (en) * | 2017-03-14 | 2019-04-02 | Google Llc | Text-to-speech synthesis using an autoencoder |
CN108899009B (en) * | 2018-08-17 | 2020-07-03 | 百卓网络科技有限公司 | Chinese speech synthesis system based on phoneme |
CN109767755A (en) * | 2019-03-01 | 2019-05-17 | 广州多益网络股份有限公司 | A kind of phoneme synthesizing method and system |
CN110136692B (en) * | 2019-04-30 | 2021-12-14 | 北京小米移动软件有限公司 | Speech synthesis method, apparatus, device and storage medium |
CN110136690B (en) * | 2019-05-22 | 2023-07-14 | 平安科技(深圳)有限公司 | Speech synthesis method, device and computer readable storage medium |
CN110473515A (en) * | 2019-08-29 | 2019-11-19 | 郝洁 | A kind of end-to-end speech synthetic method based on WaveRNN |
CN110600017B (en) * | 2019-09-12 | 2022-03-04 | 腾讯科技(深圳)有限公司 | Training method of voice processing model, voice recognition method, system and device |
-
2020
- 2020-01-03 CN CN202010006604.2A patent/CN110797002B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN110797002A (en) | 2020-02-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110797002B (en) | Speech synthesis method, speech synthesis device, electronic equipment and storage medium | |
CN109272988B (en) | Voice recognition method based on multi-path convolution neural network | |
CN111429889B (en) | Method, apparatus, device and computer readable storage medium for real-time speech recognition based on truncated attention | |
US20220004870A1 (en) | Speech recognition method and apparatus, and neural network training method and apparatus | |
CN110570845B (en) | Voice recognition method based on domain invariant features | |
CN111798840B (en) | Voice keyword recognition method and device | |
CN109767756B (en) | Sound characteristic extraction algorithm based on dynamic segmentation inverse discrete cosine transform cepstrum coefficient | |
CN113327585B (en) | Automatic voice recognition method based on deep neural network | |
CN102005205B (en) | Emotional speech synthesizing method and device | |
CN114023300A (en) | Chinese speech synthesis method based on diffusion probability model | |
CN115240702B (en) | Voice separation method based on voiceprint characteristics | |
CN114783418B (en) | End-to-end voice recognition method and system based on sparse self-attention mechanism | |
CN114495969A (en) | Voice recognition method integrating voice enhancement | |
CN114298019A (en) | Emotion recognition method, emotion recognition apparatus, emotion recognition device, storage medium, and program product | |
CN114694255B (en) | Sentence-level lip language recognition method based on channel attention and time convolution network | |
CN116469404A (en) | Audio-visual cross-mode fusion voice separation method | |
CN113793591A (en) | Speech synthesis method and related device, electronic equipment and storage medium | |
CN110674634A (en) | Character interaction method and server equipment | |
CN115019785A (en) | Streaming voice recognition method and device, electronic equipment and storage medium | |
CN112035700B (en) | Voice deep hash learning method and system based on CNN | |
CN113643687A (en) | Non-parallel many-to-many voice conversion method fusing DSNet and EDSR network | |
CN113327584A (en) | Language identification method, device, equipment and storage medium | |
CN116312617A (en) | Voice conversion method, device, electronic equipment and storage medium | |
CN116645961A (en) | Speech recognition method, speech recognition device, electronic apparatus, and storage medium | |
CN112102847B (en) | Audio and slide content alignment method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20210924 Address after: 311121 room 210, building 18, No. 998, Wenyi West Road, Wuchang Street, Yuhang District, Hangzhou City, Zhejiang Province Patentee after: Hangzhou Bodun Xiyan Technology Co.,Ltd. Address before: 311121 room 208, building 18, No. 998, Wenyi West Road, Wuchang Street, Yuhang District, Hangzhou City, Zhejiang Province Patentee before: TONGDUN HOLDINGS Co.,Ltd. |