CN110797002B

CN110797002B - Speech synthesis method, speech synthesis device, electronic equipment and storage medium

Info

Publication number: CN110797002B
Application number: CN202010006604.2A
Authority: CN
Inventors: 顾王一
Original assignee: Tongdun Holdings Co Ltd
Current assignee: Hangzhou Bodun Xiyan Technology Co.,Ltd.
Priority date: 2020-01-03
Filing date: 2020-01-03
Publication date: 2020-05-19
Anticipated expiration: 2040-01-03
Also published as: CN110797002A

Abstract

The invention discloses a voice synthesis method, which relates to the field of voice synthesis and comprises the following steps: acquiring text data, obtaining target values of a linear frequency spectrum and a phase according to the text data, and converting the text data into text vectors; inputting the text vector into a neural network model to obtain a predicted value of a linear frequency spectrum and a phase, further calculating the overall loss, training the neural network model, and obtaining the linear frequency spectrum and an initial phase through the trained neural network model; and inputting the linear spectrum and the initial phase into a Griffin-Lim vocoder for training to obtain an audio signal corresponding to the text data. The method trains the Griffin-Lim vocoder according to the linear spectrum and the initial phase, can reduce the iteration times of the vocoder, accelerate the convergence speed of the vocoder, accelerate the real-time audio synthesis process under the condition of not reducing the audio quality, and is suitable for a voice synthesis device which utilizes the Griffin-Lim algorithm as the vocoder. The invention also discloses a voice synthesis device, electronic equipment and a computer storage medium.

Description

Speech synthesis method, speech synthesis device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of speech synthesis, and in particular, to a speech synthesis method, apparatus, electronic device, and storage medium.

Background

Speech synthesis is a leading-edge technology in the field of Chinese information processing, and is mainly characterized by that according to characters or words, a given text input is decomposed into characteristic vectors, then the characteristic vectors are converted into audio characteristics, and finally the audio characteristics are restored into correspondent audio file by means of vocoder, and the audio file is outputted. With the technologies of WaveNet, LpcNet and the like, a lot of voice synthesis methods adopting a neural network as a vocoder appear, but the voice synthesis methods are difficult to be commercialized in the aspect of synthesis performance or synthesis effect, and at present, the Griffin-Lim algorithm is widely applied to the voice synthesis methods as the vocoder. The Griffin-Lim algorithm is used as an iterative algorithm for predicting the phase by using the frequency spectrum, the frequency spectrum amplitude is used as input, the phase initialized randomly is iterated for a certain number of times to obtain a proper phase connected with an audio frame, and a time domain audio signal is recovered.

The existing speech synthesis method adopts a neural network model to convert text into a linear spectrum, and then inputs the linear spectrum into a Griffin-Lim vocoder to generate an audio signal with better quality through repeated iteration. In order to improve the overall performance, the Griffin-Lim vocoder is often optimized from the engineering perspective so as to improve the efficiency of single iteration, but a good initial phase is neglected to be provided for the Griffin-Lim vocoder based on a neural network model so as to accelerate the convergence speed of the Griffin-Lim vocoder, and the burden brought by multiple iterations of the Griffin-Lim vocoder is fundamentally solved.

Disclosure of Invention

In order to overcome the disadvantages of the prior art, an object of the present invention is to provide a speech synthesis method, which adds an overall loss to target values and test values of a spectrum and a phase, so that the spectrum and the phase of a model are trained in a consistent direction, obtains a linear spectrum and an initial phase based on the trained model, inputs the linear spectrum and the initial phase into a Griffin-Lim vocoder for training, obtains a joint phase connecting audio frames, and recovers and outputs a corresponding audio signal.

One of the purposes of the invention is realized by adopting the following technical scheme:

acquiring text data, obtaining a linear spectrum target value and a phase target value according to the text data, and converting the text data into a text vector;

inputting the text vector into a neural network model to obtain a linear spectrum predicted value and a phase predicted value, calculating the overall loss according to the linear spectrum target value, the linear spectrum predicted value, the phase target value and the phase predicted value, training the neural network model according to the overall loss, and obtaining a linear spectrum and an initial phase through the trained neural network model;

inputting the linear frequency spectrum and the initial phase into a Griffin-Lim vocoder for iterative training to obtain an audio signal corresponding to the text data;

inputting the linear spectrum and the initial phase into a Griffin-Lim vocoder for iterative training to obtain an audio signal corresponding to the text data, wherein the iterative training comprises the following steps: and performing inverse short-time Fourier transform on the linear frequency spectrum and the initial phase to obtain an audio signal, obtaining a joint phase connecting each audio frame through iterative training of a Griffin-Lim vocoder, and recovering and outputting the audio signal corresponding to the text data according to the joint phase.

Further, obtaining a linear spectrum target value and a phase target value according to the text data, and converting the text data into a text vector, including:

acquiring audio data matched with the text data;

performing short-time Fourier transform on the audio data to obtain the linear spectrum target value and the phase target value;

and performing word segmentation on the text data to obtain word segmentation results of the text data, and performing unique hot coding on the word segmentation results to obtain text vectors.

Further, the neural network model is a Tacotron model, and the step of inputting the text vector into the neural network model to obtain a linear spectrum predicted value and a phase predicted value includes:

and calculating the text vector through the Tacotron model to obtain a linear frequency spectrum predicted value and a phase predicted value with the same dimensionality.

Further, calculating an overall loss according to the linear spectrum target value, the linear spectrum predicted value, the phase target value, and the phase predicted value includes:

inputting the linear spectrum target value and the linear spectrum predicted value into a linear spectrum loss function for calculation to obtain linear spectrum loss;

inputting the phase target value and the phase predicted value into a phase loss function for calculation to obtain phase loss;

and adding the phase loss and the linear frequency spectrum loss according to a preset weight to obtain the overall loss.

Further, training the neural network model according to the overall loss comprises:

and when the overall loss is greater than or equal to a preset threshold value, training the neural network model based on the overall loss, and obtaining a linear spectrum predicted value and a phase predicted value output by the current training model until the overall loss is less than the preset threshold value, so as to obtain the trained neural network model.

Further, the linear spectrum and the initial phase are input to the Griffin-Lim vocoder through the same number of spectrum channels and phase channels, respectively.

Another object of the present invention is to provide a speech synthesis apparatus, which adds a whole loss to target values and test values of a spectrum and a phase to train the spectrum and the phase of a model in a uniform direction, obtains a linear spectrum and an initial phase based on the trained model, inputs the linear spectrum and the initial phase into a Griffin-Lim vocoder to train, obtains a joint phase connecting audio frames, and recovers and outputs a corresponding audio signal.

The second purpose of the invention is realized by adopting the following technical scheme:

a speech synthesis apparatus, comprising:

the data processing module is used for acquiring text data, obtaining a linear spectrum target value and a phase target value according to the text data, and converting the text data into a text vector;

the model training module is used for inputting the text vector into a neural network model to obtain a linear spectrum predicted value and a phase predicted value, calculating the overall loss according to the linear spectrum target value, the linear spectrum predicted value, the phase target value and the phase predicted value, training the neural network model according to the overall loss, and obtaining a linear spectrum and an initial phase through the trained neural network model;

the audio output module is used for inputting the linear frequency spectrum and the initial phase to a Griffin-Lim vocoder for iterative training to obtain an audio signal corresponding to the text data; inputting the linear spectrum and the initial phase into a Griffin-Lim vocoder for iterative training to obtain an audio signal corresponding to the text data, wherein the iterative training comprises the following steps: and performing inverse short-time Fourier transform on the linear frequency spectrum and the initial phase to obtain an audio signal, obtaining a joint phase connecting each audio frame through iterative training of a Griffin-Lim vocoder, and recovering and outputting the audio signal corresponding to the text data according to the joint phase.

It is a further object of the invention to provide an electronic device comprising a processor, a storage medium and a computer program, the computer program being stored in the storage medium and the computer program being executed by the processor for performing a speech synthesis method according to one of the objects of the invention.

It is a fourth object of the present invention to provide a computer-readable storage medium storing one of the objects of the invention, having a computer program stored thereon, which when executed by a processor, implements a speech synthesis method of one of the objects of the invention.

Compared with the prior art, the invention has the beneficial effects that:

according to the method, a frequency spectrum target value and a phase target value are obtained according to text data, the text data are converted into text vectors, the text vectors are input into a neural network model to predict frequency spectrums and phases, further, the overall loss is obtained through calculation, the frequency spectrums and the phases of the model are constrained to be trained in the consistent direction, linear frequency spectrums and initial phases are obtained based on the trained neural network model, the linear frequency spectrums and the initial phases are input into a Griffin-Lim vocoder to be trained, the iteration times of the vocoder can be reduced, the convergence speed of the vocoder is accelerated, further, joint phases connected with all audio frames are obtained, and audio signals corresponding to the text data are recovered and output according to the joint phases; the voice synthesis method can accelerate the real-time audio synthesis process under the condition of not reducing the audio quality, and is suitable for a voice synthesis device which uses the Griffin-Lim algorithm as a vocoder.

Drawings

FIG. 1 is a flowchart of a speech synthesis method according to a first embodiment of the present invention;

FIG. 2 is a flowchart illustrating the training of the Griffin-Lim vocoder according to the first embodiment of the present invention;

fig. 3 is a block diagram of a speech synthesis apparatus according to a fifth embodiment of the present invention;

fig. 4 is a block diagram of an electronic device according to a sixth embodiment of the present invention.

Detailed Description

The present invention will now be described in more detail with reference to the accompanying drawings, in which the description of the invention is given by way of illustration and not of limitation. The various embodiments may be combined with each other to form other embodiments not shown in the following description.

Example one

The embodiment I provides a speech synthesis method, which aims to obtain a frequency spectrum target value and a phase target value according to text data, convert the text data into a text vector, predict frequency spectrum and phase through a neural network model, calculate overall loss according to the frequency spectrum target value and the phase predicted value so as to restrict the updating direction of model parameters in the training process of the neural network model, obtain a linear frequency spectrum and an initial phase through a trained model, input the linear frequency spectrum and the initial phase into a Griffin-Lim vocoder for training, and obtain corresponding audio signals. The voice synthesis method can reduce the iterative times of the vocoder and accelerate the convergence speed of the vocoder under the condition of not reducing the audio quality, thereby integrally accelerating the real-time audio synthesis process, and is suitable for a voice synthesis device which uses the Griffin-Lim algorithm as the vocoder.

Referring to fig. 1, a speech synthesis method includes the following steps:

s110, text data are obtained, a linear spectrum target value and a phase target value are obtained according to the text data, and the text data are converted into text vectors.

The text data may be text data obtained from a self-created text library, or text data obtained from a product text library such as news broadcast, weather forecast, electronic book reading, and the like, which is not limited herein.

In order to obtain time-frequency domain information to be used as a target value of a neural network model so as to further train the model and obtain audio data matched with the text data, the audio data is analyzed and processed by using a short-time Fourier transform, a wavelet transform or a Wigner distribution time-frequency domain analysis method, without limitation, so that a linear spectrum and a phase corresponding to the text data are obtained, and the linear spectrum and the phase are used as a linear spectrum target value and a phase target value of the neural network model training. In this embodiment, the audio data is transformed by a short-time fourier transform to obtain a linear spectrum and a phase.

Because the text data is data with different text lengths and cannot be directly input into the model for training, the text data needs to be converted into corresponding text vectors, and the text vectors are used as input for model training. The text data may be segmented by words or phrases, without limitation, and then further constructed into corresponding text vectors based on the segmented words or phrases. In this embodiment, text data is divided according to words, a dictionary is constructed based on word frequencies in a text library, an id is assigned to each word through dictionary subscripts, and the divided words are matched with word vectors corresponding to the id through traversing the dictionary, so that text vectors corresponding to the text data are obtained.

S120, inputting the text vector into a neural network model to obtain a linear spectrum predicted value and a phase predicted value, calculating overall loss according to the linear spectrum target value, the linear spectrum predicted value, the phase target value and the phase predicted value, training the neural network model according to the overall loss, and obtaining a linear spectrum and an initial phase through the trained neural network model.

And inputting the text vector into a neural network model to obtain a linear spectrum predicted value and a phase predicted value with the same dimension. Inputting the phase target value and the phase predicted value into a phase loss function for calculation to obtain phase loss; and inputting the linear spectrum target value and the linear spectrum predicted value into a linear spectrum loss function for calculation to obtain linear spectrum loss. And adding the phase loss and the linear spectrum loss according to a preset weight to obtain the overall loss of the neural network model. Preferably, the phase loss and the linear spectral loss are weighted equally over the total loss.

And when the overall loss is greater than or equal to a preset threshold value, training the neural network model, so that the phase loss and the linear spectrum loss jointly constrain the training direction of the model, and updating the model parameters in the directions of the predicted spectrum and the predicted phase. Calculating gradient based on overall loss and a back propagation algorithm, updating model parameters according to the calculated gradient and a gradient descent method, and adjusting a linear spectrum predicted value and a phase predicted value output by the currently trained model until the overall loss is smaller than the preset threshold value to obtain the trained model, and the linear spectrum and the phase output by the model.

The phase of the trained model output is recorded as

To obtain a value range of [ - π, π]Initial phase of (2), this embodiment is relative to phase

The initial phase is obtained through hyperbolic tangent function operation, and the calculation method is as follows:

；

wherein, pi is a constant of 3.14,

is a value range of [ -pi, pi [ -pi [ ]]The initial phase of (a).

Preferably, the gradient is calculated based on an overall loss and back propagation algorithm, model training is carried out by adopting an Adam optimizer, the training batch size is 32 in the training process, the training learning rate is reduced to 0.0005 after 50 thousands of steps from 0.001, is reduced to 0.0003 after 100 thousands of steps, is reduced to 0.0001 after 200 thousands of steps, the overall loss is smaller than a preset threshold value after iteration, and the training of the neural network model is completed to obtain a linear spectrum and an initial phase.

Preferably, to carry the initial phase information, the number of channels is doubled. The linear spectrum and the initial phase are respectively input into a Griffin-Lim vocoder through the spectrum channels and the phase channels with the same number of channels. The number of spectral channels and phase channels is the same as the dimensions of the linear spectrum and the initial phase.

S130, inputting the linear frequency spectrum and the initial phase into a Griffin-Lim vocoder for iterative training to obtain an audio signal corresponding to the text data.

Referring to fig. 2, a spectrum amplitude and an initial phase of a linear spectrum are input into a Griffin-Lim vocoder, the spectrum amplitude and the initial phase are subjected to inverse short-time fourier transform to obtain a time domain signal, the Griffin-Lim vocoder starts iterative training, the time domain signal is subjected to short-time fourier transform operation to obtain a new phase, the initial phase is replaced by the new phase, a joint phase for connecting each audio frame is found through repeated iteration, and then a time domain audio signal corresponding to text data is recovered and output.

The Griffin-Lim vocoder obtains a better initial phase corresponding to text data when the training is started, so that the vocoder can quickly obtain the joint phase for connecting each audio frame through fewer iterations under the condition of not reducing the audio quality, thereby accelerating the convergence speed of the training of the Griffin-Lim vocoder, further integrally accelerating the real-time audio synthesis process and reducing the synthesis real-time rate.

Example two

The second embodiment is an improvement on the first embodiment, and a linear spectrum target value and a phase target value are obtained according to text data and are used for a loss function of neural network model training; and converting the text data into text vectors for inputting the neural network model.

And acquiring audio data matched with the text data, and performing short-time Fourier transform on the audio data to obtain a linear spectrum target value and a phase target value. And framing and windowing long-time audio data through short-time Fourier transform, performing Fourier transform on each frame, stacking the transform result of each frame along the other dimension to obtain a linear spectrum and a phase corresponding to the audio data, and taking the linear spectrum and the phase as a linear spectrum target value and a phase target value for a target value parameter of a loss function of neural network model training.

Preferably, a hanning window is used for the audio data, each frame is 50 milliseconds long and 12.5 millimeters shifted, and 2048-point fourier transform is performed to obtain a linear spectrum and calculate the phase.

In order to obtain text vector input model training corresponding to text data, word segmentation is carried out on the text data to obtain word segmentation results of the text data, and the word segmentation results are subjected to one-hot coding to obtain text vectors.

Performing sentence segmentation on the text data to obtain a plurality of corresponding sentences, for example, segmenting the text data into a plurality of complete sentences according to punctuation marks; and performing word segmentation processing on the sentences by a word segmentation method which is not limited to a word meaning word segmentation method or a character string matching word segmentation method to obtain word segmentation results. The word sense segmentation method uses syntactic information and semantic information to process ambiguity phenomena to segment words. The word segmentation method for character string matching includes, but is not limited to, one of a forward maximum matching method, a reverse maximum matching method, a shortest path word segmentation method, and a bidirectional maximum matching method, and in this embodiment, the forward maximum matching method is selected to perform word segmentation on text data.

And processing the word segmentation result through one-hot coding to obtain a text vector, so that the characteristic of each dimension representation of the text vector is a continuous characteristic and is used for inputting a neural network model for model training. Preferably, the text vector is normalized, so that the training of a subsequent model is facilitated. Preferably, when the number of texts is large, the word segmentation result is processed by adopting a one-hot coding and principal component analysis method, so that a text vector with reduced dimensionality is obtained.

EXAMPLE III

The third embodiment is an improvement on the basis of the first embodiment or/and the second embodiment, a neural network model of the speech synthesis method adopts a Tacotron model, and text vectors are input into the Tacotron model to obtain a linear spectrum predicted value and a phase predicted value, which are used for calculating a loss function for restricting the training of the Tacotron model, so that the spectrum and the phase of the model are trained in the consistent direction.

The Tacotron model comprises an encoder, a decoder and a post-processing network which are connected in sequence, wherein the post-processing network comprises a CBHG unit and a full connection layer. The text vectors are sequentially calculated by an encoder and a decoder to obtain a Mel spectrum, the Mel spectrum is calculated by a CBHG unit to obtain Mel spectrum characteristics, and the Mel spectrum characteristics are calculated by a full connection layer to obtain linear spectrum predicted values and phase predicted values with the same dimensionality.

The text vector is calculated by an encoder to obtain audio features, and the audio features are input into a decoder. The decoder employs a content-based attention decoder that includes a decoding pre-processing unit, an attention RNN unit, and a decoder RNN unit. The attention RNN unit is a layer of RNN groups containing 256 GRUs, and the decoder RNN unit is a two-layer GRU group with longitudinal residual concatenation, each layer of GRU group containing 256 GRUs.

The attention RNN unit splices together the output of the decoding pre-processing unit and the output of the attention RNN unit as inputs to the decoder RNN unit. The RNN unit of the decoder is trained, the RNN unit of each step predicts r non-overlapping output frames at the same time, the input of the first step of the RNN unit of the decoder is an 'all zero frame', the last frame of the prediction result of the t step of the RNN unit of the decoder is used as the input of the t +1 step of the RNN unit of the decoder, and the RNN unit of the decoder outputs a Mel spectrum with the bandwidth of 80.

The CBHG unit comprises a one-dimensional convolution filter bank, an expressway network and a bidirectional gating circulating unit. Preferably, the one-dimensional convolution filter bank comprises 8 groups of one-dimensional convolution kernels, the width of the kernels is 3, a linear rectification function is adopted as an activation function, and the convolution operation result is 128 dimensions; the step length of the pooling is 1, and the bandwidth is 2; the expressway network is a 4-layer full-connection layer, a linear rectification function is adopted as an activation function, and the dimensionality of an output characteristic is 256 dimensions; the bi-directional gated loop unit includes 128 GRUs.

The CBHG unit convolves the Mel spectrum with K sets of one-dimensional convolution kernels, the K set comprising

The convolution kernels with the width of K (K =1,2,3, …, K) extract the characteristic context information with different lengths, and the results of K groups of convolution operations with different widths are stacked together and maximally pooled along time. Preferably, the pooling results are subjected to batch standardization processing, so that the problems of gradient dispersion and gradient explosion in model training are avoided. And inputting the batch of standardized results into a multi-layer highway network to extract high-level features, and further extracting Mel spectral features by adopting a bidirectional gating circulation unit.

Inputting Mel spectrum characteristics into a full-link layer to be connected in a depth dimension, predicting to obtain linear spectrum and phase with the same dimension, and using the linear spectrum and the phase as predicted values to constrain the predicted value parameters of the loss function of model training.

Example four

The fourth embodiment is an improvement on the third embodiment, and the overall loss is calculated according to the linear spectrum target value, the linear spectrum predicted value, the phase target value and the phase predicted value, and is used for training the neural network model, so that the spectrum and the phase of the model can be trained in the consistent direction.

And inputting the phase target value and the phase predicted value into a phase loss function for calculation to obtain the phase loss. And inputting the linear spectrum target value and the linear spectrum predicted value into a linear spectrum loss function for calculation to obtain linear spectrum loss.

The same loss function is used for the phase loss function and the linear spectrum loss function, and the loss function includes but is not limited to one of an L1 loss function, an L2 loss function and a cross entropy loss function, and an L1 loss function is used in the embodiment.

And adding the phase loss and the linear spectrum loss according to a preset weight to obtain the overall loss of the neural network model. Preferably, the phase loss and the linear spectral loss are weighted equally over the total loss.

Preferably, the linear spectrum target value is processed by a mel filter bank to obtain a mel spectrum target value, the text vector is sequentially calculated by an encoder and a decoder in a Tacotron model to obtain a mel spectrum predicted value, and the mel spectrum target value and the mel spectrum predicted value are input into a mel spectrum loss function to be calculated to obtain the mel spectrum loss. And adding the Mel spectral loss, the phase loss and the linear spectral loss to obtain the integral loss of the model. Preferably, the weight ratio of the mel-frequency spectrum loss, the phase loss and the linear spectrum loss at the whole loss is the same.

The neural network model is trained based on an overall loss and gradient descent method, so that the mel spectrum loss, the phase loss and the linear spectrum loss jointly constrain the training direction of the model, and the linear spectrum and the initial phase are obtained through the trained neural network model. And training the Griffin-Lim vocoder by utilizing the linear spectrum and the initial phase, obtaining the joint phase connected with each audio frame through the trained Griffin-Lim vocoder, and recovering and outputting the audio signal corresponding to the text data.

EXAMPLE five

An embodiment five discloses a speech synthesis apparatus corresponding to the foregoing embodiment, which is a virtual apparatus structure of the foregoing embodiment, and as shown in fig. 3, the speech synthesis apparatus includes:

Preferably, a doubled number of channels are used to carry the initial phase information. The model training module is connected with the frequency spectrum channels and the phase channels with the same number of channels, and the frequency spectrum channels and the phase channels are connected with the audio output module. The linear spectrum and the initial phase are input to the Griffin-Lim vocoder through a spectrum channel and a phase channel, respectively.

EXAMPLE six

Fig. 4 is a schematic structural diagram of an electronic device according to a sixth embodiment of the present invention, as shown in fig. 4, the electronic device includes a processor 310, a memory 320, an input device 330, and an output device 340; the number of the processors 310 in the computer device may be one or more, and one processor 310 is taken as an example in fig. 4; the processor 310, the memory 320, the input device 330 and the output device 340 in the electronic apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 4.

The memory 320 is used as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the speech synthesis method in the embodiment of the present invention (for example, the data processing module 210, the model training module 220, and the audio output module 230 in the speech synthesis method apparatus). The processor 310 executes various functional applications and data processing of the electronic device by executing the software programs, instructions and modules stored in the memory 320, that is, implements the speech synthesis method of the first to fourth embodiments.

The memory 320 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 320 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 320 may further include memory located remotely from the processor 310, which may be connected to the electronic device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 330 may be used to receive text data, preset thresholds, etc. The output device 340 may include a display device such as a display screen.

EXAMPLE seven

An embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a speech synthesis method, including:

Of course, the storage medium containing the computer-executable instructions provided by the embodiments of the present invention is not limited to the method operations described above, and may also perform related operations in the speech synthesis based method provided by any embodiments of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes instructions for enabling an electronic device (which may be a mobile phone, a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the device based on a speech synthesis method, each unit and each module included in the device is only divided according to functional logic, but is not limited to the above division as long as the corresponding function can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

Various other modifications and changes may be made by those skilled in the art based on the above-described technical solutions and concepts, and all such modifications and changes should fall within the scope of the claims of the present invention.

Claims

1. A speech synthesis method, comprising the steps of:

2. The method of claim 1, wherein obtaining a linear spectrum target value and a phase target value according to the text data, and converting the text data into a text vector comprises:

acquiring audio data matched with the text data;

3. A speech synthesis method as claimed in claim 1, characterized by: the neural network model is a Tacotron model, the text vector is input into the neural network model to obtain a linear frequency spectrum predicted value and a phase predicted value, and the method comprises the following steps:

4. A speech synthesis method according to claim 3, characterized by: calculating an overall loss according to the linear spectrum target value, the linear spectrum predicted value, the phase target value and the phase predicted value, including:

5. A speech synthesis method according to claim 4, characterized by: training the neural network model according to the overall loss, comprising:

6. A speech synthesis method according to any one of claims 1-5, characterized by: the linear spectrum and the initial phase are respectively input to the Griffin-Lim vocoder through the spectrum channels and the phase channels with the same number of channels.

7. A speech synthesis apparatus, characterized in that it comprises:

8. An electronic device comprising a processor, a storage medium, and a computer program, the computer program being stored in the storage medium, wherein the computer program, when executed by the processor, performs the speech synthesis method of any one of claims 1 to 6.

9. A computer storage medium having a computer program stored thereon, characterized in that: the computer program, when executed by a processor, implements the speech synthesis method of any of claims 1 to 6.