CN114783410A

CN114783410A - Speech synthesis method, system, electronic device and storage medium

Info

Publication number: CN114783410A
Application number: CN202210406843.6A
Authority: CN
Inventors: 俞凯; 杜晨鹏; 郭奕玮; 陈谐
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2022-07-22

Abstract

The embodiment of the invention provides a voice synthesis method, a voice synthesis system, electronic equipment and a storage medium. The method comprises the following steps: acquiring hidden layer representation used for voice synthesis data, and inputting the hidden layer representation to a phoneme-level prosody controller to obtain discrete phoneme-level prosody prediction; inputting a mixture of discrete phoneme-level prosody prediction and hidden layer representation into an acoustic model, predicting discrete acoustic features of each frame through a classifier in the acoustic model, and predicting prosody features at a frame level by using a convolutional neural network in the acoustic model; discrete acoustic features and frame-level prosodic features are input to a vocoder to generate speech of multiple prosodies. The embodiment of the invention uses the discretization speech representation to replace the traditional Mel frequency spectrum, thereby greatly reducing the problem of error transmission. Not only the tone quality of the synthesized voice is greatly improved, but also the rhythm diversity is kept. Different prosodies can be generated by the prosody controller, thereby generating diverse voices.

Description

Speech synthesis method, system, electronic device and storage medium

Technical Field

The present invention relates to the field of intelligent speech, and in particular, to a speech synthesis method, system, electronic device, and storage medium.

Background

TTS (text-to-speech) synthesis is a process of converting text into corresponding speech. Compared with the traditional statistical parameter speech synthesis, the neural TTS model based on the deep neural network has better performance. The mainstream neural text-to-speech synthesis system is usually a cascade system, which converts the input text into mel spectrum and then converts the mel spectrum into audio. At the time of conversion, Tacotron2, FastSpeech2, GlowTTS, etc. can be used. The method comprises the following steps that Tacotron2 is a sequence-to-sequence model based on an attention mechanism, FastSpeech2 is a parallel generation model based on a Transformer network, and GlowTTS uses a reversible network, inversely transforms the distribution of a Mel frequency spectrum into simple distribution, and optimizes the distribution through a maximum likelihood criterion.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

in a cascade system, prediction errors of the acoustic model are passed down. The mel frequency spectrum has complex association in the time and frequency directions, the distribution is relatively complex, and the mel frequency spectrum is difficult to model accurately enough through a general model. Especially the spectrum of the high frequency part is often unclear under the prediction of the L1 or L2 standard which is commonly used at present. Where prediction of the mel frequency spectrum is inaccurate, this may result in degradation of the quality of the generated audio.

Disclosure of Invention

The method at least solves the problems that in a cascade system in the prior art, prediction errors of an acoustic model are transmitted downwards, and if the prediction of a Mel frequency spectrum is not accurate, the quality of generated audio is affected. In a first aspect, an embodiment of the present invention provides a speech synthesis method, including:

acquiring a hidden layer representation for voice synthesis data, and inputting the hidden layer representation to a phoneme-level prosody controller to obtain discrete phoneme-level prosody prediction;

inputting the discrete phoneme-level prosody prediction and the hidden layer representation mixture into an acoustic model, predicting discrete acoustic features of each frame through a classifier in the acoustic model, and predicting prosody features at a frame level by using a convolutional neural network in the acoustic model;

and inputting the discrete acoustic features and the frame-level prosodic features into a vocoder to generate the voice with various prosodies.

In a second aspect, an embodiment of the present invention provides a speech synthesis system, including:

the prosody prediction program module is used for acquiring hidden layer representation used for voice synthesis data, and inputting the hidden layer representation to the phoneme-level prosody controller to obtain discrete phoneme-level prosody prediction;

a prosodic feature program module, configured to input the mixture of the discrete phoneme-level prosody prediction and the hidden layer representation into an acoustic model, predict a discrete acoustic feature of each frame through a classifier in the acoustic model, and predict a frame-level prosodic feature using a convolutional neural network in the acoustic model;

and the voice generation program module is used for inputting the discrete acoustic features and the frame-level prosodic features into a vocoder to generate voice with various prosodies.

In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the speech synthesis method of any of the embodiments of the present invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is configured to, when executed by a processor, implement the steps of the speech synthesis method according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: the acoustic model and vocoder are reconstructed and the discretized speech representation is used to replace the traditional Mel-frequency spectrum, thus greatly reducing the problem of error propagation. Not only the tone quality of the synthesized voice is greatly improved, but also the rhythm diversity is kept. Different prosodies can be generated by the prosody controller, thereby generating diverse voices.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a speech synthesis method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a speech synthesis method according to an embodiment of the present invention;

FIG. 3 is a diagram of a phoneme-level prosody controller of a speech synthesis method according to an embodiment of the present invention;

fig. 4 is a schematic diagram illustrating the voice reconstruction performance of a vocoder on a test set according to a voice synthesis method provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating an evaluation of a text-to-speech synthesis system of a speech synthesis method according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of pitch tracks of synthesized speech with different prosodies for a speech synthesis method according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating the prediction accuracy of the phone-level prosody label of a speech synthesis method according to an embodiment of the present invention;

FIG. 8 is a schematic diagram illustrating the prediction accuracy of discrete acoustic features of a speech synthesis method according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a speech synthesis system according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an embodiment of an electronic device for speech synthesis according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

Fig. 1 is a flowchart of a speech synthesis method according to an embodiment of the present invention, including the following steps:

s11: acquiring a hidden layer representation for voice synthesis data, and inputting the hidden layer representation to a phoneme-level prosody controller to obtain discrete phoneme-level prosody prediction;

s12: inputting the discrete phoneme-level prosody prediction and the hidden layer representation mixture into an acoustic model, predicting discrete acoustic features of each frame through a classifier in the acoustic model, and predicting prosody features at a frame level by using a convolutional neural network in the acoustic model;

s13: and inputting the discrete acoustic features and the frame-level prosodic features into a vocoder to generate the voice with various prosodies.

In this embodiment, the discretized phonetic representation is used in place of the traditional Mel-frequency spectrum, thereby greatly reducing the problem of false transmissions. And based on this, as shown in fig. 2, the acoustic model and the vocoder are redesigned, and the process of speech synthesis of the acoustic model and the vocoder can be called high-fidelity text speech synthesis with self-supervision discrete acoustic features.

Reconstructing the waveform from vector quantized acoustic features requires additional prosodic features. Therefore, in the present method, three-dimensional prosodic features are used, including logarithmic pitch, energy, and POV (probability of voice). The prosodic features are then normalized to zero mean and unit variance. For simplicity of description, the following section will abbreviated V & P for the combination of discrete acoustic features V and three-dimensional prosodic features P. The method includes two parts, an acoustic model predicts V & P according to an input phoneme sequence, and a vocoder generates a waveform according to V & P.

For step S11, a hidden layer representation of the speech synthesis data needs to be obtained during the synthesis process of the method. As an embodiment, the obtaining the hidden representation for the speech synthesis data comprises: inputting a text or phoneme sequence as speech synthesis data into a text coder to obtain a hidden layer representation of the speech synthesis data. In the present embodiment, a text is generally used for TTS, and when sufficient preparation is available, the text may be converted into a phoneme sequence first, and then the phoneme sequence may be used directly to synthesize speech. The left side of fig. 2 shows an acoustic model in which a text encoder, consisting of 6 former blocks, determines the concealment state h for the input phoneme sequence/encoding. The hidden state h is then sent to a PL prosody controller that predicts the PL prosody label and a duration predictor that predicts the duration of each phoneme, resulting in a discrete phoneme-level prosody prediction.

For step S12, the discrete phoneme-level prosody prediction and hidden layer representation mixture is input into an acoustic model, where the acoustic model consists of 3 former blocks, the output of which passes through the LSTM (Long Short-Term Memory) layer, followed by the Softmax activation function for discrete acoustic feature classification. The decoder output and discrete acoustic features are further concatenated and passed to 4 convolutional layers, each of which is followed by layer normalization and a missing layer for prosodic feature prediction, resulting in predicted frame-level prosodic features.

As explained above, the method is to calculate three-dimensional prosodic features p and to calculate their dynamic features Δ p and Δ²p is the same as the formula (I). Prosodic features [ p, Δ ] of 9 dimensions in total²p]Averaged over the frame of each phoneme, so the prosody of each phoneme can be represented by a vector. Then, all PL (phoneme level) prosodic representations are classified into n classes using k-means and indexed by clusterThe reference is a PL prosody tag.

It is also desirable to train a redesigned acoustic model, before training the acoustic model, to label phone-level (PL) prosody for all trained phones in advance, and input it into a prosody controller whose architecture is shown in fig. 3, which is trained to output h-predicted PL prosody labels from the text encoder using LSTM. Then, the quantized PL prosody, i.e., the center of the corresponding k-means cluster, is projected and added to h to control the next acoustic feature generation.

Further, the phoneme durations and prosody features were loss-trained with L2 and L1, respectively, and the PL prosody tags and discrete acoustic features were loss-trained with cross entropy. The overall training criteria are:

wherein L is_{PL_lab}Loss function value, L, referring to phoneme-level prosodic tags_durLoss function value, L, referring to phoneme duration_vLoss function value, L, referring to a discrete feature_PRefers to the loss function value of the prosodic feature.

As the acoustic model is changed, its corresponding decoding is also adjusted. There are two LSTMs in the acoustic model, used for autoregressive prediction of PL prosody labels and discrete acoustic features, respectively. During training, both LSTMs are conditioned on their inputs and the output before predicting the true phase. In the inference process, beam searching decoding is adopted. Especially decoding starts with all-zero vectors. Where k denotes the beam size. In each decoding step, the top K classes of all current hypotheses are considered and the result is treated as K new hypotheses. Beam search decoding takes into account history and the future, as opposed to greedy search decoding which always selects the best result for each step based on history.

For step S13, the voice of multiple prosody is generated by the redesigned vocoder using the discrete acoustic features and the frame-level prosody features. The model architecture of the vocoder is shown on the right half of fig. 2. In one embodiment, the discrete acoustic features and the prosodic features at the frame level are spliced and input to a confrontation network generator after being processed by a convolutional layer and feature coder, so as to generate voices with various prosody, wherein the feature coder is constructed by a four-layer transformer block and is used for optimizing the sound quality of the generated voices. In this embodiment, both the discrete acoustic features and prosodic features are transformed by convolutional layers, with channels 92 and 32 for convolutional layers, respectively, and a kernel size of 5. These two outputs are then concatenated and passed in sequence to the convolutional layer, feature encoder, and HifiGAN generator. The feature encoder here is designed to smooth discrete quantized acoustic features. It contains 4 conformational blocks, each using 2 heads of attention and 384-dimensional self-attention. The output of the HifiGAN generator is the corresponding waveform. The training criteria of HifiGAN are used to optimize the vocoder model.

As an embodiment, the training mode of the feature encoder includes: besides training the feature encoder based on discrete acoustic training features and frame-level prosody training features, the feature encoder is subjected to multi-task training through a Mel frequency spectrum to help training convergence.

In the embodiment, in addition to the method of using the existing target and the prediction result for comparison until the training converges, or the teacher-student training method or other training methods, the present invention also finds out. When the model is trained from scratch, the vocoder is hard to converge and only the HifiGAN loss. Thus, a multitask pre-training technique is introduced that also uses linear projection layers to output a predicted mel-spectrum from a feature encoder. Formally, training criteria can be written in a pre-training process:

wherein L is_HIFIGANLoss function value, L, representing training HifiGAN_melRepresenting the loss function value of the predicted mel-frequency spectrum.

After pre-training, the mel spectrogram prediction task is removed, which means that α is set to 0.

Briefly summarizing the above steps as a whole: in fig. 2, the left half is an acoustic model and the right half is a vocoder. The acoustic model obtains a hidden layer representation through a text coder after inputting a text or a phoneme sequence into the model. The discrete phoneme-level prosody is firstly used for prediction of the discretized phoneme-level prosody, then the discretized phoneme-level prosody and the previous hidden layer representation are mixed and sent into an acoustic model, and discrete speech representation of each frame is classified and predicted through Softmax. And then predicting prosodic features at the frame level through a convolutional neural network based on the prosodic features.

The discrete acoustic features and prosodic features are then fed into the vocoder. The vocoder converts discrete voice representation and frame-level rhythm characteristics through a convolution layer and then splices the discrete voice representation and the frame-level rhythm characteristics together, and then sends the discrete voice representation and the frame-level rhythm characteristics into a HifigAN generator through a layer of convolution and a characteristic encoder to predict audio. The feature encoder here consists of a 4-layer former block. The optimization criteria for this model and HifiGAN are kept consistent. To ensure the convergence of the training, a pre-training technique is used, i.e. initially an additional multi-tasking is added to predict the Mel-spectrum from the output of the feature coder. After a period of training, this additional task is removed.

It can be seen from this embodiment that the acoustic model and vocoder are reconstructed, and the discretized speech representation is used to replace the conventional mel-frequency spectrum, thereby greatly reducing the problem of error propagation. Not only the tone quality of the synthesized voice is greatly improved, but also the rhythm diversity is kept. Different prosodies can be generated by the prosody controller, thereby generating diverse voices.

The method was tested in an experiment using the LJSpeech dataset, which is an english language dataset containing approximately 24 hours of speech recorded by a female speaker. The method omits 100 utterances for verification and 150 utterances for testing. For simplicity, all voice data in this work was resampled to 16 kHz. Discrete acoustic feature extraction is performed using a disclosed pre-training based model. The frame shift of the discrete feature is 10ms and the number of possible discrete feature acoustic vectors is 21.5 k. And extracting three-dimensional prosodic features by using a Kaldi algorithm. The audio samples are provided online.

For vocoder reconstruction, the vocoder and HifiGAN are trained on the V & P training set, where a is set to 60 in the first 200k sub-optimization iterations of pre-training. HifiGAN with mel-spectrum was also trained for comparison. The speech reconstruction performance on a test set of given GT (real-pitch) acoustic features was then evaluated both subjectively and objectively. In particular, an MOS (mean opinion score) hearing test was performed, requiring 15 listeners to rate the speech quality from 0 to 5 per sentence. For objective evaluation, PESQ (objective assessment of speech quality) is calculated, which measures the similarity between the synthesized speech and the corresponding recordings. GPE (Gross Pitch Error) was also analyzed, which calculates the proportion of frames in voiced frames where the Pitch difference is less than 20% between recorded and synthesized speech, and the results are shown in FIG. 4.

In objective evaluation, it was found that the vocoder can reconstruct the recording better than the HifiGAN using V & P. Furthermore, it can be seen that the PESQ value of the vocoder is worse than that of HifiGAN with mel spectrogram. This is largely due to the loss of information by quantization. However, a more recent reconstruction does not mean a better speech quality. In fact, the difference between the speech produced by the vocoder and the HifiGAN with Melfram is almost imperceptible. In subjective hearing tests, the performance of the vocoder is superior to HifiGAN using V & P and reaches a quality comparable to HifiGAN using mel spectrogram. As for the V & P HifiGAN, some unwanted artifacts are sometimes heard, which may be caused by discontinuous quantization input characteristics.

For the naturalness of the speech synthesis, the whole text-to-speech conversion system based on discrete features (namely the short for high fidelity text speech synthesis with the self-supervision discrete acoustic features of the method) is trained, wherein an acoustic model is optimized by 1000 steps by using an Adam optimizer. In the PL prosody and discrete acoustic feature prediction, the number of PL prosody clusters n is set to 128, and the beam sizes in beam search decoding are set to 5 and 10, respectively. The method is then compared to other currently popular methods, including Tacotron2, glowstts, fastspech 2, and the complete end-to-end TTS model (VITS for short). In the first three baseline systems, an 80-dimensional mel spectrogram was used as the acoustic feature and HifiGAN as the vocoder. The MOS hearing test is used to assess the naturalness of the synthesized speech. Utterances in the test set were presented to 15 listeners and the results are shown in figure 5 with 95% confidence intervals.

As expected, a degradation in quality is observed in all cascaded baseline TTS systems compared to speech reconstruction of GT mel spectrogram. Although the sound quality of the complete end-to-end model VITS is similar to the method, it sometimes exhibits unnatural prosody. However, in contrast to the speech reconstruction of GT V & P, the present method generates high fidelity and natural speech with little degradation in quality. In addition, the method is a cascade TTS system, and is more flexible than a complete end-to-end TTS system VITS.

For prosodic diversity, text-to-speech is a one-to-many mapping because, in addition to text, speech also contains different prosodies. The method models diversity with a PL prosody controller, which enables the method to control speech synthesis using different PL prosody hypotheses in the beam search. Here, a sentence containing 3 different prosodic hypotheses is synthesized in the test set, and their pitch trajectories are shown in fig. 6, and their differences are clearly seen in fig. 6.

For the decoding algorithm, the method discusses the effectiveness of beam search decoding on PL prosody marks and discrete acoustic feature prediction. For this purpose, a greedy search and a beam search with beam sizes 5 and 10 are utilized in the two tasks, respectively. Here, the discrete acoustic feature prediction is conditioned on GT duration and PL prosody label to ensure that the predicted features are accurately aligned with the GT features, so that prediction accuracy can be calculated. The results are shown in FIGS. 7 and 8.

It can be found that the accuracy of all settings is not very high. Nevertheless, in both reasoning tasks, the beam search decoding is still slightly more accurate than the greedy search. Further, in the PL prosody mark prediction, the effect of the beam size of 5 is better, and in the discrete acoustic feature prediction, the effect of the beam size of 10 is better.

In general, the method provides for the utilization of self-supervised discrete acoustic features rather than the traditional mel spectrogram, which greatly reduces the quality gap between GT and predicted acoustic features, thereby improving the performance of the entire TTS system. The vocoder in this method uses an additional feature encoder to smooth the discontinuous quantized input features and achieve better reconstruction performance than HifiGAN. It has also been found that different PL prosody hypotheses can generate different prosody in beam search decoding. In addition, beam search decoding is superior to greedy search in both PL prosody and discrete acoustic feature prediction.

Fig. 9 is a schematic structural diagram of a speech synthesis system according to an embodiment of the present invention, where the system can execute the speech synthesis method according to any of the embodiments described above and is configured in a terminal.

The present embodiment provides a speech synthesis system 10 including: a prosody prediction program module 11, a prosody feature program module 12, and a speech generation program module 13.

The prosody prediction program module 11 is configured to obtain a hidden layer representation for speech synthesis data, input the hidden layer representation to a phoneme-level prosody controller, and obtain discrete phoneme-level prosody prediction; the prosodic feature programming module 12 is configured to input the mixture of the discrete phoneme-level prosodic prediction and the hidden layer representation into an acoustic model, predict discrete acoustic features of each frame through a classifier in the acoustic model, and predict prosodic features at a frame level by using a convolutional neural network in the acoustic model; the speech generator module 13 is configured to input the discrete acoustic features and the frame-level prosodic features into a vocoder to generate speech of multiple prosodies.

Further, the speech generator module is to:

and splicing the discrete acoustic features and the prosodic features at the frame level, processing the discrete acoustic features and the prosodic features by a convolution layer and a feature encoder, and inputting the processed discrete acoustic features and the prosodic features into a confrontation network generator to generate voices with various prosodies, wherein the feature encoder is constructed by four layers of former blocks and is used for optimizing the voice quality of the generated voices.

Further, the training mode of the feature encoder includes:

besides training the feature encoder based on discrete acoustic training features and frame-level prosody training features, the feature encoder is subjected to multi-task training through a Mel frequency spectrum to help training convergence.

Further, the prosody predictor module is to:

inputting a text or phoneme sequence as speech synthesis data into a text encoder to obtain a hidden layer representation of the speech synthesis data.

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the voice synthesis method in any method embodiment;

as one embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

inputting the discrete phoneme-level prosody prediction and the hidden layer representation mixture into an acoustic model, predicting discrete acoustic features of each frame through a classifier in the acoustic model, and predicting frame-level prosody features by using a convolutional neural network in the acoustic model;

and inputting the discrete acoustic features and the frame-level prosodic features into a vocoder to generate voice with various prosodies.

As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium that, when executed by a processor, perform a speech synthesis method in any of the method embodiments described above.

Fig. 10 is a schematic diagram of a hardware structure of an electronic device according to a speech synthesis method provided in another embodiment of the present application, where as shown in fig. 10, the electronic device includes:

one or more processors 1010 and memory 1020, one processor 1010 being illustrated in fig. 10. The apparatus of the speech synthesis method may further include: an input device 1030 and an output device 1040.

The processor 1010, memory 1020, input device 1030, and output device 1040 may be connected by a bus or other means, such as by a bus connection in fig. 10.

The memory 1020, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the speech synthesis method in the embodiments of the present application. The processor 1010 executes various functional applications of the server and data processing by operating nonvolatile software programs, instructions, and modules stored in the memory 1020, so as to implement the above-described method embodiment voice synthesis method.

The memory 1020 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data and the like. Further, the memory 1020 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 1020 may optionally include memory located remotely from processor 1010, which may be coupled to the mobile device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 1030 may receive input numeric or character information. Output device 1040 may include a display device such as a display screen.

The one or more modules are stored in the memory 1020 and, when executed by the one or more processors 1010, perform the speech synthesis method of any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The non-volatile computer readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the speech synthesis method of any of the embodiments of the present invention.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players, handheld game consoles, electronic books, as well as smart toys and portable vehicle navigation devices.

(4) Other electronic devices with data processing capabilities.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on the understanding, the above technical solutions substantially or otherwise contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of speech synthesis comprising:

2. The method of claim 1, wherein the inputting the discrete acoustic features and the frame-level prosodic features to a vocoder, generating a speech of diverse prosody comprises:

3. The method of claim 2, wherein the training mode of the feature encoder comprises:

besides training the feature encoder based on discrete acoustic training features and frame-level prosodic training features, the feature encoder is subjected to multi-task training through a Mel frequency spectrum, and the multi-task training is used for helping training convergence.

4. The method of claim 1, wherein prior to said obtaining the hidden representation for the speech synthesis data, the method further comprises: a pre-trained phoneme-level prosodic controller, comprising:

and clustering prosodic features in a training set in advance, so that the phoneme-level prosodic controller outputs a phoneme-level prosodic tag.

5. The method of claim 1, wherein the obtaining a hidden representation for speech synthesis data comprises:

6. A speech synthesis system comprising:

the prosody prediction program module is used for acquiring hidden layer representation used for voice synthesis data, inputting the hidden layer representation to a phoneme-level prosody controller and obtaining discrete phoneme-level prosody prediction;

a prosodic feature program module, configured to input the mixture of the discrete phoneme-level prosodic prediction and the hidden layer representation to an acoustic model, predict a discrete acoustic feature of each frame through a classifier in the acoustic model, and predict a prosodic feature at a frame level by using a convolutional neural network in the acoustic model;

and the voice generating program module is used for inputting the discrete acoustic characteristics and the frame-level rhythm characteristics into a vocoder to generate the voice with various rhythms.

7. The system of claim 6, wherein the speech generator module is to:

and splicing the discrete acoustic features and the prosodic features at the frame level, processing the discrete acoustic features and the prosodic features by a convolution layer and a feature encoder, and inputting the processed discrete acoustic features and the prosodic features into a confrontation network generator to generate various prosodic voices, wherein the feature encoder is constructed by a four-layer transformer block and is used for optimizing the voice quality of the generated voices.

8. The system of claim 7, wherein the training mode of the feature encoder comprises:

9. The system of claim 6, wherein the prosody predictor module is further to:

10. The system of claim 6, wherein the prosody predictor module is to:

11. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1-5.

12. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.