CN117809622A

CN117809622A - Speech synthesis method, device, storage medium and computer equipment

Info

Publication number: CN117809622A
Application number: CN202311863296.5A
Authority: CN
Inventors: 曾锐鸿; 廖艳冰; 马飞; 兰翔; 张政统; 邓其春; 黄祥康; 黎子骏; 吴文亮; 盘子圣; 王伟喆; 马金龙; 熊佳; 徐志坚; 陈光尧; 谢睿
Original assignee: Guangzhou Quyan Network Technology Co ltd
Current assignee: Guangzhou Quyan Network Technology Co ltd
Priority date: 2023-12-29
Filing date: 2023-12-29
Publication date: 2024-04-02

Abstract

The application provides a voice synthesis method, a device, a storage medium and computer equipment, and a prosody prediction model is introduced, so that intonation and prosody distribution in synthesized voice audio can be predicted according to text semantics of a text to be synthesized, and further natural smoothness of the subsequently generated synthesized voice audio can be improved. Meanwhile, the condition variation self-encoder can accurately capture and predict the data distribution of voice audio, so that the method uses the voice synthesis model constructed based on the condition variation self-encoder, and takes the target prosody vector output by the prosody prediction model as one of the input quantities of the voice synthesis model, so that the voice synthesis model can generate higher-quality voice linear spectrum data based on the condition variation self-encoder and the target prosody vector. Therefore, the method and the device can improve the prosody expression of the synthesized voice audio through comprehensive prosody prediction and variation inference, so that the synthesized voice audio is smoother and more natural, and further the voice synthesis quality can be improved.

Description

Speech synthesis method, device, storage medium and computer equipment

Technical Field

The present disclosure relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method, apparatus, storage medium, and computer device.

Background

The technology of voice synthesis, which is to automatically generate voice audio corresponding to an input text by using a computer, has been widely used in the fields of voice assistants, audio books, telephone customer service, and the like. With the rise and development of deep learning technology, the currently proposed speech synthesis schemes increasingly rely on deep neural networks to model a speech generation process using the deep neural networks, thereby generating smoother and natural speech audio. Some existing schemes can dynamically focus on different parts in the input text through a focus mechanism and a neural network architecture so as to further improve the naturalness of voice audio.

However, the inventor found that although the prior art improves the naturalness of the voice audio through the attention mechanism, it is still difficult to better capture the complex speech and prosody of the real human voice, resulting in poor prosody performance of the synthesized voice audio, particularly in the voice synthesis scene of long text. It can be seen that the prior art has the problem of low speech synthesis quality.

Disclosure of Invention

The object of the present application is to solve at least one of the above-mentioned technical drawbacks, in particular the technical drawbacks of low speech synthesis quality in the prior art.

In a first aspect, an embodiment of the present application provides a method for synthesizing speech, where the method includes:

obtaining a text to be synthesized;

performing text preprocessing on the text to be synthesized, and obtaining a preprocessed text sequence;

inputting the preprocessed text sequence into a prosody prediction model to obtain a target prosody vector output by the prosody prediction model; the prosody prediction model is used for predicting prosody according to text semantics;

generating a model input vector according to the target prosody vector;

inputting the model input vector into a speech synthesis model constructed based on a conditional variation self-encoder to obtain speech linear spectrum data output by the speech synthesis model;

and generating synthesized voice audio corresponding to the text to be synthesized according to the voice linear spectrum data.

In one embodiment, the inputting the pre-processed text sequence into a prosody prediction model to obtain a target prosody vector output by the prosody prediction model includes:

inputting the preprocessed text sequence into a BERT model of the prosody prediction model to obtain N M x 1-dimensional prosody embedding vectors output by the BERT model; wherein N and M are positive integers;

combining the N prosody embedding vectors into a first intermediate vector of M x N dimensions;

determining a noise value, and inputting the noise value and the first intermediate vector into a diffusion model of the prosody prediction model to obtain a de-noised prosody vector output by the diffusion model;

and generating the target prosody vector according to the denoising prosody vector.

In one embodiment, the de-noised prosody vector is a vector of dimension k×n, K being a positive integer;

the generating the target prosody vector from the de-noised prosody vector includes:

splitting the de-noised prosody vector into N K x 1-dimensional second intermediate vectors;

and adding the N second intermediate vectors to obtain the target prosody vector.

In one embodiment, the training process of the speech synthesis model is:

in the current training round, respectively generating training input vectors corresponding to training texts in each group of training samples according to the prosody prediction model; wherein each group of training samples comprises training texts and pre-collected training voice audios corresponding to the training texts;

respectively inputting each training input vector into a speech synthesis model constructed based on a conditional variation self-encoder, and obtaining training linear spectrum data corresponding to each training input vector;

calculating the KL divergence value corresponding to the current training round according to each training linear spectrum data;

according to each training linear spectrum data and each training voice audio, calculating a reconstruction loss value corresponding to the current training round;

updating the parameter weight of the speech synthesis model based on the KL divergence value corresponding to the current training round and the reconstruction loss value corresponding to the current training round;

and under the condition that the training ending condition is not met, entering the next training round.

In one embodiment, the calculating a reconstruction loss value corresponding to the current training round according to each training linear spectrum data and each training voice audio includes:

respectively calculating target Mel frequency spectrum data corresponding to each training voice audio;

respectively acquiring training Mel frequency spectrum data corresponding to each training linear frequency spectrum data;

and calculating a reconstruction loss value corresponding to the current training round based on each target Mel frequency spectrum data and each training Mel frequency spectrum data.

In one embodiment, the text preprocessing is performed on the text to be synthesized, and a preprocessed text sequence is obtained, including:

performing text cleaning and normalization processing on the text to be synthesized to obtain normalized text;

performing word segmentation and sentence segmentation on the normalized text to obtain an original text sequence;

and carrying out text regularization processing on the original text sequence to obtain the preprocessed text sequence.

In one embodiment, the generating the synthesized voice audio corresponding to the text to be synthesized according to the voice linear spectrum data includes:

the voice linear spectrum data is input into a HiFi-GAN model based on an antagonistic neural network to obtain the synthesized voice audio corresponding to the text to be synthesized, which is output by the HiFi-GAN model.

In a second aspect, an embodiment of the present application provides a speech synthesis apparatus, where the apparatus includes:

the text acquisition module is used for acquiring a text to be synthesized;

the text preprocessing module is used for preprocessing the text to be synthesized and obtaining a preprocessed text sequence;

the prosody prediction module is used for inputting the preprocessed text sequence into a prosody prediction model to obtain a target prosody vector output by the prosody prediction model; the prosody prediction model is used for predicting prosody according to text semantics;

the input vector generation module is used for generating a model input vector according to the target prosody vector;

the voice linear spectrum data acquisition module is used for inputting the model input vector into a voice synthesis model constructed based on a conditional variation self-encoder so as to obtain voice linear spectrum data output by the voice synthesis model;

and the audio generation module is used for generating synthesized voice audio corresponding to the text to be synthesized according to the voice linear spectrum data.

In a third aspect, embodiments of the present application provide a storage medium having stored therein computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the speech synthesis method according to any of the embodiments described above.

In a fourth aspect, embodiments of the present application provide a computer device, comprising: one or more processors, and memory;

the memory has stored therein computer readable instructions which, when executed by the one or more processors, perform the steps of the speech synthesis method of any of the embodiments described above.

In the voice synthesis method, the voice synthesis device, the storage medium and the computer equipment provided by some embodiments of the application, a prosody prediction model is introduced, so that intonation and prosody distribution in synthesized voice audio can be predicted according to text semantics of a text to be synthesized, and natural smoothness of the subsequently generated synthesized voice audio can be improved. Meanwhile, the condition variation self-encoder can accurately capture and predict the data distribution of voice audio, so that the method uses the voice synthesis model constructed based on the condition variation self-encoder, and takes the target prosody vector output by the prosody prediction model as one of the input quantities of the voice synthesis model, so that the voice synthesis model can generate higher-quality voice linear spectrum data based on the condition variation self-encoder and the target prosody vector. Therefore, the method and the device can improve the prosody expression of the synthesized voice audio through comprehensive prosody prediction and variation inference, so that the synthesized voice audio is smoother and more natural, and further the voice synthesis quality can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a flow chart of a method of speech synthesis in one embodiment;

FIG. 2 is a schematic diagram of a cascade of neural network models in one embodiment;

FIG. 3 is a schematic diagram of a speech synthesis apparatus according to an embodiment;

fig. 4 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In one embodiment, the present application provides a method of speech synthesis. The following examples illustrate the application of the method to a computer device. It is understood that the computer device herein refers to a device having a data acquisition function and a data processing function, and may be, but not limited to, various servers, personal computers, notebook computers, and the like.

As shown in fig. 1, the speech synthesis method provided in the present application may include the following steps:

s102: and obtaining the text to be synthesized.

The text to be synthesized refers to text content related to the synthesized voice audio to be generated. It is understood that the text to be synthesized may be obtained in any manner, and this is not a specific limitation herein. For example, a computer device may receive text to be synthesized entered by a user. For another example, the computer device may crawl text content in the currently displayed page and take the crawled text content as text to be synthesized.

S104: and carrying out text preprocessing on the text to be synthesized, and obtaining a preprocessed text sequence.

In this step, considering that the text format, the text expression mode (for example, case and abbreviation) and the expression format of specific fields such as number, date and time may cause synthesis errors in the speech synthesis process, and further affect the speech synthesis quality, in the case that the text to be synthesized is obtained, the computer device may perform text preprocessing on the text to be synthesized, so as to convert the text to be synthesized into a preprocessed text sequence with clear text content and smoothness and meeting the preset speech synthesis standard.

It can be appreciated that the specific manner of text preprocessing can be determined according to practical factors such as the specific application field of the speech synthesis method (e.g., the fields of a speech assistant, a sound reading material, a telephone customer service, etc.), the manner of obtaining the text to be synthesized, and the like. In one embodiment, S104 may include the following sub-steps:

step A1: performing text cleaning and normalization processing on the text to be synthesized to obtain normalized text;

step A3: performing word segmentation and sentence segmentation on the normalized text to obtain an original text sequence;

step A5: and carrying out text regularization processing on the original text sequence to obtain a preprocessed text sequence.

Specifically, in view of the fact that the text content of the text to be synthesized may include irregular content such as punctuation marks, special characters and redundant spaces, in order to ensure consistency and smoothness of the text and further improve stability of subsequent speech synthesis, the computer device may perform text cleaning and normalization processing on the text to be synthesized to remove unnecessary spaces, punctuation marks and special characters and process case and abbreviation normalization.

For the normalized text obtained after the text cleaning and normalization processing, the computer equipment can perform word segmentation and sentence breaking processing on the normalized text so as to split the long text without punctuation marks into a plurality of sub-texts with proper lengths and obtain an original text sequence consisting of the sub-texts. For example, in a word segmentation and sentence breaking process, the computer device may first split the normalized text into a plurality of words and break sentences according to each word to obtain a plurality of sub-texts, such as sentences or phrases.

In one example, prior to the word segmentation and sentence segmentation process, the computer device may perform text anomaly detection and correction on the normalized text, such as processing unusual words, detecting and correcting misspellings or ambiguous text, and so forth. After correction, the computer device may perform word segmentation and sentence segmentation processing on the corrected normalized text, thereby obtaining an original text sequence.

In the case of an original text sequence, the computer device may perform text regularization on the original text sequence to convert the content of numbers, dates, times, specific expression symbols (e.g., text,%), abbreviations, etc. in the original text sequence to linguistic words, and obtain a pre-processed text sequence.

Further, in one example, S104 may further include the sub-steps of: speech synthesis tag information is generated from the pre-processed text sequence. The aforementioned phonetic-mark information may include phoneme information, accent identification, pitch information, etc. to instruct the speech synthesis model to generate the correct pronunciation and pitch.

S106: the pre-processed text sequence is input into a prosody prediction model to obtain a target prosody vector output by the prosody prediction model.

The prosody prediction model is used for predicting prosody according to text semantics. In other words, the prosody prediction model may analyze prosody information to which the synthesized speech audio should be based on the grammar information and the semantic information of the pre-processed text sequence and output a target prosody vector representing the prosody information. The prosodic information may include speech emotion, accent position, pitch variation, and the like.

S108: and generating a model input vector according to the target prosody vector.

The model input vector refers to a vector for being input to a back-end speech synthesis model, and may include prosodic information of predicted synthesized speech audio, so that the speech synthesis model performs speech synthesis in combination with the predicted prosodic information. For example, the computer device may use the target prosody vector as a model input vector, or generate a model input vector from the target prosody vector and the speech synthesis flag information.

S110: the model input vector is input into a speech synthesis model constructed from an encoder based on a conditional variation to obtain speech linear spectrum data output by the speech synthesis model.

The voice linear spectrum data are the linear spectrum data of the synthesized voice audio which needs to be generated.

In this step, a speech synthesis model may be built from the encoder based on the conditional variance of the evidence minimum bound (Evidence Lower Bound, abbreviated ELBO). Because the condition variation self-encoder can accurately capture and predict the data distribution of the voice audio, the method uses the voice synthesis model constructed based on the condition variation self-encoder, and takes the target prosody vector output by the prosody prediction model as one of the input quantities of the voice synthesis model, so that the voice synthesis model can generate higher-quality voice linear spectrum data based on the condition variation self-encoder and the target prosody vector.

S112: synthetic speech audio corresponding to text to be synthesized is generated from the speech linear spectral data.

For example, the computer device may perform an inverse fourier transform on the voice linear spectrum data to generate synthesized voice audio. In view of limited performance of signal processing-based audio generation modes such as inverse fourier transform in terms of speech naturalness and fluency, in order to make the synthesized speech audio more similar to real human speech to further improve speech synthesis quality, the present application may employ a HiFi-GAN model based on an antagonistic neural network to generate the synthesized speech audio. In the generation of the synthesized voice audio, the computer device may input voice linear spectrum data into a HiFi-GAN model based on an antagonistic neural network to achieve high quality generation from voice features to voice signals through the neural network model and obtain the synthesized voice audio output by the HiFi-GAN model.

In order to facilitate an understanding of the aspects of the present application, a specific example is described below. It should be understood that the descriptions of the various features mentioned in this example are not limiting of the application, but are merely for convenience of the skilled artisan in understanding the present solution.

Referring to fig. 2, the computer device may input the text to be synthesized into a front-end text processing model to perform text preprocessing on the text to be synthesized through the front-end text processing model, and obtain a preprocessed text sequence. The computer device may input the pre-processed text sequence into a prosody prediction model for prosody prediction, and generate a model input vector from a target prosody vector output by the prosody prediction model, and input the model input vector into a speech synthesis model to generate speech linear spectrum data from the encoder based on the condition variation. The computer device may input voice linear spectrum data into a neural network-based vocoder so that the voice linear spectrum data may be converted into synthesized voice audio by the vocoder. In this example, text preprocessing, prosody prediction, speech synthesis, and specific processes for synthesizing speech audio are described with reference to other embodiments herein.

The present example synthesizes higher quality, more natural synthesized speech audio by combining the use of variance inference, text preprocessing, semantic prosody prediction, and neural network-based vocoders to improve the performance and quality of speech synthesis.

According to the method and the device, a prosody prediction model is introduced, so that intonation and prosody distribution in the synthesized voice audio can be predicted according to text semantics of the text to be synthesized, and further natural smoothness of the subsequently generated synthesized voice audio can be improved. Meanwhile, the condition variation self-encoder can accurately capture and predict the data distribution of voice audio, so that the method uses the voice synthesis model constructed based on the condition variation self-encoder, and takes the target prosody vector output by the prosody prediction model as one of the input quantities of the voice synthesis model, so that the voice synthesis model can generate higher-quality voice linear spectrum data based on the condition variation self-encoder and the target prosody vector. Therefore, the method and the device can improve the prosody expression of the synthesized voice audio through comprehensive prosody prediction and variation inference, so that the synthesized voice audio is smoother and more natural, and further the voice synthesis quality can be improved.

In one embodiment, inputting the pre-processed text sequence into a prosody prediction model to obtain a target prosody vector output by the prosody prediction model, comprises:

determining a noise value, and inputting the noise value and the first intermediate vector into a diffusion model of the prosody prediction model to obtain a denoised prosody vector output by the diffusion model;

and generating a target prosody vector according to the denoising prosody vector.

In this embodiment, the target prosodic vector may be generated using the BERT (Bidirectional Encoder Representations from Transformers) model and the bi-directional encoder characterizations from the transformer. Specifically, when the pre-processed text sequence is (w ₁ ,w ₂ ,…,w _N ) Wherein w is ₁ ,w ₂ ,…,w _N When the text is respectively N sub texts, after affine transformation of the BERT model, prosody embedded vectors e of each sub text can be respectively obtained:

e ₁ ,e ₂ ,…,e _N ＝BERT(w ₁ ,w ₂ ,…,w _N )

wherein e ₁ Is w ₁ Corresponding prosodic embedded vectors e ₂ Is w ₂ Corresponding prosodic embedded vectors e _N Is w _N The corresponding prosodic embedded vectors, and so on. Each prosodic embedded vector is an mx1-dimensional vector.

The computer device can combine N prosody embedded vectors output by the BERT model into a first M N-dimensional intermediate vector z _cond ，z _cond ＝[e ₁ ,e ₂ ,…,e _N ]And will z _cond And the noise value is input into a diffusion model to be denoised so as to obtain a denoised prosody vector z ₀ . The noise value may be sampled from a standard normal gaussian distribution.

Since diffusion models can bring better diversity and richness in generating performance, compared with z _cond Z of diffusion model output ₀ The method has more emotion diversity and richness in rhythm representation capability, so that the rhythm representation capability of a target rhythm vector can be further improved, and the quality of synthesized voice audio is improved. In one example, the network structure of the diffusion model may be DC-UNet (Dual Channel U-Net), and the diffusion model may generate the de-noised prosody vector through 1000 iterations.

In one embodiment, the de-noised prosody vector is a vector of dimension k×n, K being a positive integer.

Generating a target prosody vector from the de-noised prosody vector, comprising: splitting the de-noised prosody vector into N K x 1-dimensional second intermediate vectors; and adding the N second intermediate vectors to obtain the target prosody vector.

For example, if the prosody embedding vectors outputted by the BERT model are e respectively ₁ ,e ₂ ,…,e _N When in use, thenWherein->For output by diffusion model corresponding to e ₁ Is>For output by diffusion model corresponding to e ₂ Is>For output by diffusion model corresponding to e _N Is included in the first intermediate vector.

The computer device can denoise the prosody vector z ₀ Respectively split into second intermediate vectorsAnd adding the second intermediate vectors to obtain the target prosody vector. Therefore, the prosody characterization capability of the target prosody vector can be further improved, the target prosody vector can be generated in a simpler mode, and the consumption of computing resources is reduced.

In one embodiment, the training process of the speech synthesis model is:

calculating the KL divergence value corresponding to the current training round according to the training linear spectrum data;

according to the training linear spectrum data and the training voice audios, calculating a reconstruction loss value corresponding to the current training round;

In this embodiment, the computer device may train the speech synthesis model in a variance inference training manner, so as to enhance stability of the speech synthesis model, and help the speech synthesis model to capture distribution of speech data more accurately, so that the speech synthesis model may generate more natural and higher quality speech features.

In particular, the computer device may obtain sets of training samples for training the speech synthesis model, each set of training samples including training text and pre-acquired training speech audio, the training text in the same set of training samples corresponding to the training speech audio.

During training, in each training round, the computer device may generate a training input vector corresponding to each training text. The specific generation manner of the training input vector may refer to the generation manner of the model input vector described in other embodiments herein, and will not be described herein. After obtaining each training input vector, the computer device may input each training input vector into the speech synthesis model, respectively, to obtain training linear spectrum data that the speech synthesis model outputs for each training input vector, respectively. The computer device may calculate a KL divergence value corresponding to the current training round based on each training linear spectrum data output by the speech synthesis model in the current training round. And generating a reconstruction loss value corresponding to the current training round according to the training voice audio and the training linear spectrum data output by the voice synthesis model in the current training round. The computer device may perform parameter adjustment based on the KL divergence value corresponding to the current training round and the reconstruction loss value corresponding to the current training round, and determine whether a training end condition is satisfied. If not, the next training round is entered, and the process is repeated. And if the training ending condition is met, obtaining a trained voice synthesis model.

In one embodiment, in calculating the KL divergence value, the computer device may calculate the linear spectrum data corresponding to each training voice audio separately, and calculate the posterior distribution value according to the linear spectrum data corresponding to each training voice audio. The computer device may also calculate a priori distribution value from each set of training texts, taking a difference between the posterior distribution value and the a priori distribution value as the KL divergence value.

In one embodiment, calculating a reconstruction loss value corresponding to the current training round according to each training linear spectrum data and each training voice audio comprises:

In this embodiment, the reconstruction loss value may be calculated according to the mel spectrum, so that the speech synthesis model may capture the distribution of the speech data more accurately, and generate more natural and higher quality speech features. Specifically, the computer device may generate training mel spectrum data corresponding to each training linear spectrum data according to each training linear spectrum data output by the speech synthesis model. For example, for each training linear spectral data, the computer device may convert the training linear spectral data into speech audio and then calculate training mel spectral data based on the converted speech audio.

The computer equipment can also extract the target Mel frequency spectrum data corresponding to the training voice audio from each training voice audio according to the respective purposesAnd calculating a reconstruction loss value corresponding to the current training round by using the standard Mel frequency spectrum data and each training Mel frequency spectrum data. For example, the computer device may calculate the reconstruction loss value L based on the following formula _{Reconstruction} ：

Wherein x is _mel For the purpose of mel-frequency spectrum data,to train mel-frequency spectrum data. When the current training round generates a plurality of training linear spectrum data, the computer equipment calculates a reconstruction loss value corresponding to each training linear spectrum data according to the above formula, and weights each reconstruction loss value to obtain the reconstruction loss value corresponding to the current training round.

The following describes a speech synthesis apparatus provided in the embodiments of the present application, and the speech synthesis apparatus described below and the speech synthesis method described above may be referred to correspondingly to each other.

In one embodiment, the present application provides a speech synthesis apparatus 200. As shown in fig. 3, the speech synthesis apparatus 200 may include a text acquisition module 210, a text preprocessing module 220, a prosody prediction module 230, an input vector generation module 240, a speech linear spectrum data acquisition module 250, and an audio generation module 260. Wherein:

a text obtaining module 210, configured to obtain a text to be synthesized;

a text preprocessing module 220, configured to perform text preprocessing on the text to be synthesized, and obtain a preprocessed text sequence;

a prosody prediction module 230 for inputting the preprocessed text sequence into a prosody prediction model to obtain a target prosody vector output by the prosody prediction model; the prosody prediction model is used for predicting prosody according to text semantics;

an input vector generation module 240 for generating a model input vector from the target prosody vector;

a voice linear spectrum data obtaining module 250, configured to input the model input vector into a voice synthesis model constructed based on a conditional variation self-encoder, so as to obtain voice linear spectrum data output by the voice synthesis model;

an audio generating module 260, configured to generate synthesized voice audio corresponding to the text to be synthesized according to the voice linear spectrum data.

In one embodiment, the prosody prediction module 230 includes a prosody embedding vector obtaining unit, a vector combining unit, a denoising unit, and a target prosody vector generating unit. The prosody embedding vector obtaining unit is used for inputting the preprocessed text sequence into a BERT model of the prosody prediction model to obtain N m×1-dimensional prosody embedding vectors output by the BERT model; wherein N and M are positive integers. The vector combining unit is used for combining the N prosody embedding vectors into a first intermediate vector of M multiplied by N dimensions. The denoising unit is used for determining a noise value and inputting the noise value and the first intermediate vector into a diffusion model of the prosody prediction model to obtain a denoising prosody vector output by the diffusion model. The target prosody vector generating unit is configured to generate the target prosody vector from the denoising prosody vector.

The target prosody vector generating unit of the present application includes a vector splitting unit and a vector adding unit. The vector splitting unit is used for splitting the denoising prosody vector into N second intermediate vectors with K multiplied by 1 dimensions. The vector adding unit is used for adding the N second intermediate vectors to obtain the target prosody vector.

In one embodiment, the speech synthesis apparatus 200 of the present application further includes a speech synthesis model training module including a training input vector generation unit, a training linear spectrum data acquisition unit, a KL divergence calculation unit, a reconstruction loss calculation unit, and a parameter adjustment unit. The training input vector generation unit is used for respectively generating training input vectors corresponding to training texts in each group of training samples according to the prosody prediction model in the current training round; wherein each set of training samples comprises training text and pre-collected training voice audio corresponding to the training text. The training linear spectrum data acquisition unit is used for respectively inputting each training input vector into a speech synthesis model constructed based on the conditional variation self-encoder, and obtaining training linear spectrum data corresponding to each training input vector. And the KL divergence calculation unit is used for calculating the KL divergence value corresponding to the current training round according to each training linear spectrum data. And the reconstruction loss calculation unit is used for calculating a reconstruction loss value corresponding to the current training round according to each training linear spectrum data and each training voice audio. The parameter adjusting unit is used for updating the parameter weight of the voice synthesis model based on the KL divergence value corresponding to the current training round and the reconstruction loss value corresponding to the current training round; and under the condition that the training ending condition is not met, entering the next training round.

In one embodiment, the reconstruction loss calculation unit of the present application includes a first mel-spectrum acquisition unit, a second mel-spectrum acquisition unit, and a loss value calculation unit. The first mel frequency spectrum acquisition unit is used for respectively calculating target mel frequency spectrum data corresponding to each training voice audio. The second mel frequency spectrum acquisition unit is used for respectively acquiring training mel frequency spectrum data corresponding to each piece of training linear frequency spectrum data. The loss value calculation unit is used for calculating a reconstruction loss value corresponding to the current training round based on each target mel frequency spectrum data and each training mel frequency spectrum data.

In one embodiment, the text preprocessing module 220 of the present application includes a first text processing unit, a second text processing unit, and a third text processing unit. The first text processing unit is used for performing text cleaning and normalization processing on the text to be synthesized to obtain normalized text. And the second text processing unit is used for carrying out word segmentation and sentence segmentation processing on the normalized text so as to obtain an original text sequence. And the third text processing unit is used for carrying out text regularization processing on the original text sequence so as to obtain the preprocessed text sequence.

In one embodiment, the audio generation module 260 of the present application includes an audio generation unit. The audio generation unit is used for inputting the voice linear spectrum data into a HiFi-GAN model based on an antagonistic neural network so as to obtain the synthesized voice audio corresponding to the text to be synthesized, which is output by the HiFi-GAN model.

In one embodiment, the present application also provides a storage medium having stored therein computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the speech synthesis method as in any embodiment.

In one embodiment, the present application also provides a computer device having stored therein computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the speech synthesis method as in any embodiment.

Fig. 4 is a schematic diagram of an internal structure of a computer device according to an embodiment of the present application. Referring to FIG. 4, computer device 900 includes a processing component 902 that further includes one or more processors, and memory resources represented by memory 901, for storing instructions, such as applications, executable by processing component 902. The application program stored in the memory 901 may include one or more modules each corresponding to a set of instructions. Further, the processing component 902 is configured to execute instructions to perform the steps of the speech synthesis method of any of the embodiments described above.

The computer device 900 may also include a power component 903 configured to perform power management of the computer device 900, a wired or wireless network interface 904 configured to connect the computer device 900 to a network, and an input output (I/O) interface 905. The computer device 900 may operate based on an operating system stored in memory 901, such as Windows Server TM, mac OS XTM, unix, linux, free BSDTM, or the like.

It will be appreciated by those skilled in the art that the internal structure of the computer device shown in the present application is merely a block diagram of some of the structures related to the aspects of the present application and does not constitute a limitation of the computer device to which the aspects of the present application apply, and that a particular computer device may include more or less components than those shown in the figures, or may combine some of the components, or have a different arrangement of the components.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Herein, "a," "an," "the," and "the" may also include plural forms, unless the context clearly indicates otherwise. Plural means at least two cases such as 2, 3, 5 or 8, etc. "and/or" includes any and all combinations of the associated listed items.

In the present specification, each embodiment is described in a progressive manner, and each embodiment focuses on the difference from other embodiments, and may be combined according to needs, and the same similar parts may be referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of speech synthesis, the method comprising:

obtaining a text to be synthesized;

generating a model input vector according to the target prosody vector;

2. The method of claim 1, wherein the inputting the pre-processed text sequence into a prosody prediction model to obtain a target prosody vector output by the prosody prediction model comprises:

3. The method according to claim 2, wherein the de-noised prosody vector is a vector of dimension K x N, K being a positive integer;

4. The method of claim 1, wherein the training process of the speech synthesis model is:

5. The method of claim 4, wherein said calculating a reconstruction loss value corresponding to a current training round based on each of said training linear spectral data and each of said training speech tones comprises:

6. The method according to claim 1, wherein the text pre-processing the text to be synthesized and obtaining a pre-processed text sequence comprises:

7. The method according to any one of claims 1 to 6, wherein said generating synthetic speech audio corresponding to the text to be synthesized from the speech linear spectral data comprises:

8. A speech synthesis apparatus, the apparatus comprising:

the text acquisition module is used for acquiring a text to be synthesized;

9. A storage medium having stored therein computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the speech synthesis method of any of claims 1 to 7.

10. A computer device, comprising: one or more processors, and memory;

stored in the memory are computer readable instructions which, when executed by the one or more processors, perform the steps of the speech synthesis method of any one of claims 1 to 7.