CN113053356A

CN113053356A - Voice waveform generation method, device, server and storage medium

Info

Publication number: CN113053356A
Application number: CN201911382443.0A
Authority: CN
Inventors: 伍宏传; 江源; 胡国平; 胡郁
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2021-06-29

Abstract

The embodiment of the application provides a voice waveform generation method, a voice waveform generation device, a server and a storage medium, wherein the method comprises the following steps: acquiring an input text; extracting condition features from the input text; inputting the condition characteristics into a waveform generation model obtained by training, and processing the condition characteristics to obtain a voice waveform; the waveform generation model comprises a prior distribution estimation network and a waveform generation network, wherein the prior distribution estimation network is used for learning the coding information of the natural voice waveform in the training stage, and the waveform generation network is used for generating the voice waveform according to the condition characteristics and the output result of the prior distribution estimation network. The embodiment of the application can improve the waveform generation efficiency.

Description

Voice waveform generation method, device, server and storage medium

Technical Field

The present application relates to the field of machine learning technologies, and in particular, to a method, an apparatus, a server, and a storage medium for generating a speech waveform.

Background

In recent years, with the rapid development of deep learning, research on a speech waveform generation method has been greatly developed, and various waveform generation models based on a deep Neural Network (CNN), such as WaveNet based on a Convolutional Neural Network (CNN), WaveRNN based on a recurrent Neural Network (crn), and the like, have emerged. Waveform generation methods such as WaveNet generate voice waveforms point by point based on an autoregressive mode, and the waveform generation efficiency of a waveform sequence with high time domain resolution, namely voice, is very low.

Disclosure of Invention

The embodiment of the application provides a voice waveform generation method, a voice waveform generation device, a server and a storage medium, which can improve waveform generation efficiency.

A first aspect of an embodiment of the present application provides a speech waveform generation method, including:

acquiring an input text;

extracting condition features from the input text;

inputting the condition characteristics into a waveform generation model obtained by training, and processing the condition characteristics to obtain a voice waveform; the waveform generation model comprises a prior distribution estimation network and a waveform generation network, wherein the prior distribution estimation network is used for learning coding information of a natural speech waveform in a training phase, and the waveform generation network is used for generating the speech waveform according to the condition characteristics and an output result of the prior distribution estimation network.

A second aspect of an embodiment of the present application provides a model training method, including:

acquiring a voice training sample, wherein the voice training sample comprises a natural voice waveform and a text corresponding to the natural voice waveform;

extracting natural condition features from the natural voice waveform or a text corresponding to the natural voice waveform;

inputting the natural voice waveform and the natural condition characteristics into the waveform generation model to obtain a training result;

and optimizing the model parameters of the waveform generation model according to the training result.

A third aspect of an embodiment of the present application provides a model training apparatus, including:

the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a voice training sample, and the voice training sample comprises a natural voice waveform and a text corresponding to the natural voice waveform;

a first extraction unit, configured to extract a natural condition feature from the natural speech waveform or a text corresponding to the natural speech waveform;

the training unit is used for inputting the natural voice waveform and the natural condition characteristics into the waveform generation model to obtain a training result;

and the optimization unit is used for optimizing the model parameters of the waveform generation model according to the training result.

A fourth aspect of the embodiments of the present application provides a speech waveform generation apparatus, including:

an acquisition unit configured to acquire an input text;

an extraction unit configured to extract a condition feature from the input text;

the waveform generating unit is used for inputting the condition characteristics into a waveform generating model obtained by training and processing the condition characteristics to obtain a voice waveform;

the waveform generation model comprises a prior distribution estimation network and a waveform generation network, wherein the prior distribution estimation network is used for learning coding information of a natural speech waveform in a training phase, and the waveform generation network is used for generating the speech waveform according to the condition characteristics and an output result of the prior distribution estimation network.

A fifth aspect of embodiments of the present application provides a server comprising a processor and a memory, the memory being configured to store a computer program, the computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the step instructions as in the first aspect of embodiments of the present application.

A sixth aspect of embodiments of the present application provides a server comprising a processor and a memory, the memory being configured to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to execute the step instructions as in the second aspect of embodiments of the present application.

A seventh aspect of embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program for electronic data exchange, where the computer program makes a computer perform part or all of the steps as described in the first aspect of embodiments of the present application.

An eighth aspect of embodiments of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for electronic data exchange, wherein the computer program causes a computer to perform some or all of the steps as described in the second aspect of embodiments of the present application.

A ninth aspect of an embodiment of the present application provides a computer program product, wherein the computer program product comprises a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps as described in the first aspect of an embodiment of the present application. The computer program product may be a software installation package.

A tenth aspect of embodiments of the present application provides a computer program product, wherein the computer program product comprises a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps as described in the second aspect of embodiments of the present application. The computer program product may be a software installation package.

In the embodiment of the application, firstly, an input text is obtained; extracting condition features from the input text; inputting the condition characteristics into a waveform generation model obtained by training, and processing the condition characteristics to obtain a voice waveform; the waveform generation model comprises a prior distribution estimation network and a waveform generation network, wherein the prior distribution estimation network is used for learning coding information of a natural voice waveform in a training stage, and the waveform generation network is used for generating the voice waveform according to the condition characteristics and an output result of the prior distribution estimation network. The method comprises the steps of extracting condition characteristics from an input text, inputting the condition characteristics into a prior distribution estimation network to obtain an output result of the prior distribution estimation network, and generating a voice waveform by a waveform generation network according to the condition characteristics and the output result of the prior distribution estimation network. Compared with the mode of point-by-point generation of the voice waveform based on autoregression, the waveform generation model can directly generate the voice waveform according to the condition characteristics, and the waveform generation efficiency of the waveform generation model can be improved. Because the prior distribution estimation network can learn the coding information of the natural voice waveform, the quality of the voice waveform generated by the waveform generation model can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a system architecture according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a speech waveform generation method according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a waveform generation model provided in an embodiment of the present application;

FIG. 4 is a schematic flow chart diagram illustrating a model training method according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of a VAE-GAN model provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of a multi-scale discriminator provided in an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a speech waveform generating apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The following describes embodiments of the present application in detail.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a system architecture according to an embodiment of the present disclosure, and as shown in fig. 1, the system architecture includes a server 100 and at least one electronic device 101 communicatively connected to the server 100. The user holds the electronic device 101; the electronic device 101 may have a client installed thereon, and the server 100 may have a server installed thereon. The client is a program corresponding to the server and providing a local service to the client. A server is also a program installed on a server, and serves a client, and the contents of the service include, for example, providing a computation or application service to the client, providing a resource to the client, saving client data, and the like. The server 100 may directly establish a communication connection with the electronic device 101 through the internet, and the server 100 may also establish a communication connection with the electronic device 101 through the internet through another server.

The server related to the embodiment of the application may include a cloud server or a cloud virtual machine. The electronic devices involved in the embodiments of the present application may include various handheld devices, vehicle-mounted devices, wearable devices, computing devices or other processing devices connected to a wireless modem with wireless communication functions, as well as various forms of User Equipment (UE), Mobile Stations (MS), terminal equipment (terminal device), and so on.

The client in the embodiment of the application can provide voice synthesis service, voice playing service, content display service and the like for the user. For example, a speech synthesis client may provide speech synthesis services to a user. For example, a user may send a speech synthesis instruction to a speech synthesis client, the speech synthesis client sends a speech synthesis request to a server, the speech synthesis request carries a condition feature (e.g., a text feature or an acoustic parameter), the server may input the condition feature into a trained waveform generation model, a speech waveform is generated by the trained waveform generation model, the server sends the generated speech waveform to the speech synthesis client, and the speech synthesis client may play speech corresponding to the speech waveform. The embodiment of the application can adopt the trained waveform generation model to generate the voice waveform, and can improve the waveform generation efficiency of the waveform generation model. Because the prior distribution estimation network can learn the coding information of the natural voice waveform, the quality of the voice waveform generated by the waveform generation model can be improved.

Referring to fig. 2, fig. 2 is a flowchart illustrating a method for generating a speech waveform according to an embodiment of the present application. As shown in fig. 2, the voice waveform generation method is applied to a server, and includes the following steps.

201, the server side obtains an input text.

The speech waveform generation method in the embodiment of the application can be applied to the field of Text To Speech (TTS). For a text to be synthesized, a speech waveform can be synthesized by adopting the speech waveform generation method. The input text of the present application may be a piece of text to be synthesized. The input text may be chinese text or text in other languages.

202, the server side extracts the condition features from the input text.

The condition features may include text features or acoustic parameters. Text features and acoustic features are features that can be directly recognized by the waveform generation model.

Text features may include phone sequences, syllables, words, and the like. A phoneme sequence is a sequence consisting of a string of phonemes. For example, if the first input text is "mandarin chinese", the corresponding third text feature may include eight phonemes "p, u, t, o, ng, h, u, a", and the sequence of the eight phonemes is a phoneme sequence.

The acoustic parameters may include parameters such as a fundamental frequency cepstrum, a Filter Bank (FB) feature, a Fast Fourier Transform (FFT) spectrum, and a frequency spectrum.

Optionally, if the condition feature includes the first text feature, step 202 may include the following steps:

the server side extracts a first text feature from the input text through a text analysis tool.

The text analysis tool may be a set of software programs for analyzing the input text and outputting text features. For example, the text analysis tool may input the input text to the text analysis tool, and the text analysis tool outputs a phoneme sequence corresponding to the input text.

If the condition characteristics include acoustic parameters, step 202 may include the steps of:

and the server extracts a second text characteristic from the input text through a text analysis tool, inputs the second text characteristic into the acoustic model, and predicts to obtain the acoustic parameter.

The acoustic model may be modeled using a hidden markov model or a neural network (e.g., CNN, RNN, LSTM).

203, inputting the condition characteristics into a waveform generation model obtained by training by the server, and processing the condition characteristics to obtain a voice waveform; the waveform generation model comprises a prior distribution estimation network and a waveform generation network, wherein the prior distribution estimation network is used for learning the coding information of the natural speech waveform in the training stage, and the waveform generation network is used for generating the speech waveform according to the condition characteristics and the output result of the prior distribution estimation network.

In the embodiment of the application, the server inputs the condition characteristics into the waveform generation model to obtain the voice waveform. If the conditional features are text features, the waveform generation model may behave as a speech synthesizer. If the condition features are acoustic parameters, the waveform generation model may behave as a vocoder.

The waveform generation model may include a variational auto-encoder (VAE) and a Generative Adaptive Network (GAN) model, which may be referred to as a VAE-GAN model for short, and the VAE-GAN model may be composed of a VAE network and a GAN. The trained waveform generation model may be a trained waveform generation model.

Optionally, in step 203, the server inputs the condition features into a waveform generation model obtained by training, and processes the condition features to obtain a speech waveform, which may specifically include the following steps:

(11) the server side obtains an output result of the prior distribution estimation network according to the condition characteristics by using the prior distribution estimation network, and determines hidden variables of the condition characteristics from the output result of the prior distribution estimation network;

(12) and the server side generates a voice waveform by using the waveform generation network according to the condition characteristics and the hidden variables of the condition characteristics.

In the embodiment of the application, the condition features can be extracted from the input text, the condition features are input into the prior distribution estimation network to obtain the output result of the prior distribution estimation network, the hidden variables of the condition features are determined from the output result of the prior distribution estimation network, and the waveform generation network can generate the voice waveforms according to the condition features and the hidden variables of the condition features. Compared with the mode of point-by-point generation of the voice waveform based on autoregression, the waveform generation model can directly generate the voice waveform according to the condition characteristics, and the waveform generation efficiency of the waveform generation model can be improved. Because the prior distribution estimation network can learn the coding information of the natural voice waveform, the quality of the voice waveform generated by the waveform generation model can be improved.

Optionally, in step (11), the server obtains an output result of the prior distribution estimation network according to the condition characteristics by using the prior distribution estimation network, and specifically includes the following steps:

and the server side estimates the prior distribution of the hidden variables of the condition characteristics according to the condition characteristics by using the prior distribution estimation network.

Wherein the prior distribution of the hidden variables may be a gaussian distribution, and the prior distribution may include a mean and a variance of the gaussian distribution.

Optionally, in step (11), the determining, by the server, the hidden variable of the condition characteristic from the output result of the prior distribution estimation network may specifically include the following steps:

and the server samples from the prior distribution of the hidden variables of the condition characteristics to obtain the hidden variables of the condition characteristics.

The server may sample the prior distribution of the hidden variables of the condition characteristics by using an Inverse Transform Method (ITM) to obtain the hidden variables of the condition characteristics. The server may further sample the prior distribution of the hidden variables of the condition features by using any one of a Rejection Sampling method (Rejection Sampling), Importance Sampling and Resampling (Sampling-Importance-Sampling), and a Markov Monte Carlo Sampling method (Markov Chain Monte Carlo), so as to obtain the hidden variables of the condition features.

Wherein the a priori distribution estimation network may be comprised of a deconvolution network. The method is used for up-sampling the input condition characteristics to obtain the prior distribution of the hidden variables of the condition characteristics.

For example, the condition feature is denoted as y and the hidden variable is denoted as z. The trained prior distribution estimation network is marked as T _ prior _ net and is used for estimating the prior distribution p (z | y) of the hidden variable z. The trained prior distribution estimation network firstly samples the input condition characteristic y through a deconvolution network and then outputs the prior distribution of the hidden variable z of the condition characteristic y. The hidden variable z can be sampled from the prior distribution p (z | y).

μ_t,σ_t＝T_prior_net(y)；

p(z|y)＝N(μ_t,σ_t)；

z＝μ_t+σ_t*N(0,1)；

Wherein, mu_tIs the mean, σ, of the prior distribution of the hidden variable z_tIs the variance of the prior distribution of the hidden variable z. The prior distribution is a gaussian distribution. p (z | y) is the prior distribution of the hidden variable z. N (0,1) is a Gaussian distribution with a mean of 0 and a variance of 1.

Optionally, in step (12), the server generates a speech waveform according to the condition characteristic and the hidden variable of the condition characteristic by using the waveform generation network, which may specifically include the following steps:

(121) the server side inputs the hidden variables of the condition characteristics and the condition characteristics into a waveform generation network to obtain prior generated waveform distribution;

(122) and the server performs probability distribution transformation on the priori generated waveform distribution to obtain a voice waveform.

The probability distribution transformation may include a gaussian transformation, a laplacian transformation, and the like.

Fig. 3 is a schematic structural diagram of a waveform generation model provided in an embodiment of the present application, and as shown in fig. 3, the waveform generation model includes an a priori distribution estimation network and a waveform generation network. The a priori distribution estimation network may be a trained a priori distribution estimation network and the waveform generation network may be a trained waveform generation network.

The waveform generation network may be comprised of a convolutional neural network. The convolutional neural network is composed of a plurality of Layers of convolutional Layers with holes (DCLs). The waveform generation network can input the hidden variable of the input condition characteristic and the condition characteristic into the waveform generation network together to obtain prior generated waveform distribution, and the server performs Gaussian transformation on the prior generated waveform distribution to obtain a voice waveform. For example, the waveform generation network may employ an Inverse Autoregressive Flow (IAF) model to decode the input hidden variables into the speech waveform.

For example, the server may input the hidden variable z and the condition feature y into a waveform generation network, and the waveform generation network may predict a priori generated waveform x_pMean value μ of the Gaussian distribution of_pSum variance σ_pA priori generating a waveform x_pMean value μ of the Gaussian distribution of_pSum variance σ_pCan be determined according to the following formula:

μ_p,σ_p＝IAF_decoder(z,y)；

thus, the waveform x is generated a priori_pHas a waveform distribution of p (x)_p|z)＝N(μ_p,σ_p). A priori generated waveform x_pCan be obtained by a gaussian transformation. For a priori generated waveform x_pThe waveform distribution is subjected to Gaussian transformation to obtain a voice waveform (namely, a priori generated waveform x)_p) The following were used:

because the trained prior distribution estimation network can learn the coding information of the natural voice waveform, the output of the trained prior distribution estimation network is used as the input of the trained waveform generation network, and further, a high-quality voice waveform is generated.

Referring to fig. 4, fig. 4 is a schematic flowchart illustrating a model training method according to an embodiment of the present disclosure. As shown in fig. 4, the model training method is applied to a server, and the model training method is used for training a waveform generation model, and includes the following steps.

401, a server obtains a voice training sample, where the voice training sample includes a natural voice waveform and a text corresponding to the natural voice waveform.

The model training method in the embodiment of the present application may be performed before the waveform generation method of fig. 2.

In the embodiment of the application, the server can obtain the voice training samples from the training data set, the training data set can include a large number of voice training samples for training, and the voice training samples in the training data set are different from each other. For example, the training data set may include 20 hours of natural speech data, and each speech training may be 5 seconds, 8 seconds, 10 seconds, and so on.

The speech training samples in the training data set are supervisory data, and each speech training sample may include a natural speech waveform and text corresponding to the natural speech waveform.

402, the server extracts natural condition features from the natural speech waveform or the text corresponding to the natural speech waveform.

In the embodiment of the present application, the natural condition feature may include a natural text feature or a natural acoustic parameter. Natural text features and natural acoustic features are features that can be directly recognized by the waveform generation model.

The natural text features may include a sequence of phonemes. A phoneme sequence is a sequence consisting of a string of phonemes. The natural acoustic parameters may include parameters such as a fundamental frequency.

Optionally, the natural condition feature includes a natural text feature; step 402 may specifically include the following steps: and the server extracts natural text features from the text corresponding to the natural voice waveform through a text analysis tool.

The text analysis tool may be a set of software programs for analyzing the input text and outputting text features.

Optionally, the natural condition features include natural acoustic parameters; step 402 may specifically include the following steps: the server extracts natural acoustic parameters from the natural speech waveform.

The natural acoustic parameters may include parameters such as a fundamental frequency. The server can extract natural acoustic parameters from the natural speech waveform through an acoustic parameter extraction algorithm.

And 403, inputting the natural voice waveform and the natural condition characteristics into the waveform generation model by the server side to obtain a training result.

In an embodiment of the present application, the training result may include a training loss.

And 404, the server side optimizes the model parameters of the waveform generation model according to the training result.

In the embodiment of the present application, the model parameters of the waveform generation model may include parameters of the prior distribution estimation network and parameters of the waveform generation network. For example, the parameters of the prior distribution estimation network may include deconvolution network parameters in the prior distribution estimation network (the deconvolution network parameters may include weight matrices for convolutional layers of the deconvolution network, weight matrices for fully-connected layers of the deconvolution network, etc.). The parameters of the waveform generation network may include convolutional neural network parameters of the waveform generation network (the convolutional neural network parameters may include weight matrices of convolutional layers of the convolutional neural network, weight matrices of fully-connected layers of the convolutional neural network, etc.).

In the embodiment of the application, when the waveform generation model is trained, the priori distribution estimation network can learn the coding information of the natural speech waveform, and the quality of the speech waveform generated by the waveform generation model in the speech waveform generation stage can be improved. Compared with the mode of adopting a teacher model and a student model for training, the training process can be simplified.

Optionally, the waveform generation model further includes an encoder and a discriminator, and step 403 may include the following steps:

(21) the server side inputs the natural voice waveform into the coder, inputs the natural condition characteristics into the prior distribution estimation network, and calculates the prior loss function according to the output result of the coder and the output result of the prior distribution estimation network;

(22) the server side determines a first hidden variable according to the output result of the encoder and determines a second hidden variable according to the output result of the prior distribution estimation network;

(23) the server inputs the first hidden variable, the second hidden variable and the natural condition characteristic into a waveform generation network, and calculates a likelihood loss function according to a waveform result output by the waveform generation network;

(24) the server side inputs the waveform result output by the natural voice waveform and the waveform generation network into the discriminator, and calculates the discrimination loss function and the countermeasure loss function according to the output result of the discriminator.

Optionally, step 404 may include the following steps:

(31) in a first training stage, the server side updates the model parameters of the encoder, the model parameters of the prior distribution estimation network and the model parameters of the waveform generation network according to a method of minimizing a first loss function; the first loss function is determined based on the prior loss function and the likelihood loss function;

(32) in a second training stage, under the condition that the updated model parameters of the encoder and the model parameters of the prior distribution estimation network in the first training stage are fixed, the server side updates the model parameters of the discriminator according to a method for minimizing the discrimination loss function, and updates the model parameters of the waveform generation network according to a method for minimizing a second loss function, wherein the second loss function is determined based on the likelihood loss function and the countermeasure loss function.

In the embodiment of the present application, the waveform generation model for training may include an encoder, a waveform generation network, an a priori distribution estimation network, and a discriminator as follows.

The waveform generation model is described by taking a VAE-GAN model as an example.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a VAE-GAN model according to an embodiment of the present disclosure, and as shown in fig. 5, the VAE-GAN model includes a VAE model and a GAN model, the VAE model includes an encoder, an a priori distribution estimation network, and a waveform generation network, and the GAN model includes a waveform generation network and a discriminator. Wherein the a priori distribution estimation network may comprise an a priori network and the waveform generation network may comprise a decoder.

In the VAE training stage, that is, the first training stage, model parameters of the encoder, the prior distribution estimation network, and the waveform generation network need to be optimized, and in the countermeasure training stage, that is, the second training stage, model parameters of the encoder and the prior distribution estimation network are fixed, and model parameters of the waveform generation network and the discriminator need to be optimized. The waveform generation network belongs to both a VAE model and a GAN model, and parameter optimization is required in both training stages.

Wherein the encoder may be constituted by a convolutional neural network. The convolutional neural network is composed of a plurality of Layers of convolutional Layers with holes (DCLs). The encoder may encode the input speech waveform into an implicit variable space. For example, the encoder may employ a WaveNet model.

The a priori distribution estimation network may be comprised of a deconvolution network. And the method is used for up-sampling the input condition characteristics to obtain the hidden variables of the condition characteristics. For example, if the speech waveform input to the encoder is speech with a sampling rate of 16kHz, the sampling rate of the acoustic parameters (e.g., frame shift of 5ms) input to the prior distribution estimation network is 200Hz, the deconvolution network of the prior distribution estimation network is equivalent to the upsampling network, and the acoustic parameters with 200Hz can be upsampled to 16kHz (e.g., the step size of the deconvolution network can be set to [4,4,5], so that the conditional features can be upsampled by 80 times to 16kHz by such three-layer deconvolution), so that the sampling rate of the acoustic parameters is the same as the sampling rate of the speech waveform.

The waveform generation network may be comprised of a convolutional neural network. The waveform generation network is also formed by stacking DCLs, and can input the hidden variables of the input voice, the hidden variables of the condition characteristics and the condition characteristics into the waveform generation network together to generate voice waveforms. For example, the waveform generation network may employ an Inverse Autoregressive Flow (IAF) model to decode the input hidden variables into the speech waveform.

The discriminator may be formed of convolutional layers for down-sampling the input speech waveform to predict the probability that each segment of the input speech waveform is natural speech. Natural speech is speech that is uttered by a natural person and has not been processed by a speech model. For example, the natural speech may be speech recorded via a microphone.

The discriminator in fig. 5 may employ a multi-scale discriminator. Because the voice waveform sampling rate is higher, the waveform changes abundantly and contains multi-scale information. According to the method and the device, whether the input waveform is the natural waveform can be distinguished by adopting the multi-scale discriminator, the discrimination probability of multiple scales can be output by the multi-scale discriminator, long-time related information of the waveform can be captured by the multi-scale discrimination output, a local fine structure can be concerned, and the waveform output by the waveform generation network can be close to the natural voice waveform through training of the multi-scale discriminator.

For example, referring to fig. 6, fig. 6 is a schematic structural diagram of a multi-scale discriminator according to an embodiment of the present disclosure. FIG. 6 illustrates a two-scale discriminator, where the multi-scale discriminator includes 4 convolutional layers with step size equal to 2 and 2 convolutional layers with step size equal to 1, as shown in FIG. 6. In fig. 6, s is 2, which represents the step size stride of the convolutional layer is 2, and the input is sampled twice, in the figure, D1 and D2 are respectively the discrimination probabilities obtained by sampling the input waveform 4 times and 16 times, wherein the discrimination probability obtained by sampling 4 times can focus on the local information of the waveform, the discrimination probability obtained by sampling 16 times can focus on the overall information of the waveform, and the training can be performed on both the local and the overall waveform, so that the waveform output by the waveform generation network is closer to the natural speech waveform. The output result of D1, the output result of D2, the output result of D1 and the output result of D2 may be used for calculating the probability by the discriminator, and may be determined according to the sampling rate of the input waveform, which is not limited in the embodiment of the present application.

It should be noted that fig. 6 is only an example of the multi-scale discriminator provided in the embodiment of the present application, and the number of convolutional layers included in the multi-scale discriminator and the step size of each convolutional layer may be set according to the sampling rate of the input waveform and the actual training effect, which is not limited in the embodiment of the present application.

In order to intuitively understand the technical solution in the present application, the VAE-GAN model in the following embodiments is described with reference to fig. 5 as an example.

Optionally, the step (21) may specifically include the following steps:

(211) the server inputs the natural voice waveform into the encoder to obtain the posterior distribution of a first hidden variable of the natural voice waveform;

(212) the server inputs the natural condition characteristics into the prior distribution estimation network to obtain prior distribution of a second hidden variable of the natural condition characteristics;

(213) and the server calculates a prior loss function according to the posterior distribution of the first hidden variable and the prior distribution of the second hidden variable.

In the embodiment of the present application, the natural speech waveform is denoted as x₁The first hidden variable is denoted as z₁. The encoder adopts multilayer porous convolution to model the waveform sequence probability, and the acceptance field of the encoder is assumed to be N. The natural voice waveform x₁Input to the coder, which then will be based on the natural speech waveform x₁History information of (i.e. natural speech waveform x)₁Predicting a first hidden variable z at time t by N speech waveform points before time t₁Posterior distribution p (z)₁|x₁)。

p(z₁|x₁)＝N(μ_t ¹,σ_t ¹)；

z_t ¹＝μ_t ¹+σ_t ¹*N(0,1)；

Wherein the content of the first and second substances,

is a natural speech waveform x₁N speech waveform points before time t. encoder refers to an encoder. Mu.s_t ¹Is the first hidden variable z at time t predicted by the encoder_t ¹Mean, σ, of the posterior distribution of (1)_t ¹Is the first hidden variable z at time t predicted by the encoder_t ¹The variance of the posterior distribution of (a). p (z)₁|x₁) For the first hidden variable z at time t_t ¹Posterior distribution of (2). N (. mu.) of_t ¹,σ_t ¹) Means a mean value of μ_t ¹Variance is σ_t ¹Is normally distributed. A normal distribution may also be referred to as a Gaussian distribution. N (0,1) refers to a normal distribution with a mean of 0 and a variance of 1. z is a radical of_t ¹The third hidden variable z may be taken from time t_t ¹Is sampled from the posterior distribution of (a). Third latent variable z₁Including a third hidden variable z at time t_t ¹。

Wherein the natural condition is characterized by y₁，y₁Is related to the natural speech waveform x₁Corresponding condition characteristics. The second hidden variable is denoted as z₂. The prior distribution estimation network is denoted as prior _ net and is used for estimating the second hidden variable z₂Prior distribution of p (z)₂|y₁). The prior distribution estimation network firstly samples the input natural condition characteristic y1 through a deconvolution network and then outputs a second hidden variable z of the natural condition characteristic₂Is a priori distributed mean value mu_t ²Sum variance μ_t ²A second hidden variable z₂Can be sampled from the prior distribution.

p(z₂|y₁)＝N(μ_t ²,σ_t ²)；

z₂＝μ_t ²+σ_t ²*N(0,1)；

Wherein, mu_t ²Is the second hidden variable z₂Is a priori distributed mean, σ_t ²Is the second hidden variable z₂The variance of the prior distribution. p (z)₂Y1) is a second hidden variable z₂A priori distribution of. z is a radical of₂Can be derived from a second implicit variable z₂Is sampled in the prior distribution.

Optionally, in step (213), the server calculates a prior loss function according to the posterior distribution of the first hidden variable and the prior distribution of the second hidden variable. The prior loss function can be calculated as follows:

wherein L is_priorFor the prior loss function, KL (P (z | x) | P (z | y)) represents the first hidden variable z₁A posterior distribution of (a) and a second hidden variable z₂KL distance of the prior distribution.

Reference may be made to the detailed description of step (21) above, which is not repeated here.

In order to enable the prior distribution estimation network to learn the hidden variable space coding information in the coder, the prior distribution estimation network is constrained to enable the distribution p (z) output by the prior distribution estimation network₂|y₁) To approximate natural speech waveform x as much as possible₁Distribution p (z)₁|x₁). Both distributions are introduced in the first training phase (a priori distribution estimates the distribution p (z) of the network output₂|y₁) With the natural speech waveform x output by the coder₁Distribution p (z)₁|x₁) KL distance loss function) is used to measure the difference between these two distributions when the KL distance between them is small, i.e., L_priorA smaller value of (d) indicates that the two distributions are very close.

The embodiment of the application provides a prior distribution estimation network to model the prior distribution, condition characteristics are input into the prior distribution estimation network to estimate the distribution of hidden variables, KL distance is adopted to constrain the prior distribution to approximate the posterior distribution estimated by an encoder, and the correlation between the hidden variables and the condition characteristics is considered. The method and the device ensure that the prior distribution estimation network can learn the coding information of the coder on the natural voice, thereby ensuring that the output of the prior distribution estimation network is used as the input of the waveform generation network in the voice waveform generation stage, and further generating the voice with higher quality.

Optionally, the output result of the encoder includes a posterior distribution of the first hidden variable, and the output result of the prior distribution estimation network includes a prior in-distribution sample of the second hidden variable; the step (22) may specifically include the following steps:

(221) the server side samples from the posterior distribution of the first hidden variable to obtain the first hidden variable;

(221) and the server samples from the prior distribution of the second hidden variable to obtain the second hidden variable.

The server side can sample the posterior distribution of the first hidden variable through a random sampling method to obtain the first hidden variable, and can sample the posterior distribution of the second hidden variable through a random sampling method to obtain the second hidden variable.

Optionally, the step (23) may specifically include the following steps:

(231) the server inputs the first hidden variable, the second hidden variable and the natural condition feature into the waveform generation network, and the waveform generation network is used for generating a reconstructed waveform corresponding to the natural voice waveform according to the first hidden variable and the natural condition feature and generating a priori generated waveform corresponding to the natural condition feature according to the second hidden variable and the natural condition feature;

(232) and the server calculates a likelihood loss function according to the waveform distribution of the natural voice waveform and the reconstructed waveform.

In this embodiment of the application, the waveform generation network may be formed by an Inverse Autoregressive Flow (IAF) network, and the IAF network implements mapping from input distribution to output distribution. The IAF network converts the hidden variables obeying the Gaussian distribution into the Gaussian distribution of the voice waveform, and the IAF network can predict the mean value and the variance of the Gaussian distribution of the voice waveform.

Wherein, the server can convert the first implicit variable z₁And natural condition characteristics y₁Input waveform generation network that predicts a reconstructed waveform

Mean of gaussian distribution of

Sum variance

Reconstructed waveform

Mean of gaussian distribution of

Sum variance

Can be determined according to the following formula:

in the formula, IAF _ decoder refers to a waveform generation network adopting an IAF network, and z is₁And y₁Inputting IAF _ decoder to obtain reconstructed waveform

Mean of gaussian distribution of

Sum variance

Thus, the waveform is reconstructed

Has a waveform distribution of

Reconstructed waveform

Can be obtained by a gaussian transformation. For the reconstructed waveform

The waveform distribution is subjected to Gaussian transformation to obtain a reconstructed waveform

The following were used:

wherein the content of the first and second substances,

reference may be made to the description of step (21) above, which is not repeated here.

In step (223), the server may calculate the likelihood loss function according to the waveform distribution of the natural speech waveform and the reconstructed waveform. Likelihood loss function L_likeCan be calculated according to the following formula:

wherein, p (x)₁| z) is to convert a natural speech waveform x₁Waveform distribution of substitution reconstruction waveform

The likelihood probability obtained after the above-mentioned operation,

is desired given the hidden variable (z). L is_likeFor measuring similarity of waveform distribution of natural voice waveform and waveform distribution of reconstructed waveform, L_likeThe smaller the value of (a), the closer the reconstructed waveform representing the output of the waveform generation network is to the natural speech waveform.

Wherein, the server can hide the second latent variable z₂And natural condition characteristics y₁Inputting a waveform generation network which can predict a priori generated waveform x corresponding to the natural condition characteristics_p1Mean value μ of the Gaussian distribution of_p1Sum variance σ_p1A priori generating a waveform x_p1Mean value μ of the Gaussian distribution of_p1Sum variance σ_p1Can be determined according to the following formula:

μ_p1,σ_p1＝IAF_decoder(z₂,y₁)；

thus, the waveform x is generated a priori_p1Has a waveform distribution of p (x)_p1|z₂)＝N(μ_p1,σ_p1). A priori generated waveform x_p1Can be obtained by a gaussian transformation. For a priori generated waveform x_p1Is gaussian transformed, i.e. a priori generates a waveform x_p1The following were used:

wherein, mu_t ²Is the second hidden variable z₂Is a priori distributed mean, σ_t ²Is the second hidden variable z₂The variance of the prior distribution.

Optionally, the waveform result output by the waveform generation network includes a reconstructed waveform corresponding to the natural speech waveform and a priori generated waveform corresponding to the natural condition feature; the step (24) may specifically include the following steps:

(241) the server inputs the natural voice waveform, the reconstructed waveform corresponding to the natural voice waveform and the priori generated waveform corresponding to the natural condition characteristics into the discriminator to obtain the probability that the natural voice waveform is natural voice, the probability that the reconstructed waveform is natural voice and the probability that the priori generated waveform is natural voice;

(242) the server calculates a discrimination loss function according to the probability that the natural voice waveform is natural voice, the probability that the reconstructed waveform is natural voice and the probability that the prior generated waveform is natural voice;

(243) and the server calculates the countermeasure loss function according to the probability that the natural reconstructed waveform is natural voice and the probability that the prior generated waveform is natural voice.

In the embodiment of the application, the discriminator is used for calculating the probability that the input voice is natural voice. The discriminator may be a multi-scale discriminator shown in fig. 6.

Optionally, step (241) may include the steps of:

(2411) the server inputs the natural voice waveform into the multi-scale discriminator to obtain N scale discrimination probabilities that the natural voice waveform is natural voice, and the probability that the natural voice waveform is natural voice is determined based on the N scale discrimination probabilities that the natural voice waveform is natural voice; n is an integer greater than or equal to 2;

(2412) the server inputs the reconstructed waveform corresponding to the natural voice waveform into the multi-scale discriminator to obtain N scale discrimination probabilities that the reconstructed waveform corresponding to the natural voice waveform is natural voice, and the probability that the reconstructed waveform is the natural voice is determined based on the N scale discrimination probabilities that the reconstructed waveform corresponding to the natural voice waveform is the natural voice;

(2413) and the server inputs the prior generated waveform corresponding to the natural condition features into the discriminator to obtain N scale discrimination probabilities that the prior generated waveform corresponding to the natural condition features is natural voice, and determines the probability that the prior generated waveform is the natural voice based on the N scale discrimination probabilities that the prior generated waveform corresponding to the natural condition features is the natural voice.

In the embodiment of the application, the multi-scale discriminator can obtain N scale discrimination probabilities for the input waveform. For example, for an input natural speech waveform x₁To obtain the discrimination probabilities D of N scales₁(x₁)、D₂(x₁)、...、D_N(x₁). The probability that the natural speech waveform is natural speech can be calculated according to the following formula:

D(x₁)＝λ₁*D₁(x₁)+λ₂*D₂(x₁)+…+λ_N*D_N(x₁)；

wherein, D (x)₁) Is the probability that the natural speech waveform is natural speech. Lambda [ alpha ]₁、λ₂、...、λ_NAre weighting coefficients. Wherein D is₁(x₁) The probability that the first layer output is natural voice after the down-sampling layer of the natural voice is input, detail information describing waveform, D₂(x₁)、...、D_N(x₁) The probability that the lower sampling layer at a deeper level outputs natural voice can reflect the correlation information of different scales of the natural voice waveform, the shallow sampling layer outputs the detail information describing the waveform, and the deep sampling layer outputs the whole information describing the waveform, such as the whole envelope of the waveform. For example, D₁(x₁) According to the downsampling S_x1Probability that the obtained natural speech waveform is natural speech, D₂(x₁) According to the downsampling S_x2Probability that the obtained natural speech waveform is natural speech, D_N(x₁) Is according to the sampling rate S_xNThe probability that the obtained natural speech waveform is natural speech, S_x1<S_x2<...<S_xN。

Wherein λ is₁、λ₂、...、λ_NThe sum may be equal to 1. Lambda [ alpha ]₁The larger the arbiter is, the more focused the fine structure of the input waveform. Lambda [ alpha ]₁The smaller the signal, the more the arbiter is concerned with long term variations in the input waveform.

Same as for inputReconstructed waveform of input

Can obtain the discrimination probability of N scales

The probability that the reconstructed waveform is natural speech can be calculated according to the following formula:

a priori generated waveform x for an input_p1To obtain the discrimination probabilities D of N scales₁(x_p1)、

D₂(x_p1)、...、D_N(x_p1). The probability that the a priori generated waveform is natural speech can be calculated according to the following formula:

D(x_p1)＝λ₁*D₁(x_p1)+λ₂*D₂(x_p1)+…+λ_N*D_N(x_p1)；

in step (242), in the embodiment of the present application, a Least Square method GAN (Least Square GAN, LSGAN) may be used to calculate the discriminant loss function L_DCan be calculated according to the following formula:

in the above formula, D (x)₁)，

D(x_p1) Respectively natural speech waveform, reconstructed waveform, and prediction probability L calculated by priori generated waveform input discriminator_DIs the mean square of their prediction probability and their true probabilityExpectation of error, it can be seen that L_DThe smaller the discriminator, the stronger the discrimination ability.

In step (243), the penalty function L of the embodiment of the present application_advCan be calculated according to the following formula:

wherein the content of the first and second substances,

D(x_p1) Respectively, reconstructing the waveforms, generating the predicted probabilities calculated by the waveform input discriminator in a priori mode. L is_advSpeech waveform for scale waveform generation network generation (

x_p1) The ability of the arbiter can be spoofed. L is_advThe smaller the size, the closer the waveform generation network generates a speech waveform to the natural speech waveform.

Wherein, the step (31) is a training process of a first training stage; step (32) is a training process of the second training phase. A first training stage updates model parameters of the encoder, model parameters of the prior distribution estimation network, and model parameters of the waveform generation network; and the second training stage is to update the model parameters of the discriminator and the model parameters of the waveform generation network after the first training stage is finished and under the condition of fixing the updated model parameters of the encoder and the model parameters of the prior distribution estimation network in the first training stage.

In the embodiment of the present application, the minimization loss function may be optimized by an adaptive gradient descent method (for example, an Adam optimization method).

Wherein the first loss function L₁Based on the prior loss function and the likelihood loss function. In the embodiment of the present application, the first loss function L₁The calculation can be made according to the following formula:

L₁＝λ*L_prior+L_like；

wherein, to avoid the mode collapse of the encoder in the early training stage, L is gradually increased from 0 to 1_priorWeight (λ) of (c).

When the number of training steps in the first training stage reaches the preset number of training steps, or the first loss function L₁Is relatively stable (i.e., the first loss function L₁The amount of change in the number of training steps is less than a threshold), the end of the first training phase may be determined.

In step (32), in a second training stage, in a case where the updated model parameters of the encoder and the model parameters of the prior distribution estimation network in the first training stage are fixed, the server updates the model parameters of the discriminator according to a method of minimizing the discriminant loss function, and updates the model parameters of the waveform generation network according to a method of minimizing a second loss function.

Second loss function L₂Based on the likelihood loss function and the counter loss function. Second loss function L₂The calculation can be made according to the following formula:

L₂＝L_like+L_adv；

discriminant loss function L_DSee the above detailed description of step (242), L_likeReference may be made to the detailed description of step (223) above, L_advReference may be made to the detailed description of step (243) above, which is not repeated here.

The step of "the server side updates the model parameters of the discriminator according to the method of minimizing the discrimination loss function", and the step of "the server side updates the model parameters of the waveform generation network according to the method of minimizing the second loss function" may be alternately performed.

It should be noted that, when step (32) is executed, the model parameters of the encoder and the prior distribution estimation network are trained, and the model parameters of the encoder and the prior distribution estimation network are fixed.

According to the embodiment of the application, the countermeasure training is continuously performed by continuously updating the model parameters of the discriminator and the model parameters of the waveform generation network, and the countermeasure training can help the voice waveform generated by the waveform generation network to be closer to natural voice.

Wherein, when L is_DAnd L₂When the two loss functions are relatively stable (that is, the variation of the two loss functions in a certain number of training steps is smaller than a certain threshold), it can be determined that the second training phase is finished, and the whole training process is finished, so that the trained waveform generation waveform is obtained.

It should be noted that the model training method of fig. 4 is used for training a waveform generation model (e.g., VAE-GAN model), and the model training method of fig. 4 may be performed before the speech waveform generation method of fig. 2. After the waveform generation model is trained, the method described in FIG. 2 may be performed.

In accordance with the above, please refer to fig. 7, fig. 7 is a schematic structural diagram of a model training apparatus provided in an embodiment of the present application, and applied to a server, the model training apparatus 700 includes a first obtaining unit 701, a first extracting unit 702, a training unit 703 and an optimizing unit 704, where:

a first obtaining unit 701, configured to obtain a speech training sample, where the speech training sample includes a natural speech waveform and a text corresponding to the natural speech waveform;

a first extraction unit 702, configured to extract natural condition features from the natural speech waveform or a text corresponding to the natural speech waveform;

a training unit 703, configured to input the natural speech waveform and the natural condition features into the waveform generation model to obtain a training result;

an optimizing unit 704, configured to optimize a model parameter of the waveform generation model according to the training result.

Optionally, the waveform generation model further includes an encoder and a discriminator, and the training unit 703 inputs the natural speech waveform and the natural condition features into the waveform generation model to obtain a training result, specifically: inputting the natural voice waveform into the encoder, inputting the natural condition features into the prior distribution estimation network, and calculating a prior loss function according to the output result of the encoder and the output result of the prior distribution estimation network; determining a first hidden variable according to the output result of the encoder, and determining a second hidden variable according to the output result of the prior distribution estimation network; inputting the first hidden variable, the second hidden variable and the natural condition feature into the waveform generation network, and calculating a likelihood loss function according to a waveform result output by the waveform generation network; and inputting the natural voice waveform and a waveform result output by the waveform generation network into the discriminator, and calculating a discrimination loss function and a countermeasure loss function according to an output result of the discriminator.

Optionally, the training unit 703 inputs the natural speech waveform into the encoder, inputs the natural condition features into the prior distribution estimation network, and calculates a prior loss function according to an output result of the encoder and an output result of the prior distribution estimation network, specifically: inputting the natural voice waveform into the encoder to obtain the posterior distribution of a first hidden variable of the natural voice waveform; inputting the natural condition features into the prior distribution estimation network to obtain prior distribution of second hidden variables of the natural condition features; and calculating a prior loss function according to the posterior distribution of the first hidden variable and the prior distribution of the second hidden variable.

Optionally, the output result of the encoder includes a posterior distribution of the first hidden variable, and the output result of the prior distribution estimation network includes a prior in-distribution sample of the second hidden variable; the training unit 703 determines a first hidden variable according to the output result of the encoder, and determines a second hidden variable according to the output result of the prior distribution estimation network, specifically: sampling from the posterior distribution of the first hidden variable to obtain the first hidden variable; and sampling from the prior distribution of the second hidden variables to obtain the second hidden variables.

Optionally, the training unit 703 inputs the first hidden variable, the second hidden variable, and the natural condition feature into the waveform generation network, and calculates a likelihood loss function according to a waveform result output by the waveform generation network, specifically: inputting the first hidden variable, the second hidden variable and the natural condition feature into the waveform generation network, generating a reconstructed waveform corresponding to the natural speech waveform according to the first hidden variable and the natural condition feature, and generating a priori generated waveform corresponding to the natural condition feature according to the second hidden variable and the natural condition feature; and calculating a likelihood loss function according to the waveform distribution of the natural voice waveform and the reconstructed waveform.

Optionally, the waveform result output by the waveform generation network includes a reconstructed waveform corresponding to the natural speech waveform and a priori generated waveform corresponding to the natural condition feature; the training unit 703 inputs the natural speech waveform and the waveform result output by the waveform generation network into the discriminator, and calculates a discrimination loss function and a countermeasure loss function according to the output result of the discriminator, specifically: inputting the natural voice waveform, the reconstructed waveform and the priori generated waveform into the discriminator to obtain the probability that the natural voice waveform is natural voice, the probability that the reconstructed waveform is natural voice and the probability that the priori generated waveform is natural voice; calculating a discrimination loss function according to the probability that the natural voice waveform is natural voice, the probability that the reconstructed waveform is natural voice and the probability that the prior generated waveform is natural voice; and calculating a countermeasure loss function according to the probability that the natural reconstructed waveform is natural voice and the probability that the prior generated waveform is natural voice.

Optionally, the optimizing unit 704 optimizes the model parameters of the waveform generation model according to the training result, specifically: in a first training stage, updating the model parameters of the coder, the model parameters of the prior distribution estimation network and the model parameters of the waveform generation network according to a method of minimizing a first loss function; the first loss function is determined based on the prior loss function and the likelihood loss function; in a second training phase, in a case where the model parameters of the encoder and the model parameters of the prior distribution estimation network updated in the first training phase are fixed, the model parameters of the discriminator are updated in a manner of minimizing the discrimination loss function, and the model parameters of the waveform generation network are updated in a manner of minimizing a second loss function determined based on the likelihood loss function and the countermeasure loss function.

Optionally, the discriminator comprises a multi-scale discriminator.

Optionally, the training unit 703 inputs the natural speech waveform, the reconstructed waveform corresponding to the natural speech waveform, and the priori generated waveform corresponding to the natural condition feature into the discriminator to obtain the probability that the natural speech waveform is natural speech, the probability that the reconstructed waveform is natural speech, and the probability that the priori generated waveform is natural speech, specifically:

inputting the natural voice waveform into the multi-scale discriminator to obtain N scale discrimination probabilities that the natural voice waveform is natural voice, and determining the probability that the natural voice waveform is natural voice based on the N scale discrimination probabilities that the natural voice waveform is natural voice; n is an integer greater than or equal to 2;

inputting the reconstructed waveform corresponding to the natural voice waveform into the multi-scale discriminator to obtain N scale discrimination probabilities that the reconstructed waveform corresponding to the natural voice waveform is natural voice, and determining the probability that the reconstructed waveform is natural voice based on the N scale discrimination probabilities that the reconstructed waveform corresponding to the natural voice waveform is natural voice;

inputting the prior generated waveform corresponding to the natural condition features into the discriminator to obtain N scale discrimination probabilities that the prior generated waveform corresponding to the natural condition features is natural voice, and determining the probability that the prior generated waveform is natural voice based on the N scale discrimination probabilities that the prior generated waveform corresponding to the natural condition features is natural voice.

Optionally, the condition feature includes a text feature or an acoustic parameter.

Optionally, the natural condition feature includes a natural text feature; the first extraction unit 702 extracts natural condition features from the natural speech waveform or the text corresponding to the natural speech waveform, specifically: and extracting natural text features from the text corresponding to the natural voice waveform through a text analysis tool.

Optionally, the natural condition features include natural acoustic parameters; the first extraction unit 702 extracts natural condition features from the natural speech waveform or the text corresponding to the natural speech waveform, specifically: natural acoustic parameters are extracted from the natural speech waveform.

In accordance with the above, please refer to fig. 8, fig. 8 is a schematic structural diagram of a speech waveform generation apparatus provided in an embodiment of the present application, where the speech waveform generation apparatus 800 includes an obtaining unit 801, an extracting unit 802, and a waveform generating unit 803, where:

an acquisition unit 801 for acquiring an input text;

an extracting unit 802, configured to extract a condition feature from the input text;

a waveform generating unit 803, configured to input the conditional features into a trained waveform generating model, and process the conditional features to obtain a speech waveform;

Optionally, the waveform generating unit 803 inputs the condition feature into a waveform generating model obtained by training, and processes the condition feature to obtain a speech waveform, specifically: obtaining an output result of the prior distribution estimation network according to the condition characteristics by using the prior distribution estimation network, and determining a hidden variable of the condition characteristics from the output result of the prior distribution estimation network; and generating the voice waveform according to the condition characteristic and the hidden variable of the condition characteristic by utilizing the waveform generation network.

Optionally, the waveform generating unit 803 obtains an output result of the prior distribution estimation network according to the condition characteristic by using the prior distribution estimation network, and determines the hidden variable of the condition characteristic from the output result of the prior distribution estimation network, specifically:

obtaining the prior distribution of the hidden variables of the condition characteristics according to the condition characteristics by utilizing the prior distribution estimation network; and sampling from the prior distribution of the hidden variables of the condition characteristics to obtain the hidden variables of the condition characteristics.

Optionally, the waveform generating unit 803 generates the speech waveform according to the condition feature and the hidden variable of the condition feature by using the waveform generating network, specifically: inputting the hidden variables of the condition characteristics and the condition characteristics into the waveform generation network to obtain prior generated waveform distribution; and carrying out probability distribution transformation on the prior generated waveform distribution to obtain the voice waveform.

In the embodiment of the application, the condition features can be extracted from the input text, the condition features are input into the prior distribution estimation network to obtain the output result of the prior distribution estimation network, and the waveform generation network can generate the voice waveform according to the condition features and the output result of the prior distribution estimation network. Compared with the mode of point-by-point generation of the voice waveform based on autoregression, the waveform generation model can directly generate the voice waveform according to the condition characteristics, and the waveform generation efficiency of the waveform generation model can be improved. Because the prior distribution estimation network can learn the coding information of the natural voice waveform, the quality of the voice waveform generated by the waveform generation model can be improved.

Fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 900 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 902 (e.g., one or more processors) and a memory 908, and one or more storage media 907 (e.g., one or more mass storage devices) for storing applications 906 or data 905. Memory 908 and storage medium 907 may be, among other things, transient or persistent storage. The program stored on the storage medium 907 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 902 may be configured to communicate with the storage medium 907 to execute a series of instruction operations in the storage medium 907 on the server 900. The server 900 may be a software running device as provided herein.

The server 900 may also include one or more power supplies 903, one or more wired or wireless network interfaces 909, one or more input-output interfaces 910, and/or one or more operating systems 904, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.

The steps performed by the software running device in the above embodiment may be based on the server structure shown in fig. 9. Specifically, the central processing unit 902 may implement the functions of each unit in fig. 7 and 8.

Embodiments of the present application also provide a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program causes a computer to execute part or all of the steps of any one of the voice waveform generation methods as described in the above method embodiments.

Embodiments of the present application also provide a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program enables a computer to execute part or all of the steps of any one of the model training methods as described in the above method embodiments.

Embodiments of the present application also provide a computer program product, which includes a non-transitory computer-readable storage medium storing a computer program, and the computer program causes a computer to execute part or all of the steps of any one of the speech waveform generation methods as described in the above method embodiments.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: various media capable of storing program codes, such as a usb disk, a read-only memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and the like.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash memory disks, read-only memory, random access memory, magnetic or optical disks, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A speech waveform generation method, comprising:

acquiring an input text;

extracting condition features from the input text;

2. The method of claim 1, wherein the processing the conditional features to obtain a speech waveform comprises:

obtaining an output result of the prior distribution estimation network according to the condition characteristics by using the prior distribution estimation network, and determining a hidden variable of the condition characteristics from the output result of the prior distribution estimation network;

and generating the voice waveform according to the condition characteristic and the hidden variable of the condition characteristic by utilizing the waveform generation network.

3. The method according to claim 2, wherein the obtaining, by the prior distribution estimation network, an output result of the prior distribution estimation network according to the condition feature, and determining the hidden variable of the condition feature from the output result of the prior distribution estimation network, comprises:

4. The method of claim 2, wherein the generating the speech waveform from the condition feature and the hidden variable of the condition feature using the waveform generation network comprises:

inputting the hidden variables of the condition characteristics and the condition characteristics into the waveform generation network to obtain prior generated waveform distribution; and carrying out probability distribution transformation on the prior generated waveform distribution to obtain the voice waveform.

5. The method according to any one of claims 1 to 4, wherein before inputting the condition features into a trained waveform generation model and processing the condition features to obtain a speech waveform, the method further comprises:

6. The method of claim 5, wherein the waveform generation model further comprises an encoder and a discriminator, and the inputting the natural speech waveform and the natural condition features into the waveform generation model to obtain a training result comprises:

inputting the natural voice waveform into the encoder to obtain the posterior distribution of a first hidden variable of the natural voice waveform;

inputting the natural condition features into the prior distribution estimation network to obtain prior distribution of second hidden variables of the natural condition features;

calculating a prior loss function according to the posterior distribution of the first hidden variable and the prior distribution of the second hidden variable;

sampling from the posterior distribution of the first hidden variable to obtain the first hidden variable, and sampling from the prior distribution of the second hidden variable to obtain the second hidden variable;

inputting the first hidden variable, the second hidden variable and the natural condition feature into the waveform generation network, and calculating a likelihood loss function according to a waveform result output by the waveform generation network;

and inputting the natural voice waveform and a waveform result output by the waveform generation network into the discriminator, and calculating a discrimination loss function and a countermeasure loss function according to an output result of the discriminator.

7. The method of claim 6, wherein inputting the first hidden variable, the second hidden variable, and the natural condition feature into the waveform generation network, and calculating a likelihood loss function from a waveform result output by the waveform generation network comprises:

inputting the first hidden variable, the second hidden variable and the natural condition feature into the waveform generation network, generating a reconstructed waveform corresponding to the natural speech waveform according to the first hidden variable and the natural condition feature, and generating a priori generated waveform corresponding to the natural condition feature according to the second hidden variable and the natural condition feature; and calculating a likelihood loss function according to the waveform distribution of the natural voice waveform and the reconstructed waveform.

8. The method according to claim 7, wherein the waveform result output by the waveform generation network includes a reconstructed waveform corresponding to the natural speech waveform and an a priori generated waveform corresponding to the natural condition feature; the inputting the natural voice waveform and the waveform result output by the waveform generation network into the discriminator, and calculating a discrimination loss function and a countermeasure loss function according to the output result of the discriminator includes:

inputting the natural voice waveform, a reconstructed waveform corresponding to the natural voice waveform and a priori generated waveform corresponding to the natural condition characteristics into the discriminator to obtain the probability that the natural voice waveform is natural voice, the probability that the reconstructed waveform is natural voice and the probability that the priori generated waveform is natural voice;

calculating a discrimination loss function according to the probability that the natural voice waveform is natural voice, the probability that the reconstructed waveform is natural voice and the probability that the prior generated waveform is natural voice;

and calculating a countermeasure loss function according to the probability that the natural reconstructed waveform is natural voice and the probability that the prior generated waveform is natural voice.

9. The method of claim 8, wherein the discriminator comprises a multi-scale discriminator.

10. The method according to claim 9, wherein the inputting the natural speech waveform, the reconstructed waveform corresponding to the natural speech waveform, and the a priori generated waveform corresponding to the natural condition feature into the discriminator to obtain a probability that the natural speech waveform is natural speech, a probability that the reconstructed waveform is natural speech, and a probability that the a priori generated waveform is natural speech includes:

11. The method according to any one of claims 6 to 10, wherein the optimizing model parameters of the waveform generation model according to the training result comprises:

in a first training stage, updating the model parameters of the coder, the model parameters of the prior distribution estimation network and the model parameters of the waveform generation network according to a method of minimizing a first loss function; the first loss function is determined based on the prior loss function and the likelihood loss function;

in a second training phase, in a case where the model parameters of the encoder and the model parameters of the prior distribution estimation network updated in the first training phase are fixed, the model parameters of the discriminator are updated in a manner of minimizing the discrimination loss function, and the model parameters of the waveform generation network are updated in a manner of minimizing a second loss function determined based on the likelihood loss function and the countermeasure loss function.

12. A speech waveform generation apparatus, comprising:

an acquisition unit configured to acquire an input text;

13. A server comprising a processor and a memory, the memory for storing a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1 to 11.

14. A computer-readable storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1 to 11.