CN112767957B

CN112767957B - Method for obtaining prediction model, prediction method of voice waveform and related device

Info

Publication number: CN112767957B
Application number: CN202011627633.7A
Authority: CN
Inventors: 伍宏传; 胡亚军
Original assignee: University of Science and Technology of China USTC; iFlytek Co Ltd
Current assignee: University of Science and Technology of China USTC; iFlytek Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2024-05-31
Anticipated expiration: 2040-12-31
Also published as: CN112767957A

Abstract

The application discloses a method for obtaining a prediction model, a prediction method of a voice waveform, electronic equipment and a computer readable storage medium. The waveform values of the current waveform points in the plurality of sample subsequences are simultaneously input into the prediction model, so that the predicted waveform values of the next waveform points in the plurality of sample subsequences can be simultaneously obtained. Therefore, the application can reduce the calculation amount of the predicted generated voice waveform and improve the efficiency of generating the voice waveform, thereby realizing the purpose of generating the voice waveform in real time and being not easy to generate the jamming when generating the voice waveform in real time.

Description

Method for obtaining prediction model, prediction method of voice waveform and related device

Technical Field

The present application relates to the field of speech signal processing technology, and in particular, to a method for obtaining a prediction model, a method for predicting a speech waveform, an electronic device, and a computer-readable storage medium.

Background

With the development of deep learning, a large number of neural networks are used for prediction of speech waveforms, such as WaveNet, sampleRNN, waveRNN, parallelWaveNet, waveGlow. However, when the neural network is used for predicting and generating the voice waveform, the calculated amount is large, so that the efficiency of generating the voice waveform is low, the voice waveform is difficult to generate in real time, the clamping phenomenon is easy to occur in practical application, and the user experience is reduced.

Disclosure of Invention

The application mainly solves the technical problem of providing a method for obtaining a prediction model, a method for predicting a voice waveform, electronic equipment and a computer readable storage medium, and can improve the efficiency of generating the voice waveform.

In order to solve the technical problems, the application adopts a technical scheme that:

there is provided a method of obtaining a predictive model comprising:

dividing a sample voice waveform into a plurality of sample subsequences, and performing time delay processing on the plurality of sample subsequences;

Constructing a prediction model; the method comprises the steps of obtaining a predicted waveform value of a next waveform point in a plurality of sample subsequences based on a prediction model by using sample acoustic parameters and waveform values of current waveform points in the plurality of sample subsequences; wherein the sample acoustic parameters are extracted from the sample speech waveform;

training the predictive model using the sample acoustic parameters and the number of sample subsequences after delay processing.

In order to solve the technical problems, the application adopts another technical scheme that:

Provided is a method for predicting a speech waveform, including:

acquiring text acoustic parameters based on text information, and setting a plurality of initialization waveform values;

Inputting the text acoustic parameters and the initialized waveform values into a prediction model, and obtaining a plurality of prediction subsequences by using the prediction model; wherein the predictor sequence corresponds to the initialization waveform value one by one;

and obtaining predicted voice waveforms corresponding to the text information according to the plurality of predicted subsequences.

There is provided an electronic device comprising a memory and a processor coupled to each other, the memory storing program instructions, the processor being capable of executing the program instructions to implement a method of obtaining a predictive model as described in the above technical solution and/or a method of predicting a speech waveform as described in the above technical solution.

There is provided a computer readable storage medium having stored thereon program instructions executable by a processor to implement a method of obtaining a predictive model as described in the above technical solutions and/or a method of predicting a speech waveform as described in the above technical solutions.

The beneficial effects of the application are as follows: compared with the prior art, the method for obtaining the prediction model provided by the application is characterized in that the sample voice waveform is firstly divided into a plurality of sample subsequences, the sample subsequences are subjected to delay processing, then an initial prediction model is constructed, and the initial prediction model is trained according to the plurality of sample subsequences and acoustic parameters of the sample voice waveform, so that the prediction model is obtained. The waveform values of the current waveform points in the plurality of sample subsequences are simultaneously input into the prediction model, so that the predicted waveform values of the next waveform points in the plurality of sample subsequences can be simultaneously obtained. Therefore, the application can reduce the calculation amount of the predicted generated voice waveform and improve the efficiency of generating the voice waveform, thereby realizing the purpose of generating the voice waveform in real time and being not easy to generate the jamming when generating the voice waveform in real time.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

FIG. 1 is a flow chart of an embodiment of a method for obtaining a predictive model according to the present application;

FIG. 2 is a flow chart of an embodiment of the step S11 in FIG. 1;

FIG. 3 is an exemplary diagram of one embodiment of a number of sample subsequences;

FIG. 4 is a diagram illustrating an embodiment of performing a time delay process on the sub-sequence of samples in FIG. 3;

FIG. 5 is a flowchart illustrating the step S13 of FIG. 1 according to an embodiment;

FIG. 6 is a flowchart of the step S12 of FIG. 1;

FIG. 7 is a schematic diagram of a prediction model according to an embodiment;

FIG. 8 is a flowchart illustrating an embodiment of a method for predicting a speech waveform according to the present application;

FIG. 9 is a flowchart illustrating the step S53 in FIG. 8;

FIG. 10 is a flowchart illustrating the step S52 of FIG. 8;

FIG. 11 is a flowchart illustrating the step S72 in FIG. 10;

FIG. 12 is a schematic diagram of an electronic device according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of an embodiment of a computer readable storage medium according to the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to fall within the scope of the present application.

Referring to fig. 1, fig. 1 is a flowchart illustrating an embodiment of a method for obtaining a prediction model according to the present application, which includes the following steps.

Step S11, dividing the sample voice waveform into a plurality of sample subsequences, and performing time delay processing on the plurality of sample subsequences.

In this embodiment, a sample speech waveform is first prepared as training data of a prediction model, specifically, recording data of a speaker is first collected, and the recording data is preprocessed (e.g., denoising, energy management, etc.) to obtain the sample speech waveform. Wherein each waveform point in the sample speech waveform is characterized by an original amplitude value, and the original amplitude value generally varies within [ -1,1 ]. Therefore, it is preferable to perform a pre-emphasis process on the sample speech waveform before dividing the sample speech waveform into a number of sample sub-sequences, and to quantize the original amplitude value of each waveform point in the sample speech waveform into the waveform value of each waveform point. The quantization process can bring quantization noise to the sample voice waveform, pre-emphasis processing can compensate the quantization noise, and particularly u-law 8-bit quantization can be adopted during quantization, and each waveform point in the sample voice waveform is represented by 256 integer waveform values within [0, 255 ].

Referring to fig. 2, fig. 2 is a flow chart of an embodiment of step S11 in fig. 1, and the method specifically includes dividing a sample speech waveform into a plurality of sample subsequences, and performing time delay processing on the plurality of sample subsequences. The number of waveform points of the sample subsequence is a first value, and is denoted as B, that is, there are B sample subsequences, the number of waveform points included in the sample voice waveform is an integer multiple of the first value, and the multiple is a second value, and is denoted as N, that is, the sample voice waveform includes b×n waveform points, and each sample subsequence includes N waveform points.

Step S21, dividing the sample voice waveform into a plurality of sample subsequences which all comprise waveform points with second values, wherein the waveform points corresponding to the same serial number in the first to last sample subsequences are waveform points with first values which are continuously arranged in the sample voice waveform, and any two adjacent waveform points in the plurality of sample subsequences correspond to two waveform points with first values which are spaced apart from the original serial number in the sample voice waveform.

Referring specifically to fig. 3, fig. 3 is an exemplary diagram of an embodiment of a plurality of sample sub-sequences, where numerals 1,2, …,36 are original sequence numbers representing 36 (b=4, n=9) waveform points in a sample speech waveform, 9 waveform points in the 4 th row in the diagram form a first sample sub-sequence, respectively corresponding to the waveform points with the original sequence number 1,5,9,13,17,21,25,29,33 in the sample speech waveform, where any two adjacent waveform points correspond to two waveform points with the original sequence number interval of 4 in the sample speech waveform, and the second, third, and fourth sample sub-sequences corresponding to the 3, 2, and 1 rows are analogized sequentially. Finally, the first waveform point (the waveform point with the sequence number of 1) in each of the first to fourth sample subsequences is 4 waveform points with the original sequence numbers of 1,2,3 and 4 in the sample voice waveform, the second waveform point is 4 waveform points with the original sequence numbers of 5,6,7 and 8 in the sample voice waveform, and so on. Wherein the values of B and N can be reasonably set according to actual conditions. Of course, the number of waveform points in practical application is far greater than that shown in fig. 3.

In summary, the above-mentioned process of dividing the sample speech waveform into a plurality of sample subsequences is to place waveform points with the original sequence number b×n+b in the sample speech waveform in the corresponding sample subsequences, where B is B integers from 1 to B, and N is N integers from 0 to N-1. For example, b=1, then N waveform points in the sample speech waveform corresponding to N integer original sequence numbers between 0 and N-1 corresponding to b×n+1 are placed in the first sample subsequence.

Step S22, sequentially delaying the first to last sub-sequence samples by a preset multiple of a preset duration, wherein the preset multiple corresponding to the first to last sub-sequence samples is an integer between 0 and a difference between the first value and 1.

After the sample speech waveform is divided into a plurality of sample subsequences according to the manner described in step S21, waveform points corresponding to the same sequence number in the first to last subsequence samples are a plurality of waveform points continuously arranged in the sample speech waveform, and have strong correlation with each other, so that the method is not suitable for training data of a subsequent prediction model, and time delay processing is required for the plurality of subsequence samples.

Still taking the four sample subsequences shown in fig. 3 as an example, the delay procedure is described. Referring to fig. 4, fig. 4 is an exemplary diagram of an embodiment of performing time delay processing on the sample subsequence in fig. 3, where a first value b=4, and the first to fourth subsequence samples are sequentially delayed by 0 times, 1 times, 2 times, and 3 times of the preset time length d, that is, sequentially delayed by 0,2, 4, and 6 time lengths corresponding to the waveform points, as shown in fig. 4, where the position vacated after the time delay may be supplemented by 0.

After the time delay, taking the 7 th waveform point (25,18,11,4) and the 8 th small point (29,22,15,8) of each of the four sample subsequences in fig. 4 as an example, the 8 th waveform point is predicted by using the 7 th waveform point, when the 8 th waveform point (22,15,8) of each of the second, third and fourth sample subsequences is predicted, the previous and subsequent information can be seen, and taking the original serial number 22 as an example, the first 7 waveform points of each of the four sample subsequences comprise the original serial number larger than 22 and the original serial number smaller than 22, so that the predicted waveform value of the waveform point of the original serial number 22 can be predicted more accurately.

In addition, after the time delay, the waveform value of the waveform point corresponding to the same serial number in the plurality of sample subsequences may be expressed as (x _t,x_t-D,x_t-2D,…,x_t-(B-1)D), the waveform value of the waveform point corresponding to the next serial number may be expressed as (x _t+B,x_t-D+B,x_t-2D+B,…,x_t-(B-1)D+B), where t is the original serial number of the waveform point corresponding to the first sample subsequence in the sample voice waveform, and t- (B-1) x D is the original serial number of the waveform point corresponding to the last sample subsequence in the sample voice waveform. The waveform values at waveform points corresponding to serial number 7 in the four sample subsequences in fig. 4 can be expressed as (x ₂₅,x₁₈,x₁₁,x₇) where t=25, b=4, d=7.

S12, constructing a prediction model; the method comprises the steps of obtaining a predicted waveform value of a next waveform point in a plurality of sample subsequences based on a prediction model by utilizing sample acoustic parameters and waveform values of current waveform points in the plurality of sample subsequences; wherein the sample acoustic parameters are extracted from the sample speech waveform.

After the sample speech waveform is obtained, sample acoustic parameters such as mel spectrum, cepstrum fundamental frequency and the like can be extracted from the sample speech waveform, the sample acoustic parameters and waveform values of current waveform points in a plurality of sample subsequences are input into a prediction model, and a predicted waveform value of a next waveform point in the plurality of sample subsequences is obtained based on the prediction model. That is, the predicted waveform value of the next waveform point is predicted by using the waveform value of the current waveform point, and the predicted waveform value of each waveform point is obtained. The waveform value of the next waveform point in the plurality of sample subsequences is acquired simultaneously, i.e. the plurality of sample subsequences share the model parameters.

Step S13, training a prediction model by using the sample acoustic parameters and a plurality of sample subsequences after time delay processing.

Specifically, referring to fig. 5, fig. 5 is a flowchart illustrating an embodiment of step S13 in fig. 1, and the prediction model may be trained by the following steps.

Step S31, sequentially taking the first waveform point to the last waveform point in the plurality of sample subsequences as a current waveform point, and obtaining the predicted waveform value of the next waveform point in the plurality of sample subsequences based on a prediction model by utilizing the sample acoustic parameters and the waveform values of the current waveform point in the plurality of sample subsequences.

As described above, the prediction model may obtain the predicted waveform value of the next waveform point by point according to the waveform value of the current waveform point, and obtain the predicted waveform value of the next waveform point in the plurality of sample subsequences at the same time, so that the first waveform point to the last waveform point in the plurality of sample subsequences may be sequentially used as the current waveform point, and the predicted waveform value of the next waveform point may be obtained point by using the prediction model. For example, the 7 th waveform point corresponding to each of the plurality of sample subsequences after the time delay is input into a prediction model, the prediction waveform value of the 8 th waveform point is obtained by using the prediction model, the 8 th waveform point corresponding to each of the plurality of sample subsequences after the time delay is input into the prediction model, and the prediction waveform value of the 9 th waveform point is obtained by using the prediction model.

Step S32, updating the prediction model by using a loss function between the predicted waveform value and the actual waveform value of the corresponding waveform point to obtain a trained prediction model.

After the prediction model is constructed, training is needed to obtain optimal model parameters, so that the prediction model is convenient for the subsequent prediction of the voice waveform. After the predicted waveform value of the next waveform point in the plurality of sample subsequences is obtained, since the actual waveform value of the next waveform point in the plurality of sample subsequences is known, a loss function between the predicted waveform value and the actual waveform value of the corresponding waveform point may be calculated first, and model parameters of the prediction model may be updated based on a Back Propagation algorithm (Back Propagation) by using the cross entropy loss function, so as to obtain a trained prediction model.

In the embodiment, the waveform values of the current waveform points in the plurality of sample subsequences are simultaneously input into the prediction model, so that the predicted waveform values of the next waveform points in the plurality of sample subsequences can be simultaneously obtained. Compared with a prediction model obtained by directly training a sample voice waveform in the prior art, when the prediction model predicts the voice waveform in the embodiment, the calculated amount is only 1/B of the original calculated amount, wherein B is the number of sample subsequences. Therefore, the application can reduce the calculation amount of the predicted generated voice waveform and improve the efficiency of generating the voice waveform, thereby realizing the purpose of generating the voice waveform in real time and being not easy to generate the jamming when generating the voice waveform in real time.

In some embodiments, referring to fig. 6, fig. 6 is a flowchart illustrating an embodiment of step S12 in fig. 1, where the prediction model may be constructed by the following steps.

Step S41, sequentially connecting a plurality of convolution layers and a plurality of first full connection layers in series to form a condition network, sequentially connecting a plurality of circulating neural network layers in series, and respectively connecting a last circulating neural network layer and a plurality of second full connection layers in series to form a point-level network; wherein the second full-connection layer corresponds to the sample subsequences one-to-one.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an embodiment of a prediction model, where the prediction model is formed by a condition network 710 and a point level network 720, so that a plurality of convolution layers 711 and a plurality of first full connection layers 712 (Fully Connected Layer, FC) are first required to be sequentially connected in series to form the condition network 710, a plurality of recurrent neural network layers 721 (Gated Recurrent Units, GRU) are sequentially connected in series, and a last recurrent neural network layer GRU and a plurality of second full connection layers 722 are respectively connected in series to form the point level network 720. Fig. 7 schematically illustrates a case where two layers of kernel size=3 of one-dimensional convolution (Conv 1 d) and two layers of first full-connection layers FC form a conditional network, and 2 layers of GRUs and 4 layers of second full-connection layers FC form a point-level network, where the second full-connection layers FC are in one-to-one correspondence with the sample subsequences, that is, the number of second full-connection layers FC is the first value B.

Step S42, connecting the condition network and the point level network in series to form a prediction model; the first convolution layer takes sample acoustic parameters as input, the first circulating neural network layer takes output of the last first full-connection layer and waveform values of current waveform points of a plurality of sample subsequences as input, the rest circulating neural network layers take output of the last first full-connection layer and output of the last circulating neural network layer as input, and a plurality of second full-connection layers are output layers of a prediction model.

With continued reference to fig. 7, after the condition network and the point level network are formed respectively, the two are connected in series to form a prediction model. The condition network takes the sample acoustic parameters (Acoustic Features, AF) as input, and outputs corresponding hidden layer variables f, specifically, the sample acoustic parameters are input into a first convolution layer, and the last first full-connection layer outputs the hidden layer variables f. The hidden layer variable f is input into each cyclic neural network layer GRU to enhance the control capability of acoustic parameters, meanwhile, the first GRU also inputs the waveform values of the current waveform points of a plurality of sample subsequences, such as (x _t,x_t-D,x_t-2D,…,x_t-(B-1)D), and other GRUs also input the output of the last GRU. Each second full connection layer takes the output of the last GRU as input and as output layer of the prediction model. The waveform value of the current waveform point in fig. 7 is characterized as (x _t,x_t-7,x_t-14,x_t-21), i.e., b=4, d=7.

In addition, each second full-connection layer outputs a probability distribution of predicted waveform values of the next waveform point in the corresponding sample subsequence, for example, after u-law 8 bits are quantized, all possible values of the predicted waveform values are 256 integers within [0, 255], then the second full-connection layer outputs a probability distribution formed by the probabilities of the 256 values, and the corresponding actual waveform value is known, then the probability is 100%, and the probabilities of other values are 0, so that the cross entropy loss function is calculated based on the probability distribution. That is, each value in the output (x _t+4,x_t-3,x_t-10,x_t-17) of the second fully-connected layer in fig. 7 represents a probability distribution.

The first layer of GRU adopts a larger hidden layer node number (for example 512), can memorize historical waveform point information, namely, forecast waveform value of the next waveform point, can utilize all the later input historical waveform point information, can improve the accuracy of forecast, and the second layer of GRU adopts a smaller hidden layer node number (for example 16) to abstract the hidden layer information of the first layer of GRU to a lower dimension. In other embodiments, a recurrent neural network layer GRU may also be employed.

Furthermore, the last first fully-connected layer is output with a first time domain resolution; the plurality of second full connection layers are output in a second time domain resolution, and the first time domain resolution is smaller than the second time domain resolution. That is, the point-level network performs a plurality of calculations within a period of time between outputting the present hidden layer information and the next hidden layer information by the conditional network, during which time the hidden layer information of the present time is inputted to the point-level network. That is, before the next output of the last first full-connection layer, the current output of the last first full-connection layer is input to the plurality of recurrent neural network layers.

In the embodiment, the convolutional layer, the full-connection layer and the cyclic neural network are utilized to construct a prediction model, and the waveform values of the current waveform points in the plurality of sample subsequences are simultaneously input into the prediction model, so that the predicted waveform values of the next waveform points in the plurality of sample subsequences can be simultaneously obtained. Therefore, the application can reduce the calculation amount of the predicted generated voice waveform and improve the efficiency of generating the voice waveform, thereby realizing the purpose of generating the voice waveform in real time and being not easy to generate the jamming when generating the voice waveform in real time.

Based on the same inventive concept, the present application also provides a method for predicting a speech waveform, referring to fig. 8, fig. 8 is a flowchart of an embodiment of the method for predicting a speech waveform according to the present application, where the method for predicting a speech waveform includes the following steps.

Step S51, acquiring text acoustic parameters based on the text information, and setting a plurality of initialization waveform values.

When it is required To convert Text information into Speech information, it is required To convert the Text information into a phoneme sequence by using a Text analysis model in a Text To Speech system (TTS), and then convert the phoneme sequence into Text acoustic parameters by using an acoustic model in the TTS, so as To input the Text acoustic parameters into a prediction model for obtaining Speech waveforms. Further, besides the acquired text acoustic parameters, a plurality of initialization waveform values are required to be set for subsequent prediction.

Step S52, inputting the text acoustic parameters and the initialized waveform values into a prediction model, and obtaining a plurality of prediction subsequences by using the prediction model; wherein the predicted subsequences are in one-to-one correspondence with the initialization waveform values.

In some embodiments, before inputting the text acoustic parameter and the initialized waveform value into the prediction model, the prediction model may be obtained by using the technical scheme described in any embodiment corresponding to the method for obtaining the prediction model, and then the method is applied to the method for predicting the speech waveform. The specific method for obtaining the prediction model may refer to any of the above embodiments, and will not be described herein.

Inputting the acquired text acoustic parameters and the initialized waveform values into a prediction model, and obtaining a plurality of prediction subsequences by utilizing autoregressive of the prediction model; wherein the predicted subsequences are in one-to-one correspondence with the initialization waveform values. That is, each initialization waveform value can correspondingly predict one prediction subsequence, and the prediction model can obtain a plurality of prediction subsequences at the same time.

Step S53, obtaining the predicted voice waveform corresponding to the text information according to the plurality of predicted subsequences.

After a plurality of prediction subsequences are obtained by prediction using the prediction model, the prediction subsequences need to be integrated into an entire predicted voice waveform corresponding to the text information so as to be subjected to subsequent processing. Referring to fig. 9, fig. 9 is a flowchart illustrating an embodiment of step S53 in fig. 8, where a predicted speech waveform corresponding to text information may be obtained by the following steps.

In step S61, all waveform points respectively included in the plurality of prediction subsequences are aligned in the time domain.

With continued reference to fig. 3 and fig. 4, according to the embodiment of the method for obtaining the prediction model, the plurality of prediction subsequences obtained by using the prediction model are time-delayed and are not aligned in the time domain, so that all waveform points respectively included in the plurality of prediction subsequences need to be aligned in the time domain, that is, the reverse process of the time-delayed processing is equivalent, that is, the process is changed from fig. 4 to fig. 3. After alignment, waveform points corresponding to the same serial number in the plurality of predicted subsequences are waveform points continuously arranged in the predicted voice waveform.

Step S62, sequentially taking the first waveform point to the last waveform point of the plurality of aligned predicted sub-sequences as the current waveform point, and sequentially arranging the predicted waveform values of the current waveform points in the plurality of predicted sub-sequences according to the serial numbers of the predicted sub-sequences to obtain a predicted voice waveform.

After all waveform points included in the plurality of predicted subsequences are aligned in the time domain, waveform points corresponding to the same sequence number in the plurality of predicted subsequences are waveform points continuously arranged in the predicted voice waveform, so that the first waveform point to the last waveform point of the plurality of aligned predicted subsequences can be sequentially used as current waveform points, and predicted waveform values of the current waveform points in the plurality of predicted subsequences are sequentially arranged according to the sequence numbers of the predicted subsequences to obtain the predicted voice waveform.

For example, the initial waveform value set in step S51 is (128,128,128,128), i.e. 4 predicted sub-sequences can be obtained, and the 1 st waveform point of each of the first to fourth predicted sub-sequences becomes the 1 st to 4 th waveform points in the predicted sub-sequences.

After obtaining the predicted sub-sequence, if the waveform value of each waveform point is represented by a quantized value and has been pre-emphasized before, the method further comprises the steps of:

and inversely quantizing the predicted waveform value of each waveform point in the predicted voice waveform into the amplitude value of each waveform point, and performing inverse pre-emphasis processing on the predicted voice waveform after the inverse quantization.

The present embodiment can simultaneously obtain a plurality of prediction subsequences using a prediction model, and integrate the plurality of prediction subsequences into a predicted speech waveform corresponding to text information. Compared with a prediction model obtained by directly training a sample voice waveform in the prior art, when the prediction model predicts the voice waveform in the embodiment, the calculated amount is only 1/B of the original calculated amount, wherein B is the number of the prediction subsequences. Therefore, the application can reduce the calculated amount of predicting and generating the predicted voice waveform and improve the efficiency of generating the predicted voice waveform, thereby realizing the purpose of generating the voice waveform in real time and being not easy to generate the jamming when generating the voice waveform in real time.

In some embodiments, referring to fig. 10, fig. 10 is a flowchart illustrating an embodiment of step S52 in fig. 8, where a plurality of prediction subsequences may be obtained by using a prediction model as follows.

In step S71, the plurality of initialization waveform values are respectively used as the waveform values of the first waveform point in the plurality of prediction subsequences.

As described above, the initialized waveform values are in one-to-one correspondence with the predicted sub-sequences, so that before prediction, the initialized waveform values are used as the waveform values of the first waveform point in the predicted sub-sequences, respectively, so as to start the prediction process in combination with the text acoustic parameters.

Step S72, repeatedly executing the steps of inputting the waveform value of the previous waveform point in the plurality of prediction subsequences and the text acoustic parameters into the prediction model, and obtaining the predicted waveform value of the current waveform point in the plurality of prediction subsequences by using the prediction model so as to obtain the predicted waveform value of each waveform point in the plurality of prediction subsequences.

The prediction model acquires the predicted waveform values point by point, and the prediction model can acquire a plurality of predicted waveform values at the same time, so that the steps of inputting the waveform value of the previous waveform point in the plurality of predicted subsequences and the text acoustic parameters into the prediction model, and acquiring the predicted waveform value of the current waveform point in the plurality of predicted subsequences by using the prediction model are required to be repeatedly executed, so that the predicted waveform value of each waveform point in the plurality of predicted subsequences is acquired by autoregressive, and the plurality of predicted subsequences are acquired. Wherein the initializing waveform value is used when the predicted waveform value of the second waveform point is obtained, and the predicted waveform value of the second waveform point is used when the predicted waveform value of the third waveform point is obtained

Specifically, referring to fig. 11, fig. 11 is a flowchart illustrating an embodiment of step S72 in fig. 10, a predicted waveform value of a current waveform point in a plurality of predicted sub-sequences may be obtained by the following steps.

Step S81, inputting the waveform value of the previous waveform point in the plurality of prediction subsequences and the text acoustic parameters into the prediction model, and obtaining a plurality of probability distributions of the current waveform point output by the prediction model, wherein the plurality of probability distributions comprise probabilities corresponding to all the prediction waveform values.

In this embodiment, the probability distribution of all predicted waveform values is output by the prediction model, so after the waveform values of the previous waveform points in the plurality of prediction subsequences and the text acoustic parameters are input into the prediction model, the probability distribution of all possible predicted waveform values of the current waveform point is obtained, for example, after u-law 8 bits are quantized, the possible values of all predicted waveform values are 256 integers within [0, 255], and then the probability distribution is the probability of 256 values, and the sum of all the probabilities is 1.

Step S82, randomly sampling a plurality of probability distributions to obtain respective corresponding predicted waveform values.

After the probability distribution of each of a plurality of current waveform points corresponding to the initialized waveform values one by one is obtained, each probability distribution is randomly sampled to obtain a corresponding predicted waveform value. That is, the predicted waveform value corresponding to one probability value is randomly selected from 256 probability values, and finally the predicted waveform values the same as the number of the initialized waveform values are obtained at the same time.

According to the method and the device for generating the predicted voice waveform, the plurality of predicted subsequences can be obtained simultaneously by using the prediction model, so that the calculated amount of predicting the generated predicted voice waveform can be reduced, the efficiency of generating the predicted voice waveform is improved, the purpose of generating the voice waveform in real time can be achieved, and the voice waveform is not easy to be blocked when being generated in real time.

In addition, referring to fig. 12, fig. 12 is a schematic structural diagram of an embodiment of an electronic device according to the present application, where the electronic device includes a memory 1210 and a processor 1220 that are coupled to each other, and the memory 1210 stores program instructions, and the processor 1220 can execute the program instructions to implement a method for obtaining a prediction model according to the above embodiment and/or a method for predicting a speech waveform according to the above embodiment. The specific reference may be made to any of the above embodiments, and further details are not described.

In addition, the present application further provides a computer readable storage medium, please refer to fig. 13, fig. 13 is a schematic structural diagram of an embodiment of the computer readable storage medium of the present application, and the storage medium 1300 has stored thereon program instructions 1310, where the program instructions 1310 can be executed by a processor to implement a method for obtaining a prediction model according to the above embodiment, and/or a method for predicting a speech waveform according to the above embodiment. The specific reference may be made to any of the above embodiments, and further details are not described.

The foregoing description is only of embodiments of the present application, and is not intended to limit the scope of the application, and all equivalent structures or equivalent processes using the descriptions and the drawings of the present application or directly or indirectly applied to other related technical fields are included in the scope of the present application.

Claims

1. A method of obtaining a predictive model, comprising:

Constructing a prediction model; the prediction model takes a sample acoustic parameter and waveform values of current waveform points of the plurality of sample subsequences as inputs, and takes predicted waveform values of next waveform points of the plurality of sample subsequences as outputs; wherein the sample acoustic parameters are extracted from the sample speech waveform;

2. The method of claim 1, wherein the step of constructing a predictive model comprises:

sequentially connecting a plurality of convolution layers and a plurality of first full connection layers in series to form a conditional network, sequentially connecting a plurality of circulating neural network layers in series, and respectively connecting the last circulating neural network layer and a plurality of second full connection layers in series to form a point-level network; wherein the second full-connection layer corresponds to the sample subsequences one-to-one;

Connecting the condition network and the point level network in series to form the prediction model; the first convolutional layer takes the sample acoustic parameters as input, the first cyclic neural network layer takes the output of the last first full-connection layer and waveform values of current waveform points of the plurality of sample subsequences as input, the rest of the cyclic neural network layers take the output of the last first full-connection layer and the output of the last cyclic neural network layer as input, and the plurality of second full-connection layers are output layers of the prediction model.

3. The method of claim 2, wherein a last one of the first fully-connected layers is output at a first time-domain resolution, the number of second fully-connected layers is output at a second time-domain resolution, and the first time-domain resolution is less than the second time-domain resolution; before the next output of the last first full-connection layer, the last first full-connection layer outputs the current output of the last first full-connection layer.

4. The method of claim 1, wherein the number of waveform points of the sub-sequence of samples is a first number, the sample speech waveform includes a number of waveform points that is an integer multiple of the first number and a multiple is a second number, the steps of dividing the sample speech waveform into a number of sub-sequences of samples and performing a time delay process on the number of sub-sequences of samples comprise:

Dividing the sample voice waveform into a plurality of sample subsequences which all comprise waveform points of the second numerical value, wherein the waveform points corresponding to the same serial number in the first to last sample subsequences are waveform points of a first numerical value continuously arranged in the sample voice waveform, and any two adjacent waveform points in the plurality of sample subsequences correspond to two waveform points with original serial numbers spaced by the first numerical value in the sample voice waveform;

And sequentially delaying the first to last subsequence samples by a preset multiple of a preset duration, wherein the preset multiple respectively corresponding to the first to last subsequence samples is an integer between 0 and the difference between the first value and 1.

5. The method of claim 1, wherein the step of training the predictive model using the sample acoustic parameters and the number of sample subsequences after delay processing comprises:

Sequentially taking the first waveform point to the last waveform point in the plurality of sample subsequences as a current waveform point, and obtaining a predicted waveform value of a next waveform point in the plurality of sample subsequences based on the prediction model by utilizing the sample acoustic parameters and the waveform values of the current waveform point in the plurality of sample subsequences;

And updating the prediction model by using a loss function between the predicted waveform value and the actual waveform value of the corresponding waveform point so as to obtain the trained prediction model.

6. The method of claim 1, wherein prior to the step of dividing the sample speech waveform into a number of sample subsequences, further comprising:

and pre-emphasis processing is carried out on the sample voice waveform, and the original amplitude value of each waveform point in the sample voice waveform is quantized into the waveform value of each waveform point.

7. A method for predicting a speech waveform, comprising:

Inputting the text acoustic parameters and the initialized waveform values into a prediction model, and obtaining a plurality of prediction subsequences by using the prediction model; wherein the predictor sequence corresponds to the initialization waveform value one by one; the prediction model is obtained by training a plurality of sample subsequences after processing by using sample acoustic parameters and time delays, the plurality of sample subsequences are obtained by dividing sample speech waveforms, the prediction model takes waveform values of current waveform points of the sample acoustic parameters and the plurality of sample subsequences as input, and predicted waveform values of next waveform points of the plurality of sample subsequences as output; wherein the sample acoustic parameters are extracted from the sample speech waveform;

8. The method of claim 7, wherein the step of inputting the text acoustic parameters and the initialization waveform values into a predictive model and obtaining a number of predicted subsequences using the predictive model comprises:

Respectively taking the plurality of initialization waveform values as waveform values of a first waveform point in the plurality of prediction subsequences;

And repeatedly executing the steps of inputting the waveform value of the previous waveform point in the plurality of prediction subsequences and the text acoustic parameter into the prediction model, and obtaining the predicted waveform value of the current waveform point in the plurality of prediction subsequences by utilizing the prediction model so as to obtain the predicted waveform value of each waveform point in the plurality of prediction subsequences.

9. The method of claim 8, wherein the step of inputting the waveform value of the previous waveform point in the plurality of predicted sub-sequences and the text acoustic parameter into the prediction model and obtaining the predicted waveform value of the current waveform point in the plurality of predicted sub-sequences using the prediction model comprises:

inputting the waveform value of the previous waveform point in the plurality of prediction subsequences and the text acoustic parameters into the prediction model, and obtaining a plurality of probability distributions of the current waveform point output by the prediction model, wherein the plurality of probability distributions comprise probabilities corresponding to all the predicted waveform values;

And randomly sampling the probability distributions to obtain a plurality of corresponding predicted waveform values.

10. The method of claim 8, wherein the step of obtaining predicted speech waveforms corresponding to the text information from the plurality of predicted sub-sequences comprises:

aligning all waveform points respectively included by the plurality of prediction subsequences in a time domain;

Sequentially taking the first waveform point to the last waveform point of the plurality of the predicted subsequences after alignment as a current waveform point, and sequentially arranging predicted waveform values of the current waveform points in the plurality of the predicted subsequences according to sequence numbers of the predicted subsequences to obtain the predicted voice waveform.

11. The method of claim 8, further comprising, after the step of obtaining predicted speech waveforms corresponding to the text information from the plurality of predicted subsequences:

And inversely quantizing the predicted waveform value of each waveform point in the predicted voice waveform into the amplitude value of each waveform point, and performing inverse pre-emphasis processing on the predicted voice waveform after inverse quantization.

12. An electronic device comprising a memory and a processor coupled to each other, the memory storing program instructions that are executable by the processor to implement the method of obtaining a predictive model according to any one of claims 1-6 and/or the method of predicting a speech waveform according to any one of claims 7-11.

13. A computer readable storage medium having stored thereon program instructions executable by a processor to implement the method of obtaining a predictive model according to any of claims 1-6 and/or the method of predicting a speech waveform according to any of claims 7-11.