CN112562655A

CN112562655A - Residual error network training and speech synthesis method, device, equipment and medium

Info

Publication number: CN112562655A
Application number: CN202011406146.8A
Authority: CN
Inventors: 朱晓旭; 张大成
Original assignee: Beijing Orion Star Technology Co Ltd
Current assignee: Beijing Orion Star Technology Co Ltd
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2021-03-26

Abstract

The invention discloses a method, a device, equipment and a medium for training and voice synthesis of a residual error network. The sampling point subnetwork in the trained residual error network directly acquires the Gaussian parameter vector of the voice frame at the current sampling point according to the condition vector and the first residual error signal value of the last sampling point, and the Gaussian parameter vector comprises the weighted value, the Gaussian mean value and the Gaussian standard deviation which are respectively corresponding to the Gaussian algorithms with the set number, so that the second residual error signal value of the voice frame at the current sampling point can be acquired based on the Gaussian parameter vector subsequently, the process of determining the second residual error signal value is simplified, the efficiency of determining the second residual error signal value is improved, the transformation of data types and dimensions is not needed, and the resources consumed by the transformation of the data types and the dimensions are greatly reduced.

Description

Residual error network training and speech synthesis method, device, equipment and medium

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a method, an apparatus, a device, and a medium for training a residual error network and speech synthesis.

Background

Vocoders are important tools for synthesizing speech signals based on acoustic features such as spectrum, and mainly include source-filter models. Through the source-filter model and the obtained acoustic characteristics, a sound source excitation signal and a sound channel response signal can be constructed to simulate the sound signals emitted by the air flow generated by the lungs through the whole sound channel system (including trachea, vocal cords and oral cavity) when a human produces a sound.

In the prior art, in order to obtain a speech signal corresponding to a text to be processed, at least one acoustic feature vector corresponding to the text to be processed may be obtained in advance through a speech synthesis model based on text features of the text to be processed. Then, based on each acoustic feature vector, acquiring a predicted speech signal of a speech frame corresponding to each acoustic feature vector; determining a residual signal value of a voice frame corresponding to each acoustic feature vector at each sampling point through a residual network aiming at each acoustic feature vector; and then, performing data type conversion on the residual signal value corresponding to each sampling point through an inverse mu-law transformation algorithm, converting the integer (int) type data into floating point (float) type data, and then determining a synthetic speech signal corresponding to the text to be processed based on the converted residual signal value of the speech frame corresponding to the acoustic feature vector at each sampling point and the predicted signal value of the speech frame corresponding to the acoustic feature vector at each sampling point.

For the above speech synthesis method, after the residual signal value of a speech frame corresponding to any acoustic feature vector at each sampling point is obtained through the residual network, the data type conversion needs to be performed on the obtained residual signal value through the inverse μ -law transformation algorithm, so that the speech synthesis process is very complicated, the speech synthesis efficiency is reduced, and a large amount of time is consumed and the data type conversion is performed.

Disclosure of Invention

The embodiment of the invention provides a training and voice synthesis method, device, equipment and medium of a residual error network, which are used for solving the problems that the existing voice synthesis process is very complicated and low in efficiency, and a large amount of resources are required to be consumed for transformation of data types and dimensions.

The embodiment of the invention provides a training method of a residual error network, which comprises the following steps:

aiming at each voice frame in any voice sample in the sample set, a first residual signal value corresponds to each sampling point of the voice frame, and the first residual signal value is determined according to the real signal value and the predicted signal value of the voice frame at the sampling point; wherein, the predicted signal value of the speech frame at the sampling point is determined according to the acoustic feature vector of the speech frame and the physical vocoder;

acquiring a condition vector of each voice frame contained in the voice sample through a frame level sub-network in an original residual error network based on the acoustic characteristic vector of the voice frame; through a sampling point sub-network in the original residual error network, aiming at each sampling point, acquiring a Gaussian parameter vector of the voice frame at the current sampling point according to the condition vector and a first residual error signal value of the voice frame at the last sampling point; determining a second residual signal value of the speech frame at the current sampling point based on the Gaussian parameter vector; the Gaussian parameter vector comprises a set number of weighted values, Gaussian mean values and Gaussian standard deviations which respectively correspond to Gaussian algorithms;

and training the original residual error network according to the first residual error signal value and the second residual error signal value of each voice frame contained in the voice sample at each sampling point.

The embodiment of the invention provides a speech synthesis method based on a residual error network obtained by training, which comprises the following steps:

acquiring at least one acoustic feature vector corresponding to a text to be processed based on text features of the text to be processed through a speech synthesis model;

aiming at each acoustic feature vector, acquiring a predicted speech signal of a speech frame corresponding to the acoustic feature vector through a physical vocoder based on the acoustic feature vector; acquiring a second residual signal value of the voice frame corresponding to the acoustic characteristic vector at each sampling point through a residual network based on the acoustic characteristic vector; for each sampling point, determining a corresponding predicted signal value of the sampling point in the predicted voice signal, and determining a synthesized signal value of a voice frame corresponding to the acoustic feature vector at the sampling point according to the predicted signal value and a second residual signal value of the voice frame corresponding to the acoustic feature vector at the sampling point;

and determining a synthesized voice signal corresponding to the text to be processed according to the synthesized signal value of the voice frame corresponding to each acoustic feature vector at each sampling point in sequence.

The embodiment of the invention provides a training device of a residual error network, which comprises:

a determining module, configured to, for each speech frame in any speech sample in a sample set, correspond to a first residual signal value at each sampling point of the speech frame, where the first residual signal value is determined according to a true signal value and a predicted signal value of the speech frame at the sampling point; wherein, the predicted signal value of the speech frame at the sampling point is determined according to the acoustic feature vector of the speech frame and the physical vocoder;

the prediction module is used for acquiring a condition vector of each voice frame contained in the voice sample through a frame level sub-network in an original residual error network based on the acoustic feature vector of the voice frame; through a sampling point sub-network in the original residual error network, aiming at each sampling point, acquiring a Gaussian parameter vector of the voice frame at the current sampling point according to the condition vector and a first residual error signal value of the voice frame at the last sampling point; determining a second residual signal value of the speech frame at the current sampling point based on the Gaussian parameter vector; the Gaussian parameter vector comprises a set number of weighted values, Gaussian mean values and Gaussian standard deviations which respectively correspond to Gaussian algorithms;

and the training module is used for training the original residual error network according to the first residual error signal value and the second residual error signal value of each voice frame contained in the voice sample at each sampling point.

The embodiment of the invention provides a speech synthesis device of a residual error network obtained by a training method based on the residual error network, which comprises the following steps:

the acquiring unit is used for acquiring at least one acoustic feature vector corresponding to a text to be processed based on the text feature of the text to be processed through a voice synthesis model;

a prediction unit, configured to, for each acoustic feature vector, obtain, by a physical vocoder, a predicted speech signal of a speech frame corresponding to the acoustic feature vector based on the acoustic feature vector; acquiring a second residual signal value of the voice frame corresponding to the acoustic characteristic vector at each sampling point through a residual network based on the acoustic characteristic vector; for each sampling point, determining a corresponding predicted signal value of the sampling point in the predicted voice signal, and determining a synthesized signal value of a voice frame corresponding to the acoustic feature vector at the sampling point according to the predicted signal value and a second residual signal value of the voice frame corresponding to the acoustic feature vector at the sampling point;

and the determining unit is used for sequentially determining the synthesized voice signal corresponding to the text to be processed according to the synthesized signal value of the voice frame corresponding to each acoustic feature vector at each sampling point.

An embodiment of the present invention provides an electronic device, which includes a processor, and the processor is configured to implement the steps of the training method for a residual error network as described above or the steps of the speech synthesis method as described above when executing a computer program stored in a memory.

Embodiments of the present invention provide a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the training method for the residual error network as described above or the steps of the speech synthesis method as described above.

The sampling point subnetwork in the trained residual error network directly acquires the Gaussian parameter vector of the voice frame at the current sampling point according to the condition vector and the first residual error signal value of the last sampling point, and the Gaussian parameter vector comprises the weighted value, the Gaussian mean value and the Gaussian standard deviation which are respectively corresponding to the Gaussian algorithms with the set number, so that the second residual error signal value of the voice frame at the current sampling point can be acquired based on the Gaussian parameter vector subsequently, the process of determining the second residual error signal value is simplified, the efficiency of determining the second residual error signal value is improved, the transformation of data types and dimensions is not needed, and the resources consumed by the transformation of the data types and the dimensions are greatly reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a training process of a residual error network according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a training process of a residual error network according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a speech synthesis process according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a training apparatus for a residual error network according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of another electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the attached drawings, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to improve the efficiency of obtaining a predicted speech signal, embodiments of the present invention provide a method, an apparatus, a device, and a medium for training a residual error network and speech synthesis.

Example 1: fig. 1 is a schematic diagram of a training process of a residual error network according to an embodiment of the present invention, where the training process includes:

s101: aiming at each voice frame in any voice sample in the sample set, a first residual signal value corresponds to each sampling point of the voice frame, and the first residual signal value is determined according to the real signal value and the predicted signal value of the voice frame at the sampling point; wherein the predicted signal value of the speech frame at the sampling point is determined according to the acoustic feature vector of the speech frame and the physical vocoder.

The training method of the residual error network provided by the embodiment of the invention is applied to electronic equipment, and the electronic equipment can be intelligent equipment such as a robot and the like and can also be a server.

In order to facilitate prediction of a residual signal present in a synthesized speech signal, a sample set for training a residual network needs to be collected in advance, so that the residual network can be trained subsequently based on each speech sample contained in the sample set. The sample set includes a plurality of speech samples, and for each speech frame included in any speech sample, the speech frame has a first residual signal value at each sampling point, and the first residual signal value is determined according to a true signal value and a predicted signal value of the speech frame at the sampling point, that is, a difference between the true signal value of the speech frame at the sampling point and the predicted signal value of the speech frame at the sampling point is determined as a first residual signal.

And the real signal value of the speech frame at the sampling point is determined according to the speech frame and the preset sampling rate. Specifically, according to the pre-configured sampling rate and the speech frame, the true signal value of the speech frame at each sampling point can be determined. And the predicted signal value of the speech frame at the sampling point is determined according to the acoustic feature vector of the speech frame and the physical vocoder. Specifically, a physical vocoder is used for acquiring a predicted speech signal corresponding to the speech frame based on the acoustic feature vector of the speech frame, and the predicted signal value of the speech frame at each sampling point is determined according to the predicted speech signal corresponding to the speech frame and a pre-configured sampling rate. The vocoder may be an LPC vocoder or a WORLD vocoder.

It should be noted that the acoustic feature vector of any speech frame may be obtained through a model, such as a speech synthesis model (TTS) model, or may be obtained through a physical algorithm. The specific method for acquiring the acoustic feature vector can be flexibly set according to actual requirements, and is not particularly limited herein.

S102: aiming at each voice frame contained in a voice sample, acquiring a condition vector of the voice frame through a frame level sub-network in an original residual error network based on an acoustic feature vector of the voice frame; acquiring a Gaussian parameter vector of the voice frame at the current sampling point according to the condition vector and a first residual signal value of the voice frame at the last sampling point by a sampling point sub-network in an original residual network aiming at each sampling point; determining a second residual signal value of the speech frame at the current sampling point based on the Gaussian parameter vector; the Gaussian parameter vector comprises a weight value, a Gaussian mean value and a Gaussian standard deviation which correspond to the Gaussian algorithms with set number respectively.

In a specific implementation process, any voice sample in a sample set is obtained, and the acoustic feature vector of each voice frame contained in the voice sample is sequentially input into an original residual error network. And acquiring a condition vector of the current input speech frame through a frame-level sub-network in the original residual error network based on the acoustic feature vector of the speech frame, so that a sub-network behind the frame-level sub-network in the residual error network can perform corresponding processing based on the condition vector to acquire a second residual error signal value.

In a possible implementation, the acoustic feature vector of any speech frame is input into an original residual error network, feature extraction is performed on the acoustic feature vector through a continuously set number of convolutional layers, for example, two convolutional layers, included in a frame-level sub-network in the original residual error network, a feature vector after convolution processing output by the last convolutional layer is obtained, then the obtained feature vector is up-sampled through a continuously set number of fully-connected layers, for example, two fully-connected layers, and a vector with a set dimension output by the last fully-connected layer is determined as a condition vector, for example, a condition vector with 128 dimensions.

And after the condition vector is acquired based on the frame level sub-network in the original residual error network, performing subsequent processing based on the condition vector through the sampling point sub-network in the original residual error network.

In a specific implementation process, after the sampling point subnetwork acquires the condition vector output by the frame level subnetwork, for each sampling point, the gaussian parameter vector of the speech frame at the current sampling point is acquired according to the condition vector and the first residual signal value of the last sampling point. And the sampling point sub-network performs subsequent processing based on the obtained Gaussian parameter vector, so that a second residual signal value of the voice frame at the current sampling point can be determined.

In the embodiment of the present invention, the obtained gaussian parameter vector of the speech frame at the current sampling point includes a weight value, a gaussian mean value and a gaussian standard deviation respectively corresponding to a set number of gaussian algorithms. And the weight value corresponding to each Gaussian algorithm identifies the influence of the Gaussian algorithm on the second residual signal value of the speech frame at the current sampling point. The gaussian mean is the mean of each of the residual signal values that the speech frame may have at the current sampling point, i.e. the mathematical expectation of each of the residual signal values that the speech frame may have at the current sampling point. The gaussian standard deviation is the degree of dispersion of each residual signal value that the speech frame may have at the current sample point.

S103: and training the original residual error network according to the first residual error signal value and the second residual error signal value of each voice frame contained in the voice sample at each sampling point.

Since the voice sample includes a plurality of voice frames, the steps in the above embodiment are performed for each voice frame to obtain a second residual signal value of each voice frame included in the voice sample at each sampling point.

In a specific implementation process, the loss value is calculated according to the second residual signal value and the first residual signal value of each sampling point of each voice frame contained in the voice sample. Based on the loss value, the original residual error network is trained so as to update parameter values of parameters in the original residual error network.

In specific implementation, when the original residual error network is trained according to the loss value, a gradient descent algorithm can be adopted to perform back propagation on the gradient of the parameter in the original residual error network, so that the parameter value of the parameter in the residual error network is updated.

In the embodiment of the present invention, the sample set includes a large number of voice samples, the above steps are performed for each voice sample, and when a preset convergence condition is satisfied, the residual error network training is completed.

The predetermined convergence condition may be satisfied by that a loss value calculated according to a second residual signal value and a first residual signal value of each speech frame included in the speech sample at each sampling point is less than a set loss value threshold, and the number of iterations for training the residual network reaches a set maximum number of iterations. The specific implementation can be flexibly set, and is not particularly limited herein.

As a possible implementation manner, when performing residual network training, the voice samples in the sample set may be divided into training samples and test samples, and the original residual network is trained based on the training samples, and then the reliability of the trained residual network is verified based on the test samples.

Example 2: in order to accurately obtain the second residual signal value of the speech frame at the current sampling point, on the basis of the above embodiment, in the embodiment of the present invention, the determining the second residual signal value of the speech frame at the current sampling point based on the gaussian parameter vector includes:

for each Gaussian algorithm, determining parameter information corresponding to the Gaussian algorithm in a Gaussian parameter vector;

and determining a second residual signal value of the speech frame at the current sampling point based on the parameter information corresponding to each Gaussian algorithm.

In the process of performing speech synthesis through a residual error network in the prior art, because a sampling point subnetwork in the residual error network acquires that a 256-dimensional prediction vector of the speech frame corresponding to a current sampling point is integer (int) type, a second residual error signal value of the speech frame at the current sampling point determined based on the prediction vector is also int type data, the second residual error signal value of the speech frame at the current sampling point needs to be subjected to inverse transformation through an inverse μ -law transformation algorithm, the second residual error signal value of the speech frame at the current sampling point is transformed from int type data to floating point (float) type data, and then the inversely transformed second residual error signal value is mapped to a signal value in a bit depth mapping interval, so as to determine a synthesized speech signal corresponding to a text to be processed later. The inverse transformation of the second residual signal value of the speech frame at the current sampling point is performed through the inverse mu-law transformation algorithm, so that the possible deviation of the speech frame in the second residual signal value of the current sampling point is enlarged, further, the subsequently determined synthesized speech signal is inaccurate, click noise is easy to generate, and the audio quality of the synthesized speech signal is influenced.

In order to reduce the influence of the inverse μ -law transform algorithm on the determined residual signal of the speech frame at each sampling point, according to the principle that the speech signal has short-time invariance and the change of the fundamental frequency between two adjacent speech frames is relatively small, when the acquired 256-dimensional prediction vector is processed by the softmax function, namely normalization processing is performed, the pitch (pitch) and the gain (gain) of the speech frame are used to adjust the probability of sampling the normalized prediction vector, but the method does not obviously reduce the influence of the inverse μ -law transform algorithm on the second residual signal value of the speech frame at each sampling point.

In addition, before a sampling point subnetwork in the residual error network acquires a 256-dimensional prediction vector of the speech frame corresponding to a current sampling point, a preset embedded matrix is needed to map a residual error signal of a previous sampling point, a pitch and a gain of the speech frame from a 1-dimensional int-type data vector to a float-type data vector with a set dimension, so that data required to be calculated in the process of acquiring the prediction vector of the speech frame corresponding to the current sampling point according to a condition vector and each mapped float-type data vector is very large.

In the embodiment of the invention, the second residual signal value of the voice frame at the current sampling point can be determined through a Gaussian algorithm, so that the second residual signal value of the voice frame at each sampling point, which is acquired through a residual network, is directly float-type data, inverse transformation through an inverse mu-law transformation algorithm is not required, and the accuracy of the acquired second residual signal value of each sampling point is improved. In order to subsequently determine the second residual signal value of the voice frame at the current sampling point through the gaussian algorithm, a sub-network of the sampling points in the original residual network needs to determine the gaussian parameter vector of the voice frame at the current sampling point according to the condition vector and the first residual signal value of the last sampling point, and the gaussian parameter vector only needs to store the weight values, the gaussian mean values and the gaussian standard deviations of the gaussian algorithms with a set number, so that the dimension of the gaussian parameter vector only needs to be related to the number of the preconfigured gaussian algorithms, but generally, the number of the preconfigured gaussian algorithms is not too large, such as 1 gaussian algorithm, the process of determining the second residual signal value of the voice frame at the current sampling point through the gaussian algorithm can also be realized, and the calculation amount required by the sub-network of the sampling points is greatly reduced.

In a possible implementation manner, since there may be multiple peaks in any speech frame, in an embodiment of the present invention, the second residual signal value of the speech frame at the current sampling point may be determined by multiple gaussian algorithms. After the gaussian parameter vectors are obtained based on the above embodiment, the parameter information corresponding to the gaussian algorithm in the gaussian parameter vectors can be determined for each gaussian algorithm.

The parameter information corresponding to any one of the gaussian algorithms includes a weight value, a gaussian mean value and a gaussian standard deviation of the gaussian algorithm.

Based on the above embodiment, the parameter information corresponding to each gaussian algorithm is determined, and corresponding processing is performed to determine the second residual signal value of the speech frame at the current sampling point.

Specifically, determining a second residual signal value of the speech frame at the current sampling point based on the parameter information corresponding to each gaussian algorithm includes:

aiming at each Gaussian algorithm, determining a Gaussian residual value corresponding to the Gaussian algorithm according to the Gaussian mean value and the Gaussian standard deviation in the parameter information corresponding to the Gaussian algorithm; determining a second residual signal value of the speech frame at the current sampling point according to the weight value in the parameter information corresponding to each Gaussian algorithm and the Gaussian residual value corresponding to each Gaussian algorithm; or

Splicing the weight values in each parameter information into weight vectors; determining a normalization weight vector according to the weight vector and a normalization function; determining target feature elements from each feature element contained in the normalized weight vector; determining a Gaussian algorithm corresponding to the target characteristic element, determining a Gaussian residual value corresponding to the Gaussian algorithm according to a Gaussian mean value and a Gaussian standard deviation in parameter information corresponding to the Gaussian algorithm, and determining the Gaussian residual value as a second residual signal value of the speech frame at the current sampling point.

In the embodiment of the present invention, after the parameter information corresponding to each gaussian algorithm is obtained, the second residual signal value of the speech frame at the current sampling point may be directly determined based on the parameter information corresponding to each gaussian algorithm, or a target gaussian algorithm may be determined from each gaussian algorithm, and the second residual signal value of the speech frame at the current sampling point is determined according to the parameter information corresponding to the target gaussian algorithm.

Specifically, determining a second residual signal value of the speech frame at the current sampling point includes the following steps:

the method comprises the steps that firstly, aiming at parameter information of each Gaussian algorithm, a Gaussian mean value and a Gaussian standard deviation are obtained from the parameter information of the Gaussian algorithm; and performing corresponding processing according to the obtained Gaussian mean value and the Gaussian standard deviation, and determining a Gaussian residual value corresponding to the Gaussian algorithm.

After the Gaussian residual value corresponding to each Gaussian algorithm is determined based on the method, the second residual signal value of the speech frame at the current sampling point is determined according to the weight value included in each parameter information and the Gaussian residual value corresponding to each Gaussian algorithm.

In a possible implementation manner, determining a second residual signal value of the speech frame at the current sampling point according to the weight value included in each parameter information and the gaussian residual value corresponding to each gaussian algorithm includes:

in a possible implementation manner, the following formula is adopted, and according to the weight value included in each sub-parameter vector and the gaussian residual value corresponding to each gaussian algorithm, the prediction residual signal corresponding to the current sampling point of the speech frame is determined:

wherein e (t) is a second residual signal value corresponding to the t-th sampling point of the speech frame,

the weighted value included in the parameter information of the kth Gaussian algorithm corresponding to the tth sampling point,

is a Gaussian mean value corresponding to the k-th Gaussian algorithm

And the standard deviation of gauss

Determining a Gaussian residual value corresponding to the k-th Gaussian algorithm,

is the Gaussian mean value included in the parameter information of the kth Gaussian algorithm corresponding to the tth sampling point,

and the Gaussian standard deviation is included in the parameter information of the kth Gaussian algorithm corresponding to the tth sampling point.

And secondly, acquiring the weight value from the parameter information corresponding to each Gaussian algorithm according to the parameter information corresponding to each Gaussian algorithm. And splicing the obtained weight values corresponding to each Gaussian algorithm into weight vectors. And normalizing the weight vectors obtained by splicing according to a preset normalization function, such as a softmax function, and the like, and determining the normalized weight vectors. Each feature element contained in the normalized weight vector represents a normalized weight value corresponding to the corresponding gaussian algorithm. And determining any feature element contained in the normalized weight vector as a target feature element, for example, randomly determining any feature element contained in the normalized weight vector as the target feature element by using a polynomial distribution sampling algorithm. And determining the Gaussian algorithm corresponding to the target characteristic element according to the corresponding relation between each characteristic element in the preset normalization weight vector and the Gaussian algorithm. And acquiring a Gaussian mean value and a Gaussian standard deviation in the parameter information of the Gaussian algorithm corresponding to the target characteristic element. And determining a Gaussian residual value corresponding to the Gaussian algorithm according to the obtained Gaussian mean value and the Gaussian standard deviation, and directly determining the Gaussian residual value as a second residual signal value of the speech frame at the current sampling point.

In the method for determining the second residual signal value of the speech frame at the current sampling point, determining the gaussian residual value corresponding to the gaussian algorithm according to the gaussian mean value and the gaussian standard deviation in the parameter information corresponding to the gaussian algorithm includes:

and determining a Gaussian residual value corresponding to the Gaussian algorithm according to the Gaussian mean value, the Gaussian standard deviation and a preset random value in the parameter information corresponding to the Gaussian algorithm, wherein the random value is any value in a preset random range.

In order to ensure randomness of the second residual signal values of the subsequent prediction, in the embodiment of the present invention, a random value is pre-configured, and the random value is any value in a pre-configured random range, wherein the random value is any value in the pre-configured random range [0,1 ]. After acquiring the gaussian mean value and the gaussian standard deviation in the parameter information of a certain gaussian algorithm based on the above embodiments, corresponding processing is performed according to the acquired gaussian mean value, the gaussian standard deviation and a preconfigured random value, and a gaussian residual value corresponding to the gaussian algorithm is determined.

In a possible implementation manner, determining a gaussian residual value corresponding to the gaussian algorithm according to the gaussian mean, the gaussian standard deviation and the random value in the parameter information corresponding to the gaussian algorithm includes:

determining the product of the Gaussian standard deviation and the random value in the parameter information corresponding to the Gaussian algorithm;

and determining the sum of the determined product and the Gaussian mean value in the parameter information corresponding to the Gaussian algorithm as the Gaussian residual value corresponding to the Gaussian algorithm.

In the embodiment of the invention, the product of the Gaussian standard deviation in the parameter information of the Gaussian algorithm and the preset random value is determined. And then adding the determined product with a Gaussian mean value in the parameter information of the Gaussian algorithm, and determining the obtained sum as a Gaussian residual value corresponding to the Gaussian algorithm.

Specifically, determining the gaussian residual value corresponding to the gaussian algorithm may be determined as follows:

is a Gaussian mean value corresponding to the k-th Gaussian algorithm

And the standard deviation of gauss

the Gaussian standard deviation included in the parameter information of the kth Gaussian algorithm corresponding to the tth sampling point, and j is a random range [0,1]]Any random value in between.

In order to facilitate subsequent speech synthesis, in the embodiment of the present invention, the second residual signal value of any speech frame obtained through the residual network at each sampling point is float-type data, and the gaussian residual values obtained by the method in the embodiment are all int-type data within the range of [ -1, 1], so in the embodiment of the present invention, the corresponding relationship between each int value within the range of the gaussian residual values and each float value within the preset range is pre-configured. After the gaussian residual value is obtained, the obtained gaussian residual value can be mapped into a preset range according to the corresponding relationship between each int value in the interval range of the gaussian residual value and each float value in the preset range.

For example, the range of the Gaussian residual values is [ -1, 1]]Any value within the predetermined range is [ -2 ]¹⁵，2¹⁵]The internal float value is pre-configured with [ -1, 1 [ ]]Each int value within the interval range of (a) and a preset range of [ -2 ]¹⁵，2¹⁵]The corresponding relation of each float value in the table, if the obtained Gauss residual value is 1, the value is configured with [ -1, 1] in advance]Each int value within the interval range of (a) and a preset range of [ -2 ]¹⁵，2¹⁵]Mapping the Gaussian residue value of 1 to the float value of 2 in the preset range¹⁵＝32768。

Example 3: in order to make the gaussian parameter vector of the residual signal corresponding to the current sampling point of the speech frame more accurate, and further make the second residual signal value of the predicted speech frame at the current sampling point more accurate, on the basis of the above embodiment, the method for obtaining the gaussian parameter vector of the speech frame at the current sampling point according to the condition vector and the first residual signal value of the speech frame at the previous sampling point includes:

and acquiring a Gaussian parameter vector of the voice frame at the current sampling point according to the condition vector, the first residual signal value of the voice frame at the last sampling point and the acoustic parameter of the voice frame at the current sampling point, wherein the acoustic parameter comprises at least one of a predicted signal value of the voice frame at the current sampling point, fundamental frequency information of the voice frame and voiced and unvoiced information of the voice frame.

In practical application, the acoustic parameters of the speech frame corresponding to each sampling point have certain influence on the second residual signal value of the predicted speech frame at each sampling point, and according to the acoustic parameters of the speech frame corresponding to each sampling point, the accuracy of the second residual signal value of the predicted speech frame at each sampling point can be improved to a certain extent. Therefore, in the embodiment of the present invention, any speech frame contained in a speech sample has an acoustic parameter corresponding to each sampling point, and the acoustic parameter of the speech frame at the current sampling point includes at least one of a predicted signal value of the speech frame at the current sampling point, fundamental frequency information of the speech frame, and unvoiced and voiced information of the speech frame. In a specific implementation process, after a sampling point subnetwork acquires a condition vector input by a frame level subnetwork, for each sampling point, a gaussian parameter vector of a voice frame at the current sampling point is acquired according to the acquired condition vector, a first residual signal value of a previous sampling point and an acoustic parameter of the voice frame at the current sampling point, and subsequent processing is performed based on the gaussian parameter vector to determine a second residual signal value of the voice frame at the current sampling point.

Example 4: in order to accurately obtain the condition vector corresponding to the speech frame and further improve the accuracy of the second residual signal value predicted by the residual error network, on the basis of the above embodiment, in the embodiment of the present invention, the acoustic feature vector includes at least one of the spectral parameter of the speech frame, the fundamental frequency information of the speech frame, the unvoiced and voiced information of the speech frame, and the aperiodic parameter of the speech frame.

In practice, the different intensities of airflow generated by the lungs during a person's utterance correspond to different excitation sources, and the speech signal generated by the different excitation sources can be divided into unvoiced and voiced sounds. According to the voiced and unvoiced information of the voice frame, the voice signal synthesis is facilitated. Based on this, in order to accurately obtain the condition vector corresponding to the speech frame and further improve the accuracy of the second residual signal value obtained by the residual error network, in the embodiment of the present invention, the unvoiced and voiced sound information of the speech frame may be used as the acoustic features included in the acoustic feature vector of the speech frame.

Certainly, when a person produces a voice, the person has certain prosodic features such as pause, tone and the like, in order to enable a synthesized voice signal to be more natural and to be close to the voice produced by a real person, the fundamental frequency information of a voice frame is also considered in the process of synthesizing the voice signal, the starting time of vocal cord vibration can be determined according to the fundamental frequency information of the voice frame, and the determination of the characteristic information such as tone, rhythm and the like of the synthesized voice information is facilitated. Based on this, in order to accurately obtain the condition vector corresponding to the speech frame and further improve the accuracy of the second residual signal value obtained by the residual error network, in the embodiment of the present invention, the fundamental frequency information of the speech frame may be used as the acoustic feature included in the acoustic feature vector of the speech frame.

A speech signal originating from a person may also be determined from a plurality of excitation sources, which speech signal may contain both periodic and non-periodic components. According to the non-periodic information of the voice frame, the method can help to determine the type and strength of the sound source involved when a person sends out a voice signal to a certain extent. Based on this, in order to accurately obtain the condition vector corresponding to the speech frame and further improve the accuracy of the second residual signal value obtained by the residual error network, in the embodiment of the present invention, the aperiodic information of the speech frame may also be used as the acoustic feature included in the acoustic feature vector of the speech frame.

In addition, the spectral parameters corresponding to the speech frames can provide time-frequency characteristics in the synthesized speech signals, such as generalized mel cepstrum (MGC), barker cepstrum coefficients (BFCC), Mel Frequency Cepstrum Coefficients (MFCC), and the like. Based on this, in order to accurately obtain the condition vector corresponding to the voice frame and further improve the accuracy of the second residual signal value obtained by the residual error network, in the embodiment of the present invention, the spectral parameter of the voice frame may also be used as an acoustic feature included in the acoustic feature vector of the voice frame, where the spectral parameter of any voice frame includes at least one of MGC, BFCC, and MFCC.

In one possible embodiment, since the process of extracting BFCCs from the speech samples is complicated and not beneficial to training the residual error network, the spectral parameter of any speech frame is preferably MGC or MFCC.

It should be noted that the dimension of the spectral parameter may be set to different values according to different scenes, for example, it is desirable to reduce the amount of computation, the dimension of the spectral parameter may be set to be smaller, for example, 30 dimensions, if it is desirable to improve the accuracy of the condition vector corresponding to the obtained speech frame, and further improve the accuracy of the second residual signal value of each sampling point of the residual network prediction, and the dimension of the spectral parameter may be set to be larger, for example, 128 dimensions. But should not be too large or too small, and preferably the dimension of the spectral parameter may be 60 dimensions.

Of course, the pitch (pitch), gain (gain), etc. information of the speech frame can also be determined as the acoustic features contained in the acoustic feature vector of the speech frame. The acoustic features included in the specific acoustic feature vector may be flexibly set according to actual requirements, and are not specifically limited herein.

For example, the spectral parameters of the speech frame, the fundamental frequency information of the speech frame, the unvoiced/voiced information of the speech frame, and the aperiodic parameters of the speech frame may be respectively used as the acoustic features, and the acoustic feature vector may be determined according to each acoustic feature.

For example, 60 dimensions (MGC) of a speech frame, fundamental frequency information (f0) of the speech frame, unvoiced speech information (uv) of the speech frame, and aperiodic parameters (ap) of the speech frame are each taken as acoustic features 60MGC, f0, uv, ap, respectively, and 63 dimensions of acoustic feature vectors [60MGC, f0, uv, ap ] are determined from each of the acoustic features 60MGC, f0, uv, ap.

In a possible implementation manner, in order to facilitate training of the residual error network and reduce the amount of calculation of the residual error network, the processed fundamental frequency information may be obtained after performing logarithm processing on the fundamental frequency information of the speech frame, and the fundamental frequency information of the speech frame may be updated according to the processed fundamental frequency information, so that the residual error network may perform corresponding calculation based on the acoustic feature vector including the processed fundamental frequency information by reducing the numerical value in the fundamental frequency information of the speech frame.

For example, the 60-dimensional MGC of the speech frame, the processed fundamental frequency information log (f0) of the speech frame, the unvoiced and voiced information uv of the speech frame, and the aperiodic parameter ap of the speech frame are respectively taken as the acoustic features 60MGC, log (f0), uv, ap, and the 63-dimensional acoustic feature vector dimensions [60MGC, log (f0), uv, ap ] are determined according to each of the acoustic features 60MGC, log (f0), uv, ap.

After determining the acoustic feature vector of any speech frame, in the manner described in the above embodiment, for each speech frame of any speech sample, a condition vector of the speech frame, for example, a 128-dimensional condition vector of the speech frame, is determined based on the acoustic feature vector of the speech frame, for example, a 63-dimensional acoustic feature vector determined according to the 60-dimensional MGC of the speech frame, the processed fundamental frequency information log (f0) of the speech frame, the unvoiced and voiced information uv of the speech frame, and the aperiodic parameter ap of the speech frame, through the frame-level subnetwork of the original residual network.

Since the sampling point sub-network needs to perform corresponding processing based on the condition vector when predicting the second residual signal value of the voice frame at each sampling point, in order to facilitate subsequent processing of the sampling point sub-network of the original residual network, the subsequent processing is performed based on the condition vector of the voice frame, the frame level sub-network of the original residual network determines the target number of each sampling point according to the pre-configured sampling rate, copies the condition vector to obtain the condition vectors of the target number, determines a condition matrix according to each condition vector obtained after copying, for example, the target number is 160, copies the condition vector of 128 dimensions, finally determines a condition matrix of 160 × 128, and outputs the condition matrix.

A splicing layer in a sampling point subnetwork acquires a condition matrix output by a frame level subnetwork, determines the number of sampling points corresponding to a voice frame according to the dimension of the condition matrix, then divides a condition vector corresponding to the current sampling point of the voice frame from the condition matrix for each sampling point, splices the condition vector corresponding to the current sampling point of the voice frame, a first residual signal value of a previous sampling point and an acoustic parameter corresponding to the current sampling point of the voice frame, and determines a splicing vector corresponding to the current sampling point of the voice frame, for example, a 128-dimensional condition vector corresponding to the current sampling point of the voice frame, a 1-dimensional vector formed by the first residual signal value of the previous sampling point, a 1-dimensional vector formed by a predicted signal value corresponding to the current sampling point of the voice frame, a 1-dimensional vector formed by fundamental frequency information of the, Splicing 1-dimensional vectors formed by the unvoiced and voiced sound information of the voice frames to obtain 132-dimensional spliced vectors corresponding to the current sampling points of the voice frames; after a Sparse Gated recovery Unit (Sparse Gated recovery Unit, Sparse GRU) in the sampling point subnetwork obtains and processes the splicing vector, a Gated recovery Unit (Gated recovery Unit, GRU) in the sampling point subnetwork obtains and further processes the processed result output by the Sparse GRU, a Dual fully connected (Dual FC) layer in the sampling point subnetwork respectively obtains and upsamples the processed result output by the GRU through a set number of fully connected layers, and then corresponding processing, such as element-wise summation processing, is performed according to the upsampling results output by the set number of fully connected layers, so as to obtain a gaussian parameter vector, such as a 15-dimensional gaussian parameter vector, of a residual signal corresponding to the current sampling point of the speech frame. And determining a second residual signal value of the speech frame at the current sampling point by the method provided by the invention based on the acquired Gaussian parameter vector in a Gaussian sampling (GMM) layer in the sampling point sub-network.

Example 5: to facilitate the training of the original residual network, in the embodiment of the present invention, based on the above embodiment, the predicted signal value of the speech frame at the sampling point is determined according to the acoustic feature vector of the speech frame and the physical vocoder.

In practical application scenarios, the predicted speech signal of any speech frame can be obtained by a conventional physical vocoder, such as griffi-lim, or a neural network vocoder, such as WaveNet, based on the extracted acoustic feature vector of the speech frame. For a conventional physical vocoder, the predicted speech signal is obtained mainly through the spectrum. In the process of acquiring the predicted voice signal, because the traditional physical vocoder can estimate the phase of each frame of predicted voice signal, a certain deviation exists between the estimated phase and the actual phase corresponding to the voice signal, thereby easily causing the unclear acquired predicted voice signal and having a large difference between the acquired predicted voice signal and the voice signal sent by a real person. And for the neural network vocoder, it is mainly formed by neural network, through the acoustic characteristic of neural network and any speech frame of extraction, obtain the prediction speech signal of the speech frame. Although the acquisition of a high-quality predicted speech signal can be realized through the neural network vocoder, the huge autoregressive neural network included in the neural network vocoder needs to calculate the feature value of the speech frame at each sampling point based on the time sequence, so that the process of acquiring the predicted speech signal through the neural network vocoder needs to consume a lot of time, the efficiency is very low, and the requirement of acquiring the predicted speech signal of the speech frame according to the acoustic features of the speech frame in real time cannot be met.

Based on this, in the embodiment of the present invention, the predicted speech signal of any speech frame can be obtained through the physical vocoder and the acoustic feature vector of the speech frame. Wherein the physical vocoder may be an LPC vocoder, a WORLD vocoder, or the like.

In one possible implementation, the physical vocoder used to obtain the predicted speech signal of a frame of speech may be a WORLD vocoder. Through the WORLD vocoder, the predicted voice signal of the voice frame can be obtained in real time and high quality based on the acoustic feature vector of the voice frame, the prediction speed is ten times higher than that of a traditional physical vocoder, and the real-time prediction of the voice is sufficiently realized. Since the WORLD vocoder may not well eliminate the mechanical sound of the physical vocoder while achieving fast acquisition of the predicted voice signal, the naturalness is poor. Therefore, the prediction voice signal predicted by the WORLD vocoder can be corrected through the trained original residual error network, so that the synthesized voice signal is more natural and better in effect.

In one possible implementation, for each speech frame in any speech sample in the sample set, the predicted speech signal of the speech frame may be determined in advance by the physical vocoder and the acoustic feature vector of the speech frame. And determining the predicted signal value of the speech frame at each sampling point according to the preconfigured sampling rate and the predicted speech signal of the speech frame. And determining the real signal value of the speech frame at each sampling point according to the pre-configured sampling rate and the speech frame. And determining the difference value of the real signal value of the speech frame at the sampling point and the predicted signal value of the speech frame at the sampling point as a first residual signal aiming at each sampling point of the speech frame.

In another possible implementation, in the process of training the original residual error network, the acoustic feature vector of a speech frame to be currently input into the original residual error network is correspondingly processed through a physical vocoder in real time, and a predicted speech signal of the speech frame is determined. And determining the predicted signal value of the speech frame at each sampling point according to the preconfigured sampling rate and the predicted speech signal of the speech frame. And determining the real signal value of the speech frame at each sampling point according to the pre-configured sampling rate and the speech frame. And determining the difference value of the real signal value of the speech frame at the sampling point and the predicted signal value of the speech frame at the sampling point as a first residual signal aiming at each sampling point of the speech frame.

Because adjacent voice signals have a certain correlation, the real signal value and the predicted signal value of any voice frame at each sampling point can be pre-emphasized, that is, for each sampling point, the pre-emphasized signal value of the voice frame at the current sampling point is determined according to the signal difference between the signal value (including the real signal value and the predicted signal value) of the voice frame at the current sampling point and the signal value of the voice frame at the last sampling point, and the preset weight value. Based on the pre-emphasized real signal value and the pre-emphasized predicted signal value of any speech frame at each sampling point, the method is adopted to determine the first residual signal value of the speech frame at each sampling point, so that the original residual network is trained based on the acoustic characteristic vector and the first residual signal value of each speech frame of any speech sample in the sample set.

It should be noted that, in the embodiment of the present invention, because the second residual signal value and the first residual signal value are both float-type data, a huge embedded matrix does not need to be stored in the sampling point subnetwork, the data type and dimension of the first residual signal value of the previous sampling point are converted, and after the second residual signal value is obtained, inverse μ -law transformation is not needed to be performed, so as to perform data type conversion on the second residual signal value, thereby improving the efficiency of obtaining the second residual signal value.

Example 6: fig. 2 is a schematic diagram of a training process of a residual error network according to an embodiment of the present invention, where the process includes:

because the sample set comprises a large number of voice samples, each voice sample comprises a plurality of voice frames, the following steps are executed for each voice frame contained in any voice sample in the sample set:

inputting the determined 63-dimensional acoustic feature vector dimensions [60MGC, f0, uv and ap ] of a certain voice frame, fundamental frequency information f0 of the voice frame, unvoiced and voiced information uv of the voice frame and aperiodic parameters ap of the voice frame into a WORLD vocoder, and obtaining a predicted voice signal of the voice frame.

The predicted signal value of the speech frame at each sample point, such as p (t), is determined based on the preconfigured sample rate and the predicted speech signal of the speech frame. And determining the real signal value of the speech frame at each sampling point according to the pre-configured sampling rate and the speech frame. And determining the difference value of the real signal value of the speech frame at the sampling point and the predicted signal value of the speech frame at the sampling point as a first residual signal aiming at each sampling point of the speech frame.

Inputting the determined 63-dimensional acoustic feature vector dimensions [60MGC, f0, uv and ap ] into the original residual error network, extracting the features of the acoustic feature vector through two consecutive 1 × 3 convolutional layers (Conv 1 × 3) contained in a frame level subnetwork (frame rate network) in the original residual error network, obtaining the feature vector output by the last convolutional layer after convolution processing, then performing up-sampling on the obtained feature vector through two consecutive fully-connected layers (FC) contained in the frame level subnetwork in the original residual error network, and outputting a 128-dimensional condition vector through the last fully-connected layer.

In order to facilitate subsequent processing of the sampling point sub-network in the original residual error network, the frame level sub-network in the original residual error network copies (rep) the acquired 128-dimensional condition vectors to 160. From each 128-dimensional condition vector, a 160 x 128 condition matrix is determined, and then the 160 x 128 condition matrix is input into a sampling point sub-network in the original residual network.

After a sampling point sub-network (sample rate network) acquires a condition matrix input by a frame level sub-network, the number of sampling points corresponding to the voice frame is determined according to the dimension of the condition matrix. For each sample point, the following steps are performed:

and dividing a condition vector corresponding to the current sampling point of the voice frame from the condition matrix through a splicing (concat) layer in a sampling point sub-network, splicing a 128-dimensional condition vector corresponding to the current sampling point of the voice frame, a 1-dimensional vector formed by a first residual signal value of a previous sampling point, a 1-dimensional vector formed by a predicted signal value corresponding to the current sampling point of the voice frame, a 1-dimensional vector formed by fundamental frequency information of the voice frame and a 1-dimensional vector formed by unvoiced and voiced sound information of the voice frame, and acquiring and outputting a 132-dimensional splicing vector corresponding to the current sampling point of the voice frame.

And correspondingly processing and outputting the 132-dimensional splicing vector output by the splicing layer based on the preconfigured 384 neurons through the spare GRU in the sampling point sub-network.

And further processing and outputting the processing result output by the Sparse GRU through the GRU in the sampling point subnetwork based on the pre-configured 16 neurons.

And respectively acquiring the processing result output by the GRU through two fully connected layers which are configured in advance through the Dual FC in the sampling point sub-network, performing up-sampling, and then performing element-wise summation processing according to the up-sampling results output by the two fully connected layers to acquire a 15-dimensional Gaussian parameter vector of the voice frame at the current sampling point.

And determining parameter information corresponding to the Gaussian algorithm in the Gaussian parameter vector aiming at each Gaussian algorithm through the GMM in the sampling point sub-network. The parameter information of any one of the gaussian algorithms includes a weight value, a gaussian mean value and a gaussian standard deviation of the gaussian algorithm. Aiming at the parameter information of each Gaussian algorithm, acquiring a Gaussian mean value and a Gaussian standard deviation from the parameter information of the Gaussian algorithm; and performing corresponding processing according to the obtained Gaussian mean value and the Gaussian standard deviation, and determining a Gaussian residual value corresponding to the Gaussian algorithm. And after the Gaussian residual value corresponding to each Gaussian algorithm is determined, determining a second residual signal value e (t) of the speech frame at the current sampling point according to the weight value included in each parameter information and the Gaussian residual value corresponding to each Gaussian algorithm.

After the second residual signal value of each voice frame contained in the voice sample at each sampling point is obtained through the original residual network based on the steps, the original residual network is trained according to the first residual signal value and the second residual signal value of each voice frame contained in the voice sample at each sampling point.

Example 7: fig. 3 is a schematic diagram of a speech synthesis process according to an embodiment of the present invention, including:

s301: and acquiring at least one acoustic feature vector corresponding to the text to be processed based on the text features of the text to be processed through a speech synthesis model.

S302: aiming at each acoustic feature vector, acquiring a predicted speech signal of a speech frame corresponding to the acoustic feature vector through a physical vocoder based on the acoustic feature vector; acquiring a second residual signal value of the voice frame corresponding to the acoustic characteristic vector at each sampling point through a residual network based on the acoustic characteristic vector; and determining a corresponding predicted signal value of the sampling point in a predicted voice signal aiming at each sampling point, and determining a synthesized signal value of a voice frame corresponding to the acoustic characteristic vector at the sampling point according to the predicted signal value and a second residual signal value of the voice frame corresponding to the acoustic characteristic vector at the sampling point.

S303: and determining a synthesized voice signal corresponding to the text to be processed according to the synthesized signal value of the voice frame corresponding to each acoustic feature vector at each sampling point in sequence.

The speech synthesis method provided by the embodiment of the invention is applied to electronic equipment, and the electronic equipment can be intelligent equipment such as a robot and the like and can also be a server.

It should be noted that the electronic device performing speech synthesis may be the same as or different from the electronic device performing residual error network training in the above embodiment. When the electronic device performing speech synthesis is different from the electronic device performing residual error network training, the electronic device performing residual error network training may be used, and after the residual error network is trained, the trained residual error network is sent to the electronic device performing speech synthesis for storage based on the training method of the residual error network, and in the subsequent speech synthesis process, the electronic device performing speech synthesis may perform corresponding processing based on the stored residual error network.

If a synthesized voice signal corresponding to the text to be processed needs to be synthesized, at least one acoustic feature vector corresponding to the text to be processed is acquired based on the text features of the text to be processed through a voice synthesis model. Subsequent processing is performed based on each acoustic feature vector.

In the specific implementation process, the text characteristics of the text to be processed are obtained. The text features of the text to be processed may be obtained through a text analysis algorithm, such as syntactic analysis, grammatical analysis, or may be determined manually. In the specific implementation process, the setting can be flexibly performed according to the requirement, and is not specifically limited herein. Then, according to the obtained text features of the text to be processed, at least one acoustic feature vector corresponding to the text to be processed can be obtained through a deep learning model which is trained in advance, such as a tacontron model.

After each acoustic feature vector is determined based on the above-described embodiment, each acoustic feature vector is sequentially input to the physical vocoder and the residual network, respectively, in order. Performing corresponding calculation on each acoustic feature vector through a physical vocoder to obtain a predicted voice signal of a voice frame corresponding to the acoustic feature vector, and simultaneously obtaining a second residual signal value of the voice frame corresponding to the acoustic feature vector at each sampling point through a residual network based on the acoustic feature vector; and determining a corresponding predicted signal value of the sampling point in the predicted voice signal aiming at each sampling point, and determining a corresponding synthesized signal value of the sampling point according to the predicted signal value and the second residual signal value of the sampling point.

In practical applications, there is also a requirement for the bit depth of the synthesized speech signal to ensure the quality of the synthesized speech signal. The bit depth refers to a dynamic range of a speech signal, and refers to how many bits represent the amplitude of the speech signal at a certain time point, and the more the number of bits is, the more accurate the amplitude change of the speech signal at a certain time point can be represented, for example, the bit depth is 16 bits, the bit depth is 24 bits, and the like. Therefore, in the process of determining the synthesized speech signal corresponding to the sampling point according to the predicted signal value of the sampling point and the second residual signal value of the sampling point, firstly, the sum of the predicted signal value of the sampling point and the second residual signal value of the sampling point is determined, then the sum of the predicted signal value of the sampling point and the second residual signal value of the sampling point is mapped to the signal value in the pre-configured bit depth mapping range, for example, 16bit, 24bit, etc., and the mapped signal value is determined as the synthesized signal value of the speech frame corresponding to the acoustic feature vector at the sampling point. Wherein, the mapped signal value is integer data.

In a possible implementation manner, if the second residual signal value of any voice frame acquired through the residual network at each sampling point is already floating point type data within a preset range, and the size of the preset range is equal to that of the bit depth mapping range, the signal sum of any sampling point can be acquired, and then the sampling point can be directly rounded, so that the signal sum can be acquired and mapped to the signal value within the pre-configured bit depth mapping range.

For example, any value in the bit depth mapping range is [ -2 [ ]¹⁵，2¹⁵]The integer value, the second residual signal value of a certain sampling point and the predicted signal value are [ -2 ]¹⁵，2¹⁵]If the sum of the second residual signal value and the predicted signal value obtained from the sampling point is [ -2 ] for the floating point type data in the preset range¹⁵，2¹⁵]The obtained signal sum is rounded, that is, the signal and the signal value in the bit depth mapping range can be obtained, and the mapped signal value is determined as the synthesized signal value of the speech frame corresponding to the acoustic feature vector at the sampling point.

The process of obtaining the second residual signal value of the speech frame corresponding to the acoustic feature vector at each sampling point through the residual network based on the acoustic feature vector comprises the following steps:

acquiring a condition vector of a voice frame corresponding to the acoustic feature vector based on the acoustic feature vector through a frame level sub-network in a residual error network;

and obtaining a Gaussian parameter vector of the speech frame at the current sampling point according to the condition vector and the second residual signal value of the last sampling point for each sampling point through a sampling point sub-network in the residual network, and determining the second residual signal value of the speech frame at the current sampling point based on the Gaussian parameter vector.

After the synthesized signal value of the speech frame corresponding to each acoustic feature vector at each sampling point is obtained based on the above embodiment, the synthesized speech signal corresponding to the text to be processed can be determined according to each synthesized signal value.

Example 8: an embodiment of the present invention provides a training apparatus for a residual error network, and fig. 4 is a schematic structural diagram of the training apparatus for a residual error network provided in the embodiment of the present invention, where the apparatus includes:

a determining module 41, configured to, for each speech frame in any speech sample in a sample set, correspond to a first residual signal value at each sampling point of the speech frame, where the first residual signal value is determined according to a true signal value and a predicted signal value of the speech frame at the sampling point; wherein, the predicted signal value of the speech frame at the sampling point is determined according to the acoustic feature vector of the speech frame and the physical vocoder;

a prediction module 42, configured to, for each speech frame included in a speech sample, obtain, through a frame-level subnetwork in an original residual error network, a condition vector of the speech frame based on an acoustic feature vector of the speech frame; acquiring a Gaussian parameter vector of the voice frame at the current sampling point according to the condition vector and a first residual signal value of the voice frame at the last sampling point by a sampling point sub-network in an original residual network aiming at each sampling point; determining a second residual signal value of the speech frame at the current sampling point based on the Gaussian parameter vector; the Gaussian parameter vector comprises a set number of weighted values, Gaussian mean values and Gaussian standard deviations which respectively correspond to Gaussian algorithms;

and a training module 43, configured to train the original residual network according to the first residual signal value and the second residual signal value of each speech frame included in the speech sample at each sampling point.

In a possible implementation, the prediction module is specifically configured to: for each Gaussian algorithm, determining parameter information corresponding to the Gaussian algorithm in a Gaussian parameter vector; and determining a second residual signal value of the speech frame at the current sampling point based on the parameter information corresponding to each Gaussian algorithm.

In a possible implementation, the prediction module 42 is specifically configured to: aiming at each Gaussian algorithm, determining a Gaussian residual value corresponding to the Gaussian algorithm according to the Gaussian mean value and the Gaussian standard deviation in the parameter information corresponding to the Gaussian algorithm; determining a second residual signal value of the speech frame at the current sampling point according to the weight value in the parameter information corresponding to each Gaussian algorithm and the Gaussian residual value corresponding to each Gaussian algorithm; or, the weight values in each parameter information are spliced into weight vectors; determining a normalization weight vector according to the weight vector and a normalization function; determining target feature elements from each feature element contained in the normalized weight vector; determining a Gaussian algorithm corresponding to the target characteristic element, determining a Gaussian residual value corresponding to the Gaussian algorithm according to a Gaussian mean value and a Gaussian standard deviation in parameter information corresponding to the Gaussian algorithm, and determining the Gaussian residual value as a second residual signal value of the speech frame at the current sampling point.

In a possible implementation, the prediction module 42 is specifically configured to: and determining a Gaussian residual value corresponding to the Gaussian algorithm according to the Gaussian mean value, the Gaussian standard deviation and a preset random value in the parameter information corresponding to the Gaussian algorithm, wherein the random value is any value in a preset random range.

In a possible implementation, the prediction module 42 is specifically configured to:

determining the product of the Gaussian standard deviation and the random value in the parameter information corresponding to the Gaussian algorithm; and determining the sum of the product and the Gaussian mean value in the corresponding parameter information as the Gaussian residual value corresponding to the Gaussian algorithm.

Example 9: an embodiment of the present invention provides a speech synthesis apparatus, and fig. 5 is a schematic structural diagram of the speech synthesis apparatus provided in the embodiment of the present invention, where the apparatus includes:

the acquiring unit 51 is configured to acquire, through a speech synthesis model, at least one acoustic feature vector corresponding to a text to be processed based on text features of the text to be processed;

a prediction unit 52, configured to, for each acoustic feature vector, obtain, by a physical vocoder, a predicted speech signal of a speech frame corresponding to the acoustic feature vector based on the acoustic feature vector; acquiring a second residual signal value of the voice frame corresponding to the acoustic characteristic vector at each sampling point through a residual network based on the acoustic characteristic vector; determining a corresponding predicted signal value of each sampling point in a predicted voice signal, and determining a synthesized signal value of a voice frame corresponding to the acoustic characteristic vector at the sampling point according to the predicted signal value and a second residual signal value of the voice frame corresponding to the acoustic characteristic vector at the sampling point;

and the determining unit 53 is configured to determine, according to the synthesized signal value of the speech frame corresponding to each acoustic feature vector at each sampling point, a synthesized speech signal corresponding to the text to be processed in sequence.

In a possible implementation, the prediction unit 52 is specifically configured to:

acquiring a signal sum of a predicted signal value and a second residual signal value of a speech frame corresponding to the acoustic feature vector at the sampling point; and mapping the signal sum to a signal value in a preset bit depth mapping range, and determining the mapped signal value as a synthesized signal value of a voice frame corresponding to the acoustic feature vector at the sampling point.

Example 10: as shown in fig. 6, which is a schematic structural diagram of an electronic device according to an embodiment of the present invention, on the basis of the foregoing embodiments, the electronic device includes: the system comprises a processor 61, a communication interface 62, a memory 63 and a communication bus 64, wherein the processor 61, the communication interface 62 and the memory 63 complete mutual communication through the communication bus 64; the memory 63 has stored therein a computer program which, when executed by the processor 61, causes the processor 61 to perform the steps of:

aiming at each voice frame contained in a voice sample, acquiring a condition vector of the voice frame through a frame level sub-network in an original residual error network based on an acoustic feature vector of the voice frame; acquiring a Gaussian parameter vector of the voice frame at the current sampling point according to the condition vector and a first residual signal value of the voice frame at the last sampling point by a sampling point sub-network in an original residual network aiming at each sampling point; determining a second residual signal value of the speech frame at the current sampling point based on the Gaussian parameter vector; the Gaussian parameter vector comprises a set number of weighted values, Gaussian mean values and Gaussian standard deviations which respectively correspond to Gaussian algorithms;

Because the principle of the electronic device for solving the problems is similar to the training method of the residual error network, the implementation of the electronic device can refer to the implementation of the method, and repeated details are not repeated.

Example 11: fig. 7 is a schematic structural diagram of another electronic device according to an embodiment of the present invention, where on the basis of the foregoing embodiments, the electronic device includes: the system comprises a processor 71, a communication interface 72, a memory 73 and a communication bus 74, wherein the processor 71, the communication interface 72 and the memory 73 are communicated with each other through the communication bus 74; the memory 73 has stored therein a computer program which, when executed by the processor 71, causes the processor 71 to perform the steps of:

acquiring at least one acoustic feature vector corresponding to the text to be processed based on the text features of the text to be processed through a speech synthesis model; aiming at each acoustic feature vector, acquiring a predicted speech signal of a speech frame corresponding to the acoustic feature vector through a physical vocoder based on the acoustic feature vector; acquiring a second residual signal value of the voice frame corresponding to the acoustic characteristic vector at each sampling point through a residual network based on the acoustic characteristic vector; determining a corresponding predicted signal value of each sampling point in a predicted voice signal, and determining a synthesized signal value of a voice frame corresponding to the acoustic characteristic vector at the sampling point according to the predicted signal value and a second residual signal value of the voice frame corresponding to the acoustic characteristic vector at the sampling point; and determining a synthesized voice signal corresponding to the text to be processed according to the synthesized signal value of the voice frame corresponding to each acoustic feature vector at each sampling point in sequence.

Because the principle of the electronic device for solving the problems is similar to the speech synthesis method, the implementation of the electronic device can be referred to the implementation of the method, and repeated details are not repeated.

The communication bus mentioned in the above embodiments of the electronic device may be a peripheral component interconnect standard (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus. The communication interface 72 is used for communication between the above-described electronic apparatus and other apparatuses. The memory may include Random Access Memory (RAM) and may also include non-volatile memory (NVM), such as at least one disk memory. Alternatively, the memory may be at least one memory device located remotely from the processor. The processor can be a general-purpose processor, including a central processing unit, a Network Processor (NP), etc.; but may also be a digital instruction processor (DSP), an application specific integrated circuit, an array of field programmable gates or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.

Example 12: on the basis of the foregoing embodiments, the present invention further provides a computer-readable storage medium, in which a computer program executable by a processor is stored, and when the program is run on the processor, the processor is caused to execute the following steps:

Because the principle of solving the problem by the computer-readable storage medium is similar to the training method of the residual error network in the above embodiment, the specific implementation may refer to the implementation of the training method of the residual error network, and repeated details are not repeated.

Example 13: on the basis of the foregoing embodiments, the present invention further provides a computer-readable storage medium, in which a computer program executable by a processor is stored, and when the program is run on the processor, the processor is caused to execute the following steps:

Since the principle of the computer-readable storage medium to solve the problem is similar to the speech synthesis method in the above-described embodiment, specific implementation can be referred to implementation of the speech synthesis method.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of training a residual network, the method comprising:

2. The method of claim 1, wherein determining the second residual signal value of the speech frame at the current sample point based on the Gaussian parameter vector comprises:

for each Gaussian algorithm, determining parameter information corresponding to the Gaussian algorithm in the Gaussian parameter vector;

3. The method according to claim 2, wherein said determining the second residual signal value of the speech frame at the current sampling point based on the parameter information corresponding to each of the gaussian algorithms comprises:

Splicing the weight values in each parameter information into weight vectors; determining a normalization weight vector according to the weight vector and a normalization function; determining a target feature element from each feature element contained in the normalized weight vector; and determining a Gaussian algorithm corresponding to the target characteristic element, determining a Gaussian residual value corresponding to the Gaussian algorithm according to the Gaussian mean value and the Gaussian standard deviation in the parameter information corresponding to the Gaussian algorithm, and determining the Gaussian residual value as a second residual signal value of the speech frame at the current sampling point.

4. The method according to claim 3, wherein determining the residual value of the Gaussian corresponding to the Gaussian algorithm according to the Gaussian mean and the Gaussian standard deviation in the parameter information corresponding to the Gaussian algorithm comprises:

and determining a Gaussian residual value corresponding to the Gaussian algorithm according to the Gaussian mean value, the Gaussian standard deviation and a preset random value in the parameter information corresponding to the Gaussian algorithm, wherein the random value is any one value in a preset random range.

5. The method according to any of claims 1-4, wherein the first residual signal value and the second residual signal value are both floating point type data.

6. A speech synthesis method based on a residual network obtained by training according to any one of claims 1 to 5, characterized in that the method comprises:

7. An apparatus for training a residual network, the apparatus comprising:

8. A speech synthesis apparatus based on a residual network trained by the method of any one of claims 1-5, the apparatus comprising:

9. An electronic device, characterized in that the electronic device comprises a processor for implementing the steps of the training method of the residual network according to any one of claims 1-5 or the steps of the speech synthesis method according to claim 6 when executing a computer program stored in a memory.

10. A computer-readable storage medium, characterized in that it stores a computer program which, when being executed by a processor, carries out the steps of the training method of the residual network according to any one of claims 1 to 5 or the steps of the speech synthesis method according to claim 6.