CN111899715B

CN111899715B - Speech synthesis method

Info

Publication number: CN111899715B
Application number: CN202010672761.7A
Authority: CN
Inventors: 袁熹
Original assignee: Shengzhi Information Technology Nanjing Co ltd
Current assignee: Shengzhi Information Technology Nanjing Co ltd
Priority date: 2020-07-14
Filing date: 2020-07-14
Publication date: 2024-03-29
Anticipated expiration: 2040-07-14
Also published as: CN111899715A

Abstract

The invention discloses a voice synthesis method, which creatively introduces a spectral gradient Sobel operator into a loss function design of a voice synthesis model, thereby improving the capability of detail description of a characteristic prediction model in voice synthesis; the invention improves the tone quality of the voice synthesis.

Description

Speech synthesis method

Technical Field

The invention relates to the technical field of voice, in particular to a voice synthesis method.

Background

Speech synthesis technology is a typical interdisciplinary process that gives computers (or various end devices) the ability to speak like a person. TTS technology (also known as text-to-speech technology) is a technology that converts text information generated by a computer itself or input externally into intelligible, fluent speech output.

The current common speech synthesis evaluation method is to evaluate the speech quality of the synthesized speech to determine whether the speech synthesis method is good or bad, and the evaluation strategy determines that the speech quality is important for the research of the speech synthesis technology. The method combines the voice synthesis and the voice tone quality evaluation, firstly synthesizes voice through a voice synthesis method, then judges the quality of the synthesized voice through the voice tone quality evaluation, finally reflects the quality of the synthesis method through an evaluation result, and finds out factors influencing the voice tone quality of the synthesized voice of the synthesis method to modify so as to synthesize voice with better tone quality. Therefore, in order to effectively promote the development of speech synthesis technology, a high-quality speech synthesis algorithm is particularly important.

The mainstream speech synthesis method at the present stage is based on modeling parameters, and generally consists of two parts: a feature prediction model and a vocoder model, both models being trained separately. The feature prediction model maps the input text sequence into acoustic features that are received by the vocoder model to be restored to true speech. Before training the model, a Loss Function (Loss Function), also called objective Function, is defined to express the difference between the predicted result and the real sample, so as to adjust the model parameters. The design of the loss function has a large impact on model training.

The acoustic feature selection and corresponding loss function of the commonly used feature prediction model are as follows:

the acoustic features select the fundamental frequency (F0), loss is the mean absolute error (Mean Absolute Error, MAE) distance calculated as a norm (L1) distance. The Duration (Duration) of the phoneme is calculated first, then the base frequency distance from the corresponding real audio is calculated, and the loss function is calculated as follows:

the acoustic features select a linear spectrum (Linear Spectrogram), loss is MAE or mean square error (Mean Square Error, MSE), and the distance is calculated as L1 or a two-norm (L2) distance. The distance between the predicted linear spectrum and the true linear spectrum is calculated as follows:

the acoustic features select Mel spectra (los) with loss being MAE or MSE and distance calculated as L1 or L2 distance. The distance between the predicted mel-spectrum and the true mel-spectrum is calculated as follows:

acoustic features combinations of the above features, loss being designed as a combination of the above loss

Paper [ Deep Voice: real-time Neuroal Text-to-Spech ], discloses a loss design employing MAE loss of F0. F0 energy is maximum, and correct fitting of F0 can basically restore the timbre of the target person. But the mid-high frequency part of the speech represents details of the speech that are closely related to the sound quality; the loss design based on F0 is not considered in the middle and high frequency part, and the tone quality of the synthesized voice can be seriously reduced;

in speech signals, the low frequency part is usually high in energy and the middle and high frequency part is smaller in energy, if MAE or MSE loss is used, the model tends TO fit the low frequency part (because of the large gradient caused by the high low frequency energy), the middle and high frequency part is turbid, the middle and high frequency texture is lost, and the synthetic tone is choked. In addition, MAE loss "sharpens" the spectrum, and the resultant sound quality is "mechanically perceived";

paper [ Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions ], disclosing loss design is MSE loss for Mel spectra; as described above, the mid-high frequency detail still cannot be characterized, and the MSE loss "blurs" the spectrogram, and the resulting sound quality has a "cloudy" feel.

Disclosure of Invention

The invention aims to solve the technical problem of overcoming the defects of the prior art and providing a voice synthesis method.

The invention adopts the following technical scheme for solving the technical problems:

the voice synthesis method provided by the invention comprises the following steps:

step 1, acoustic features are selected to be Mel spectrums or linear spectrums as the output of a feature prediction model, and a text is subjected to forward calculation through the feature prediction model to obtain a prediction spectrum

Step 2, calculatingIs->

Step 3, calculating a spectrum S of the real audio;

step 4, calculating Sobel operator S of S _sobel ；

Step 5, calculatingAnd S mean square error>

Step 6, calculatingAnd Sobel operator of S +.>

Step 7, determining a balance coefficient alpha;

step 8, constructing the following loss function loss:

wherein loss is the mean square error from step 5And the mean square error of the characteristic spectrum Sobel operator in the step 6 +.>Two parts are formed, and alpha is a balance coefficient for balancing the two parts;

step 9, based on the loss calculated in the step 8, reversely deriving and updating parameters of the characteristic prediction model;

step 10, repeating the steps 1-9, training the feature prediction model until the feature prediction model converges, and finally obtaining a completely trained feature prediction model;

step 11, during speech synthesis, inputting a text to a feature prediction model, and calculating and outputting the prediction spectrum in the step 1 through the feature prediction modelThen will->And inputting the voice to a vocoder to obtain the voice.

As a further optimization scheme of the speech synthesis method of the present invention, in step 2,calculation means +.>Sobel feature calculation of (2) including x-direction and y-direction; sobel derives from image processing, where the image is actually a two-dimensional array, the spectrum of acoustic features is similar to that of an image, and is understood to be a two-dimensional array, the x-direction is the transverse direction of the array, and the y-direction is the longitudinal direction of the array.

In step 3, the spectrum calculation of the real audio refers to calculating the spectrum of the target audio, which is a linear spectrum or a mel spectrum, but is consistent with the spectrum selection in step 1.

As a further optimization scheme of the voice synthesis method, in step 4, S _sobel The computation refers to Sobel feature computation of S, including x-direction and y-direction.

As a further optimization scheme of the voice synthesis method, in the step 7, the balance coefficient ranges from 0 to 1.

Compared with the prior art, the technical scheme provided by the invention has the following technical effects:

(1) The spectral gradient Sobel operator is innovatively introduced into the loss function design of the voice synthesis model, so that the capability of detail description of the feature prediction model in voice synthesis is improved;

(2) The invention improves the tone quality of the voice synthesis.

Drawings

FIG. 1 is a calculation of a penalty function incorporating the Sobel operator.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

Sobel is a gradient feature operator derived from image processing, which characterizes the texture features of an image. The loss of medium-high frequency sound quality is an important cause of degradation of the synthesized sound quality. In the technology, a Sobel operator is introduced into the loss design of the feature prediction model, so that the model focuses on the details of acoustic features, and the synthetic tone quality is improved.

A method for determining a loss function of a speech synthesis model by combining a Sobel operator comprises the following steps:

step 1, acoustic features are selected from Mel spectra or Linear Spectrogram, and spectrum output is obtained by forward calculation

Step 2, calculatingIs->

Step 3, calculating a spectrum S of the real audio;

step 4, calculating Sobel operator S of S _sobel ；

Step 5, calculating MSE of two spectrums;

step 6, calculating MSE of two spectrum Sobel operators

Step 7, determining the balance coefficient alpha

Step 8, constructing the following loss:

the forward process of step 1 refers to spectral output during model trainingConsidering that the vocoder model can restore sound well through linear spectrum or mel spectrum at present, the characteristic prediction model characteristic selection can be linear spectrum or mel spectrum;

in the step 2 of the process, the process is carried out,calculation means +.>Sobel feature calculation of (2) including x-direction and y-direction;

in step 3, the spectrum calculation of the real audio refers to calculating the spectrum of the target audio, which may be a linear spectrum or a mel spectrum, but is consistent with the spectrum selection in step 1;

in step 4, S _sobel The calculation refers to Sobel feature calculation of S, and comprises an x direction and a y direction;

in step 5, the MSE of the spectrum is calculated;

in step 6, the MSE of the Sobel operator of the spectrum is calculated;

in step 7, the balance coefficient is used to control the weight of the two parts, and the range is 0 to 1;

in step 9, a loss function is constructed finally, which consists of an MSE of a spectrum and an MSE of a Sobel operator of the spectrum, and the balance coefficient controls the weights of the two parts.

The invention focuses on the loss design method based on the Sobel operator in a characteristic prediction model of voice synthesis; the spectral gradient Sobel operator is innovatively introduced into the loss function design of the speech synthesis model.

FIG. 1 is a schematic diagram of a loss function in which a feature prediction model is forward calculated to obtain a prediction spectrum (component 101, corresponding to equation (1)) Calculating to obtain Sobel operator (component 102, corresponding to +.1)>) The true audio calculation obtains the true audio spectrum (component 103, corresponding to S in equation (1)), and further calculates the Sobel operator (component 104, corresponding to S in equation (1) _sobel ). Calculating MSE of component 101 and component 103 to obtain spectral MSE (component 105, corresponding to equation (1)Calculating the MSE of the part 102 and the part 104, obtaining the MSE of the spectral Sobel operator (part 106, corresponding to +.>Specifying the equilibrium coefficient α, the components 105, 106 point multiplication [ α, (1- α)]A final loss is obtained (part 107).

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention.

Claims

1. A method of speech synthesis comprising the steps of:

Step 2, calculatingIs->

Step 3, calculating a spectrum S of the real audio;

step 4, calculating Sobel operator S of S _sobel ；

Step 5, calculatingAnd S mean square error>

Step 6, calculatingAnd Sobel operator of S +.>

Step 7, determining a balance coefficient alpha;

step 8, constructing the following loss function loss:

wherein loss is the mean square error from step 5And step 6, the mean square error of the characteristic spectrum Sobel operatorTwo parts are formed, and alpha is a balance coefficient for balancing the two parts;

2. A speech synthesis method according to claim 1, wherein, in step 2,calculation means +.>Sobel feature calculation of (2) including x-direction and y-direction; sobel derives from image processing, where the image is actually a two-dimensional array, the spectrum of acoustic features is similar to that of an image, and is understood to be a two-dimensional array, the x-direction is the transverse direction of the array, and the y-direction is the longitudinal direction of the array.

3. A speech synthesis method according to claim 1, wherein in step 3, the spectrum calculation of the real audio refers to calculating the spectrum of the target audio, which is a linear spectrum or mel spectrum, but is consistent with the spectrum selection in step 1.

4. The method of claim 1, wherein in step 4, S _sobel The computation refers to Sobel feature computation of S, including x-direction and y-direction.

5. A method of speech synthesis according to claim 1, wherein in step 7, the balance factor is in the range 0 to 1.