CN111899715B - Speech synthesis method - Google Patents

Speech synthesis method Download PDF

Info

Publication number
CN111899715B
CN111899715B CN202010672761.7A CN202010672761A CN111899715B CN 111899715 B CN111899715 B CN 111899715B CN 202010672761 A CN202010672761 A CN 202010672761A CN 111899715 B CN111899715 B CN 111899715B
Authority
CN
China
Prior art keywords
spectrum
prediction model
sobel
calculating
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010672761.7A
Other languages
Chinese (zh)
Other versions
CN111899715A (en
Inventor
袁熹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shengzhi Information Technology Nanjing Co ltd
Original Assignee
Shengzhi Information Technology Nanjing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shengzhi Information Technology Nanjing Co ltd filed Critical Shengzhi Information Technology Nanjing Co ltd
Priority to CN202010672761.7A priority Critical patent/CN111899715B/en
Publication of CN111899715A publication Critical patent/CN111899715A/en
Application granted granted Critical
Publication of CN111899715B publication Critical patent/CN111899715B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Abstract

The invention discloses a voice synthesis method, which creatively introduces a spectral gradient Sobel operator into a loss function design of a voice synthesis model, thereby improving the capability of detail description of a characteristic prediction model in voice synthesis; the invention improves the tone quality of the voice synthesis.

Description

Speech synthesis method
Technical Field
The invention relates to the technical field of voice, in particular to a voice synthesis method.
Background
Speech synthesis technology is a typical interdisciplinary process that gives computers (or various end devices) the ability to speak like a person. TTS technology (also known as text-to-speech technology) is a technology that converts text information generated by a computer itself or input externally into intelligible, fluent speech output.
The current common speech synthesis evaluation method is to evaluate the speech quality of the synthesized speech to determine whether the speech synthesis method is good or bad, and the evaluation strategy determines that the speech quality is important for the research of the speech synthesis technology. The method combines the voice synthesis and the voice tone quality evaluation, firstly synthesizes voice through a voice synthesis method, then judges the quality of the synthesized voice through the voice tone quality evaluation, finally reflects the quality of the synthesis method through an evaluation result, and finds out factors influencing the voice tone quality of the synthesized voice of the synthesis method to modify so as to synthesize voice with better tone quality. Therefore, in order to effectively promote the development of speech synthesis technology, a high-quality speech synthesis algorithm is particularly important.
The mainstream speech synthesis method at the present stage is based on modeling parameters, and generally consists of two parts: a feature prediction model and a vocoder model, both models being trained separately. The feature prediction model maps the input text sequence into acoustic features that are received by the vocoder model to be restored to true speech. Before training the model, a Loss Function (Loss Function), also called objective Function, is defined to express the difference between the predicted result and the real sample, so as to adjust the model parameters. The design of the loss function has a large impact on model training.
The acoustic feature selection and corresponding loss function of the commonly used feature prediction model are as follows:
the acoustic features select the fundamental frequency (F0), loss is the mean absolute error (Mean Absolute Error, MAE) distance calculated as a norm (L1) distance. The Duration (Duration) of the phoneme is calculated first, then the base frequency distance from the corresponding real audio is calculated, and the loss function is calculated as follows:
the acoustic features select a linear spectrum (Linear Spectrogram), loss is MAE or mean square error (Mean Square Error, MSE), and the distance is calculated as L1 or a two-norm (L2) distance. The distance between the predicted linear spectrum and the true linear spectrum is calculated as follows:
the acoustic features select Mel spectra (los) with loss being MAE or MSE and distance calculated as L1 or L2 distance. The distance between the predicted mel-spectrum and the true mel-spectrum is calculated as follows:
acoustic features combinations of the above features, loss being designed as a combination of the above loss
Paper [ Deep Voice: real-time Neuroal Text-to-Spech ], discloses a loss design employing MAE loss of F0. F0 energy is maximum, and correct fitting of F0 can basically restore the timbre of the target person. But the mid-high frequency part of the speech represents details of the speech that are closely related to the sound quality; the loss design based on F0 is not considered in the middle and high frequency part, and the tone quality of the synthesized voice can be seriously reduced;
in speech signals, the low frequency part is usually high in energy and the middle and high frequency part is smaller in energy, if MAE or MSE loss is used, the model tends TO fit the low frequency part (because of the large gradient caused by the high low frequency energy), the middle and high frequency part is turbid, the middle and high frequency texture is lost, and the synthetic tone is choked. In addition, MAE loss "sharpens" the spectrum, and the resultant sound quality is "mechanically perceived";
paper [ Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions ], disclosing loss design is MSE loss for Mel spectra; as described above, the mid-high frequency detail still cannot be characterized, and the MSE loss "blurs" the spectrogram, and the resulting sound quality has a "cloudy" feel.
Disclosure of Invention
The invention aims to solve the technical problem of overcoming the defects of the prior art and providing a voice synthesis method.
The invention adopts the following technical scheme for solving the technical problems:
the voice synthesis method provided by the invention comprises the following steps:
step 1, acoustic features are selected to be Mel spectrums or linear spectrums as the output of a feature prediction model, and a text is subjected to forward calculation through the feature prediction model to obtain a prediction spectrum
Step 2, calculatingIs->
Step 3, calculating a spectrum S of the real audio;
step 4, calculating Sobel operator S of S sobel
Step 5, calculatingAnd S mean square error>
Step 6, calculatingAnd Sobel operator of S +.>
Step 7, determining a balance coefficient alpha;
step 8, constructing the following loss function loss:
wherein loss is the mean square error from step 5And the mean square error of the characteristic spectrum Sobel operator in the step 6 +.>Two parts are formed, and alpha is a balance coefficient for balancing the two parts;
step 9, based on the loss calculated in the step 8, reversely deriving and updating parameters of the characteristic prediction model;
step 10, repeating the steps 1-9, training the feature prediction model until the feature prediction model converges, and finally obtaining a completely trained feature prediction model;
step 11, during speech synthesis, inputting a text to a feature prediction model, and calculating and outputting the prediction spectrum in the step 1 through the feature prediction modelThen will->And inputting the voice to a vocoder to obtain the voice.
As a further optimization scheme of the speech synthesis method of the present invention, in step 2,calculation means +.>Sobel feature calculation of (2) including x-direction and y-direction; sobel derives from image processing, where the image is actually a two-dimensional array, the spectrum of acoustic features is similar to that of an image, and is understood to be a two-dimensional array, the x-direction is the transverse direction of the array, and the y-direction is the longitudinal direction of the array.
In step 3, the spectrum calculation of the real audio refers to calculating the spectrum of the target audio, which is a linear spectrum or a mel spectrum, but is consistent with the spectrum selection in step 1.
As a further optimization scheme of the voice synthesis method, in step 4, S sobel The computation refers to Sobel feature computation of S, including x-direction and y-direction.
As a further optimization scheme of the voice synthesis method, in the step 7, the balance coefficient ranges from 0 to 1.
Compared with the prior art, the technical scheme provided by the invention has the following technical effects:
(1) The spectral gradient Sobel operator is innovatively introduced into the loss function design of the voice synthesis model, so that the capability of detail description of the feature prediction model in voice synthesis is improved;
(2) The invention improves the tone quality of the voice synthesis.
Drawings
FIG. 1 is a calculation of a penalty function incorporating the Sobel operator.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
Sobel is a gradient feature operator derived from image processing, which characterizes the texture features of an image. The loss of medium-high frequency sound quality is an important cause of degradation of the synthesized sound quality. In the technology, a Sobel operator is introduced into the loss design of the feature prediction model, so that the model focuses on the details of acoustic features, and the synthetic tone quality is improved.
A method for determining a loss function of a speech synthesis model by combining a Sobel operator comprises the following steps:
step 1, acoustic features are selected from Mel spectra or Linear Spectrogram, and spectrum output is obtained by forward calculation
Step 2, calculatingIs->
Step 3, calculating a spectrum S of the real audio;
step 4, calculating Sobel operator S of S sobel
Step 5, calculating MSE of two spectrums;
step 6, calculating MSE of two spectrum Sobel operators
Step 7, determining the balance coefficient alpha
Step 8, constructing the following loss:
the forward process of step 1 refers to spectral output during model trainingConsidering that the vocoder model can restore sound well through linear spectrum or mel spectrum at present, the characteristic prediction model characteristic selection can be linear spectrum or mel spectrum;
in the step 2 of the process, the process is carried out,calculation means +.>Sobel feature calculation of (2) including x-direction and y-direction;
in step 3, the spectrum calculation of the real audio refers to calculating the spectrum of the target audio, which may be a linear spectrum or a mel spectrum, but is consistent with the spectrum selection in step 1;
in step 4, S sobel The calculation refers to Sobel feature calculation of S, and comprises an x direction and a y direction;
in step 5, the MSE of the spectrum is calculated;
in step 6, the MSE of the Sobel operator of the spectrum is calculated;
in step 7, the balance coefficient is used to control the weight of the two parts, and the range is 0 to 1;
in step 9, a loss function is constructed finally, which consists of an MSE of a spectrum and an MSE of a Sobel operator of the spectrum, and the balance coefficient controls the weights of the two parts.
The invention focuses on the loss design method based on the Sobel operator in a characteristic prediction model of voice synthesis; the spectral gradient Sobel operator is innovatively introduced into the loss function design of the speech synthesis model.
FIG. 1 is a schematic diagram of a loss function in which a feature prediction model is forward calculated to obtain a prediction spectrum (component 101, corresponding to equation (1)) Calculating to obtain Sobel operator (component 102, corresponding to +.1)>) The true audio calculation obtains the true audio spectrum (component 103, corresponding to S in equation (1)), and further calculates the Sobel operator (component 104, corresponding to S in equation (1) sobel ). Calculating MSE of component 101 and component 103 to obtain spectral MSE (component 105, corresponding to equation (1)Calculating the MSE of the part 102 and the part 104, obtaining the MSE of the spectral Sobel operator (part 106, corresponding to +.>Specifying the equilibrium coefficient α, the components 105, 106 point multiplication [ α, (1- α)]A final loss is obtained (part 107).
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention.

Claims (5)

1. A method of speech synthesis comprising the steps of:
step 1, acoustic features are selected to be Mel spectrums or linear spectrums as the output of a feature prediction model, and a text is subjected to forward calculation through the feature prediction model to obtain a prediction spectrum
Step 2, calculatingIs->
Step 3, calculating a spectrum S of the real audio;
step 4, calculating Sobel operator S of S sobel
Step 5, calculatingAnd S mean square error>
Step 6, calculatingAnd Sobel operator of S +.>
Step 7, determining a balance coefficient alpha;
step 8, constructing the following loss function loss:
wherein loss is the mean square error from step 5And step 6, the mean square error of the characteristic spectrum Sobel operatorTwo parts are formed, and alpha is a balance coefficient for balancing the two parts;
step 9, based on the loss calculated in the step 8, reversely deriving and updating parameters of the characteristic prediction model;
step 10, repeating the steps 1-9, training the feature prediction model until the feature prediction model converges, and finally obtaining a completely trained feature prediction model;
step 11, during speech synthesis, inputting a text to a feature prediction model, and calculating and outputting the prediction spectrum in the step 1 through the feature prediction modelThen will->And inputting the voice to a vocoder to obtain the voice.
2. A speech synthesis method according to claim 1, wherein, in step 2,calculation means +.>Sobel feature calculation of (2) including x-direction and y-direction; sobel derives from image processing, where the image is actually a two-dimensional array, the spectrum of acoustic features is similar to that of an image, and is understood to be a two-dimensional array, the x-direction is the transverse direction of the array, and the y-direction is the longitudinal direction of the array.
3. A speech synthesis method according to claim 1, wherein in step 3, the spectrum calculation of the real audio refers to calculating the spectrum of the target audio, which is a linear spectrum or mel spectrum, but is consistent with the spectrum selection in step 1.
4. The method of claim 1, wherein in step 4, S sobel The computation refers to Sobel feature computation of S, including x-direction and y-direction.
5. A method of speech synthesis according to claim 1, wherein in step 7, the balance factor is in the range 0 to 1.
CN202010672761.7A 2020-07-14 2020-07-14 Speech synthesis method Active CN111899715B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010672761.7A CN111899715B (en) 2020-07-14 2020-07-14 Speech synthesis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010672761.7A CN111899715B (en) 2020-07-14 2020-07-14 Speech synthesis method

Publications (2)

Publication Number Publication Date
CN111899715A CN111899715A (en) 2020-11-06
CN111899715B true CN111899715B (en) 2024-03-29

Family

ID=73192553

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010672761.7A Active CN111899715B (en) 2020-07-14 2020-07-14 Speech synthesis method

Country Status (1)

Country Link
CN (1) CN111899715B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101281744A (en) * 2007-04-04 2008-10-08 国际商业机器公司 Method and apparatus for analyzing and synthesizing voice
CN103531196A (en) * 2013-10-15 2014-01-22 中国科学院自动化研究所 Sound selection method for waveform concatenation speech synthesis
CN110136690A (en) * 2019-05-22 2019-08-16 平安科技(深圳)有限公司 Phoneme synthesizing method, device and computer readable storage medium
CN110534089A (en) * 2019-07-10 2019-12-03 西安交通大学 A kind of Chinese speech synthesis method based on phoneme and rhythm structure
US10510358B1 (en) * 2017-09-29 2019-12-17 Amazon Technologies, Inc. Resolution enhancement of speech signals for speech synthesis

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8583438B2 (en) * 2007-09-20 2013-11-12 Microsoft Corporation Unnatural prosody detection in speech synthesis
JP5337608B2 (en) * 2008-07-16 2013-11-06 本田技研工業株式会社 Beat tracking device, beat tracking method, recording medium, beat tracking program, and robot
GB2508417B (en) * 2012-11-30 2017-02-08 Toshiba Res Europe Ltd A speech processing system
US9484015B2 (en) * 2013-05-28 2016-11-01 International Business Machines Corporation Hybrid predictive model for enhancing prosodic expressiveness

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101281744A (en) * 2007-04-04 2008-10-08 国际商业机器公司 Method and apparatus for analyzing and synthesizing voice
CN103531196A (en) * 2013-10-15 2014-01-22 中国科学院自动化研究所 Sound selection method for waveform concatenation speech synthesis
US10510358B1 (en) * 2017-09-29 2019-12-17 Amazon Technologies, Inc. Resolution enhancement of speech signals for speech synthesis
CN110136690A (en) * 2019-05-22 2019-08-16 平安科技(深圳)有限公司 Phoneme synthesizing method, device and computer readable storage medium
CN110534089A (en) * 2019-07-10 2019-12-03 西安交通大学 A kind of Chinese speech synthesis method based on phoneme and rhythm structure

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Perceptual Weighting Filter Loss for DNN Training In Speech Enhancement;Ziyue Zhao;《2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)》;20191123;全文 *
基于神经网络的统计参数语音合成方法研究;胡亚军;《中国优秀硕士学位论文全文数据库》;20181015;全文 *

Also Published As

Publication number Publication date
CN111899715A (en) 2020-11-06

Similar Documents

Publication Publication Date Title
Erro et al. Voice conversion based on weighted frequency warping
CN1750124B (en) Bandwidth extension of band limited audio signals
JP4945586B2 (en) Signal band expander
US7792672B2 (en) Method and system for the quick conversion of a voice signal
EP1995723B1 (en) Neuroevolution training system
JP7018659B2 (en) Voice conversion device, voice conversion method and program
US20060064301A1 (en) Parametric speech codec for representing synthetic speech in the presence of background noise
JP5153886B2 (en) Noise suppression device and speech decoding device
JP5717097B2 (en) Hidden Markov model learning device and speech synthesizer for speech synthesis
CN110648684B (en) Bone conduction voice enhancement waveform generation method based on WaveNet
US20170221479A1 (en) Noise compensation in speaker-adaptive systems
JP4382808B2 (en) Method for analyzing fundamental frequency information, and voice conversion method and system implementing this analysis method
CN112735454A (en) Audio processing method and device, electronic equipment and readable storage medium
Birkholz et al. Estimation of pitch targets from speech signals by joint regularized optimization
CN111899715B (en) Speech synthesis method
JPH08248994A (en) Voice tone quality converting voice synthesizer
CN113436607B (en) Quick voice cloning method
KR101361034B1 (en) Robust speech recognition method based on independent vector analysis using harmonic frequency dependency and system using the method
JP6400526B2 (en) Speech synthesis apparatus, method thereof, and program
Huang et al. An automatic voice conversion evaluation strategy based on perceptual background noise distortion and speaker similarity
CN104538026A (en) Fundamental frequency modeling method used for parametric speech synthesis
JP2951514B2 (en) Voice quality control type speech synthesizer
JPH08305396A (en) Device and method for expanding voice band
Xie et al. Pitch transformation in neural network based voice conversion
Kunikoshi et al. Improved F0 modeling and generation in voice conversion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant