CN111899715B - Speech synthesis method - Google Patents
Speech synthesis method Download PDFInfo
- Publication number
- CN111899715B CN111899715B CN202010672761.7A CN202010672761A CN111899715B CN 111899715 B CN111899715 B CN 111899715B CN 202010672761 A CN202010672761 A CN 202010672761A CN 111899715 B CN111899715 B CN 111899715B
- Authority
- CN
- China
- Prior art keywords
- spectrum
- prediction model
- sobel
- calculating
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001308 synthesis method Methods 0.000 title claims abstract description 16
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 20
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 20
- 238000001228 spectrum Methods 0.000 claims description 52
- 238000004364 calculation method Methods 0.000 claims description 16
- 238000000034 method Methods 0.000 claims description 10
- 238000012549 training Methods 0.000 claims description 5
- 238000012545 processing Methods 0.000 claims description 3
- 238000013461 design Methods 0.000 abstract description 9
- 230000003595 spectral effect Effects 0.000 abstract description 6
- 238000005457 optimization Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 2
- 238000013441 quality evaluation Methods 0.000 description 2
- 230000015556 catabolic process Effects 0.000 description 1
- 230000003750 conditioning effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000013209 evaluation strategy Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
- G10L13/047—Architecture of speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Abstract
The invention discloses a voice synthesis method, which creatively introduces a spectral gradient Sobel operator into a loss function design of a voice synthesis model, thereby improving the capability of detail description of a characteristic prediction model in voice synthesis; the invention improves the tone quality of the voice synthesis.
Description
Technical Field
The invention relates to the technical field of voice, in particular to a voice synthesis method.
Background
Speech synthesis technology is a typical interdisciplinary process that gives computers (or various end devices) the ability to speak like a person. TTS technology (also known as text-to-speech technology) is a technology that converts text information generated by a computer itself or input externally into intelligible, fluent speech output.
The current common speech synthesis evaluation method is to evaluate the speech quality of the synthesized speech to determine whether the speech synthesis method is good or bad, and the evaluation strategy determines that the speech quality is important for the research of the speech synthesis technology. The method combines the voice synthesis and the voice tone quality evaluation, firstly synthesizes voice through a voice synthesis method, then judges the quality of the synthesized voice through the voice tone quality evaluation, finally reflects the quality of the synthesis method through an evaluation result, and finds out factors influencing the voice tone quality of the synthesized voice of the synthesis method to modify so as to synthesize voice with better tone quality. Therefore, in order to effectively promote the development of speech synthesis technology, a high-quality speech synthesis algorithm is particularly important.
The mainstream speech synthesis method at the present stage is based on modeling parameters, and generally consists of two parts: a feature prediction model and a vocoder model, both models being trained separately. The feature prediction model maps the input text sequence into acoustic features that are received by the vocoder model to be restored to true speech. Before training the model, a Loss Function (Loss Function), also called objective Function, is defined to express the difference between the predicted result and the real sample, so as to adjust the model parameters. The design of the loss function has a large impact on model training.
The acoustic feature selection and corresponding loss function of the commonly used feature prediction model are as follows:
the acoustic features select the fundamental frequency (F0), loss is the mean absolute error (Mean Absolute Error, MAE) distance calculated as a norm (L1) distance. The Duration (Duration) of the phoneme is calculated first, then the base frequency distance from the corresponding real audio is calculated, and the loss function is calculated as follows:
the acoustic features select a linear spectrum (Linear Spectrogram), loss is MAE or mean square error (Mean Square Error, MSE), and the distance is calculated as L1 or a two-norm (L2) distance. The distance between the predicted linear spectrum and the true linear spectrum is calculated as follows:
the acoustic features select Mel spectra (los) with loss being MAE or MSE and distance calculated as L1 or L2 distance. The distance between the predicted mel-spectrum and the true mel-spectrum is calculated as follows:
acoustic features combinations of the above features, loss being designed as a combination of the above loss
Paper [ Deep Voice: real-time Neuroal Text-to-Spech ], discloses a loss design employing MAE loss of F0. F0 energy is maximum, and correct fitting of F0 can basically restore the timbre of the target person. But the mid-high frequency part of the speech represents details of the speech that are closely related to the sound quality; the loss design based on F0 is not considered in the middle and high frequency part, and the tone quality of the synthesized voice can be seriously reduced;
in speech signals, the low frequency part is usually high in energy and the middle and high frequency part is smaller in energy, if MAE or MSE loss is used, the model tends TO fit the low frequency part (because of the large gradient caused by the high low frequency energy), the middle and high frequency part is turbid, the middle and high frequency texture is lost, and the synthetic tone is choked. In addition, MAE loss "sharpens" the spectrum, and the resultant sound quality is "mechanically perceived";
paper [ Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions ], disclosing loss design is MSE loss for Mel spectra; as described above, the mid-high frequency detail still cannot be characterized, and the MSE loss "blurs" the spectrogram, and the resulting sound quality has a "cloudy" feel.
Disclosure of Invention
The invention aims to solve the technical problem of overcoming the defects of the prior art and providing a voice synthesis method.
The invention adopts the following technical scheme for solving the technical problems:
the voice synthesis method provided by the invention comprises the following steps:
step 1, acoustic features are selected to be Mel spectrums or linear spectrums as the output of a feature prediction model, and a text is subjected to forward calculation through the feature prediction model to obtain a prediction spectrum
Step 2, calculatingIs->
Step 3, calculating a spectrum S of the real audio;
step 4, calculating Sobel operator S of S sobel ;
Step 5, calculatingAnd S mean square error>
Step 6, calculatingAnd Sobel operator of S +.>
Step 7, determining a balance coefficient alpha;
step 8, constructing the following loss function loss:
wherein loss is the mean square error from step 5And the mean square error of the characteristic spectrum Sobel operator in the step 6 +.>Two parts are formed, and alpha is a balance coefficient for balancing the two parts;
step 9, based on the loss calculated in the step 8, reversely deriving and updating parameters of the characteristic prediction model;
step 10, repeating the steps 1-9, training the feature prediction model until the feature prediction model converges, and finally obtaining a completely trained feature prediction model;
step 11, during speech synthesis, inputting a text to a feature prediction model, and calculating and outputting the prediction spectrum in the step 1 through the feature prediction modelThen will->And inputting the voice to a vocoder to obtain the voice.
As a further optimization scheme of the speech synthesis method of the present invention, in step 2,calculation means +.>Sobel feature calculation of (2) including x-direction and y-direction; sobel derives from image processing, where the image is actually a two-dimensional array, the spectrum of acoustic features is similar to that of an image, and is understood to be a two-dimensional array, the x-direction is the transverse direction of the array, and the y-direction is the longitudinal direction of the array.
In step 3, the spectrum calculation of the real audio refers to calculating the spectrum of the target audio, which is a linear spectrum or a mel spectrum, but is consistent with the spectrum selection in step 1.
As a further optimization scheme of the voice synthesis method, in step 4, S sobel The computation refers to Sobel feature computation of S, including x-direction and y-direction.
As a further optimization scheme of the voice synthesis method, in the step 7, the balance coefficient ranges from 0 to 1.
Compared with the prior art, the technical scheme provided by the invention has the following technical effects:
(1) The spectral gradient Sobel operator is innovatively introduced into the loss function design of the voice synthesis model, so that the capability of detail description of the feature prediction model in voice synthesis is improved;
(2) The invention improves the tone quality of the voice synthesis.
Drawings
FIG. 1 is a calculation of a penalty function incorporating the Sobel operator.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
Sobel is a gradient feature operator derived from image processing, which characterizes the texture features of an image. The loss of medium-high frequency sound quality is an important cause of degradation of the synthesized sound quality. In the technology, a Sobel operator is introduced into the loss design of the feature prediction model, so that the model focuses on the details of acoustic features, and the synthetic tone quality is improved.
A method for determining a loss function of a speech synthesis model by combining a Sobel operator comprises the following steps:
step 1, acoustic features are selected from Mel spectra or Linear Spectrogram, and spectrum output is obtained by forward calculation
Step 2, calculatingIs->
Step 3, calculating a spectrum S of the real audio;
step 4, calculating Sobel operator S of S sobel ;
Step 5, calculating MSE of two spectrums;
step 6, calculating MSE of two spectrum Sobel operators
Step 7, determining the balance coefficient alpha
Step 8, constructing the following loss:
the forward process of step 1 refers to spectral output during model trainingConsidering that the vocoder model can restore sound well through linear spectrum or mel spectrum at present, the characteristic prediction model characteristic selection can be linear spectrum or mel spectrum;
in the step 2 of the process, the process is carried out,calculation means +.>Sobel feature calculation of (2) including x-direction and y-direction;
in step 3, the spectrum calculation of the real audio refers to calculating the spectrum of the target audio, which may be a linear spectrum or a mel spectrum, but is consistent with the spectrum selection in step 1;
in step 4, S sobel The calculation refers to Sobel feature calculation of S, and comprises an x direction and a y direction;
in step 5, the MSE of the spectrum is calculated;
in step 6, the MSE of the Sobel operator of the spectrum is calculated;
in step 7, the balance coefficient is used to control the weight of the two parts, and the range is 0 to 1;
in step 9, a loss function is constructed finally, which consists of an MSE of a spectrum and an MSE of a Sobel operator of the spectrum, and the balance coefficient controls the weights of the two parts.
The invention focuses on the loss design method based on the Sobel operator in a characteristic prediction model of voice synthesis; the spectral gradient Sobel operator is innovatively introduced into the loss function design of the speech synthesis model.
FIG. 1 is a schematic diagram of a loss function in which a feature prediction model is forward calculated to obtain a prediction spectrum (component 101, corresponding to equation (1)) Calculating to obtain Sobel operator (component 102, corresponding to +.1)>) The true audio calculation obtains the true audio spectrum (component 103, corresponding to S in equation (1)), and further calculates the Sobel operator (component 104, corresponding to S in equation (1) sobel ). Calculating MSE of component 101 and component 103 to obtain spectral MSE (component 105, corresponding to equation (1)Calculating the MSE of the part 102 and the part 104, obtaining the MSE of the spectral Sobel operator (part 106, corresponding to +.>Specifying the equilibrium coefficient α, the components 105, 106 point multiplication [ α, (1- α)]A final loss is obtained (part 107).
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention.
Claims (5)
1. A method of speech synthesis comprising the steps of:
step 1, acoustic features are selected to be Mel spectrums or linear spectrums as the output of a feature prediction model, and a text is subjected to forward calculation through the feature prediction model to obtain a prediction spectrum
Step 2, calculatingIs->
Step 3, calculating a spectrum S of the real audio;
step 4, calculating Sobel operator S of S sobel ;
Step 5, calculatingAnd S mean square error>
Step 6, calculatingAnd Sobel operator of S +.>
Step 7, determining a balance coefficient alpha;
step 8, constructing the following loss function loss:
wherein loss is the mean square error from step 5And step 6, the mean square error of the characteristic spectrum Sobel operatorTwo parts are formed, and alpha is a balance coefficient for balancing the two parts;
step 9, based on the loss calculated in the step 8, reversely deriving and updating parameters of the characteristic prediction model;
step 10, repeating the steps 1-9, training the feature prediction model until the feature prediction model converges, and finally obtaining a completely trained feature prediction model;
step 11, during speech synthesis, inputting a text to a feature prediction model, and calculating and outputting the prediction spectrum in the step 1 through the feature prediction modelThen will->And inputting the voice to a vocoder to obtain the voice.
2. A speech synthesis method according to claim 1, wherein, in step 2,calculation means +.>Sobel feature calculation of (2) including x-direction and y-direction; sobel derives from image processing, where the image is actually a two-dimensional array, the spectrum of acoustic features is similar to that of an image, and is understood to be a two-dimensional array, the x-direction is the transverse direction of the array, and the y-direction is the longitudinal direction of the array.
3. A speech synthesis method according to claim 1, wherein in step 3, the spectrum calculation of the real audio refers to calculating the spectrum of the target audio, which is a linear spectrum or mel spectrum, but is consistent with the spectrum selection in step 1.
4. The method of claim 1, wherein in step 4, S sobel The computation refers to Sobel feature computation of S, including x-direction and y-direction.
5. A method of speech synthesis according to claim 1, wherein in step 7, the balance factor is in the range 0 to 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010672761.7A CN111899715B (en) | 2020-07-14 | 2020-07-14 | Speech synthesis method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010672761.7A CN111899715B (en) | 2020-07-14 | 2020-07-14 | Speech synthesis method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111899715A CN111899715A (en) | 2020-11-06 |
CN111899715B true CN111899715B (en) | 2024-03-29 |
Family
ID=73192553
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010672761.7A Active CN111899715B (en) | 2020-07-14 | 2020-07-14 | Speech synthesis method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111899715B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101281744A (en) * | 2007-04-04 | 2008-10-08 | 国际商业机器公司 | Method and apparatus for analyzing and synthesizing voice |
CN103531196A (en) * | 2013-10-15 | 2014-01-22 | 中国科学院自动化研究所 | Sound selection method for waveform concatenation speech synthesis |
CN110136690A (en) * | 2019-05-22 | 2019-08-16 | 平安科技(深圳)有限公司 | Phoneme synthesizing method, device and computer readable storage medium |
CN110534089A (en) * | 2019-07-10 | 2019-12-03 | 西安交通大学 | A kind of Chinese speech synthesis method based on phoneme and rhythm structure |
US10510358B1 (en) * | 2017-09-29 | 2019-12-17 | Amazon Technologies, Inc. | Resolution enhancement of speech signals for speech synthesis |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8583438B2 (en) * | 2007-09-20 | 2013-11-12 | Microsoft Corporation | Unnatural prosody detection in speech synthesis |
JP5337608B2 (en) * | 2008-07-16 | 2013-11-06 | 本田技研工業株式会社 | Beat tracking device, beat tracking method, recording medium, beat tracking program, and robot |
GB2508417B (en) * | 2012-11-30 | 2017-02-08 | Toshiba Res Europe Ltd | A speech processing system |
US9484015B2 (en) * | 2013-05-28 | 2016-11-01 | International Business Machines Corporation | Hybrid predictive model for enhancing prosodic expressiveness |
-
2020
- 2020-07-14 CN CN202010672761.7A patent/CN111899715B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101281744A (en) * | 2007-04-04 | 2008-10-08 | 国际商业机器公司 | Method and apparatus for analyzing and synthesizing voice |
CN103531196A (en) * | 2013-10-15 | 2014-01-22 | 中国科学院自动化研究所 | Sound selection method for waveform concatenation speech synthesis |
US10510358B1 (en) * | 2017-09-29 | 2019-12-17 | Amazon Technologies, Inc. | Resolution enhancement of speech signals for speech synthesis |
CN110136690A (en) * | 2019-05-22 | 2019-08-16 | 平安科技(深圳)有限公司 | Phoneme synthesizing method, device and computer readable storage medium |
CN110534089A (en) * | 2019-07-10 | 2019-12-03 | 西安交通大学 | A kind of Chinese speech synthesis method based on phoneme and rhythm structure |
Non-Patent Citations (2)
Title |
---|
A Perceptual Weighting Filter Loss for DNN Training In Speech Enhancement;Ziyue Zhao;《2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)》;20191123;全文 * |
基于神经网络的统计参数语音合成方法研究;胡亚军;《中国优秀硕士学位论文全文数据库》;20181015;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111899715A (en) | 2020-11-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Erro et al. | Voice conversion based on weighted frequency warping | |
CN1750124B (en) | Bandwidth extension of band limited audio signals | |
JP4945586B2 (en) | Signal band expander | |
US7792672B2 (en) | Method and system for the quick conversion of a voice signal | |
EP1995723B1 (en) | Neuroevolution training system | |
JP7018659B2 (en) | Voice conversion device, voice conversion method and program | |
US20060064301A1 (en) | Parametric speech codec for representing synthetic speech in the presence of background noise | |
JP5153886B2 (en) | Noise suppression device and speech decoding device | |
JP5717097B2 (en) | Hidden Markov model learning device and speech synthesizer for speech synthesis | |
CN110648684B (en) | Bone conduction voice enhancement waveform generation method based on WaveNet | |
US20170221479A1 (en) | Noise compensation in speaker-adaptive systems | |
JP4382808B2 (en) | Method for analyzing fundamental frequency information, and voice conversion method and system implementing this analysis method | |
CN112735454A (en) | Audio processing method and device, electronic equipment and readable storage medium | |
Birkholz et al. | Estimation of pitch targets from speech signals by joint regularized optimization | |
CN111899715B (en) | Speech synthesis method | |
JPH08248994A (en) | Voice tone quality converting voice synthesizer | |
CN113436607B (en) | Quick voice cloning method | |
KR101361034B1 (en) | Robust speech recognition method based on independent vector analysis using harmonic frequency dependency and system using the method | |
JP6400526B2 (en) | Speech synthesis apparatus, method thereof, and program | |
Huang et al. | An automatic voice conversion evaluation strategy based on perceptual background noise distortion and speaker similarity | |
CN104538026A (en) | Fundamental frequency modeling method used for parametric speech synthesis | |
JP2951514B2 (en) | Voice quality control type speech synthesizer | |
JPH08305396A (en) | Device and method for expanding voice band | |
Xie et al. | Pitch transformation in neural network based voice conversion | |
Kunikoshi et al. | Improved F0 modeling and generation in voice conversion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |