CN109036370B - Adaptive training method for speaker voice - Google Patents
Adaptive training method for speaker voice Download PDFInfo
- Publication number
- CN109036370B CN109036370B CN201810576452.2A CN201810576452A CN109036370B CN 109036370 B CN109036370 B CN 109036370B CN 201810576452 A CN201810576452 A CN 201810576452A CN 109036370 B CN109036370 B CN 109036370B
- Authority
- CN
- China
- Prior art keywords
- model
- speaker
- voice
- adaptive
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 47
- 230000003044 adaptive effect Effects 0.000 title claims abstract description 31
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000009826 distribution Methods 0.000 claims abstract description 58
- 230000009466 transformation Effects 0.000 claims abstract description 37
- 230000008451 emotion Effects 0.000 claims abstract description 26
- 230000002996 emotional effect Effects 0.000 claims abstract description 21
- 238000013499 data model Methods 0.000 claims abstract description 17
- 239000011159 matrix material Substances 0.000 claims description 19
- 238000012417 linear regression Methods 0.000 claims description 10
- 238000007476 Maximum Likelihood Methods 0.000 claims description 8
- 238000001228 spectrum Methods 0.000 claims description 8
- 230000001131 transforming effect Effects 0.000 claims description 3
- 230000015572 biosynthetic process Effects 0.000 abstract description 11
- 238000003786 synthesis reaction Methods 0.000 abstract description 11
- 238000001308 synthesis method Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention discloses a speaker voice self-adaptive training method, belonging to the technical field of voice synthesis, comprising the following steps: giving training emotional voice data and target speaker emotional voice data; the acoustic parameters are characterized, and the state output distribution and the duration distribution of the acoustic parameters are estimated and modeled; normalizing the difference between the state output distribution of the training voice data model and the state output distribution of the average voice model to obtain an average voice model of the emotion voice data of the target speaker; and carrying out speaker self-adaptive transformation on the average voice model to obtain a speaker related self-adaptive model. The adaptive model obtained by the adaptive training method for the speaker voice is used for voice synthesis, so that the influence caused by the difference of speakers in a voice library can be reduced, the emotional similarity of the synthesized voice is improved, and the emotional voice with good naturalness, fluency and emotional similarity can be synthesized only by using a small amount of emotional corpus to be synthesized.
Description
Technical Field
The invention belongs to the technical field of voice synthesis, and particularly relates to a speaker voice self-adaptive training method.
Background
In recent years, with the continuous development of speech synthesis technology, the tone quality of synthesized speech is improved significantly from the initial speech synthesis method of physical mechanism, the speech synthesis method of source-filter, the speech synthesis method of waveform concatenation, the speech synthesis method of statistical parameters, which are becoming mature at present, and the speech synthesis method based on deep learning, which is being studied actively. However, in the traditional voice synthesis method, researchers only realize the conversion of written characters and characters into simple spoken language and output, but ignore emotional information carried by speakers in the speech expression process. How to improve the expressive force of synthesized speech becomes an important content of the research of emotion speech synthesis technology and is also an inevitable trend of the research in the field of speech signal processing in the future.
Disclosure of Invention
The invention aims to provide a speaker voice self-adaptive training method, which can obtain a self-adaptive model for voice synthesis and improve the emotional similarity of synthesized voice.
The technical scheme adopted by the invention is as follows:
a speaker voice adaptive training method is provided, which comprises the following steps:
giving training emotional voice data and target speaker emotional voice data;
the acoustic parameters are characterized, and the state output distribution and the duration distribution of the acoustic parameters are estimated and modeled;
normalizing the difference between the state output distribution of the training voice data model and the state output distribution of the average voice model by using a linear regression equation to obtain an average voice model of the emotion voice data of the target speaker;
under the guidance of the emotion voice data of the target speaker, speaker self-adaptive transformation is carried out on the average voice model to obtain a speaker-related self-adaptive model.
Further, the acoustic parameters at least include a fundamental frequency parameter, a frequency spectrum parameter and a duration parameter.
Further, after the given training emotion voice data and the target speaker emotion voice data, the method further comprises:
and estimating linear transformation between the two by adopting a maximum likelihood criterion, and obtaining a covariance matrix for adjusting the distribution of the model.
Further, the estimating and modeling the state output distribution and the duration distribution of the acoustic parameters includes: and simultaneously controlling and modeling the state output and the time length distribution by adopting a semi-hidden Markov model.
Further, the linear regression equation includes:
wherein, the formula (2.1) is shown as a state output distribution transformation equation,mean vector representing the state output of training speech data model s, W ═ a, b]For training a transformation matrix of the differences between the state output distributions of the speech data model s and the mean tone model, oiIs an average observation thereofVector quantity; equation (2.2) shows the state duration distribution transformation equation,a mean vector representing the state durations of the training speech data model s. X ═ α, β]For training the transformation matrix of the differences between the state duration distribution of the speech data model s and the mean tone model, diIs its average time length, where xi ═ oT,1]。
Further, the performing speaker adaptive transformation on the mean tone model includes: and carrying out speaker self-adaptive transformation on the average voice model by utilizing the emotional sentences of the target speaker to be synthesized and adopting a CMLLR self-adaptive algorithm.
Further, the adaptive transformation comprises: and transforming the parameters of the fundamental frequency, the frequency spectrum and the duration in the mixed language average sound model into the characteristic parameters of the voice to be synthesized by utilizing the state output of the speaker, the probability distribution mean value of the duration and the covariance matrix.
Furthermore, the self-adaptive model is corrected and updated by adopting a maximum posterior probability algorithm.
Compared with the prior art, the invention has the beneficial effects that: the speaker voice self-adaptive training method can obtain a self-adaptive model, is used for self-adaptive training in the voice synthesis process, can reduce the influence caused by the difference of speakers in a voice library, and improves the emotion similarity of synthesized voice.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is a flow chart of the operation of an embodiment of the present invention;
FIG. 2 is a flowchart of a speaker adaptive algorithm according to an embodiment of the present invention.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
As shown in fig. 1, an embodiment of the present invention provides a method for adaptive training of speaker voice, including:
s1: giving training emotional voice data and target speaker emotional voice data;
s2: the acoustic parameters are characterized, and the state output distribution and the duration distribution of the acoustic parameters are estimated and modeled;
s3: normalizing the difference between the state output distribution of the training voice data model and the state output distribution of the average voice model by using a linear regression equation to obtain an average voice model of the emotion voice data of the target speaker;
s4: under the guidance of the emotion voice data of the target speaker, speaker self-adaptive transformation is carried out on the average voice model to obtain a speaker-related self-adaptive model.
In S1, after the training emotion speech data and the target speaker emotion speech data are given, the method further includes: and estimating linear transformation between the two by adopting a maximum likelihood criterion, and obtaining a covariance matrix for adjusting the distribution of the model.
In S2, the acoustic parameters at least include a fundamental frequency parameter, a spectrum parameter and a duration parameter. Estimating and modeling the state output distribution and the time length distribution of the acoustic parameters, wherein the method comprises the following steps: and simultaneously controlling and modeling the state output and the time length distribution by adopting a semi-hidden Markov model.
In S4, the speaker adaptive transform of the mean pitch model includes: and carrying out speaker self-adaptive transformation on the average voice model by utilizing the emotional sentences of the target speaker to be synthesized and adopting a CMLLR self-adaptive algorithm. The adaptive transformation includes: and transforming the parameters of the fundamental frequency, the frequency spectrum and the duration in the mixed language average sound model into the characteristic parameters of the voice to be synthesized by utilizing the state output of the speaker, the probability distribution mean value of the duration and the covariance matrix.
The adaptive model obtained in this embodiment is corrected and updated by using a maximum posterior probability algorithm.
In the system speaking, firstly, a constraint maximum likelihood linear regression algorithm is adopted to carry out speaker self-adaptive training on the multi-speaker emotion voice data model, so that an average voice model of the multi-speaker emotion voice data is obtained. Then, under the guidance of target emotion voice data of the target speaker, the average voice model is subjected to speaker self-adaptive transformation by adopting a constraint maximum likelihood linear regression algorithm to obtain a speaker-related self-adaptive model, and finally the self-adaptive model is corrected and updated by adopting the maximum posterior probability.
In order to improve the quality of synthesized emotional speech, the embodiment adopts a plurality of emotional speaker speech data to train to obtain an average voice model, and the acoustic model has larger deviation due to the differences in the aspects of gender, character, emotional expression and the like of the emotional speakers. In order to avoid the influence of Speaker variation on the Training model, the present embodiment employs a Speaker Adaptive Training (SAT) method to normalize Speaker differences, so as to improve the accuracy of the model and further improve the quality of synthesized emotion voice. Considering that the unvoiced and unvoiced segments of chinese have no fundamental frequency, the present document implements fundamental frequency modeling using Multi-space probability distribution (HMM). Based on the context-dependent MSD-HSMM speech synthesis unit, the present embodiment performs speaker adaptive training on the multi-speaker emotion corpus by using a Constrained Maximum Likelihood Linear Regression (CMLLR) algorithm, so as to obtain an average speech model of multi-speaker emotion speech.
Referring to fig. 2, which shows a flow of the speaker adaptive algorithm in this embodiment, first, given training emotion speech data and target speaker emotion speech data, in order to reflect the difference between the two models, the present embodiment uses a maximum likelihood criterion to estimate the linear transformation between the two model data, and obtains a covariance matrix for adjusting the model distribution. In the adaptive training process, it is necessary to characterize acoustic parameters such as fundamental frequency parameters, spectrum parameters, duration parameters, etc., and estimate and model state output distribution and duration distribution of these parameters, but the initial hidden Markov model does not describe the duration distribution accurately, so this embodiment uses a semi-hidden Markov model (HSMM) with accurate duration distribution to model state output and duration distribution simultaneously, and this embodiment uses a set of linear regression equations, as shown in formulas (2.1) and (2.2), to normalize the speaker speech model difference:
wherein, the formula (2.1) is shown as a state output distribution transformation equation,mean vector representing the state output of training speech data model s, W ═ a, b]For training a transformation matrix of the differences between the state output distributions of the speech data model s and the mean tone model, oiIs its average observation vector; equation (2.2) shows the state duration distribution transformation equation,a mean vector representing the state durations of the training speech data model s. X ═ α, β]For training the transformation matrix of the differences between the state duration distribution of the speech data model s and the mean tone model, diIs its average time length, where xi ═ oT,1]。
Then, after the adaptive training of the speaker is finished, a small number of emotion sentences of the target speaker to be synthesized can be utilized, and the CMLLR adaptive algorithm is adopted to carry out speaker adaptive transformation on the average voice model, so that the speaker adaptive model representing the target speaker is obtained. In the speaker self-adaptive transformation, the parameters of fundamental frequency, frequency spectrum and duration in the mixed language average sound model are transformed into the characteristic parameters of the speech to be synthesized by mainly utilizing the state output of the speaker and the mean value and covariance matrix of the probability distribution of the duration. The transformation equation of the feature vector o in the state i is shown in equation (2.3), and the transformation equation of the state duration d in the state i is shown in equation (2.4):
bi(o)=N(o;Aμi-b,A∑iAT)=|A-1|N(Wξ;μi,∑i) (2.3)
wherein xi is ═ oT,1],ψ=[d,1]T,μiIs the mean of the state output distribution, miIs the mean value, Σ, of the time length distributioniIn the form of a diagonal covariance matrix,is the variance. W ═ A-1 b-1]Outputting a linear transformation matrix of probability density distribution for the target speaker state, X ═ alpha-1,β-1]Is a transformation matrix of state duration probability density distribution.
Through the HSMM-based adaptive transformation algorithm, the speech acoustic feature parameters can be normalized and processed. For adaptive data O of length T, the maximum likelihood estimation may be performed on the transform Λ ═ W, X.
Where λ is the parameter set for HSMM.
When the data volume of the target speaker is limited and cannot meet the condition that each model distribution can be estimated corresponding to one conversion matrix, a plurality of distributions are required to share one conversion matrix, namely binding of regression matrices, and finally a good self-adaption effect can be achieved by adopting less data. As shown in fig. 2.
The present embodiment adopts a Maximum A Posteriori (MAP) algorithm to modify and update the model. For a given set of HSMM parameters, it is assumed that its forward probability is αt(i) And the backward probability is betat(i) In state i, it continuously observes the sequence ot-d+1......otGeneration probability ofThe method comprises the following steps:
the maximum a posteriori probability estimate is described as follows:
in the formula (I), the compound is shown in the specification,andrepresents the mean vector after the linear regression transformation, ω represents the MAP estimated parameters of the state output, and τ represents the MAP estimated parameters of its time duration distribution.Andrepresenting adaptive mean vectorAndweighted average MAP estimate of (a).
Experiments prove that compared with the traditional speech synthesis system based on the hidden Markov model, the emotion speech synthesis system based on the self-adaptive model and the traditional speech synthesis system based on the hidden Markov model are added with a speaker self-adaptive training process in a training stage to obtain emotion speech average tone models of a plurality of speakers.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.
Other technical features than those described in the specification are known to those skilled in the art, and are not described herein in detail in order to highlight the innovative features of the present invention.
Claims (7)
1. A speaker voice adaptive training method is characterized by comprising the following steps:
giving training emotional voice data and target speaker emotional voice data;
the acoustic parameters are characterized, and the state output distribution and the duration distribution of the acoustic parameters are estimated and modeled;
normalizing the difference between the state output distribution of the training voice data model and the state output distribution of the average voice model by using a linear regression equation to obtain an average voice model of the emotion voice data of the target speaker;
under the guidance of the emotion voice data of the target speaker, performing speaker self-adaptive transformation on the average voice model to obtain a speaker-related self-adaptive model;
after the given training emotion voice data and the target speaker emotion voice data, the method further comprises the following steps: and estimating linear transformation between the two by adopting a maximum likelihood criterion, and obtaining a covariance matrix for adjusting the distribution of the model.
2. The method as claimed in claim 1, wherein the acoustic parameters include at least a fundamental frequency parameter, a frequency spectrum parameter and a duration parameter.
3. The method of adaptive training of speaker's speech according to claim 1, wherein said estimating and modeling the state output distribution and duration distribution of acoustic parameters comprises: and simultaneously controlling and modeling the state output and the time length distribution by adopting a semi-hidden Markov model.
4. The method of adaptive speaker speech training according to claim 1, wherein the linear regression equation comprises:
wherein, the formula (2.1) is shown as a state output distribution transformation equation,mean vector representing the state output of training speech data model s, W ═ a, b]For training a transformation matrix of the differences between the state output distributions of the speech data model s and the mean tone model, oiIs its average observation vector; equation (2.2) shows the state duration distribution transformation equation,mean vector representing the state duration of training speech data model s, X ═ α, β]For training the transformation matrix of the differences between the state duration distribution of the speech data model s and the mean tone model, diIs its average time length, where xi ═ oT,1]。
5. The method of adaptive speaker speech training according to claim 1, wherein said performing speaker adaptive transformation on the mean-squared speech model comprises: and carrying out speaker self-adaptive transformation on the average voice model by utilizing the emotional sentences of the target speaker to be synthesized and adopting a CMLLR self-adaptive algorithm.
6. The method of adaptive speaker speech training according to claim 5, wherein the adaptive transformation comprises: and transforming the parameters of the fundamental frequency, the frequency spectrum and the duration in the mixed language average sound model into the characteristic parameters of the voice to be synthesized by utilizing the state output of the speaker, the probability distribution mean value of the duration and the covariance matrix.
7. The method according to claim 1, wherein the adaptive model is modified and updated using a maximum a posteriori probability algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810576452.2A CN109036370B (en) | 2018-06-06 | 2018-06-06 | Adaptive training method for speaker voice |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810576452.2A CN109036370B (en) | 2018-06-06 | 2018-06-06 | Adaptive training method for speaker voice |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109036370A CN109036370A (en) | 2018-12-18 |
CN109036370B true CN109036370B (en) | 2021-07-20 |
Family
ID=64612408
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810576452.2A Active CN109036370B (en) | 2018-06-06 | 2018-06-06 | Adaptive training method for speaker voice |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109036370B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112837674B (en) * | 2019-11-22 | 2024-06-11 | 阿里巴巴集团控股有限公司 | Voice recognition method, device, related system and equipment |
CN111627420B (en) * | 2020-04-21 | 2023-12-08 | 升智信息科技(南京)有限公司 | Method and device for synthesizing emotion voice of specific speaker under extremely low resource |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2524505A (en) * | 2014-03-24 | 2015-09-30 | Toshiba Res Europ Ltd | Voice conversion |
CN106531150A (en) * | 2016-12-23 | 2017-03-22 | 上海语知义信息技术有限公司 | Emotion synthesis method based on deep neural network model |
CN107039033A (en) * | 2017-04-17 | 2017-08-11 | 海南职业技术学院 | A kind of speech synthetic device |
CN107103900A (en) * | 2017-06-06 | 2017-08-29 | 西北师范大学 | A kind of across language emotional speech synthesizing method and system |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9183830B2 (en) * | 2013-11-01 | 2015-11-10 | Google Inc. | Method and system for non-parametric voice conversion |
CN104217713A (en) * | 2014-07-15 | 2014-12-17 | 西北师范大学 | Tibetan-Chinese speech synthesis method and device |
US20170213542A1 (en) * | 2016-01-26 | 2017-07-27 | James Spencer | System and method for the generation of emotion in the output of a text to speech system |
CN106971703A (en) * | 2017-03-17 | 2017-07-21 | 西北师范大学 | A kind of song synthetic method and device based on HMM |
CN107895582A (en) * | 2017-10-16 | 2018-04-10 | 中国电子科技集团公司第二十八研究所 | Towards the speaker adaptation speech-emotion recognition method in multi-source information field |
-
2018
- 2018-06-06 CN CN201810576452.2A patent/CN109036370B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2524505A (en) * | 2014-03-24 | 2015-09-30 | Toshiba Res Europ Ltd | Voice conversion |
CN106531150A (en) * | 2016-12-23 | 2017-03-22 | 上海语知义信息技术有限公司 | Emotion synthesis method based on deep neural network model |
CN107039033A (en) * | 2017-04-17 | 2017-08-11 | 海南职业技术学院 | A kind of speech synthetic device |
CN107103900A (en) * | 2017-06-06 | 2017-08-29 | 西北师范大学 | A kind of across language emotional speech synthesizing method and system |
Also Published As
Publication number | Publication date |
---|---|
CN109036370A (en) | 2018-12-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108831435B (en) | Emotional voice synthesis method based on multi-emotion speaker self-adaption | |
Saito et al. | Non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors | |
Woodland | Speaker adaptation for continuous density HMMs: A review | |
US10140972B2 (en) | Text to speech processing system and method, and an acoustic model training system and method | |
Yamagishi | Average-voice-based speech synthesis | |
CN110930981A (en) | Many-to-one voice conversion system | |
Nose et al. | An intuitive style control technique in HMM-based expressive speech synthesis using subjective style intensity and multiple-regression global variance model | |
Toda | Augmented speech production based on real-time statistical voice conversion | |
CN109036370B (en) | Adaptive training method for speaker voice | |
CN101178895A (en) | Model self-adapting method based on generating parameter listen-feel error minimize | |
Peng et al. | Cross-lingual speaker adaptation for HMM-based speech synthesis considering differences between language-dependent average voices | |
Sadekova et al. | A Unified System for Voice Cloning and Voice Conversion through Diffusion Probabilistic Modeling. | |
Aryal et al. | Articulatory inversion and synthesis: towards articulatory-based modification of speech | |
Gao et al. | Articulatory copy synthesis using long-short term memory networks | |
JP2015041081A (en) | Quantitative f0 pattern generation device, quantitative f0 pattern generation method, model learning device for f0 pattern generation, and computer program | |
CN107068165A (en) | A kind of phonetics transfer method | |
Saheer et al. | Combining vocal tract length normalization with hierarchical linear transformations | |
Yamagishi et al. | Adaptive training for hidden semi-Markov model [speech synthesis applications] | |
Liao et al. | Speaker adaptation of SR-HPM for speaking rate-controlled Mandarin TTS | |
Li et al. | Speech intelligibility enhancement using non-parallel speaking style conversion with stargan and dynamic range compression | |
Toda et al. | Modeling of speech parameter sequence considering global variance for HMM-based speech synthesis | |
Chunwijitra et al. | A tone-modeling technique using a quantized F0 context to improve tone correctness in average-voice-based speech synthesis | |
Yoshimura et al. | Cross-lingual speaker adaptation based on factor analysis using bilingual speech data for HMM-based speech synthesis. | |
JP2007011042A (en) | Rhythm generator and voice synthesizer | |
JP6137708B2 (en) | Quantitative F0 pattern generation device, model learning device for F0 pattern generation, and computer program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |