CN108831435B - Emotional voice synthesis method based on multi-emotion speaker self-adaption - Google Patents
Emotional voice synthesis method based on multi-emotion speaker self-adaption Download PDFInfo
- Publication number
- CN108831435B CN108831435B CN201810576165.1A CN201810576165A CN108831435B CN 108831435 B CN108831435 B CN 108831435B CN 201810576165 A CN201810576165 A CN 201810576165A CN 108831435 B CN108831435 B CN 108831435B
- Authority
- CN
- China
- Prior art keywords
- speaker
- emotion
- voice
- model
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000002996 emotional effect Effects 0.000 title claims abstract description 27
- 238000001308 synthesis method Methods 0.000 title claims abstract description 18
- 230000008451 emotion Effects 0.000 claims abstract description 62
- 238000012549 training Methods 0.000 claims abstract description 48
- 230000009466 transformation Effects 0.000 claims abstract description 23
- 238000013499 data model Methods 0.000 claims abstract description 15
- 230000002194 synthesizing effect Effects 0.000 claims abstract description 9
- 238000009826 distribution Methods 0.000 claims description 29
- 238000000034 method Methods 0.000 claims description 22
- 230000003044 adaptive effect Effects 0.000 claims description 21
- 238000007476 Maximum Likelihood Methods 0.000 claims description 11
- 238000012417 linear regression Methods 0.000 claims description 11
- 239000011159 matrix material Substances 0.000 claims description 11
- 230000001419 dependent effect Effects 0.000 claims description 7
- 238000002372 labelling Methods 0.000 claims description 6
- 230000006978 adaptation Effects 0.000 claims description 3
- 238000003066 decision tree Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 2
- 238000010606 normalization Methods 0.000 claims description 2
- 230000003595 spectral effect Effects 0.000 claims description 2
- 230000015572 biosynthetic process Effects 0.000 abstract description 11
- 238000003786 synthesis reaction Methods 0.000 abstract description 11
- 230000008569 process Effects 0.000 description 7
- 238000001228 spectrum Methods 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Electrically Operated Instructional Devices (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses an emotional voice synthesis method based on multi-emotion speaker self-adaptation, which belongs to the technical field of voice synthesis and comprises the following steps: carrying out speaker self-adaptive training on the multi-speaker emotion voice data model to obtain an average voice model of the multi-speaker emotion voice data; carrying out speaker self-adaptive transformation on the average voice model to obtain a speaker related self-adaptive model; obtaining a label file related to the context of the target text, and generating corresponding voice parameters; and synthesizing the voice parameters to obtain the voice of the target emotion of the target speaker. The speech synthesis method of the embodiment of the invention can reduce the influence caused by the difference of speakers in a speech library, improve the emotional similarity of the synthesized speech, and can synthesize the emotional speech with good naturalness, fluency and emotional similarity by using a small amount of emotional corpus to be synthesized.
Description
Technical Field
The invention belongs to the technical field of voice synthesis, and particularly relates to an emotional voice synthesis method based on multi-emotional speaker self-adaptation.
Background
In recent years, with the continuous development of speech synthesis technology, the tone quality of synthesized speech is improved significantly from the initial speech synthesis method of physical mechanism, the speech synthesis method of source-filter, the speech synthesis method of waveform concatenation, the speech synthesis method of statistical parameters, which are becoming mature at present, and the speech synthesis method based on deep learning, which is being studied actively. However, in the traditional voice synthesis method, researchers only realize the conversion of written characters and characters into simple spoken language and output, but ignore emotional information carried by speakers in the speech expression process. How to improve the expressive force of synthesized speech becomes an important content of the research of emotion speech synthesis technology and is also an inevitable trend of the research in the field of speech signal processing in the future.
Disclosure of Invention
The invention aims to provide an emotional voice synthesis method based on multi-emotion speaker self-adaptation, which can reduce the influence caused by the difference of speakers in a voice library and improve the emotional similarity of synthesized voice.
The technical scheme adopted by the invention is as follows:
the emotional voice synthesis method based on the self-adaption of the multi-emotional speaker comprises the following steps:
extracting acoustic parameter files required by a training model from a first emotion voice database of target emotion of a multi-speaker;
obtaining a label file from a target text file;
performing HMM training on the element model to obtain an HMM model library;
carrying out speaker self-adaptive training on the multi-speaker emotion voice data model to obtain an average voice model of the multi-speaker emotion voice data;
under the guidance of the emotion voice data of the target speaker, performing speaker self-adaptive transformation on the average voice model to obtain a speaker-related self-adaptive model;
performing text analysis on target emotion and voice text of a target speaker to be synthesized to obtain a context-related label file of the target text;
under the guidance of an adaptive model, obtaining a context-dependent HMM decision sequence of the target voice through decision analysis, and generating corresponding voice parameters;
and synthesizing the voice parameters to obtain the voice of the target emotion of the target speaker.
Further, the acoustic parameter file required by the training model is extracted through STRAIGHT parameters.
Further, the acoustic parameter file at least comprises fundamental frequency and spectral parameters.
Further, the obtaining the markup file from the target text file includes: after the text file is subjected to text analysis, a single-phone labeling file containing phone information and a context-related labeling file containing context information are obtained by a labeling generation program.
Further, the HMM training is performed on the primitive models under the guidance of the context attributes and the question set.
Further, the HMM model library is obtained through decision tree clustering.
Furthermore, the speaker self-adaptive training is carried out on the multi-speaker emotion voice data model through a constrained maximum likelihood linear regression algorithm; and/or the presence of a gas in the gas,
and carrying out speaker self-adaptive transformation on the average voice model through a constrained maximum likelihood linear regression algorithm.
Further, the speaker dependent adaptive model is modified and updated using a maximum a posteriori probability.
Furthermore, the label file related to the context of the target text is generated through a label generating program.
Further, the synthesizing of the speech parameters is performed by using a STRAIGHT speech synthesizer.
Further, the performing speaker adaptive training on the multi-speaker emotion voice data model comprises:
giving training emotional voice data and target speaker emotional voice data;
the acoustic parameters are characterized, and the state output distribution and the duration distribution of the acoustic parameters are estimated and modeled;
and carrying out normalization processing on the difference between the state output distribution of the training voice data model and the state output distribution of the average voice model by using a linear regression equation.
Further, after the given training emotion voice data and the target speaker emotion voice data, the method further comprises:
and estimating linear transformation between the two by adopting a maximum likelihood criterion, and obtaining a covariance matrix for adjusting the distribution of the model.
Compared with the prior art, the invention has the beneficial effects that:
1. experiments prove that compared with the traditional speech synthesis system based on the hidden Markov model, the speech synthesis method disclosed by the invention has the advantages that the speaker self-adaptive training process is added in the training stage to obtain the emotional speech average tone models of a plurality of speakers, the method can reduce the influence caused by the difference of the speakers in a speech library and improve the emotional similarity of synthesized speech, and the emotional speech with good naturalness, fluency and emotional similarity can be synthesized by using a small amount of emotional corpora to be synthesized through the speaker self-adaptive transformation algorithm on the basis of the average tone models.
2. The voice synthesis method disclosed by the invention adopts a plurality of speakers to jointly build the emotion voice database, so that the feasibility is improved, and the emotion content of the database is richer.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is a flow chart of the operation of an embodiment of the present invention;
FIG. 2 is a flowchart of a speaker adaptive algorithm according to an embodiment of the present invention.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
As shown in fig. 1, an embodiment of the present invention provides an emotion speech synthesis method based on multi-emotion speaker adaptation, including:
s1: extracting acoustic parameter files required by a training model from a first emotion voice database of target emotion of a multi-speaker;
s2: obtaining a label file from a target text file;
s3: performing HMM training on the element model to obtain an HMM model library;
s4: carrying out speaker self-adaptive training on the multi-speaker emotion voice data model to obtain an average voice model of the multi-speaker emotion voice data;
s5: under the guidance of the emotion voice data of the target speaker, performing speaker self-adaptive transformation on the average voice model to obtain a speaker-related self-adaptive model;
s6: performing text analysis on target emotion and voice text of a target speaker to be synthesized to obtain a context-related label file of the target text;
s7: under the guidance of an adaptive model, obtaining a context-dependent HMM decision sequence of the target voice through decision analysis, and generating corresponding voice parameters;
s8: and synthesizing the voice parameters to obtain the voice of the target emotion of the target speaker.
In this embodiment, the primitive model refers to an initial model, and the training model refers to a model obtained after training.
Specifically, the method can be divided into a training stage, an adaptive stage and a synthesis stage, and comprises the following steps:
a training stage: a first emotion voice database of multi-speaker target emotion and a second voice database of target speaker target emotion are given, wherein voice data files are subjected to a STRAIGHT parameter extraction process to obtain acoustic parameter files such as fundamental frequency and spectrum parameters required by a training model, text files are subjected to a text analysis process, and a single-phoneme label file containing phoneme information and a label file related to context information and containing context information are obtained through a label generation program. And then carrying out HMM training on the element model under the guidance of the context attributes and the problem set, and obtaining an HMM model library through decision tree clustering. The acoustic parameter file extracted in the training stage comprises parameters such as STRAIGHT parameters, fundamental frequency, spectrum parameters and the like which are used subsequently, and an HMM model library obtained in the training stage is applied to the whole HMM training process.
And (3) self-adaptive stage: firstly, a constrained maximum likelihood linear regression algorithm is adopted to carry out speaker self-adaptive training on a multi-speaker emotion voice data model, so that an average voice model of the multi-speaker emotion voice data is obtained. Then, under the guidance of target emotion voice data of the target speaker, the average voice model is subjected to speaker self-adaptive transformation by adopting a constraint maximum likelihood linear regression algorithm to obtain a speaker-related self-adaptive model, and finally the self-adaptive model is corrected and updated by adopting the maximum posterior probability.
And (3) a synthesis stage: the method is the same as the principle of a statistical parameter speech synthesis method based on an HMM, firstly, a speech text of target emotion of a target speaker to be synthesized is input, and then a label generation program generates and obtains a label file related to the context of the target text through a text analysis process. Under the guidance of an adaptive model, a context-dependent HMM decision sequence of the target voice is obtained through decision analysis, and corresponding voice parameters are generated. And finally, synthesizing to obtain the target emotion voice of the target speaker by adopting a STRAIGHT voice synthesizer.
In a traditional emotion voice synthesis system based on hidden Markov model statistical parameters, in order to train and obtain a high-quality emotion voice model library, the requirements and requirements for emotion voice data are strict, if a single speaker is used for recording emotion voice data, a large amount of time and energy are consumed, the quality of the emotion voice data cannot be guaranteed, and the feasibility is low. However, if a plurality of speakers are used to build the emotion voice database together, the feasibility can be improved, and the emotion content of the database is richer. Therefore, in the embodiment, a plurality of emotion speakers are selected to build the emotion corpus.
In order to improve the quality of synthesized emotional speech, the embodiment adopts a plurality of emotional speaker speech data to train to obtain an average voice model, and the acoustic model has larger deviation due to the differences in the aspects of gender, character, emotional expression and the like of the emotional speakers. In order to avoid the influence of Speaker variation on the Training model, the present embodiment employs a Speaker Adaptive Training (SAT) method to normalize Speaker differences, so as to improve the accuracy of the model and further improve the quality of synthesized emotion voice. Considering that the unvoiced and unvoiced segments of chinese have no fundamental frequency, the present document implements fundamental frequency modeling using Multi-space probability distribution (HMM). Based on the context-dependent MSD-HSMM speech synthesis unit, the present embodiment performs speaker adaptive training on the multi-speaker emotion corpus by using a Constrained Maximum Likelihood Linear Regression (CMLLR) algorithm, thereby obtaining an average voice model of multi-speaker emotion speech.
First, given training emotion voice data and target speaker emotion voice data, in order to reflect the difference between the two models, the present embodiment adopts the maximum likelihood criterion to estimate the linear transformation between the two model data, and obtains the covariance matrix for adjusting the model distribution. In the adaptive training process, it is necessary to characterize acoustic parameters such as fundamental frequency parameters, spectrum parameters, duration parameters, etc., and estimate and model state output distribution and duration distribution of these parameters, but the initial hidden Markov model does not describe the duration distribution accurately, so this embodiment uses a semi-hidden Markov model (HSMM) with accurate duration distribution to model state output and duration distribution simultaneously, and this embodiment uses a set of linear regression equations, as shown in formulas (2.1) and (2.2), to normalize the speaker speech model difference:
wherein, the formula (2.1) is shown as a state output distribution transformation equation,mean vector representing the state output of training speech data model s, W ═ a, b]For training a transformation matrix of the differences between the state output distributions of the speech data model s and the mean tone model, oiIs its average observation vector; equation (2.2) shows the state duration distribution transformation equation,mean vector representing the duration of the state of training speech data model s X α]For training the transformation matrix of the differences between the state duration distribution of the speech data model s and the mean tone model, diIs its average time length, wherein ξ ═ oT,1]。
Then, after the adaptive training of the speaker is finished, a small number of emotion sentences of the target speaker to be synthesized can be utilized, and the CMLLR adaptive algorithm is adopted to carry out speaker adaptive transformation on the average voice model, so that the speaker adaptive model representing the target speaker is obtained. In the speaker self-adaptive transformation, the parameters of fundamental frequency, frequency spectrum and duration in the mixed language average sound model are transformed into the characteristic parameters of the speech to be synthesized by mainly utilizing the state output of the speaker and the mean value and covariance matrix of the probability distribution of the duration. The transformation equation of the feature vector o in the state i is shown in equation (2.3), and the transformation equation of the state duration d in the state i is shown in equation (2.4):
bi(o)=N(o;Aμi-b,A∑iAT)=|A-1|N(Wξ;μi,∑i)(2.3)
wherein, ξ ═ oT,1],ψ=[d,1]T,μiIs the mean of the state output distribution, miMean value of time length distribution, ∑iIn the form of a diagonal covariance matrix,is the variance. W ═ A-1b-1]Outputting a linear transformation matrix of probability density distribution for the target speaker state, X ═ α-1,β-1]Is a transformation matrix of state duration probability density distribution.
Through the HSMM-based adaptive transformation algorithm, the speech acoustic feature parameters can be normalized and processed. For adaptive data O of length T, the maximum likelihood estimation may be performed on the transform Λ ═ W, X.
Where λ is the parameter set for HSMM.
When the data volume of the target speaker is limited and cannot meet the condition that each model distribution can be estimated corresponding to one conversion matrix, a plurality of distributions are required to share one conversion matrix, namely binding of regression matrices, and finally a good self-adaption effect can be achieved by adopting less data. As shown in fig. 2.
The present embodiment employs a Maximum A Posteriori (MAP) algorithm to modify and update the model for a given set of HSMM parameters, assuming a forward probability of αt(i) And the backward probability is βt(i) In state i, it continuously observes the sequence ot-d+1......otGeneration probability ofThe method comprises the following steps:
the maximum a posteriori probability estimate is described as follows:
in the formula (I), the compound is shown in the specification,andrepresents the mean vector after the linear regression transformation, ω represents the MAP estimated parameters of the state output, and τ represents the MAP estimated parameters of its time duration distribution.Andrepresenting adaptive mean vectorAndweighted average MAP estimate of (a).
Experiments prove that compared with the traditional speech synthesis system based on the hidden Markov model, the emotion speech synthesis system of the embodiment adds the speaker self-adaptive training process in the training stage to obtain the emotion speech average sound models of a plurality of speakers.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.
Other technical features than those described in the specification are known to those skilled in the art, and are not described herein in detail in order to highlight the innovative features of the present invention.
Claims (10)
1. A multi-emotion speaker self-adaptive emotion voice synthesis method is characterized by comprising the following steps:
extracting acoustic parameter files required by a training model from a first emotion voice database of target emotion of a multi-speaker;
obtaining a label file from a target text file;
performing HMM training on the element model to obtain an HMM model library;
carrying out speaker self-adaptive training on the multi-speaker emotion voice data model to obtain an average voice model of the multi-speaker emotion voice data;
under the guidance of the emotion voice data of the target speaker, performing speaker self-adaptive transformation on the average voice model to obtain a speaker-related self-adaptive model;
performing text analysis on target emotion and voice text of a target speaker to be synthesized to obtain a context-related label file of the target text;
under the guidance of an adaptive model, obtaining a context-dependent HMM decision sequence of the target voice through decision analysis, and generating corresponding voice parameters;
and synthesizing the voice parameters to obtain the voice of the target emotion of the target speaker.
2. The method for synthesizing emotional speech based on multi-emotion speaker adaptation according to claim 1, wherein the acoustic parameter files required by the training models are extracted through STRAIGHT parameter extraction;
and/or the acoustic parameter file comprises at least fundamental frequency and spectral parameters.
3. The method as claimed in claim 1, wherein said obtaining a markup document from a target text document comprises: after the text file is subjected to text analysis, a single-phone labeling file containing phone information and a context-related labeling file containing context information are obtained by a labeling generation program.
4. The method for synthesizing emotional speech based on multi-emotional speaker adaptation according to claim 1, wherein the HMM training is performed on the primitive models under the guidance of context attributes and question sets;
and/or the HMM model library is obtained through decision tree clustering.
5. The method according to claim 1, wherein the speaker adaptive training is performed on the multi-speaker emotion voice data model by a constrained maximum likelihood linear regression algorithm; and/or the presence of a gas in the gas,
and carrying out speaker self-adaptive transformation on the average voice model through a constrained maximum likelihood linear regression algorithm.
6. The method of claim 1, wherein said adaptive model associated with a speaker is modified and updated using a maximum a posteriori probability.
7. The method as claimed in claim 1, wherein the context-dependent markup document of the target text is generated by a markup generation program.
8. The method of claim 1, wherein the synthesizing of speech parameters is performed using a STRAIGHT speech synthesizer.
9. The method of claim 5, wherein the performing speaker adaptive training on the multi-speaker emotion voice data model comprises:
giving training emotional voice data and target speaker emotional voice data;
the acoustic parameters are characterized, and the state output distribution and the duration distribution of the acoustic parameters are estimated and modeled;
and carrying out normalization processing on the difference between the state output distribution of the training voice data model and the state output distribution of the average voice model by using a linear regression equation.
10. The method of claim 9, wherein the given training emotion speech data and target speaker emotion speech data are followed by further comprising:
and estimating linear transformation between the two by adopting a maximum likelihood criterion, and obtaining a covariance matrix for adjusting the distribution of the model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810576165.1A CN108831435B (en) | 2018-06-06 | 2018-06-06 | Emotional voice synthesis method based on multi-emotion speaker self-adaption |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810576165.1A CN108831435B (en) | 2018-06-06 | 2018-06-06 | Emotional voice synthesis method based on multi-emotion speaker self-adaption |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108831435A CN108831435A (en) | 2018-11-16 |
CN108831435B true CN108831435B (en) | 2020-10-16 |
Family
ID=64143538
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810576165.1A Active CN108831435B (en) | 2018-06-06 | 2018-06-06 | Emotional voice synthesis method based on multi-emotion speaker self-adaption |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108831435B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109658917A (en) * | 2019-01-17 | 2019-04-19 | 深圳壹账通智能科技有限公司 | E-book chants method, apparatus, computer equipment and storage medium |
CN109949791A (en) * | 2019-03-22 | 2019-06-28 | 平安科技(深圳)有限公司 | Emotional speech synthesizing method, device and storage medium based on HMM |
CN110379407B (en) * | 2019-07-22 | 2021-10-19 | 出门问问(苏州)信息科技有限公司 | Adaptive speech synthesis method, device, readable storage medium and computing equipment |
CN110232907B (en) * | 2019-07-24 | 2021-11-02 | 出门问问(苏州)信息科技有限公司 | Voice synthesis method and device, readable storage medium and computing equipment |
CN111627420B (en) * | 2020-04-21 | 2023-12-08 | 升智信息科技(南京)有限公司 | Method and device for synthesizing emotion voice of specific speaker under extremely low resource |
CN112185345A (en) * | 2020-09-02 | 2021-01-05 | 电子科技大学 | Emotion voice synthesis method based on RNN and PAD emotion models |
CN117496944B (en) * | 2024-01-03 | 2024-03-22 | 广东技术师范大学 | Multi-emotion multi-speaker voice synthesis method and system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101226742A (en) * | 2007-12-05 | 2008-07-23 | 浙江大学 | Method for recognizing sound-groove based on affection compensation |
CN101452699A (en) * | 2007-12-04 | 2009-06-10 | 株式会社东芝 | Rhythm self-adapting and speech synthesizing method and apparatus |
CN102610236A (en) * | 2012-02-29 | 2012-07-25 | 山东大学 | Method for improving voice quality of throat microphone |
CN103456302A (en) * | 2013-09-02 | 2013-12-18 | 浙江大学 | Emotion speaker recognition method based on emotion GMM model weight synthesis |
CN107103900A (en) * | 2017-06-06 | 2017-08-29 | 西北师范大学 | A kind of across language emotional speech synthesizing method and system |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5158022B2 (en) * | 2009-06-04 | 2013-03-06 | トヨタ自動車株式会社 | Dialog processing device, dialog processing method, and dialog processing program |
CN106971703A (en) * | 2017-03-17 | 2017-07-21 | 西北师范大学 | A kind of song synthetic method and device based on HMM |
-
2018
- 2018-06-06 CN CN201810576165.1A patent/CN108831435B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101452699A (en) * | 2007-12-04 | 2009-06-10 | 株式会社东芝 | Rhythm self-adapting and speech synthesizing method and apparatus |
CN101226742A (en) * | 2007-12-05 | 2008-07-23 | 浙江大学 | Method for recognizing sound-groove based on affection compensation |
CN102610236A (en) * | 2012-02-29 | 2012-07-25 | 山东大学 | Method for improving voice quality of throat microphone |
CN103456302A (en) * | 2013-09-02 | 2013-12-18 | 浙江大学 | Emotion speaker recognition method based on emotion GMM model weight synthesis |
CN107103900A (en) * | 2017-06-06 | 2017-08-29 | 西北师范大学 | A kind of across language emotional speech synthesizing method and system |
Also Published As
Publication number | Publication date |
---|---|
CN108831435A (en) | 2018-11-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108831435B (en) | Emotional voice synthesis method based on multi-emotion speaker self-adaption | |
US10140972B2 (en) | Text to speech processing system and method, and an acoustic model training system and method | |
Morgan | Deep and wide: Multiple layers in automatic speech recognition | |
CN1835074B (en) | Speaking person conversion method combined high layer discription information and model self adaption | |
CN106688034A (en) | Text-to-speech with emotional content | |
CN107103900A (en) | A kind of across language emotional speech synthesizing method and system | |
JP2013171196A (en) | Device, method and program for voice synthesis | |
JPWO2006134736A1 (en) | Speech synthesis apparatus, speech synthesis method and program | |
Nose et al. | An intuitive style control technique in HMM-based expressive speech synthesis using subjective style intensity and multiple-regression global variance model | |
CN109036370B (en) | Adaptive training method for speaker voice | |
JP5807921B2 (en) | Quantitative F0 pattern generation device and method, model learning device for F0 pattern generation, and computer program | |
CN101178895A (en) | Model self-adapting method based on generating parameter listen-feel error minimize | |
Chen et al. | Polyglot speech synthesis based on cross-lingual frame selection using auditory and articulatory features | |
Toman et al. | Unsupervised and phonologically controlled interpolation of Austrian German language varieties for speech synthesis | |
Lee et al. | A comparative study of spectral transformation techniques for singing voice synthesis | |
JP4945465B2 (en) | Voice information processing apparatus and method | |
Yamagishi et al. | Adaptive training for hidden semi-Markov model [speech synthesis applications] | |
Liao et al. | Speaker adaptation of SR-HPM for speaking rate-controlled Mandarin TTS | |
Chunwijitra et al. | A tone-modeling technique using a quantized F0 context to improve tone correctness in average-voice-based speech synthesis | |
Savargiv et al. | Study on unit-selection and statistical parametric speech synthesis techniques | |
Yoshimura et al. | Cross-lingual speaker adaptation based on factor analysis using bilingual speech data for HMM-based speech synthesis. | |
JP6137708B2 (en) | Quantitative F0 pattern generation device, model learning device for F0 pattern generation, and computer program | |
Suzić et al. | Style-code method for multi-style parametric text-to-speech synthesis | |
Sung et al. | Factored maximum penalized likelihood kernel regression for HMM-based style-adaptive speech synthesis | |
Ijima et al. | Statistical model training technique based on speaker clustering approach for HMM-based speech synthesis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |