CN109036370A - A kind of speaker's voice adaptive training method - Google Patents
A kind of speaker's voice adaptive training method Download PDFInfo
- Publication number
- CN109036370A CN109036370A CN201810576452.2A CN201810576452A CN109036370A CN 109036370 A CN109036370 A CN 109036370A CN 201810576452 A CN201810576452 A CN 201810576452A CN 109036370 A CN109036370 A CN 109036370A
- Authority
- CN
- China
- Prior art keywords
- speaker
- distribution
- model
- duration
- adaptive
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000003044 adaptive effect Effects 0.000 title claims abstract description 44
- 238000000034 method Methods 0.000 title claims abstract description 32
- 230000009466 transformation Effects 0.000 claims abstract description 39
- 230000002996 emotional effect Effects 0.000 claims abstract description 38
- 230000006978 adaptation Effects 0.000 claims abstract description 21
- 238000013499 data model Methods 0.000 claims abstract description 17
- 230000008451 emotion Effects 0.000 claims abstract description 16
- 239000011159 matrix material Substances 0.000 claims description 19
- 238000012417 linear regression Methods 0.000 claims description 11
- 238000001228 spectrum Methods 0.000 claims description 8
- 238000007476 Maximum Likelihood Methods 0.000 claims description 7
- 230000015572 biosynthetic process Effects 0.000 abstract description 18
- 238000003786 synthesis reaction Methods 0.000 abstract description 18
- 230000002194 synthesizing effect Effects 0.000 description 5
- 239000004744 fabric Substances 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 241001269238 Data Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000010189 synthetic method Methods 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention discloses a kind of speaker's voice adaptive training methods, belong to speech synthesis technique field, comprising: given training emotional speech data and target speaker's emotional speech data;Parameters,acoustic is characterized, and the distribution of the state output of parameters,acoustic and duration distribution are estimated, modeled;The difference of the distribution of training phonetic data models state output and average sound model state output distribution is normalized, the average sound model of target speaker's emotional speech data is obtained;Speaker adaptation transformation is carried out to average sound model, obtains the relevant adaptive model of speaker.The exemplary speaker's voice adaptive training method of the present invention, obtained adaptive model is used for speech synthesis, it can reduce influence caused by the difference of speaker in sound bank, improve the emotion similarity of synthesis voice, only with a small amount of emotion corpus to be synthesized, it will be able to synthesize all good emotional speech of naturalness, fluency, emotion similarity.
Description
Technical field
The invention belongs to speech synthesis technique field, specifically a kind of speaker's voice adaptive training method.
Background technique
In recent years, with the continuous development of speech synthesis technique, from the phoneme synthesizing method of initial Physical Mechanism,
Source-filter phoneme synthesizing method, to the phoneme synthesizing method of the waveform concatenation to reach its maturity at present, the language of statistical parameter
Sound synthetic method, and study the phoneme synthesizing method based on deep learning just contained instantly, synthesize voice sound quality obtain it is bright
It is aobvious to improve.However, traditional phoneme synthesizing method, researcher only realizes a written, character is converted to simply
Spoken language output but has ignored speaker's emotion information entrained during verbal exposition.How synthesis phonetic representation is improved
Power will become the important content of emotional speech synthesis technical research, also will be the certainty of future speech signal process field research
Trend.
Summary of the invention
The purpose of the present invention is to provide a kind of speaker's voice adaptive training method, this method can obtain being used for language
The adaptive model of sound synthesis, improves the emotion similarity of synthesis voice.
The technical scheme adopted by the invention is as follows:
Provide a kind of speaker's voice adaptive training method, comprising:
Given training emotional speech data and target speaker's emotional speech data;
Parameters,acoustic is characterized, and the distribution of the state output of parameters,acoustic and duration distribution are estimated, modeled;
With equation of linear regression to the distribution of training phonetic data models state output and average sound model state output distribution
Difference be normalized, obtain the average sound model of target speaker's emotional speech data;
Under the guidance of target speaker's emotional speech data, speaker adaptation transformation is carried out to average sound model, is obtained
To the relevant adaptive model of speaker.
Further, the parameters,acoustic includes at least base frequency parameters, frequency spectrum parameter and duration parameters.
Further, after the given trained emotional speech data and target speaker's emotional speech data, further includes:
Using the linear transformation of maximum-likelihood criterion estimation between the two, and it is adjusted the covariance square of model profile
Battle array.
Further, the state output distribution and duration distribution to parameters,acoustic is estimated, is modeled, comprising: adopts
With half Hidden Markov model is to state output and duration distribution carries out while control model.
Further, the equation of linear regression includes:
Wherein, formula (2.1) show state output distribution transformation equation,Indicate the shape of training phonetic data models s
The mean vector of state output, W=[A, b] are poor between the state output distribution of training phonetic data models s and average sound model
Different transformation matrix, oiFor its average observed vector;Formula (2.2) show state duration distribution transformation equation,Indicate instruction
Practice the mean vector of the state duration of phonetic data models s.X=[α, β] is the state duration point of training phonetic data models s
The transformation matrix of difference, d between cloth and average sound modeliFor its duration that is averaged, wherein ξ=[oT,1]。
Further, described pair of average sound model carries out speaker adaptation transformation, comprising: is spoken using target to be synthesized
The emotion sentence of people carries out speaker adaptation transformation to average sound model using CMLLR adaptive algorithm.
Further, the adaptive transformation includes: the mean of probability distribution of the state output using speaker and duration,
And covariance matrix, fundamental frequency, frequency spectrum and duration parameters that hybrid language is averaged in sound model are transformed to voice to be synthesized
Characteristic parameter.
Further, the adaptive model is modified and is updated using maximal posterior probability algorithm.
Compared with prior art, the invention has the benefit that exemplary speaker's voice of the invention adaptive training side
Method can obtain adaptive model, for the adaptive training during speech synthesis, can reduce speaker in sound bank
Influence caused by difference improves the emotion similarity of synthesis voice, adaptive by speaker on the basis of average sound model
Scaling method is strained, only with a small amount of emotion corpus to be synthesized, it will be able to synthesize naturalness, fluency, emotion similarity all
Good emotional speech.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other
Feature, objects and advantages will become more apparent upon:
Fig. 1 is the work flow diagram of the embodiment of the present invention;
Fig. 2 is speaker adaptation of embodiment of the present invention algorithm flow chart.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to
Convenient for description, part relevant to invention is illustrated only in attached drawing.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase
Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
As shown in Figure 1, An embodiment provides a kind of speaker's voice adaptive training methods, comprising:
S1: given training emotional speech data and target speaker's emotional speech data;
S2: characterizing parameters,acoustic, and is estimated the distribution of the state output of parameters,acoustic and duration distribution, built
Mould;
S3: with equation of linear regression to the distribution of training phonetic data models state output and average sound model state output point
The difference of cloth is normalized, and obtains the average sound model of target speaker's emotional speech data;
S4: under the guidance of target speaker's emotional speech data, carrying out speaker adaptation transformation to average sound model,
Obtain the relevant adaptive model of speaker.
It in S1, gives after training emotional speech data and target speaker's emotional speech data, further includes: using maximum
The linear transformation of likelihood criterion estimation between the two, and it is adjusted the covariance matrix of model profile.
In S2, parameters,acoustic includes at least base frequency parameters, frequency spectrum parameter and duration parameters.It is defeated to the state of parameters,acoustic
Distribution and duration distribution are estimated, are modeled out, comprising: are carried out using half Hidden Markov model to state output and duration distribution
Control model simultaneously.
In S4, speaker adaptation transformation is carried out to average sound model, comprising: utilize the emotion of target speaker to be synthesized
Sentence carries out speaker adaptation transformation to average sound model using CMLLR adaptive algorithm.The adaptive transformation includes:
Using the state output of speaker and the mean of probability distribution of duration and covariance matrix, hybrid language is averaged sound model
In fundamental frequency, frequency spectrum and duration parameters be transformed to the characteristic parameter of voice to be synthesized.
The adaptive model that the present embodiment obtains is modified and is updated using maximal posterior probability algorithm.
System says, first using the constraint linear regression algorithm of maximum likelihood to more speaker's emotional speech data models into
Row speaker adaptation training, to obtain the average sound model of more speaker's emotional speech data.Later, in target speaker
It is same to be spoken using the constraint linear regression algorithm of maximum likelihood to average sound model under the guidance of target emotional speech data
People's adaptive transformation obtains the relevant adaptive model of speaker, is finally carried out using maximum a posteriori probability to adaptive model
Amendment and update.
In order to improve synthesis emotional speech quality, the present embodiment is put down using the training of multiple emotional speaker voice data
Equal sound model, due to the difference of emotional speaker gender, personality, emotional expression etc., acoustic model has relatively large deviation.For
It avoids because speaker's variation is to influence caused by training pattern, the present embodiment is trained using speaker adaptation
The method of (Speaker Adaption Training, SAT), is normalized speaker's difference, so as to improve the standard of model
Exactness, and then improve synthesis emotional speech quality.It is general using more spaces herein in view of Chinese voiceless sound and unvoiced segment do not have fundamental frequency
Rate is distributed HMM (Multi-space probability distribution, MSD-HMM) and realizes fundamental frequency modeling.Based on upper and lower
The relevant MSD-HSMM speech synthesis unit of text, the present embodiment is using the constraint linear regression algorithm of maximum likelihood
(constrained maximum likelihood linear regression, CMLLR) is to more speaker's Emotional Corpus
Speaker adaptation training is carried out, to obtain the average sound model of more speaker's emotional speeches.
It is illustrated in figure 2 the present embodiment speaker adaptation algorithm flow, firstly, given training emotional speech data and mesh
Speaker's emotional speech data are marked, in order to reflect difference between two models, the present embodiment goes to estimate using maximum-likelihood criterion
Linear transformation between two model datas, and it is adjusted the covariance matrix of model profile.During adaptive training,
It needs to characterize the parameters,acoustics such as base frequency parameters, frequency spectrum parameter, duration parameters, and the state output of these parameters is distributed
Estimated with duration distribution, modeled, but the not accurate description to duration distribution of initial Hidden Markov model, so this
Embodiment uses half Hidden Markov model (hidden semi-Markov model, HSMM) with the distribution of accurate duration right
State output and duration distribution carry out control model simultaneously, one group of equation of linear regression of the present embodiment, such as formula (2.1), public affairs
Shown in formula (2.2), speaker's speech model difference to be normalized:
Wherein, formula (2.1) show state output distribution transformation equation,Indicate the shape of training phonetic data models s
The mean vector of state output, W=[A, b] are poor between the state output distribution of training phonetic data models s and average sound model
Different transformation matrix, oiFor its average observed vector;Formula (2.2) show state duration distribution transformation equation,Indicate instruction
Practice the mean vector of the state duration of phonetic data models s.X=[α, β] is the state duration point of training phonetic data models s
The transformation matrix of difference, d between cloth and average sound modeliFor its duration that is averaged, wherein ξ=[oT,1]。
Then, after having carried out speaker adaptation training, so that it may utilize a small amount of emotion of target speaker to be synthesized
Sentence carries out speaker adaptation transformation to average sound model using CMLLR adaptive algorithm, speaks to obtain and represent target
The speaker adaptation model of people.In speaker adaptation transformation, the state output of speaker and duration are mainly utilized
Mean of a probability distribution and covariance matrix, fundamental frequency, frequency spectrum and the duration parameters transformation hybrid language being averaged in sound model
For the characteristic parameter of voice to be synthesized.If formula (2.3) is shown under state i, the transformation equation of feature vector o, such as formula
(2.4) it show under state i, the transformation equation of state duration d:
bi(o)=N (o;Aμi-b,A∑iAT)=| A-1|N(Wξ;μi,∑i) (2.3)
Wherein, ξ=[oT, 1], ψ=[d, 1]T, μiFor the mean value of state output distribution, miFor the mean value of duration distribution, ∑i
For diagonal covariance matrix,For variance.W=[A-1 b-1] be target speaker state output probability density distribution linear change
Change matrix, X=[α-1,β-1] be state duration probability density distribution transformation matrix.
By the adaptive transformation algorithm based on HSMM, speech acoustics feature parameter can be normalized and feature at
Reason.The self-adapting data O for being T for length can carry out maximal possibility estimation to transformation Λ=(W, X).
Wherein, λ is the parameter set of HSMM.
When target speaker's data volume is limited, a transition matrix progress can be corresponded to by not being able to satisfy each model profile
Estimation, this just needs multiple be distributed to share a transition matrix, that is, the binding of regression matrix, may finally by using compared with
Few data realize preferable adaptive effect.As shown in Figure 2.
The present embodiment is modified model using maximum a posteriori probability (Maximum A Posteriori, MAP) algorithm
And update.For given HSMM parameter set, it is assumed that its forward direction probability is αt(i), backward probability βt(i), at state i,
Continuous observation sequence ot-d+1......otGenerating probabilityIt is:
Maximum a-posteriori estimation is described as follows:
In formula,WithThe mean vector after linear regression transformation is represented, ω represents the MAP estimation ginseng of state output
Number, and τ represents its duration distribution MAP estimation parameter.WithRepresent adaptive mean vectorAndWeighted average MAP
Estimated value.
The emotional speech synthesis system based on adaptive model described in the present embodiment has been built with tradition based on hidden
The speech synthesis system of Markov model, is experimentally confirmed, with traditional speech synthesis system based on Hidden Markov model
It compares, joined speaker adaptation training process in the training stage, the emotional speech for obtaining multiple speakers is averaged sound model,
By the method, influence caused by the difference of speaker in sound bank can reduce, improve the emotion similarity of synthesis voice,
On the basis of average sound model, algorithm is converted by speaker adaptation, only with a small amount of emotion corpus to be synthesized, energy
Enough synthesize all good emotional speech of naturalness, fluency, emotion similarity.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art
Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic
Scheme, while should also cover in the case where not departing from the inventive concept, it is carried out by above-mentioned technical characteristic or its equivalent feature
Any combination and the other technical solutions formed.Such as features described above and (but being not limited to) disclosed herein have it is similar
The technical characteristic of function is replaced mutually and the technical solution that is formed.
Except for the technical features described in the specification, remaining technical characteristic is the known technology of those skilled in the art, is prominent
Innovative characteristics of the invention out, details are not described herein for remaining technical characteristic.
Claims (8)
1. a kind of speaker's voice adaptive training method, characterized in that include:
Given training emotional speech data and target speaker's emotional speech data;
Parameters,acoustic is characterized, and the distribution of the state output of parameters,acoustic and duration distribution are estimated, modeled;
With equation of linear regression to the difference of the distribution of training phonetic data models state output and average sound model state output distribution
It is different to be normalized, obtain the average sound model of target speaker's emotional speech data;
Under the guidance of target speaker's emotional speech data, speaker adaptation transformation is carried out to average sound model, is said
Talk about the relevant adaptive model of people.
2. speaker's voice adaptive training method according to claim 1, characterized in that the parameters,acoustic, at least
Including base frequency parameters, frequency spectrum parameter and duration parameters.
3. speaker's voice adaptive training method according to claim 1, characterized in that the given trained emotion language
After sound data and target speaker's emotional speech data, further includes:
Using the linear transformation of maximum-likelihood criterion estimation between the two, and it is adjusted the covariance matrix of model profile.
4. speaker's voice adaptive training method according to claim 1, characterized in that the shape to parameters,acoustic
State output distribution and duration distribution are estimated, are modeled, comprising: are distributed using half Hidden Markov model to state output and duration
Carry out control model simultaneously.
5. speaker's voice adaptive training method according to claim 1, characterized in that the equation of linear regression packet
It includes:
Wherein, formula (2.1) show state output distribution transformation equation,Indicate that the state of training phonetic data models s is defeated
Mean vector out, W=[A, b] are difference between the state output distribution of training phonetic data models s and average sound model
Transformation matrix, oiFor its average observed vector;Formula (2.2) show state duration distribution transformation equation,Indicate training language
The mean vector of the state duration of sound data model s.X=[α, β] be training phonetic data models s state duration distribution with
The transformation matrix of difference, d between average sound modeliFor its duration that is averaged, wherein ξ=[oT,1]。
6. speaker's voice adaptive training method according to claim 3, characterized in that described pair of average sound model into
Row speaker adaptation transformation, comprising: using the emotion sentence of target speaker to be synthesized, using CMLLR adaptive algorithm pair
Average sound model carries out speaker adaptation transformation.
7. speaker's voice adaptive training method according to claim 6, characterized in that the adaptive transformation packet
It includes: using the state output of speaker and the mean of probability distribution of duration and covariance matrix, hybrid language being averaged sound mould
Fundamental frequency, frequency spectrum and duration parameters in type are transformed to the characteristic parameter of voice to be synthesized.
8. speaker's voice adaptive training method according to claim 1, characterized in that the adaptive model uses
Maximal posterior probability algorithm is modified and updates.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810576452.2A CN109036370B (en) | 2018-06-06 | 2018-06-06 | Adaptive training method for speaker voice |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810576452.2A CN109036370B (en) | 2018-06-06 | 2018-06-06 | Adaptive training method for speaker voice |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109036370A true CN109036370A (en) | 2018-12-18 |
CN109036370B CN109036370B (en) | 2021-07-20 |
Family
ID=64612408
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810576452.2A Active CN109036370B (en) | 2018-06-06 | 2018-06-06 | Adaptive training method for speaker voice |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109036370B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109949791A (en) * | 2019-03-22 | 2019-06-28 | 平安科技(深圳)有限公司 | Emotional speech synthesizing method, device and storage medium based on HMM |
CN111627420A (en) * | 2020-04-21 | 2020-09-04 | 升智信息科技(南京)有限公司 | Specific-speaker emotion voice synthesis method and device under extremely low resources |
CN112837674A (en) * | 2019-11-22 | 2021-05-25 | 阿里巴巴集团控股有限公司 | Speech recognition method, device and related system and equipment |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104217713A (en) * | 2014-07-15 | 2014-12-17 | 西北师范大学 | Tibetan-Chinese speech synthesis method and device |
US20150127350A1 (en) * | 2013-11-01 | 2015-05-07 | Google Inc. | Method and System for Non-Parametric Voice Conversion |
GB2524505A (en) * | 2014-03-24 | 2015-09-30 | Toshiba Res Europ Ltd | Voice conversion |
CN106531150A (en) * | 2016-12-23 | 2017-03-22 | 上海语知义信息技术有限公司 | Emotion synthesis method based on deep neural network model |
CN106971703A (en) * | 2017-03-17 | 2017-07-21 | 西北师范大学 | A kind of song synthetic method and device based on HMM |
US20170213542A1 (en) * | 2016-01-26 | 2017-07-27 | James Spencer | System and method for the generation of emotion in the output of a text to speech system |
CN107039033A (en) * | 2017-04-17 | 2017-08-11 | 海南职业技术学院 | A kind of speech synthetic device |
CN107103900A (en) * | 2017-06-06 | 2017-08-29 | 西北师范大学 | A kind of across language emotional speech synthesizing method and system |
CN107895582A (en) * | 2017-10-16 | 2018-04-10 | 中国电子科技集团公司第二十八研究所 | Towards the speaker adaptation speech-emotion recognition method in multi-source information field |
-
2018
- 2018-06-06 CN CN201810576452.2A patent/CN109036370B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150127350A1 (en) * | 2013-11-01 | 2015-05-07 | Google Inc. | Method and System for Non-Parametric Voice Conversion |
GB2524505A (en) * | 2014-03-24 | 2015-09-30 | Toshiba Res Europ Ltd | Voice conversion |
CN104217713A (en) * | 2014-07-15 | 2014-12-17 | 西北师范大学 | Tibetan-Chinese speech synthesis method and device |
US20170213542A1 (en) * | 2016-01-26 | 2017-07-27 | James Spencer | System and method for the generation of emotion in the output of a text to speech system |
CN106531150A (en) * | 2016-12-23 | 2017-03-22 | 上海语知义信息技术有限公司 | Emotion synthesis method based on deep neural network model |
CN106971703A (en) * | 2017-03-17 | 2017-07-21 | 西北师范大学 | A kind of song synthetic method and device based on HMM |
CN107039033A (en) * | 2017-04-17 | 2017-08-11 | 海南职业技术学院 | A kind of speech synthetic device |
CN107103900A (en) * | 2017-06-06 | 2017-08-29 | 西北师范大学 | A kind of across language emotional speech synthesizing method and system |
CN107895582A (en) * | 2017-10-16 | 2018-04-10 | 中国电子科技集团公司第二十八研究所 | Towards the speaker adaptation speech-emotion recognition method in multi-source information field |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109949791A (en) * | 2019-03-22 | 2019-06-28 | 平安科技(深圳)有限公司 | Emotional speech synthesizing method, device and storage medium based on HMM |
CN112837674A (en) * | 2019-11-22 | 2021-05-25 | 阿里巴巴集团控股有限公司 | Speech recognition method, device and related system and equipment |
CN111627420A (en) * | 2020-04-21 | 2020-09-04 | 升智信息科技(南京)有限公司 | Specific-speaker emotion voice synthesis method and device under extremely low resources |
WO2021212954A1 (en) * | 2020-04-21 | 2021-10-28 | 升智信息科技(南京)有限公司 | Method and apparatus for synthesizing emotional speech of specific speaker with extremely few resources |
CN111627420B (en) * | 2020-04-21 | 2023-12-08 | 升智信息科技(南京)有限公司 | Method and device for synthesizing emotion voice of specific speaker under extremely low resource |
Also Published As
Publication number | Publication date |
---|---|
CN109036370B (en) | 2021-07-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108831435B (en) | Emotional voice synthesis method based on multi-emotion speaker self-adaption | |
JP6092293B2 (en) | Text-to-speech system | |
Yamagishi et al. | Robust speaker-adaptive HMM-based text-to-speech synthesis | |
Yamagishi | Average-voice-based speech synthesis | |
KR102311922B1 (en) | Apparatus and method for controlling outputting target information to voice using characteristic of user voice | |
Qian et al. | Improved prosody generation by maximizing joint probability of state and longer units | |
Inoue et al. | An investigation to transplant emotional expressions in DNN-based TTS synthesis | |
CN109036370A (en) | A kind of speaker's voice adaptive training method | |
Yamagishi et al. | Speaking style adaptation using context clustering decision tree for HMM-based speech synthesis | |
Yin et al. | Modeling F0 trajectories in hierarchically structured deep neural networks | |
Toda | Augmented speech production based on real-time statistical voice conversion | |
Shirota et al. | Integration of speaker and pitch adaptive training for HMM-based singing voice synthesis | |
Chen et al. | Polyglot speech synthesis based on cross-lingual frame selection using auditory and articulatory features | |
Gao et al. | Articulatory copy synthesis using long-short term memory networks | |
Lee et al. | A comparative study of spectral transformation techniques for singing voice synthesis | |
Yamagishi et al. | Adaptive training for hidden semi-Markov model [speech synthesis applications] | |
Ling et al. | Articulatory control of HMM-based parametric speech synthesis driven by phonetic knowledge | |
Liao et al. | Speaker adaptation of SR-HPM for speaking rate-controlled Mandarin TTS | |
Chunwijitra et al. | A tone-modeling technique using a quantized F0 context to improve tone correctness in average-voice-based speech synthesis | |
Zen | Statistical parametric speech synthesis: from HMM to LSTM-RNN | |
Karhila et al. | Creating synthetic voices for children by adapting adult average voice using stacked transformations and VTLN | |
Sung et al. | Factored MLLR adaptation for singing voice generation | |
Coto-Jiménez et al. | Speech Synthesis Based on Hidden Markov Models and Deep Learning. | |
Suzić et al. | Style-code method for multi-style parametric text-to-speech synthesis | |
Kaya et al. | Effectiveness of Speech Mode Adaptation for Improving Dialogue Speech Synthesis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |