CN109036370A - A kind of speaker's voice adaptive training method - Google Patents

A kind of speaker's voice adaptive training method Download PDF

Info

Publication number
CN109036370A
CN109036370A CN201810576452.2A CN201810576452A CN109036370A CN 109036370 A CN109036370 A CN 109036370A CN 201810576452 A CN201810576452 A CN 201810576452A CN 109036370 A CN109036370 A CN 109036370A
Authority
CN
China
Prior art keywords
speaker
distribution
model
duration
adaptive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810576452.2A
Other languages
Chinese (zh)
Other versions
CN109036370B (en
Inventor
赵峰
徐海青
吴立刚
章爱武
潘子春
李葵
李明
张引强
黄影
陈是同
徐唯耀
秦浩
王文清
郑娟
王维佳
秦婷
梁翀
浦正国
张天奇
余江斌
韩涛
杨维
张才俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
Anhui Jiyuan Software Co Ltd
Information and Telecommunication Branch of State Grid Anhui Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
Anhui Jiyuan Software Co Ltd
Information and Telecommunication Branch of State Grid Anhui Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Information and Telecommunication Co Ltd, Anhui Jiyuan Software Co Ltd, Information and Telecommunication Branch of State Grid Anhui Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN201810576452.2A priority Critical patent/CN109036370B/en
Publication of CN109036370A publication Critical patent/CN109036370A/en
Application granted granted Critical
Publication of CN109036370B publication Critical patent/CN109036370B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a kind of speaker's voice adaptive training methods, belong to speech synthesis technique field, comprising: given training emotional speech data and target speaker's emotional speech data;Parameters,acoustic is characterized, and the distribution of the state output of parameters,acoustic and duration distribution are estimated, modeled;The difference of the distribution of training phonetic data models state output and average sound model state output distribution is normalized, the average sound model of target speaker's emotional speech data is obtained;Speaker adaptation transformation is carried out to average sound model, obtains the relevant adaptive model of speaker.The exemplary speaker's voice adaptive training method of the present invention, obtained adaptive model is used for speech synthesis, it can reduce influence caused by the difference of speaker in sound bank, improve the emotion similarity of synthesis voice, only with a small amount of emotion corpus to be synthesized, it will be able to synthesize all good emotional speech of naturalness, fluency, emotion similarity.

Description

A kind of speaker's voice adaptive training method
Technical field
The invention belongs to speech synthesis technique field, specifically a kind of speaker's voice adaptive training method.
Background technique
In recent years, with the continuous development of speech synthesis technique, from the phoneme synthesizing method of initial Physical Mechanism, Source-filter phoneme synthesizing method, to the phoneme synthesizing method of the waveform concatenation to reach its maturity at present, the language of statistical parameter Sound synthetic method, and study the phoneme synthesizing method based on deep learning just contained instantly, synthesize voice sound quality obtain it is bright It is aobvious to improve.However, traditional phoneme synthesizing method, researcher only realizes a written, character is converted to simply Spoken language output but has ignored speaker's emotion information entrained during verbal exposition.How synthesis phonetic representation is improved Power will become the important content of emotional speech synthesis technical research, also will be the certainty of future speech signal process field research Trend.
Summary of the invention
The purpose of the present invention is to provide a kind of speaker's voice adaptive training method, this method can obtain being used for language The adaptive model of sound synthesis, improves the emotion similarity of synthesis voice.
The technical scheme adopted by the invention is as follows:
Provide a kind of speaker's voice adaptive training method, comprising:
Given training emotional speech data and target speaker's emotional speech data;
Parameters,acoustic is characterized, and the distribution of the state output of parameters,acoustic and duration distribution are estimated, modeled;
With equation of linear regression to the distribution of training phonetic data models state output and average sound model state output distribution Difference be normalized, obtain the average sound model of target speaker's emotional speech data;
Under the guidance of target speaker's emotional speech data, speaker adaptation transformation is carried out to average sound model, is obtained To the relevant adaptive model of speaker.
Further, the parameters,acoustic includes at least base frequency parameters, frequency spectrum parameter and duration parameters.
Further, after the given trained emotional speech data and target speaker's emotional speech data, further includes:
Using the linear transformation of maximum-likelihood criterion estimation between the two, and it is adjusted the covariance square of model profile Battle array.
Further, the state output distribution and duration distribution to parameters,acoustic is estimated, is modeled, comprising: adopts With half Hidden Markov model is to state output and duration distribution carries out while control model.
Further, the equation of linear regression includes:
Wherein, formula (2.1) show state output distribution transformation equation,Indicate the shape of training phonetic data models s The mean vector of state output, W=[A, b] are poor between the state output distribution of training phonetic data models s and average sound model Different transformation matrix, oiFor its average observed vector;Formula (2.2) show state duration distribution transformation equation,Indicate instruction Practice the mean vector of the state duration of phonetic data models s.X=[α, β] is the state duration point of training phonetic data models s The transformation matrix of difference, d between cloth and average sound modeliFor its duration that is averaged, wherein ξ=[oT,1]。
Further, described pair of average sound model carries out speaker adaptation transformation, comprising: is spoken using target to be synthesized The emotion sentence of people carries out speaker adaptation transformation to average sound model using CMLLR adaptive algorithm.
Further, the adaptive transformation includes: the mean of probability distribution of the state output using speaker and duration, And covariance matrix, fundamental frequency, frequency spectrum and duration parameters that hybrid language is averaged in sound model are transformed to voice to be synthesized Characteristic parameter.
Further, the adaptive model is modified and is updated using maximal posterior probability algorithm.
Compared with prior art, the invention has the benefit that exemplary speaker's voice of the invention adaptive training side Method can obtain adaptive model, for the adaptive training during speech synthesis, can reduce speaker in sound bank Influence caused by difference improves the emotion similarity of synthesis voice, adaptive by speaker on the basis of average sound model Scaling method is strained, only with a small amount of emotion corpus to be synthesized, it will be able to synthesize naturalness, fluency, emotion similarity all Good emotional speech.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:
Fig. 1 is the work flow diagram of the embodiment of the present invention;
Fig. 2 is speaker adaptation of embodiment of the present invention algorithm flow chart.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, part relevant to invention is illustrated only in attached drawing.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
As shown in Figure 1, An embodiment provides a kind of speaker's voice adaptive training methods, comprising:
S1: given training emotional speech data and target speaker's emotional speech data;
S2: characterizing parameters,acoustic, and is estimated the distribution of the state output of parameters,acoustic and duration distribution, built Mould;
S3: with equation of linear regression to the distribution of training phonetic data models state output and average sound model state output point The difference of cloth is normalized, and obtains the average sound model of target speaker's emotional speech data;
S4: under the guidance of target speaker's emotional speech data, carrying out speaker adaptation transformation to average sound model, Obtain the relevant adaptive model of speaker.
It in S1, gives after training emotional speech data and target speaker's emotional speech data, further includes: using maximum The linear transformation of likelihood criterion estimation between the two, and it is adjusted the covariance matrix of model profile.
In S2, parameters,acoustic includes at least base frequency parameters, frequency spectrum parameter and duration parameters.It is defeated to the state of parameters,acoustic Distribution and duration distribution are estimated, are modeled out, comprising: are carried out using half Hidden Markov model to state output and duration distribution Control model simultaneously.
In S4, speaker adaptation transformation is carried out to average sound model, comprising: utilize the emotion of target speaker to be synthesized Sentence carries out speaker adaptation transformation to average sound model using CMLLR adaptive algorithm.The adaptive transformation includes: Using the state output of speaker and the mean of probability distribution of duration and covariance matrix, hybrid language is averaged sound model In fundamental frequency, frequency spectrum and duration parameters be transformed to the characteristic parameter of voice to be synthesized.
The adaptive model that the present embodiment obtains is modified and is updated using maximal posterior probability algorithm.
System says, first using the constraint linear regression algorithm of maximum likelihood to more speaker's emotional speech data models into Row speaker adaptation training, to obtain the average sound model of more speaker's emotional speech data.Later, in target speaker It is same to be spoken using the constraint linear regression algorithm of maximum likelihood to average sound model under the guidance of target emotional speech data People's adaptive transformation obtains the relevant adaptive model of speaker, is finally carried out using maximum a posteriori probability to adaptive model Amendment and update.
In order to improve synthesis emotional speech quality, the present embodiment is put down using the training of multiple emotional speaker voice data Equal sound model, due to the difference of emotional speaker gender, personality, emotional expression etc., acoustic model has relatively large deviation.For It avoids because speaker's variation is to influence caused by training pattern, the present embodiment is trained using speaker adaptation The method of (Speaker Adaption Training, SAT), is normalized speaker's difference, so as to improve the standard of model Exactness, and then improve synthesis emotional speech quality.It is general using more spaces herein in view of Chinese voiceless sound and unvoiced segment do not have fundamental frequency Rate is distributed HMM (Multi-space probability distribution, MSD-HMM) and realizes fundamental frequency modeling.Based on upper and lower The relevant MSD-HSMM speech synthesis unit of text, the present embodiment is using the constraint linear regression algorithm of maximum likelihood (constrained maximum likelihood linear regression, CMLLR) is to more speaker's Emotional Corpus Speaker adaptation training is carried out, to obtain the average sound model of more speaker's emotional speeches.
It is illustrated in figure 2 the present embodiment speaker adaptation algorithm flow, firstly, given training emotional speech data and mesh Speaker's emotional speech data are marked, in order to reflect difference between two models, the present embodiment goes to estimate using maximum-likelihood criterion Linear transformation between two model datas, and it is adjusted the covariance matrix of model profile.During adaptive training, It needs to characterize the parameters,acoustics such as base frequency parameters, frequency spectrum parameter, duration parameters, and the state output of these parameters is distributed Estimated with duration distribution, modeled, but the not accurate description to duration distribution of initial Hidden Markov model, so this Embodiment uses half Hidden Markov model (hidden semi-Markov model, HSMM) with the distribution of accurate duration right State output and duration distribution carry out control model simultaneously, one group of equation of linear regression of the present embodiment, such as formula (2.1), public affairs Shown in formula (2.2), speaker's speech model difference to be normalized:
Wherein, formula (2.1) show state output distribution transformation equation,Indicate the shape of training phonetic data models s The mean vector of state output, W=[A, b] are poor between the state output distribution of training phonetic data models s and average sound model Different transformation matrix, oiFor its average observed vector;Formula (2.2) show state duration distribution transformation equation,Indicate instruction Practice the mean vector of the state duration of phonetic data models s.X=[α, β] is the state duration point of training phonetic data models s The transformation matrix of difference, d between cloth and average sound modeliFor its duration that is averaged, wherein ξ=[oT,1]。
Then, after having carried out speaker adaptation training, so that it may utilize a small amount of emotion of target speaker to be synthesized Sentence carries out speaker adaptation transformation to average sound model using CMLLR adaptive algorithm, speaks to obtain and represent target The speaker adaptation model of people.In speaker adaptation transformation, the state output of speaker and duration are mainly utilized Mean of a probability distribution and covariance matrix, fundamental frequency, frequency spectrum and the duration parameters transformation hybrid language being averaged in sound model For the characteristic parameter of voice to be synthesized.If formula (2.3) is shown under state i, the transformation equation of feature vector o, such as formula (2.4) it show under state i, the transformation equation of state duration d:
bi(o)=N (o;Aμi-b,A∑iAT)=| A-1|N(Wξ;μi,∑i) (2.3)
Wherein, ξ=[oT, 1], ψ=[d, 1]T, μiFor the mean value of state output distribution, miFor the mean value of duration distribution, ∑i For diagonal covariance matrix,For variance.W=[A-1 b-1] be target speaker state output probability density distribution linear change Change matrix, X=[α-1-1] be state duration probability density distribution transformation matrix.
By the adaptive transformation algorithm based on HSMM, speech acoustics feature parameter can be normalized and feature at Reason.The self-adapting data O for being T for length can carry out maximal possibility estimation to transformation Λ=(W, X).
Wherein, λ is the parameter set of HSMM.
When target speaker's data volume is limited, a transition matrix progress can be corresponded to by not being able to satisfy each model profile Estimation, this just needs multiple be distributed to share a transition matrix, that is, the binding of regression matrix, may finally by using compared with Few data realize preferable adaptive effect.As shown in Figure 2.
The present embodiment is modified model using maximum a posteriori probability (Maximum A Posteriori, MAP) algorithm And update.For given HSMM parameter set, it is assumed that its forward direction probability is αt(i), backward probability βt(i), at state i, Continuous observation sequence ot-d+1......otGenerating probabilityIt is:
Maximum a-posteriori estimation is described as follows:
In formula,WithThe mean vector after linear regression transformation is represented, ω represents the MAP estimation ginseng of state output Number, and τ represents its duration distribution MAP estimation parameter.WithRepresent adaptive mean vectorAndWeighted average MAP Estimated value.
The emotional speech synthesis system based on adaptive model described in the present embodiment has been built with tradition based on hidden The speech synthesis system of Markov model, is experimentally confirmed, with traditional speech synthesis system based on Hidden Markov model It compares, joined speaker adaptation training process in the training stage, the emotional speech for obtaining multiple speakers is averaged sound model, By the method, influence caused by the difference of speaker in sound bank can reduce, improve the emotion similarity of synthesis voice, On the basis of average sound model, algorithm is converted by speaker adaptation, only with a small amount of emotion corpus to be synthesized, energy Enough synthesize all good emotional speech of naturalness, fluency, emotion similarity.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from the inventive concept, it is carried out by above-mentioned technical characteristic or its equivalent feature Any combination and the other technical solutions formed.Such as features described above and (but being not limited to) disclosed herein have it is similar The technical characteristic of function is replaced mutually and the technical solution that is formed.
Except for the technical features described in the specification, remaining technical characteristic is the known technology of those skilled in the art, is prominent Innovative characteristics of the invention out, details are not described herein for remaining technical characteristic.

Claims (8)

1. a kind of speaker's voice adaptive training method, characterized in that include:
Given training emotional speech data and target speaker's emotional speech data;
Parameters,acoustic is characterized, and the distribution of the state output of parameters,acoustic and duration distribution are estimated, modeled;
With equation of linear regression to the difference of the distribution of training phonetic data models state output and average sound model state output distribution It is different to be normalized, obtain the average sound model of target speaker's emotional speech data;
Under the guidance of target speaker's emotional speech data, speaker adaptation transformation is carried out to average sound model, is said Talk about the relevant adaptive model of people.
2. speaker's voice adaptive training method according to claim 1, characterized in that the parameters,acoustic, at least Including base frequency parameters, frequency spectrum parameter and duration parameters.
3. speaker's voice adaptive training method according to claim 1, characterized in that the given trained emotion language After sound data and target speaker's emotional speech data, further includes:
Using the linear transformation of maximum-likelihood criterion estimation between the two, and it is adjusted the covariance matrix of model profile.
4. speaker's voice adaptive training method according to claim 1, characterized in that the shape to parameters,acoustic State output distribution and duration distribution are estimated, are modeled, comprising: are distributed using half Hidden Markov model to state output and duration Carry out control model simultaneously.
5. speaker's voice adaptive training method according to claim 1, characterized in that the equation of linear regression packet It includes:
Wherein, formula (2.1) show state output distribution transformation equation,Indicate that the state of training phonetic data models s is defeated Mean vector out, W=[A, b] are difference between the state output distribution of training phonetic data models s and average sound model Transformation matrix, oiFor its average observed vector;Formula (2.2) show state duration distribution transformation equation,Indicate training language The mean vector of the state duration of sound data model s.X=[α, β] be training phonetic data models s state duration distribution with The transformation matrix of difference, d between average sound modeliFor its duration that is averaged, wherein ξ=[oT,1]。
6. speaker's voice adaptive training method according to claim 3, characterized in that described pair of average sound model into Row speaker adaptation transformation, comprising: using the emotion sentence of target speaker to be synthesized, using CMLLR adaptive algorithm pair Average sound model carries out speaker adaptation transformation.
7. speaker's voice adaptive training method according to claim 6, characterized in that the adaptive transformation packet It includes: using the state output of speaker and the mean of probability distribution of duration and covariance matrix, hybrid language being averaged sound mould Fundamental frequency, frequency spectrum and duration parameters in type are transformed to the characteristic parameter of voice to be synthesized.
8. speaker's voice adaptive training method according to claim 1, characterized in that the adaptive model uses Maximal posterior probability algorithm is modified and updates.
CN201810576452.2A 2018-06-06 2018-06-06 Adaptive training method for speaker voice Active CN109036370B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810576452.2A CN109036370B (en) 2018-06-06 2018-06-06 Adaptive training method for speaker voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810576452.2A CN109036370B (en) 2018-06-06 2018-06-06 Adaptive training method for speaker voice

Publications (2)

Publication Number Publication Date
CN109036370A true CN109036370A (en) 2018-12-18
CN109036370B CN109036370B (en) 2021-07-20

Family

ID=64612408

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810576452.2A Active CN109036370B (en) 2018-06-06 2018-06-06 Adaptive training method for speaker voice

Country Status (1)

Country Link
CN (1) CN109036370B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109949791A (en) * 2019-03-22 2019-06-28 平安科技(深圳)有限公司 Emotional speech synthesizing method, device and storage medium based on HMM
CN111627420A (en) * 2020-04-21 2020-09-04 升智信息科技(南京)有限公司 Specific-speaker emotion voice synthesis method and device under extremely low resources
CN112837674A (en) * 2019-11-22 2021-05-25 阿里巴巴集团控股有限公司 Speech recognition method, device and related system and equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104217713A (en) * 2014-07-15 2014-12-17 西北师范大学 Tibetan-Chinese speech synthesis method and device
US20150127350A1 (en) * 2013-11-01 2015-05-07 Google Inc. Method and System for Non-Parametric Voice Conversion
GB2524505A (en) * 2014-03-24 2015-09-30 Toshiba Res Europ Ltd Voice conversion
CN106531150A (en) * 2016-12-23 2017-03-22 上海语知义信息技术有限公司 Emotion synthesis method based on deep neural network model
CN106971703A (en) * 2017-03-17 2017-07-21 西北师范大学 A kind of song synthetic method and device based on HMM
US20170213542A1 (en) * 2016-01-26 2017-07-27 James Spencer System and method for the generation of emotion in the output of a text to speech system
CN107039033A (en) * 2017-04-17 2017-08-11 海南职业技术学院 A kind of speech synthetic device
CN107103900A (en) * 2017-06-06 2017-08-29 西北师范大学 A kind of across language emotional speech synthesizing method and system
CN107895582A (en) * 2017-10-16 2018-04-10 中国电子科技集团公司第二十八研究所 Towards the speaker adaptation speech-emotion recognition method in multi-source information field

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150127350A1 (en) * 2013-11-01 2015-05-07 Google Inc. Method and System for Non-Parametric Voice Conversion
GB2524505A (en) * 2014-03-24 2015-09-30 Toshiba Res Europ Ltd Voice conversion
CN104217713A (en) * 2014-07-15 2014-12-17 西北师范大学 Tibetan-Chinese speech synthesis method and device
US20170213542A1 (en) * 2016-01-26 2017-07-27 James Spencer System and method for the generation of emotion in the output of a text to speech system
CN106531150A (en) * 2016-12-23 2017-03-22 上海语知义信息技术有限公司 Emotion synthesis method based on deep neural network model
CN106971703A (en) * 2017-03-17 2017-07-21 西北师范大学 A kind of song synthetic method and device based on HMM
CN107039033A (en) * 2017-04-17 2017-08-11 海南职业技术学院 A kind of speech synthetic device
CN107103900A (en) * 2017-06-06 2017-08-29 西北师范大学 A kind of across language emotional speech synthesizing method and system
CN107895582A (en) * 2017-10-16 2018-04-10 中国电子科技集团公司第二十八研究所 Towards the speaker adaptation speech-emotion recognition method in multi-source information field

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109949791A (en) * 2019-03-22 2019-06-28 平安科技(深圳)有限公司 Emotional speech synthesizing method, device and storage medium based on HMM
CN112837674A (en) * 2019-11-22 2021-05-25 阿里巴巴集团控股有限公司 Speech recognition method, device and related system and equipment
CN111627420A (en) * 2020-04-21 2020-09-04 升智信息科技(南京)有限公司 Specific-speaker emotion voice synthesis method and device under extremely low resources
WO2021212954A1 (en) * 2020-04-21 2021-10-28 升智信息科技(南京)有限公司 Method and apparatus for synthesizing emotional speech of specific speaker with extremely few resources
CN111627420B (en) * 2020-04-21 2023-12-08 升智信息科技(南京)有限公司 Method and device for synthesizing emotion voice of specific speaker under extremely low resource

Also Published As

Publication number Publication date
CN109036370B (en) 2021-07-20

Similar Documents

Publication Publication Date Title
CN108831435B (en) Emotional voice synthesis method based on multi-emotion speaker self-adaption
JP6092293B2 (en) Text-to-speech system
Yamagishi et al. Robust speaker-adaptive HMM-based text-to-speech synthesis
Yamagishi Average-voice-based speech synthesis
KR102311922B1 (en) Apparatus and method for controlling outputting target information to voice using characteristic of user voice
Qian et al. Improved prosody generation by maximizing joint probability of state and longer units
Inoue et al. An investigation to transplant emotional expressions in DNN-based TTS synthesis
CN109036370A (en) A kind of speaker's voice adaptive training method
Yamagishi et al. Speaking style adaptation using context clustering decision tree for HMM-based speech synthesis
Yin et al. Modeling F0 trajectories in hierarchically structured deep neural networks
Toda Augmented speech production based on real-time statistical voice conversion
Shirota et al. Integration of speaker and pitch adaptive training for HMM-based singing voice synthesis
Chen et al. Polyglot speech synthesis based on cross-lingual frame selection using auditory and articulatory features
Gao et al. Articulatory copy synthesis using long-short term memory networks
Lee et al. A comparative study of spectral transformation techniques for singing voice synthesis
Yamagishi et al. Adaptive training for hidden semi-Markov model [speech synthesis applications]
Ling et al. Articulatory control of HMM-based parametric speech synthesis driven by phonetic knowledge
Liao et al. Speaker adaptation of SR-HPM for speaking rate-controlled Mandarin TTS
Chunwijitra et al. A tone-modeling technique using a quantized F0 context to improve tone correctness in average-voice-based speech synthesis
Zen Statistical parametric speech synthesis: from HMM to LSTM-RNN
Karhila et al. Creating synthetic voices for children by adapting adult average voice using stacked transformations and VTLN
Sung et al. Factored MLLR adaptation for singing voice generation
Coto-Jiménez et al. Speech Synthesis Based on Hidden Markov Models and Deep Learning.
Suzić et al. Style-code method for multi-style parametric text-to-speech synthesis
Kaya et al. Effectiveness of Speech Mode Adaptation for Improving Dialogue Speech Synthesis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant