CN109036370A

CN109036370A - A kind of speaker's voice adaptive training method

Info

Publication number: CN109036370A
Application number: CN201810576452.2A
Authority: CN
Inventors: 赵峰; 徐海青; 吴立刚; 章爱武; 潘子春; 李葵; 李明; 张引强; 黄影; 陈是同; 徐唯耀; 秦浩; 王文清; 郑娟; 王维佳; 秦婷; 梁翀; 浦正国; 张天奇; 余江斌
Original assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; Anhui Jiyuan Software Co Ltd; Information and Telecommunication Branch of State Grid Anhui Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; Anhui Jiyuan Software Co Ltd; Information and Telecommunication Branch of State Grid Anhui Electric Power Co Ltd
Priority date: 2018-06-06
Filing date: 2018-06-06
Publication date: 2018-12-18
Anticipated expiration: 2038-06-06
Also published as: CN109036370B

Abstract

The invention discloses a kind of speaker's voice adaptive training methods, belong to speech synthesis technique field, comprising: given training emotional speech data and target speaker's emotional speech data；Parameters,acoustic is characterized, and the distribution of the state output of parameters,acoustic and duration distribution are estimated, modeled；The difference of the distribution of training phonetic data models state output and average sound model state output distribution is normalized, the average sound model of target speaker's emotional speech data is obtained；Speaker adaptation transformation is carried out to average sound model, obtains the relevant adaptive model of speaker.The exemplary speaker's voice adaptive training method of the present invention, obtained adaptive model is used for speech synthesis, it can reduce influence caused by the difference of speaker in sound bank, improve the emotion similarity of synthesis voice, only with a small amount of emotion corpus to be synthesized, it will be able to synthesize all good emotional speech of naturalness, fluency, emotion similarity.

Description

A kind of speaker's voice adaptive training method

Technical field

The invention belongs to speech synthesis technique field, specifically a kind of speaker's voice adaptive training method.

Background technique

In recent years, with the continuous development of speech synthesis technique, from the phoneme synthesizing method of initial Physical Mechanism, Source-filter phoneme synthesizing method, to the phoneme synthesizing method of the waveform concatenation to reach its maturity at present, the language of statistical parameter Sound synthetic method, and study the phoneme synthesizing method based on deep learning just contained instantly, synthesize voice sound quality obtain it is bright It is aobvious to improve.However, traditional phoneme synthesizing method, researcher only realizes a written, character is converted to simply Spoken language output but has ignored speaker's emotion information entrained during verbal exposition.How synthesis phonetic representation is improved Power will become the important content of emotional speech synthesis technical research, also will be the certainty of future speech signal process field research Trend.

Summary of the invention

The purpose of the present invention is to provide a kind of speaker's voice adaptive training method, this method can obtain being used for language The adaptive model of sound synthesis, improves the emotion similarity of synthesis voice.

The technical scheme adopted by the invention is as follows:

Provide a kind of speaker's voice adaptive training method, comprising:

Given training emotional speech data and target speaker's emotional speech data；

Parameters,acoustic is characterized, and the distribution of the state output of parameters,acoustic and duration distribution are estimated, modeled；

With equation of linear regression to the distribution of training phonetic data models state output and average sound model state output distribution Difference be normalized, obtain the average sound model of target speaker's emotional speech data；

Under the guidance of target speaker's emotional speech data, speaker adaptation transformation is carried out to average sound model, is obtained To the relevant adaptive model of speaker.

Further, the parameters,acoustic includes at least base frequency parameters, frequency spectrum parameter and duration parameters.

Further, after the given trained emotional speech data and target speaker's emotional speech data, further includes:

Using the linear transformation of maximum-likelihood criterion estimation between the two, and it is adjusted the covariance square of model profile Battle array.

Further, the state output distribution and duration distribution to parameters,acoustic is estimated, is modeled, comprising: adopts With half Hidden Markov model is to state output and duration distribution carries out while control model.

Further, the equation of linear regression includes:

Wherein, formula (2.1) show state output distribution transformation equation,Indicate the shape of training phonetic data models s The mean vector of state output, W=[A, b] are poor between the state output distribution of training phonetic data models s and average sound model Different transformation matrix, o_iFor its average observed vector；Formula (2.2) show state duration distribution transformation equation,Indicate instruction Practice the mean vector of the state duration of phonetic data models s.X=[α, β] is the state duration point of training phonetic data models s The transformation matrix of difference, d between cloth and average sound model_iFor its duration that is averaged, wherein ξ=[o^T,1]。

Further, described pair of average sound model carries out speaker adaptation transformation, comprising: is spoken using target to be synthesized The emotion sentence of people carries out speaker adaptation transformation to average sound model using CMLLR adaptive algorithm.

Further, the adaptive transformation includes: the mean of probability distribution of the state output using speaker and duration, And covariance matrix, fundamental frequency, frequency spectrum and duration parameters that hybrid language is averaged in sound model are transformed to voice to be synthesized Characteristic parameter.

Further, the adaptive model is modified and is updated using maximal posterior probability algorithm.

Compared with prior art, the invention has the benefit that exemplary speaker's voice of the invention adaptive training side Method can obtain adaptive model, for the adaptive training during speech synthesis, can reduce speaker in sound bank Influence caused by difference improves the emotion similarity of synthesis voice, adaptive by speaker on the basis of average sound model Scaling method is strained, only with a small amount of emotion corpus to be synthesized, it will be able to synthesize naturalness, fluency, emotion similarity all Good emotional speech.

Detailed description of the invention

By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:

Fig. 1 is the work flow diagram of the embodiment of the present invention；

Fig. 2 is speaker adaptation of embodiment of the present invention algorithm flow chart.

Specific embodiment

The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, part relevant to invention is illustrated only in attached drawing.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

As shown in Figure 1, An embodiment provides a kind of speaker's voice adaptive training methods, comprising:

S1: given training emotional speech data and target speaker's emotional speech data；

S2: characterizing parameters,acoustic, and is estimated the distribution of the state output of parameters,acoustic and duration distribution, built Mould；

S3: with equation of linear regression to the distribution of training phonetic data models state output and average sound model state output point The difference of cloth is normalized, and obtains the average sound model of target speaker's emotional speech data；

S4: under the guidance of target speaker's emotional speech data, carrying out speaker adaptation transformation to average sound model, Obtain the relevant adaptive model of speaker.

It in S1, gives after training emotional speech data and target speaker's emotional speech data, further includes: using maximum The linear transformation of likelihood criterion estimation between the two, and it is adjusted the covariance matrix of model profile.

In S2, parameters,acoustic includes at least base frequency parameters, frequency spectrum parameter and duration parameters.It is defeated to the state of parameters,acoustic Distribution and duration distribution are estimated, are modeled out, comprising: are carried out using half Hidden Markov model to state output and duration distribution Control model simultaneously.

In S4, speaker adaptation transformation is carried out to average sound model, comprising: utilize the emotion of target speaker to be synthesized Sentence carries out speaker adaptation transformation to average sound model using CMLLR adaptive algorithm.The adaptive transformation includes: Using the state output of speaker and the mean of probability distribution of duration and covariance matrix, hybrid language is averaged sound model In fundamental frequency, frequency spectrum and duration parameters be transformed to the characteristic parameter of voice to be synthesized.

The adaptive model that the present embodiment obtains is modified and is updated using maximal posterior probability algorithm.

System says, first using the constraint linear regression algorithm of maximum likelihood to more speaker's emotional speech data models into Row speaker adaptation training, to obtain the average sound model of more speaker's emotional speech data.Later, in target speaker It is same to be spoken using the constraint linear regression algorithm of maximum likelihood to average sound model under the guidance of target emotional speech data People's adaptive transformation obtains the relevant adaptive model of speaker, is finally carried out using maximum a posteriori probability to adaptive model Amendment and update.

In order to improve synthesis emotional speech quality, the present embodiment is put down using the training of multiple emotional speaker voice data Equal sound model, due to the difference of emotional speaker gender, personality, emotional expression etc., acoustic model has relatively large deviation.For It avoids because speaker's variation is to influence caused by training pattern, the present embodiment is trained using speaker adaptation The method of (Speaker Adaption Training, SAT), is normalized speaker's difference, so as to improve the standard of model Exactness, and then improve synthesis emotional speech quality.It is general using more spaces herein in view of Chinese voiceless sound and unvoiced segment do not have fundamental frequency Rate is distributed HMM (Multi-space probability distribution, MSD-HMM) and realizes fundamental frequency modeling.Based on upper and lower The relevant MSD-HSMM speech synthesis unit of text, the present embodiment is using the constraint linear regression algorithm of maximum likelihood (constrained maximum likelihood linear regression, CMLLR) is to more speaker's Emotional Corpus Speaker adaptation training is carried out, to obtain the average sound model of more speaker's emotional speeches.

It is illustrated in figure 2 the present embodiment speaker adaptation algorithm flow, firstly, given training emotional speech data and mesh Speaker's emotional speech data are marked, in order to reflect difference between two models, the present embodiment goes to estimate using maximum-likelihood criterion Linear transformation between two model datas, and it is adjusted the covariance matrix of model profile.During adaptive training, It needs to characterize the parameters,acoustics such as base frequency parameters, frequency spectrum parameter, duration parameters, and the state output of these parameters is distributed Estimated with duration distribution, modeled, but the not accurate description to duration distribution of initial Hidden Markov model, so this Embodiment uses half Hidden Markov model (hidden semi-Markov model, HSMM) with the distribution of accurate duration right State output and duration distribution carry out control model simultaneously, one group of equation of linear regression of the present embodiment, such as formula (2.1), public affairs Shown in formula (2.2), speaker's speech model difference to be normalized:

Then, after having carried out speaker adaptation training, so that it may utilize a small amount of emotion of target speaker to be synthesized Sentence carries out speaker adaptation transformation to average sound model using CMLLR adaptive algorithm, speaks to obtain and represent target The speaker adaptation model of people.In speaker adaptation transformation, the state output of speaker and duration are mainly utilized Mean of a probability distribution and covariance matrix, fundamental frequency, frequency spectrum and the duration parameters transformation hybrid language being averaged in sound model For the characteristic parameter of voice to be synthesized.If formula (2.3) is shown under state i, the transformation equation of feature vector o, such as formula (2.4) it show under state i, the transformation equation of state duration d:

b_i(o)=N (o；Aμ_i-b,A∑_iA^T)=| A^-1|N(Wξ；μ_i,∑_i) (2.3)

Wherein, ξ=[o^T, 1], ψ=[d, 1]^T, μ_iFor the mean value of state output distribution, m_iFor the mean value of duration distribution, ∑_i For diagonal covariance matrix,For variance.W=[A^-1 b^-1] be target speaker state output probability density distribution linear change Change matrix, X=[α^-1,β^-1] be state duration probability density distribution transformation matrix.

By the adaptive transformation algorithm based on HSMM, speech acoustics feature parameter can be normalized and feature at Reason.The self-adapting data O for being T for length can carry out maximal possibility estimation to transformation Λ=(W, X).

Wherein, λ is the parameter set of HSMM.

When target speaker's data volume is limited, a transition matrix progress can be corresponded to by not being able to satisfy each model profile Estimation, this just needs multiple be distributed to share a transition matrix, that is, the binding of regression matrix, may finally by using compared with Few data realize preferable adaptive effect.As shown in Figure 2.

The present embodiment is modified model using maximum a posteriori probability (Maximum A Posteriori, MAP) algorithm And update.For given HSMM parameter set, it is assumed that its forward direction probability is α_t(i), backward probability β_t(i), at state i, Continuous observation sequence o_t-d+1......o_tGenerating probabilityIt is:

Maximum a-posteriori estimation is described as follows:

In formula,WithThe mean vector after linear regression transformation is represented, ω represents the MAP estimation ginseng of state output Number, and τ represents its duration distribution MAP estimation parameter.WithRepresent adaptive mean vectorAndWeighted average MAP Estimated value.

The emotional speech synthesis system based on adaptive model described in the present embodiment has been built with tradition based on hidden The speech synthesis system of Markov model, is experimentally confirmed, with traditional speech synthesis system based on Hidden Markov model It compares, joined speaker adaptation training process in the training stage, the emotional speech for obtaining multiple speakers is averaged sound model, By the method, influence caused by the difference of speaker in sound bank can reduce, improve the emotion similarity of synthesis voice, On the basis of average sound model, algorithm is converted by speaker adaptation, only with a small amount of emotion corpus to be synthesized, energy Enough synthesize all good emotional speech of naturalness, fluency, emotion similarity.

Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from the inventive concept, it is carried out by above-mentioned technical characteristic or its equivalent feature Any combination and the other technical solutions formed.Such as features described above and (but being not limited to) disclosed herein have it is similar The technical characteristic of function is replaced mutually and the technical solution that is formed.

Except for the technical features described in the specification, remaining technical characteristic is the known technology of those skilled in the art, is prominent Innovative characteristics of the invention out, details are not described herein for remaining technical characteristic.

Claims

1. a kind of speaker's voice adaptive training method, characterized in that include:

With equation of linear regression to the difference of the distribution of training phonetic data models state output and average sound model state output distribution It is different to be normalized, obtain the average sound model of target speaker's emotional speech data；

Under the guidance of target speaker's emotional speech data, speaker adaptation transformation is carried out to average sound model, is said Talk about the relevant adaptive model of people.

2. speaker's voice adaptive training method according to claim 1, characterized in that the parameters,acoustic, at least Including base frequency parameters, frequency spectrum parameter and duration parameters.

3. speaker's voice adaptive training method according to claim 1, characterized in that the given trained emotion language After sound data and target speaker's emotional speech data, further includes:

Using the linear transformation of maximum-likelihood criterion estimation between the two, and it is adjusted the covariance matrix of model profile.

4. speaker's voice adaptive training method according to claim 1, characterized in that the shape to parameters,acoustic State output distribution and duration distribution are estimated, are modeled, comprising: are distributed using half Hidden Markov model to state output and duration Carry out control model simultaneously.

5. speaker's voice adaptive training method according to claim 1, characterized in that the equation of linear regression packet It includes:

Wherein, formula (2.1) show state output distribution transformation equation,Indicate that the state of training phonetic data models s is defeated Mean vector out, W=[A, b] are difference between the state output distribution of training phonetic data models s and average sound model Transformation matrix, o_iFor its average observed vector；Formula (2.2) show state duration distribution transformation equation,Indicate training language The mean vector of the state duration of sound data model s.X=[α, β] be training phonetic data models s state duration distribution with The transformation matrix of difference, d between average sound model_iFor its duration that is averaged, wherein ξ=[o^T,1]。

6. speaker's voice adaptive training method according to claim 3, characterized in that described pair of average sound model into Row speaker adaptation transformation, comprising: using the emotion sentence of target speaker to be synthesized, using CMLLR adaptive algorithm pair Average sound model carries out speaker adaptation transformation.

7. speaker's voice adaptive training method according to claim 6, characterized in that the adaptive transformation packet It includes: using the state output of speaker and the mean of probability distribution of duration and covariance matrix, hybrid language being averaged sound mould Fundamental frequency, frequency spectrum and duration parameters in type are transformed to the characteristic parameter of voice to be synthesized.

8. speaker's voice adaptive training method according to claim 1, characterized in that the adaptive model uses Maximal posterior probability algorithm is modified and updates.