GB2537908A - Speech synthesis using linear dynamical modelling - Google Patents

Speech synthesis using linear dynamical modelling Download PDF

Info

Publication number
GB2537908A
GB2537908A GB1507422.2A GB201507422A GB2537908A GB 2537908 A GB2537908 A GB 2537908A GB 201507422 A GB201507422 A GB 201507422A GB 2537908 A GB2537908 A GB 2537908A
Authority
GB
United Kingdom
Prior art keywords
speech
linear dynamical
vectors
models
hidden
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB1507422.2A
Other versions
GB2537908B (en
GB201507422D0 (en
Inventor
Tsiaras Vassilis
Stylianou Ioannis
Maia Ranniery
Digalakis Vassilis
Diakoloukas Vassilis
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Europe Ltd
Original Assignee
Toshiba Research Europe Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Research Europe Ltd filed Critical Toshiba Research Europe Ltd
Priority to GB1507422.2A priority Critical patent/GB2537908B/en
Publication of GB201507422D0 publication Critical patent/GB201507422D0/en
Publication of GB2537908A publication Critical patent/GB2537908A/en
Application granted granted Critical
Publication of GB2537908B publication Critical patent/GB2537908B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Abstract

A text-to-speech (TTS) system is trained according to a constrained higher-order (eg. second order) parametric linear dynamic model (LDM) whereby text is converted to a sequence of linguistic units (eg. phonemes, sub-phonemes), each state of which is looked up in an acoustic model table to produce a sequence of speech vectors which is output as speech. A predefined number T of hidden vectors xt evolve according to a state equation involving an observation matric H, state transformation matrix F, covariance matrices Q & R and mean vectors m. The LDMs may be constrained to be critically damped towards a target q.

Description

SPEECH SYNTHESIS USING LINEAR DYNAMICAL MODELLING
FIELD
Embodiments described herein relate generally to a system and method of speech processing and a system and method of training a model for a text-to-speech system.
BACKGROUND
Text to speech systems are systems where audio speech or audio speech files are outputted in response to reception of a text file.
Text to speech systems are used in a wide variety of applications such as electronic games, E-book readers, E-mail readers, satellite navigation, automated telephone systems and automated warning systems.
There is a continuing need to make efficient systems which sound more like a human voice.
BRIEF DESCRIPTION OF THE FIGURES
Systems and Methods in accordance with non-limiting embodiments will now be described with reference to the accompanying figures in which: Figure 1 shows a text to speech system; Figure 2 shows a text-to-speech method; Figure 3 shows how a phoneme relates to linear dynamical models; Figure 4 shows a representation of a hidden Markov chain representing an LDM of an embodiment; Figure 5 shows how speech is synthesised using a linear dynamical model of an embodiment; Figure 6 shows a method of training an acoustic model; Figure 7 shows a method of clustering similar linguistic units together; Figure 8 shows the mel-cepstral distance as a function of hidden space dimension and Figure 9 shows the state space trajectory for a first order LDIVI and a second order critically damped LDM.
DETAILED DESCRIPTION
According to a first embodiment there is provided a method of speech processing, the method comprising receiving one or more linguistic units, converting said one or more linguistic units into a sequence of speech vectors for synthesising speech, said conversion using a one or more corresponding constrained higher order parametric linear dynamical models, and outputting said sequence of speech vectors.
By utilising higher order linear dynamical models the resultant speech vectors have a reduced number of artefacts relative to first order linear dynamical models. This is due to the reduction in the number of discontinuities in the state trajectories of the linear dynamical models. By utilising linear dynamical models, the footprint of the model is reduced as LDMs capture a greater range of dynamics than Hidden Markov Model (HMM) systems which require the modelling of not only the speech parameters but also the second and third derivatives of the speech parameters. In contrast, LDMs require only a single observation equation to obtain similar results. in addition, each phoneme may be segmented into a smaller number of states as the linear dynamical models are more effective at modelling dynamics over time. This ensures a reduction in the number of models per phoneme and therefore a reduction in the number of parameters required to model speech. Constraining the linear dynamical models ensures that the system is stable, in one embodiment, hidden states are used to determine the speech vectors and the evolution of the hidden states over time is modelled using the one or more higher order linear dynamical models.
A linguistic unit may be a phoneme or a grapheme or may be a segment of a phoneme or a grapheme, such as a sub-phoneme or a sub-grapheme. The speech processing may be a text to speech method which comprises receiving text and determining a sequence of linguistic units from the text, In one embodiment, the one or more constrained higher order linear dynamical models (LDMs) are second order linear dynamical models. Second order LDMs better represent the movement of the articulators which produce speech. The movement of articulators also follow second order equations Accordingly, second order LDMs provide a more accurate model of speech synthesis.
In one embodiment, hidden states are used to determine the speech vectors and the evolution of the hidden states is modelled using the second order critically damped LDMs. in one embodiment, the one or more linear dynamical models describe critically damped task dynamic gestures towards targets. As the LDMs are constrained to be critically damped, they are stable and evolve over a long period of time towards target values. Without constraining the LDMs there would be divergence away from the target values.
In one embodiment, the conversion compriscs, for each of the one or more linguistic units: selecting an associated linear dynamical model; determining a predefined number T of hidden vectors xt according to a state evolution equation wherein the hidden vectors xt for frame t are: x1--N Oil, Q1) xt = F xt_i + q + w; w Q) and determining a sequence of speech vectors yt based on the hidden vectors xt according to the observation equation: yt = Hxt + ity + v; R), wherein each hidden vector xt is a vector representing hidden parameters z,: zt xt = zt_it H is an observation matrix. Q1, Q and R are covariance matrices, pi and py are mean vectors. F is a state transformation matrix, q is a target vector for the hidden states, and T, F, R, q, pi and ply are defined by the respective linear dynamical model. By utilising xt to represent further hidden parameters zt and 4_1 the second order linear dynamical model can be represented as a first order system.
Hidden vectors xt are inherently hidden parameters in their own rights. The hidden vectors xt allow the second order dynamics of the hidden parameters zt to be represented in a first order format.
In one embodiment the state transformation matrix F obeys: F = (25 -S2). / 0
where S is a matrix which determines the rate of critically damped dynamics of the hidden vectors towards the target vector q. This specific form of F effectively models speech dynamics whilst also being able to be effectively solved during training. This means that the system may be trained more efficiently and more effectively than other systems. Previous systems have required numerical methods to train the systems. This is less efficient as it involves iterative calculations and ultimately results in approximations. In contrast, the above formulation of F allows an exact solution to be found resulting in a much faster training process.
In one embodiment the one or more linear dynamical models comprise a plurality of linear dynamical models and the observation equations have parameters which are globally tied across all linear dynamical models. This reduces the footprint of the model by requiring fewer parameters to be stored. In one embodiment the observation matrix H and/or the covariance matrix R are the same for all linear dynamical models. In one embodiment Q and/or Qi are set to be equal to the identity matrix. By globally tying parameters across the linear dynamical models, the footprint of the system is greatly reduced without reducing the quality of the synthesised speech According to a further embodiment there is provided a method of training a model for a text-to speech system, wherein said model is for converting a sequence of linguistic units into a sequence of speech vectors for synthesising speech. The method comprises receiving speech data comprising training speech vectors and associated linguistic units, modelling speech for each linguistic unit using one or more constrained higher order parametric linear dynamical models, and training the linear dynamical models, said training comprising estimating parameters of the linear dynamical models to fit the models to the associated speech data.
As mentioned above, utilising constrained higher order parametric linear dynamical models provides a model which produces speech vectors with a reduced amount of artifacts.
Estimating parameters may comprise finding locally optimum values for the parameters This may involve finding locally maximum likelihood estimates of the parameters.
in one embodiment the constrained higher order linear dynamical models are second order linear dynamical models. In a further embodiment the one or more linear dynamical models describe critically damped task dynamic gestures towards targets.
In one embodiment each linear dynamical model comprises a state evolution equation which describes a number T of hidden vectors xt for frame t according to: x1-N(p(1, Q1) = Fx,_, + q + w; Q) and an observation equation which describes speech parameters yt for frame t according to: yt = Hxt + py + v; w -N(0, R).
H is an observation matrix, Qi, Q and R are covariance matrices, pi and py are mean vectors. Each hidden vector xt is a vector representing hidden parameters zt: zt xt = (zti.
F is a state transformation matrix according to F = (2S -S2).
q is a target vector for the hidden states and S is a matrix, and wherein T. S. R, q pi and py are defined by the respective linear dynamical model.
In one embodiment S determines the rate of critically damped dynamics of the hidden vectors towards the target vector q.
In one embodiment fitting each of the models to the associated speech data comprises an expectation maximisation method comprising: a) an expectation step comprising obtaining sufficient statistics for the linear dynamical model via a Kalman filter followed by a Kalman smoothen b) a maximisation step comprising using the sufficient statistics to determine estimates for parameters 5, H, R, q, pi and py and updating the linear dynamical model with these estimates; and c) repeating steps a) and b) until a local maximum is reached.
In one embodiment, the expectation step obtains the sufficient statistics: t t =1 kr t =2 where gat is the Kalman smoother estimate of the expected value of hidden vector Xt. 4T is the Kalman smoother estimate of the expected value of moment xtxr and fit,t_ilT is the Kalman smoother estimate of the expect value of moment xtxr_,. The maximisation step comprises, using the sufficient statistics to: solve for sk, where k = 1: n, where n is the dimension of 4: 3i ( = 0 under the constraint that si. has an absolute value of less than one, where si, is the kth diagonal component of matrix 5; determine q(k) according to:
-
and determine: where m is the dimension of the speech vectors vt.
As the linear dynamical models are parametric, they may be solved and therefore result in a more efficient method of training In one embodiment, the maximisation step further comprises determining from the sufficient statistics: or Q1 is set to the identity matrix for across the whole linear dynamical model: The above methods may be implemented on a system or device.
In a further embodiment there is provided a system for speech processing, the system comprising an input configured to receive one or more linguistic units and a processor. The processor is configured to convert said one or more linguistic units into a sequence of speech vectors for synthesising speech, said conversion using a one or more corresponding constrained higher order parametric linear dynamical models; and output said sequence of speech vectors.
In an additional embodiment there is provided a system for training a model for a text-to speech system, wherein said model is for converting a sequence of linguistic units into a sequence of speech vectors for synthesising speech. The system comprises an input configured to receive speech data comprising training speech vectors and associated linguistic units and a processor.
The processor is configured to model speech for each linguistic unit using one or more constrained higher order parametric linear dynamical models; and train the linear dynamical models, said training comprising estimating parameters of the linear dynamical models to fit the models to the associated speech data.
in one embodiment there is provided a carrier medium comprising computer readable code configured to cause a computer to perform one or more of the above methods.
Since some methods in accordance with embodiments can be implemented by software, some embodiments encompass computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal.
There has been a great interest in statistical parametric speech synthesis during the last years, particularly with approaches based on standard hidden Markov models (HM1M5) and their variations. Natural sounding speech has been synthesized by using HMMs and the quality of the best HMM-based synthesis systems is close to the quality of the best unit selection synthesis systems. However, although HNINIs can be a relatively efficient modelling scheme for speech, they suffer from a number of limitations that have been pointed out in the literature. The 1-IMM limitations derive from assumptions such as: a) conditional independence of observations given the state sequence and b) the speech statistics of each state do not change dynamically.
A simple mechanism for capturing time dependence is to augment the observation space with feature derivatives and use their relationships to produce smoother trajectories during synthesis. However, the standard HMM parameter estimation algorithm can only be used under the assumption that the static and dynamic feature sequences are independent. The inconsistency of this mechanism was solved by the trajectory HMM, which imposes relationships between static and dynamic feature vector sequences during training as well. Although trajectory HMNI improved the quality of synthesized speech, the challenge remains to make further progress with models which can easily and consistently be used for both parameter estimation and synthesis. Two such models which also explicitly capture the dynamics of speech are the autoregressive models and the linear dynamical models (LDMs). Both models have low computational requirements in synthesis, and are suitable for applications with low-latency and real-time requirements.
LDMs have been used in speech recognition but there are only preliminary efforts concerning their use in speech synthesis which were based on segmentation and clustering produced by FINIM systems. In a point-of-view these systems are mainly LDM-I-IMM hybrids rather than LDM-based synthesizers.
Most parametric speech synthesis systems are based on the HMM-based speech synthesis toolkit (HTS), which is publicly available toolkit. Programmatically, an LDM-based system may be built as an extension of the HTS, however, incorporating LDMs in the HTS framework requires extended modifications which are not trivial and results in a final system which is difficult to maintain and extend. For this reason, and in order to have the flexibility to experiment with alternative state-space models, a new LDM-based speech synthesis system has been developed from scratch In practice, training of the LDMs using the present system is performed in two phases.
First the decision trees are constructed for each segment and parameter type. The linguistic questions as well as the full context labelling of the training examples are provided as input files to the LDM-synthesis system. Then, the LDM models associated with the leaves of the trees can be used for synthesis. At this point, the system provides the flexibility to retrain new LDMs with variable model configurations as far as the structure of the parameters is concerned as well as using different variations of the Expectation Maximisation (EM) algorithm. Synthesis may be performed by producing a trajectory of speech parameters which maximizes the likelihood given the LDNIs under the constraint of global variance Accordingly, embodiments of the present invention make a major step towards building a complete LDM-based speech synthesis system.
Figure 1 shows a text to speech system 1. The text to speech system 1 comprises a processor 3 which executes a program 5. Text to speech system 1 further comprises storage 7. The storage 7 stores data which is used by program 5 to convert text to speech. The text to speech system 1 further comprises an input module 11 and an output module 13. The input module 11 is connected to a text input 15. Text input 15 receives text. The text input 15 may be for example a keyboard. Alternatively, text input 15 may be a means for receiving text data from an external storage medium or a network.
Connected to the output module 13 is output for audio 17. The audio output 17 is used for outputting a speech signal converted from text which is input into text input 15. The audio output 17 may be for example a direct audio output, e.g. a speaker or an output for an audio data file which may be sent to a storage medium, networked etc. In use, the text to speech system 1 receives text through text input 15. The program 5 executed on processor 3 converts the text into speech data using data stored in the storage 7. The speech is output via the output module 13 to audio output 17.
A simplified process will now be described with reference to Figure 2. In first step, S101, text is inputted. The text may be inputted via a keyboard, touch screen, text predictor or the like. The text is then converted into a sequence of linguistics units. These linguistic units may be phonemes or graphemes or may be segments of phonemes or graphemes, such as sub-phonemes or sub-graphemes. The units may be context dependent e.g. triphones which take into account not only the phoneme which has been selected but the proceeding and following phonemes. The linguistic units may be a sequence of phonetic and prosodic contextual units (full context tables). The text is converted into the sequence of linguistic units using techniques which are well-known in the art (such as the Festival Speech Synthesis System from the University of Edinburgh).
If the text is divided into phonemes, each phoneme is divided into a predefined number of sub-phonemes. amm schemes utilise 5 sub-phonemes. Due to the improved temporal dynamics of LDMs, a reduced number of segments may be used. In one embodiment the number of segments per phoneme is 3. Each segment may, for instance, be a sub-phoneme. Each segment has its own corresponding acoustic model.
In one embodiment, the linguistic units are sub-phonemes, as shown in more detail in Figure 3.
In step S105, the corresponding acoustic model for each linguistic unit is looked up. This may be achieved via a phonetic-to-acoustic map which is predetermined, e.g. via training of the system in order to fit models to linguistic units.
In step S109 each acoustic model is used to produce a sequence of speech parameters or speech vectors over time. Traditionally, the acoustic model is a Hidden Markov Model (HMM), however, this provides a large number of parameters which need to be trained in order to fit the model to normal speech. In embodiments of the present invention, as shall be described below, a second order critically damped linear dynamical model is used. This provides a much smaller number of parameters than HMM while producing synthesised speech of a similar quality. Accordingly, the system has a much smaller footprint than equivalent HIVEVI systems.
Once a sequence of speech vectors has been determined, synthesised speech is output in step S111. The output speech signal is represented by speech parameters, or speech vectors. These speech vectors relate to a number of features of the output speech signal, such as the fundamental frequency (F0), the band aperiodicity (BAP), the mel-cepstrum coefficients. The output vectors can be used to generate an output speech waveform using a vocoder.
Figure 3 shows how a phoneme relates to linear dynamical models. A phoneme 201 with context is split into a predefined number of sub-phonemes (211, 213, 215). In this embodiment, each phoneme is split into three sub-phonemes. Each sub-phoneme is a linguistic unit and has its own corresponding LDM (221, 223, 225). The three sub-phonemes share the same context. Each LDM models speech to output a predetermined number of frames T of speech vectors y (231, 233, 235). Each frame represents a period of speech. The number of frames, T, for each LDM may vary depending on a duration model. The duration model is learned from the training data. LDNI1 models T1 frames, LDM2 models T2 frames, LDM3 models T3 frames. The three sequences of speech vectors (231, 233, 235) are concatenated to form an output set of speech vectors (241) which may be used to synthesise speech for the phoneme.
Linear Dynamical Models As mentioned above, embodiments of the present invention utilise linear dynamical models (LDMs). LDMs are the simplest dynamical models with continuous state vectors The state evolution process is a linear first-order Gauss-Markov random process while the observation process is a factor analyser. The output of the process follows a time varying multivariate Gaussian distribution.
Table 1 shows symbols which are used throughout the application Figure 4 shows a representation of a hidden Markov chain representing an LDM of an embodiment. Observations 301, 303, 305 are the final output data, with T number of observations for a given LDM. This may be synthesised speech or components of synthesised speech such as mel-cepstral coefficients (mcep), fundamental frequencies (FO), band aperiodicity parameters (bap) or sinusoidal parameters. This may also be raw speech which is input to train the system.
The LDM is broken up into a number of states (311, 313, 315), with hidden variables.
These represent the position of hidden articulators which cause the output (synthesised speech) Table 1: Table of Symbols T Number of observations -vector at. time t P, Ob:servaUon vector at h:ac I [x1, , = X LT Tr ee ter'," of state vectors 3 -Y; fir 'Iraec1tJrv of),;f21-4- ykN.:.t OrS Pralaility densityfunction Vector a. is normally distributed with mean ii and covariame E I,..-- ;I:oration of,he EM ' igoritiam Sk;:t of parameters Pt Mean value of the state Q3.
Covariance if the initial state State e vc.lut.louimatrix Mean value of ' d 1states Q Cc dance of hidden states Mean value of observaticeis.
Covariance of +1.).scrvat 4 Kalnu-111 ***,-iaoi 4 hei P)t 1 - Kalm,,in sTilool.hin estin expected vaiur: of moment:n4 I t.i.-1?.3' -L a I10< Kalliliiii r..,11-10fIt11e1 et, ii expected value of T11011101.i _ There is one observation (301, 303, 305) which is output for each state (311, 313, 315). Each state represents a different time segment or frame. There are T frames associated with the output for each LDM. T varies dependent on the LDM and is determined when the LDM is trained. Each frame has its own set of hidden variables which form the basis for the calculation of the respective observation 301, 303, 305.
To calculate the observations (301, 303, 305) the hidden variables need to be determined. The LDM first determines these hidden variables by transitioning between states using state evolution model (331, 333). The hidden variables are then used in corresponding observation models (321, 323, 325) to determine the observations (301, 303, 305) The structure of this graph allows the decomposition of the joint probability of observations and hidden states into simpler factors To show that, the trajectory of hidden variables is defined: X = [x1. ... r] =Jiir. (1)
along with the trajectory of observation variables: = I = (2) then the joint probability distribution factors are defined as: P, = .P (x1) H P(xt t -1 P tixt) t= 2 t=-1 (3) This distribution is Gaussian since, by assumption, all individual factors are Gaussians.
Each observation (speech vector) (301, 303, 305) can be calculated via a corresponding observation model 321, 323, 325 which utilises the hidden parameters: Observation Model: PO/I/xi) = J^r(yi; h(x), R) * xt= hidden variables for state t: Abstract state, Articulators, Sinusoidals, e.t.c.
* F = observation for state t: (mceps, FO, bap), Sinusoidal parameters, Raw speech * h(x) = an affine transformation or non-linear map. E.g., h(x) = H. x, + (it * R = covariance matrix for Gaussian distribution The hidden parameters for a given state are dependent on the hidden parameters for the previous state. To transition between states, (to find the relevant hidden variables for the next state) state evolution models (431, 433) are utilised: State evolution Model: P(xiivii) = N(xt,.1(xt4, * xj = hidden variables for frame t: Abstract state, Articulators, Sinusoidals, etc. * fixi_j) = an affine transformation. E. g.,fix,4) = 12.x + * Q = covariance matrix for Gaussian distribution and.7111(x; E) is the normal distribution: 1. 1 /
7\1-(x; Fly; -(270n/21Q I1/2 exP MY:CI -14)1 Methods in accordance with embodiments model in a parametric way the state evolution and consider tasks over time, through which the state trajectories should pass.
Given one or more measurement sequences and a model, there are three basic tasks that may be performed: 1. Classication: Compute the probability that a measurement sequence Y came from this model.
2. Inference: Compute the probability that the system was in state z at time t, P (xt = zlY).
3. Learning: Determine the parameter settings which maximize the probability of the measurement sequences.
Due to Markov assumption, the classification and inference tasks can be performed with an exact algorithm. The classification task can be solved with a forward pass through the chain, while an additional backward pass is needed for the inference task. On the other hand, it is computationally very hard to find the optimum parameter settings without knowing the hidden state. For this reason, numerical methods are used to estimate locally optimum parameter settings. In one embodiment, the Expectation Maximization (EM) method is used for the parameter estimation. The EM algorithm is an iterative method for finding maximum likelihood estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which estimates the hidden state trajectory given both the observations and the parameter values, and a maximization (Ni) step, which involves system identification using the state estimates from the E-step.
An WM is characterized by a state evolution equation (4b) and an observation space equation (4c) as shown below, where the trajectories are also provided, as well as the complete LDM model. The simplest state space models are the LDMs where both the transition and observation models are described by stochastic affine transformations: Al.(' Q) tt1 - .rt (4a) Ar(0.(1) (4b) (4c) where x E Rn and y E The set of parameters 9 is defined.
6 = [F,q"ity, (5) F is an n x n state transition matrix and H is an m x n observation matrix. The state x is an n-dimensional vector which evolves according to linear difference equation (4b), with initial condition x, defined by equation 4a. The initial condition x, follows a Gaussian distribution (TV) with an initial mean value and an initial covariance matrix Qi The hidden state x cannot be observed directly. Tnstead, m-dimensional measurements y are available at discrete sampling times as described by (4c). The vectors w and v are called state evolution noise and observation noise respectively and are independent of each other. w and v are proportional to Gaussians with a mean of 0 and covariance matrices Q and R respectively. q and py are the state evolution mean and the observation mean respectively.
The estimation of parameters is obtained by maximizing the likelihood: (6) One of the advantages of an LDM-based speech synthesis system is that, by tying some parameters across all LDM models, a speech synthesis system having a much smaller number of parameters than a similarly performing state-of-the-art FIMM-based speech synthesizer can be achieved with a similar quality of output.
Specifically, in the above LDM system, matrix F and vector q in equation (4b) have 71 X re and re elements respectively. In this embodiment, covariance matrices Q and Q, are set to the identity matrix / and so do not contribute to the number of parameters since In the observation equation, vector my has m elements. Matrices H and R are globally tied and have in x n and m x m elements respectively. As matrices H and R are globally tied, all states share the same H and R. Accordingly, all LDIVIs have the same observation matrix II and covariance matrix 1? but have their own value for the observation mean,uy. Each LDM has its own values for the parameters in the state evolution equation, J q, mi. By globally tying matrices H and R and by setting 0 and 0/ to equal the identity matrix, the footprint of the LDIVIs is greatly reduced.
In embodiments where the observations to be synthesised are mel-cepstra, band aperiodicities or phase information then matrix R is constrained to be diagonal. For sinusoidal features R is not constrained to be diagonal.
Although the covariance matrix R may be constrained to be diagonal, which drastically limits the number of its parameters, in this analysis it is considered to be a full matrix. The mean value pi has n elements. Let 14_,DM denotes the total number of LDM tree leaves of the decision tree used to train the LDM. Then the total number of parameters is Limm(n2 + 2n + m) +m x n+m x m+nx n. Since Limp' >> in > Ti, the contribution of terms that are not multiplied by LLDm is negligible.
In order to compare the footprint of LDMs with the footprint of HMMs, the number of parameters of a typical HMI system is calculated. For each tree leaf there are m + m+ m elements for the means of features and of the corresponding first and second order derivatives. Also, there are m + m + m for the diagonal covariance matrices. If Lrimm is the total number of HATM tree leaves then the total number of parameters is Liimm 6m.
The values m and n used in one embodiment for each speech parameter type are For mel-cepstrum (MCEP), m = 40 and n = 6. For continuous 1nFo, m = 1 and n = 1. For bandaperiodicity, m = 2 and n = 2. For phase features, m = 20, n = 4. In this case the ratio LLEmi Lymm for mceps is approximately 0.62. This provides the following estimation number of LDM parameters MCEP parameter ratio - -0.23 number of HMM parameters Accordingly, by globally tying some of the parameters across all LDMs a much smaller footprint for the speech synthesiser can be obtained.
Second order LDMs The movement of the articulators which produce speech may be calculated using critically damped spring-mass models: 624.2 + 2S ax + S2 (x(t) -g) = w (7a) clt dt where x(t) is the position of the articulator at a given time t, g is the equilibrium position (the point attractor or target parameter of the system) where the spring is neither stretched nor compressed, S is a mass normalised stiffness parameter which controls the rate of movement of x(t), and w is random noise that extends the spring-mass model to a statistical model. w represents random force applied to the system.
According to an embodiment, the state evolution equation is modelled using a second order difference equation that has similar dynamics to the spring-mass model. An approximation of the spring-mass model is used. This allows the system to be solved allowing the models to be more effectively fit to training speech parameters.
The discrete-time version of the spring-mass model is: xt = Fxt_j + (/ -F)q + w (7b) F = (ecs (I + S) -e5 S2 -e5(1 -S)) e-s q = Co), w-N(0,Q) (7c) To make the estimation of the parameters more robust, we propose to substitute equation 7b with a simpler equation that has similar dynamics. The state evolution (4b) is modeled using linear second order task-dynamics, as shown below. A second order recursion is used: zt = 2Szt_1 -S2 z_2 + q + w, Q) q = (7d) where z, g,w E r, Q E RTh" and where s =1 (8) z, is the hidden variable for frame t, and sk determines the rate of the critically-damped dynamics of the state trajectories towards targets g For simplicity and also due to physical characteristics of articulators' dynamics the matrix S is assumed to be diagonal The above equation can be written in the first order canonical form xt = Ext-i ± tU /Iv Al( 0 Q) (9) The augmented state xt is a hidden vector allowing the conversion from the second order system of equation 7d to the first order system of equation 9. The hidden vector xt and the system matrix F are E Ran ( S (10) respectively. Whilst the actual hidden parameters are zi, for simlicity the hidden vectors xt shall be referred to from hereon out as hidden parameters The system input (or control), q, is (11) The form F in the present embodiment (compared to that in equation 7c) results in a simpler, more robust speech synthesis system. The discrete equation (9, 10 and 11) has similar dynamics to the spring-mass model. It is the simplest possible difference equation that has dynamics which are critically damped towards a target. This allows the equation to be solved in a maximisation step thereby allowing the LDM to be effectively fit to a set of training speech parameters. If the sping mass model was discretised diretly then the resulting difference equation cannot be reliably solvedand would therefore result in a system which is more difficult to train. Accordingly, embodiments allow provide an improved form of the state evolution equation which ensures more accurate and efficient training of the system.
The observation model is described by an equation that has the same structure as equation 4c: Ut (12) where R n.
The initial state is given by (21),Ill (13) Therefore, a second order critically damped LDM is also described by a system of equations that have the form of ordinary LDMs (equations 4a-4c).
In embodiments of the invention, the observation and state-evolution equations follow a second order critically damped linear dynamical model as described with reference to equations 7-13 Unlike previous LDMs, the state evolution matrix F has a specific form which reduces the footprint of the LDMs by up to 90% compared to a frill matrix F since F now has n parameters instead of the n2 parameters that a full matrix F has.
In addition, the LDMs of the present embodiment are fully parametric. Accordingly, exact solutions can be found during training (as shown below) which is more efficient than optimising a non-parametric model.
Figure 5 shows how speech is synthesised using a linear dynamical model of an embodiment. In step S401 a linguistic unit to be modelled is received. The acoustic model associated with the linguistic unit is then chosen in step S403. The association between linguistic units and acoustic models is predefined during the training of the system, as shall be discussed in detail below. All acoustic models obey the linear dynamical model of equations 8-13, with each acoustic model having associated values for parameters F, q, it.3" Qin,Q,R. These parameters are set according to the LDM in step S405. Each acoustic model outputs T frames, where T may be different depending on the LDM. T is set according to the duration model of the acoustic model. T is set during training of the acoustic model. In step S405, the value of T is set according to the LDM. This sets the total number of hidden states and the total number of speech vectors which are calculated for the LDM.
The hidden parameter x, for each state t is then is then calculated, allowing the calculation for the speech parameters xt for each state to be calculated.
In step S407, it is determined whether the current linguistic unit is the first linguistic unit. If it is, then t is set to 1 and the initial hidden parameter xi is calculated via equation 13 (S409).
If the linguistic unit is not the first linguistic unit in the current utterance, then t is set to 1 and the initial hidden parameter x1 is set to the value for the hidden parameter of the last state of the previous linguistic unit (S411). This provides continuity when transitioning between models.
Once the initial hidden parameter xi has been set, the initial speech parameter yi for the first stage is calculated via equation 12 (S413).
At step S415 the method moves to the next state (t = t + 1). At step S411 the hidden parameter for the state is calculated using the value for the hidden variable for the previous state (xi_i) via equations 8-11. The speech variable for this state is then calculated using equation 12.
In step S419 it is checked whether the maximum number of states (T) has been reached. If T has not been reached, then the method loops back to step S415 so that the hidden and speech variables for the next state may be calculated If T has been reached, then the speech vectors for the linguistic unit Y are output so that they may be used to synthesise speech (S421).
By repeating the method of Figure 5 for each linguistic unit sequences of speech vectors can be output to synthesise whole utterances Training the LDMs (Expectation Maximisation) To fit the LDMs to linguistic units the models must be trained. This involves determining the optimum value for parameters F, q, ity" QiH,Q,R for each acoustic unit. This can be obtained via Expectation Maximisation based on a training set of speech vectors associated with their corresponding linguistic units.
The parameters and the hidden state of system (1) can be jointly estimated with Expectation Maximization (EM). The EM iteration alternates between performing an expectation (E) step, which estimates the hidden state trajectory given both the observations and the parameter values, and a maximization (M) step, which involves system identification using the state estimates from the E-step. Each one of these steps is efficiently calculated. When the state is given then the parameters are estimated from closed form algebraic formulas. On the other hand, when the parameters of an LDM are known then the marginal probabilities and the sufficient statistics (used in the equations for estimating the parameters) are calculated by a forward pass (Kalman filter) followed by a backward pass through the Markov chain of the probabilistic interactions (Kalman smoother).
Figure 6 shows a method of training the acoustic model. In step S501 the method receives the training speech parameters yj and initiates with initial estimates of of the model parameters F, H, p, Q1, Q, R, q and ity. In one embodiment, the covariance matrices Q and Q1 are fixed as (Q = Qi = /). This removes a degeneracy of the model and does not restrict its generality. Also, in order to ensure the stability of the model, the transition matrix F is constrained to have spectral radius less than or equal to one.
The initial estimates for the model parameters and the training speech trajectories are used in an expectation step (S503 and S505). A Kalman filter is implemented in a first part of the expectation step (S503) to calculate the marginal probabilities and the statistics 2, Itit, t E 71 and felt-t, t E (2, 71 (see Algorithm 1).
Algorithm. I: Kalman Filter Data; 017,12rvatiullS, inilT, and model pat lb * F, Q. a 0 Result: logli -loguT)) and statistics tut-t ' 11 tit I, En t C ( t1_ f lnitializatiinl = p i = = for = T do /t Prediction if t> I then
-
= FEt -Q Update = yt - + Er,f = He_,16 R Kt -tie HT t:-1 71 f et = ± = - KtIrEllt -I et, -.:\17Ct 0.
logL = logL log(ct, Algorithm 1 computes recursively the probability of each hidden variable xt given the set of speech vectors from frame 1 to t-1 (p(xtlyi:t_i)) and the probability of each hidden variable xt given the set of speech vectors from frame 1 to t (p(xtlytt)) and evaluates the probability density function p(Y). To prevent underflows, Algorithm 1 returns log(p(Y)), which is interpreted as the log-likelihood of the model parameters given the data (log(L(91Y)) = log(t(Y16))) in the parameter estimation phase.
In step S505, the statistics calculated in the Kalman filter (2tit, !tit, 2tit_i, and rtit_i) are used in a second part of the expectation step. A Kalman smoother is used to obtain sufficient statistics 2t1T,RCIT, t E {1, T} and r?t,t_,IT, t E {2, ..., T} (see Algorithm 2).
Algorithm 2: Kalman Smoot Data: Statistics tt, Ettt -calculated tOill mud fundol paramaer F f{ t {2 Result: Statistl Atrr.
_HT TiT for / = T.72 do
T **)t
P = t 1:1: It * j -1T + t ti 1;T. ' f-tri*Lt_ILL In step S507 a maximisation step is enacted using the sufficient statistics and the training speech parameters. The maximisation step computes values for the parameters which maximise the expected log-likelihood found in the E step. It can be shown that the maximum values for the parameters for a first order LDIVI can be obtained via the sufficient statistics using the equations in Table 2.
Table 2.: I equatioms of T'1/4.1s in the NtlEte of algorithm llt = tiiT gi = NIZT -MR: F = (171 T11C'c'-'1), T-. C ' it l',1 ci =-Ftfi) E iagi T-k " c -TT OM -PVT - 7,-.,:nxn H = (F; -+cicn IlF3 -+-j:;.:-. a - E _ i, 1'v = AR _ 7,1 (F6 _a-FA, ,,,,,,, ) c kl,"nz Where
T-F,
-t=2 = t
-1 = t=1
I = MY: i=? (14) The situation is slightly different for second order LDIVIs. To find the estimates for parameters S and q, the following method is used (see Appendix A for derivations).
For k = 1: n, where n is the dimension of z, we have the following equation: k.. n. k):
-
1 (k, n * \ (1-) 14'.1.9 k' , s." -1741i. n k) - - 1:1,' , k) - (14a) This equation is 3rd order and can always be solved to obtain at least one real value of sk. A value of sk is sought which has an absolute value of less than one ask I < 1). This ensures that the second order LDM is stable. A method of how to solve this equation can be found in "Numerical Recipes 3rd Edition: The Art of Scientific Computing", William H. Press, Saul A. Teukolsky, William T Vetterling, and Brian P Flannery, Cambridge University Press, New York, NY, USA, 3 edition, 2007. After determining sk then this value can be substituted into the following equation:
I (14b)
The update equations for a second order LDM are shown in table 3. Appendix A contains the derivations of these equations.
Table 3: Update equations of second order critically damped LDMs in the NI-step of the EM method.
muipi -We can choose betweena:plobally tied l'..2.1 and kii -.,:,..dution of Eq. 14a. ;:i.i,, shou; i ' SQ that: is-the. .s,tjiu ticsn of Eq. 14b. k = I: 11 if = if E k 5,- . e " . - 7(..,(i.
ft isglc"halIy tied
-
Once the updated parameter values have been calculated, it is then determined whether a local maximum has been reached (5500).
If no local maximum has been reached then the method loops back to step 5503 to repeat the expectation (5503 and S505) followed by maximisation (5507) based on the updated model parameters. This obtains new estimates for the model parameters.
If a local maximum has been reached then the updated model parameters are stored (S51 I) so that they may be retrieved at a later date in order to synthesise speech for the corresponding linguistic unit.
Accordingly, the Expectation and Maximisation steps are repeated until the system converges on a local maximum for the parameters of the LDM This allows the system to determine optimum values for F, H, Q1, Q, R, q and icy based on observed speech data *** * 372-for a given linguistic unit or linguistic unit Tree-based clustering for LaNt-TTS When training models to speech data, there is often not enough examples for a given liguistic unit for the data to be modelled accurately. To achieve high quality synthesised speech in statistical parametric speech synthesis, it is important to robustly model the acoustic and linguistic contexts. Typical parametric speech synthesizers consist of a huge number of context-dependent models, many of which cannot be robustly trained since there are limited number of observations in the training set.
Sometimes, there is a complete lack of samples for a given linguistic unit which would result in the system being unable to fit a model to the unsampled linguistic unit.
To address this problem, top-down decision tree based context clustering is usually used The decision trees do not only contribute in addressing the data sparsity problem, they are also used to model unseen acoustic contexts in the training data. Accordingly, some linguistic units (or linguistic units) which are similar are clustered together and assigned the same model.
A phonetic decision tree is a binary tree in which a boolean (yes/no) decision is made based on a phonetic question associated to each node. Initially, all states (equivalently, all associated training data) are placed at the root node of a tree. Depending on each answer, the pool of states is successively split until the LDM-likelihood increment is less than the increase of the complexity of the models. Model complexity can be measured with the minimum description length (MDL) criterion: = 1/2 pk logN (15) where k is the number of free parameters per LDNI model, N is the total number of frames associated with a tree node, and p is a heuristic scaling factor which in this embodiment is set to I. Previous experiments with LDNI-based speech synthesizers were based on decision trees and states derived from corresponding HNINI systems. However, this is not optimal for LDIVIs. Typically a context dependent unit (linguistic unit) is split into five-segments (linguistic units) when modelling using an HMM. Given the duration of most of these units and the number of segments considered, it is common to find segments with associated sub-phoneme segments that consist of only a single frame. Furthermore, the small time duration in each segment does not allow the LDN1 to exhibit its ability to better model the temporal dynamics of speech.
In an effort to overcome the above inefficiencies, a simple suboptimal phoneme segmentation rule is adopted. Each phoneme is split into two equally sized segments (linguistic units), the left and right segment. This simple rule works in practice since LADIVIs adequately model speech dynamics within each segment.
In order to robustly estimate the parameters of the model that corresponds to each linguistic unit, it is necessary to cluster the units based on their acoustic and linguistic context In practice, a different top-down LDM-based phonetic decision tree is built for each of the segments (left and right) and each type of speech parameters considered (mel-cepstrum, 1nF0, band-aperiodi city and phase features). Thus, a total of eight decision trees are constructed. Finally, all the observations associated with each leaf of a decision tree are used to estimate the corresponding duration model as a Gaussian distribution.
The computational complexity of a decision tree-based clustering approach is higher than in the HATM case. In HIVIN4s, and autoregressive HIVIMs, the length of segments (number of frames per segment) is considered to be fixed. This assumption results in parameter estimation formulae that are direct functions of the sufficient statistics collected initially from the training data. Tn practice, this means that when a cluster is split into two clusters, the likelihoods of the new clusters can be efficiently calculated by accumulating the relevant sufficient statistics collected once, without reference to the training data. However, for models such as LDMs and trajectory 1-1MMs it is not possible to apply the same approach. On LDMs the sufficient statistics depend on the model which in turn depends on the training data. When a cluster is split the model of the parent node cannot be used for any of the two children. Therefore, new model parameters and the corresponding likelihoods have to be iteratively estimated from the data for each new cluster.
Algorithm 3 shows the pseudo-code of LDM-based decision tree clustering As can be seen an LDM model is estimated for every new child cluster as we move down the hierarchy. The clustering process may be accelerated using approximation algorithms Algorithm 3 relies on coarse grain parallelism to remain practical; i.e., the search for the best question is performed in parallel using all cores of modern processors. In Algorithm 1, L., and Lil denote the likelihood of the data samples that satisfy and those that do not satisfy question q respectively. C is the MIDL threshold of (15). The algorithm performs a breadth first construction of the tree using a task queue to store the tree nodes that are candidates for split. The parallel section performs hypothetical splits and the actual split is done for the best question of list Q only if this list is not empty. If list Q is empty then the current node, v, is a leaf node.
Algorithm 3: Decision Trees Clustering Data: Training examples and linguistic questions Result: The decision tree Create 114k root node which has pointers to all examples taskQue ue.putiroot) while hi ckutieNai.NorEmpty()do v = taskQueue.pop0 for q c Questions do in parallel Split die examples of node v according to q Fit an LDM to "yes" examples and calculate Lv Fit an WM 10 "no" examples and calculate L. if ItIp-ty > log C then L Store IN, log 1:" lcg; L") thin list (1.2 if Q.is-ArotEmpnl) then Choose the question q' of C) with the largest log L.4 ± log L., wdue and set v.ti = Split the examples of node r according to.I Create node with pointers to -yes" examples Create node iz with pointers to 'no-examples Connect and n as children of r tasikQueue,putt 0; askQueue,put(n) Figure 7 shows a method of clustering similar linguistic units together. In step S701 a set of training speech vectors and associated full context labelling (including linguistic units) is received. In step S703 a single LDM is fit to all speech vectors using the above method of training (see Figure 6). The value of the log likelihood for the model log calculated during the Kalman filtering step of the EM method, is stored.
The set of questions Q are then applied, in parallel, to the all linguistic units in the model (S705) The group of linguistic units are then split into "yes" and "no" clusters based on the answer to the question (S707) The log likelihoods for the "yes" and "no" clusters are calculated in step S711 In step S713 if the cumulative log-likelihood of the "yes" and "no" clusters is greater than the log likelihood for the parent cluster in addition to the MDL threshold 4-then the question q and the cumulative log likelihood of the "yes" and "no" clusters are stored In step S715 it is checked whether any questions were stored from the above parallel processing of questions Tf not, then the parent model is the best model for the all of the received linguistic units and this is assigned to the unsplit duster (S717). This model is then used when synthesising speech for any of the received linguistic units If one or more questions have been stored then the question with the highest cumulative log-likelihood of the "yes" and "no" clusters is chosen (S719). The models fitted to the "yes" and "no" clusters and assigned to the respective clusters (S721) and then the above method is repeated for these "yes" and "no' clusters (5723). Through this method, a decision tree is iteratively formed until all linguistic units are assigned an LDM (even if this is shared between linguistic units).
After clustering, the duration model of each leaf cluster (or leaf model) is determined. This provides the number of frames (T) for each LDM. The number of frames differs for each LDM. Each cluster contains a number of training examples (segments of speech parameters). Each of these segments consists of a number of frames. The duration (number of frames) for each leaf cluster (associated with a given LDM) is modelled with a Gaussian with mean and standard deviation calculated from the number of frames of the associated training segments.
That is, the duration of this cluster (this LDM) follows a normal distribution with mean: T!Dm = (T1, T2, *** T N) /1\1 (16) and variance aLDM = Eit\L 'LDM) N (17) where T" is the number of frames in the nfil example of N speech parameter segments assigned to the leaf cluster (to the LDM) Figure 8 shows the mel-cepstral distance as a function of hidden space dimension. The cepstral distance in dB between two sequences of mel-cepstral coefficients sets is given by (18) where cj,i(i) and c.1,7(i) are the i-th mel-cepstral coefficient for the t-frame of the natural and generated sequence of coefficients sets, respectively, with T being the number of frames and m the cepstrum order. Smaller distances correspond to better modelling. It can be seen from the diagram that the mel-cepstral distance between the original and synthetic speech cepstrum approaches its minimum for a relatively small hidden space dimension (6 to 8). This is why, in one embodiment, state vectors consisting of 6 components are chosen.
Figure 9 shows the state space trajectory for a first order LDM and a second order critically damped LDM. Both the first order (810) and the second order (820) LDN1s have first (830) and second (840) target values towards which the state space trajectories are urged. Both are initially urged towards the first target value (830) before transitioning to being urged towards the second target value (840). This represents the transition between different models over a single utterance. The first order LDM (810) has a discontinuity (850) at the transition between the two target values (830, 840). In contrast, the second order LDM (820) has no discontinuity. The resultant speech synthesised based on the state parameters of the second order LDM (820) would therefore have a few number of artefacts and therefore sound more natural. Accordingly, second order LDMs synthesise speech with a smaller number of artefacts than first order LDMs.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed the novel methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of methods and systems described herein may be made without departing from d(cic2)= the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms of modifications as would fall within the scope and spirit of the inventions.
Appendix A Maximum Likelihood Estimation of the Model Parameters To find the maximum likelihood (ML) estimates for the model parameters, the joint log-likelihood of the data logP(YI 0) has to he maximised L(0) = logP(Y10) = log (I P(X,Y10)dX) (Al) Using any distribution Q(X) over the hidden variables, we can obtain a lower bound on L: £(0) = log (f Q(X)P(X) ne) dX) g(X (A-2) And using the Jensen's inequality, which can be proved by the concavity of the log function, we have £(0) f Q(X) log (PCQX('Lle)) dX 01) Q(X) log P(X,Y16)dX f Qi(X) log Q(X)dX Ix (A3) We set Q(X) log F(X,Y10)dX -Q(X) log Q(X)dX (A4) The EM method alternates between maximising F with respect to the distribution Q and the parameters 0, respectively, holding the other fixed. Starting from some initial set of parameters we alternately apply: E-step: P4-step: Cli+i 4-arg max.F(Q, ei) (Ma) 01+1 arg max.F(Qi+i, 0) (Mb) The maximum in the E step (Eq. Ma) is attained when the conditional distribution is chosen to be the posterior of the state sequence given the observation sequence and the old model parameters, (2:+1(X) = P IY, 03. With this choice, the lower bound (A3) holds with equality F (W+1, @3 = L(0t).
The maximum in the M-step (Eq. Mb) is obtained by maximizing the first term in (A3), since the second term does not depend on 0 M-step: 97.14 "4-arg 1118X ji P(X1Y, 0i) log P(X,119)dX e x Auxiliary Function Based on equation A6, we define the following auxiliary function: Q(0i, 0) = Eilog.P(X, 'Y 10)1Y., P(X1Y, ei) log P(X Y10)dX (A7) From Eq. 3 we have logP(X, Y1_61) = logP(xi) + logP(xtlxt_i) + log P(yt ixt) t=2 t=1 (A8) The first term can be written as: (A6) log P(xi) log 2r -(A9) The second term can be written as: log Pl.ttlXt-i) t=2 n(T -1) T -1 log 2ff log1(21 2 2 -1(Xt -F Xt_i - (2-1(xt -F xt_i -t=2 (A10) And the third term can be written as: log P (yt Ixt) -2 log 27r -logRI (yt (All) To simplify that derivation of update equations for matrix S and vector q we assume that the covariance Q of the evolution equation and the initial covariance Qi are equal to the identity matrix. Setting Q = = I E R2n the auxiliary function can be written as: (Al2) M-Step The optimum parameters can be found by differentiating the auxiliary function Q (0" with respect to the parameters and by setting the derivative to zero Note that the auxiliary function Q 9" 60 is different to the covariance matrices Q and Q1 -rt N4-Step: Update Equations for the Target Vector q To find the new target vector q, the auxiliary function in Eq. Al2 can be differentiated with respect to q(k) as follows:
E
TI."i,,,-,..k) c'T, , , , h, 5.- ,r-F-^ k i ± sc, i / ., , A t,. T.,), 1-, ) -, qtk) = 4-* , t,1, e 77-1 T -1 T -1 6(k) j., Csj.k) , 2,;1,(,(( -h1,7) - ''-' T -1 g'&?." T -1 ' ''h' T -1 (A13) M-Step: Update Equations for the damping parameter S To find the new damping parameter S, the auxiliary function in Eq. Al2 can be differentiated with respect to s(k) as follows: if 1 - 23(t (E) , (k) 1-1;:._-/ (10 -6 si(rt (J.H)(e.t. 10, ± ± -f-2q(kt:r* (k) -2s;":"A (.n ± k)jY Oi = 0 -d(k, k 2sk11) k0 ± 3.:;!,IT; (1k, nt k) _ kn k,± k) - soi(k)(11,n = 0 (A14) After substituting q(k) from Eq. A13 into Eq. A14: (A15) As mentioned with regard to Equation 14a, this 3rd order equation can be solved using a known method to obtain at least one real value of sk ("Numerical Recipes 3rd Edition: The Art of Scientific Computing", William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery, Cambridge University Press, New York, NY, USA, 3 edition, 2007). A value of sk is sought which has an absolute value of less than one (Isk' < 1) Update equations for vector pi To find the new mean vector of the initial state, the auxiliary function in Eq. Al2 can be differentiated with respect to as follows: api -2 2QJ:(E[x1iY,01] -14) = o kii = E [xi I Y/eti = (A16) Update equations for covariance Q1 To find the new covariance of the initial state, the auxiliary function in Eq. A]2 can be differentiated with respect to Q1 as follows: 000i3O) iOiogQi1 1E -49(xi ycir 1 (xi - IY, Bi 8QT, 2 0Q-3- NT I
T T T
X1 -in)(X1 P1)T111011.^.1" -Q1 + (*IT - ^ + Therefore
-
0Q(8,0) OQT1 and by taking into account Eq. Al6 we have Atm -tittir (A17) Update Equations for covariance Q To find the new covariance Q, the auxiliary function in Eq. Al2 can be differentiated with respect to (/1 as follows: 0000) T - Ext._ -.07 Y. 0; = 0 =>.
Q, - -4 eig 1=7 lea--F a-9,64 -Ext -0)ThAr s g
-TT -(A18)
Update Equations for Vector g, To find the new vector tty, the auxiliary function in Eq. Al2 can be differentiated with respect to g.," as follows: = R- t -firt vir 04 Therefore = 0 = a my Q (0" 0) Using the notation of the sufficient statistics: (A19) 1, 77 kC4) Update Equations for Matrix H To find the new matrix H, the auxiliary function in Eq. Al2 can be differentiated with respect to H as follows: ) R-.1 OH - -art /turn Y, Therefore Using the notation of the sufficient statistics we have: (A20) If we substitute Eq. A19 into the above equation we have: (A21) Since H = [Hz, 0,72] we set H(: n + 1: 2n) = Update Equations for Matrix R To find the new matrix R, the auxiliary function in Eq. AU can be differentiated with respect to IC' as follows: R =T ytyr -H xtYt Ity Yt)
T
T
Therefore using the notation of the sufficient statistics: R = -T(-6 - -Rya) (A22) t=1 t=1 t=1

Claims (16)

  1. CLAIMS: 1. A method of speech processing, the method comprising: receiving one or more linguistic units; converting said one or more linguistic units into a sequence of speech vectors for synthesising speech, said conversion using a one or more corresponding constrained higher order parametric linear dynamical models; and outputting said sequence of speech vectors.
  2. 2. A method according to claim I wherein the one or more constrained higher order linear dynamical models are second order linear dynamical models.
  3. 3. A method according to claim I wherein the one or more linear dynamical models describe critically damped task dynamic gestures towards targets.
  4. 4 The method of claim 1 wherein said conversion comprises, for each of the one or more linguistic units: selecting an associated linear dynamical model: determining a predefined number T of hidden vectors xt according to a state evolution equation wherein the hidden vectors xt for frame t arc: x1-31/-(M1, Qi) xt = Fxt_i + q + w; w-.7^1-(0, Q) determining a sequence of speech vectors yt based on the hidden vectors xt according to the observation equation: yt = fixt + tty + v; w -N(0, R) wherein each hidden vector xt is a vector representing hidden parameters 4: z xt = t: zr-1) and wherein H is an observation matrix, Q1, Q and R are covariance matrices, bti and py are mean vectors. F is a state transformation matrix, q is a target vector for the hidden states, and T, F. R. q, and bty are defined by the respective linear dynamical model.
  5. 5. A method according to claim 4 wherein the state transfommtion matrix F obeys: = (25 -S2). / 0where S is a matrix which determines the rate of critically damped dynamics of the hidden vectors towards the target vector q.
  6. 6. A method according to claim 5 wherein S is diagonal.
  7. 7. A method according to claim 4 wherein the one or more linear dynamical models comprise a plurality of linear dynamical models and the observation equations have parameters which are globally tied across all linear dynamical models.
  8. 8. A method according to claim 7 wherein observation matrix I-I and/or the covariance matrix R are the same for all linear dynamical models.
  9. 9 A method according to claim 4 wherein Q and/or QJ are set to be equal to the identity matrix.
  10. 10. A method of training a model for a text-to speech system, wherein said model is for converting a sequence of linguistic units into a sequence of speech vectors for synthesising speech, the method comprising: receiving speech data comprising training speech vectors and associated linguistic units; modelling speech for each linguistic unit using one or more constrained higher order parametric linear dynamical models; mid training the linear dynamical models, said training comprising estimating parameters of the linear dynamical models to fit the models to the associated speech data.
  11. 11. A method according to claim 10 wherein the constrained higher order linear dynamical models are second order linear dynamical models.
  12. 12. A method according to claim 10 wherein the one or more linear dynamical models describe critically damped task dynamic gestures towards targets.
  13. 13. The method of claim 10 wherein, each linear dynamical model comprises: a state evolution equation which describes a number T of hidden vectors xt for frame t according to: x1-N(11), Q1) xt = F xt_i + q + w; w'-'N (O, Q) and an observation equation which describes speech parameters yt for frame t according to: yt = Hxt + + v; w -N(0, R) wherein H is an observation matrix, Qt, Q and R are covariance matrices, th and ity arc mean vectors, each hidden vector xt is a vector representing hidden parameters z,: zt).xt = (zt_i.F is a state transformation matrix according to: F = (25 -s2): and 0 / q is a target vector for the hidden states and S is a matrix, and wherein T S R, q, Pi and py arc defined by the respective linear dynamical model.
  14. 14. A method according to claim 13 wherein S determines the rate of critically damped dynamics of the hidden vectors towards the target vector q.
  15. 15. A method according to claim 13 wherein fitting each of the models to the associated speech data comprises an expectation maximisation method comprising: a) an expectation step comprising obtaining sufficient statistics for the linear dynamical model via a Kalman filter followed by a Kalman smoother: b) a maximisation step comprising using the sufficient statistics to determine estimates for parameters S. H. R, q, tri and ity and updating the linear dynamical model with these estimates; and c) repeating steps a) and b) until a local maximum is reached.
  16. 16. A method according to claim 13 wherein the expectation step obtains the sufficient statistics: t-2 t =1 kr t =2 where 2tri is the Kalman smoother estimate of the expected value of hidden vector Xt.fit IT is the Kalman smoother estimate of the expected value of moment xtxr and fikt_ilT is the Kalman smoother estimate of the expect value of moment xtxr_i; and the maximisation step comprises, using the sufficient statistics to: solve for sk, where k = 1: n, where n is the dimension of zt: = under the constraint that sk has an absolute value of less than one where sk is the kth diagonal component of matrix S; determine q(k) according to: and determine: where m is the dimension of the speech vectors yt.
    17 A method according to claim 16 wherein: the maximisation step further comprises determining from the sufficient statistics: 15; or Q1 is set to the identity matrix for across the whole linear dynamical model: 18. A system for speech processing, the system comprising: an input configured to receive one or more linguistic units; and a processor configured to: convert said one or more linguistic units into a sequence of speech vectors for synthesising speech, said conversion using a one or more corresponding constrained higher order parametric linear dynamical models; and output said sequence of speech vectors.19. A system for naming a model for a text-to speech system, wherein said model is for converting a sequence of linguistic units into a sequence of speech vectors for synthesising speech, the system comprising: an input configured to receive speech data comprising training speech vectors and associated linguistic units; and a processor configured to: model speech for each linguistic unit using one or more constrained higher order parametric linear dynamical models; and train the linear dynamical models, said training comprising estimating parameters of the linear dynamical models to fit the models to the associated speech data.20. A carrier medium comprising computer readable code configured to cause a computer to perform the method of any of claims 1-17.
GB1507422.2A 2015-04-30 2015-04-30 Speech synthesis using linear dynamical modelling Expired - Fee Related GB2537908B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1507422.2A GB2537908B (en) 2015-04-30 2015-04-30 Speech synthesis using linear dynamical modelling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1507422.2A GB2537908B (en) 2015-04-30 2015-04-30 Speech synthesis using linear dynamical modelling

Publications (3)

Publication Number Publication Date
GB201507422D0 GB201507422D0 (en) 2015-06-17
GB2537908A true GB2537908A (en) 2016-11-02
GB2537908B GB2537908B (en) 2021-09-15

Family

ID=53488951

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1507422.2A Expired - Fee Related GB2537908B (en) 2015-04-30 2015-04-30 Speech synthesis using linear dynamical modelling

Country Status (1)

Country Link
GB (1) GB2537908B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060053008A1 (en) * 2004-09-03 2006-03-09 Microsoft Corporation Noise robust speech recognition with a switching linear dynamic model
US20080046245A1 (en) * 2006-08-21 2008-02-21 Microsoft Corporation Using a discretized, higher order representation of hidden dynamic variables for speech recognition

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060053008A1 (en) * 2004-09-03 2006-03-09 Microsoft Corporation Noise robust speech recognition with a switching linear dynamic model
US20080046245A1 (en) * 2006-08-21 2008-02-21 Microsoft Corporation Using a discretized, higher order representation of hidden dynamic variables for speech recognition

Also Published As

Publication number Publication date
GB2537908B (en) 2021-09-15
GB201507422D0 (en) 2015-06-17

Similar Documents

Publication Publication Date Title
EP2846327B1 (en) Acoustic model training method and system
JP5768093B2 (en) Speech processing system
JP6092293B2 (en) Text-to-speech system
JP6246777B2 (en) Speech synthesis method, apparatus and program
EP3192070B1 (en) Text-to-speech with emotional content
US8825485B2 (en) Text to speech method and system converting acoustic units to speech vectors using language dependent weights for a selected language
WO2012115213A1 (en) Speech-synthesis system, speech-synthesis method, and speech-synthesis program
Henter et al. Gaussian process dynamical models for nonparametric speech representation and synthesis
GB2524505A (en) Voice conversion
US20110276332A1 (en) Speech processing method and apparatus
Zen et al. Context-dependent additive log f_0 model for HMM-based speech synthesis
Yamagishi An introduction to hmm-based speech synthesis
CN106157948B (en) A kind of fundamental frequency modeling method and system
JP5474713B2 (en) Speech synthesis apparatus, speech synthesis method, and speech synthesis program
GB2537908A (en) Speech synthesis using linear dynamical modelling
GB2537907A (en) Speech synthesis using dynamical modelling with global variance
KR102051235B1 (en) System and method for outlier identification to remove poor alignments in speech synthesis
Khorram et al. Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthesis
Coto-Jiménez et al. Speech Synthesis Based on Hidden Markov Models and Deep Learning.
Shinoda Speaker adaptation techniques for speech recognition using probabilistic models
Kadowaki et al. Speech prosody generation for text-to-speech synthesis based on generative model of F0 contours
JP5345967B2 (en) Speech synthesis apparatus, speech synthesis method, and speech synthesis program
JP2004279454A (en) Method for speech generation model speaker adaptation, and its device, its program, and its recording medium
Ostendorf Segmental Acoustic Modeling for Speech Recognition Mari Ostendorf Electrical and Computer Engineering Department, Boston University 8 St. Mary's St., Boston, MA 02215 USA email: mo@ bu. edu
Hong et al. Acoustic modeling and parameter generation using relevance vector machines for speech synthesis

Legal Events

Date Code Title Description
PCNP Patent ceased through non-payment of renewal fee

Effective date: 20230430