GB2537907A

GB2537907A - Speech synthesis using dynamical modelling with global variance

Info

Publication number: GB2537907A
Application number: GB1507420.6A
Authority: GB
Inventors: Maia Ranniery; Digalakis Vassilis; Diakoloukas Vassilis; Tsiaras Vassilis; Stylianou Ioannis
Original assignee: Toshiba Research Europe Ltd
Current assignee: Toshiba Europe Ltd
Priority date: 2015-04-30
Filing date: 2015-04-30
Publication date: 2016-11-02
Anticipated expiration: 2035-04-30
Also published as: GB2537907B; GB201507420D0

Abstract

A text-to-speech (TTS) system is trained according to a linear dynamic model (LDM) whereby text is converted to a sequence of linguistic units (eg. phonemes, sub-phonemes), each state of which is looked up in an acoustic model table to produce a sequence of speech vectors which is adjusted to increase the variance of the speech vectors vi(d) based on a predefined global variance v before being output as speech. A predefined number T of hidden vectors xt evolve according to a state equation involving an observation matrix H, state transformation matrix F, covariance matrices Q & R and mean vectors m. Second order LDMs may be constrained to be critically damped towards a target q, and speech parameter trajectories Y may be calculated according to a steepest ascent method.

Description

SPEECH SYNTHESIS USING LINEAR DYNAMICAL MODELLING WITH GLOBAL

VARIANCE

FIELD

Embodiments described herein relate generally to a system and method of speech processing

BACKGROUND

Text to speech systems are systems where audio speech or audio speech files are outputted in response to reception of a text file.

Text to speech systems are used in a wide variety of applications such as electronic games, E-book readers, E-mail readers, satellite navigation, automated telephone systems and automated warning systems.

There is a continuing need to make efficient systems which sound more like a human voice.

BRIEF DESCRIPTION OF THE FIGURES

Systems and Methods in accordance with non-limiting embodiments will now be described with reference to the accompanying figures in which: Figure I shows a text to speech system; Figure 2 shows a text-to-speech method; Figure 3 shows how a phoneme relates to linear dynamical models; Figure 4 shows a representation of a hidden Markov chain representing an HDTV of an embodiment; Figure 5 shows how speech is synthesised using a linear dynamical model of an embodiment, Figure 6 shows a method of training an acoustic model; Figure 7 shows a method of clustering similar linguistic units together; Figure 8 shows the mel-cepstral distance as a function of hidden space dimension; Figure 9 shows the state space trajectory for a first order LDIVI and a second order critically damped WM; Figure 10 shows a method of calculating the global variance of speech vectors; Figure 1 1 shows a method of generating speech parameters with global variance; and Figure 12 shows for a sample waveform the natural and synthesized trajectories of the 32nd mel-cepstral coefficient of a given utterance.

DETAILED DESCRIPTION

Higher order Linear Dynamical Models According to a first embodiment there is provided a method of speech processing, the method comprising receiving one or more linguistic units, converting said one or more linguistic units into a sequence of speech vectors for synthesising speech, said conversion using a one or more corresponding constrained higher order parametric linear dynamical models, and outputting said sequence of speech vectors.

By utilising higher order linear dynamical models the resultant speech vectors have a reduced number of artefacts relative to first order linear dynamical models. This is due to the reduction in the number of discontinuities in the state trajectories of the linear dynamical models. By utilising linear dynamical models, the footprint of the model is reduced as LDIMs capture a greater range of dynamics than Hidden Markov Model (HMM) systems which require die modelling of not only the speech parameters but also the second and third derivatives of the speech parameters. In contrast, LDMs require only a single observation equation to obtain similar results. In addition, each phoneme may be segmented into a smaller number of states as the linear dynamical models are more effective at modelling dynamics over time. This ensures a reduction in the number of models per phoneme and therefore a reduction in the number of parameters required to model speech. Constraining the linear dynamical models ensures that the system is stable. In one embodiment, hidden states are used to determine the speech vectors and the evolution of the hidden states over time is modelled using the one or more higher order linear dynamical models.

A linguistic unit may be a phoneme or a grapheme or may be a segment of a phoneme or a grapheme, such as a sub-phoneme or a sub-grapheme. The speech processing may be a text to speech method which comprises receiving text and determining a sequence of linguistic units from the text.

In one embodiment, the one or more constrained higher order linear dynamical models (LDMs) are second order linear dynamical models. Second order LDMs better represent the movement of the articulators which produce speech. The movement of articulators also follow second order equations Accordingly, second order LDMs provide a more accurate model of speech synthesis.

in one embodiment, hidden states are used to determine the speech vectors and the evolution of the hiddcn states is modelled using the second order critically damped LDMs. In one embodiment, the one or more linear dynamical models describe critically damped task dynamic gestures towards targets. As the LDMs are constrained to be critically damped, they are stable and evolve over a long period of time towards target values. Without constraining the LDMs there would be divergence away from the target values.

In one embodiment, the conversion comprises, for each of the one or more linguistic units: selecting an associated linear dynamical model; determining a predefined number T of hidden vectors xt according to a skate evolution equation wherein the hidden vectors xt for frame t are: Q1) xt = Fxt_i + q + w; w-N(0, Q) and determining a sequence of speech vectors yt based on the hidden vectors xt according to the observation equation: yt = H xt + + v; w-N(0, R), wherein each hidden vectorxt is a vector representing hidden parameters zt: zt xt = Lt-1 / H is an observation matrix, Qi, Q and R are covariance matrices, pi and py arc mean vectors, F is a state transformation matrix, q is a target vector for the hidden states, and T, F, R, q, pi and py are defined by the respective linear dynamical model. By utilising x, to represent further hidden parameters z, and 41 the second order linear dynamical model can be represented as a first order system.

Hidden vectors xt are inherently hidden parameters in their own rights. The hidden vectors xt allow the second order dynamics of the hidden parameters zt to be represented in a first order format.

In one embodiment the state transformation matrix F obeys: F = (2S -S2). I 0 P

where S is a matrix which determines the rate of critically damped dynamics of the hidden vectors towards the target vector q. This specific form of F effectively models speech dynamics whilst also being able to be effectively solved during training. This means that the system may be trained more efficiently and more effectively than other systems. Previous systems have required numerical methods to train the systems. This is less efficient as it involves iterative calculations and ultimately results in approximations. in contrast, the above formulation of F allows an exact solution to be found resulting in a much faster training process In one embodiment the one or more linear dynamical models comprise a plurality of linear dynamical models and the observation equations have parameters which are globally tied across all linear dynamical models. This reduces the footprint of the model by requiring fewer parameters to be stored. In one embodiment the observation matrix H and/or the covariance matrix R are the same for all linear dynamical models. In one embodiment Q and/or Q I are set to be equal to the identity matrix. By globally tying parameters across the linear dynamical models, the footprint of the system is greatly reduced without reducing the quality of the synthesised speech.

According to a further embodiment there is provided a method of training a model for a text-to speech system, wherein said model is for converting a sequence of linguistic units into a sequence of speech vectors for synthesising speech. The method comprises receiving speech data comprising training speech vectors and associated linguistic units, modelling speech for each linguistic unit using one or more constrained higher order parametric linear dynamical models, and training the linear dynamical models, said training comprising estimating parameters of the linear dynamical models to fit the models to the associated speech data.

As mentioned above, utilising constrained higher order parametric linear dynamical models provides a model which produces speech vectors with a reduced amount of artifacts.

Estimating parameters may comprise finding locally optimum values for the parameters This may involve finding locally maximum likelihood estimates of the parameters.

In one embodiment the constrained higher order linear dynamical models are second order linear dynamical models. In a further embodiment the one or more linear dynamical models describe critically damped task dynamic gestures towards targets.

in one embodiment each linear dynamical model comprises a state evolution equation which describes a number T of hidden vectors xt for frame t according to: x1-N0/1, Q1) xt = Fx,_, + q + w; Q) and an observation equation which describes speech parameters yt for frame t according to: yt = Hxt + py + R).

H is an observation matrix, Qi. Q and R are covariance matrices, pi and ity are mean vectors.

Each hidden vector xt is a vector representing hidden parameters zt: zt xt = F is a state transformation matrix according to F = (25 -52) / q is a target vector for the hidden states and S is a matrix, and wherein T, S, R, q, pi and py are defined by the respective linear dynamical model.

In one embodiment S determines the rate of critically damped dynamics of the hidden vectors towards the target vector q.

in one embodiment fitting each of the models to the associated speech data comprises an expectation maximisation method comprising: a) an expectation step comprising obtaining sufficient statistics for the linear dynamical model via a Kalman filter followed by a Kalman smoother; b) a maximisation step comprising using the sufficient statistics to determine estimates for parameters 5, H, R, q, pi and py and updating the linear dynamical model with these estimates; and c) repeating steps a) and b) until a local maximum is reached.

In one embodiment, the expectation step obtains the sufficient statistics: t t =1 kr t =2 where gat is the Kalman smoother estimate of the expected value of hidden vector Xt. 4T is the Kalman smoother estimate of the expected value of moment xtxr and fit,t_ilT is the Kalman smoother estimate of the expect value of moment xtxr_,. The maximisation step comprises, using the sufficient statistics to: solve for sk, where k = 1: n, where n is the dimension of 4: 3i ( = 0 under the constraint that si. has an absolute value of less than one, where si, is the kth diagonal component of matrix 5; determine q(k) according to: ti

-

and determine: where m is the dimension of the speech vectors vt.

As the linear dynamical models are parametric, they may be solved and therefore result in a more efficient method of training.

In one embodiment, the maximisation step further comprises determining from the sufficient statistics: or Q1 is set to the identity matrix for across the whole linear dynamical model: The above methods may be implemented on a system or device.

In a further embodiment there is provided a system for speech processing, the system comprising an input configured to receive one or more linguistic units and a processor. The processor is configured to convert said one or more linguistic units into a sequence of speech vectors for synthesising speech, said conversion using a one or more corresponding constrained higher order parametric linear dynamical models; and output said sequence of speech vectors.

In an additional embodiment there is provided a system for training a model for a text-to speech system, wherein said model is for converting a sequence of linguistic units into a sequence of speech vectors for synthesising speech. The system comprises an input configured to receive speech data comprising training speech vectors and associated linguistic units and a processor.

The processor is configured to model speech for each linguistic unit using one or more constrained higher order parametric linear dynamical models; and train the linear dynamical models, said training comprising estimating parameters of the linear dynamical models to fit the models to the associated speech data.

in one embodiment there is provided a carrier medium comprising computer readable code configured to cause a computer to perform one or more of the above methods.

Global Variance According to a first embodiment there is provided a method of speech processing comprising receiving one or more linguistic units, converting said one or more linguistic units into a sequence of speech vectors for synthesising speech using one or more linear dynamical models. The method further comprises adjusting the speech vectors to increase the variance of the speech vectors based on a predefined global variance and outputting the adjusted speech vectors.

Linear dynamical models better describe the dynamics of speech. This allows fewer models to be used and fewer parameters to be utilised thereby reducing the footprint of the model. By adjusting the speech vectors based on a predefined global variance the resulting speech vectors produce more natural sounding speech.

Embodiments described herein apply equally to first order linear dynamical models as well as to higher order linear dynamical models (such as those described in the section "Higher Order Linear Dynamical Models"). The predefined global variance may be obtained through measuring the global variance of a training set of speech vectors.

In one embodiment adjusting the speech vectors comprises a gradient ascent method to find a set of speech vectors which increases a likelihood function which is based on the one or more linear dynamical models and a predefined global variance probability distribution. This provides a quick and efficient method of applying the predefined global variance to the speech vectors. The method of speech processing may be applied to speech synthesis. It is important for the method of speech synthesis to be quick as speech may need to be produced in real-time.

In one embodiment the gradient ascent method comprises determining adjusted speech trajectory Y by: a) for each speech vector y, of the speech trajectory Y. determining an updated value for the speech vector yt according to dL Yt = Yt + a-dY where a is a predefined step size parameter and L is the likelihood function; and b) repeating step a) a predetermined number of times.

The predetermined number of times may depend on the level of global variance which is required and the desired speed of the method. In one embodiment the predefined number is between 5 and 10. This has been found to provide an acceptable level of global variance without significantly slowing the method down.

In one embodiment each of the one or more linear dynamical models obeys a state evolution equation for hidden parameters xt for frame t: x1-N-011, Q1) xt+1 = Fxt+ q +w; w-N(0, Q) and each hidden parameter is used to determine speech vector yt for frame t according to the observation equation: yt = Hxt + + v; H is an observation matrix, Q1, Q and R are covariance matrices, Pi and my are mean vectors. F is a state transformation matrix, q is a target vector for the hidden states, wherein F, H. R, q, and my are defined by the respective linear dynamical model.

This embodiment may apply to first order linear dynamical models, or to second order linear dynamical models (as described herein), where the hidden parameters xt are hidden vectors which represent further hidden parameters zj: xt(ztzti) In one embodiment, a constrained higher order linear dynamical model is used. Constraining the linear dynamical models ensures that the system is stable.

In one embodiment the gradient of the likelihood function for a given speech vector Yt is determined by: dL 1 2 -dy = -R -1(y firk't (v -my).* (Yt -where T" is the total number of frames t to be synthesise, .* denotes the element-wise multiplication of two vectors, v is the variance of the set of speech vectors Y. E. is a covariance matrix for the global variance and is a mean of thc global variance. This provides a general method of adding global variance of linear dynamical models In one embodiment the one or more linear dynamical models comprise a plurality of linear dynamical models and the observation equations have parameters which are globally tied across all linear dynamical models. This reduces the footprint of the model by requiring fewer parameters to be stored. In one embodiment the observation matrix H and/or the covariance matrix R are the same for all linear dynamical models. In one embodiment Q and/or Q1 are set to be equal to the identity matrix. By globally tying parameters across the linear dynamical models, the footprint of the system is greatly reduced without reducing the quality of the synthesised speech.

The above methods may be implemented on a system or device.

According to one embodiment there is provided a system for speech processing, the system comprising an input configured to receive one or more linguistic units and a processor. The processor is configured to convert said one or more linguistic units into a sequence of speech vectors for synthesising speech using one or more linear dynamical models; adjust the speech vectors to increase the variance of the speech vectors based on a predefined global variance; and output the adjusted speech vectors.

In one embodiment adjusting the speech vectors comprises a gradient ascent method to find a set of speech vectors which increases a likelihood function which is based on the one or more linear dynamical models and a predefined global variance probability distribution.

In one embodiment the gradient ascent method comprises determining adjusted speech trajectory Y by: a) for each speech vector yt of the speech trajectory Y, determining: dL Yr = Yr + adY where t is the respective frame of the speech vector yt, a is a predefined step size parameter and L is the likelihood fimetion; and repeating step a) a predetermined number of times.

In one embodiment each of the one or more linear dynamical models obeys a state evolution equation for hidden parameters xt for frame 1: x1-N-(11, Qt) xt+t = Fxt + q + w; w-,W(0,Q) and each hidden parameter is used to determine speech vector yt for frame t according to the observation equation: w-N(0,R).

H is an observation matrix. Q, Q and R are covariance matrices, pi and /231 are mean vectors, F is a state transformation matrix, q is a target vector for the hidden states wherein F, H. R, q, m. and py are defined by the respective linear dynamical model.

In one embodiment the gradient of the likelihood function for a given speech vector y is determined by: dL 1 2 = _ where T" is the total number of frames t to be synthesised.* denotes the element-wise multiplication of two vectors, v is the variance of the set of speech vectors Y. E.," is a covariance matrix for the global variance and fi, is a mean of the global variance.

In one embodiment the one or more linear dynamical models comprise a plurality of linear dynamical models and the observation equations have parameters which are globally tied across all linear dynamical models. In one embodiment Q and/or Qt are set to be equal to the identity matrix.

Since some methods in accordance with embodiments can be implemented by software, some embodiments encompass computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal.

Text to Speech There has been a great interest in statistical parametric speech synthesis during the last years, particularly with approaches based on standard hidden Markov models (HM1Ms) and their variations. Natural sounding speech has been synthesized by using HMMs and the quality of the best HMM-based synthesis systems is close to the quality of the best unit selection synthesis systems. However, although HMMs can be a relatively efficient modelling scheme for speech, they suffer from a number of limitations that have been pointed out in the literature. The 1TVIM limitations derive from assumptions such as: a) conditional independence of observations given the state sequence and b) the speech statistics of each state do not change dynamically.

A simple mechanism for capturing time dependence is to augment the observation space with feature derivatives and use their relationships to produce smoother trajectories during synthesis. However, the standard ITIMM parameter estimation algorithm can only be used under the assumption that the static and dynamic feature sequences are independent. The inconsistency of this mechanism was solved by the trajectory MAIM, which imposes relationships between static and dynamic feature vector sequences during training as well. Although trajectory HMNI improved the quality of synthesized speech, the challenge remains to make further progress with models which can easily and consistently be used for both parameter estimation and synthesis. Two such models which also explicitly capture the dynamics of speech are the autoregressive models and the linear dynamical models (LDMs). Both models have low computational requirements in synthesis, and are suitable for applications with low-latency and real-time requirements.

LDMs have been used in speech recognition but there are only preliminary efforts concerning their use in speech synthesis which were based on segmentation and clustering produced by FINIM systems. In a point-of-view these systems are mainly LDNI-HMM hybrids rather than LDNI-based synthesizers.

Most parametric speech synthesis systems are based on the HMM-based speech synthesis toolkit (HTS), which is publicly available toolkit. Programmatically, an LDM-based system may be built as an extension of the HTS; however, incorporating LDMs in the HTS framework requires extended modifications which are not trivial and results in a final system which is difficult to maintain and extend. For this reason, and in order to have the flexibility to experiment with alternative state-space models, a new LDM-based speech synthesis system has been developed from scratch.

In practice, training of the LDMs using the present system is performed in two phases.

First the decision trees are constructed for each segment and parameter type. The linguistic questions as well as the full context labelling of the training examples are provided as input files to the LDM-synthesis system. Then, the LDM models associated with the leaves of the trees can be used for synthesis. At this point, the system provides the flexibility to retrain new LDMs with variable model configurations as far as the structure of the parameters is concerned as well as using different variations of the Expectation Maximisation (EM) algorithm. Synthesis is then performed by producing a trajectory of speech parameters which maximizes the likelihood given the LDMs under the constraint of global variance.

Accordingly, embodiments of the present invention make a major step towards building a complete LDM-based speech synthesis system Figure 1 shows a text to speech system I. The text to speech system I comprises a processor 3 which executes a program 5. Text to speech system I further comprises storage 7. The storage 7 stores data which is used by program 5 to convert text to speech. The text to speech system 1 further comprises an input module 11 and an output module H. The input module 11 is connected to a text input 15. Text input 15 receives text. The text input 15 may be for example a keyboard. Alternatively, text input 15 may be a means for receiving text data from an external storage medium or a network.

Connected to the output module 13 is output for audio 17. The audio output 17 is used for outputting a speech signal converted from text which is input into text input 15. The audio output 17 may be for example a direct audio output, e.g. a speaker or an output for an audio data file which may be sent to a storage medium, networked etc. In use, the text to speech system 1 receives text through text input 15. The program 5 executed on processor 3 converts the text into speech data using data stored in the storage 7. The speech is output via the output module 13 to audio output 17.

A simplified process will now be described with reference to Figure 2. In first step, 5101, text is inputted. The text may be inputted via a keyboard, touch screen, text predictor or the like. The text is then converted into a sequence of linguistics units. These linguistic units may be phonemes or graphemes or may be segments of phonemes or graphemes, such as sub-phonemes or sub-graphemes. The units may be context dependent e.g. triphones which take into account not only the phoneme which has been selected but the proceeding and following phonemes. The linguistic units may be a sequence of phonetic and prosodic contextual units (full context tables). The text is converted into the sequence of linguistic units using techniques which are well-known in the art (such as the Festival Speech Synthesis System from the University of Edinburgh).

If the text is divided into phonemes, each phoneme is divided into a predefined number of sub-phonemes. HAIM schemes utilise 5 sub-phonemes. Due to the improved temporal dynamics of LDIV1s, a reduced number of segments may be used. In one embodiment the number of segments per phoneme is 3. Each segment may, for instance, be a sub-phoneme. Each segment has its own corresponding acoustic model.

In one embodiment, the linguistic units are sub-phonemes, as shown in more detail in Figure 3 In step S105, the corresponding acoustic model for each linguistic unit is looked up.

This may be achieved via a phonetic-to-acoustic map which is predetermined, e.g. via training of the system in order to fit models to linguistic units In step 5109 each acoustic model is used to produce a sequence of speech parameters or speech vectors over time. Traditionally, the acoustic model is a Hidden Markov Model 10(I-ININI), however, this provides a large number of parameters which need to be trained in order to fit the model to normal speech. In embodiments of the present invention, as shall be described below, a second order critically damped linear dynamical model is used. This provides a much smaller number of parameters than H M while producing synthesised speech of a similar quality. Accordingly, the system has a much smaller footprint than equivalent HNIIM systems.

Once a sequence of speech vectors has been determined, synthesised speech is output in step S111. The output speech signal is represented by speech parameters, or speech vectors. These speech vectors relate to a number of features of the output speech signal, such as the fundamental frequency (FO), the band aperiodicity (BAP), the mel-cepstrum coefficients. The output vectors can be used to generate an output speech waveform using a vocoder.

Figure 3 shows how a phoneme relates to linear dynamical models. A phoneme 201 with context is split into a predefined number of sub-phonemes (211, 213, 215). In this embodiment, each phoneme is split into three sub-phonemes. Each sub-phoneme is a linguistic unit and has its own corresponding LDNI (221, 223, 225). The three sub-phonemes share the same context. Each LDM models speech to output a predetermined number of frames T of speech vectors y (231, 233, 235). Each frame represents a period of speech. The number of frames, T, for each MN/ may vary depending on a duration model. The duration model is learned from the training data. LDNI1 models Ti frames, LDM2 models T2 frames, LDM3 models T3 frames. The three sequences of speech vectors (231, 233, 235) are concatenated to form an output set of speech vectors (241) which may be used to synthesise speech for the phoneme.

Linear Dynamical Models As mentioned above, embodiments of the present invention utilise linear dynamical models (LDMs). LDMs are the simplest dynamical models with continuous state vectors. The state evolution process is a linear first-order Gauss-Markov random process while the observation process is a factor analyser. The output of the process follows a time varying multivariate Gaussian distribution.

Table 1 shows symbols which are used throughout the application Figure 4 shows a representation of a hidden Markov chain representing an LDM of an embodiment. Observations 301, 303, 305 are the final output data, with T number of observations for a given LDM. This may be synthesised speech or components of synthesised speech such as mel-cepstral coefficients (mcep), fundamental frequencies (FO), band aperiodicity parameters (bap) or sinusoidal parameters. This may also be raw speech which is input to train the system.

The LDM is broken up into a number of states (311, 313, 315), with hidden variables. These represent the position of hidden articulators which cause the output (synthesised speech) Table 1: Table of Symbols T Number of observations -vector at. time t P, Ob:servaUon vector at h:ac I [x1, , = X LT Tr ee ter'," of state vectors 3 -Y; fir 'Iraec1tJrv of),;f21-4- ykN.:.t OrS Pralaility densityfunction Vector a. is normally distributed with mean ii and covariame E I,..-- ;I:oration of,he EM ' igoritiam Sk;:t of parameters Pt Mean value of the state Q3.

Covariance if the initial state State e vc.lut.louimatrix Mean value of ' d 1states Q Cc dance of hidden states Mean value of observaticeis.

Covariance of +1.).scrvat 4 Kalnu-111 ***,-iaoi 4 hei P)t 1 - Kalm,,in sTilool.hin estin expected vaiur: of moment:n4 I t.i.-1?.3' -L a 10< Kalliliiii r..,11-10fIt11e1 et, ii expected value of T11011101.i _ There is one observation (301, 303, 305) which is output for each state (311, 313, 315). Each state represents a different time segment or frame. There are T frames associated with the output for each LDM. T varies dependent on the LDM and is determined when the LDM is trained. Each frame has its own set of hidden variables which form the basis for the calculation of the respective observation 301, 303, 305.

To calculate the observations (301, 303, 305) the hidden variables need to be determined. The LDM first determines these hidden variables by transitioning between states using state evolution model (331, 333). The hidden variables are then used in corresponding observation models (321, 323, 325) to determine the observations (301, 303, 305) The structure of this graph allows the decomposition of the joint probability of observations and hidden states into simpler factors To show that, the trajectory of hidden variables is defined: X = [x1. ... r] =Jiir. (1)

along with the trajectory of observation variables: (2) then the joint probability distribution factors are defined as: P(X. Y) = P H P(xt t -1 P tixt) t= 2 t=1 (3) This distribution is Gaussian since, by assumption, all individual factors are Gaussians.

Each observation (speech vector) (301, 303, 305) can be calculated via a corresponding observation model 321, 323, 325 which utilises the hidden parameters: Observation Model: P(yt/xt) = .7T(yt; h(xt), R) 1. xt= hidden variables for state t: Abstract state, Articulators, Sinusoidals, etc. 2. yt = observation for state t (mceps, FO, bap), Sinusoidal parameters, Raw speech 3. h(x)= an affine transformation or non-linear map. E.g., h(x)= x, + 4. Ft = covariance matrix for Gaussian distribution The hidden parameters for a given state are dependent on the hidden parameters for the previous state. To transition between states, (to find the relevant hidden variables for the next state) state evolution models (431, 433) are utilised: State evolution Model: P(xt/xty) = .7Vixt;fixt-i), (2) * xt = hidden variables for frame t: Abstract state, Articulators, Sinusoidal s, etc. * Axi_j) = an affine transformation. E. g.,f(x,_1) = 12.x + 5. Q = covariance matrix for Gaussian distribution and N(x; tty; Q) is the normal distribution: [ (270n/21(21 i/2 exP[ 15,1Q vx [51 Methods in accordance with embodiments model in a parametric way the state evolution and consider tasks over time, through which the state trajectories should pass Given one or more measurement sequences and a model, there are three basic tasks that may be performed: 1 Classi cation: Compute the probability that a measurement sequence Y came from this model.

2 Inference: Compute the probability that the system was in state z at time t, P(x t = z117).

3 Learning: Determine the parameter settings which maximize the probability of the measurement sequences.

Due to Markov assumption, the classification and inference tasks can be performed with an exact algorithm. The classification task can be solved with a forward pass through the chain, while an additional backward pass is needed for the inference task. On the other hand, it is computationally very hard to find the optimum parameter settings without knowing the hidden state. For this reason, numerical methods are used to estimate locally optimum parameter settings. In one embodiment, the Expectation Maximization (EM) method is used for the parameter estimation. The EM algorithm is an iterative method for finding maximum likelihood estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which estimates the hidden state trajectory given both the observations and the parameter values, and a maximization (Ni) step, which involves system identification using the state estimates from the E-step.

An LDIVI is characterized by a state evolution equation (4b) and an observation space equation (4c) as shown below, where the trajectories are also provided, as well as the complete LDM model. The simplest state space models are the LDMs where both the transition and observation models are described by stochastic affine transformations: AL(014:Qt.): (4a).

Ty. (0, (.)) (40) R) (4e) where x E Rn and y E R. The set of parameters U is defined: = q, ity, Q111,Q,R1 (5) F is an n x n state transition matrix and if is an m x n observation matrix. The state x is an n-dimensional vector which evolves according to linear difference equation (4b), with initial condition x1 defined by equation 4a. The initial condition x1 follows a Gaussian distribution (..7V) with an initial mean value pi and an initial covariance matrix Q1.

The hidden state x cannot be observed directly. Instead, m-dimensional measurements y are available at discrete sampling times as described by (4c). The vectors w and v are called state evolution noise and observation noise respectively and are independent of each other. w and v are proportional to Gaussians with a mean of 0 and covariance matrices Q and R respectively. q and py are the state evolution mean and the observation mean respectively.

The estimation of parameters is obtained by maximizing the likelihood: P(.X. 16)d (6) One of the advantages of an LDM-based speech synthesis system is that, by tying some parameters across all LDM models, a speech synthesis system having a much smaller number of parameters than a similarly performing state-of-the-art 1-1A4M-based speech synthesizer can be achieved with a similar quality of output.

Specifically, in the above LDM system, matrix F and vector q in equation (4b) have n x 71 and n elements respectively. In this embodiment, covariance matrices 0 and 0/ are set to the identity matrix I and so do not contribute to the number of parameters since In the observation equation, vector pty has m elements. Matrices H and R are globally tied and have in x n and m x m elements respectively. As matrices H and R are globally tied, all states share the same H and R. Accordingly, all LDMs have the same observation matrix H and covariance matrix I? but have their own value for the observation mean i.ty. Each LDM has its own values for the parameters in the state evolution equation, F, q, ji. By globally tying matrices IT and R and by setting Q and 0, to equal the identity matrix, the footprint of the LDMs is greatly reduced.

In embodiments where the observations to be synthesised are mel-cepstra, band aperiodicities or phase information then matrix R is constrained to be diagonal For sinusoidal features R is not constrained to be diagonal.

Although the covariance matrix R may be constrained to be diagonal, which drastically limits the number of its parameters, in this analysis it is considered to be a full matrix. The mean value pi has n elements. Let Luxvi denotes the total number of LDM tree leaves of the decision tree used to train the LDM. Then the total number of parameters is Lumi(n2 + 2n + m) +mxn+mxm+nx n. Since Lung >> in > n, the contribution of terms that are not multiplied by Limp! is negligible.

In order to compare the footprint of LDMs with the footprint of HMN4s, the number of parameters of a typical HAIM system is calculated. For each tree leaf there are m + m+ m elements for the means of features and of the corresponding first and second order in X derivatives. Also, there are m + m + m for the diagonal covariance matrices. If Lfrimm is the total number of FINTM tree leaves then the total number of parameters is Lymm 6m.

The values m and n used in one embodiment for each speech parameter type are: For mel-cepstrum (MCEP), m = 40 and n = 6. For continuous 1nF0, m = 1 and n = I. For bandaperiodicity, m = 2 and n = 2. For phase features, m = 20, n = 4. In this case the ratio LLDmi Liimm for mceps is approximately 0.62. This provides the following estimation number of LDM parameters MCEP parameter ratio = -0.23 number of HMM parameters Accordingly, by globally tying some of the parameters across all LDMs a much smaller footprint for the speech synthesiser can be obtained Second order LDMs The movement of the articulators which produce speech may be calculated using critically damped spring-mass models: d2x(t) + dx(t) 2, + 2S S (x(t) -g) = w a t2 at where x(t) is the position of the articulator at a given time t, g is the equilibrium position (the point attractor or target parameter of the system) where the spring is neither stretched nor compressed, S is a mass normalised stiffness parameter which controls the rate of movement of x(t), and w is random noise that extends the spring-mass model to a statistical model. w represents random force applied to the system.

According to an embodiment, the state evolution equation is modelled using a second order difference equation that has similar dynamics to the spring-mass model. An approximation of the spring-mass model is used. This allows the system to be solved allowing the models to be more effectively fit to training speech parameters. (7a)

The discrete-time version of the spring-mass model is: xt = Fxt_i+ (I -F)q + w (7b) where = (6,-5(1 + S) ems F -e-s(1 -S))' q = (go) , w-Isf (0, Q) (7c) -e-sS2 To make the estimation of the parameters more robust, we propose to substitute equation 7b with a simpler equation that has similar dynamics. The state evolution (4b) is modeled using linear second order task-dynamics, as shown below. A second order recursion is used: w Q) = g) zt = 2Szt_1 -S2zt_2 + q + w, q (7d) where z,g,w E IP11,Q E nxn and (8) zj is the hidden variable for frame t, and sk determines the rate of the critically-damped dynamics of the state trajectories towards targets g. For simplicity and also due to physical characteristics of articulators' dynamics the matrix S is assumed to be diagonal.

The above equation can be written in the first order canonical form Ext_i w, w r &1(U. (2) (9) The augmented state xj is a hidden vector allowing the conversion from the second order system of equation 7d to the first order system of equation 9. The hidden vector xj and the system matrix F are (10) respectively. Whilst the actual hidden parameters are zb for simlicity the hidden vectors xt shall be referred to from hereon out as hidden parameters The system input (or control), q, is (11) The form F in the present embodiment (compared to that in equation 7c) results in a simpler, more robust speech synthesis system. The discrete equation (9, 10 and 11) has similar dynamics to the spring-mass model. It is the simplest possible difference equation that has dynamics which are critically damped towards a target. This allows the equation to be solved in a maximisation step thereby allowing the LDN4 to be effectively fit to a set of training speech parameters. If the spins mass model was discretised diretly then the resulting difference equation cannot be reliably solvedand would therefore result in a system which is more difficult to train. Accordingly, embodiments allow provide an improved form of the state evolution equation which ensures more accurate and efficient training of the system.

The observation model is described by an equation that has the same structure as equation 4c Yt - + A1( D, R..) (12) where c trit X fi.

The initial state is given by -Aroll,(21). (13)

Therefore, a second order critically damped LDM is also described by a system of equations that have the form of ordinary LDMs (equations 4a-4c) (g 5) ille In embodiments of the invention, the observation and state-evolution equations follow a second order critically damped linear dynamical model as described with reference to equations 7-13. Unlike previous LDMs, the state evolution matrix F has a specific form which reduces the footprint of the LDMs by up to 90% compared to a full matrix F since F now has n parameters instead of the n' parameters that a full matrix F has.

In addition, the LDMs of the present embodiment are fully parametric. Accordingly, exact solutions can be found during training (as shown below) which is more efficient than optimising a non-parametric model Figure 5 shows how speech is synthesised using a linear dynamical model of an embodiment. In step S401 a linguistic unit to be modelled is received. The acoustic model associated with the linguistic unit is then chosen in step S403. The association between linguistic units and acoustic models is predefined during the training of the system, as shall be discussed in detail below. All acoustic models obey the linear dynamical model of equations 8-13, with each acoustic model having associated values for parameters F, Q111, Q, R. These parameters are set according to the LDM in step S405. Each acoustic model outputs T frames, where T may be different depending on the LDM. T is set according to the duration model of the acoustic model.

T is set during training of the acoustic model. In step S405, the value of T is set according to the LDM. This sets the total number of hidden states and the total number of speech vectors which are calculated for the LDM.

The hidden parameter xt for each state t is then is then calculated, allowing the calculation for the speech parameters xt for each state to be calculated.

In step S407, it is determined whether the current linguistic unit is the first linguistic unit. If it is, then t is set to 1 and the initial hidden parameter x1 is calculated via equation 13 (S409).

If the linguistic unit is not the first linguistic unit in the current utterance, then t is set to 1 and the initial hidden parameter xi is set to the value for the hidden parameter of the last state of the previous linguistic unit (S411) This provides continuity when transitioning between models Once the initial hidden parameter xl has been set, the initial speech parameter yl for the first stage is calculated via equation 12 (S413).

At step S415 the method moves to the next state (t = t + 1). At step S411 the hidden parameter for the state is calculated using the value for the hidden variable for the previous state (x1_,) via equations 8-11 The speech variable for this state is then calculated using equation 12 In step S419 it is checked whether the maximum number of states (T) has been reached. If T has not been reached, then the method loops back to step S415 so that the hidden and speech variables for the next state may be calculated If T has been reached, then the speech vectors for the linguistic unit Y are output so that they may be used to synthesise speech (S421) By repeating the method of Figure 5 for each linguistic unit, sequences of speech vectors can be output to synthesise whole utterances.

Training the LDMs (Expectation Maximisation) To fit the LDMs to linguistic units the models must be trained. This involves determining the optimum value for parameters F, q, my, pi, Q11-1, Q, R for each acoustic unit. This can be obtained via Expectation Maximisation based on a training set of speech vectors associated with their corresponding linguistic units.

The parameters and the hidden state of system (1) can be jointly estimated with Expectation Maximization (EM) The EM iteration alternates between performing an expectation (E) step, which estimates the hidden state trajectory given both the observations and the parameter values, and a maximization (M) step, which involves system identification using the state estimates from the E-step. Each one of these steps is efficiently calculated. When the state is given then the parameters are estimated from dosed form algebraic formulas. On the other hand, when the parameters of an LDIVI are known then the marginal probabilities and the sufficient statistics (used in the equations for estimating the parameters) are calculated by a forward pass (Kalman filter) followed by a backward pass through the Markov chain of the probabilistic interactions (Kalman smoother) Figure 6 shows a method of training the acoustic model. In step S501 the method receives the training speech parameters y, and initiates with initial estimates of of the model parameters F, H, Ri, Q1, Q, R, q and ity. In one embodiment, the covariance matrices Q and Q1 are fixed as (Q = Q = 1). This removes a degeneracy of the model and does not restrict its generality. Also, in order to ensure the stability of the model, the transition matrix F is constrained to have spectral radius less than or equal to one.

The initial estimates for the model parameters and the training speech trajectories are used in an expectation step (S503 and S505) A Kalman filter is implemented in a first part of the expectation step (S503) to calculate the marginal probabilities and the statistics Xot, rot. t E}1, T} and 2tlt_1, t E 12, ..., T1 (see Algorithm 1), Algorithm 1 computes recursively the probability of each hidden variable xt given the set of speech vectors from frame 1 to t-1 (p(xtlyi,t_1)) and the probability of each hidden variable xt given the set of speech vectors from frame 1 to t (p(xtlyi t)) and evaluates the probability density function p(Y) To prevent underflows, Algorithm 1 returns log(p(Y)), which is interpreted as the log-likelihood of the model parameters given the data (log(L(8 I Y)) = log(9(/18))) in the parameter estimation phase.

In step S505, the statistics calculated in the Kalman filter (Yew, ftit, 2tit_i, and fitit_i) are used in a second part of the expectation step A Kalman smoother is used to obtain sufficient statistics ictin 1?-tin t E {1, ..., T} and fit,t_,IT, t E {2, ..., T} (see Algorithm 2).

Algorithm n I: Kalman Filter Data: Ob121:Vati011S, 21-fry, and model parmucter6; F, Q. H. jt1?,ji Result: logli -loguTY) and SULtiStif. rt,nt. t T1 }tit 1, to. C [2} /1' ni tializat / = = (71:. = for = LT do r Prediction. if if t I then

-

-Ftt. lit -Q r" update */ = Pt-1;114:1-i P = R Kt -E;Ji 11-1Tt,71 = H- = Etk-: -1141-;111.--1 -.:\i"(ct; ü.Ef:;) iogL = logiL logvt) Algorit]um Kalman Smoother Data: Statistic calculated from Raiman and suodel parameter F Result: Statisties T1;Hy.. t 1-1 ntr = eVirj VT for t -T.-2 do T. -01-1 = T = Et-11/-1 + ItC,Itar -Ent tftr -'ft Itt -1 T =T irt-l:TIT 1:T I14 )H-117- -.1 t_17 In step S507 a maximisation step is enacted using the sufficient statistics and the training speech parameters. The maximisation step computes values for the parameters which maximise the expected log-likelihood found in the E step. It can be shown that the maximum values for the parameters for a first order LDNI can be obtained via the sufficient statistics using the equations in Table 2.

Tabic 2: 117pc-lafti t;ctuations of LIM& in dm M-Fittip of EA:I algorithm Iti = £I]T Cl -nliT -i tiRT F I IR ff-vixn = (I-4 -Ti i (..,CAT, , Cr (11 I= T1 I [C.* - H -0'5 -i1-C-4C11 O.:, -+,-,-3,3,i-i R---",.

= T-,i -Fh..3) 4T 6 - ._ T ' 2 p:_r) C aernx-Where

T = =1 ii

= V ktpt.='? = II = ,,-=1 114:7-F. = R C=2 PC: = (14) The situation is slightly different for second order LDMs. To find the estimates for parameters S and q, the following method is used (see Appendix A for derivations).

For k = 1: n, where n is the dimension of zt, we have the following equation: T -1 (14a) This equation is 3rd order and can always be solved to obtain at least one real value of sk. A value of sk is sought which has an absolute value of less than one (lsk I < 1).

This ensures that the second order LDM is stable. A method of how to solve this equation can be found in "Numerical Recipes 3rd Edition: The Art of Scientific Computing", William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery, Cambridge University Press, New York, NY, USA, 3 edition, 2007. After determining sk then this value can be substituted into the following equation: q(k) = (14b) The update equations for a second order LIDM are shown in table 3 Appendix A contains the derivations of these equations.

Table 3: Update equations of second order critically damped LDMs in the M-step of the EM method. = = - -

):', Met:u a globa and, --I-

-

8 f z is a real sU1UUO.II of Eq. 14a. sr should he constrained. daft is the of Eq_ 1413_ k = 1: ri Q = H = --. t - :I: -tHcc Acl H e Rmx" H is gle..AT 1 5,_ -A -1115. i Once the updated parameter values have been calculated, it is then determined whether a local maximum has been reached (S509).

If no local maximum has been reached then the method loops back to step S503 to repeat the expectation (S503 and S505) followed by maximisation (S507) based on the updated model parameters. This obtains new estimates for the model parameters.

If a local maximum has been reached then the updated model parameters are stored (S511) so that they may be retrieved at a later date in order to synthesise speech for the corresponding linguistic unit.

Accordingly, the Expectation and Maximisation steps are repeated until the system converges on a local maximum for the parameters of the LDM. This allows the system to determine optimum values for F, H, i1, (21, Q, R,q and i13, based on observed speech data YT for a given linguistic unit or linguistic unit.

Tree-based clustering for LDM-TTS When training models to speech data, there is often not enough examples for a given liguistic unit for the data to be modelled accurately. To achieve high quality synthesised speech in statistical parametric speech synthesis, it is important to robustly model the acoustic and linguistic contexts. Typical parametric speech synthesizers consist of a huge number of context-dependent models, many of which cannot be robustly trained since there are limited number of observations in the training set. Sometimes, there is a complete lack of samples for a given linguistic unit which would result in the system being unable to fit a model to the unsampled linguistic unit.

To address this problem, top-down decision tree based context clustering is usually used. The decision trees do not only contribute in addressing the data sparsity problem, they are also used to model unseen acoustic contexts in the training data. Accordingly, some linguistic units (or linguistic units) which are similar are clustered together and assigned the same model.

A phonetic decision tree is a binary tree in which a boolean (yes/no) decision is made based on a phonetic question associated to each node. Initially, all states (equivalently, all associated training data) are placed at the root node of a tree. Depending on each answer, the pool of states is successively split until the LDM-likelihood increment is less than the increase of the complexity of the models. Model complexity can be measured with the minimum description length (IVIDL) criterion: = 1/2 pk log N (15) where k is the number of free parameters per LDNI model, N is the total number of frames associated with a tree node, and p is a heuristic scaling factor which in this embodiment is set to 1.

Previous experiments with LDM-based speech synthesizers were based on decision trees and states derived from corresponding EINEM systems. However, this is not optimal for LDNIs. Typically a context dependent unit (linguistic unit) is split into five-segments (linguistic units) when modelling using an HMM. Given the duration of most of these units and the number of segments considered, it is common to find segments with associated sub-phoneme segments that consist of only a single frame. Furthermore, the small time duration in each segment does not allow the LDIVI to exhibit its ability to better model the temporal dynamics of speech.

In an effort to overcome the above inefficiencies, a simple suboptimal phoneme segmentation rule is adopted. Each phoneme is split into two equally sized segments (linguistic units), the left and right segment. This simple rule works in practice since LDNIs adequately model speech dynamics within each segment.

In order to robustly estimate the parameters of the model that corresponds to each linguistic unit, it is necessary to cluster the units based on their acoustic and linguistic context.

In practice, a different top-down LDIVI-based phonetic decision tree is built for each of the segments (left and right) and each type of speech parameters considered (mel-cepstrum, 1nF0, band-aperiodi city and phase features). Thus, a total of eight decision trees are constructed. Finally, all the observations associated with each leaf of a decision tree are used to estimate the corresponding duration model as a Gaussian distribution.

The computational complexity of a decision tree-based clustering approach is higher than in the FIN41V1 case. In ITIVIIVIs, and autoregressive FIMMs, the length of segments (number of frames per segment) is considered to be fixed. This assumption results in parameter estimation formulae that are direct functions of the sufficient statistics collected initially from the training data. In practice, this means that when a cluster is split into two clusters, the likelihoods of the new clusters can be efficiently calculated by accumulating the relevant sufficient statistics collected once, without reference to the training data. However, for models such as LDIVIs and trajectory HMIVIs it is not possible to apply the same approach. On LDIVIs the sufficient statistics depend on the model which in turn depends on the training data. When a cluster is split the model of the parent node cannot be used for any of the two children. Therefore, new model parameters and the corresponding likelihoods have to be iteratively estimated from the data for each new cluster.

Algorithm 3 shows the pseudo-code of LDM-based decision tree clustering. As can be seen an LDM model is estimated for every new child cluster as we move down the hierarchy. The clustering process may be accelerated using approximation algorithms.

Algorithm 3 relies on coarse grain parallelism to remain practical; i.e., the search for the best question is performed in parallel using all cores of modern processors. In Algorithm 1, L., and L" denote the likelihood of the data samples that satisfy and those that do not satisfy question q respectively. is the MDL threshold of (15). The algorithm performs a breadth first construction of the tree using a task queue to store the tree nodes that are candidates for split. The parallel section performs hypothetical splits and the actual split is done for the best question of list Q only if this list is not empty. If list Q is empty then the current node, v, is a leaf node.

Algorithm 3: Decision Trees Clustering Data: Training examples and linguistic questions Result: The decision iree Create. thk root node which has pointers examples taskQue ift bile tayk-Qitetut isiWitEng- ) do v = taskQueite.pop0 for q C Quesiions do in parallel Split the examples of node v according to q Fit an LDM to "yes-examples and calculate L. Fit an LDN4 to "no" examples and calculate L" if log: Ly + log L,, > log Lv;',=, then L Store 1;q, log Ly ± log lin) into list Q it (2,isNotEmpti() then Choose the question qt of Q with the largest log L -k log L " value and set vq = Split the examples of node t-according to q' Create node:Kt with pointers to "yes' examples Create node n with pointers to "no" examplet;; Connect y and n as children of ti taskQueue,pun yi; taskQueue,punn) Figure 7 shows a method of clustering similar linguistic units together. In step S701 a set of training speech vectors and associated full context labelling (including linguistic units) is received. In step S703 a single LDM is fit to all speech vectors using the above method of training (see Figure 6). The value of the log likelihood for the model log 1," calculated during the Kalman filtering step of the EM method, is stored.

The set of questions Q are then applied, in parallel, to the all linguistic units in the model (S705) The group of linguistic units are then split into "yes" and "no" clusters based on the answer to the question (S707). The log likelihoods for the "yes" and "no" clusters are calculated in step S711.

In step S713 if the cumulative log-likelihood of the "yes" and "no" clusters is greater than the log likelihood for the parent cluster in addition to the MDL threshold 41 then the question q and the cumulative log likelihood of the "yes" and "no" clusters are stored In step S715 it is checked whether any questions were stored from the above parallel processing of questions. If not, then the parent model is the best model for the all of the received linguistic units and this is assigned to the unsplit cluster (S717). This model is then used when synthesising speech for any of the received linguistic units If one or more questions have been stored then the question with the highest cumulative log-likelihood of the "yes" and "no" clusters is chosen (S719) The models fitted to the "yes" and "no" clusters and assigned to the respective clusters (S721) and then the above method is repeated for these "yes" and "no" clusters (S723). Through this method, a decision tree is iteratively formed until all linguistic units are assigned an LDM (even if this is shared between linguistic units).

After clustering, the duration model of each leaf cluster (or leaf model) is determined.

This provides the number of frames (T) for each LDM. The number of frames differs for each LDM. Each cluster contains a number of training examples (segments of speech parameters). Each of these segments consists of a number of frames. The duration (number of frames) for each leaf cluster (associated with a given LDM) is modelled with a Gaussian with mean and standard deviation calculated from the number of frames of the associated training segments.

That is, the duration of this cluster (this LDM) follows a normal distribution with mean: T!DM = (T1,T2, **.,T)/1^1 (16) and variance awm = \ '-1(771-Tun42/N (17) where Tin is the number of frames in the nth example of N speech parameter segments assigned to the leaf cluster (to the LDM).

Figure 8 shows the mel-cepstral distance as a function of hidden space dimension. The cepstral distance in dB between two sequences of mel-cepstral coefficients sets is given by \ >r? c " = ti=0 Lei) ct,9 (18) where ci,i(i) and co(i) are the i-th mel-cepstral coefficient for the t-frame of the natural and generated sequence of coefficients sets, respectively, with T being the number of frames and m the cepstrum order. Smaller distances correspond to better modelling. It can be seen from the diagram that the mel-cepstral distance between the original and synthetic speech cepstrum approaches its minimum for a relatively small hidden space dimension (6 to 8). This is why, in one embodiment, state vectors consisting of 6 components are chosen.

Figure 9 shows the state space trajectory for a first order LDM and a second order critically damped LDM. Both the first order (810) and the second order (820) LDMs have first (830) and second (840) target values towards which the state space trajectories are urged. Both are initially urged towards the first target value (830) before transitioning to being urged towards the second target value (840) This represents the transition between different models over a single utterance. The first order LDM (810) has a discontinuity (850) at the transition between the two target values (830, 840). In contrast, the second order LDM (820) has no discontinuity. The resultant speech synthesised based on the state parameters of the second order LDM (820) would therefore have a few number of artefacts and therefore sound more natural.

Accordingly, second order LDMs synthesise speech with a smaller number of artefacts than first order LDMs.

Global Variance Whilst LDMs offer flexibility, a small footprint and time-domain dependence, the generated speech often has a muffled quality. The quality of the synthesised speech may be improved by incorporating global variance to increase the variability of the speech to make it sound more life-like.

Global Variance (GV) is defined as the intra-utterance variance of a speech parameter trajectory and is typically modelled by a Gaussian distribution. The GV-based parameter generation method constrains the synthesized trajectories to have the same GV as the GV of the corresponding training data. This improves the quality of synthetic speech generated from statistical parametric synthesizers. In an embodiment, global variance is used in combination with LDMs providing a speech synthesiser with a reduced footprint compared to HMM-based synthesisers but improved speech synthesis compared to non-GV LDMs, In one embodiment, GV likelihood is used as a penalty at synthesis (GV-based post-filtering).

Global variance v is calculated by: (19a) v(d) = 1 2 j(Ye(d) Y(d))2 t=1 (19b) Tu 37(10 = Lyt(d) (19c) where v(d) is variance of the Clh component of an utterence with speech parameters Y = y4i ii,1 with It, frames in the utterance. 7(d) is the mean of the dth component of the speech parameters. The global variance v has a mean variance it, across all utterances and covariance across all utterances.

Figure 10 shows a method of calculating the global variance of speech vectors In step S901, a set of training speech parameters is received for a set of n utterances Yi, Y2, YN* In step S903, the counter for the component of each utterance d is set to 1. In step S905, number of the utterance, i, is set to 1.

In step S907, the variance of dth component of the ith utterance is calculated using equations (19b) and (19c). That is, the mean speech parameter for component d 7(d) across all utterances is calculated using equation (19c). This is used to calculate the variance v(d).

In step S909 it is determined whether the last utterance has been reached (i.e. whether i = N). If not then i is increased by 1 (S911) and the method loops back to step S907 to calculate the variance for the dth component of the next utterance If the last utterance has been reached then the mean variance)1,(d) and the covariance I,(d) for the clth component of all N utterances are calculated (S913).

It is then determined whether the last component has been reached (i.e. whether d = m) (S915). If not, then d is increased by 1 (S917) and the method loops back to step S905 to calculate the variance for the dth component of the N utterances and calculate the mean and covariance of these variances If the last component has been reached then the mean variance p(d) and the covariance I,(d) are stored for all m utterances (S919) so that they may be used in synthesising speech (see below) The optimisation criterion used in LDMs is extended to increase the variability of the time-domain trajectories of the LDMs.

As mentioned above, LDMs follow state evolution and observation equations according to equations 4a-c. Global variance may also be applied to the second order critically damped LDM described in equations 8-13 The estimation of parameters is obtained by maximising the likelihood shown in equation 6. In one embodiment, this optimisation criterion is modified so as to include as weight a posterior probability P(v10) which will modify accordingly the trajectories of speech parameters generated by LDM, according to: P(le, ov) = 1, isowp(vier)dx (20) The contribution of the original optimization criterion is controlled by a parameter co.

O" is the set of parameters for the distribution of the global variance and v is the variance of Y. In synthesis, the optimum parameter sequence is determined so as to maximize an objective function consisting of the LDM and GV log probability density functions L = In p(YIX, 0) + In p(v10v) T" (21a) where Y are the trajectories of speech parameters (e.g. Cepstrum) and vector v has the variances of the Y trajectories. T" is the duration of Y and hidden state Xis defined as X = arg max p(X10) (21b) and may be obtained via equations 9 and 13.

The objective function L is maximized by a steepest ascent method.

The posterior probability P (v I) is modelled as a single Gaussian with mean vector it, and covariance matrix Zy: (22) The global variance parameters 0, and the LDM parameters 0 are independently trained based on the training speech data. The constant to denotes the weight for controlling a balance between the two likelihoods. In one embodiment, co is set to 1/Tu.

The following log-scaled likelihood is maximized with respect to Y under the condition of determined 2.

In a specific embodiment, for a specific value of omega and considering the posterior probability to have been estimated by a Normal distribution, it can been shown that the likelihood has the form: = log(p(ylle, o)°p(v100) L = -co (T" logl RI +Tht -H -ily)T R-1(yt - -my)) t=1 --2 (loglE" I + (v liv)TI/71(v -1/v)) (23) For u.) = 1/T", the above equation can be written as: 1 1, L = --2(logIRI +-TutrtY -Hie -My)TR-1(Y - -My)) --2 a oglE" I + (v -POI Ev 1(2 -itv)) (24) where My = 1T 0 Ity = * /4], and 1 is the all ones vector of dimension T. A Y that maximizes L is needed. For this, the derivative dildyis calculated from equation 23 It can be shown (see Appendix B) that: dL 1 dL Y dY 4 (25) where L4 is the term of the fourth term of equation 23 L4 = -((v - Er-1(v -yr)) (26) From the definition of scalar-by-matrix derivative (in denominator layout notation): (26) where it can be shown that (see Appendix B)-(27) where * denotes the element-wise multiplication of two vectors The term --R-101 -HX -My) can be calculated frame-by-frame. This can be achieved whilst switching between different LaVls (different parameters) during an utterance, while the global variance, which appears in the term dL4/dY, is calculated for the whole utterance The maximisation of likelihood can be achieved via a steepest ascent method.

Steepest Ascent Method: To determine the speech vectors, Y is iteratively updated with the gradient in accordance with a steepest ascent method: = 1-' 1 0. i)II, (28) where a is a step size parameter.

There are two possible settings of the initial trajectory Y(0)-th. In a first embodiment, the trajectory is calculated using the observation equation for the LDM (equations 4c or 12). In a second embodiment, the trajectory Y' is used. The dth component of the tth speech vector y(d) of Y' are linearly converted from those of Y as follows: fir " ydi (d) = 1,((r) g((1)) (1) (29) This amplifies the variance of the speech parameters so that it is equal to the mean variance 1.2, of the training speech parameters. The speech vectors y(d) with amplified variance are then used as the input for the steepest ascent method The steepest ascent method aims to make the trajectory optimal according to the log-likelihood of the LDNI models whilst ensuring that the trajectory has a global variance close to the natural speech parameter trajectories.

The steepest ascent method is shown in algorithm 4 and Figure 11 Algorithm Steepest Data: Trajectories of stares k" and of speech parameters a by 4a, 4b and 4c, also the corresponding sequence of LDMs parameters and eV PAIATIFiM..quio dose to t:local optimum or Scale Y accardin, = 0.025.

for item = 9 = L1 MU1 1, B -p for t do Figure I I shows a method of generating speech parameters with global variance In step 1001, the sequence of linguistic units for which speech is to be synthesised is received along with the global variance mean pt" and covariance E1, and the step size parameter a. The global variance mean pi, and covariance E, are predefined and stored in memory, p, and X, are originally determined based on the training speech data (see Figure 9). a may also be predefined and stored in memory. Alternatively, a may be input by the user to set the amount of variance in the synthesised speech.

In step S1003 the appropriate sequence of LDMs is chosen according to the stored association with the received linguistic units (the phonetic to acoustic mapping). The duration T of each LDM and the parameters F, H, jk, q, p, Qi, Q, R for each LDM are selected. By determining the duration T of each LDM, the total duration of the utterance, T" is determined. T" is the total number of frames for the whole utterance.

In step S1005 a sequence of hidden parameters I and speech parameters Y are calculated for the linguistic units according to the LDMs. In one embodiment, a state evolution equation is used to determine the hidden parameters of the LDMs before an observation equation is used to determine the speech parameters (see Figure 5) This may be done separately for each LDM, or may be done in one process, with the parameters of the LDMs changing dynamically as the process passes between LDMs.

In the first embodiment, the hidden parameters and speech parameters of one LDM are calculated before moving on to calculate the parameters for the next LDM In the latter embodiment, the hidden parameters for multiple LDMs (up to all of the selected LDMs) before the corresponding speech parameters are calculated In step 51006, the speech parameters are scaled according to equation 29 This ensures that the speech vectors have the same variance as the mean variance of the training speech vectors ft, In step S1007 the variance v(Y) and mean)7 are calculated and B is set to B = E,71.* (v -pv) (30) v(Y) and 5i are calculated using the equations of 19. B is utilised later in the steepest ascent method to calculate dL/dY.

In step 51009 t is set to 1 In step 51011 the tth values of X and Y are selected. dLidY is then calculated according to. dL 12

c7, = R-1(Yr Hti4 T-B.* Olt -and the tw value of Y is set to: y 0 yt a ddLy (31) (32) where Y@, t) denotes the tth value of Y (i.e. yi). This updates the tth value of Y based on the gradient of the log likelihood function at that point.

At step S1015 it is checked whether the total number of frames in the utterance, Tu, has been reached. If T has not been reached, the method loops back to step S1011 to update the next value of yt. If T has been reached then it is determined whether the maximum number of iterations of the steepest ascent method (maximum number of times Y has been updated according to equation 28) has been reached (51017) This maximum number of iterations may be predefined or may be set by the user. If the maximum number of iterations has not been reached then the method loops back to step S1007 to calculate a further updated version of Y If the maximum number of iterations has been reached then the speech parameters Y are output (S1019).

In an alternative embodiment of Figure 11, step S1006 is omitted and so the speech parameters are not scaled according to equation 29.

CV expressed as a Product of Experts (PoE) A product of experts (PoE) combines multiple models (experts) by taking their product and normalizing the result. Each expert can be an unnormalized model P (x101) over the input space. A PoE is expressed as: -1-E P(xl{e/}LL) = I P(xl 1=1 (33) where Z is a normalisation constant computed as: Z = FIE P(x10/)dx t= (34) The PoE for speech parameter generation including the GV term is written as: (35) where the GV, r, is determined by the equations of 19.

In synthesis the normalization constant Zidn' is ignored as the maximization of the likelihood is independent of this constant.

Results show that adding global variation significantly increases the subjective score of synthesized speech. By combining global variance with linear dynamical models, embodiments of the invention produce a speech synthesiser with a greatly reduced footprint compared to HNIIVI based synthesisers but which produces speech of comparable quality and which scores better in subjective testing.

LDM Training Considering GV The log-likelihood of the LDNI (Eq. 6) including the global variance is: (36) In the following, in order to simplify the notation, & is used to represent 19,9v). The auxiliary function then becomes:

T -+

-=

-Hirt -

-H

P (37)

As can be seen, the final term of Eq. 37 (equivalent to Eq. 23) is independent of the earlier terms. This allows the global variance to be trained independently of the LDMs.

Figure 12 shows for a sample waveform the natural and synthesized trajectories of the 32" mel-cepstral coefficient of a given utterance. As can be seen the synthesized trajectory (dashed line) is over-smoothed compared to the natural one (dashed and dotted line) and this is one of the factors that cause synthesised speech to sound artificial compared to natural speech However, when global variance from the training samples is used to generate the synthesized trajectory (solid black line), the global variance of the synthetic speech is forced to stay close to that of the natural speech. This results in a more natural trajectory and therefore produces natural sounding synthesised speech.

It should be noted that, the method described herein applies to all linear dynamical models including the first and second order LDMs described herein.

While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed the novel methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of methods and systems described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms of modifications as would fall within the scope and spirit of the inventions.

Appendix A Maximum Likelihood Estimation of the Model Parameters To find the maximum likelihood (ML) estimates for the model parameters, the joint log-likelihood of the data logP(YI 0) has to he maximised L(0) = logP(Y10) = log (I P(X,Y10)dX) (Al) Using any distribution Q(X) over the hidden variables, we can obtain a lower bound on L: £(0) = log (f Q(X)P(X) ne) dX) g(X (A2) And using the Jensen's inequality, which can be proved by the concavity of the log function, we have £(0) f Q(X) log (PCQX('Lle)) dX 01) Q(X) log P(X,Y16)dX f Qi(X) log Q(X)dX Ix (A3) We set Q(X) log F(X,Y10)dX -Q(X) log Q(X)dX (A4) The EM method alternates between maximising F with respect to the distribution Q and the parameters 0, respectively, holding the other fixed. Starting from some initial set of parameters 00 we alternately apply: E-step: P4-step: Cji+i arg max.F(Q, ei) (A5a) 01+1 arg max.F(Qi+i, 0) (A5b) The maximum in the E step (Eq. A5a) is attained when the conditional distribution is chosen to be the posterior of the state sequence given the observation sequence and the old model parameters, (2:+1(X) = P IY, 03. With this choice, the lower bound (A3) holds with equality F (W+1, @3 = L(0t).

The maximum in the M-step (Eq Mb) is obtained by maximizing the first term in (A3), since the second term does not depend on 0 M-step: 97.14 "4-arg MaXx P(X1Y, 0i) log P(X,119)dX e (A6) Auxiliary Function Based on equation A6, we define the following auxiliary function: Q(0i, 0) = Eilog.P(X, 'Y 10)1Y., P(X1Y, ei) log P(X Y10)dX (A7) From Eq. 3 we have logP(X, Y1_61) = logP(xi) + logP(xtlxt_i) + log P(yt ixt) t=2 t=1 (A8) The first term can be written as: log P(xi) log 2r -(A9) The second term can be written as: log P1XtlXt-1) t=2 n(T -1) T -1 log 2ff log1(21 2 2 -1(Xt -F Xt_i - (2-1(xt -F xt_i -t=2 (A10) And the third term can be written as: log P (yt Ixt) -2 log 27r -logRI (yt (All) To simplify that derivation of update equations for matrix S and vector q we assume that the covariance Q of the evolution equation and the initial covariance Qi are equal to the identity matrix. Setting Q = = I E R2n the auxiliary function can be written as: (Al2) M-Step The optimum parameters can be found by differentiating the auxiliary function Q(0, with respect to the parameters and by setting the derivative to zero Note that the auxiliary function Q 9" 60 is different to the covariance matrices Q and Q1 -rt M-Step: Update Equations for the Target Vector q To find the new target vector q, the auxiliary function in Eq. Al2 can be differentiated with respect to q(k) as follows: 7T...,i,,,-( k) c T, , , h, 5.- , P7.1 i ci -1- 77-1 sc,i, ., ., , tt,. i 77 k), 1-e1/4) -P.

ql\k) , 4-- -, ',1, e T -I T -1 6(41 j., (J,k) , 2,;1,(,n H-1,7) - --' 77 -1 g'& k T -1 ' ''h' T -i (A13) M-Step: Update Equations for the damping parameter S To find the new damping parameter S, the auxiliary function in Eq. Al2 can be differentiated with respect to s(k) as follows: a 0 ( , - E 2 xt (NT, (k) .± 2s (k r s -6 sixt (k) (et k) k)(e.f.... n ±2q1k1T 1(k) -hq(k)r, k 04 = 0 IL 4() 2sk r (k, (k, k) _ ' kn k,± k) -y(I(!)C ( k) ± Sol( .k) (n = 0 (A14) After substituting q(k) from Eq. A13 into Eq. A14: (A15) As mentioned with regard to Equation 14a, this 3rd order equation can be solved using a known method to obtain at least one real value of sk ("Numerical Recipes 3rd Edition: The Art of Scientific Computing", William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery, Cambridge University Press, New York, NY, USA, 3 edition, 2007). A value of sk is sought which has an absolute value of less than one (Isk' < 1) Update equations for vector pi To find the new mean vector of the initial state, the auxiliary function in Eq. AU can be differentiated with respect to as follows: api -2 2QJ71(E[x1iY,O1i -14) = o kii = E [xi I Y/eti = (A16) Update equations for covariance Q1 To find the new covariance of the initial state, the auxiliary function in Eq. Al2 can be differentiated with respect to Q1 as follows: 000i3O) iOiogQi1 1E -49(xi ycir 1 (xi - IY, Bi 8QT, 2 0Q-3- NT I

T T T

X1 -in)(X1 P1)T111011.^.1" -Q1 + (*IT - ^ + Therefore

-

0Q(8,0) OQT1 and by taking into account Eq. Al6 we have Atm -tittir (A17) Update Equations for covariance Q To find the new covariance Q, the auxiliary function in Eq. Al2 can be differentiated with respect to (/1 as follows: T - Ext._ -.07 Y. 0; = 0 =>.

Q, - -4 eig 1=7 lea--F a-9,64 -Ext -0)ThAr s g

-TT -(A18)

Update Equations for Vector g, To find the new vector gy, the auxiliary function in Eq. Al2 can be differentiated with respect to g.," as follows: = R- t -firt vir 04 Therefore = 0 = a my Q (0" 0) Using the notation of the sufficient statistics: (A19) 1, 77 kC4) Update Equations for Matrix H To find the new matrix H, the auxiliary function in Eq. Al2 can be differentiated with respect to H as follows: ) R-.1 OH - -art /turn Y, Therefore Using the notation of the sufficient statistics we have: (A20) If we substitute Eq. A19 into the above equation we have: (A21) Since H = [Hz, 0,72] we set H(: n + 1: 2n) = Update Equations for Matrix R To find the new matrix R, the auxiliary function in Eq. AU can be differentiated with respect to IC' as follows: R =T ytyr -H xtYt Ity Yt)

T

Therefore using the notation of the sufficient statistics: R = -T(-6 - -Rya) (A22) t=1 t=1 t=1 Appendix B A Y that maximizes L is needed. For this, the derivative dildy is calculated from equation 23. The terms logl R I and loglE,1 are independent of Y. Assuming that g is fixed then the derivative of the term tr (01: -HI -MOT 12-1(Y -Hg -My)) is: dtraY -Hite -My)TR-1(Y -Hie -My)) -2R-1(Y H g -My) dY (B1) The term ((v -tiv)TE,,-1(v - can be expanded as: (B2) The differential of the term vTv with respect to the matrix element Y(k, r) = (at row k and column r) is: i.

and, using the equations of (19) OY (kir) = (Yr(k) (k))e k Ti, where ek = [0, ...,1, ...,OF is the unit vector that has 1 in the k-th position. Therefore: avTEuiv = 2vT Ev-le lc L (y1 (k) -y(k)) aY (k,r) Tu The differential of the term itTX,Tiv is: agt31171v aY(k,r) -4E 2 171 e k (Yr(k) (k)) To simplify the presentation, L4, the fourth term of L, is used.

Li = -i(cv Combining equations B2, B5 and B6: OY (k, r) (YT(k) Yi(k))(9 Pv)T E;1 e k where the product Ejlek is equal to the d-th column of E. Ov 2 (B3) (B4) (B5) (B6) (B7) (B8) Combining equations B8 for d = 1: m and taking into account that E, is symmetric, we can write: OL4 2 = -E771(v Pv).* (Yr -aYt Tu (B9) where.* denotes the element-wise multiplication of two vectors.

From the definition of scalar-by-matrix derivative (in denominator layout notation): t at:A UT - 1.4'.7q< On..,..11..). ' -' .9Y-t:m;r") _ raL4 0114 ati 4 (B10) Finally: dL 1 dY T, R 1(y Hi my) ± dL4 dY (B11) Derivatives Some useful matrix derivatives used above are: 0g:(if) = tr cç Lit: Or where x is scalar -WhETe T IS scalar and.4 is not a limrtion of x otr(XT:4K) (B12)