GB2508411A - Speech synthesis by combining probability distributions from different linguistic levels - Google Patents

Speech synthesis by combining probability distributions from different linguistic levels Download PDF

Info

Publication number
GB2508411A
GB2508411A GB201221625A GB201221625A GB2508411A GB 2508411 A GB2508411 A GB 2508411A GB 201221625 A GB201221625 A GB 201221625A GB 201221625 A GB201221625 A GB 201221625A GB 2508411 A GB2508411 A GB 2508411A
Authority
GB
United Kingdom
Prior art keywords
speech
text
sequence
acoustic units
linguistic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB201221625A
Other versions
GB2508411B (en
Inventor
Javier Latorre-Martinez
Mark John Francis Gales
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Europe Ltd
Original Assignee
Toshiba Research Europe Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Research Europe Ltd filed Critical Toshiba Research Europe Ltd
Priority to GB1221625.5A priority Critical patent/GB2508411B/en
Publication of GB2508411A publication Critical patent/GB2508411A/en
Application granted granted Critical
Publication of GB2508411B publication Critical patent/GB2508411B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Abstract

In a text-to-speech synthesiser which converts text input (15, fig. 1) to audio output (17), a multi-level probability model relating all potential speech vectors to a specific utterance is generated at a range of linguistic levels. Text is first converted into linguistic units of different level (eg. syllables on one level, words on another) each having a duration of several frames, and linguistic context (eg. phonetic, prosodic, semantic or syntactic information) associated with each unit. Each unit is then related to linear parameters of a speech signal contour according to probability distributions (fig. 3) in a model of speech vectors (eg. fundamental frequency F0, lsp, aperiodicity, S405 fig. 4), whose mean and variance are determined during the training of the system (figs. 6-9). The probability distributions of all the different levels are finally combined a la Bayes and Expectation Maximization algorithms to give a total Gaussian distribution for speech vector x having a mean x and a variance P.

Description

Speech Synthesis.
FIELD
Embodiments of the present invention as generally described herein related to a text-to-speech system and method.
BACKGROUND
Text to speech systems are systems where audio speech or audio speech tiles are outputted in response to reception of a text fite.
Text to speech systems are used iii a wide variety of applicalions such as electronic gaines, B-book readers, E-mail readci-s, sateQite navigation, automated tekphone systems, automated warning systems.
There is a continuing iiced to make systems sound more like a hurnin voice.
BRIEF DESCRIPTION OF TIlE DRAWINGS
Systems and methods in accordance with non-limiting embodiments will now be described with reference to the accompanying figures in which: Figure 1 isa schematic of a text Lo speech systeni; Figure 2 is a flow diagram of a speech synthcis method in accordance with an embodiment of the present invention; Figure 3 is a schematic ofa Gaussian probability function; Figure 4 is a flow diagram showing the steps performed by a speech processing system; Figure 5 is a schematic of a text to speech system which can be Uained; 1'igure 6 is a flow diagram demonstrating a method of creating a decision tree during training of a text to speech systerri; Figure 7 is a flow diagram demonstrating a method of training a speech processing system in accordance with an embodiment of the present invention; Figure 8 is a flow diagram demonstrating a method of training a speech processing system in accordance with a further embodiment of the present invention; Figure 9 is a flow diagram demonstrating a method of traiting a speech processing system in accordance with yet a. further embodiment of the present invention; and Figure 10 is a schematic of the correlation matrix corresponding to three different methods of training a speech processing system.
DETAILED DESCRIPTiON
A standard text-to-speech system consists of three modules: a front end that transforms the input text into a sequence of segments of normalized features that can be understood by the computer; a prosody module that predicts the duration and intonation for each one of these segments; and a waveform generation module that pi-oduecs the final spct'eh signal for each segment based on their features and their Predicted prosodic values. In statistical parametric synthesis the predicted fundamental frequency values (PU) arc used because they arc needed to construct the excitation signal for the vocoder. The goal of a prosody module that predicts intonation and duration is to create a trajectory of P0 values that conveys the prosodic information required by the input text as unambiguously and naturally as possible. This involves two contradictory goals. On the one hand, the generated P0-trajectories for each segment of the sentence should be as close as possible to the canonical ones so that the ambiguity is minimized. On the other hand, the whole trajectory needs to be continuous and smooth.
In an embodiment, a text-to-speech method is provided, said method comprising: inputting text; dividing said text into a first sequence of acoustic units corresponding to a first linguistic level, the duration of each acoustic unit being equal to a plura.lil.y of flames; obtaining linguistic context features from said text; mapping said acoustic unit to probability distributions that relate said linguistic context feature.s for each of said acoustic units to speech parameters; wherein said speech parameters correspond to a linear parameterization of a speech signal contour over said plurality of frames according to a speech vector model; estimating the duration of each of the said acoustic units using a duration model; convening said first sequence of acoustic units into a sequence of speech vectors by combining said probability distributions into a probability distribution of output coefficients wherein said converting of said first sequence of acoustic units into a sequence of speech vectors comprises producing a random sample from said probability distribution of output coefficients; outputting said sequence of speech vectors as audio.
The speech parameters arc observation coefficients which are themselves dependent on component parameters corresponding to a linear parameterization of a speech signal contour over the plurality of frames corresponding to an aeoi.istic unit at the linguistic level. The parameters on which the observation coefficients are dependent are static coefficients and concatenation coefficients. Static coefficients are dependent on a single acoustic unit.
Concatenation coefficients may be dependent on the same acoustic unit as the corresponding static coefficient and/or surrounding acoustic units.
The observation coefficients and all components of the observation coefficients may be expressed as a linear combination of static coefficients for the acoustic unit and/or surrounding acoustic units at the linguistic level. The observation coefficients and all components of the observation coefficients may also be expressed as a linear combination of output coefficients.
The output coefficients may he static coefficients for acoustic units at the linguistic level or at a different linguistic level. The different linguistic level may he a lower linguistic revel than that of the observation vector. The different linguistic level may he the frame level.
In an embodiment, once the probability distributions for the observation parameters have been determined, they are expressed as a linear combination of static coefficients at the linguistic level of the acoustic units into which the text was divided. These are then re-expressed as a linear combination of output coefficients which, in an embodiment, are static coefficients corresponding to acoustic units at a linguistic level lower than the one of the acoustic units into which the text was divided.
Linguistic context features can he any information that is obtained from the text. Linguistic context features may comprise phonetic information (first phone, last phone), prosodic information (position of syllable in accent group), or any other form of information. The linguistic context features may further comprise semantic (e.g. positive vs. negative words) and/or syntactic (verbs, nouns, etc) information. The linguistic context features extracted from the text may valy depending on the speech vector to be calculated.
The first linguistic level is a real linguistic level and is selected from phones, diphones, syllables, nioras, words, accent feet, intonational phrases, sentences phrases, breath groups, paragraphs or any other linguistic level.
Ihe first linguistic level nay be selected from syllables, moras, words, accent feet, intonational phrases, sentences phrases, breath groups and paragraphs.
Mapping between the text and speech parameters may he done using a decision tree, neural 1 0 network or linear model Tn the case of a-decision tree, speech parameters are arranged into clusters corresponding to different branches of the decision tree.
The probability distributions may be Gaussian distributions or other distributions such as Poisson. Student-t, Laplacian or Gamma distributions. The distributions may or may not be 1 5 described by a mean and variance.
the speech vector may be selected from the fundamental frequency, the spectrum coefficients, hand aperiodicity or any other spccch vector. Spectrum coefficients may bc sclcctcd from isp, mel-Isp, cepstral, mel-cepstral, generalized mel-cepstral and harmonics amplitude or any other spectrum coefficient.
The duration can be estimated at any time prior to the computing of the probability distribution of output coefficients. In an embodiment. the duration is calculated prior to mapping the acoustic units to probability distributions that relate linguistic context features for each of the acoustic units to speech parameters.
The method consists of sampling from the trajectory distribution of the output coefficients one possible trajectory of the speech signal.
In another embodiment, a text-to-speech method is provided, comprising: inputting text; dividing said text into a first sequence of acoustic units and at least one further sequence of acoustic units, each of said sequences corresponding to a different linguistic level; the duration of each acoustic units being equal to a plurality of frames; obtaining linguistic context features from said text; mapping said acoustic unit to probability distributions that relate said linguistic context features for each of said acoustic units to speech parameters, wherein said speech parameters correspond to a linear parameterization of a speech signal contour over said plurality of frames according to a speech vector model; estimating the duration of each of the said acoustic units using a duration model; converting said first sequence of acoustic units and said at least one further sequence of acoustic units into said sequence of speech vectors by combining said probability distributions into a probability distribution of output coefficients, wherein said converting of said first sequence of acoustic units into a sequence of speech vectors comprises producing a random sample from said probability distribution of output coefficients; outputting said sequcncc of specch vectors as audio.
1 0 In the multi-level embodiment above, the observation coefficients and all components of the observation coefficients for all linguistic levels may be expressed as a linear combination of output coefficients, thus enabling the combination of probability distributions for all levels into a single probability distribution of output coefficients. The output coefficients may be static coefficients for acoustic units at any of the considered linguistic levels. In an embodiment, the output coefficients are static coefficients for the lowest linguistic level. The output coefficients may be static coefficients for acoustic units at a different linguistic level not corresponding to the linguistic level of any of the acoustic units into which the text was divided. Ihe different linguistic level may be a lower linguistic level than the levels corresponding to any of the acoustic units into which the text was divided. The differcnt linguistic level may he the frame level.
In an embodiment, once the probability distributions for the observation parameters have been determined, they are expressed as a linear combination of static coefficients at the linguistic levels of the acoustic units into which the text was divided. They are then re-expressed as a linear combination of output coefficients which, in an embodiment, are static coefficients corresponding to acoustic units at the lowest linguistic level into which the text was divided.
At least one of the sequences of acoustic units comprises acoustic units spanning multiple frames such that the multiple frames encompass the linguistic level. The probability distributions model the statistics of a linear parameterization of the speech signal contour formed by the frames encompassingthe linguistic level.
In another embodiment, a method of training a speech vector model is provided, said method comprising inputting an audio sample and a corresponding text; dividing said audio sample into a first sequence of acoustic units corresponding to a first linguistic level according to said corresponding text; extracting a first speech vector at said linguistic level from said audio sample; identifying regions of said speech vector that are discontinuous; obtaining said speech vector model by applying maximum-likelihood criteria; wherein said discontinuous regions of said speech vector are computed as measurements with maximum unreliability.
The speech vector model may comprise one or more Gaussian distributions. the speech vector model may comprise other distributions such as Poisson, Student-t, I.aplacian or Gamma distributions. The distributions may or may not bc dcscribcd by a mean and variance.
Discontinuous regions of the speech vector have "missing" values and are computed as the model is trained by treating such values as measurements with maximum unreliability. The degree of reliability or unreliability of regions of the speech vector may be modelled as a having a Gaussian probability distribution and the probability distribution may be computed as part of the speech vector model. The missing data of discontinuous speech signals are simply assumed to be measurements with maximum uncertainty so that no signal reconstruction or interpolation process is required other than the intrinsic one provided by the model training itself.
The data used for training is assumed to consist of measurements ol a real value with a certain degree of unbiased Gaussian. noise which depends on the nature of the signal and the tools used for the measurement. An estimation of the probability distribution of such uncertainty is used to train the statistical model.
In an embodiment, during training, neural network, decision tree or linear model structures are computed which relate speech parameters to linguistic context features. The speech parameters are arranged into clusters within these structures.
hi another embodiment, a method of training a speech vector model is provided, said method comprising inputting an audio sample and a corresponding text; dividing said audio sample into a. first sequence of acoustic units and at least one further sequence of acoustic units, each of SaRI sequences corresponding to a different linguistic level according to said corresponding text; extracting a first speech vector and at least one further speech vector at said different linguistic levels from said audio sample; identifying regions of said speech vector that are discontiuuous* obtaining said speech vector model by applying maximum-likelihood criteria; wherein said discontinuous regions of said speech vector are computed as measurements with maximum unreliability.
The method may be used to train one or more of the speech vector models employed in speech synthesis. The method may be used to train all of the speech vector models employed in speech synthesis. The method may be used to train models for the fundamental frequency, the spectrum coefficients and band aperiodicity or any other speech vector. The spectrum coefficients u'ained in this way may be Isp, mel-isp, cepstral, mel-cepstral, generalized mel-cepstral and harmonics rnplil.udc or any other spectrum coefficient The extracted speech vccors may or may not have missing values.
The speech vector model may he obtained a.s a. product of experts of speech vector sub-models each defined at different linguistic levels, wherein the likelihood probabilities of a speech vector trajectory over models defined at different linguistic levels are combined as a normalized product. Each level model produces a prediction over the whole trajectory of the speech signal.
One or more or all of the sub-models may also be individually trained using the above method.
One of the sub-models may be a model of the mel-cepstrum coefficients also individually trained using the above method and the other sub-models may be individually trained using inoUier method.
The speech vector model may be obtained as a superposition of speech vector sub-models defined at different linguistic levels. The models are combined as a sum such that a lower linguistic level model models the errors of a higher level model. Each sub model produces a prediction of the differences between the real trajectoiy and the trajectory predicted by the upper level sub models. During training of the model, the probability of the error between the observed data and their most likely representation according to the parameters of the sub-model at the lowest level is calculated.
The sub models may comprise one or more Gaussian distributions. The sub models may comprise other distributions such as Poisson, Student-t, l.ap!acian or Gamma distributions. The distributions may or may not he described by a mean and variance. The sub models may all comprise Gaussian distributions and may be expressed as a linear transformation of the variables of the distributions of the lowest level sub-model so that the combination of all the sub-models is also a Gaussian distrihution.
The sub models model the trajectory of the speech signal over a span of the multiple frames encompassing the linguistic level said linguistic level being that of a syllable, word, sentence or multiple sentences such as a paragraph. The distributions comprising at least one of the sub levels model the statistics of the speech signal contour formed by the frames encompassed at the linguistic level at which the sub model is defined.
En an embodiment, speech synthesis using models trained according to thc above cmhodiments comprises sampling i..rom the trajectory distribution defined by the combination of all the sub-models one possible trajectory of the speech signal.
In another embodiment, a text-to-speech system is provided, said system comprising: a text input configured to receive inputted text; a processor configured to: divide said text into a first sequence of acoustic units corresponding to a first linguistic level, the duration of each acoustic units being equal to a plurality of frames; obtain linguistic context features from said text; map said acoustic unit to probability distributions that relate said linguistic context features for each of said acoustic units to speech parameters; wherein said speech parameters correspond to a linear parameterization of a speech signal contour over said plurality of frames according to a spccch vector model; estimate the duration of each of the said acoustic units using a duration model; convert said first sequence of acoustic units into a sequcnce of speech vectors by combining said probability distributions into a probability distribution of output coefficients wherein said converting of said first sequence of acoustic units into a sequence of speech vectors comprises producing a random sample from said probability distribution of output coefficients; output said sequence of speech vectors as audio.
Since the present invention can be implemented by software, the present invention encompasses computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal c.g. an clcct-ical, optical or microwave signal.
In another embodiment, a. carrier medium is provided comprising computer readable code configured to cause a computer to perform a text-to-speech method, said method comprising: inputting text; dividing said text into a first sequence of acoustic units corresponding to a first linguistic level; mapping a plurality of parameters to each acoustic unit according to linguistic context features obtained from said text; estimating the duration of each of the said acoustic units using a duration model, converting said first sequence of acoustic units into a sequence of speech vectors using a speech vector model, wherein said speech vector model comprises probability distributions which relate said acoustic units to said speech vectors according to said parameters, wherein said converting of said first sequence of acoustic units into a sequence of speech vectors comprises producing a random sample from said probability distributions; outputting said sequence of speech vectors as audio.
In another embodiment, a. carrier medium comprising computer readable code configured to cause a computer to perform a method of training a speech vector model is provided, said method comprising; inputting an audio sample and a corresponding text; dividing said audio sample into a first sequence of acoustic units corresponding to a first linguistic level according to said corresponding text; extracting a first speech vector at said linguistic level from said audio sample; identif5iing regions of said speech vector that are discontinuous; obtaining said speech vector model by applying maximum-likelihood criteria; wherein said discontinuous regions of said speech vector are computed as measurements with maximum unreliability.
In standard HMM-based text to speech synthesis, FU is model]ed at frame level using a sub-division of the phone into states so that for each state the mean and variance of the LogFO values observed during that state are modelled. The mean and variance of the LogFO first and second derivatives are also modelled. The houndaries of the states are related to phones hut are not constrained by any supra-segmental structure.
In an embodiment, a parametric FO model approach (DCT-F0) is used. Parametric PG model approaches are a technique to model speech intonation (P0), by training statistical models of the pitch contours at supra-segmeutal linguistic levels, e.g. the syllable. To train those models, the pitch contour of supra-segmental units in the training data is parameterized with a transform.
One example of such a transform is the discrete cosine transform (DCT) with a fixed number of coefficients. Once the transform is performed, the pitch contour of each syllable is represented by the same number of dimensions and probability distributions can he computed. Parametric-based P0 models model P0 "contours" at well defined linguistic units such as the syllable, intonation phrase, etc. Such an approach models long-term correlations between the elements of an utterance characteristic of supra-segmental structures.
In an embodiment, a factor analysis approach is used to define and train a parametric-based FO model using maximum likelihood criteria. This approach is known as FA-DCT. Systems and methods in accordance with embodiments of the invention not require an interpolation of the FO values obtained from the FO extractor algorithm. Instead, both parameterization coefficients and un-observed FO values are treated as hidden variables and estimated from the observed ones using an Expectation Maximization (EM) algorithm. Expectation Maximization algorithms are well known in the art. Using this approach, the structure of the model decision tree is not affected by any artificiality created values which are external to the input data. The decision ba.scd clustering proccss requires the computation of an EM algorithm to cstiniatc the model for each split and each question during the clustering procedure. To reduce the computational complexity two approaches to this are employed. First, the clustering structure of the model is 1 0 computed only for the static features so that a closed form solution for both the mean and the variance of the parameterization coefficients can be obtained. Second, instead of evaluating at all possible questions as each split in the decision tree, a more reduced sub-set of them is pre-selected using a Minimum Generation Error-based algorithm.
1 5 Past experiments have shown that the intonation generated with this approach sounds more natural and stable than the one generated by other methods.
Systems and mcthods in accordance with embodiments of the present invention allow: * the models to be trained independently for the combination of their clusters; the logFo contour to be parameterized without styling or interpolating logFO. Styling and interpolating are processes which are completely external to the data but have a very strong effect out the parameters that are extracted and on the way they are then clustered. They are defined heuristically and affect the parameterization and therefore the quality of the generated information.; and the final model to be a proper generative model. As a result, no heuristic adjustments are required when integrating the model with models at other linguistic levels as a "product-of-experts". A product of experts approach will be described below.
The goal of using factor analysis is to put the DCT-F0 model in a proper probabilistic framework that allows: * the interpolation process to be integrated within the training process so that the interpolator is derived from the data directly instead of imposing any heuristic function (spline, linear, etc); * the integration of models at multiple levels in a proper statistical framework as a product-of-experts; and the implementation of adaptation schemes such as MLLR or CMLLR.
The approach may be used for any application of text-to-speech synthesis, and is especially useful in those scenarios that require stable but not monotonous intonation with a strong coherence across long time spans, such as synthesising speech for e-books or web pages.
Systems and methods in accordance with embodiments of the present invention will now he described with reference to figures 1 to 10.
Figure 1 shows a text to speech system I. The text to speech system 1 comprises a. proce!.sor 3 which executes a program 5. Text to speech system I further comprises storage 7. The storage 7 stores data which is used by program 5 to convert text to speech. The text to speech system I further comprises an input module 11 and an output module 13. the input module 11 is connected to a text input 15. Text input 15 receives text. The text input 15 may be for example a keyboard. Alternatively, text input 15 may be a means for receiving text data from an external storage medium or a network.
Connected to the output module 13 is output for audio iT The audio output 17 is used for outputting a speech signal convcrtcd from text which is input into text input 15. The audio oi.itput 1 7 may he for example a direct audio output eg. a speaker or an output for an audio data file which may be sent to a storage medium, networked, etc. In use, the text to speech system 1 receives text through text input 15. The program 5 executes on processor 3 and converts the text into speech data using data stored in the storage 7. The speech is output via the output module 13 to audio output 17.
A simplified process for speech synthesis will now be described with reference to figure 2. This process can be applied to any speech vector describing the fundamental frequency; the specflm coefficients such as 1sp, mel-Isp, cepstral, inel-ceps-tral generalized mol-ecpstral, harmonics amplitude; the hand aperiodicity; or any other speech vector.
In the first step, SlOl, text is input. The text may he input via a keyboard, touch screen, text predictor or the like. The text is then converted into several sequences of different levels of linguistic units (supra-segmental units). In the present embodiment, these supra-segmental units are proper linguistic units, such as a syllable. For example, one linguistic level is a syllabic and one is any level other than syllabic. Other examples of proper linguistic units include a phone, diphone, mora, word, accent foot. intonational phrase, sentence, phrase, breath group or paragraph. All of these linguistic units have a duration lasting several frames. The text is converted into the sequences of linguistic units using techniques which are well-known in the art and will not be explained further here. Because the linguistic units are proper linguistic units, they will be of variable duration.
In step S 103, linguistic information in the text. including linguistic context features is a.ssociatcd with each linguistic unit. Lingustie context features can be any information that is obtained from the text. Linguistic context features may be phonetic information (for example first phone or last phone), prosodic information (for example the position of syllable in accent group), or any other form of information. The linguistic context features may further comprise semantic (for example, positive as opposed to negative words) and/or syntactic (for example verbs and nouns, etc.) information.
In step S 105, the linguistic context features are used to look up the probability distributions relating each acoustic unit to speech parameters. These speech parameters correspond to a linear parameterization of a speech sigria.l contour over the frames encompassed by the acoustic unit according to a. spccch vcetor model. The process of parameterization during the training of the speech vector model will be discussed below.
lii an embodiment, the mapping from linguistic context features and speech parameters i.s carried out using a decision tree, which will be described later. In another embodiment the mapping is done by employing a neural network model. For example, Bishop, C.M. (1995) Neural Networks for Pattern Recognition, Clarendon Press, Chapter 6, describes a suitable model. In yet another embodiment, the mapping is done using a linear model.
In this embodiment, the probability distributions will be Gaussian distributions which are defined by means and variances. However. it is possible to use other distributions sLich as the Poisson, Studcnt-t, Laplacian or Gamma distributions, somc of which arc defined by variables other than the mean and variance.
hiring synthesis, it is assumed that each acoustic unit does not have a definitive one-to-one correspondence to a speech vector or "observation" to use the terminology of the art. Many acoustic units are pronounced in a similar manner, are affected by surrounding acoustic units, their location in a word or sentence. or are pronounced differently by different speakers. Thus, each acoustic unit only has a probability of being related to a speech vector and text-to-speech systems calculate many probabilities and choose a sequence of observations given a sequence of acoustic units.
A Gaussian distribution is shown in Figure 3. Figure 3 can be thought of as being the probabil ily distribi.jtior, ol an acoustic unit relating to a speech vecLor. For example, the speed' vector shown as X has a probability P1 of corresponding to the phoneme or other acoustic unit which has the distribution shown in figure 3.
The shape and the position of the Gaussian is defined by its mean and variance. These parameters are determined during the training of the system.
In step S 107, a duration is predicted for each unit of each linguistic level and associated with the 1 5 relevant probability distribution for that level. Models for predicting duration are well known in the art and will not be discussed here.
In step Sl0, the duration data and the probability distributions relating to the individual units for the diffcrcnt linguistic levels are combined to build a multi-level acoustic trajectory model.
This acoustic trajectory model comprises a probability distribution of output coefficients.
In a multi-level model, the probability of all potential speech vectors relating to a specific utterance must be considered at a range of linguistic levels. This implies a global optimization over all the acoustic units at each linguistic level of the utterance. A probability distribution for the utterance is constructed, therefore, by combining the probability distributions of all the levels being considered. In this embodiment, the probability distributions at the individual linguistic levels are Gaussians and the relationship between the parameters of different levels are defined by linear equations. As a result, the total probability distribution of the speech vector xis also a Gaussian p(x;i,P) defined by its mean I and variance P. In an embodiment, the total proha.liliI:y disLrihution is constructed by eprcssirtg the speech parameters for the individual linguistic levels as linear combinations of ouLput coefficients. The output coefficients ma.y or may not correspond to the linear parameterization of acoustic units at one of the linguistic levels. In an embodiment, the output coefficients correspond to the linear parameterization of acoustic units at the lowest linguistic level.
lii step Sill a sequence of speech vectors for the utterance are determined. In this embodiment, the speech vectors are selected using sampling, whereby each speech vector is selected randomly from the globally optimized probability distribution p(x;i.P). In this embodiment, the degree of randomization scales with a function of the covariance of the probability distribution, P. In an embodiment, the level oF "randomness" is controlled by scaling the cova.riancc. hi other words, a sample is taken from the disttlhution p(x;i,P') where P' can he any function of P and thc inptit text, for example P' = alP Eqn 1 where a is just a positive scalar factor.
In an embodiment, the factor a is provided by the user. In another embodiment, the factor a is ontrollcd by another model. For example, the factor a may be determined according to a model of the global variance. The gloha.l variance across the elements x of the speech vector x is N-i )x.x.
v(x) = -_______ -Lqn2 where v(x) is the global variance and E[x] is the expectation value of x. For samples extracted from p(x;,P) N-i iO Eqn 3 Therefore, the expected value of the global variance is E[v(x)] = Jv(x)p(x;i,aP)dx jtrace(xxT) - p(x; 1, aP)dx -trace(JnTp(x;i,aP)dx)_/1
N
-trace(aP+i.x T)2
N
= a.trace(P)+trace(xxT)_p Eqn 4 in an embodiment, a is adjusted to target a specific value for the global variance. For example, a may be chosen such that the global variance calculated using Eqn 4 is equal to the average global variance computed over all the utterances of the data used to train the model.
In another embodiment, the scaling factor a is set to zero and there is no randomization in the selection. in this case, the speech vector is the most likely speech vector of the distribution, i.e. the mean of the distribution.
In an embodiment, once a sequence ol speech vectors has been determined, a. speech signal is output iii step S113.
The above embodiments correspond to the case where text is divided into acoustic units at more than one linguistic level. In another embodiment, the text may be divided into acoustic units at only one linguistic level. Otherwise, the method proceeds in accordance with figure 2. as described above. In an embodiment, in the ease of a single linguistic level, the output coefficients do nol correspond to a linear parameterization of acoustic units at the linguistic level. Instead, they correspond to a lineai-pararneLerizatioi of acoustic LLnits at a lower linguistic level. In an embodiment, the output coefficients correspond to a linear parameterization of acoustic units at the frame level.
In the embodiment of Figure 2, the duration is calculated in Step S107. However, the duration can be calculated at any time prior to building a statistical distribution of output coefficients (step 5109 in Figure 2). In another embodiment, the duration is calculated immediately after linguistic context foaturca arc obtained from the text (step S103 in Figure 2).
In a further embodiment, the sequence of speech vectors determined according to the steps indicated in Figure 2 and described above is combined with other speech vectors selected from the set of fundamental frequency; spectrum coefficients such as Isp, mel-Isp, cepstral and mel-cepstral; or band aperiodicity using a vocal tract filter.
Figure 4 shows an embodiment where all three types of speech vector are combined.
In Step 5401 text is input.
In Step 5403, thc duration of each acoustic unit contained within the text being considered is estimated using a duration model. Duration models arc wclI known in the art. and will not he discussed here.
Note that in this embodiment, the duration is estimated before computing the speech vectors.
Uowever, the duration can be estimated at a later stage, during the computation of the speech vectors, as in the embodiment of Figure 2.
In Step S405, the spectrum coefficient, fundamental frequency (FO) and band aperiodicity vectors are computed. In an embodiment, one or more of the FO, band aperiodicity and spectrum coefficient vectors is calculated following the sequence of steps 5103 to SI 13 of figure 2 and described above. In another embodiment, all three o.CtJi.c speech veci.ora arc calculated following the sequence described in figurc 2.
In Step 407. a pulse sequence generated according to the FO vector and a noise signal are mixed according to the ratios described by the band aperiodicity vectors. This process is known as excitation generation and the resulting vector as the excitation vector.
In Step 409, the excitation vector arid spectrum coefficient vector are input into a vocal tract filter.
In Step 411, a speech signal is output from the vocal tract filter according to the spectrum coefficient and excitation vectors.
Next, the training of the model in accordance with an embodiment of the present invention will be described with reference to figures 5 to 9. In the description below we refer to the fijndamental frequency P0 hut this algorithm is equally applicable to th.e spectrum coefficients, 1 0 the band aperiodi.city or any other signal derived from speech audio.
We also refer to the syllable level but the algorithm is equally applicable to any real linguistic level. We begin by discussing a general approach to the training of the model before describing specific embodiments.
In addition to the features described with reference to figure 1, figure 5 also comprises an audio input 23 and an audio input module 21. When training a system, it is necessary to have an audio input winch matches the text being nputied via I.ct mput 15.
The aim when training a conventional text-to-speech system is to estimate the model parameter set).. which maximizes the likelihood for a given observation sequence o. hi the present embodiment). consists of a set of context-dependent Gaussian distributions /V(o;g,L) where o is the observation vector and ji. and L are the mean vector and diagonal covariance matrix associated with each syllables of an input sentence. Given such model, the probability of the thseivation vector for a whole sentence o = [oJ,...o]' is P(o A) =A/(o;p,E) EqnS where p and £ are the total mean vector and diagonal covariance matrix created by the concatenation of the s and L of the Gaussian distributions associated with each syllable s of the input sentence. The mapping between the syllables of an input text and their associated distribution is implemented by means of decision free clustering.
In the present embodiment, a training database contains an audio input and corresponding text.
From the training data a set of fundamental frequencies FO is extracted from the audio input and each one is associated with the corresponding text. x is defined as the log of the fundamental frequency, logFO.
= [xç,...,x]' is thc logFO vector associated with a syllables of the training data, where 4. is the duration of the syllable in frames. If the duration of all the syllables in the training database were the same, models could be trained directly on the il vectors. Unfortunately, this is not the 1 0 case. Instead, an approach luown as factor analysis (FA) is employed. In factor analysis, the pitch contours of the syllables in the Lraining data are parame.terized with a linear transformation such that s ci. x Eqn6
where c is the coefficients vector for syllable s, a deterministic linear transformation which depends only on the duration of the syllables, 4, and c, the parameterization error. In an embodiment, N" is the inverse of the 5th order discrete cosine transform (DCI). Other linear transformations could be used. Now instead of xt the variable to be modelled is c which has the same dimension for all the syllables.
If only the DCT coefficients e5 are modelled, the x generated from such a DCT trajectory will not necessarily be continuous at the transition between syllables. To avoid such gaps, the manner in which the logFO of one syllable relates to that of its neighbours has also to be modelled. This can be achieved by means of concatenation coefficients similar to the A and A2 features used in standard [-IIVIM-hased speech synthesis. f IIVIIVI based synthesis is known in the art and wifl not be discussed here. To obtain a smooth spectrum the spectral envelopes of consecutive frames have to be similar to each other. However, in ordcr to obtain a continuous logFO trajectory, it is not necessary to ensure similarity between the logF0 contours of consecutive syllables. Instead, the logFO trajectory in the transition between two syllables must he continuous. In an embodiment, therefore, the following continuity constraints; which ensure the continuity of the generated logFo trajectory, are applied to the model: 1. The delta of the 0th DCT cfficient (ACa), calculated as A.0.0.0 -cs_i Eqn7 This coefficient guarantees that the average IogFO of two consecutive syllables does not differ too greatly.
2. The gradient of logFO at the junction with the previous and next syllables: Ax and Ax. The definition of the gradients with respect to logFO trajectories is -1. 14' zXx) w.t;++j + >: (s+i) UI -= :1. Eqn 8
where H' is a window of' a fixed number of' li-allies around the syllabic boundary. Using the linear relationship of Eqn 6 and neglecting the c terms, Eqn S can be rewritten as = H c, + H3+n cs+.i Eqn 9 where -i d = wlV, "i I I U) .. a-hi) = -14' Eqn 10 S = 3C tLI = :i.
Rqn 11 and N is the w width row of N" Similarly + Hc5 Eqn 12 With the addition of the continuity coefficients, an 8-dimensional DCT observation vector o is obtained = kJ Ac, A4, Ax}1 Eqn 13 Given this model, it can be shown that all the components of the observation vector a can be expresse4 as linear transformations of C? for current and surrounding syllables. Therefore o ca.n he expressed as a = Mc. whcre M is a transformation matrix that depends oniy on the syllabic durations and c = [c1 cl]1. Since is Gaussian, it follows that P(e A) = A1(e; ê, F) Eqn 14 where ê= L,tt Fqn 15 P = Eqnló L = PM'E1 Eqn 17 For a sentence with 1' syllables, the general form for the whole utterance can be written x=Nc+ Eqn 18 where x = [i1, x2, ., xJ and N is a block diagonal matrix formed by the concatenation of the N" matrices for each one of the s syllables in the sentence.
In the present embodiment, it is assumed that e follows a distribution crJV(OV) Eqnl9 where V diag([Ac]) Eqn 20 with A(C) a matrix to select from the general error model a the appropriate value for each frame.
On extraction of log EQ from the li-airting database, there may be regions of IogFO that are unvoiced or discontinuous and regions that are voiced but may he unreliable.
In an embodiment the "reliable" and "unreliable" regions of logFO are treated heuristically. The "real" values of x are the values that the regions of FO would have if there was no unreliability in the system; the greater the disparity between the observed and "real" values, the less "reliable" the observation. It is assumed that the relationship between the observed values of x and their "real" values can be expressed as ohs rcal p r-.i\/(x0, 1' ) Eqn2! with xt the observed value of x, xJ"' the trajectory modelled by Eqn 18 and V' a diagonal matrix representing the reliabi1ity" of each observed frame. In this embodiment there is no need to elassi P0 observations according (0 their "rcliahililv". Instead, the variance V of e in Eqn 19 is be assumed to be dependent on vo factors: one for the modelling error VC awl another one for the FO extraction V".
For each fi-ame the H) extraction factor V' is assumed to be a function of the frame JO aperiodicity and/or the spectrum, a, and s, respectively, and of the specific FO extractor.
In an embodiment, a direct linear function of the aperiodicity is used. V is then written as diag(ga) Eqn22 where a = [a,..., ai and g is a vector dependent on the P0 extraction method employed. In an embodiment, the vector g is integrated iii tim model such that a and of I3qn 20 are replaced by = gT]T and the variance selection matrix = [Az, a,] , respectively, where ii is Lhe utterance index. g is then estimated together with the variance of the modelling error c by means of the modified variance selection matrices.
in an embodiment:, more thEm one FO extraction method and a different g is used for each method.
In the above embodiments, the approach for handling the unreliable regions of logFO is integrated into the model, The discontinuous unvoiced ("missing") regions of logFO, denoted i,, must still be dealt with, however. The approach taken in the present embodiment is to calculate the model X using an expectation-maximization approach, also known as the Baum-Welch aigorithm. The DCT coefficients c, as well as unvoiced parts of the fundamental frequency x are treated as latent variables and an iterative approach is used to compute them.
Thus methods and systems in accordance with the above embodiments treat discontinuous regions of EQ as measurements with maximum unrealiability so that no signa.l reconstruction or interpolation process is required other than the intrinsic one provided by the model training itself.
Methods and systems in aeeordaice wiLh the above embodiments handle unobserved, missing and unreliable values in a consistent statistical framework; there is no need to interpolate and filter the extracted logFO trajectories. This Incans that neither the model tying structure nor its parameters depend on any artificially created values. Additionally, the dcgree of rclia.hilil.y oJ' each observed FO value can be Cearned during the training with little or no heuristic. Further, multiple EU extracting methods can he combined and weighted according to their reliability.
This approach takes into consideration the reliability of each particular method and moreover does not require the FO vectors extracted by different methods to be perfectly synchronized.
An overview of the basic Baum-Welch algorithm employed in the present embodiment is now provided. In order to simpliñ some later expressions, = S0x Eqn 23 Xm Smx Rqn 24 are defined as the observed and missing (discontiuous) parts of x respectively, with So the correspondiag set of orthogonal selection matrices. Similarly for the transformation arid the error matrices N0= S0N Eqn25
SN Eqn26
V0= SOVSJ Eqn27 V SmVSI Eqn 28 Note that, since V is diagonal = Eqn 29 1 5 To train the model for logF0, two sets of latent variables are needed: a) the vector of DCT coefficients c and 14 the logFO values of the unvoiced regioi:is. These Iwo sets of latent variables have to he obtained from the observed logFO values of the reliablc and non-reliable voiced regions.
For a single utterance, the two latent variables are the sequences of parameters c and the non-observed FO frames -A new variable y is defined as I 1T y={c,X] Eqn 30 Using this variable, the auxiliary function for an Expectation Maximization algorithm can be expressed as Q(A,A) = iP(yx0,A)iog(P(y,x0A))dy
I
Eqn 31 Expectation Maximization algorithms are well known in the art. The posterior probability P(y I x0, can be decomposed as P(y[x0A) = P(cxm,xo,A)P(xmxo,A) Eqn 32 Applying Bayes, it follows that P(xc, A)P(cA) P(exm. x, A) P(cjx, A) = Eqn33 and P(xmJxo.A) = P(xIA) P(x0IA) Eqn 34 Substituting Eqo 31 into Eqn 32 it follows that P(y[x0,A) = P(aHc,A)P(cH\) P(x0X) Eqn 35 From Eqn 18 it can be shown that P(xc.A) =A1(x;Nc,V) Eqn36 Therefore, the prior of x P(xA) = /P(xe.A)P(cx)de EqnJ7 can he expressed, using Pqn 14 and 18, as P(xA) = A[(x;i, U) Eqn 38 with x = Eqn39 U -V+NPN' Eqn 40 and con sequently P(x0A) = Eqn 41 whcvc = SONÔ Eqn42 U0 = s0(v+JVPN')sT Eqn 43 Substituting Eqns 14, 36, 41 and in Eqn 35 it follows that -N(x;Nc,V)N(c,ê,P) (yx0, -N(x0;0, U0) Eqa 44 This can he summarised as P(yJxo,A) Bqn 45 whei-e -( NT V1N F P -NjV -1 --V;'N V;1} -( NIV_IN±ME_1M -NV;' -VNm V) Eqn 46 and
C C
Nmë) Nm Eqn 47 where is an identity matrix with d the dimension of c and = (NTv_tN+P_i)_i(P_1e+NJvx0) = (NTV_1N + p_l)_i(ITE_:LI2 1-NTVx0) Eqn 48 The log-scaled joint probability also follows using Eqas 36 and P4 as log(P(y, x0A)) 1og(P(xc, A)) ± 1og(P(cA)) = -O.5(1og(VD + 1og(P) +(x -Nc) -Nc) +(c -ê)'P'(c -Eqn49
I
H = ( [dd(} \ °[d?n#de} Eqn 50 and d is the dimension of XII! We define the mean and variance associated with one utterance as = diag(A'u) Eqn 50(a) respectively, where q is the supervector made of the concatenation of all the mean vectors pi of ii the supervector made up of the concatenation oithe leading diagonal of all the covariance matrices 11 of)I and where A is the selection matrix for the mean value of the given utterance.
Using Eqo 5oç). Eqn 49 can be written as iog(P(y, x0A)) = O.5 (iog(I V) + 1og(P) -N0HTy)TV(x0 -N0HTY)
--EqnSl where
-/ NIV;'JV m+F, -NV;' I p-Vni.LVp Pqn 52 and
N C
= ( : )PMTEiA1q Pqn 53 where A( A) is the selection matrix for the rnea.n value of the given utterance.
Expectation.cfrp In the expectation step, the following quantities are calculated: jP(yxoA)vdv = V Eqn 54 J P(yx0,A)yy'dy Bqn 55 and the auxiliary vector = + Eqn 56 W11e1C ori-diag (S0(NHThPIINT)ST) ± + on-diag ((x0 --Eqn57 and = on-diag(G ( + yy)G) flqri 58 With N'
-I
Eqn 59 From Eqn 46, ç can be written as -( iC -\ Nn1K11 1(22 Eqn 60 with -. (NTv;'w0 + Bqn 61 K22 = Vn+NmKijN Bqn 62 Note that K11 Bqn63
T--T
G jj C = Eqn 64 so that equations 57 and 58 can he simplified as on-diag (NOK1INT) + +on-diag ((x0 N0e)(x0 -N0c)T) Eqn 65 Vm Eqn 66 Maximization step In the maximization step, an auxiliary function is defined over all utterances in the training data: Qtotai(A A) = Q(A, A) Vu Eqn 67 where u represents the utterance index. The maximization step consists of maximizing Q!,,,,by finding the derivative of A) with respect to. and setting it to zero. This is done while holding the values of 7 and p calculated in the expectation step above constant. Such an approach yields the following so-called updal.c equations For the parameters of the model 1 i) = (AET 1MLPUME'AU)T(AIEJ'M.Un) Vu Vu Eqn 68 (7 = (>ATA) >ATi" Vu Vu Eqn 69 Due to the concatenation coefficients however, there is no close-form solution for the model covariance i. In an embodiment, a solution is obtained with an iterative method following the gradient = 0.5 AJ on-diag [M.a(Pit + ± ThJVJ)H.OMJ ±24(e)TMT] [qn 70 hot-c = Eqn7l Eqn72 Using Eqn. 63 and that HTu:H= eui Eqn 73 Eqn 70 can be simplified to give -0.5 Af on-diag ( Vu
-
-
+2(e. )TiT) Eqn 74 Note that the model error a for the missing regions of logF'O is invariant, and there is iherefore no need to compute. Furiher, > 6 then for any 6,,, c the foI1owiig approximation can be made.
NT V'N Eqn 75 Under this approxirnalion, the update equations of the model are exactly the same as those ihaL are obtained when the latent variable consists just of e. Note that if this approximation is not made, will act as a form of variance tlooring.
The model i that maximizes Q is obtained during the maximization step. This model £ is input as k into the expectation step and the cycle of expectation-maximization recommences.
As the function Q is convex, this iterative process will produce an optimum value for Thus, the model is calculated by applying maximum-likelihood criteria. The coefficients c and the missing values x, are also calculated as part of the training. The coefficients c calculated in this way are known as maximum a-posteriori (JYIAP) coefficients.
Methods and systems in accordance with the above embodiments do not require smoothing and 1 0 interpolation during the parameterization of the EEl trajectory because missing values are trained together with the model itself. [herefore, the structure of the intonation model does not depend on any artificially created values. Further, the reliability of the observed FO values is determined automatically by the model and the Ft.) vectors extracted by different methods can be combined according to their reliability in a consistent probabilistic framework. By providing a sound mathematical framework. the FA-UCT enables the integration of different. modcls into a product of experts as well as the usage of oilier techniques such as speaker adaptation, cluster-acoustic-train ing (CAT), ctc. This framework produces models of the composed correlation matrix that model long-term correlations sufficiently well enough to allow for sampling-based parameter generation.
To run the EM algorithm some initial values for the model are required. The initialisation is non-trivial for the following reasons: * the dimension of the logPO of each syllable/utterance, x, is variable and so are the dimensions of tramsformatiol1 matrix N; * logFO is discontinuous, thcrcJ'orc in some cases neither the static c3 nor the cotutinuity coefficients Ax'0 can be obtained; and * since for sonic syllables o, cannot he obtained, it is impossible to cluster syllables directly on ii space when building the decision trcc.
In order to overcome the problems associated with initializing the concatenation coefficients, in an embodiment, the following solution is adopted: 1. the DCI of the logFo of the syllables are clustered using a factor analysis (FA) model without any concatenation parameters; 2. the MAP distribution of the DCT coefficients and missing points of each syllable N(y, ,p) are obtained; 3. the probability density' function of the observation vector is computed using the MAP distributions; 4. (optionally) the models are re-clustered on the o space; and 5. a full FA analysis is run over the models of o.
The implementation of the above steps in (lie training of the niodel is now described with rcfcrcncc to tigure 7.
In Step 5701, the audio input is segmented into linguistic units according to its corresponding text. In an embodiment, the audio input is segmented into units at a plurality oi'linguistie levels, such as syllabic, phrase, etc. In a further embodiment, the audio input is segmented into units at only a single linguistic level.
In Step S703 the fundamental frequency is extracted from the audio to give a vector of logFO values x=[x1,x2,..., XiI' In Step S705 each FO frame from the input audio is classified according to its reliability level i.e. as missing (unvoiced), reliable and unreliable. In the embodiment described above (see Eqns 21 to 22), it is not necessary to classi' fames as reliable and unreliable; instead they are classified only as missing or observed, Land x0 respectively.
In Step 5707 the linguistic context features are obtained from the text and associated with each of the corresponding segments of x.
En Step 5709 a model for the static coefficients is computed. Models of static coefficients neglect correlations between syllables and treat them as independent.
[o one embodiment, the Baum-Welch algorithm described above is employed to find the static DCT coefficients of each linguistic unit. When the concatenation coefficients are neglected, each leaf of the decision tree can be optimised independently. For each syllables, the model is still the one of Eqn 6 but the update equations are simpler. In order not to confuse the notation in this section with that of the general approach, the prime symbol will be used to denote the values and models of a single syllable. For all the syllables s associated with one leaf-node) of a decision tree the variance of the error is (cf. Eqn 20) / Icy V =diag([A o-]) Eqn 76 where =[c,c,aTjT with r, i and in the indices for reliable, non-reliable and missing.
Since there are no concatenation coefficients it follows that in the equation o = Mc that M = I. Inserting this into equations 14 and 16 gives P as a block diagonal matrix with the block associated to each syllable. Each syllable model j can therefore be optimized independently such that Eqn 50(a) gives q' = t' = and consequently A' = I. Without concatenation coefficients, the MAP estimation of the DCT coefficients becomes (from Eqn 48) = (N/TVNI + + NTV_x Eqn 77 arid the update equations Follow simply as (cf Eqns 68 and 69)
---I C9 VsEj 11=
Vs E.. j Eqn 78 (>1 ATA') V.sGj VsG Eqn79 Without concatenation coefficients, it can be shown from Eqo 71 that = from which it follows that >: K + VsE)
-
Eqn 80 which now has a closed-form solution.
In order to accelerate the computation, the following values can be pre-computed for each syllable: = NJN Eqn 81 /3'n Eqn82 with K E {r, n, m} Using Eqns Si and 82, it can be shown that K1= (i E Eqn 83
H-
itE {rn,in} nC Eqn S4 For a 5th order DCI', the dimensions of if and UK are 5x5 and 5x1 respectively.
n Step 8709, the static coefficients are clustered into decision trees.
The clustering of the static coefficients obtained with the above embodiment is described with reference to figure 4.
In step 8601 a root node is trained by running the Baum-Welch algorithm described above over every syl1abIex. This gives a root node model 4o.
In step S603 an optimLim splitting question, ia split the syllables into two clusters, is determined. In maximum likelihood clustering, tile goal is to find the split that maximiics the likelihood difference AL = 1og(P(xA) Vs cj - log( P(x8 HiD - A2) VsEJ1 VsCj2 Eqn 85 Where 4i arid 42 are models for the two clusters. In an cinboditnetit A1 and 42 are detennined by running the Baum-Welch algorithm over their respective syllables.
to obtain the observed part x, Eqn 38 is used, which, iii the absence of concatenation coefficients becomes = Iog(K(xi;U)) Eqn 86 where now (cf Eqns 39 and 40) Eqn 87 U= N,EN±V Eqn88 To determine each splitting question, the FA-clustering requires 1. an PA-model for A, and A1 to be trained by running an EM algorithm; 2. AL to be computed.
Given the large number of possible qi.iestions, it is not practical to use the Baum-Welch algorithm directly. In an cnibodimeri., an approximation is made by pre-seleeting a reduced cub-set of questions with a faster algorithm, and using the algorithm to select the optimum question from that subset. Following the JEMA method, which is described in Aguero et. al., (2008) A study of,JRA'fA for intonation modeling, Proc. ICASSP, April 2008, Las Vegas,I...JSA, a fast clustering algorithm based on RMSE using the previously defined a, is employed. RMSE is well known in the art. We give a brief overview of one possible approach below.
The total time domain error e of a node/introduced when the observed LogFO values x of the syllables s associated with that node are substituted by a mean vector, NN) g1, is given by is i.s T is is r(ji) szz (x0 N0) (r0 -N0 Vs ç -2,iib1 + q7 Eqn89 where = F.qn 90 = _ Vs E.j Bqn9l /sT is (x0 x0) VsEj Eqn 92 and = Eqn 93 Eqn94 as before.
The mean vector that minimizes this error can be shown to be = which gives = -bIa3 Eqn95 !1*here.fo]:e, the total errol reduction produced by a given split will follow as = bJab + -lkpi 96 where y, n and p refer to the yrs-chld. the no-child and the parent nodes respectively. This equation allows for a very efficient pre-selection of the N-best splitting questions.
In step 8605, it is determined if the split is valid according to pee-determined validity conditions. In an embodiment, the spliL is valid if AL is larger than a certain threshold. In another cnthodimcrtt, the split is valid if all clusters of syllables are split by the splitting question. In anoiher embodiment, a split is valid if t he total number of splits arising from the splitting question is below a certain threshold.
If the split is valid then each new cluster becomes a node and the calculation returns to step S603.
If the split is not valid, then the node becomes a leaf iii the decision tree and the calculation progresses to step S607 In step S609 the Baum-Welch algorithm is re-rut] on the x values associated with each node.
Even without concatenation coefficients, the EM algorithm requires an initia.l value of the model parameters at each node. In an embodiment, an initialization value of the model A is obtained using the RMSE approximation. An RMSE approximation is known in the art and will not be described here. The root node initial values can be obtained from the Ri14SE.
approximation as to = aj'bo Bqn 97
-I
Eqn 98 > T T /sT Is (bto ci, /0 + jL0 p + r Vs --. --______ V.s d.5, Eqn 99 for K E {r,n} . hi an embodiment the initial value for a' is taken as o. For the rest of splits, the mean vector j is initialised as = a7'b1. Initial values of and E, are taken to be those of the parent node.
In step S7l 1, the concatenation coefficients of the model are initialized using the static DCI coefficients c calculated in step S709..
For each observations the concatenation parameters that ma.xjrni7c the likelihood given the data and the model are to be obtained. That is, for each concatenation coefficient, -.S3. s-I-i -argmaxF(&x.x,x A) Eqn 100 is calculated, where ? is the model obtained for the static coefficients e in step S709 above and A is one of the CoflealenaLion coefficients, Ac., Axor &;. Using Eqn 7 and Eqn S this posterior probability can be wTitien as P( A) = f c1, A)P(c._i, c, c.5+i x', x, x', A)dc._1dc5dc1.1 = I 1, Cs C.5+i A) P(c.9_i il A)dc.9_1P(c3 x, A)dc.5P(c.+j x1, Equ 101 For a static model without ConcatenaLion coefficients, it can be shown that P(ex5,A) =K(c;,9,K8) Eqn 102 where. = and K6 = IC for a given syllable s and K and were given in Bqiis 83 and 84 respectively.
The corresponding equations for the concatenation coefficients therefore follow as P(ex', A) = iV(c; aA(o) Eqn 103 P(L\4x, x5, A) = A1(A4; z\x, aAx7) Eqn 104 P(A4x5, A) = V(.x; aAx:) Eqn 105 where (from Eqns 7 to 12) Eqn 106 1-00 00 uo = A41 ± K:1 Eqn 107 = Ht)ei Hës Eqn 108
-H
Eqn 109 = -Eqn 110
HKH -
Eqn 111 where T< is the 0th row and 0-th column element ofK, Thus, the initial value for the concatenation coefficients is obtained as a linear combination of the ë. s Since this produces a set of real distributions on the observation-vector space. o, there are three options for how to proceed: a.) kccp the same clustering structtirc of the static space c and create an initial model of the extended vector using the decision tree obtained for the static features; b) recluster the models over the o space; or c) write them down to a file as untied models and re-cluster them using a decision tree thus obtaining a new clustering just over the concatenation coefficients. These options are discussed in detail below.
In step 712, in one embodiment, the concatenation coefficients are calculated for each syllable, treating each syllable as independent from adjaccnt syllables. The concatenation coefficients are then clustered using the approach described in figure 6. Thus two decision trees are produced: one for the concatenation coefficients and one for the static DCT coefficients.
Once the concatenation coefficients are obtained in this way, the extended observation vector o is calculated. ibis is done by assuming that each of the parameters is independent. Using Eqns 102 to Ill, an approximate posterior probability for the extended observation vector defined in Eqn 13 can be written as F(o8x,)) =A1(o8;O3,K.) Eqn 112 1 0 where O,q = Eqn 113 = Eqn 114 In order to calculate this extended vector, the decision trees for each coefficient are followed according to the context of each syllable.
hi another embodiment, the concatenation coefficients are not clustered separately. Instead, the models of die extended vector o givcn in Eqas 112 to 114 are calculated by treating the coefficients as untied. The niodels of the u are then clustered using lhc algorithini described in flgurc 6. In thi.s case the mean and variance of the extended model call be shown to become (ef.
Eqns 78 and 80).
I', -Vse Vsej Eqn 115 L K + on-diag(O8O) * / = on-diag(t,1p VsEj Eqn 116 respectively.
In yet another embodi pent, the decision tree fur the static coefficients calculated above is imposed on the concatenation coefficients. The models of the extended vector v givcn in equations 20-22 are then calculated according to a single decision tree. The advantage in this case is that the initial model required for the static coefficients does not need to he very large.
Instep S713 the models of the extended observation vector o calculated in step S712 above are retrained by running the fill EM algorithm described in Equations 30 to 74 above. The model calciUated is input as X into the auxiliary function Q of Eqn 31. Whereas, in step S709, concatenation coefi7eicnLs modelling the relationship between adjacent linguistic units were 1 5 ignored. in Lbs step, the model is trained so that such relationships are included.
Steps S701 -5713 ale repeated br each level of linguistic Liflit being considered.
In another embodiment, the initialization of the static coefficients c' (the model obtained in step 5707) is not done using an FA algorithm, but instead using regularization based nodel irutiaiizal:ion. Regularization methods use a technique to guarantee the smoothness of the inverse transform of static coefficients. As well as minimizing the weighted least squares error between the observed values of x' and the model values, a criteria is imposed on c' such that x' is smooth, In order to obtain e', the weighted least squares error between the observed values of x' and the model values are minimized and a smoothness criterion is imposed.
The inverse discrete cosine transtbrm (DCT) of c into x', 1, is given by = N(c',t) = c'0 +2 (t +0.5) Eqa 117 Where p is the dimension of the DCT vector and T the dimension of x Eqn 117 can be expressed as a linear equation x=Nc Equ 11. 8 The RIVISE estimation of c' given xis the one that minimizes = (ix' -N'c')' w(x' -N'c') Eqn 119 Where W is an optional diagonal weighing mali-iN. RMSE approximations ale well known in the art and will not be discussed further here.
Minimization of Eqn 119 with respect to c' yields if = (N'WN'1 N'Wx' Eqn 120 lix' is complete, i.e. there are no missing (discontinuous) values of x or no eleirient of W is zero, this equation is usually stable. Howcvcr. when there are missing x' values, a. smoothing term that acts as a penalty when the values of c' become meaningless is introduced. Eqn 119 becomes: = (x _N:c)Tw0(x -Nc')+AR(1') Eqn 121 In an etubodinient, R is a function which is small if N'c' is smooth and large otherwise, in an embodiment. R is Ihe integral of the square value of the derivative: R(I') = = CITRTC/ Eqa 122 Where T Since R is already weighted by A this becomes simply R=22diag([O,i2,22,.m.,p2 Eqn 123 Minirinizing CR with respect to c' now yields = (NTWON + 2R)1NTW0x E*qn 124 an equation for the static coefficients. Through thc introduction of values in the diagonal of the denominator via R this equation is stable. Note that) is just a small value that guarantees stability without a significant distortion to the error.
Once the static coefficients are computed in this way, the concatenaLion coefficients are conipulod from them.
The decision tree clustering of stcp 5709 of coefficicnLs obtained in this way is done using standard procedures, either asstxning that each syllable is a point distributions or assigning a very low covarianec value. Standard procedures for decision tree clusiering are vell known in the art and will not he discussed here.
In an embodiment, once the model structure is defined in this way, the error covariance associated to each leaf node and type of input (missing, reliable, non-reliable) is initialized as the average square error of each cluster with respect to the training data.
Steps 5710 to S713 then pi-occed in the manner described above.
Product of Experts One of the advantages of the described framework is that it allows the combination of models at multiple levels, both as a product-of-experts (PoE) or as a superpositional model. If the model consists of a PoE with multiple levels, the concatenation coefficients are not needed because the continuity of the IogFO between different segments of one level is already guaranteed by the model a.t the upper level.
An embodiment using a product of experts approach is now described with reference to figure 8.
In Step S801, the audio input is segmented into linguistic units according to its corresponding text. In an embodiment, the audio input is segmented into units at a plurality of linguistic levels, such as syllable, phrase, etc. In Step S803 the fundamental frequency is extracted from the audio. In statistical intonation methods, the fundamental frequency FO or logFO signal is treated as a random variable. For an utterance with T linguistic units a vector of logFO values.r-[x,, x3, ...xy] is extracted.
In Step 5805 each FO frame from the input audio is classified according to its reliability level i.e. as missing, reliable and unreliable. In the embodiment described above (see Eqns 18 to 74), it is not necessary to classify fames as reliable and unreliable: instead they are classified only as missing or observed, x,. and x0 respectively.
fri Step 5807 the linguistic context features are obtained from the associated text and associated with each ofthex values.
in Step 5809 the Baum-Welch algorithm described above is employed to find and cluster the DCI eocfficicnls of each linguistic unit into a decision tree as described above. In this step the concatenation coefficients are neglected, as in Step 5709.
In step 5811 a product of experts model (PoE) is defined. In a product of experts, instead of computing the likelihood of a given trajectory x over one single model), it is computed over a set of models A ={?,..., A,...11} as flP(xAj)m1 P(x(A)= flcJ7'yn Eqn 125 where I represents tile model level, eg. syllable, aeceni. group, etc. Thus each expert in the product of experts is a model for the static coefficients aione. They are then retrained using the full FA as a product of experts. Because each expert is defined at a different linguistic ievel, correlation between adjacent acoustic units are calculated during this training process.
The auxiliary fbnction (Eqn 31) becomes Q(A, A) -In the general case tile product of experts model would require the whole auxiliary frmnction to be recomputed. In an embodiment, however aH the experts are Ga.ussia.n and their variables 01 can be cxprcsscd as a linear combination of the main model parameters c, i.e., o, = W1,c. In this case, the total product-of- 1 5 experts becomes a standard trajectory model where Eqn 126 P(xA) Jf(x; ]TêA, V + NPAN) and Eqn 127 = [WIF/MTET'MIFi 9_i Vu Eqn 128 CA = VI' Eqn 129 where F is the matrix that transforms the static coefficients corresponding to level 1 into coefficients at the output level. In au embodiment, the models are updated with respect to the variance and the weighting terms are ignored. In this embodiment, the training is identical to the general described in equations 30 to 74, above.
Note that the main difference between the formulation of the problem as a PoE and the previous one is that the Poll enables the usc of different context dependencies for cach expert.. In this way, by combining the models of the different decision trees it is possible to have a. higher number of effective models with the same or even fewer free parameters.
In step S813 the model A is trained by running the full EM algorithm as described in equations to 74 above.
In the above embodiment, the product of experts is constructed from models for static coefficients calculated using the FA-DCT approach described in equations 30 to 74 above. In another embodiment, one or more of the individual experts is a model for static coefficients calculated using another approach such as the HMN4 or any other statistical model. In yet another embodiment, one of the experts is calculated as a superpositional model.
Superpositional models will be described below.
For example, the main expert may be defined at frame level, e.g., a frame-level HMM model, such that the only required latent variable is the logF0 trajectory so that c = y and N I. HMM models are well known in the art and are not described here. In the 11MM case it follows from Eqns 60, 48 and 49 that the following simplifications can be made: <PA.iCii Eqn 130
YA
Eqn 131 and log(P(y, x A)) = log(P(xo y, A)) + 1og(PA)) = -O.5(10gCV0D +1og(P -S0y)'V(x© -Soy) +(y -)P(y -k)) Eqn 132 Superpositional Model Based on the FA analysis, there is yet another possibility to create a multi-level model 3S a superpositional model.
In this embodiment, instead of leaving each mode] to completely model the whole pitch-contour, a low-scale model models the errors of the higher scale ona The model in this embodiment is x = NACA + EA = ) N.c1 ± A Vl Eqn 133 NA [N0,* ,N1,** ,NL] Eqn 134 [cT,''' ,cT,'' ,c] Eqn 135 Where I corresponds to a linguistic level with frO the highest level and 1L the lowest level under consideration. LA = N(0, A) in which A gets modelled at the lowest level L. In the superpositional model, the model parameters can be easily trained together, however, the same for the model structures is not feasible. A possible solution is to iteratively estimate the model parameters and the tree structures for each level. In one embodiment, model initialization proceeds as follows; I. The model at the highest level is clustered considering only the parameters of -/ is set to 1 = 1; 2. I. is clustered, taking the parameters of)Y as constant; 3. ?JL.2' are trained jointly; and 4. I is increased in unit steps and the process repeated from step 2) until / = L. Once there is an initial structurc for each level, an iterative approach of rcclustcring and it-training is possible. The contribution of each cluster/level is modulated y pre-defined matrix N which does not need to he learned.
This embodiment will now he described with rcgard to figure 9 In Step 901, the audio input is segmented into linguistic units according to its corresponding text. In an embodiment, the audio input is segmented into units at a plurality of linguistic levels, 1 5 such as syllable, phrase, etc. hi Step 903 the fundamental frequency is extracted from the audio. In statistical intonation methods, the fundamental frequency PU or logFO signal is treated as a 1-audom variable. For an utterance with 7 linguistic units a vector of logFO values x'x1, x, ...xr] is exn'acted.
In Step 905 each FO frame from the input audio is classified according to it-s reliability level i.e. as missing, reliable and unreliable. In the embodiment described above (see Eqns 18 to 74). it is not necessary to classil5t fames as reliable and unreliable; instead they are classified only as missing or observed, Xm and x0 respectively.
In Step 907 the linguistic context features are obtained from the associated text and associated rith each of the x values.
In Step 909 the Baum-Welch algorithm described above is employed to train the model)° of the highest linguistic level being considered. In this step the concatenation coefficients are neglected, as in Step 709. Following this step / is set as 1.
In Step 911 the Baum-Welch algorithm described above is employed to train the model 2 of lillgLiistic level I using the model in flqn 119 taking I = 0.+,I -This is done by holding the parameters of models) IS' constant.
In Step 913 the Baum-Welch algorithm described above is employed to jointly train all models using the model in Eqn 133.
Steps 911 and 913 are repeated until l=L.
The above described methods can model long-term correlations as it models explicitly the pitch contours of supra-segmental structures such a syllable, intonation phrase, etc. A better temporal correlation between the different parts of an utterance allows one to use a sampling-based synthesis, as in step S 111 of Figure 2 In sampling-based synthesis, instead of always producing the mean value of the probability distribution for the global logF@ contour, it produces a contour which is generated randomly from that distribution. This method can reduce the monotony of the generated information but it requires a long-term model of the temporal correlation. Figure shows a schematic of the correlation matrix U of equation I 3. V -I NPN' using three different approaches to modelling speech.
Figure 10 (a) shows the correlation matrix for a standard 11MM approach with dynamic features. I-HAM approaches are known in the art a-nd will not he discussed here. In standard [-1MM-based synthesis, 1 directly models the probability of the observed logFO at the frame level. Intonation is mainly supra-segmental. Statistically, this means that the logFO values of frames belonging to the same supra-segmental unit, e.g. syllable, tend to be more sttongly correlated than those belonging to different units.
The Ax1 and A2x, coefficients used in standard HMMs produces a band correlation matrix of the values within the windows used to calculate them. This is represented in Figure 10(a) by dark grey and light grey shaded squares.. The diagonal terms (shaded dark grey) represent terms corresponding to individual acoustic units. l'he off diagonal terms (light grey) arise from 1-lie A and A2 coefficients describing the concatenation between acoustic units. The length of such windows is fixed and its boundaries do not match any linguistic unit. Supra-senental inforinaLion iii standard HMMs appears only implicitly as questions in the decision tree-based clustering of the state-level logFO models.
Figure 10(b) shows the correlation matrix for the DCT-F0 model, the training of which was described above using a factor analysis approach with reference to figure 7. The main idea of the parametric-based FO model is to make supra-segmental information explicit by delming statistical models that represent the logFO contour of proper linguistic units. Mathematically this is equivalent to using a block diagonal covariance matrix with variable width blocks where each block represents the frames associated with each supra-senental unit. These block-diagonal elements of variable width are represented in figure 1 0(h). The main diagonal blocks (dark grey) are terms corresponding to the individual acoustic units. The upper and lower diagonal blocks (light grey) represent terms arising from the concatenation coefficients.
Figure 1 0(c) shows the correlation matrix of a product of four DCT-F0 models using the product of experts approach (PoE), as described above. Unlike the other approaches, PoE allows for different context dependencies for each expert. The FO can be obtained with different methods and, depending on the nature of the experts, with different frame shift, taking into account the uncertainty associated with each method.
The block diagonal matrices corresponding to the different units of each model are represented in figure 10(c). The schematic of the correlation matrix is shaded a. different colour for each linguistic level. In the example shown in figure 10(c), the lowest level model is defined at the frame level, as in the HMM case of figure 1 0(a) and the diagonal terms are of equal width. The long term correlations of this model are modelled by all the other models in the product of experts. The other levels are defined at real linguistic levels and hence the diagonal blocks have variable width as the acoustic units are of variable duration.
A better temporal correlation between the different parts of an utterance allows one to use a sampling-based sylithesis, as described in an above embodiment. In sampling based synthesis, instead of always producing the mean value of the probability distribution for the global log FO contour, it produces a contour which is gcncratcd randomly from that distribution. Ihis method can reduce the monotony of the generated intonation but it requires a long-term model of the temporal correlation.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel methods and systems described herein may be embodied in a variety of other forms; fijrthermore. various omissions, substitutions and changes in the form of the methods, systems and carrier media described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.

Claims (18)

  1. CLAIMS: 1. A text-to-speech method, said method comprising: inputting text; dividing said text into a first sequence of acoustic units corresponding to a first linguistic leveL the duration of each of said acoustic units being cpial to a plurality of frames; obtaining linguistic context features from said text; mapping said acoustic units to probability distributions that relate said linguistic context features for each of said acoustic units to speech parameters, wherein said speech parameters correspond to a linear parameterization of a speech signal contour over said plurality of frames according to a speech vector model; estimating the duration of each of the said acoustic units using a duration model; 1 5 converting said first sequence of acoustic units into a sequence of speech vectors by combining said probability distributions into a probability distribution of output coefficients wherein said converting of said first sequence of acoustic units into a sequence of speech vectors comprises producing a random sample from said probability distribution of output coefficients; and outputting said sequence of speech vectors as audio.
  2. 2. The text-to-speech method of claim 1, wherein dividing said text comprises dividing said text into a first sequence of acoustic units and at least one further sequence of acoustic units, each of said sequences corresponding to a different linguistic level; and converting said first sequence of acoustic units and said at least one further sequence of acoustic units into said sequence of speech vectors using said speech vector model.
  3. 3. The text-to-speech method of claim 1, wherein said speech vectors comprise one or more of the fhndamental frequency, the spectrum coefficients and band aperiodicity.
  4. 4. The text-to-speech method of claim 3, wherein said spectrum coefficients are selected from Isp, mel-isp, cepstrai, mel-cepstrai.generalized mel-cepstral and harmonics amplitude.
  5. 5. The text-to-speech method of claim I. wherein said first linguistic level is selected from phones. diphones, syllables. tuoras. words, accent feet, intonational phrases, sentences phrases, breath groups, or paragraphs.
  6. 6. The text-to-speech method of claim 1, wherein said mapping from said text to said plurality of probability distributions that relate said acoustic units to speech parameters is performed using one or more of decision trees, neural networks or linear models.
  7. 7. The text-to-speech method of claim 1, wherein said probability distributions are Gaussian distributions.
  8. 8. A method of training a speech vector model, said method comprising; inputting an audio sample and a corresponding text; dividing said audio sample into a first scqucnco of acouslic units corresponding to a first linguistic level according to said corresponding text; extracting a first speech vector at said first linguistic lcvel from said audio samplc; identilying regions of said speech vector that are discontinuous; obtaining said speech vector model by applying maximum-likelihood criteria, wherein said discontinuous regions of said speech vector are computed as measurements with maximum unreliability.
  9. 9. The method of training a speech vector model of claim B, wherein dividing said audio sample comprises dividing said audio sample into a first sequence of acoustic units and at least one further sequence of acoustic units, each of said sequences corresponding to a different linguistic Icvcl, and wherein extracting said speech vector comprises extracting a first spccch vector and at least one further speech vector at said different linguistic levels from said audio sample.
  10. 10. The method of training a speech vector model of claim 9, wherein said speech vector model is obtained as a product of experts of speech vector sub-models each defined at said different linguistic levels.
  11. Ii. The method of training a speech vector model of claim 9, wherein said speech vector model is obtained as a superposition of speech vector sub-models defined at said at said different linguistic levels.
  12. 12. The method of training a speech vector model of claim 8. wherein said speech vector modci comprises a set of one or more Gaussian distributions.
  13. 13. The method of training a speech vector model of claim 9 wherein said sub-models comprise a set of one or more Gaussian distributions.
  14. 14. The method of training a speech vector model of claim 8 wherein vectors for discontinuous regiolls of said speech vector are not computed using interpolation techniques.
  15. 15. the method of training a speech vector model of claim 8, wherein the degree of reliability or unreliability of regions of said speech vector is modelled as a having a Gaussian probability distribution and said prohabdily distribution is computed as part of said speech vector model.
  16. 16. A text-to-speech system, said system comprising: a text input configured to receive inputted text: a processor configured to: divide said text into a first sequence of acoustic units corresponding to a first linguistic level, the duration of each acoustic units being equal to a plurality of frames; obtain linguistic context features from said text; map said acoustic unit to probability distributions that relate said linguistic context features for each of said acoustic units I.e speech parameters, wherein said speech parameters correspond to a linear para.mctcrization of a speech signal contour over said plurality of frames according to a speech vector model; estimate the duration of each of the said acoustic units using a duration model; convert said first sequence of acoustic units into a sequence of speech vectors by combining said probability distributions into a probability distribution of output coefficients wherein said converting of said first sequence of acoustic units into a sequence of speech vectors comprises producing a random sample from said probability distribution of output coefflcients and output said sequence of speech vectors as audio.
  17. 17. A carrier medium comprising computer readable code configured to cause a computer to perform the method of claim 1.
    1 8. A carrier medium comprising computcr readable code configured to cause a computer to perform the method of claim 8.Amendments to the Claims have been filed as follows CLAIMS: 1. A text-to-speech method, said niethod comprising: inputting text; dividing said text into a first sequence of acoustic units corrcsponding to a lust linguistic level, the duration of each of said acoustic units being equal to a plurality of frames; obtaining linguistic context features from said text; mapping said acoustic units to probability distributions that relate said linguistic context tbatures for each of said acoustic units to speech parametcrs, wherein said speech parameters are generated by a linear parariictcrization of a speech signal corilour over said plurality of frames according to a speech vector model; o estimating the duration of each of the said acoustic units using a duration model; converting said first sequence of acoustic units into a sequence of speech vectors by r 20 combining said probability distributions into a. probability distribution of outptit coefficients wherein said converting of said first sequence of acoustic units into a sequence of speech vectors comprises producing a random sample from said probability distribution of output coefficients; and outputting said sequence of speech vectors as audio.2. The text-to-speech method of claim 1, wherein dividing said text comprises dividing said text into a first sequence of acoustic units and at least one further sequence of acoustic units, each of said sequences corresponding to a different linguistic level; and conveiting said first sequence of acoustic units and said at least one furLher sequence of acoustic units into said sequence of speech vectors using said speech vector model.3. Ilie text-to-speech method of elaini 1, wherein said speech vectors comprise one or more of the fundamental frequency, the spectrum ceefficicrits and hand aperiodicity.4. The text-to-speech method of claim 3, wherein said spectrum coefficients are selected from isp, mel-Isp, cepstral, mel-cepstral, generalized mel-cepstrai and harmonics amplitude 5. The text-to-speech method of claim I, wherein said first linguistic level is selected ftotn phones. diphones, syllables, novas, words, accent feet, intonational phrases, sentences phrases, breath groups, or paragraphs.6. The text-to-speech method of claim 1, wherein said mapping from said text to said plurality of probability distributions that relate said acoustic units to speech parameters is performed using one or niore of decision trees, neural networks or linear models.7. [he text-to-speech method of claim I, wherein said probability distributions are Gaussian distributions.8. A method of training a speech vector model, said method comprising; r inputting an audio sample and a corresponding text; dividing said audio sample into a first sequence of acoustic units corresponding to a first linguistic level according to said corresponding text; extracting a first speech vector at said first linguistic level from said audio sample; identifying regions of said speech vector that are discontinuous; obtaining said speech vector model by applying maximum-likelihood criteria, wherein said discontinuous regions of said speech vector are computed as measurements with maximum unreliability.9. The method of training a speech vector model of claim 8, wherein dividing said audio sample comprises dividing said audio sample into a first sequence of acoustic units and at least one further sequence of acoustic units, each of said sequences corresponding to a different linguistic level, and wherein extracting said speech vector comprises cxtracting a first speech vector and at leusi one further speech vccl.or at said different linguistic levels from said audio sample.10. The method of training a speech vector model of claim 9, wherein said speech vector model is obtained as a product of experts of speech vector sub-models each defined at said different linguistic leve's.11. The method of training a speech vector model of claim 9, wherein said speech vector model is obtained as a superposition of speech vector sub-models defined at said at said different linguistic levels.12. The method of training a speech vector model of claim 8, wherein said speech vector model comprises a set of one or more Gaussian distributions.13. the method of training a speech vector model of claim 9 wherein said sub-models comprise a set of one or more Gaussian distributions.14. The method of ftaiinng a speech vector model of claim 9 wherein vectors for discontinuous legions of said speech vector are lot computed using interpolation o techniques.1 5. The method of training a speech vector model of claim 8, wherein the degree of r reliability or unreliability of regions of said speech vector is modelled as a having a Gaussian probability distribution and said probability distribution is computed as part of said speech vector model.16. A text-to-speech system, said system comprising: a text input configured to receive inputted text; a processor configured to: divide said text into a first sequence of acoustic units corresponding to a first linguistic level, the duraLion of each acoustic units being equal to a plurality of frames; obtain linguistic context features from said texi; map said acoustic unit to probability distributions that relate said linguistic context features for each of said acoustic units to speech parameters, wherein said speech parameters are generated by a linear paramneterization of a. speech signal coi:itour over.said plurality of frames according to a speech vector model; estimate the duration of each of the said acoustic units using a duration model; convert saRi first. sequence oF acoustic units into a sequence of speech vectors by combining said probability distributions into a. probability distribution of output coefficients wherein said converting of said first sequence of acoustic units into a. sequence of speech S vectors comprises producing a. random sample from said probability distribution of output coefficients; and output said sequence of speech vectors as audio.1 7. A carrier medium comprising computcr readable code configured to cause a computer to perf tin the method of claim 1
  18. 18. A carrier medium comprising computer readable code configured to cause a computer to perform the method of claim 8. (4 Co r
GB1221625.5A 2012-11-30 2012-11-30 Speech synthesis Active GB2508411B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1221625.5A GB2508411B (en) 2012-11-30 2012-11-30 Speech synthesis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1221625.5A GB2508411B (en) 2012-11-30 2012-11-30 Speech synthesis

Publications (2)

Publication Number Publication Date
GB2508411A true GB2508411A (en) 2014-06-04
GB2508411B GB2508411B (en) 2015-10-28

Family

ID=50683752

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1221625.5A Active GB2508411B (en) 2012-11-30 2012-11-30 Speech synthesis

Country Status (1)

Country Link
GB (1) GB2508411B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111724765A (en) * 2020-06-30 2020-09-29 上海优扬新媒信息技术有限公司 Method and device for converting text into voice and computer equipment
CN113453072A (en) * 2021-06-29 2021-09-28 王瑶 Method, system and medium for splicing and playing multi-language video and audio files according to levels

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104392716B (en) * 2014-11-12 2017-10-13 百度在线网络技术(北京)有限公司 The phoneme synthesizing method and device of high expressive force
CN108630190B (en) * 2018-05-18 2019-12-10 百度在线网络技术(北京)有限公司 Method and apparatus for generating speech synthesis model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0805433A2 (en) * 1996-04-30 1997-11-05 Microsoft Corporation Method and system of runtime acoustic unit selection for speech synthesis
WO2010142928A1 (en) * 2009-06-10 2010-12-16 Toshiba Research Europe Limited A text to speech method and system
US20120035917A1 (en) * 2010-08-06 2012-02-09 At&T Intellectual Property I, L.P. System and method for automatic detection of abnormal stress patterns in unit selection synthesis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0805433A2 (en) * 1996-04-30 1997-11-05 Microsoft Corporation Method and system of runtime acoustic unit selection for speech synthesis
WO2010142928A1 (en) * 2009-06-10 2010-12-16 Toshiba Research Europe Limited A text to speech method and system
US20120035917A1 (en) * 2010-08-06 2012-02-09 At&T Intellectual Property I, L.P. System and method for automatic detection of abnormal stress patterns in unit selection synthesis

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111724765A (en) * 2020-06-30 2020-09-29 上海优扬新媒信息技术有限公司 Method and device for converting text into voice and computer equipment
CN113453072A (en) * 2021-06-29 2021-09-28 王瑶 Method, system and medium for splicing and playing multi-language video and audio files according to levels

Also Published As

Publication number Publication date
GB2508411B (en) 2015-10-28

Similar Documents

Publication Publication Date Title
EP2846327B1 (en) Acoustic model training method and system
CN103310784B (en) The method and system of Text To Speech
JP5768093B2 (en) Speech processing system
KR100486735B1 (en) Method of establishing optimum-partitioned classifed neural network and apparatus and method and apparatus for automatic labeling using optimum-partitioned classifed neural network
JP5398909B2 (en) Text-to-speech synthesis method and system
Sigtia et al. A hybrid recurrent neural network for music transcription
CN103366733A (en) Text to speech system
Shi et al. Robust Bayesian pitch tracking based on the harmonic model
CN105654940B (en) Speech synthesis method and device
JP5025550B2 (en) Audio processing apparatus, audio processing method, and program
JP2011237795A (en) Voice processing method and device
GB2508411A (en) Speech synthesis by combining probability distributions from different linguistic levels
Keshet et al. A large margin algorithm for speech-to-phoneme and music-to-score alignment
Deng et al. Deep dynamic models for learning hidden representations of speech features
KR20040088364A (en) Method and apparatus for formant tracking using a residual model
EP3038103A1 (en) Quantitative f0 pattern generation device and method, and model learning device and method for generating f0 pattern
JP3920749B2 (en) Acoustic model creation method for speech recognition, apparatus thereof, program thereof and recording medium thereof, speech recognition apparatus using acoustic model
CN106157948B (en) A kind of fundamental frequency modeling method and system
JP4716125B2 (en) Pronunciation rating device and program
JP4891806B2 (en) Adaptive model learning method and apparatus, acoustic model creation method and apparatus for speech recognition using the same, speech recognition method and apparatus using acoustic model, program for the apparatus, and storage medium for the program
Orbach et al. Transductive phoneme classification using local scaling and confidence
Khorram et al. Soft context clustering for F0 modeling in HMM-based speech synthesis
JP2018013721A (en) Voice synthesis parameter generating device and computer program for the same
Khorram et al. Context-dependent deterministic plus stochastic model
Yu et al. Context adaptive training with factorized decision trees for HMM-based speech synthesis.