GB2524503A - Speech synthesis - Google Patents

Speech synthesis Download PDF

Info

Publication number
GB2524503A
GB2524503A GB1405253.4A GB201405253A GB2524503A GB 2524503 A GB2524503 A GB 2524503A GB 201405253 A GB201405253 A GB 201405253A GB 2524503 A GB2524503 A GB 2524503A
Authority
GB
United Kingdom
Prior art keywords
accent
speech
text
sequence
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB1405253.4A
Other versions
GB201405253D0 (en
GB2524503B (en
Inventor
Balakrishna Venkata Jagannadha Kolluru
Vincent Ping Leung Wan
Javier Lattore-Martinez
Kayoko Yanagisawa
Mark John Francis Gales
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Europe Ltd
Original Assignee
Toshiba Research Europe Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Research Europe Ltd filed Critical Toshiba Research Europe Ltd
Priority to GB1405253.4A priority Critical patent/GB2524503B/en
Publication of GB201405253D0 publication Critical patent/GB201405253D0/en
Publication of GB2524503A publication Critical patent/GB2524503A/en
Application granted granted Critical
Publication of GB2524503B publication Critical patent/GB2524503B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Abstract

A text to speech system outputs speech with a particular accent by converting input text S101 into a sequence of semantic units (eg. graphemes) S105 and further into a sequence of accent-dependent acoustic units (eg. phones) S107 using an accent model comprising sub-models based on probability distributions relating semantic and acoustic units for a given accent with an interpolation weight, such that selecting an accent (253, figs 4 & 5) selects the values of these interpolation weights. The accent model may be trained (fig. 6) by receiving speech data from speakers having different accents in order to derive an initial set of interpolation weights for each accent.

Description

Speech synthesis
FWLD
Embodiments of the present invention as generally described herein relate to a text to speech system and method.
BACKGROUND
Text to speech systems are systems where audio speech or audio speech files are outputted in response to reception of a text file.
Text to speech systems may output speech with particular attributes such as accent. Accent may affect speech both at the acoustic level and at the phonetic level. For example, a consonant forming part of a word may he pronounced differently in one accent compared to another. In some accents consonants maybe omitted entirely.
There is a continuing need to create systems which can produce natural sounding speech in a range of accents.
BRIEF DESCRIPTiON OF FUR FIGURES
Figure 1 shows a schematic of a text to speeth system; Figure 2 is a schematic of a co-segmentation of canonical arid surface phones; Figure 3 is a flow chart showing a text to speech system in accordance with an crirbodinient; Figure 4 is a schematic of a system showing how the accent may be selected; Figure 5 is a variation on the system of figure 5; Figure 6 is a. flow diagram ofa method for adapting the text to speech systcm in response to a target voice; Figure 7 is a schematic of a system for training an accent and acoustic model; FigureS is a flow chart of a method for training a text to speech system in accordance with an embodiment; Figure 9 is a flow chart of a method for initialising an acoustic model; Figure 10 is a flow chart of a method for initialising a joint sequence model; Figure 11 is a flow diagram showing the calculation of phone to phone mapping; Figure 12 is a graph showing the variation in phonetic sequence with interpolation weights.
DETAILED DESCRIPTION
In an embodiment, a text to speech method for outputting speech with a particular accent is provided, said method comprising: inputting text; selecting an accent with which to output said text; converting said [cxl into a sequence of seinatitic units; converting said sequence of semantic units into a sequence of accent dependent acoustic units using an accent model; convcrling said sequence of accent dependent acoustic units into a sequence of speech vectors using an acoustic model; outputting said sequence of speech vectors as speech with siicl selected accent, wherein said accent model comprises a plurality of accent sub-models, wherein each accent sub-model comprises parameters describing probability distributions which relate semantic units to acoustic units of a specific accent, and wiiereitr an interpolation weight is associated with each accent sub model, such that selecting an accent comprises selecting the values of said interpolaLion weights.
The semanl.ic t.tn'ts comprise a. representation eon3illon to all accents. In an embodiment, the semantic units are phones, In another embodiment, the semantic units are graphemes.
Converting the text into the sequence of semantic units may comprise using a. lexicon.
Converting the text into the sequence olseniantie units may comprise using a canonical lexicon.
Converting the sequence of semantic units into a sequence of accent dependent units does not comprise the use of a lexicon.
In an embodiment, converting said seqLlence of semantic units into a sequence of accent dependent units comprises converting a common representation into an accent dependent phonetic sequence.
The selected accent maybe an accent for which the system is specifically trained or it may be a different accent. The selected accent is different from the accents modelled by the individual accent sub-models. [he selected accent may comprise a combination of accents, for example an accent which is SO% British English and 50% US English.
I'he acoustic units may be phones. i'he accent models may bejoint sequence models.
The aeeenl models may be phone to phone joint sequence models. The accent models may he grapheme to phone joint sequence models.
In an embodiment the acoustic units are trcatvd as hidden variables in the accent sub-models.
The semantic units may not correspond to the selected accent. The semantic units may comprise phones which do not correspond to the. selected accent. The semantic units may comprise phones which correspond to an accent which is different to the selected accent.
In an embodiment the interpolation weights are continuous such that the selected accent can be varied on a continuous scale. The interpolation weights may be defined using audio, text, an external agent or any combination thereof. the accent sub-models arid their interpolation are employed to generation pronunciation variation in the speech synthesis.
In an embodiment, the acoustic model may comprise parameters relating to speaker voice and speaker attributes. In an embodiiiient, the acoustic model comprises parameters relating to acccnt.
In anotliei cinbodinient, a method of training an accent model for a text-to speech system is provided, wherein said accent model converts a sequence of semantic units into a sequence of accent dependent acoustic units, the riietliod coluprisifig: receiving speech data flow a plurality of speakers speaking with different accents; training a plurality of accent sub-modcls, said training comprising deriving parameters describing probability distributions which relate semantic units to acoustic units of a specific accent; and deriving a set of interpolation weights, wherein said interpolation weights are varied to allow the accent model to accommodate different accents such that selecting an accent comprises selecting the values of said interpolation weights.
In a further embodiment, a method of training an accent model for a text-to speech system is provided, wherein said accent model converts a sequence of semantic units into a. sequence of accent dependent acoustic units, the method comprising: receiving speech data horn a plurality of speakers speaking with different accents; training a plurality of accent sub-models, said training comprising deriving parameters describing probability distributions which relate semantic units to acoustic units of a specific accent; and deriving a set of interpolation weights, wherein said intcrpolaLion weights are varied to allow Lhc accent model to accommodate different accents such that selecting an accent comprises selecting the values of said interpolation weights.
The accent mode! and acoustic mode! are trained together.
Methods in accordance with embodiments of the present invention can. he implemented either in hardware or on software in a general purpose computer. Further methods in accordance with embodiments of the present can be imp!emented in a combination of hardware and software.
Methods in accordance with embodiments of the present invention can also be implemented by a single processing apparatus or a distributed network of processing apparatuses.
Since some methods iii accordance with embodiments can be implemented by software, some embodiments encompass computer code provided to a general purpose computer on any suitab!e catTier medium. The carrier medium can comprise any storage medium such as a floppy disk a CD RUM, a magnetic device or a programmab!e memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signa!.
rigure 1 shows a text to spcec]i system 1. The text to speech system 1 comprises a processor 3 which executes a program 5. Text to speech systeni 1 furLhcr comprises storage 7. The storage 7 stores data which is used by program 5 to convert text to speech. The text to speech system I further comprises an input module II and an output module 13. The input module It is connected to i text input. 15. Text input IS receives text. The test input 15 may he for example a keyboard. Alternatively, text input 1 5 ray he a means for receiving text data from an external storage medium or a network.
Connected to the output module 13 is output for audio 17. The audio output 17 is used for outputting a speech signal converted from text which is input into text input 15. The audio output 17 may be for example a direct audio output, e.g. a speaker or an output for an audio data file which may be sent to a storage medium, networked etc. in use, the text to speech system 1 receives text through text input 15. the programS executed on processor 3 converts the text into speech data using data stored in the storage 7. Ihe speech is output via the output nioduk 13 to audio output 17.
qexi to speech systems calculate the best observation sequence 0 for a given word sequence This observation sequence is given by argmax P(0 w) . In practice, text to speech systems convert input text into a sequence of acoustic units q = {Ø(q) ,**, Ø?} rhere is the number of units in the sequence. An acoustic model is then used to determine the best observation sequence from Ihose acoustic units. In all enibodiment, the sequence of acoustic units is treated as a hidden variable and 0 is calculated using argmaxr(O w) = argma4P(O q)F(q (1) o o Hi i In an embodiment, the optimum value of q may be employed in Bqn (1) as a In-st.
approximation to the marginalization of q, i.e. Equ (1) becomes argmaxP(O w) argmax{P(O q)P(q I w)} 0 O,q argmax{P(O I (2) hi an embodiment, the acoustic units are phones. In another embodiment, the units may he context dependent e.g. triphones which take into account not only the phone which has been selected but the proceeding and following phones.
hi an emnhodiiicnt, [lie problem of gencraUng the phonetic sequence q is I.rca.i.cd as a. phonc to phone (p2p) problem, in l.liis eiiiboditncnt, a given word sequence is I iist converted into a canonical phonetic sequence g using a pre-definecL lexicon. The surface phonetic sequence q is then generated from the canonical sequence of phones using phone-to-phone joint sequence models. In an embodiment, a number ofjoint sequence models are provided; each one corresponding to a specific accent. Depending on which combination of these joint sequence model is employed, and the weights associated with these models (see below), the surface phonetic sequence will vary. The variation in surface phonetic sequences enables a variety of accents for the output speaker to be accommodated by the system and for the accent of the output specch to be varied at the phonetic level.
Variations iii the phonetic sequence enable the system to accommodate the effects of accent at the phonctic level. For cainplc, a British English spcakcr would typically pronounce a long vowel in the word "bird" instead of pronouncing the "r". A Scottish English speaker, in contrast, would pronounce the "r". This is an example of the influence of accent at the phonetic level.
Associated with each joint sequence model is an interpolation weight AP The value of each 2' varies according to the accent of the desired speaker. Because the interpolation weights are continuous, the system provided enables continuous interpolation between different accents.
This is representative of the variation of accents with geography; upon crossing the Scottish boarder in the England there is not a step change in accent hut rather a. gradua.l decrease in the influence of Scottish pronunciation. Thus, the interpolation weights enable (he system to accommodate the fluid natLtre of accents.
This dynamic sequence facilitates control of the voice synthesis system, enabling the user to choose any combination of accents to he included in a single voice. Blending of different accents makes it possible to generate subtle variations within an accent. Further, only a single (canonical) lexicon is required for synthesis: separate lexica are not needed for each surface accent. New accents may also be generated by interpolating an existing set of accents: a pie-built lexicon for new accents is not required. This approach also enables the system to generate subtle variations within an accent.
In this embodiment, the estimation of all model parameters is based on maximizing O1fl1,O,2)J,... . ,w) = log(p(O" (S)() (3) where e is the acoustic model for a speaker s and where is estimated from O) =argrnax{p(o q,®)P(q IPw®j} (4) In one embodiment, the phone to phonejoint segmentation model ®p2p acts at the word level, i.e. 11'1 J'(q) = [JP(q, = P(q1 t)J,g1,92).P(g I w) i'i.geG (5) In this embodiment, the complete phonetic sequence is partitioned into contributions from each individual word, i.e. q={qJ...qj (6) with a similar partition for the canonical phonetic sequence g.
The probability F( 1w,) of phonetic sequence g, given a. word w, is obtained from a.
canonical lcxicoi. Pronunciation lexica are well known in the art.. In all cnibodinient, a. single canonical lcxicon is used such that there a on iquc Illapping p -> g. The same canonical lexicon is used whichever accent is sekcted. This lexica can he in the form of a look-up table; an algorithm for grapheme-to-phoneme conversion, e.g. as a set of rules; or a combination of both.
In an embodimient, the probability of each of the phone realisations of the word q are stunined over all possible joint segmentations oftlie word's canonical phonetic sequence g and the surface phonetic sequence. A joint segmentation 0 spcies * a phone realisation q = (7) * a canonical phonetic sequence: g = (8) where, as before 101 is the number of segments in -The size of each of the segments need not be the same provided that the combination of canonical sub-sequences and realisation (sub-sequences) yield g and q respectively. The set of all possible combinations is denoted by (g,q).
It is therefore possible to write P(n u. 1:5) ® P( Ik15 ® ) = F P(g1 I P(Ø I, ®2)P(q, , g1 0' k1 ®p2p) i)(g.q) cc P(Ø I IPX, ep2p)1J O) ØeG(g1q1) iI a (9) Using equations (5) and (9). equation (4) gives hv =argmax p(O' Iq,®) fJ F(gIw) __________ eC cvc'lesr'L iiiadtI cajisrant Ni xJJA P(ç5' ,Ø ®,) (10) ii " iam:I Nr,dd where C is a nonnalising phone to phone scaling factor required because the acoustic probability produced by p(O q, @) and the co-segmentation probability generated by the interpolated joint sequence models are to di!l'ere.ni scales and NP is a norinalising factor.
The normatising factor can be viewed as a transiLion probabiLity winch gives the probability of a specilic sequence of phones ç4qg corresponding to a word given all Lhe possible valid sequences for that word obtainable from all thejoint sequence models of the model. For example, Figure 2 shows a schematic representation of three possiblejoint segmentations of canonical and surface phones given three canonical phones gi, g2 and g3 and four surface phones qi, q2, q3 and q4. The threejoint segmentations shown in Figure 2 are = 02 = {<q1,g1>,<q2,g2>,<q4,g3>} (11) 03 = {<q1,g1>,<q2,g2>,<q3,g3>} the normalising factor determines the probability of one of these joint segmentations given all of the possible. In an embodiment, it is assumed that k' is subsumed in the computation as all possible and valid joint segmentations are considered.
in an embodiment, the approach to computing q using equation (10) comprises a two-step process: 1) Generate a N-best list of phonetic sequences for a given word from ajoint segmentation of surface and canonical phones using the joint sequence models, i.e. determine the phonetic sequences q that give the N-best values of P(q I 18 wC) 2) Rescore the N-best list of phonetic sequences using the respective acoustic niodels.
Using these N-best phonetic sequences, determine q from equation (4).
A flow chart of a text to speech system according to this embodiment is shown in Figure 3.
In step SlOl, text is received. The text may be inputted via a keyboard, touch screen, text predictor or the like.
In 5103, the desired accent is selected. Ihis may be done using a number oldiffereni methods.
Examples of some possible methods for selecting the accent are explained with reference to figures 4 and 5. Ultimately, this comprises selecting the accent interpolation weights 2 with one weight for each accent for which a joint segmentation model has been trained.
For example, ifjoint segmentation models have been trained for British English (enUK) and US English (enUS) then it is possible to select any combination of these accents. 1'or example, if the user desires and output speaker having equal contributions from both accents, then 2,the coefficient corresponding to the enUK phone to phonejoint sequence model takes the value 0.5 and L'5 is also given the value 0.5.
In Step S105, the input text is converted into the sequence of canonical phones acoustic units g {g g1) using the canonical phonetic lexicon. As described above, the same canonical lexicon is used whichever accent is selected.
In Step S 107, the probability distributions P(q k' , w, e2) relating surface phonetic sequences to the input text are determined using the accent interpolation weights for the desired speaker2" and the joint sequence models In Step S 109, probability distributions j()(8) of acoustic observations for the N-best phonetic sequences determined in step SI 07 are computed using the acoustic model ®,,, in this embodiment the N-best phonetic sequences are the N phonetic sequences which corresponding to the input text with the highest probability. N-best methods are well-known in the art; N is a task-specific empirical value. In an embodiment, N5 In Step Sill, the sequence which maximizes the acoustic log likelihood is selected from the N-best phonetic sequences determined in step SI 09, i.e. fp(V' I qNbC,@)} . Thus the N-best list are "rescored" using the acoustic modeL Speech parameters are then determined from argniax(P(F)}.
Note that in the case that N-1, i.e. that there is only one suitable phonetic sequence, in Eqn (2) can be simply calculated from q = argrnax P(q w) (12) = argmax]JJ 2P(q, 1w, (E),) Instep SI 13, the speech parameters determined in step SIll are input into a vocoder that converts back to audio the acoustic parameters used for modelling. Any standard vocoder with parameters suitable for statistical modelling may be employed.
In step Si 15 accented speech is output as audio.
Figure 4 shows a possible method of selecting the accent. Here, a user directly selects the accent intcrpolation weights A using, for example, a mouse to drag arid drop a point on the screen, a keyboard to input a figure etc. In Figure 4, a selection urut 25 1 which comprises a mouse, keyboard or the like selects the weiglitings using display 253. Display 253, in this example has a radar chart which shows the weightings. The user can use the selecting nulL 251 in order o change thc dominance of the various accents via [he radar chart. Note that the sum o[all A is equal to one and therefore (he radar chart of Figure 4 shows the relative weights of different accents. It will be appreciated by those skilled in the art that other display methods may be used.
hi a. further embodiment, the system is provided with a memory which saves predetermined sets of accent interpolation weights. Each vector may he designed to allow the text to be outputting with a different accent. For example, English with a. British accent (enUK), a US accent (enUS), or a Scottish accent (enSC), etc. A system in accordance with such an embodiment is shown in Figure 5. Here, the display 253 shows different accents which may be selected by selecting unit 251.
Figures 4 and 5 show two possible interfaces for selecting the accent of the output speech in systems according to embodiments. ilowever, other interfaces or methods of selecting accent interpolation weights may be employed with methods and systems according to embodiments.
In an embodiment, the system may adjust the weiglitings in response to speech input from a speaker. In this embodiment, the system is used to synthesise an accent where the system is given an input oi' a target voice wiLh tile same accent.
Figure 6 shows onc example. First, the input LargeL voice is received at step S5Oi. Next, the accent interpolation weights, i.e. the weightings of the joint sequence models which have been previously trained, ire adjusted to match the target voice in step S503.
In step 5505, audio is then outputted using the new weightings derived in step S503.
Thus, if a particular accent is sought it is possible to adjust the accent interpolation weights in response to an input of natural speech. This enables the use of the text to speech system as a conversational system which can adapt the accent of its responses to that of the user.
In the embodiments described above, the text is converted to a canonical phonetic sequence.
The canoniea.l phonetic sequence is then converted into an accent phonetic sequence using the phone to phone models. For languages such as English, in which pronunciation is often irregular. this has the advantage of minimising text to phone mapping, which is computationally difficult.
In another cmbodiinenL, the text is converted to a grapheme sequence rather than a canonical phonetic sequence. the grapheme sequence is then converted to an accent phonetic sequence. In this embodiment, the iicl:hod described in the flow chart, of figure 3 is tnickatigud. However, a grapheme sequence is obtained in step SIO5 and the joint sequence models describe grapheme to phone conversion -In yet another embodiment, the text is converted to a grapheme sequence which is then converted to a canonical phonetic sequence. The canonical phonetic sequence is then converted to an accent phonetic sequence as described above. This method may be employed, for example, when the text to canonical phone mapping for a particular word is not known but the grapheme to canonical phone mapping is known. For example, a piece of text may include a rare word whose canonical form has not been included in the lexicon. techniques For text to grapheme 1 0 mapping are well known in the art.
In an embodiment. the acoustic model is also configured to accommodate accent. In Ibis embodiment, phonetic level accent effects are synthesised by the joint sequence models while acoustic level accent effects are handled by the acoustic model. Examples of acoustic models capable of handling accent effects at the acoustic level include Cluster Adaptive Iraining (CAl') models, Constrained Maxonuin linear likelihood Regression (CMI.1 R) models, etc. In a further embodiment, the acoustic model may allow factorization of speaker and emotion. In this embodiment, the phone to phone models described above provide the necessary handle to control accent for text to speech synthesis in a three dimensional factorisation of speaker, emotion and accent. Examples of acoustic models which operate by factorization of speaker and emotion components include Cluster Adaptive Training (CAT) models, Constrained Maximum Linear Likelihood Regression (CMLLR) models, etc. The training of the text to speech system according to an embodiment will now be described with reference to Figures 7-11.
The system of figure 7 is similar to that described with reference to figure 1. Therefore, to avoid any unnecessary repetition, like reference numerals will be used to denote like features.
In addition to the features described with reference to figure 1, figure 7 also comprises an audio input 23 and an audio input module 21. When training a system. it is necessary to have an audio input which matches the text being inputtcd via text input 15.
The flowchart of Figure 8 shows the training of the system in accordance with an embodiment.
hi Slep 5701, training speech is input into the system. These are the observations 0 in the model Equation (3) during the training of the systent In an embodiment, speech is input from colTesponding to a variety of speakers and a variety of accents. In an embodiment, input speech corresponding to each speaker speaking in a variety of accents is input. In another embodiment, data comprising speech from a variety of speakers each speaking in a single accent is input.
In Step 5703, text corresponding to the input training speech is input.
In step S705, thejoint segmentation e2, and acoustic Ow,, models ate initialized.
In an embodiment, the acoustic model may be any speaker-independent, accent-independent statistical model which is itself based on Hidden Markov Models (HMM) . Suitable acoustic models are well known in the art and include models trained with Cluster Adaptive Training techniques (CAT), Speaker adaptive training (SAT), deep neural networks (DNN) linear dynamic models, etc. However, other models could also be used.
In speech processing systems which are based on Hidden Markov Models (HMMs), the HMM is otici expressed as: M=(A,B.H) (13) where A = a11}ff.1 and is the state transition probability distribution, B = b1 0)}N1 is the state output probabiliLy distribution and II = {r1} is the initial state probability distribution mid where N is the number of states in the HMM.
I-low a 1-1MM is used in a speech modelling system is well known in the art and will not he described here.
In the current embodiment, the state transition probability distribution A and the initial state probability distribution are determined in accordance with procedures well known in the art.
Therefore, the remainder of this description will be concerned with the state output probability distribution.
A flowchart showing the initialization of the acoustic model according to an embodiment is shown in Figure 9.
As described in relation to the flow chart of Figure 8. in step 5701, audio data comprising samples of speech from all speakers and for all accents is input.
ln step S203, acoustic katurcs arc extracted horn the audio tiles. Such acoustic features may comprise any suitable fix speech recognition, such as mel-cepstral features, perceptual-linear-predictive features (PLY), etc. How to extract such features is well known in the art and will not be further explained here.
In Steps 205-S209 the initial acoustic model is constructed.
Generally in text to speech systems the state output vector or speech vector o(i) frotn an ni' Gaussian component in a model set M is N(o(1):,Lm) (14) where a,,, and I,,, are the mean and covariance of then1' Gaussian component.
[he aim when training a conventional text-to-speech system is to estimate the Model parameter set Mwhich maximises likelihood for a given observation sequence.
As it is not possible to obtain the above model set based on so called Maximum Likelihood (ML) criteria purely analytically, the problem is conventionally addressed by using an iterative approach known as the expectation inaximisation (EM) algorithm which is often rclbrrcd to as the Baum-Welch algorithm. Here, an auxiliaty function (the "Q" function) is derived; Q(JW, Liv!') = jy,,, (t)log p(o(r), in[M) (15) where yfi, (t) is the posterior probability of component ni generating the observation 0(t) given the current model parameters M' and 44' is the new parameter set. After each iteration, the parameter set 41' is replaced by the new parameter set M which maximises QQ\4, M').
In Step 5205, a uniform Gaussian distribution is estimated from all the input data and employed for all intial distributions in the acoustic model. This initial distribution is known as a "flat start" and is a procedure well known in the art.
lii Step 5207, the initial distributions are updated using the Baum-Welch algorithm, described above. At this stage, usually context-independent mono phone IIMMs are trained.
In Step S209, an embedded re-estimation is performed. Whereas in Step S207, each mono phone 1-LMM was trained, in this re-estimation, the monophone models are usually first expanded into context-dependent triphones or quinphones. Since the training data may not contain all the possible context-dependent combinations, the context-dependent models are usually clustered together with a decision tree. Enibcddcd re-estimation of hidden Markov Models as well as model clustering with decision trees ale well known in the art and will not be discussed further here.
The model output. from the embedded re-estinmtion is employed as the imi.ialiscd acoustic model. The initialisation of the joint sequence models comprises determining the phone-phone joint segmentation.
A flowchart showing the steps involved in the initialisation of a joint sequence model is shown in Figure 10.
In Step S301 a lexicon comprising the mapping from words to canonical phones (201) is combined with ii lexicon comprising the mapping from words to the surFace to determine the canonical phone g to surface phone q mappings <q,g, as described in relation to Figure 2 above. Suitable lexica are well known in the art and will not be described in detail here.
A schematic representation of this process is showR in Figure 11. In the example of Figure ii, the mapping from words to US English (enUS) pronunciation phones is taken to be the canonical lexicon 201 wiule the surface lexicon supplies that mapping li-otti words to British English (en.IJK). The lexica are combined to give direct phone to phone mapping 205. A comparable mapping keeping the same canonical lexicon is performed for a number of other surficc accents. In principle there is no restriction as to which lexica should he considered the canonical one. However, it is convenient for the canonica.l lexica. to be the accent for which a.
richer and more accurate grapheme-to-phone conversion exists.
Once the mapping is determined, standard software may be employed to build the joint sequence modles. One example ol' suitable software is as Sequitur, however, other suitable software is widely available and well known in the art.
In an embodiment, the algorithm described in steps 5303 to 5309 is employed to build the joint sequence models.
Iii SLep 5303 the mapping data deLermined in step 5301 is partitioned into a set of training data and a development step.
Itie available data is split into training and development sets such thai 90% of the availab]e data is used for training and 5% is used for development. The other 5% is retained for testing the validity of the model.
In step 3305 the JSM models are initialised. In this step the basic structure of the model is created by defining the nun her of nodes in the model and their pronunciation. Each node contains information about a word's pronunciation, its history, its parent, its child etc. At initialization, all these are set to 0 and then updated subsequently during training.
In Step 5307 a context estimation is performed. Within each context, different statistics about a pronunciation, counts of different phones, maximum number of phone, its neighbours are accumulated. This steps prepares the model to be iteratively trained.
In Step 5309 a joint sequence model is built using the context estimation performed in step S307. An expectation-algorithm is applied. In an embodiment, the expectation-maximisation algorithm is applied for a fixed number of iterations. In another embodiment it is applied until convergence. The model is trained iteratively until the difference between successive iterations of'the training stages arc below a certain threshold or a certain minimum number of iterations.
Once the joint sequence and acoustic models are initiafised, the training algorithm shown in Figure 8 proceeds to step 5707.
In Step S707, the text in the training corpus is converted into canonical phones g using the canonical phonetic lexicon In Step S709, the initial values of A for each of the training samples is determined. In an embodiment, the initial values of 2, are determined from the descriptions of the input training speech. In an embodiment, each joint sequence model corresponds to an accent used for traung. For cxample, ilone of the training speakers is described as having an accent which is British English (enIJK), , then one of the joint.ceqrience models will correspond to enUK and the coefficient corresponding to this phonc to phone joint sequence model 211K takes the initial value 1.0. For this speaker, the coefficients of all oLEler models (for example 2) al-c scL to 0.
Thus. the aeeeiiLs of the speakers used for lraituiig become [he "vertices" of' the accent space as defined by the accent interpolation weights.
In Step 5711, the values of A for each speaker are re-estimated using both the joint sequence and acoustic models.
An auxiliary function for the models is given by Q(Ufr)) = p(q o'J.'>e)tog(p(qo lw,0,®)) = p(q Jo, ?S, w, &) log[P(O I q, w, ).P(q 1w. )] = p(cI 10, w, e)[log P(q I w, , i-log P(0 q, w, i(s) °)] (s) x rnaxp(q I0,A,w,®)1og[P(q 1w). ,e)+logF(O Iq,w) ,®,)] (16) Tue aeousLie probability P(0 I, q, w, A(S), 0.,,,,) is mdependeni of and therefore can be treated as a constant, Ce,. Thus Q becomes is) Q = rnaxp(qIO,1J,w,®)1ogIfJP(qJ1,w,,ø,,,j+C2] q -% (-) =maxp(qIO,?),w,O)1og[flP(q W 0)P(g, 1t.'I1,0j,r)+Cp] si' bu I -1 gcC Since g1 is deterministic with respect of w1, it follows that (xl Q=maxp(qIO)P).w,,0)log[flP(q1 1 -i-C2] q For single canonical model, Q.:!maxp(q O,1f.w,®)Iog[fl P(q7., IC2] q =1.P(g1 k{s) ® In) (.5) (x) cc rnaxp(q O,A1',w,0)1og [r.. ,8bJ1'(q,gI 0,1. .0,,)±C1,] (I Øet2 Tjsing Jensen's inequality, liii (.v il Q cc maxp(qj O,?P,w,@) 1og[P(q) , )flP(7,ç I0,,,,)+Cj 11 1=1 /=1 (i) j(g) ccmaxp(q I " w,0)5 iog[P(q ? .0h,J2p(Ø. ,c-02)+C] q a (17) A Lagrange multiplier is added to ensure that the sum to 1. Equating to zero gives the 52: updale equation for a = max p(q Ia, w,) 2(s)p((q)0(g) I (18) (I
I
#.g (s)p (q)(g) I k-I Corresponding update equations for the acoustic model p(q O,1P,w,0) and joint sequence model p(), 4& 0) are obtained from Eqn (17).
The initialized value of,t is input into Eqn (18) and the updated value calculated while the joint sequence and acoustic models arc held constant.
In an embodiment, this step is repeated until convergence.
S
In step S7l3, the joint sequence models are updated. This is done using Equation (17) and the relevant update equation for the joint sequence model. In this calculation 2 and the acoustic model are held constant. Note that, in contrast to the initialisation step of S705 in which lexica were used to build the initial joint sequence models, in this update step, the acoustic training data itself is employed, Such a step comprises updating soft statistics such as the number of occurrences of the co-segmentations of the canonical phones g and the surface phones q (see e.g. Eqn (11) and the accompanying description) for each iteration of optimum phonetic sequence.
Thus, the.Ioint sequence models are re-estimated using speech data.
In an embodiment, step S71 3 is repeated until convergence.
in step S7l5 the acoustic model is updaled using Eqn (17) and the corresponding update equatioli for the acoustic model. 2 and are held fixed.
Updating the acoustic model comprises updating the parameters of the model. The precise natw'e of these parameters will naturally depend on the exact acoustic model employed. For example. updating a CAT model comprises also updating the cluster weights, whereas updating an average voice model (AVM) comprises also updating the Constrained Maximum Likelihood Linear Regression (CMLLR) transforms.
In an embodiment, step S715 is repeated until convergence.
In an embodiment, the system rel.urns La stLp S7I I and steps S7 II-S71 S are i'epcated until convergence. By updating the acoustic and joint sequence models and interpolation weights together in an iterative fashion, the phone pI'onurlciation is bound to tlic acoustic models, thereby ensuring the accuracy of the system.
Figure 12 shows experimental results showing the effect of varying interpolation weights of enlJT( and enUS on phonetic sequence. The x-axis indicates the ml rpolation weight AUK for enLIK. Equivalciily, the x-axis indicates (lA) for enUK. The y-axis indicates cumulative L.evenshtein dislanre on (lie entire set. The Levensh(:ein distance was used to measure tile difference diLe to iiterpolating the different accent-specific joint sequence models. To measure the efficacy of an acccnt model, the Lcvcnshtcin distance between the observed ph. onetic sequence after the interpolation and the actual phonetic sequence derived from the relevant lexicon was compared. The blue line indicates the Levenshtein distance relative to the enlJS lexicon and the red line the distance from the enUK lexicon. This difference was measured at different levels of interpolation starting from 0. 1 to 0.9 with an increment of 0.1. As shown in Figure 12, as the interpolation weight varies on the x-axis, so does the Levenshtein distance. At higher weights, as the model (trained on enUK) has higher contribution it has fewer errors and as its contribution is lowered the phonetic transcription moves away from enUK to enUS pronunciation.
Systems and methods according to the above embodiments enable accent-specific voices for commercial applications such as for use in information kiosks and audiobooks, etc. They providc a large amount of control for end uscrs by enabling them to select and adjttst the accent of the output speech very precisely. Further, the systems can be used to respond to users iii their own accents.
While certain embodiments have been dcscribcd, these embodiments have been presented by way of example only, and arc not inteiided to limit. (lie scope of the inventions. Indeed, the novel methods and systems described herein may he embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may he made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.

Claims (18)

  1. CLAiMS: I. A text to speech method for outputting speech with a particular accent, said method comprising: inputting text; selecting an accent with which to output said text; converting said text into a sequence of semantic units; converting said sequence of semantic units into a sequence of accent dependent acoustic units using an accent model; converting said sequence of accent dependent ucotistic units into a scquence of speech vectors using au acoustic model; outputting said sequence of speech vectors as speech with said selected accent, wherein said accent model comprises a plurality of accent sub-models, wherein each accent sub-model comprises parameters describing probability distributions which relate semantic units to acoustic units of a specific accent, and wherein an interpolation weight is associated with each accent sub model, such that selecting an accent comprises selecting the values of said interpolation weights.
  2. 2. The text to speech method of claim 1, wherein said semantic units are phones.
  3. 3. The text to speech method of claim 1, wherein said acoustic units are phones.
  4. 4. The text to speech method of claim 1, wherein said semantic units are graphemes.
  5. 5. The tcxt to speech method of claim 2, wherein said phones are phones corresponding to an accent which is different fi-om said selected accent.
  6. 6. ]he text to speecFi method of claim I, wherein said accent sub-models are joint sequcncc models.
  7. 7. The text to speech method of claim 1, wherein said interpolation weights are continuous.
  8. 8. The text to speech method of claim 1, wherein said selected accent comprises a mixture of a plurality of accents.
  9. 9. The test to speech method of claim I, wherein said interpolation weights are defined using audio, text, an external agent or any combination thereof.
  10. 10. The text to speech method of claim 1, wherein said selected accent is different from the accents modelled by the individual accent sub-models.
  11. 11. 1'he text to speech method of claim I, wherein said acoustic model comprises parameters relating to speaker voice and speaker attributes.
  12. 12. A method of training an accent model for a text-to speech system, wherein said accent model converts a sequence of semantic units into a sequence of accent dependent acouslic units, the method comprising: receiving speech data from a plurality of speakers speaking with different accents; training a plurality of accent sub-models, said training comprising deriving w1rtI11e[ers describing probability distributions which relate semantic units to acoustic units of a specific accent; and deriving a set of interpolation weights, wherein said interpolation weights are varied to allow the accent model to accommodate different accents such that selecting an ac6ent comprises selecting the values of said interpobtion weights
  13. 13. A method of training a text to speech system, the method comprising: receiving speech data from a plurality of speakers speaking with different accents; training a plurality of accent sub-models, said training comprising deriving parameters describing probability distributions which relate semantic units to acoustic units of a specific accent, deriving a set of interpolation weights, wherein said interpolation weights are varied to allow the accent model to accommodate different accents such that selecting an accent comprises selecting the values of said interpolation weights, training an acoustic model, said training comiprising deriving parameters describing probability distributions which relate acoustic units to speech vectors.
  14. 14. A text-to-speech system for outputting speech with a pai-ticular accent, said system comprising: a text input for receiving inputl.cd (cxl.; a processor configured to: allow selection of an accent with which to output said text; convert said text into a sequence of semantic units; convert said sequence of semantic units into a sequence of accent dependent acoustic units using an accent model; convert said sequence of accent dependent acoustic units into a sequence of speech vectors using an acoustic model; output said sequence of speech vectors as speech wiLh said selected acccnL, wherein said accent model comprises a plurality of accent sub-models, wherein each acceni sub-model compi-ises parameters describing probability distributions which relate semantic units to acoustic units of a specific accent.and wherein an interpolation weigh! is associated with each accent sub model, such that selecting an accent comprises selecting the values of said interpolation weights.
  15. 15. A system for training an accent model for a text-to speech system, wherein said accent model converts a sequence of semantic units into a sequence of accent dependent acoustic units, the system comprising: an input for speech data from a plurality of speakers speaking with different accents; a proessor configured to: train a plurality of accent sub-models, said training comprising deriving parameters describing probability distributions which relate semantic units to acoustic uiits of a specific accent; and derive a set of interpolation weights, wherein said interpolation weights are varied to allow the accent model to accommodate different accents such that selecting an accent comprises selecting the values of said interpolation weights.
  16. 16. A carrier medium comprising computer readable code configured to cause a computer to perform the method of claim 1.
  17. 17. A calTier medium coiriprisirig computer readable code configured lo cause a computer to perform the method of claim 12.
  18. 18. A carrier medium comprising computer readable code configured to cause a computer to perform the method of claim 1 3.
GB1405253.4A 2014-03-24 2014-03-24 Speech synthesis Expired - Fee Related GB2524503B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1405253.4A GB2524503B (en) 2014-03-24 2014-03-24 Speech synthesis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1405253.4A GB2524503B (en) 2014-03-24 2014-03-24 Speech synthesis

Publications (3)

Publication Number Publication Date
GB201405253D0 GB201405253D0 (en) 2014-05-07
GB2524503A true GB2524503A (en) 2015-09-30
GB2524503B GB2524503B (en) 2017-11-08

Family

ID=50686816

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1405253.4A Expired - Fee Related GB2524503B (en) 2014-03-24 2014-03-24 Speech synthesis

Country Status (1)

Country Link
GB (1) GB2524503B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7454348B1 (en) * 2004-01-08 2008-11-18 At&T Intellectual Property Ii, L.P. System and method for blending synthetic voices
US20110313767A1 (en) * 2010-06-18 2011-12-22 At&T Intellectual Property I, L.P. System and method for data intensive local inference
US20120173241A1 (en) * 2010-12-30 2012-07-05 Industrial Technology Research Institute Multi-lingual text-to-speech system and method
EP2650874A1 (en) * 2012-03-30 2013-10-16 Kabushiki Kaisha Toshiba A text to speech system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7454348B1 (en) * 2004-01-08 2008-11-18 At&T Intellectual Property Ii, L.P. System and method for blending synthetic voices
US20110313767A1 (en) * 2010-06-18 2011-12-22 At&T Intellectual Property I, L.P. System and method for data intensive local inference
US20120173241A1 (en) * 2010-12-30 2012-07-05 Industrial Technology Research Institute Multi-lingual text-to-speech system and method
EP2650874A1 (en) * 2012-03-30 2013-10-16 Kabushiki Kaisha Toshiba A text to speech system

Also Published As

Publication number Publication date
GB201405253D0 (en) 2014-05-07
GB2524503B (en) 2017-11-08

Similar Documents

Publication Publication Date Title
KR102581346B1 (en) Multilingual speech synthesis and cross-language speech replication
EP2846327B1 (en) Acoustic model training method and system
Chiu et al. State-of-the-art speech recognition with sequence-to-sequence models
Zeyer et al. Improved training of end-to-end attention models for speech recognition
Wong et al. Sequence student-teacher training of deep neural networks
Renduchintala et al. Multi-modal data augmentation for end-to-end ASR
Zen et al. Statistical parametric speech synthesis using deep neural networks
CN106971709B (en) Statistical parameter model establishing method and device and voice synthesis method and device
JP5242724B2 (en) Speech processor, speech processing method, and speech processor learning method
Hojo et al. DNN-based speech synthesis using speaker codes
CN107924678B (en) Speech synthesis device, speech synthesis method, and storage medium
JP6293912B2 (en) Speech synthesis apparatus, speech synthesis method and program
JP5398909B2 (en) Text-to-speech synthesis method and system
GB2546981B (en) Noise compensation in speaker-adaptive systems
KR20070077042A (en) Apparatus and method of processing speech
CN103366733A (en) Text to speech system
Qian et al. Improved prosody generation by maximizing joint probability of state and longer units
Yin et al. Modeling F0 trajectories in hierarchically structured deep neural networks
US20110276332A1 (en) Speech processing method and apparatus
US20160189705A1 (en) Quantitative f0 contour generating device and method, and model learning device and method for f0 contour generation
Baby et al. Deep Learning Techniques in Tandem with Signal Processing Cues for Phonetic Segmentation for Text to Speech Synthesis in Indian Languages.
Kotani et al. Voice conversion based on deep neural networks for time-variant linear transformations
Prabhavalkar et al. A factored conditional random field model for articulatory feature forced transcription
JP6350935B2 (en) Acoustic model generation apparatus, acoustic model production method, and program
Xie et al. Voice conversion with SI-DNN and KL divergence based mapping without parallel training data

Legal Events

Date Code Title Description
PCNP Patent ceased through non-payment of renewal fee

Effective date: 20230324