GB2524505A - Voice conversion - Google Patents

Voice conversion Download PDF

Info

Publication number
GB2524505A
GB2524505A GB1405255.9A GB201405255A GB2524505A GB 2524505 A GB2524505 A GB 2524505A GB 201405255 A GB201405255 A GB 201405255A GB 2524505 A GB2524505 A GB 2524505A
Authority
GB
United Kingdom
Prior art keywords
speech
speaker
sequence
parameters
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB1405255.9A
Other versions
GB2524505B (en
GB201405255D0 (en
Inventor
Javier Lattore-Martinez
Vincent Ping Leung Wan
Balakrishina Venkata Jagannadha Kolluru
Ioannis Stylianou
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Europe Ltd
Original Assignee
Toshiba Research Europe Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Research Europe Ltd filed Critical Toshiba Research Europe Ltd
Priority to GB1405255.9A priority Critical patent/GB2524505B/en
Publication of GB201405255D0 publication Critical patent/GB201405255D0/en
Publication of GB2524505A publication Critical patent/GB2524505A/en
Application granted granted Critical
Publication of GB2524505B publication Critical patent/GB2524505B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser

Abstract

A text-to-speech (TTS) system modifies the synthesised speech according to the voice/speech attributes of a speaker by dividing text (from eg. Automatic Speech Recognition S208) into a sequence of acoustic units (S211), and speech segments (S205) into probability distributions which are modified (S219) using an acoustic model (S215, eg. Hidden Markov HMM) relating acoustic units (eg. words, graphemes or phonemes) to output speech vectors. The modification (D) may be selected from eg. expression, emotion, accent etc., and the probability distributions for speech parameters such as the spectrum, log fundamental frequency (Log F0) or its derivatives, duration etc. modified while others are held constant.

Description

Voice conversion
FIELD
Embodiments of the present invention as generally described herein relate to a voice conversion system and method.
BACKGROUND
Text to speech systems are systems where audio speech or audio speech files are outputted in response to reception of a text file.
Unit selection text to speech systems employ a speech corpus comprising samplcs of real speech. Voice conversion, is used to obtain voices or speaker attributes not con lain ed with i ii thc corpus.
Therc is a continuing need for systems employing voice conversion to soirncl natural while allowing for a variety of speaker attributes.
BRIEF DESCRIPTION 01 lEE FiGURES
i'igure 1 is a voice conversion system according to an embodiment.
Figure 2 is a flow diagram showing a voice conversion method in accordance with an embodiment.
Figure 3 is a schematic of a segmented waveform.
Figure 4 is a schematic of a Gaussian probability function.
Figure 5 is a schematic of a system showing how a modification can be selected.
Figure 6 is a schematic of another system showing how a modification can be selected.
Figure 7 is a plot showing how emotions can be transplanted between different speakers.
Figure 8 is a plot of acoustic space showing the transplant of emotional speech.
Figure 9 is a flow diagram showing a voice conversion method in accordance with another embodiment.
Figure 10 is schematic of a text to speech system in accordance with an embodiment.
Figure 11 is a flow diagram showing a text to speech method in accordance with an embodiment.
Figure 12 is a flow diagi-am showing the training of a speech processing system.
Figure 1 3 is a flow diagram showing in more detail some of the steps for training the speaker clusters of figure 12.
Figure 14 is a tiow diagram showing iii more detail some oJ:' the steps Ibr training the clusters relating to attributes of figure 12.
Figure 15 is a. schematic of decision trees used by emhodimcnt.
Figure 16 is a schematic showing a. coilec.ti.on of different types of data suitable for training a system using a method of Figure 12.
Figure 17 shows the results of a DMOS expression similarity test.
Figure 18 shows the results of a DMOS speaker similarity text.
DETAILED DESCRIP'l'ION In an embodiment, a speech conversion method for modifying speech obtained from a speaker is provided, said method comprising: inputting text corresponding to said speech; dividing said text into a sequence of acoustic units; determining a sequence of speech segments obtained from said speaker; determining a sequence of probability distributions corresponding to said sequence of speech segments; selecting a modification with which to modify said speech; determining a modification factor; modifying said probability distributions using said modification factor; and outputting modified speech as audio, wherein said probability distributions are derived directly from said speech segments, and said modification factor is calculated using an acoustic model.
The method permits a user to modify the speech of a speaker so that it is output with a different expression or voice style, Input speech is modified according to a selected modihcation and modified speech is output. The modification may relate to any speaker attributed such as expression, emotion, accent, etc. A database of the speaker speaking in the intended expression.. or voice style is not required in order to output expressive speech.
The method may or may not he used as part of a. texi to speech method, Probability distributions may be determined for sortie or all of the components of the inpuL speech. Probability distributions may be determined for one or more of spectral parameters (spectrum), log of fundamental irequency (Log 16), first differential of Log F0 (Delta Log F0), second differential of Log F0 (Delta.-DeUa Log F0), Band. a.periodicity parameters (BAP), duration or any other speech. component.
l'he output speech may be reeognisable as belonging to the speaker.
Ihe modification may comprise altering the speech such that it exhibits an expression such as angry or sad, which is not exhibited by the input speech. The modification may comprise keeping the expression exhibited by the input speech but modifying the speaker voice. ftc modification may comprise altering the speaker style or accent of the speaker.
The modification may comprise modifying only one single attribute of speech while all other attributes (including speaker) are held constant.
The acoustic model is capable of synthesizing speech in the style of both the original and modified speech. The acoustic model may or may not employ faetorisation of the speaker voice and attributes. The speaker voice and attributes may be varied independently such that an attribute can be combined with a range of different speakers.
The modification may bc continuous such that the speech can be modified over a.
continuous range. Continuous control allows not just expressions such as "sad" or "angry" hut also any intermediate expression.
The modification may he chosen by a. user via, a. user interface, The modification may he defined using audio, text, an external agen.t or any combination thereof In an embodiment, the input speech is parameterized before being divided into speech segments. The probability distributions may be pseudo models of the speech segments.
In an embodiment, one probability distribution is determined for each segment of the input speech. in another embodiment, one probability distribution is determined for each frame of each segment of the input speech.
The input speech may he provided directly by the speaker or may comprise recorded speech. The input speech may comprise speech output from a tcxt to speech system.
The Lext may comprise the text of j ust some or alt of the input speech. the modification may be applied to all or just some of said speech. Different modifications may be applied to different parts of the speech. Some parts of the speech may be unmodified.
In an embodiment, the acoustic model has a plurality of model parameters describing acoustic model probability distributions which relate an acoustic unit to a speech vector.
In an embodiment, the acoustic model may comprise a first set of parameters relating to speaker voice and a second set of parameters rctating to speaker attribute, wherein the first and second set of parameters do not overlap and wherein the first and second parameters modify said acoustic model probability disnibutions. The values of the first and second sets of parameters may be defined using audio, text, an external agent or any combination thereof In an embodiment, the sequence of modified speech distributions is a function of model parameters and the sequence of probability distributions corresponding to the sequence of speech segments.
In an embodiment, the model parameters relate to speaker voice and speaker attribute.
the model parameters modify the acoustic model probability distributions.
In an embodiment, the acoustic model comprises a first set of parameters relating to speaker voice and a sccond set of parametcrs relating to speaker attribute, whercin the first and second set of parameters explicitly overlap but an implicit set of speaker attributes that match those olthc inpiLi speech can he found.
Selecting a modification may cornprisc sciceting parameters from the second sct of S parameters relating to speaker attribute. Selecting a modification may comprise selecting parameters from the first set of parameters relating to speaker voice. Selecting a modification may comprise selecting parameters from both the first set of parameters relating to speaker voice and the second set of parameters relating to speaker attribute.
In an embodiment determining a modification factor comprises: converting said sequence of acoustic units into a sequence of speech vectors using said acoustic model and first parameters obtained from the speech of said speaker; determining a difference between the selected second parameters and second parameters obtained from the speech of said speaker; and determining the modification factor from said difference.
In another embodiment, determining a modification factor comprises: converting said sequence of acoustic units into a sequence of speech vectors using said acoustic model; determining a first difference between the selected first parameters and first parameters obtained from the speech of said speaker; determining a second difference between the selected second parameters and second parameters obtained from the speech of said speaker; and determining the modification factor from said first and second difference.
Modiflying said probability distributions using said modification factor may comprise adding said modification factor to the mean of said probability distributions. The modification factor may comprise a shift to be added to the mean of the distributions of the original speech. The shift may be obtained as the difference between the means of the distribution that correspond to the output style and that of the original speaker style.
The means of the distributions of the speech model may be defined by a. transform over a canonical, model, The transform may be trained by a method that permits factorization of speaker and attribute, The transform may he trained using Cluster Adaptive Training (CAT). The nansform may be trained by Maximum-Likelihood Linear Regression (MLLR) or Constrained Maxiniuin-I,ikelihoocl Linear Regression (CMLLR) The acoustic model may be trained by a. method that does not permit fnctoriza.tion of speaker and aLtribule. There may be an overlap between parameters describing speaker and parameters describing atl.nbutes.
In an embodiment, the text is input from an automatic speech recognition device.
lii an embodiment, determining a sequence of speech segments comprises: converting said sequence of acoustic units into a sequence of speech segments using a recorded speech model, wherein said recorded speech model comprises a corpus of recorded speech segments. The recorded speech model may be a text to speech model. The recorded speech model may be a unit selection text to speech model.
In an embodiment, determining a sequence of speech segments comprises: inputting speech; parameterizing said speech; atid scgncnting said parainetcrizcd speech.
In an embodiment, outputting modified speech as audio comprises: determining a sequence of modified speech vectors from said modified probability distributions; outputting said sequence of modified speech vectors as audio. A vocoder may be used to convert the modified speech vectors to audio.
In another embodiment outputting modified speech as audio comprises: determining a sequence of modified speech vectors from said modified probability distributions; calculating a di librenee between said inputted speech and said modified speech vectors; shifting said input speech according to said calculated difference; outputting said shifted speech as audio.
The duration of the original speech may also be modified to match the desired output style..
SpectraJ modification may be implemented as a time-domain filter applied to the original speech whereby the coefficients of said filter are obia.ined from the shift in the trajectories of the spectral coefficients of the original speech and those of the modified speech.
It this embodiment, the duration and lit) modi icattous may be obtained by overlap-acid methods such as TDPSOLA.
In one embodiment, each frame of the original speech may be represented by its own probability distribution and the modification of the duration achieved by rescaling segments of the original speech according to the duration modification assigned to it.
Segment rescaling may be achieved by means of a linear parameterization of the trajectories of the coefficients of original speech segments.
In another embodiment, each of said speech segments may comprise a plurality of frames and the probability distribution of each frame within a segment may be considered io be the same. That implies thai each frame of the original speech associated with a sub-segment (state) may share the same distribution. In this embodiment, the duration modification may be obtained by modifying the number of frames in the output speech that are associated with said state.
Shifting the input speech may comprise modifying the fundamental frequency of the input speech. Shifting said input speech may comprise rescaling segments of said input speech in order to modify their duration.
In an embodiment, a voice conversion system is provided. l'hc voice conversion system is confignred to modify speech obtained from a speaker, said voice conversion system comprising: a processor configured to: receive input text corresponding to said speech; divide said text into a sequence of acoustic units; determine a sequence of speech segments obtained from said spcakcr; dctermine a. sequence of probability distributions corresponding to sai.d sequence of speech segments; select a modification with which to modify said speech; determine a. modification factor; modify said probability a distributions using said modification factor; and output modified speech as audio, wherein said probability distributions are derived directly from said speech segments, and said modification factor is calculated using an acoustic model.
Since some methods iii accordance with cmbodimcnts can be implcniented by software, sortie ejnt,odiments encompass computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy di sic, a CD ROW., a magnetic device or a programmable memory device, or any transient medium sucl.i. a.s any signal e.g. an electrical, optical or microwave signal.
Figure 1 shows a voice conversion system 1 according to an embodiment. The voice conversion system 1 comprises a processor 3 which executes a program 5. Voice conversion system I further comprises storage 7. ihe voice conversion system 1 further comprises an input module 11 and an output module 13. The input module 11 is connected to a text input 15. Text input 15 receives text. The text input 15 may be for example a keyboard. Alternatively, text input 15 may be a means for receiving text data from an external storage medium or a network. Text input 15 may be a means for receiving text data output from an automatic speech recognition device. The voice conversion system I further comprises an audio input 23 and an audio input module 21.
The audio input 23 receives speech. In an embodiment the received text data is the text of the input speech.
Connected to the output module 13 is output for audio 17. The audio output 17 is used for outputting a speech signal converted from the input speech signal. The audio output 17 may be for example a direct audio output e.g. a speaker or an output for an audio data file which may he sent to a storage medium, networked etc. In use, the voice conversion system I receives speech from... a speaker through speech.
input 23. The text of the speech is received through text input 15. The program S executed on processor 3 converts input speech data using data stored in the storage 7.
The converted speech is output via the output module 13 to audio output 17.
The converted speech comprises speech modified to exhibit different characteristics.
The converted speech may comprise a new voice style or correspond to a different expression. For example, the input speech may be neutral in expression but the converted speech may correspond to happy speech.
In an embodiment, the data stored in the storage 7 does not comprise a database of the speaker speaking in the desired style. In a further embodiment, the data stored in the storage 7 does not comprise any data obtained from the speaker of the input speech.
Figure 2 is a flow ehairt for a voice conversion process in accordance with an embodiment. In step S201, speech is input. The speech may be direct audio input, e.g. via a microphone, or it may comprise an audio data file.
In Step 8203, the speech waveform is parameterized and a sequence of speech vectors x for each component of the speech is determined. A variety of approaches can be used to parameLerize the speech wavelörm. These lechiiiques are well known in the art and will not be described in detail here.
In Step 8205, the speech vectors are segmented into sub-segments x known as states.
In an embodiment, when the full text of the speech is known, segmentation may be achieved using a Viterbi alignment between acoustic models (described below) and the input speech parameters. Viterhi alignment is well known in the art. The alignment determines the secjuence of acoustic models with the highest log-likelihood with respect to the input sequence. All of the input frames associated with one mode define a. sub-segment. Tins procedure is well known in the art. In another embodiment, minimum phone error is employed instead of log-likelihood. Other segmentation methods known in the art ma.y also be employed, including manua.l segmentation. In an embodiment, segmentation is performed when not all of the text of the speech is known or the text does not exactly match the speoch.
Figure 3 shows an example of a segmented speech vector x. The waveform x is segmented inl.u sth.I.es. The speech vector associated with each state s of the inpiLt speech data is given by x5... , where d is the duration of the state in frames.
In Step 5207, parametric models are generated for each input state. The probability distributions of each speech component (such as IfO, mel-cep, ete) are obtained for each state. In an embodiment. the probability distributions are normal distributions defined by their mean and variance. Other distributions may also be employed.
In an embodiment, a probability distribution (4, E) is generated for each frame t of the input speech.
In another embodim cut, the input speech is modelied by simplified "pseudo models" in which the frames of the input speech associated with a state all share the same distribution, i.e. Rqn 1 where f(x5 d, t) = Vies Eqn2.a Or f(x5, d5, t) = rnedian(x5) Eqn2.b In optional step S208, the input speech is input into an automatic speech recognition device which generates the text of the input speech. Such devices are wei.i known in the art and will not be. described heie.
In Step 5209, text is input. The text is that of the input speech. The text may be inputted via a keyboard, touch screen, text predictor or the like. If an automatic speech recognition device is employed in Step 520S, the text is obtained directly from the input speech and does not need to hc provided separately.
The text is then converted into a sequence of acoustic units. These acoustic units may be context dependent phonemes or may be context dependent e.g. Iriphones which take into account not only the phoneme which has been selected but the proceeding and following phonemes or more complex units taking into account not only phonetic information but also intbnnation about the part-of.speech, position of stress accents, etc. These extended context-dependent acoustic units are well known in the art and will not he ffirther explained here. The text is converted into the sequence of acoustic units using techniques which are well-known in the art and will not be explained further here.
In Step 5211, the input speech is projected into text to speech space by applying an acoustic model to the corresponding text and adapting that model to the characteristics of the input speech. In this description, the acoustic model is a Hidden Markov Model (HMIM). However, other models could also be used.
In an embodiment, the model comprises many probability density functions relating to an acoustic unit i.e. phoneme, grapheme, word or part thereof to speech parameters. In this embodiment, the probability distributions will be Gaussian distribt.itions and these are generaily referred to as Gaussians or components. Gaussian distributions are defined by means and variances. However, it is possible to use other distributions such a.s the Poisson, Student-t, Iaplacian or Gamma distributions some of which are defined by variables other than the mean and variance.
A Gaussian distribution is shown in figure 4. Figure 4 can be thought of as being the probability distribution of an acoustic unit relating to a speech vector. For example, the speech vector shown as X has a probability Pt of corresponding to the phoneme or other acoustic unit which has the distribution shown in figure 4.
Ihe shape mid position of the Gaussian is defined by its mean and variance. These parameters are determined durin.g the training of the acoustic mode!.
In some embodiments, there will be a plurality of different states which ill be each be modelled using a Gaussian. For example, in an embodiment, the text-to-speech system comprises multiple streams. Such streams niay be selected from one or more of spectral parameters (SpecLrum), Log of fundamental frequency (Log Fo), first differential of Log F0 (Delta Log Fo), second differential of Log F0 (Delta-Delta Log F0), Band aperiodicity parameters (BAP), duration etc. the streams may also be further divided into classes such as silence (sil), short pause (pau) and speech (spe) etc. In an emnbodimett, the Gaussian components arc inodificd by parameters corresponding to speaker and expression. Projecting the speech into text to speech space comprises adjusting these parameters in order to determine the closest representation of the input speech in that space. The acoustic model will be described in detail below.
Instep 5213, the attributes of the output speech are selected. In the embodiment of Figure 2, only the expression of the speech is altered. For example, the voice may be altered to sound happy, sad, angry, nervous. calm, commanding, etc, while keeping the same speaker voice identity.
Note that in other embodiments other attributes of the speaker may be altered, for example accent. Further, the speaker voice itself may also he altered. For example, the speaker may be selected from a range of potential speaking voices such as a male voice, young femna.i.e voice etc. Any speaker or speaker attribute for which the model has been trained may he selected independently from each other.
in Step 5215, the projection of the input speech into text to speech space determined as described above aid the expression selected in step 5213 are employed to determine a modihcahon factor A. In all embodiment, the modification factor is a. shift to be added to the means of the distributions of the original speech.
In Step 5219. the modificafion factors are applied to the distributions obtained iii step S207. the resulting output distributions will be those of input speech with the desired expression.
In an embodiment, the modified distribution of an output frame, x at time t given the input speech x and the desired expression e, is given generally as P(x x Eqn 3 Where, as above, XS is the segment of x to which the original x belongs; d5 is the original length of x5 in frames; d is the modified duration which follows the distribution P(dIxs, e) = N(d5 + 5, as), where 8, is the duration modification factor; and A, is the expression modifleation factor. The calculation of the modification factors will he described in detail below.
As described in relation to step 5207 above, in one embodiment, a probability distribution was generated for each frame i of the input speech. In this embodiment, f(x',d, ,d,,t) is given by applying a transForm to the segment vcctor and taking the tth component: f(xs,d3,d:,t) -(T;,Tdxs)[t] Bqn4 where Td is a linear transform dependent oniy on duration, In this embodiment, tile modification of the duration is achieved by re-sampling each segment of original speech according to the duration modification assigned to it. The segment re-sampling is achieved by means of a linear parameterization of the trajectories on the original speech segments.
in another embodiment, a state-wise is distribution adopted, as described in relation to equation 2.a or 2.b above. In this approach, using equation 2.a.f(f,cI,cI,e) is given by Ex = VtES
S Eqn S
which is independent ol both t and d In this enibodirnent, the frames of the original speech associated with a. sub-segment (state) all share the same distribution. The duraLion modification is achieved by modifying the number of frames in the output speech that should be associated with each state.
The speech parameters for output speech are then synthesized using these inodilied distributions and a sequence of output speech vectors is generated. These syntheLic output speech vectors replace those obtained in step S203 from the input speech. In an embodiment, only part of the input speech is modified and the remainder is left unchanged. In an embodiment, only modified sections of the input speech are replaced by their synthetic counterparts.
in Step 221, the speech waveform is reconstructed from the modified speech vectors using a voeoder. This step is the inverse of the parameterization process of step S203. In an embodiment, any standard invertible parametric vocoder may be used for the steps S203 and S221.
In an embodiment, a. source excitation signal is also input into the vocoder. In this embodiment, a. source excitation signal is obLained from the speech segments determined in step S21 5 and modified according to the selected modification. In ml embodiment, the source excilaLion parameters of the input speech are estimated in step 5211 and pal-utneters con-espolidiug l.o 1110(11 hed speedi are computed using a SoltitO excitation mode!. A source excitation modification factor is then calculated from the difference between the estimatcd source excitation parameters of the input speech and the source excitation parameters of the modified speech. The source excitation moditleation factor is applied to the source excitation signal before it is input to the voeoder in step S221, whereby such modification factor may consists ol' just a modification of pitch and duration of the excitation according to the modified duration and pitch trajectory obtained from S219. Suitable source excitation models are known in the art and will not be described here.
Thus, according to the embodiment of Figure 2, if the text of the speech of a speaker is known, the voice conversion system can modi' the speech so as to change the expression of the speech or the voice style in which it is spoken. No database of recordings of the speaker speaking with the intended expression or in the intended voice style is required. Instead, statistical distributions that represent the original speech are modified by a factor obtained from a text to speech model that can synthesize speech both in the style of the original voice as well as in the target style.
Methods and systems according to the above embodiment permit the user to obtain a voice which is identifiable as that of the original speaker but with some attributes that are produced by the text to speech system, The acoustic model and modification factor according to embodiments will now be described.
In one embodiment, the acoustic mode! is a constrained maximum likelihood linear regression (CMLLR) model. In this model, taking o, o and. o as three sets of data.
comprising data of the target speaker speaking with the original expression ii (o3; the reference spea]cer r speaking with originai expression 72 (Orn); and the same reIèrence speaker r speaking with target expression e (Ore), it is assumed that there exists a canonical space ô which is independent of both speaker and expression. In this case the target speaker is the speaker of the inpuL speech and the reference speaker is a speaker for which the acoustic model is trained, The canonical space can be obtained Ii'oin the observed spacc by mcan of a linear transformation such that 6 = + b,,7 A,..,7o1 + brn An,Ore + Eqn 6 The inverse ol' these transforms also exist so that o =A (ô-b) III ifl 1?? Eqn 7 In order to determine the modification lhclor, it is necessary to find a set of transforms that can change the expression without modifying the speaker identity so that o A:1(o-b) te Ic Ic EqnS It is assumed that the linear transforms A.b comprise a factorization of speaker and expression transforms (both of them invertible) such that =A,(Ao, +b)+b, = + A1b + Eqn 9 Identiting terms in Eqn 9 gives = A,A b =Ab +b / l Fqn 10,11 and likewise for the other transforms.
Substituting Eqn 10 and Eqn 11 in Eqn 8 gives O =A;'A1'(AIAO +A(b --b3) = A'Ao,, +A;1(b be)) Eqnl2 With a factorized model, A, A, be and b1 can be accessed directly. If the model is not itctorizcd. only combined transl'orins A1, arc accessible. 1-lowever, cvcn in this case if it is possible to assume that there exists one transform for the reference speaker A,11, b' for which it is possible to assume that n'n'. then A;'Acan be rewritten as A;1A17 A;1A;1ArAnt = AArnr Eqn 13 It follows from 11 that I3nibre Ar(bnt1)e) Eqnl4 and therefore that A;'(b -bj A;1A;1A,..(b,7 b)= A;(b, -b/) -EqnlS -AIA sit Thus, Eqn 12 can be written as O, re k'rn°tn Y'rn' re Equ 16 It then follows that if °tn for the input speech, the modified output spccch can be obtained from.
o,. ,gkro1L) where A (A4t1 + (b, -bre)) LEe Eqn 17, 18 Thus, to modify the speech of speaker t from its original expression ii to the intended expression e, it is enough to 1(110w tile translornis [hr another relercncc speaker r br those two expressions in an canonical mode! such as an average voice model (AVM) that normalizes both speaker and expression. This can be done without a proper factorization of the speaker and expression, i.e, there is no need to compute {A,b} and {Ar,hr).
Relating hack now to Figure 2, in this embodiment, Step 5211 comprises determining A,, and b,, from tile input speech. A,. and b,.e are then determined in Step S2 15. 1'he modification factor is then defined by Eqns 17 and 18.
Note that this technique can be applied, not only to voice conversion, but also to modify a speaker and expression dependent model built for target speaker / with style n. In this case, the distrbutions of the input speech, defined by, and E are the mean and covariance for such a speaker-and-expression dependent modcl or AVM modcl, since 1 1 -1 i't -)2] AEAT Eqn 19,20 Where and t are the mean and covariance over the canonical space ô 6, respectively.
Note that, in equations 6-20, the model parameters p,E refer equivalently to the observation space model or to the duration model.
In an embodiment, the acoustic model is one which ha.s been trained using a. cluster adaptive training method (CAT). The standard CA.T approach is a special ease of CMILR where all [lie A matrices are identity matrices and where di Ilerent speakers and speaker attributes are modelled by the bias terms b which are are defined by applying weights A to model pa.rmneters which have been arranged into clusters, The CAT approach will now be described in detail.
In a CAT based method, the mean of a Gaussian for a selected speaker is expressed as a weighted sum of independent nienlis of the Gaussians. Thus: (.c,e1 e) -(s.e c; 111,1 -Eqn 21 where lie is the mean of Gaussian component m in with speaker voiec s, and attributes e e F, , is the index for a cluster with P the total number of clusters, ,t(5d1 is the speaker & attributes dependent interpolation weight of the cluster for the speaker s and attributes I /I!tr(;/) is the mean for component m in cluster i. For one of the clusters, usually cluster i1. all the weights are always set to 1.0. This cluster is called the bias cluster'.
In order to obtain at) independent control of each tctor thc weights are defined as P.ei.CF) = [1, x(s)T;(e)T,, jc; )T Eqn 22 So that Eqn. 21 can be rewritten as eF) = c(ni,I) + + 9 Eqn23 Where tcrrnI) represent the mean associated with the bias cluster, are the means for the speaker clusters, and J4,)are the means for thefa.ttrihute.
The means of the Gaussians arc clustered. In an embodiment. each cluster comprises at least one decision tree, the decisions used in said trees are based on linguistic, phonetic and prosodic variations. In an embodiment. (here is a decision tree for each component which is a member of a cluster, Prosodic, phonetic, and linguistic contexts affect the final speech waveform. Phonetic contexts typically affects vocal Lract, and prosodic (e.g. syllable) and linguistic (e.g., part of speech of words) contexts affects prosody such as duration (rhythm) and fundamental frequency (tone). Each cluster may comprise one or more sub-clusters where each sub-cluster comprises at least one of the said decision trees.
The abovc can either be considered to retrieve a weight for each sub-cluster or a. weight vector for each cli.ister, ihe components of the weighi. vector being the weightings For each sub-cluster.
Ihe following contiguration shows a standard embodiment. To model this data, in this embodiment, 5 state HrvlMs are used. The data is separated into three classes for this example: silence, short pause, and speech. In this particular embodiment, the allocation of decision trees and weights per sub-cluster are as follows.
In this particular embodiment the following streams are used per cluster: Spectrum: I stream, 5 states. 1 tree per state x 3 classes LogFO: 3 streams. 5 states per stream. 1 tree per state and stream x 3 classes BAP: 1 stream, 5 states, 1 tree per state x 3 classes Duration: 1 stream, 5 states, 1 tree x 3 classes (each tree is shared across all states) Total: 3x26 = 78 decision trees For the above, the following weights are applied to each stream per voice characteristic e.g. speaker: Spectrum: 1 stream, 5 states, 1 weight per stream x 3 classes LogFO:3 streams. 5 states per stream, I weight per stream x 3 classes BAP: 1 stream, 5 states, I weight per stream x 3 cJasses Duration: 1 strcarn, 5 states, 1 wcight per state and stream x 3 classes Total: 3x10 30 weights As shown in this example, it is possible to allocate the same weight to different decision trees (spectrum) or more than one weighi to the same decision tree (duration) or any oilier combination. As used herein, decision trees to which the same weighting is to he applied are considered to form a subcluster.
In an embodiment, the mean of a Gaussian distribution of a particular speaker and attribute is expressed a.s a. weighted sum of the means of a Gaussian component, where the summation t.ises one mean from each cluster. The mean is selected on the basis of the prosodic, linguistic and phonetic context of the acoustic unit which is currently being processed.
While the means are selected according to the acoustic units derived from the text, the weightings, which give the speaker and expression characteristics of the original speech, are determined according to an external input.
Referring hack to Figure 2, therefore, in this embodiment Step 5211 comprises adjusling [he weightings in order to determine the closest representation of the input speech in text to speech space. The result is a set of weightings 2? which, along with the Gaussian means determined from the corresponding text comprise the projection of the input speech in CAT text to speech space.
Step S213 then comprises selecting weightings corresponding to the desired output voice characteristics. Figure 5 shows a possible method of selecting the voice characteristics. Here, a user directly selects the weighting using, for example, a mouse to drag and drop a point on the screen, a keyboard to input a figure etc. In figure 5, a selection unit 251 which comprises a mouse, keyboard or the. like selects the weightings using display 253. Display 253, in this example has a radar chart which shows the weightings. The user can use the selecting unit 251 in order to change the dominance of the various clusters via the radar chart. It will he appreciated by those skilled in the art that other display methods may be used.
In some embodiments, the weighting can be projected onto their own space, a "weights space" with initially a weight representing each dimension. This space can be re-arranged into a. di.Iferent space which dimensions represent different voice a:itrihutes.
For example, ii [lie modelled voice characteristic is expression, uric dimension may indicate happy voice characteristics another nervous etc., the user may seleci to increase die wcighting on the happy voice dimension so that this voice characteristic dominates. In that case the number of dimensions of the new space is lower than that of the original weights space. The weights vector on the original space can then be obtained as a function of the coordinates vector of the new space In one embodiment, this projection of the original weight space onto a reduced dimension weight space is formed using a linear equation of the type = Hu where H is a projection matrix. In one embodiment, matrix H is defined to set on its columns the original)) for d representative speakers selected manually, where d is the desired dimension of the new space. Other techniques could be used to either reduce the dimcnsionality of the weight space or, if the values of aare pre-defined for several speakers, to automatically find the function that maps the control u space to the original A weight space.
In a further embodiment, the system is provided with a memory which saves predetermined sets of weightings vectors. Each vector may be designed to allow the text to be outputting with a different voice characteristic. For example, a happy voice, furious voice, etc. A system in accordance with such an embodiment is shown in Figure 6. Here, the display 253 shows different voice attributes which may be selected by selecting unit 25 1.
In Step S215, the modification factor a is calculated.
Figure 7 shows a. plot useful for visualising how the speaker voices and attributes are related. The plot offigure 7 i.s shown in 3 dimensions but can he extended to higher dimension orders.
Speakers are plotted iong the z axis. JnI.his sinipl.. lied 11101, the speaker weightings are defined as a single dimension, in practice, there are likely to he 2 or more speaker weightings represented ott a correspondmg number of axis.
Expression is represented on the x-y plane. With expression 1 along the x axis and expression 2 along the y axis, the weighting corresponding to angry and sad are shown.
Using this arrangement it is possible to generate the we.ightings required for an "Angry" speaker a and a "Sad" speaker h, By deriving the point on the x-y plane which corresponds to a new emotion or attribute, it can be seen how a new emotion or attribute can be applied to the existing speakers.
Figure 8 shows the principles explained above with reference to acoustic space. A 2-dimension acoustic space is shown here to allow a transform to be visualised. However, in practice, the acoustic space will extend in many dimensions.
In an expression CAT the mean vector for a given expression is Yxpu = Eqn 24 Where pxpr is the mean vector representing a speaker speaking with expression xpr, 1t is the CAT weighting for component k for expression xpr and Pk is the component k mean vector of component k.
The only part which is emotion-dependent are the weights. Therefore, the difference between two different expressions (xprl and xpr2) is just a shift of the mean vectors Pxpr2 = Pxp,i ± A i,,' ..vpr2 = (2V)Y2 -2xpr Vk Iiqn2S. 26 This is shown in figure 8.
In this embodiment, t1" are the weightings 2 determined from the input as described above. !=tr2 are the weightings 2 corresponding to the target expression for which the text-to-speech system has been trained.
From Eqn 3 and the related discussion above, it is clear that, in this embodiment, the modification factor a comprises two components: one for expression and one for duraLion.
In the CAT approach, the modification factor AsC is defined as = E (,ymi'. -,1t"P. i: Eqn 27 and the duration modification se is defined as = b.-2 12;hii -Eqa 28 where 1dtir,e and)durn are the CAT weights for the duration of the target expression e and those associated to x.
Once the modification factor is calculated, it is applied to the input speech distributions in step S219.
Note that, by comparing Eqn 27 to Eqn 17, it is clear that the CAT transformation is a.
special case of the CMLLR transformations iii which the matrices A al-c all eqi.rai to the idenLity matrix and bm Ltr = br -E27t 1=2 1=2 Eqn29 with jir depending only on speaker. Consequently, it follows that the modification factor is given by bibre /-2 Eqn3O Since in the CAT approach all conversion matrices are the identity matrix, Eq. 18 has no effect on the variance. Therefore, the variances atn1r of the observation space emp and duration respectively remained utunodified. It should be noted, however, that obtaining these covariances from the input speech might be problematic. due to a sparseness of observation points. For that reason, in one embodiment they are obtained lhn-n the model according to Eq. 20.
The flowchart in Figure 9 shows a process for modifying speech according to another cmhodimcnt. In this embodiment, instead of modifying the speech vector to generate synthesized speech, the original speech is directly modified by shifting components of it.
Steps 5601 -S615 proceed as steps S201-S215 as described in relation to Figure 2 above.
In step S6 17, the modified duration d is calculated from the distribution P (d 1x5. e) = 2V(d5 + ö, ok), where ö5 was defined in equation 28 above. Tile modified duration is calculated for each segment of the input speech. In this case, a duration transfoim for each segment from the input speech to the desired expression is calculated as the difference between the duration of each segment of the input speech and the calculated modified duration for each segment.
Similarly, in step 5619, new trajectories for the spectrum and fundamental frequency FO are calculated by applying the modification factor calculated in step 5615 to the input speech models. The trajectories can be calculated either by applying A directly to the segmented speech or by altering the spcech modcl distributions calculated in step 5607.
The modified distributions are calculated according to the process described in relation to equations 3 to S above. Modified speech parameters for the spectrum and fundamental frcqucncy, FO, are then synthesized using these modified distributions and a trajectory of modified specch vectors is generated.
In Step S621. the difference between the trajectories of the spectrum and FO of the original, pararneterized, speech and the new trajectory of modified speech vectors calculated in stcp S6 19 is computed.
In Step 5623, the original speech is fed through a time domain filter into which the difference between the trajectories of the spectrum computed iii Step.5621 are also input as filter coefficients. In the embodiment of Figure 9, the filter is a me! log spectrum approximation (MLSA) filter. MLSA. filters are well known in the art and will not he described in detail here. In an embodiment, the filter is configurcd to shift the mel-eepsfrum spectrum of the original speech according to the calculated dillërence of trajectories. The spectrum of the shifted speech corresponds to that of speech comprising the desired output voice characteristic.
In Step 5625 an FO shift and a duration shift corresponding to the desired voice characteristic are applied to the filtered speech. The P0 of the input speech waveform is shifted according to the difference of P0 trajectories calculated in step 5621 and the durations of die original speech segments are shifted according to the duration transform calculated in step S6 17. In an embodiment, the shifts are achieved by overlap-add methods such as time-domain pitch synchronous overlap add (TD-PSOLA).
Overlap-add methods are well known in the art and will not be described here. Once the shift has been applied, the P0 and duration of the shifted speech correspond to that of speech comprising the desired output characteristic.
In Step 5627, the shifted speech waveform is output.
Thus, whereas the output speech in the embodiment of Figure 8 is vocoded speech, Lhe output speech of the present embodiment is the original speech which has been altered by filtering and shifting the speech waveform to exhibit the desired voice characteristic.
Further, in this embodiment. all components of the speech -spectrum, and prosody -are modified together.
The embodiments described above comprise modifying input speech. In another eiibodinicnt, speech niodi Iicatioi-i is employed as part of text-to-speech synthesis, Figure 10 shows a. text to speech system according to an embodiment. The system of Figure 10 comprises the same features as that of the system of L However, unlike the system of Fi.gi.ire 1, neither an audio input (23) nor an audio input, module (21) is required. To avoid any unnecessary repetition, therefore, like reference numerals will. he used to denote like features A flow charl showing a text to speech process according to an embodiment is shown in Figure ii.
in step 5901, text is input. l'he text may be inputted via a keyboard, touch screen, text predictor or the like.
In Step S903 the text is then converted into a sequence of acoustic units. These acoustic wilts may be phonemes or graphemes. Ihe units may be context dependent e.g. triphones which take into account not only the phoneme which has been selected but the proceeding and following phonemes. The text is converted into the sequence of acoustic units using techniques which are well-known in the art and will not be explained further here.
In Step S 905, the desired expression for the output speech is selected. Methods for selecting expression according to embodiments were described in relation to Figures 5 and6above.
In Step S907, a database of recorded speech segments is searched and a sequence of segments which most closely matches both the sequence of acoustic units obtained from the text and the selected expression is determined. In an embodiment, the corpus of recorded speech segments does not comprise segments exhibiting the selected expression. Nevertheless, it is advantageoi.is to select a sequence as close as possible to the desired characteristic; the greater the modification of the speech segments in order to introduce expression into the speech, the greater the reduction in naturalness. The selection ol the speech segments is ierftinied by a standard unit selection speech model. Unit selection text to speech models are well known in the art and will not be discussed in detail here. !hc distributions arc stored according to information relating to context such as phonetic environment. etc. In. an embodiment, the probability distributions associated with each segment of recorded speech are also stored. In an embodiment, a prohahihty distribution (x E) is stored for each frame of each rccordcd segment. In another embodiment, a. "pseudo model" such as those described in 2 are stored for each recorded speech segment.
in Step S909, an acoustic model is used to calculate the modification factor A. In the embodiment of Figure 9, the acoustic model is an expression CAT model. However, other acoustic models, such as the CMLLR model described in relation to Eqns 6 to 18 could be used. In order to calculate the modification factor, the sequence of speech segments selected in step S907 is projected into CAT space. This process is the same as that described above in relation to step 5211 of Figure 2; Ihe Gaussian means are obtained by applying the CAT decision trees to the acoustic units obtained from the text in step 8903. The CAT cluster weightings are then determined by adjustment in order to determine the closest representation of the speech segments selected in step S907 in CAT space.
in another embodiment, the unit selection speech model stores the CAT weights of each of the segments in the speech corpus which have been pre-calculated and stored during the training of the model.
Once the sequence of speech segments has been projected into CAT space, the inodifleati on factor i\ is calculated as described in relation to step 8215 of Figure 2 above.
In Step S91 1, the modification factor A is applied to the distributions associated with each segment of speech selccted in stcp 8907. The modified parameters are calculated as described in relation to step S2 19 of Figure 2 above. A modified speech trajectory is thus obtained which corresponds to the input text and the selected expression.
In Step 5913, the modified parameter trajectories are converted into speech by a vocoder. In an cmboditncnt, segment residual information associated with each speech segment, such as phase information, is also input into tEle vococlcr.
In Step 5915, the speech is output a.s audio.
Ihus, the method according to the above enibodiment modules pre-recorded and stored segments of real speech so that the speech exhibits a desired expression characteristic.
New voices can be obtained with a unit selection text to speech system without requiring new recordings r the use of voice conversion. Thus, the methods can be employed to improve the versatility of unit-selection text to speech systems by permitting a variety of voices and expressions. Methods and systems according to the above embodiments therefore combine the high quality speech produced by unit selection Lext to speech systems with the versatility of' HMM-text to speech. Thc original voice of the unit selection TTS can be retained but the expression can be modified.
The training of the system will now be described. For the speech modification systems according to the embodiments described in relation to figures 2 and 9, it is only necessary to train a CAT model. For the text to speech system according to the embodiment described above, however, it is necessary both to build a unit selection TTS model and train a CAT TTS model. The building of unit selection TTS models is well known iu the art and will not be described here. The training of the CAT model, suitable for use in both text to speech systems and speech modification systems according to embodiments is described below, In. order to train acoustic models for use with an emhodiment, a system such as that shown in Figure 1 may also he used for training the CAT model. When training a system, it is necessary to have an audio input 23 which matches the text being inputted via, text inpi.it 15, In speech p neessing systems which are based on Hidden Markov Models (T-[MMs), ilic 1-1MM is often expressed as: -(A,B,u) Eqn 31 where A = a1, }N and is the state transition probability distribution, B = {b, o)} is the state output probability distribution and n is the initial state probability distribution and where N is ihe ntLmber of states in the 11MM.
How a 11MM is used in a text-to-speech system is well known in the art mid will not be described here.
In acoustic models for use with the current embodiment, the state transition probability distribution A and the initial state probability distribution are determined in accordance with procedures well known in the art. Therefore, the remainder of this description will be concerned with the state output probability distribution.
Generally in text to speech systems the state output vector or speech vector 0(i) from an m11 Gaussian component in a model set Al is P(o(t)m,s,e,LM)= N(oQ);I1(8:e)z{)) Eqn 32 where and 1r,,, are the mean and covariance of the m Gaussian component for speaker s and expression e.
The aim when training a conventional text-to-speech model is to estimate the Model parameter set A1 which rnaximiscs likelihood for a given observation scqucnec. In the conventional model, there is one single speaker and expression, therefore the model parameter set is Pm and 1,,, for the all coi-nponents ITt.
As it is not possible to obtain the above model set based on so called Maximum Likelihood (ML) criteria purely unalylicafly, the problem is conventionally addressed by using un iterative approach known as the expectation inaximisation (l-M) algorithm which is often referred to as the Baum-Welch algorithm. Here, an auxiliary function (the "Q" function) is derived: Q(Jv1, .74') = y, (t)log p(oQ),,nM) Eqn 33 where / rn (t) is the posttior probability of componmit in generating the observation 0(t) given the eurrnnt model parameters, and M is the new parameter set. After each iteration, the parameter set Al' is replaced by the new parameter set M which maxiniises Q(M, M'). p(o(1), n A is a generative model such as a GMM, HMM etc. In the present ernhodiment a TTMM is used which has a state oLLtpLLt vector 01: P(o(t)m,s.e,J4) = N(o;jjt.) e) Eqn 34 Where nie{I MN},1c{ T} . and Eareindiees for component, tune speaker and expression respectively and where MN T, Sand B are the total number of' components. frames, speakers and expressions respectively.
the exact form of ij(.e) and k4 depends on the type of speaker and expression dependent transforms that are applied. In the most general way the speaker dependent transforms includes: -a set of speaker-expression dependent weights -a speaker-expression-dependent cluster Jt) -a set of linear translonjis A;,b;] whereby these transIbrrn could depend just on the speaker, just on the expression or on both.
After applying all the possible speaker dependent transforms in step 211, the mean veelor and eovaHance matrix of the probability disiribulion tn for speaker s and expression e become (S.ø1 = (sM + -Eqn 35 -(AI:s.;TL_1 A't' jr -(ni) .(/?J) r(,ir)) liqn 36 where are the means of cluster I for component in as described in Eqn. 21, is the mean vector for component in of the additional cluster for speaker s expression s, which, will be described later, and A4:and b are the linear transformation matrix and the bias vector associated with re-ession class r(n) for tile speakers, expression e.
R is the total number of regression classes and B} denotes the regression class to which the component m belongs.
If no linear transformation is applied A3and b'(; become an identity matrix and zero vector respectively.
For reasons which will be explained later, for models trained for used with an embodiment, the eovariances are clustered and arranged into decision trees where (,, . denotes the leaf node in a eovariance decision tree to which the cc-variance matrix of the c*ornpouent in belongs and V is tb.e total number of variance decision tree leaf nodes.
lJsing the above, the auxiliary ILLflCL1OI1 can be expressed as: Q(M,L&1)= -Xm Q)og v(w) (o(r)- jiV))J÷ c Eqn 37 where C is a constant independent of M. Thus, using the above and substituting equations 35 and 36 in equation 37, the auxiliary function shows that the model parameters may be split into four distinct parts.
The first part are the parameters of the canonical model i.e. speaker and expression independent means (p.,} and the speaker and expression independent covariance {k} the above indices ii and k indicate leaf nodes of the mean and variance decision trees which will he described later. The second part are the speaker-expression dependent weights {4"}se where s indicates speaker, e indicates expression and i the cluster index parameter. The third part is the means of the speaker-expression dependent cluster c(m,x) and the fourth part is the CMLLR constrained maximum likelihood linear regression. transforms where s indicates speaker, e expression and d indicates component or speaker-expression regression class to wh.ich component m belongs.
Once the auxiliary function is expressed in the above maimer. it is then maximized with respect to each of the variables in turn in order to obtain the ML values of the speaker and voice characteristic parameters, the speaker dependent parameters and the voice characteristic deperideni parameters.
In detail, for determining the ML estimate of the mean, the following procedure is performed: To simpli' the following equations it is assumed that no linear transform is applied.
If a linear transform is applied, the original observation vectors {O(t)} have to be substituted by the transform ones -ACtt) ç I -1 i(nr) Eqn 38 Similarly, it will be assumed that there is no additional cluster. The inclusion of that extra cluster during the training is just equivalent to adding a linear transform on which Ais the identity matrix and = First, the auxiliary function of equation 33 is differentiated with respect to as follows: UQ(&4;kA) = -2..... tip I.'
Eqn 39 Where -fl,-.(xn.) ç' -Lai LTj -Lid ffltJ i. c(?u.fl:zn Rqn 40 will-i arxl k.aeeumidated stalls lies E i(t, s, = Ym(t, s,c)A.r,1?. vm) 0(t), t1iqn 41 By maximizing the equation in the normal way by setting the derivative to zero, the following formula is achieved for the ML estimate of it i.e. = c (k. b$fl
Eqn 42 It should be noted, that the ML estimate of n also depends on g where k does not equal n. The index n is used to represent leaf nodes of decisions trees of mean vectors, whereas the index k represents leaf modes of covariance decision trees. Therefore, it is necessary to perform the optimization by iterating over all until convergence.
This can. be performed by optimizing all P'n simultaneously by solving the following equations.
1*Glj 01N jfit' [k1 * i_ 0vv lfiN Eqn 43 However, if the training data is small or N is quite large, the coefficient matrix of equation 43 carmoL have full rank. This problem can be avoided by using singular value decomposition or other well-known matrix factorization techniques.
The same process is then performed in. order to perform an ML estimate of the covariances i.e. the auxiliary function shown in equation F8 is differentiated with respect to Ek to give: t...ejn 7m(t S, e)o'.(t) = t,s,em s. e) Eqn.44 Where = 0(t) -Eqn. 45 The ML estimate for speaker dependent weights and the speaker dependent linear transform can also be obtained in the same manner i.e. differentiating tile auxiliary function with respect to the parameter for which the ML estimate is required and then setting the value of ihe differential to 0.
For the expression dependent weights this yields = ( y ,tn s q I in) E (t, s.
t tn,R qi in Eqn46 Where = 0(t) -f-c(rn1) Eqn47 And similarly, for the speaker-dependent weights = ( q(m)=q y(r,scr%I Eto (t' in q(in=t }Iqn4S Where aLci) = o(t) --AIA Eqn49 In the training of models for use with an embodimenL, Ihe process is performed in an iterative manner. This basic system is explained with reference to the flow diagrams of figures l2to 14.
In step S40 1, a plurality of inputs of audio speech is received. In this illustrative example, 4 speakers are used.
Next, in step S'403, an acoustic model is trained and produced for each of the 4 voices, each speaking with neutral emotion. in the training of a model for use in an embodiment, each of the 4 models is only trained using data from one voice. S403 will be explained in more detail with reference to the flow chart of figure 13.
In step S305 of figure 13, the number of clusters P is set to V + 1, where V is the number oF voices (4).
In step S307, one cluster (cluster I), is deLermined as the bias cluster. The decision trees for the bias cluster and the associated cluster mean vectors are initialised using the voice which in step S303 produced the best model. In this example, each voice is given a tag "Voice A", "Voice B", "Voice C" and "Voice 1)", here Voice A is assumed to have produced the best model. The covariance matrices, space weights for multi-space probability distributions (MSD) and their parameter sharing structure are aiso initialised to those of the voice A model.
Bach binary decision tree is constructed in a locally optimal fashion starting with a single root node representing all contexts. In this embodiment, by context, the following bases are used, phonetic, linguistic and prosodic. As each node is created, the next optimal question about the context is selected. The question is selected on the basis of which question causes the maximum increase iii likelihood and the terminal nodes generated in the training examples.
Then, the set of terminal nodes is searched to find the one which can be split using its optimum question to provide the largest increase in the total likelihood to the training data. Providing that this increase exceeds a threshold, the node is divided using the optimal question and two new terminal nodes are created. The process stops when no new terminal nodes can be formed since any further splitting will not exceed the threshold applied to the likelihood split.
This process is shown for example in figure 15. The nth terminal node in a. mean decision tree is divided into two new terminal nodes.qand if by a question q. The likelihood gain achieved by this split can he caleLdated as follows: £(n) = ( E m 8(m) + w!CS(m) \ Eqn 50 Where S(n) denotes a set of components associated with node n. Note that th.e terms which are constant with respect to pi are not included.
S
Where C is a constant term independent of p. The maximum likelihood of is given by equation 13 Thus, the above can be written as: £(n) = ( L in E 8(n) EqitSl Thus, the likelihood gained by splitting node n into and iF is given by: = L'tn) + £(u?) -L) Eqn. 52 Thus, using the above, it is possible to construct a decision tree for each cluster where the Lree is arranged so that the optimal question is asked first in the tree and the decisions arc arranged in hierarchical order according to the likelihood of splitting. A weighting is then applied to each cluster.
Decision Irees might be also constructed for variance. The covariance decision trees are constructed as follows: If the case Lerininal node in a covariajice decision tree is divided into two new terminal nodes A? and by question q, the cluster covarianee matrix and the gain by ihe splil are expressed as follows:
A
**fl.1. != Eqn. 53 = -a, e) log Fk' ± Eqn. 54 where D is constant independent of {Lk}. Therefore the increment in likelihood is a.c(k. q) = .t(kL) ± .f(k) C(k) Eqn.55 In step S309, a specific voice tag is assied to each of 2,...,P clusters e.g. clusters 2, 3, 4, and S are for speakers B, C, D and A respectively. Note, because voice A was used to initialise the bias cluster it is assigned to the last cluster to he initialised.
In step S3 11, a set of CAT interpolation weights are simply set to I or 0 according to the assigned voice tag as: 11.0 iJ 1=1 = . 1.0 if voicetag(s) . / 10.0 otherwise In a model for use with an embodinient, there are globa.l weights per speaker, per stream.
In step S313, for each cluster 2 (P-i) in turn the clusters are initialised as follows.
The voice data for the associated voice, e.g. voice B for cluster 2, is aligned using the mono-speaker model for the associated voice trained in step S303. Given these aligmnents, the statistics are computed and the decision U-ce and mean values for the cluster are estimated. The mean values for the cluster are computed as the normalised weighted sum of the cluster moans using the wei glib; sot in step S3 Ii i.e. in practice this results in the mean values for a given context being the weighted sum (weight 1 in both cases) of the bias cluster mean for that context and. the voice B model mean for that context in cluster 2.
In step S315, the decision trees are then rebuilt for the bias cluster using all the data from all 4 voices, and associated means imd variance parameters re-estimated.
After adding the clusters for voices B, C and D the bias cluster is re-estimated using all 4 voices at the same time.
In step S3 17, Cluster P (voice A) is now initialised as for the other clusters, described in step S3 13, using data only from voice A. Once the clusters have been initialised as above, the CAT model is then updated/trained as follows: In step S3 19 the decision trees are re-constructed cluster-by-cluster from cluster 1 to P, keeping the CAT weights fixed. In step S321, new means and variances are estimated in the CAT model. Next in step S323, new CAT weights are estimated for each cluster. In an embodiment. the process loops hack to S321 until convergence. The parameters and weights are estimated using maximum likelihood calculations performed by using the auxiliary function of the Baum-Welch algorithm, to obtain a better estimate of said parameters.
As previously described, the parameters are estimated via an iterative process.
In training a model for use with a ffirther embodiment, at step S323, the process ioops back to step S3 19 so that the decision trees are reconstructed during each iteration until convergence.
The process then returns to step S405 of figure 12 where the model is then trained for di flbreat attributes. lIla this particular example, the attribute is emotion.
In this training, emotion in a speaker's voice is modelled using cluster adaptive training iii the same niazuier as described for modelling the speaker's voice instep S403. First, "emotion clusters" are initialised th step S405. This will be explained in more detail with reference to figure 14.
Data is then collected for at least one of the speakers where the speaker's voice is emotional. It is possible to collect data from just one speaker, where the speaker provides a number of data samples. each exhibiting a different emotions or a plurality of the speakers providing speech data samples with different emotions. In this training, it will be presumed that the speech samples provided to train the system to exhibit emotion come from the speakers whose data was collected to train the initial CAT model in step S403. However, the system can also train to exhibit emotion using data from a speaker whose data was not used in S403 and this will be described later.
In step S451, the non-Neutral emotion data is then grouped into N groups. In step S453, Ne additional clusters are added to model emotion. A cluster is associated with each emotion group. For example, a cluster is associated vith "Happy", etc. These emotion clusters arc provided in addition to the neaLral speaker clLLsters fon-ned in step S403.
In step S455, initialise a binary vector for the emotion cluster weighting such that if speech data is to be used for training exhibiting one emotion, the cluster is associated with that emotion is set to "1" and all other emotion clusters are weighted at "0".
During this initialisation phase the neutral emotion speaker clusters are set to the weightings associated with the speaker for the data.
Next, the decision trees are built for each emotion cluster in step S457. Finally, the weights are re-estimated based on all of the data in step S4S9.
After the emotion clusters have been initialised. as explained above, the Gaussian means and variances are re-estimated for all clusters, bias., speaker and emotion in step S407.
Next, the weights for the emotion clusters are re-estimated as described above in step S409. The decision trees are then re-computed in step S41 1. Next, the process loops back to step S407 and the model parameters, followed by the weightings in step S409, followed by reconstructing the decision trees in step S4 11 are performed until convergence, in an embodiment, the loop S407-S409 is repeated several times.
Next, in step S4l3, the model variance and means are re-estimated for all clusters, bias, speaker and emotion. lii step S415 the weighis are re-estimated for the speaker clusters and the decision trees are rebuilt in step S4l7. The process then loops back to step S413 and this loop is repeated until convergence. Then the process loops back to step S407 and the loop concerning emotions is repeated until converge. The process continues until convergence is reached for both loops jointly.
Figure 15 shows clusters 1 to P which are in the forms of decision trees. In this simplified example, there are lust four terminal nodes in cluster I and three terminal nodes in cluster P. It is important to note that the decision trees need not be symmetric i.e. each decision tree can have a different number of terminal nodes. The number oF terminal nodes and th.e number of branches in th.e tree is determined purely by the log likelihood splitting which achieves the maximum split at the first decision and then the questions are asked in order of the question which causes the larger split. Once the split achieved is below a threshold, the splitting of a node terminates.
The above produces a canonical model which allows the following synthesis to be perilirmed: 1. Any of the 4 voices can be synthesised using the final set of weight vectors corresponding to that voice in combination with any attribute such as emotion for which the system has been trained. Thus, in the ease thai only "happy" data exists for speaker 1, providing that the system has been trained with "angry" data for at least one of the other voices, it is possible for system to output the voice of speaker 1. with the "angry emotion".
2. A random voice call be synihesised fIoni the acoustic space spanned by the CAT model by setting the weight vectors to arbitrary positions and any of the trained attributes can be applied to this new voice.
3. The system may also be used to output a voice with 2 or more different attributes. For example, a speaker voice may be outputted with 2 different attributes, for example an emotion and an accent.
To model different attributes which can be combined such as accent and emotion, the two different attributes to be combined are incorporated as described in relation to equation 3 above.
In such an arrangement, one set of clusters will be for different speakers, another set of clusters for emotion and a final set of clusters for accent. The emotion clusters will be initialised as explained with reference to figure 14, the accent clusters will also be initialised as an additional group of clusters as explained with reference to figure 14 as for emotion. Figure 12 shows that there is a separate ioop for training emotion then a separate loop for training speaker. If the voice attribute is to have 2 components such as accent and emotion, there will be a separate loop for accent and a. separate loop for emotion.
The framework of the above training allows the models to he trained jointly, thus enhancing both the controllability and the quality of the generated speech trajectories.
The above also allows for the requirements for the range of training data to be more relaxed. For example, the training data configuration shown in figure 16 could be used rlicre there are: 3 female speakers -17s1; fs2; and fs3 3 male speakers msl, ms2 and ms3 where fsl and fs2 have an American accent and are recorded speaking with neutral emotion, fs3 has a Chinese accent and is recorded speaking for 3 lots of data, where on.e data set shows neutral emotion, one data set shows happy emotion and one data set angry emotion. Male speaker msl has an American accent is recorded only speaking with neutral emotion, male speaker ms2 has a Scottish accent and is recorded for 3 data sets speaking with the emotions of angry, happy and sad. The third male speaker ms3 has a Chinese accent and is recorded speaking with neutral emotion. The above system allows voice data to be output with any of the 6 speaker voices with any of the recorded combinations of accent and emotion.
Tn an embodiment, there is overlap between the voice attributes and speakers such that the grouping of the data used for training the clustcrs is unique for each voice characteristic.
In the above embodiments, factorized models are employed to accommodate expression transformation. In another embodiment, expression transformation is performed using non-factorized models. In this embodiment, the training data includes one reference speaker r speaking with expression n which is sufficiently close to neutral to assume that A. A. and b. b. In this embodiment
U
A;'A.. = A;'A;ArA. = A'A Similarly, since b -= A, (b11 -&) it follows that A;'(b hr) = 4'A;'4,,k, She) = A(h ke) Consequently, the target acoustic space can be obtained as A + (b, -b)) In order to demonstrate the above, experiments were conducted using CNII, according to embodiments. The results of the experiments are shown in Figures i7 and 18.
Figure 17 shows the results of a Differential Meat' Opinion Score (DMOS) test evaluating the subjective expression similarity. In the experiment the reference samples comprise original expressive speech uttered by different speakers in different expressions XXX (in Figure 17, the expressions are denoted by the terminal three letter code). \TCnF_XXX and VCF_XXX correspond to samples generated using methods according to non-factorized and factorized embodiments described above, respectively, whereby neutral input speech is transformed into the XXX target expression.
PM_neuXXX correspond to samples generated directly from neutral input speech without modification and compared to the expression XXX. PM refXXX corresponds to the upper-bound of the experiment defined as samples generated from real speech with the same XXX expression as the DM05 references but different sentences.
Figure 18 shows the results of a speaker similarity DMOS test averaged over a].] the expressions. VCnF, VCF and PM neu are the same as in Figure 17. In this case the DM05 reference samples correspond speech uttered by the target speaker in the input neutral. style. TTS indicates samples generated with a CAT text to speech model adapted to the target speaker.
The results shown in ligates 17 and 18 indicate that even with its simplest implcmcntation the methods according to emhodimcnts dcscribcd abovc can inodi1 the input expression while preserving the speaker identity.
Systems and methods according to the above embodiments may he iu4ed in the scenic arts such as animation, video gaines and films, etc. For example. the voice of one actor may be combined with the rendering of another. Further, systems and methods according to above embodiments may be used as part of computer assisted learning in the fields of acting, language and discourse. etc. For example, a. student's voice may be modified to exhibit a particular style in order to provide instruction or feedhaclc. [he student can detcrnunc how close their rendering isto the inlended one and adapt their voice accordingly.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed the novel methods and apparatus described herein may he embodied in a variety of other forms; furthermore, various omissions, substitutions arid changes in the form of methods arid apparatus described herein may be made without departing from the spirit of the inventions. the accompanying claims and their equivalents are intended to cover such forms of modifications as would fall within the scope and spirit of the inventions.

Claims (11)

  1. CLAIMS: 1. A speech conversion method for modiing speech obtained from a speaker, said method comprising: inputting text corresponding to said speech; dividing said text into a sequence of acoustic units; determining a sequence of speech segments obtained from said speaker; determining a sequence of probability distributions corresponding to said sequence of speech segments; selecting a modification with which to modi' said speech; determining a modification factor; modifying said probability disti-ibutions ising said modification factor; and outputting modified spccch as audio, wherein said probability distributions are derived direci.y from said speech sewnents, -and said modification factor is calculated usilig an acoustic model.
  2. 2. Thc speech conversion mcthod of claim 1, wherein said acoustic model has a plurality of model parameters describing acoustic model probability distributions which relate an acoustic unit to a speech vector.
  3. 3. The speech conversion method of claim 2, wherein said sequence of modified speech distributions is a ftinction of said model parameters and said sequence of probability distributions corresponding to said sequence of speech segments.
  4. 4. ?f he speech conversion method of claim 3, wherein said model parameters relate to speaker voice and speaker attribute, and wherein the model parameters modify said acoustic model probability distributions.
  5. 5. The speech conversion method of claim 3, wherein said acoustic model comprises a first set of parameters relating to spealcer voice and a second set of parameters relating to speaker attribute, wherein the first and second set of parameters do not overlap and wherein the first and second parameters modify saicL acoustic model probability distributions.
  6. 6. The speech conversion method of claim 5, wherein selecting a modification comprises selecting parameters from the second set of parameters relating to speaker attribute.
  7. 7. Ihe speech conversion method of claim 6, wherein determining a modification factor comprises: converting said sequence of acoustic units into a sequence of speech vectors using said acoustic model and first parameters obtained from the speech of said speaker; determining a difference between the selected second parameters and second parameters obtained from the speech of said speaker; and determining the modification factor from said difference.
  8. 8. The speech conversion method of claim 5, wherein selecting a modification comprises selecting parameters from the first set of parameters relating to speaker voice and the second set of parameters relating to speaker attribute.
  9. 9. The speech conversion method of claim 8, wherein determining a modification factor comprises: converting said sequence of acoustic units into a sequence of speech vectors using said acoustic model; determining a first difference between the selected first parameters and first parameters obtained from the speech of said speaker; determining a second difference between the selected second parameters and second.parameters obtained from the speech of said speaker; and detennining the modification factor from said first and second difference.
  10. 10. The speech conversion method of claim 9, wherein modifying said probability distributions using said modification factor comprises adding said modification factor to the meali of said probabiliLy distributions.
  11. 11. Thc voice conversion mcthod of claim 1, wherein said text is input from an automatic speech recognition device.liThe voice conversion method of claim I. wherein determining a sequence of speech segments comprises: converting said sequence of acoustic units into a sequence of speech segments using a recorded speech model, wherein said recorded speech model comprises a corpus of recorded speech segments.13. Ihe voice conversion method of claim I. -wherein determining a sequence of speech segments comprises: inputting speech; parameterizing said speech; and segmenting said parameterized speech.14. The voice conversion method of claim 13, wherein outputting modified speech as audio comprises: determining a sequence of modified speech vectors from said modified probability cli strihutions; outputfing said sequence of modified speech vectors as audio.15. The voice conversion method of claim 13, wherein outputting modified speech as audio comprises: determining a sequence of modified speech vectors from said modified probability distributions; calculating a difference between said inputted speech and said modified speech vectors; shifling said input speech according to said calculated difference; ourpulling said shifted speech as audio.16. The voice conversion method of claim. 15, wherein shifting said input speech comprises applying a time-domain filter to directly to the input speech.17. The voicc conversion method of claim 15, wherein shifting said input speech comprises modifying the fundamental frequency otT Lhe input speech.18. The voice conversion method of claim 15, wherein shifting said input speech comprises rescaling segments of said input speech in order to modify their duration.19. The voice conversion method of claim 1, wherein each of said speech segments comprises a plurality of frames and wherein the probability distribution for each frame of a segment is the same.20. A voice conversion system, configured to modify speech obtained from a speaker, said voice conversion system comprising: a processor configured to: receive input text corresponding to said speech; divide said text into a sequence of acoustic units; determine a sequence of speech segments obtained from said speaker; determine a sequence of probability distributions corresponding to.said sequence of speech segments; select a modification with which to modify said speech; determine a modification factor; modify said probability disuibutions using said modification factor; and output modified speech as audio, wherein said probability distributions are derived directly from said speech segments, and said modification factor is calculated using an acoustic model.
GB1405255.9A 2014-03-24 2014-03-24 Voice conversion Expired - Fee Related GB2524505B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1405255.9A GB2524505B (en) 2014-03-24 2014-03-24 Voice conversion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1405255.9A GB2524505B (en) 2014-03-24 2014-03-24 Voice conversion

Publications (3)

Publication Number Publication Date
GB201405255D0 GB201405255D0 (en) 2014-05-07
GB2524505A true GB2524505A (en) 2015-09-30
GB2524505B GB2524505B (en) 2017-11-08

Family

ID=50686818

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1405255.9A Expired - Fee Related GB2524505B (en) 2014-03-24 2014-03-24 Voice conversion

Country Status (1)

Country Link
GB (1) GB2524505B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018153359A1 (en) * 2017-02-27 2018-08-30 华为技术有限公司 Emotion state prediction method and robot
CN108630190A (en) * 2018-05-18 2018-10-09 百度在线网络技术(北京)有限公司 Method and apparatus for generating phonetic synthesis model
CN109036370A (en) * 2018-06-06 2018-12-18 安徽继远软件有限公司 A kind of speaker's voice adaptive training method
CN109192225A (en) * 2018-09-28 2019-01-11 清华大学 The method and device of speech emotion recognition and mark
CN110767209A (en) * 2019-10-31 2020-02-07 标贝(北京)科技有限公司 Speech synthesis method, apparatus, system and storage medium
EP3618060A4 (en) * 2017-04-26 2020-04-22 Sony Corporation Signal processing device, method, and program
US20220335928A1 (en) * 2019-08-19 2022-10-20 Nippon Telegraph And Telephone Corporation Estimation device, estimation method, and estimation program

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090070115A1 (en) * 2007-09-07 2009-03-12 International Business Machines Corporation Speech synthesis system, speech synthesis program product, and speech synthesis method
EP2650874A1 (en) * 2012-03-30 2013-10-16 Kabushiki Kaisha Toshiba A text to speech system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090070115A1 (en) * 2007-09-07 2009-03-12 International Business Machines Corporation Speech synthesis system, speech synthesis program product, and speech synthesis method
EP2650874A1 (en) * 2012-03-30 2013-10-16 Kabushiki Kaisha Toshiba A text to speech system

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018153359A1 (en) * 2017-02-27 2018-08-30 华为技术有限公司 Emotion state prediction method and robot
US11670324B2 (en) 2017-02-27 2023-06-06 Huawei Technologies Co., Ltd. Method for predicting emotion status and robot
EP3618060A4 (en) * 2017-04-26 2020-04-22 Sony Corporation Signal processing device, method, and program
CN108630190A (en) * 2018-05-18 2018-10-09 百度在线网络技术(北京)有限公司 Method and apparatus for generating phonetic synthesis model
CN109036370A (en) * 2018-06-06 2018-12-18 安徽继远软件有限公司 A kind of speaker's voice adaptive training method
CN109036370B (en) * 2018-06-06 2021-07-20 安徽继远软件有限公司 Adaptive training method for speaker voice
CN109192225A (en) * 2018-09-28 2019-01-11 清华大学 The method and device of speech emotion recognition and mark
US20220335928A1 (en) * 2019-08-19 2022-10-20 Nippon Telegraph And Telephone Corporation Estimation device, estimation method, and estimation program
CN110767209A (en) * 2019-10-31 2020-02-07 标贝(北京)科技有限公司 Speech synthesis method, apparatus, system and storage medium
CN110767209B (en) * 2019-10-31 2022-03-15 标贝(北京)科技有限公司 Speech synthesis method, apparatus, system and storage medium

Also Published As

Publication number Publication date
GB2524505B (en) 2017-11-08
GB201405255D0 (en) 2014-05-07

Similar Documents

Publication Publication Date Title
EP2846327B1 (en) Acoustic model training method and system
JP6092293B2 (en) Text-to-speech system
JP5768093B2 (en) Speech processing system
JP6246777B2 (en) Speech synthesis method, apparatus and program
GB2524505A (en) Voice conversion
EP3192070B1 (en) Text-to-speech with emotional content
JP5398909B2 (en) Text-to-speech synthesis method and system
KR20090061920A (en) Speech synthesizing method and apparatus
Zen et al. Context-dependent additive log f_0 model for HMM-based speech synthesis
JP2017167526A (en) Multiple stream spectrum expression for synthesis of statistical parametric voice
Toda et al. Modeling of speech parameter sequence considering global variance for HMM-based speech synthesis
KR102518471B1 (en) Speech synthesis system that can control the generation speed
Latorre et al. Voice expression conversion with factorised HMM-TTS models.
Krak et al. Applied aspects of the synthesis and analysis of voice information
Coto-Jiménez et al. Speech Synthesis Based on Hidden Markov Models and Deep Learning.
Louw Neural speech synthesis for resource-scarce languages
Jayasinghe Machine Singing Generation Through Deep Learning
Skare et al. Using a Recurrent Neural Network and Articulatory Synthesis to Accurately Model Speech Output
Yoshimura ACOUSTIC AND WAVEFORM MODELING FOR STATISTICAL SPEECH SYNTHESIS
Khorram et al. Context-dependent deterministic plus stochastic model
Sulír et al. The influence of adaptation database size on the quality of HMM-based synthetic voice based on the large average voice model
PHAN et al. A hybrid approach between HMM-based and unit selection in Vietnamese speech synthesis for limit adaptive data

Legal Events

Date Code Title Description
PCNP Patent ceased through non-payment of renewal fee

Effective date: 20230324