WO2022263806A1 - Text-to-speech system - Google Patents
Text-to-speech system Download PDFInfo
- Publication number
- WO2022263806A1 WO2022263806A1 PCT/GB2022/051491 GB2022051491W WO2022263806A1 WO 2022263806 A1 WO2022263806 A1 WO 2022263806A1 GB 2022051491 W GB2022051491 W GB 2022051491W WO 2022263806 A1 WO2022263806 A1 WO 2022263806A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- emotion
- gmm
- attention
- component
- scores
- Prior art date
Links
- 230000008451 emotion Effects 0.000 claims abstract description 104
- 238000012549 training Methods 0.000 claims abstract description 35
- 238000000034 method Methods 0.000 claims abstract description 31
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 28
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 28
- 230000002996 emotional effect Effects 0.000 claims abstract description 26
- 230000001419 dependent effect Effects 0.000 claims abstract description 14
- 239000000203 mixture Substances 0.000 claims abstract description 7
- 238000005070 sampling Methods 0.000 claims abstract description 3
- 239000013598 vector Substances 0.000 claims description 19
- 230000008569 process Effects 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 3
- 239000000654 additive Substances 0.000 description 2
- 230000000996 additive effect Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000007935 neutral effect Effects 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000003278 mimic effect Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/027—Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
Definitions
- This invention relates to a text-to-speech (TTS) system.
- TTS text-to-speech
- This invention relates to a TTS system which can convey emotion.
- TTS systems are well known, and can receive text as an input and output this as synthesised spoken speech. Early TTS systems outputted speech in a rather robotic, monotone manner. There is, however, an increasing desire for TTS systems which mimic the human voice more closely, including expressing emotion associated with the text. Thus, expressing in an "angry " voice where appropriate, in a "sad” voice and so on. Ultimately the aim is to generate a TTS system that a listener cannot distinguish from a human reading out the text.
- GST global style tokens
- a text- to-speech method comprising a training phrase in which the system is trained with a plurality of emotionally tagged data tagged with a plurality of different emotions which are applied to a GST model to estimate emotion-dependent style embedding; generating a Gaussian mixture model (GMM) on said emotion- dependent style embedding, one gaussian component for each of the plurality of emotions, and, at the time of synthesis, sampling said emotion-dependent style embeddings from each gaussian component of the GMM to obtain a combined mixed emotion scores and applying these as an input for controlled speech synthesis
- GMM Gaussian mixture model
- the plurality of emotions are selected from:
- the training step of the expressive TTS system preferably comprises providing a first set of training data of relatively large amount and diversity, and a second set of training data which can be of relatively short amount (compared to the first set) and which are tagged according to the predominant emotion.
- the invention further comprises a TTS system comprising apparatus configured to use the method.
- Figure 1 shows a conventional global style token (GST) emotion modelling system
- FIG. 2 shows an overall diagram of a text to speech (TTS) system with emotion control
- Figure 3 shows a training method ofan emotion control module
- Figure 4 shows a synthesis/influence method of an emotion control module.
- GST global style tokens
- a given speech signal is first compressed into a compact vector, known as a reference embedding, by an encoder.
- the reference embedding is then fed into an attention layer which determines scores of similarity between the embedding and entries from the set of style tokens. These then go through a softmax operation (as known in the art) which results in a so called “condition vector” (CV) which represents how similar the reference embedding is to each token.
- the combination of the condition vector and the GSTs results in a style embedding.
- the style embedding construction process is shown schematically in Figure 1 where a plurality of tokens 1 a, 1 b to 1 k are generated and input to the attention layer 2 to create a series of attention scores 2a, 2b to 2k.
- the scores comprise a condition vector 3.
- the combination of the condition vector (CV) 3 and the GSTs 1 a to 1 k results in a style embedding, which is essentially a weighted summation of the GSTs where the weights are elements of the condition vector.
- This final style embedding is then used to condition the TTS or generate a specific style of emotion. Note that this will relate to a single emotion - Fear, say, or more generally a single style provided from the reference embedding.
- the style layers may be implemented as an additive multi-head attention (MHA) model, as described in reference [3] above, ie a plurality of heads (or subvectors) in particular emotional styles.
- MHA additive multi-head attention
- a given speech signal is first compressed into a compact vector, a reference embedding, by an encoder.
- the reference embedding is then fed into an attention layer where the goal is not to align but to determine scores of similarity between the embedding and entries from a set of style tokens.
- these scores compose a so-called condition vector (CV), and represent how similar is the reference embedding to each token of the GST bank.
- CV condition vector
- the combination of a CV and GSTs result into a style embedding, given by where and are respectively the entries of the GST bank and components of the CV, with K being the number of tokens, and each GST being a D-dimensional vector, i.e.
- the style embedding s is then used to condition the TTS onto a specific style or emotion.
- the style layer is implemented as an additive multi-head attention (MHA) module [3].
- MHA additive multi-head attention
- the style embedding is a concatenation of individual head- dependent style embeddings where are respectively CV components and
- a plurality of emotional training samples are used, in different emotional styles. These may, for example, be: anger, happiness, sadness, excitement, surprise, fear and disgust.
- a training piece is read by a reader using a particular emotion and this is labelled.
- a user may read a piece in an "angry” voice and it is labelled as such, similarly with a "happy” voice, a "sad” voice, and so on.
- This generates a plurality of speech audio signals as training samples which are labelled with appropriate emotions.
- One emotion is associated with each labelled sample.
- the training data for the TTS system may also comprise one or more typically longer samples of people reading text in a neutral and not over-expressive voice.
- a typical example may be four hours long, for example but it may be any length.
- the training data which may include text 5 and audio 6 samples, is stored in a database 7. This is then applied to a training file 8 where a style model is trained together with the TTS system, driven by the TTS loss.
- a style model is trained together with the TTS system, driven by the TTS loss.
- the goal of the joint training is to provide a speech signal that can be as close as possible to its natural version.
- the training phase includes TTS training 9 and style model training 10 which provides style embeddings 11 to the TTS training.
- CVs emotion condition vectors
- GMM Gaussian Mixture Model
- TTS output the text to be output as TTS 16 is applied to a synthesis system together with the emotional scores (GAMs - see below) 17. These are then mixed in the synthesis model 15 using TTS influence 18 and style embedding creation 19, which provides style embeddings to the TTS inference 18, in order to generate a synthetic speech (ie TTS output) with emotions. As described, this may have a mix of emotions and may have different degrees of each of emotion.
- GMM components 24 This results in a plurality of GMM components 24, one for each emotion. That is, a GMM component 1 , GMM component 2... GMM component J. In a preferred embodiment, there are seven emotions but different numbers and types of emotions may be used.
- each one is given one emotional label (sadness, happiness etc) so that each GMM component set 24 represents one emotion.
- the set of components is then applied during synthesis time.
- the actual text to be synthesised is analysed and the emotional content of the text is determined.
- This will typically comprise a plurality of emotions and with different degrees of "intensity" for each emotions.
- the amount of each emotion (from the relevant GMM component) in the text to be synthesised is determined. Typically, this results in a score between a zero value and maximum value, for example between 0 and 1 for each emotion for the text, where 0 represents a situation whether the text has none of that particular emotion and 1 where it is entirely of that emotion.
- a passage which is entirely happy without any other emotion may have a score of 1 for the happiness emotion and a score of 0 for every other emotion.
- a text by a user who is angry but has no other emotion will have a score of 1 for the anger model, and 0 for the others.
- there will be a degree of each emotion and thus a typical text may have a score of say 0.5 for happiness, 0.75 for anger, 0.24 for disgust, and so on, and thus represents and includes a range and plurality of emotions.
- the emotional control model 24 (from the training stage) is shown comprising the GMM components, one for each emotion. These are then sampled 25 to provide an attention weight for each emotion (representing the degree of information that should be taken from the GST bank to be used in a particular text). After that, these attention weights are multiplied a corresponding emotional score. These emotional scores, that come from the frontend or the user, may be, for example, 0.2 for happiness, 0.8 for excitement, and so on. These are then combined at stage 28, a softmax process 29 is applied, and a CV 13 is generated which is used for the actual TTS output.
- the GMM To train the GMM, in effect, all of the samples of one emotion (eg all the "happy samples”) are used to calculate the mean vector of the "happy component". This is then used as the initial mean of the corresponding generated Gaussian component for the GMM component relating to happiness. The same is done for all the different emotions.
- the GMM can be trained and its means are iteratively updated during the GMM training. They have already been labelled with a particular emotion and therefore this emotion is inherently linked to a particular GMM component 24.
- the set of emotional scores (GMM components) are provided either by a user or by a front end of a system.
- the attention score vector is sampled 25 from each component of the trained GMM and these are then combined with the provided emotional scores to generate the synthetic CV 30 used for synthesis.
- style embedding can be constructed using a process as shown in Figure 2 above and speech can be generated using a general process (Tacotron/ GTS) as shown in Figure 1 .
- an embedding represents a vector that represents specific information.
- an embedding can represent a speaker, resulting in a set of vectors or embeddings in which each of them represents a specific speaker.
- the embeddings may represent styles.
- emotion control can be divided into training and inference stages.
- emotional CVs are accumulated and a GMM fitted on them.
- CVs from each Gaussian component are sampled and mixed, based on the scores provided by the TTS front end.
- attention scores are collected from emotionally tagged data before being applied to the softmax layer, to obtain emotion-dependent attention scores, , where j and n are respectively emotion and sample indices, and the scores prior to softmax are where means the k-th attention score before softmax of the n-th sample of emotional dataset j.
- a GMM is fitted on , where J is the number of intended styles or emotions.
- each component mean m® is initializing by making where is the number of samples in emotional dataset j. In order to enable interpretable emotion control at synthesis it is assumed that each component represents one emotion.
- first a set of emotional scores are provided by the user or
- TTS frontend Then an attention score vector is sampled independently from each component of the trained GMM, with S® being the covariance matrix of component j and meaning a normal distribution. After that the frontend emotion scores, are combined, with the sampled attention scores, where become the mixed emotion scores. The adjusted CV components are calculated as and the final style embedding is given by
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Document Processing Apparatus (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
A text-to-speech method, comprising a training phase in which the system is trained with a plurality of emotional tagged data, tagged with a plurality of different emotions which are applied to a GST model to estimate emotion-dependent style embedding; generating a Gaussian mixture model (GMM) on said emotion-dependent style embedding, each Gaussian component representative of one emotion, and; at the time of synthesis, sampling said emotion-dependent style embeddings from the GMM and applying these as an input for controlled speech synthesis.
Description
Text-to-Speech System
This invention relates to a text-to-speech (TTS) system. In particular, it relates to a TTS system which can convey emotion.
TTS systems are well known, and can receive text as an input and output this as synthesised spoken speech. Early TTS systems outputted speech in a rather robotic, monotone manner. There is, however, an increasing desire for TTS systems which mimic the human voice more closely, including expressing emotion associated with the text. Thus, expressing in an "angry " voice where appropriate, in a "sad" voice and so on. Ultimately the aim is to generate a TTS system that a listener cannot distinguish from a human reading out the text.
A system which uses global style tokens (GST) for expressive speech synthesis has been proposed in:
[1] Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous, "Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis," in Proceedings of the 35th International Conference on Machine Learning (J. Dy and A. Krause, eds.), vol. 80 of Proceedings of Machine Learning Research, (Stockholm Sweden), pp. 5180- 5189, PMLR, 10-15 Jul 2018.
Since then, these have been used as a way of conveying different styles and emotions in neural TTS systems like that known as "Tacotron" and described in:
[2] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry- Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, "Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions," in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (Calgary, Canada), pp. 4779^1783, Apr. 2018.
An additive multi-head attention module system building on this is described in:
[3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, "Attention is all you need," in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, Curran Associates, Inc., 2017. The GSTs are tokens that can be used to control various styles/emotions of speaking. They typically are a set of vectors that contain a target speaker's prosody and speaking style information. The goal of them was to represent and (to some extent) control speech style by using a large expressive speech database such as audio books. The assumption by the authors of these techniques was that underlying speech expressions can be factorised as a weighted sum of a set of style tokens or global style tokens by means of a style model which is trained with the TTS system in the machine learning environment.
Although these techniques are generally effective, they have several drawbacks. Firstly, GSTs are difficult to interpret in terms of specific styles, and second, it is difficult to synthesise speech having a specific style or mix of specific styles. Whilst some attempts have been made to control speech styles at the time of synthesis using GST, each of these have required a specific style to be used at the time of synthesis.
In particular, normal human speech generally uses a mix of emotions. Thus a human reader may express a degree of happiness but also simultaneously of fear and some excitement in a certain situation, or other mixed emotions. . This is very difficult or impossible to achieve with existing GST type methods which are based on single emotions.
The present invention arose in an attempt to provide an improved method and system, expressive TTS systems and speech synthesis in which a wide range of emotions can be portrayed by the synthesised speech.
According to the present invention in a first aspect there is provided a text- to-speech method, comprising a training phrase in which the system is trained with a plurality of emotionally tagged data tagged with a plurality of different emotions which are applied to a GST model to estimate emotion-dependent style embedding; generating a Gaussian mixture model (GMM) on said emotion- dependent style embedding, one gaussian component for each of the plurality of emotions, and, at the time of synthesis, sampling said emotion-dependent style embeddings from each gaussian component of the GMM to obtain a combined mixed emotion scores and applying these as an input for controlled speech synthesis
Thus the speech synthesis output represents a combined emotion, which was not possible with previously proposed systems.
Preferably (but not exclusively) the plurality of emotions are selected from:
Anger
Happiness
Sadness
Excitement
Surprise
Fear
Disgust
The training step of the expressive TTS system preferably comprises providing a first set of training data of relatively large amount and diversity, and a second set of training data which can be of relatively short amount (compared to the first set) and which are tagged according to the predominant emotion.
The invention further comprises a TTS system comprising apparatus configured to use the method.
Other aspects and features of the invention are disclosed in the dependent claims, description and drawings.
Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings in which;
Figure 1 shows a conventional global style token (GST) emotion modelling system;
Figure 2 shows an overall diagram of a text to speech (TTS) system with emotion control;
Figure 3 shows a training method ofan emotion control module; and
Figure 4 shows a synthesis/influence method of an emotion control module.
As described, global style tokens (GST) for expressive speech synthesis have been proposed and used to convey different styles and emotions in TTS systems.
Essentially, a given speech signal is first compressed into a compact vector, known as a reference embedding, by an encoder. The reference embedding is then fed into an attention layer which determines scores of similarity between the embedding and entries from the set of style tokens. These then go through a softmax operation (as known in the art) which results in a so called "condition vector" (CV) which represents how similar the reference embedding is to each token. The combination of the condition vector and the GSTs results in a style embedding.
The style embedding construction process is shown schematically in Figure 1 where a plurality of tokens 1 a, 1 b to 1 k are generated and input to the attention layer 2 to create a series of attention scores 2a, 2b to 2k. After a softmax operation the scores comprise a condition vector 3. The combination of the condition vector (CV) 3 and the GSTs 1 a to 1 k results in a style embedding, which is essentially a weighted summation of the GSTs where the weights are elements of the condition vector. This final style embedding is then used to condition the TTS or generate a specific style of emotion. Note that this will relate to a single emotion - Fear, say, or more generally a single style provided from the reference embedding.
The style layers may be implemented as an additive multi-head attention (MHA) model, as described in reference [3] above, ie a plurality of heads (or subvectors) in particular emotional styles.
In more detail, In the GST approach, a given speech signal is first compressed into a compact vector, a reference embedding, by an encoder. The reference embedding is then fed into an attention layer where the goal is not to align but to determine scores of similarity between the embedding and entries from a set of style tokens. After going through a softmax operation these scores compose a so-called condition vector (CV), and represent how similar is the reference embedding to each token of the GST bank. The combination of a CV and GSTs result into a style embedding, given by
where
and
are respectively the entries of the GST bank and components of the CV, with K being the number of tokens, and each GST being a D-dimensional vector, i.e.
The style embedding s is then used to condition the TTS onto a specific style or emotion.
In [1] the style layer is implemented as an additive multi-head attention (MHA) module [3]. In this case the style embedding is a concatenation of individual head- dependent style embeddings
where
are respectively CV components and
GST bank entries for head h. In this case the dimension of each token is divided by the number of heads, i.e. . If the MHA has H heads,
the final style embedding becomes:
The MHA realization of GST is a specific case under the style factorization emotion control proposed by [1]
In embodiments of the present invention, a plurality of emotional training samples are used, in different emotional styles. These may, for example, be: anger, happiness, sadness, excitement, surprise, fear and disgust. A training piece is read by a reader using a particular emotion and this is labelled. Thus a user may read a piece in an "angry" voice and it is labelled as such, similarly with a "happy" voice, a "sad" voice, and so on. This generates a plurality of speech audio signals as training samples which are labelled with appropriate emotions. One emotion is associated with each labelled sample.
The training data for the TTS system may also comprise one or more typically longer samples of people reading text in a neutral and not over-expressive voice. A typical example may be four hours long, for example but it may be any length.
For training the entire system but emotionally tagged and neutral data will preferably be used. For style control training it may be possible to just use emotionally tagged data.
The general technique is shown in Figure 2 and more specific training and synthesis parts are shown in more detail in Figures 3 and 4.
Referring to Figure 2, the training data, which may include text 5 and audio 6 samples, is stored in a database 7. This is then applied to a training file 8 where a style model is trained together with the TTS system, driven by the TTS loss. This means that the TTS system + emotion control are trained altogether in a way to reproduce a synthesized speech signal that can be as natural as possible.
Or, in other words, the goal of the joint training is to provide a speech signal that can be as close as possible to its natural version.
Thus, the training phase includes TTS training 9 and style model training 10 which provides style embeddings 11 to the TTS training. This results in trained models 12 including a TTS model 13 and a style model 14. As described below, at the training stage emotion condition vectors (CVs) are acquired and accumulated and a Gaussian Mixture Model (GMM) is fitted on these, where each GMM component is expected to represent each of the different emotions. These represent the trained models 12.
At synthesis time 15 the text to be output as TTS 16 is applied to a synthesis system together with the emotional scores (GAMs - see below) 17. These are then mixed in the synthesis model 15 using TTS influence 18 and style embedding creation 19, which provides style embeddings to the TTS inference 18, in order to generate a synthetic speech (ie TTS output) with emotions. As described, this may have a mix of emotions and may have different degrees of each of emotion.
Turning now to the training phase. As described, a plurality of training samples are obtained, each labelled with a different emotion. From these, a GMM is fitted to the samples, resulting in one Gaussian component representing each emotion.
Referring to Figure 3, the labelled emotional data samples 20, comprising emotional data 1, emotional data 2. emotional data J are applied to a style model 21. This results in a series of attention weights 20 for each emotion. Thus, there will be a first attention weight for emotion 1 (say sadness) another attention weight for emotion 2 (say happiness) and so on. These are then used to initialise a gaussian component for each component 23. After this, the GMMs are trained by maximum-likelihood using an expectation-maximisation algorithm. The procedure below describes an embodiment of how the GMM is estimated, by way of example only: 1. Initialize means by taking the average of the corresponding emotion dependent samples:
where
is the number of samples in emotional dataset j;
where
is a J-size identity matrix;
3. perform the E step of the expectation-maximization algorithm, by estimating the posterior probabilities,
4. perform the M step of the expectation-maximization algorithm by estimating the new component means, covariances and coefficients by making
5. repeat steps 3 and 4 several times until there is not much change in the log- likelihood cost below:
This results in a plurality of GMM components 24, one for each emotion. That is, a GMM component 1 , GMM component 2... GMM component J. In a preferred embodiment, there are seven emotions but different numbers and types of emotions may be used.
As the GMM components are initialised from the sample, each one is given one emotional label (sadness, happiness etc) so that each GMM component set 24 represents one emotion. The set of components is then applied during synthesis time.
At this time the actual text to be synthesised is analysed and the emotional content of the text is determined. This will typically comprise a plurality of emotions and with different degrees of "intensity" for each emotions. The amount of each emotion (from the relevant GMM component) in the text to be synthesised is determined. Typically, this results in a score between a zero value and maximum value, for example between 0 and 1 for each emotion for the
text, where 0 represents a situation whether the text has none of that particular emotion and 1 where it is entirely of that emotion.
Thus, a passage which is entirely happy without any other emotion may have a score of 1 for the happiness emotion and a score of 0 for every other emotion. A text by a user who is angry but has no other emotion will have a score of 1 for the anger model, and 0 for the others. In practice, there will be a degree of each emotion, and thus a typical text may have a score of say 0.5 for happiness, 0.75 for anger, 0.24 for disgust, and so on, and thus represents and includes a range and plurality of emotions. By using all or some of the GMM components, one for each emotion, and combining these, the actual synthesis text can be provided with this range of emotions.
Referring more particularly to Figure 4, the emotional control model 24 (from the training stage) is shown comprising the GMM components, one for each emotion. These are then sampled 25 to provide an attention weight for each emotion (representing the degree of information that should be taken from the GST bank to be used in a particular text). After that, these attention weights are multiplied a corresponding emotional score. These emotional scores, that come from the frontend or the user, may be, for example, 0.2 for happiness, 0.8 for excitement, and so on. These are then combined at stage 28, a softmax process 29 is applied, and a CV 13 is generated which is used for the actual TTS output.
To train the GMM, in effect, all of the samples of one emotion (eg all the "happy samples") are used to calculate the mean vector of the "happy component". This is then used as the initial mean of the corresponding generated Gaussian component for the GMM component relating to happiness. The same is done for all the different emotions. Once this initial mean is set in the initialising step, then the GMM can be trained and its means are iteratively updated during the GMM training. They have already been labelled with a particular emotion and therefore this emotion is inherently linked to a particular GMM component 24.
In Figure 4, the set of emotional scores (GMM components) are provided either by a user or by a front end of a system. The attention score vector is sampled 25 from each component of the trained GMM and these are then combined with the provided emotional scores to generate the synthetic CV 30 used for synthesis.
Once this synthetic CV 30 is obtained, then style embedding can be constructed using a process as shown in Figure 2 above and speech can be generated using a general process (Tacotron/ GTS) as shown in Figure 1 .
As is known, in neural network and machine learning technologies an embedding represents a vector that represents specific information. In the present invention, an embedding can represent a speaker, resulting in a set of vectors or embeddings in which each of them represents a specific speaker. Thus, the embeddings may represent styles.
Thus, in embodiments of the invention, emotion control can be divided into training and inference stages.. At training time emotional CVs are accumulated and a GMM fitted on them. At synthesis time CVs from each Gaussian component are sampled and mixed, based on the scores provided by the TTS front end.
Traininq
Starting from a fully trained tacotron and style model [1]], attention scores are collected from emotionally tagged data before being applied to the softmax layer, to obtain emotion-dependent attention scores,
, where j and n are respectively emotion and sample indices, and the scores prior to softmax are
where means the k-th attention score before softmax of the n-th sample of emotional dataset j.
After that, a GMM is fitted on
, where J is the number of intended styles or emotions. To enforce emotion controllability each component mean m® is initializing by making
where is the number of samples in emotional dataset j. In order to enable interpretable emotion control at synthesis it is assumed that each component represents one emotion.
Synthesis
At synthesis time, first a set of emotional scores are provided by the user or
TTS frontend:
Then an attention score vector is sampled independently from each component of the trained GMM,
with S® being the covariance matrix of component j and
meaning a normal distribution. After that the frontend emotion scores,
are combined, with the sampled attention scores,
where become the mixed emotion scores. The adjusted CV
components are calculated as
and the final style embedding is given by
Claims
1. A text-to-speech method, comprising a training phrase in which the system is trained with a plurality of emotionally tagged data tagged with a plurality of different emotions which are applied to a GST model to estimate emotion- dependent style embedding; generating a Gaussian mixture model (GMM) on said emotion-dependent style embedding, one GMM component for each of the plurality of emotions, and, at the time of synthesis, sampling said emotion-dependent style embeddings from the GMM components, to obtain combined mixed emotion scores and applying these as an input for controlled speech synthesis
2. A method of claimed in claim 1 wherein one GMM component is generated for each one of the plurality of different emotions.
3. A method as claimed in claim 1 and claim 2 wherein the plurality of emotions are selected from anger, happiness, sadness, excitement, surprise, fear and disgust.
4. A method as claimed in any preceding claim wherein the training step comprises providing a first set of training data of relatively large amount and a second set of training data of relatively short amount compared to the first set of training data , each of which are tagged according to a predominant emotion.
5. A method as claimed in any preceding claim wherein, in the training phase, emotionally tagged data is applied to an attention weights estimation module and attention weights are generated for each emotion.
6. A method as claimed in claim 5 wherein the attention weight on each emotion is used to generate a GMM component for that emotion.
7. A method as claimed in claim 6 wherein the number of GMM components corresponds to the number of intended emotions.
8. A method as claimed in any preceding claim wherein during the training phase each GMM component is initialised with a particular emotion.
9. A method as claimed in any preceding claim wherein, at the time of synthesis, the text to be synthesised is analysed, and the emotional content of the text is determined.
10. A method as claimed in claim 9 wherein at the time of synthesis, a score is allocated to each GMM component representative of the amount of each emotion in the text, ranging from a minimum value, where it has none of that particular emotion, to a maximum value (where the text is wholly of that emotion).
11. A method as claimed in claim 10 wherein a value is given to each GMM component between a minimum and maximum value, inclusive, establishing the amount of that particular emotion in the text, to provide an emotion score for each GMM component and therefore each emotion, and using these to obtain a condition vector for synthesis comprising desired amounts of the respective emotions.
12. A method as claimed in claim 11 wherein the GMM components are sampled to provide an attention weight for each emotion.
13. A method as claimed in claim 12 wherein the attention weights are used to generate an emotional score which is then combined, and further, comprising applying a softmax process and generating a condition vector.
14. A method as claimed in any preceding claim, wherein, in the training phase, attention scores are collected from emotionally tagged data before being applied to a softmax layer, to obtain emotion-dependent attention scores,
, where j and n are respectively emotion and sample indices, and the scores prior to softmax are
where means the k-th attention score before softmax of the n-th sample of emotional dataset j. and wherein, a GMM is fitted on where J is the number of
intended styles or emotions. To enforce emotion controllability each component mean m® is initializing by making
where is the number of samples in emotional dataset j.
15. A method as claimed in any preceding claim wherein, during the synthesis stage, after a set of emotional scores are provided
an attention score vector is sampled independently from each component of the trained GMM,
with being the covariance matrix of component j and
meaning a normal distribution; the frontend emotion scores,
are combined, with the sampled attention scores,
where
become the mixed emotion scores; a set of adjusted CV components are calculated as
and a final style embedding is given by
16. A TTS system comprising apparatus configured to use the method of any of the preceding claims.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB2108468.6 | 2021-06-14 | ||
GB2108468.6A GB2607903B (en) | 2021-06-14 | 2021-06-14 | Text-to-speech system |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022263806A1 true WO2022263806A1 (en) | 2022-12-22 |
Family
ID=76954504
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/GB2022/051491 WO2022263806A1 (en) | 2021-06-14 | 2022-06-14 | Text-to-speech system |
Country Status (2)
Country | Link |
---|---|
GB (1) | GB2607903B (en) |
WO (1) | WO2022263806A1 (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160093289A1 (en) * | 2014-09-29 | 2016-03-31 | Nuance Communications, Inc. | Systems and methods for multi-style speech synthesis |
US20210035551A1 (en) * | 2019-08-03 | 2021-02-04 | Google Llc | Controlling Expressivity In End-to-End Speech Synthesis Systems |
WO2021034786A1 (en) * | 2019-08-21 | 2021-02-25 | Dolby Laboratories Licensing Corporation | Systems and methods for adapting human speaker embeddings in speech synthesis |
-
2021
- 2021-06-14 GB GB2108468.6A patent/GB2607903B/en active Active
-
2022
- 2022-06-14 WO PCT/GB2022/051491 patent/WO2022263806A1/en unknown
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160093289A1 (en) * | 2014-09-29 | 2016-03-31 | Nuance Communications, Inc. | Systems and methods for multi-style speech synthesis |
US20210035551A1 (en) * | 2019-08-03 | 2021-02-04 | Google Llc | Controlling Expressivity In End-to-End Speech Synthesis Systems |
WO2021034786A1 (en) * | 2019-08-21 | 2021-02-25 | Dolby Laboratories Licensing Corporation | Systems and methods for adapting human speaker embeddings in speech synthesis |
Non-Patent Citations (6)
Title |
---|
A. VASWANI, N. SHAZEER, N. PARMAR, J. USZKOREIT, L. JONES, A. N. GOMEZ, L. U. KAISER, AND I. POLOSUKHIN: "Advances in Neural Information Processing Systems", vol. 30, 2017, CURRAN ASSOCIATES, INC., article "Attention is all you need" |
AN XIAOCHUN ET AL: "Effective and direct control of neural TTS prosody by removing interactions between different attributes", NEURAL NETWORKS, ELSEVIER SCIENCE PUBLISHERS, BARKING, GB, vol. 143, 11 June 2021 (2021-06-11), pages 250 - 260, XP086810988, ISSN: 0893-6080, [retrieved on 20210611], DOI: 10.1016/J.NEUNET.2021.06.006 * |
J. SHENR. PANGR. J. WEISSM. SCHUSTERN. JAITLYZ. YANGZ. CHENY. ZHANGY. WANGR. SKERRY- RYAN: "Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions", PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), (CALGARY, CANADA, April 2018 (2018-04-01), pages 4779 - 4783 |
KWON OHSUNG ET AL: "Effective parameter estimation methods for an ExcitNet model in generative text-to-speech systems", 21 May 2019 (2019-05-21), XP055889982, Retrieved from the Internet <URL:https://arxiv.org/pdf/1905.08486.pdf> [retrieved on 20220210] * |
UM SE-YUN ET AL: "Emotional Speech Synthesis with Rich and Granularized Control", ICASSP 2020 - 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 4 May 2020 (2020-05-04), pages 7254 - 7258, XP033793390, DOI: 10.1109/ICASSP40776.2020.9053732 * |
Y. WANGD. STANTONY. ZHANGR.-S. RYANE. BATTENBERGJ. SHORY. XIAOY. JIAF. RENR. A. SAUROUS: "Proceedings of the 35th International Conference on Machine Learning", vol. 80, article "Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis", pages: 5180 - 5189 |
Also Published As
Publication number | Publication date |
---|---|
GB2607903B (en) | 2024-06-19 |
GB2607903A (en) | 2022-12-21 |
GB202108468D0 (en) | 2021-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3752964B1 (en) | Speech style transfer | |
Zen et al. | Statistical parametric speech synthesis using deep neural networks | |
Li et al. | Towards multi-scale style control for expressive speech synthesis | |
US20200410976A1 (en) | Speech style transfer | |
CN106688034A (en) | Text-to-speech with emotional content | |
Wang et al. | Accent and speaker disentanglement in many-to-many voice conversion | |
Algihab et al. | Arabic speech recognition with deep learning: A review | |
Zhu et al. | Building a controllable expressive speech synthesis system with multiple emotion strengths | |
Khademian et al. | Monaural multi-talker speech recognition using factorial speech processing models | |
Hasija et al. | Out domain data augmentation on Punjabi children speech recognition using Tacotron | |
Soliman et al. | Isolated word speech recognition using convolutional neural network | |
Wan et al. | Combining multiple high quality corpora for improving HMM-TTS. | |
Joo et al. | Effective emotion transplantation in an end-to-end text-to-speech system | |
Rani et al. | Speech recognition using neural network | |
Tasnia et al. | An overview of bengali speech recognition: Methods, challenges, and future direction | |
JP6594251B2 (en) | Acoustic model learning device, speech synthesizer, method and program thereof | |
WO2022263806A1 (en) | Text-to-speech system | |
Parikh et al. | Gujarati speech recognition–A review | |
Nose et al. | A speaker adaptation technique for MRHSMM-based style control of synthetic speech | |
Koolagudi et al. | Performance analysis of LPC and MFCC features in voice conversion using artificial neural networks | |
Ridhwan et al. | Differential Qiraat Processing Applications using Spectrogram Voice Analysis | |
Müller et al. | Enhancing multilingual graphemic RNN based ASR systems using phone information | |
CN115910099B (en) | Automatic musical instrument identification method based on depth probability map neural network | |
Kerle et al. | Speaker Interpolation based Data Augmentation for Automatic Speech Recognition | |
Dong et al. | An Improved Speech Synthesis Algorithm with Post filter Parameters Based on Deep Neural Network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22737943 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 16/04/2024) |