GB2546981A - Noise compensation in speaker-adaptive systems - Google Patents

Noise compensation in speaker-adaptive systems Download PDF

Info

Publication number
GB2546981A
GB2546981A GB1601828.5A GB201601828A GB2546981A GB 2546981 A GB2546981 A GB 2546981A GB 201601828 A GB201601828 A GB 201601828A GB 2546981 A GB2546981 A GB 2546981A
Authority
GB
United Kingdom
Prior art keywords
speech
noise
parameters
values
speech factor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB1601828.5A
Other versions
GB2546981B (en
GB201601828D0 (en
Inventor
Latorre-Martinez Javier
Ping Leung Wan Vincent
Yanagisawa Kayoko
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Europe Ltd
Original Assignee
Toshiba Research Europe Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Research Europe Ltd filed Critical Toshiba Research Europe Ltd
Priority to GB1601828.5A priority Critical patent/GB2546981B/en
Publication of GB201601828D0 publication Critical patent/GB201601828D0/en
Priority to JP2017017624A priority patent/JP6402211B2/en
Priority to US15/422,583 priority patent/US10373604B2/en
Publication of GB2546981A publication Critical patent/GB2546981A/en
Application granted granted Critical
Publication of GB2546981B publication Critical patent/GB2546981B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • G10L15/075Adaptation to the speaker supervised, i.e. under machine guidance
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Abstract

An adaptive acoustic model (eg. a Deep Neural Network, DNN) relating acoustic units to speech vectors for eg. text to speech (TTS) synthesis takes noise-corrupted speech samples as an input and uses noise characterisation parameters to map noise-corrupted speech factor parameter values (eg. Maximum Likelihood Linear Regress transforms or Cluster Adaptive Training Weights) to clean speech factor parameter values. The adaptive model may be trained by adjusting the weights of the DNN so as to minimize error (S507-509).

Description

Noise compensation in speaker-adaptive systems
FIELD
The present disclosure relates to a system and method for adapting an acoustic model which relates acoustic units to speech vectors. The system and method may be employed in a text-to-speech system and method.
BACKGROUND
Text to speech systems are systems where audio speech or audio speech files are output in response to reception of a text file.
Voice cloning or customer voice building systems are systems which can synthesize speech with the voice of a given speaker with just a few samples of the voice of that speaker. Voice cloning systems are often combined with text to speech systems.
Text to speech systems and voice cloning systems are used in a wide variety of applications such as electronic games, E-book readers, E-mail readers, satellite navigation, automated telephone systems, automated warning systems.
There is a continuing need to make systems more robust to noise.
SUMMARY OF THE INVENTION
According to aspects of the present invention there are provided a method for adapting an acoustic model relating acoustic units to speech vectors according to claim 1, and an apparatus according to claim 13.
The method and apparatus can be employed respectively in a method of text to speech synthesis according to claim 9, and a text to speech apparatus according to claim 14.
BRIEF DESCRIPTION OF THE DRAWINGS
Systems and methods in accordance with non-limiting examples of the invention will now be described with reference to the accompanying figures in which:
Figure 1 shows a schematic of a text to speech system in accordance with an example of the invention;
Figure 2 is a flow diagram of a text to speech process in accordance with an example of the invention;
Figure 3 shows a Gaussian function;
Figure 4 is a flow diagram of acoustic model adaptation using clean speech;
Figure 5 is a flow diagram of acoustic model adaptation using noisy speech;
Figure 6 is a schematic representation of the effect of noise levels on speaker parameters;
Figure 7 is a schematic representation of mapping from noisy to clean speech parameters in accordance with an example of the invention;
Figure 8 is a schematic representation of a system for mapping noisy speech parameters to clean speech parameters in accordance with an example of the invention; and Figure 9 is a flow diagram of neural network training in accordance with an example of the invention.
DETAILED DESCRIPTION
Figure 1 shows a text to speech system in accordance with an example of the invention of the present invention.
The text to speech system 1 comprises a processor 3 which executes a program 5. The program 5 comprises an acoustic model. The acoustic model will be described in more detail below. Text to speech system 1 further comprises storage 7. The storage 7 stores data which is used by program 5 to convert text to speech. The text to speech system 1 further comprises an input module 11 and an output module 13. The input module 11 is connected to a text input 15. Text input 15 receives text. The text input 15 may be for example a keyboard. Alternatively, text input 15 may be a means for receiving text data from an external storage medium or a network.
Connected to the output module 13 is an output for audio 17. The audio output 17 is used for outputting a speech signal converted from text which is input into text input 15. The audio output 17 may be for example a sound generation unit e.g. a speaker, or an output for an audio data file which may be sent to a storage medium, networked, etc.
The text to speech system 1 also comprises an audio input 23 (such as a microphone) and an audio input module 21. In an example of the invention, the output of the audio input module 21 when the audio input module 21 receives a certain input audio signal (“input audio”) from the audio input 23, is used to adapt the speech output from the text to speech system. For example, the system may be configured to output speech with the same speaker voice as that of a sample of speech captured by the audio input 23.
In this example of the invention, the text to speech system 1 receives an audio sample through audio input module 21 and text through text input 15. The program 5 executes on processor 3 and converts the text into speech data using data stored in the storage 7 and the input audio sample. The speech is output via the output module 13 to audio output 17. A simplified text to speech process will now be described with reference to Figure 2.
In first step, S3101, text is input. The text may be input via a keyboard, touch screen, text predictor or the like. The text is then converted into a sequence of acoustic units in step S3103. These acoustic units may be phonemes or graphemes. The units may be context dependent e.g. triphones which take into account not only the phoneme which has been selected but the proceeding and following phonemes. The text is converted into the sequence of acoustic units using techniques which are well-known in the art and will not be explained further here.
In step S3105, the probability distributions are looked up which relate acoustic units to speech parameters. In this example of the invention, the probability distributions will be Gaussian distributions which are defined by means and variances. However, it is possible to use other distributions such as the Poisson, Student-t, Laplacian or Gamma distributions some of which are defined by variables other than the mean and variance.
It is impossible for each acoustic unit to have a definitive one-to-one correspondence to a speech vector or “observation” to use the terminology of the art. Many acoustic units are pronounced in a similar manner, are affected by surrounding acoustic units, their location in a word or sentence, or are pronounced differently by different speakers. Thus, each acoustic unit only has a probability of being related to a speech vector and text-to-speech systems calculate many probabilities and choose the most likely sequence of observations given a sequence of acoustic units. A Gaussian distribution is shown in figure 3. Figure 3 can be thought of as being the probability distribution of an acoustic unit relating to a speech vector. For example, the speech vector shown as X has a probability P of corresponding to the phoneme or other acoustic unit which has the distribution shown in figure 3.
The shape and position of the Gaussian is defined by its mean and variance. These parameters are determined during the training of the system.
These parameters are then used in the acoustic model in step S3107. In this description, the acoustic model is a Hidden Markov Model (HMM). However, other models could also be used.
The memory of the text of the speech system will store many probability density functions relating an acoustic unit i.e. phoneme, grapheme, word or part thereof to speech parameters. As the Gaussian distribution is generally used, these are generally referred to as Gaussians or components.
In a Hidden Markov Model or other type of acoustic model, the probability of all potential speech vectors relating to a specific acoustic unit must be considered. Then the sequence of speech vectors which most likely corresponds to the sequence of acoustic units will be taken into account. This implies a global optimization over all the acoustic units of the sequence taking into account the way in which two units affect to each other. As a result, it is possible that the most likely speech vector for a specific acoustic unit is not the best speech vector when a sequence of acoustic units is considered.
In some examples of the invention, there will be a plurality of different states which will be each be modelled using a Gaussian. For example, in an example of the invention, the text-to-speech system comprises multiple streams. Such streams may be selected from one or more of spectral parameters (Spectrum), Log of fundamental frequency (Log Fo), first differential of Log Fo (Delta Log Fo), second differential of Log Fo (Delta-Delta Log Fo), Band aperiodicity parameters (BAP), duration etc. The streams may also be further divided into classes such as silence (sil), short pause (pau) and speech (spe) etc. In an example of the invention, the data from each of the streams and classes will be modelled using a HMM. The HMM may comprise different numbers of states. For example, in an example of the invention, 5 state FIMMs may be used to model the data from some of the above streams and classes. A Gaussian component is determined for each HMM state.
Once a sequence of speech vectors has been determined, speech is output in step S3109.
As noted above, in an example of the invention, the parameters employed in step S3105 may be adapted according to input audio which is input in to the system via audio input module 21 of Figure 1. This adaptation enables speech with the speaker voice (and/or the other speaker factor, as appropriate) of the input audio to be output from the text to speech system.
In an example of the invention, the acoustic model is one which is able to capture a range factors of speech data such as speaker, expression or some other factor of speech data. In a further example of the invention, the output speech comprises the same value of a speaker factor as that of the input audio sample, i.e. the text to speech system 1 adapts the acoustic model such that the output speech matches that of the audio sample.
For example, the output speech is output with the same speaker voice as the input speech.
In an example of the invention, the acoustic model is a Hidden Markov Model (HMM). However, other models could also be used. Hidden Markov Models and their training are described in Appendix 1. The modelling of speech factors such as speaker voice is explained in relation to three methods of training HMMs: Maximum Likelihood Linear Regression (MLLR), Cluster Adaptive Training (CAT) and Constrained Maximum Likelihood Linear Regression (CMLLR).
Although any speech factor, or multiple speech factors can be used as a basis for adapting the output speech of a text to speech system, in the below description, it is assumed that the speech factor is that of speaker voice.
Figure 4 shows a flow chart of the process of adaptation of such models in response to input audio where the audio was recorded under “clean” conditions, i.e. the input audio is not corrupted by noise. In step S201, the input audio (input speech) is received. In Step S203, the speaker transforms of the acoustic model are modified until a set of speaker transforms which most closely matches the input speech is determined. The nature of the speaker transforms and the method employed to obtain the set of speaker transforms will depend on the nature of the acoustic model employed.
The training of three suitable acoustic models which capture speaker variation is described in Appendix 1. During training, an Auxiliary equation for each model (e.g. Equation (0) for MLLR) is minimised for each training speaker in order to obtain the corresponding speaker transform. In an example of the invention, during adaptation, the Auxiliary equation is minimized for the input adaptation data in order to obtain the speaker transform of the input speaker. Adaptation therefore comprises effectively retraining the acoustic model using the input data. If the exact context of the adaptation data is not found in the training data, decision trees are employed to find similar contexts within the training data instead.
In Step S205, text to speech synthesis is performed employing the speaker transforms calculated in step S203 according to the process described above in relation to Figure 2. Thus, the speech of the input text is output with the same voice as that of the speech received in step S201.
The above method is suitable for adapting acoustic models and obtaining speech which matches speech received under clean conditions, i.e. when the speech sample was obtained in a controlled sound environment, such as in a studio using high quality equipment. However, the input audio sample may be corrupted with additive or convolutional noise. If the method of Figure 4 were employed for a sample of speech obtained under noisy conditions, the speaker transform obtained during adaptation would be adapted to the corrupted speech and therefore itself be corrupted by noise. Consequently, the corresponding audio output would also be affected by the noise corrupting the original speech signal. Depending on the degree of noise corruption, the quality of match between the voice of the person to whom the speech sample belonged and the output audio may be poor. In an example of the invention, the system according to Figure 1 is configured to compensate for the noise in the input signal and therefore the output speech is substantially unaffected by the degree of noise corruption that the input audio sample is subjected to.
The effect of noise on a speech signal can be modelled as follows. In an example of the invention, it is assumed that a speech signal s(r) is corrupted by an additive noise signal η(τ) and a convolutional channel noise h(r). The observed signal Sa(f) will be S0(x) = (s(T)+n(T))*h(T) (0)
Ignoring the effect of the window, the spectrum of this signal for frequency (O on a frame around time Ί will be so(co)=[S(to)+Ν(ω)]Η(ω) (0)
Note that to simplify the notation, T has been omitted from the above equation. The power spectrum for the frame is
and the logarithm of the amplitude becomes
can be expanded as
wherein Θ represents the phase. The first term on the right hand side of Equation (0) can be shown to be
In an example of the invention, the speech signal is represented as a Cepstrum, which is a Discrete Cosine Transform (DCT) of the log-spectrum. Equivalently, the DCT of the log-spectrum is a linear transformation of the log spectral vectors given by Equation (0) . In other words,
(0) where T is the DCT transformation matrix and
where/is the frequency index.
From Equations (0) and (0) it follows that
(0) where
and
with
and
Therefore it follows that
where Ο, -Tl, (0) « =3¾ (0) ' (0)
The Taylor series logarithmic expansion is given by
for I x -11< 1 which is a reasonable assumption as long as the signal-to-noise ratio (SNR) is positive. Therefore, each element of ^becomes
which depends in a non-linear way on both the speech signal and the additive noise. Obtaining a clean speaker transform from a speaker transform determined under noisy conditions is therefore non-trivial.
When adapting an acoustic model to produce speech corresponding to a speaker to which an audio input corresponds, the goal is to transform a canonical model so that it can generate speech with the characteristics of the target speaker. When employing clean speech, the adaptation parameters model the differences between the speakers. However, when the adaptation data is noisy, instead of just modelling the difference between speakers, they must also adjust for the contribution of Cnls and Ch terms as given in Equation (0). In addition, the precise effect that the noise has depends on the form of the canonical model employed.
The effect of noise on the adaptation parameters for MLLR, CAT and CMLLR is discussed in detail in Appendix 2. The approach of the present example, however, may be applied to whatever type of acoustic model is employed in the system as will now be explained.
Figure 5 shows a flow chart of a process according to an example of the invention in which an acoustic model is adapted to speech obtained in a noisy environment and compensation of the noise is then performed. In Step S301, speech is received. In Step S303, the speaker transforms of the acoustic model are modified until a set of speaker transforms which most closely matches the input speech is determined. This process is entirely equivalent to that of S203 in Figure 4.
In Step S305 features relating to the characteristics of the noise are determined from the input speech. This gives a noise characterization defined by one or more noise parameters. A variety of noise features may be used to calculate the characteristics of noise in the speech sample. Examples include the average spectrum and mel-cepstrum of non-speech segments of the sample or the ratio of noise to speech. In another example of the invention, a segment (e.g. of several seconds) of audio without speech are input and the analyser calculates the noise parameters from the sample without speech.
In Step S307, the speaker transform calculated in step S303 and the noise parameters calculated in Step S305 are employed to determine a “clean” speaker transform, that is, the speaker transform that would have been obtained if the received speech had been obtained under clean conditions. In an example of the invention, this step comprises mapping a “noisy” speaker transform to a “clean” speaker transform. This process will now be described in detail.
Figure 6 is a schematic diagram of the speaker transform space according to an example of the invention. The effect of noise on speaker transform is indicated by contour lines. The position of the transforms corresponding to three speakers, Spkl, Spk2 and Spk 3, are indicated at four different noise levels, NoiseLevelo, NoiseLeveli. NoiscLeveL and NoiscLcvcl j. The speaker transforms at NoiseLevelo are those obtained under clean conditions. The level of noise increases on going from NoiseLevelo to NoiseLevel i. Figure 6 shows that as the noise level increases, the difference between the transforms of different speaker decreases. At high enough levels of noise (i.e. the centre of the contour lines), there will be no difference in the speaker transform obtained from samples of speech from different speakers.
In an example of the invention, obtaining a clean speaker transform comprises mapping a speaker transform obtained at one noise level to the speaker transform which would have been obtained at another noise level within the speaker space. This mapping is shown schematically in Figure 7. A speaker transform is obtained from a sample of speech from speaker 3 at NoiseLevel i. This is indicated by the solid triangle 1011 in Figure 7. Using knowledge of the level of noise within the sample, the speaker transform is mapped from NoiseLevel i to NoiseLevelo, i.e. to position 1103.
In an example of the invention, the mapping from noisy to clean space is performed using an adaptive model trained by a machine learning method, e.g. an artificial deep neural network (DNN). In an example of the invention, a deep neural network is employed to determine a function / such that /Of. y„>=*f«>) where xsnpk is the transform obtained under noise, yn is a vector that characterises the noise and xspk\i, the transform which would be obtained without noise.
Figure 8 shows a schematic of a system for performing the mapping. An analyser 2001 determines the speaker transforms from the noisy data. The analyser also determines the noise features of the noisy data. The analyser calculates the value of the speaker transform in the presence of noise, i.e. the position indicated by 1101 in Figure 7 (the “speaker features”). The analyser also calculates the noise parameters characterizing noise in the speech. The skilled person will appreciate that the calculation of noise may be performed in a number of ways.
Both the speaker transforms and the noise features are input into the neural network 2003. The neural network maps the “noisy” speaker transforms to the clean speaker transforms using Equation (0). The clean speaker transform x^is then output to the acoustic model.
Finally, in step S309 of the method of Figure 5, the clean speaker transform obtained in step S307 is employed to output audio using the process described above in relation to Figure 2.
Figure 9 shows a generalised process for training a neural network for use in an example of the invention. In this process, training data is employed to train the mapping. The training data can be created by adding different types of noise to existing clean data for a number of different speakers.
In Step S501, the acoustic model or models are trained with samples of clean speech for each of plurality of speakers (“training speakers”). The training of an acoustic model according to an example of the invention is described in Appendix 1.
Subsequently, the set of steps S502 to S504 is performed separately for each of the training speakers, thereby producing respective results for each training speaker.
In Step S502, for a given one of the training speakers, speaker parameters λ(ι) for the clean speech training data for that training speaker are obtained.
In Step S503, noisy speech data 1) is created. In an example of the invention, this is done by adding different types of noise to some existing clean data for the training speaker.
In Step S504, adaptation parameters denoted as λ^1 are obtained for the created noisy data Tj using the process of adaptation described above.
In Step S505, the neural network is initialised. The skilled person will appreciate that a number of standard methods may be employed to initialise the neural network. In an example of the invention, random initialisation is employed.
Steps S506 and S507 are performed using the results obtained in steps S502 and S504 for all the training speakers. That is, the speaker parameters A(t) and adaptation parameters λ® obtained in steps S502 and S504 for the respective multiple training speakers, are combined into a single dataset which is used in steps S506 and S507.
In Step S506, the neural network error is calculated. The noisy adaptation parameters ληη *, together with a vector of parameters to characterize the noise φ„ (noise features), are used as input for the DNN. The neural network then outputs “clean” speaker transforms λη ’. —(/)
The output of the DNN will be a vector of adaptation parameters λη , which should be as close as possible to the parameters λ(η that were obtained when using the clean speech. The neural network error is calculated by comparing these two sets of parameters.
In an example of the invention, the neural network error function can be calculated as the difference between the predicted speaker transforms and the target coefficients, i.e. the root mean square error (RMSE):
where 62 arc the parameters of the deep neural network (DNN). The derivative of E((o)™se (the “back propagation error”) is given by
In another example of the invention, the error function is given by the log-likelihood over the clean training data, i.e.
where the Auxiliary function
is computed over the clean data with the state occupancies Ym(j>0 obtained by using the clean transform λ(η . The derivative of
is given by Equations (0) and (0) in the Appendix for MLLR and CLLR respectively. In this example of the invention, the back propagation error can be expressed as
where k and G are the sufficient statistics necessary to estimate the transformation parameters.
In step S507, the neural network is modified using stochastic gradient descent to reduce the error calculated in Step S506 above. In an example of the invention, this comprises varying the parameters CO until the back propagation error is zero. In other words, the DNN is trained, upon receiving noisy speech and parameters characterizing the noise, to produce an estimate of the parameters which would have been obtained by performing step S502 on the noisy speech if the noisy speech had not been corrupted by noise.
The process then moves back to step S506 and repeats until the NN error reaches convergence.
Thus, the parameters of the neural network are updated to learn the mapping from noisy speaker space to clean speaker space.
In an example of the invention, a noise feature vector that provides a good estimation of
A A λη/s and λ* is determined by the analyser 2001 for input into the DNN 2003.
Methods and systems according to the above examples of the invention can be applied to any adaptable acoustic model. Because the neural network is capable of pattern recognition, the relationship between the speaker transforms of a particular model and noise can be learnt by the network, regardless of the model employed. As shown in
Appendix 2, when an MLLR (or CAT) model is employed, the function f of Equation (0) is given by Equation (0).
Methods and systems according to the examples of the invention described above enable speech to be synthesised in a target voice without the need for high quality, low noise samples of speech in that voice. Instead, samples of the target voice may be provided without the use of a studio or high quality equipment. Speaker similarity is preserved without the use of signal processing to compensate or clean the noise.
While certain arrangements have been described, these arrangements have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made.
Appendix 1
In speech processing systems which are based on Hidden Markov Models (HMMs), the HMM is often expressed as: M = (A, B, Π ) (0) where
and is the state transition probability distribution,
is the state output probability distribution and
is the initial state probability distribution and where N is the number of states in the HMM.
In the examples of the invention described below, the state transition probability distribution A and the initial state probability distribution are determined in accordance with procedures well known in the art. Therefore, this section will be concerned with the state output probability distribution.
The aim when training a conventional text-to-speech system is to estimate the model parameter set which maximises likelihood for a given observation sequence. The same system as that shown in Figure 1 may be used for training an acoustic model for use in a text-to-speech system. However, when training an acoustic model, it is necessary to have an audio input which matches the text being input via text input 15. Training the model therefore comprises maximising the model parameter set which maximises the likelihood of the audio input.
Generally in text to speech systems the state output vector or speech vector o(t) from an mth Gaussian component in a model set M is
(0) where μη and Em are the mean and covariance of the m4 Gaussian component.
As it is not possible to obtain the above model set based on so called Maximum Likelihood (ML) criteria purely analytically, the problem is conventionally addressed by using an iterative approach known as the expectation maximisation (EM) algorithm which is often referred to as the Baum-Welch algorithm. Here, an auxiliary function (the “Q” function) is derived:
(0) where ym (t) is the posterior probability of component m generating the observation o(t) given the current model parameters M’ and M is the new parameter set. After each iteration, the parameter set M’ is replaced by the new parameter set M which maximises Q(M, M’). p(o(t), m | M) is a generative model such as a GMM, HMM etc.
The model may incorporate factors of speech data such as speaker, expression, or noise environment, or some other factor of speech data. The model then has a state output vector of:
(0) where m e {1,.......,MN} , t, and t are indices for component, time and target speaker, or expression or some other factor of speech data respectively and where MN, T and S are the total number of components, frames and speakers (or expression, or noise environment, or some other factor of speech data) respectively. In the discussion below, it is assumed the speech factor is speaker.
The exact form of β^ and will depend on any speaker dependent transforms that are applied. Three approaches to such speaker dependent transformations are discussed below. However, others are also possible.
MLLR
When employing a Maximum-likelihood linear regression (MLLR) transform, the goal is to find an affine transformation of the canonical parameters that maximises the probability with respect to the adaptation data cw of a target speaker t. This transformation is independent for the mean and the covariances of the model. Considering only the effect on the mean, the MLLR mean transformation can be written as
(0) where } represents the affine transformation which is applied to the mean vector of the Gaussian component Ifl for regression class q(m). This can be expressed in a more compact form as
where
Each dimension d in can be also represented as
where
is the d‘h line of the matrix which can be also expressed as
Where λ'11 is a supervector with all the adaptation parameters and K|(m(|) is a selection matrix for the Gaussian m and dimension d .
Now, Equation (O)Error! Reference source not found, can be rewritten as
where H(m) is a matrix which lines are defined as
The adaptation coefficients are obtained by maximising the auxiliary function
(0) where Σν(πι) is the covariance of component m. and 0 is an observation vector which is usually built by joining the static and dynamic spectral coefficients, e.g.,
The Δ and Δ2 are usually linear operators that can be summarized as o = Wc with W a fixed matrix.
The derivative of (0) with respect to λ(η is
where
which equating Equation (0) to zero yields the solution
CAT
Cluster adaptive training (CAT) is a special case of MLLR in which different speakers are accommodated by applying weights to model parameters which have been arranged into clusters. In this model the mean value is defined as
where P is the total number of clusters.
Consequently, all the Equations described above in relation to MLLR apply equally to CAT, with M1(m) substituted by
CMLLR
When using a constrained maximum-likelihood linear regression (CMLLR) transform, the transformation is applied in the feature space. The goal of the adaptation is to find an affine transformation of the observed feature vector that maximizes the probability with respect to a model trained on a canonical space. The state output vector is given by:
Similarly to MLLR, the CMLLR transform can be expressed as
with
As in Equation (0), each dimension of o(r)^m) can be expressed as
Therefore, as with MLLR this can be written as
where λ (t) is the supervector with all the CMLLR coefficients and Oq(m) a matrix with lines given by
with Kq(m d) the selection matrix for CMLLR.
The derivative of ζ)(λ“';λ' ') w.r.t. λ<η can again be expressed as
where now
which yields the solution
Appendix 2
The effect of noise on transforms according to the acoustic models discussed in Appendix 1 will now be described.
In an example of the invention, the acoustic model employed is a Maximum-likelihood Linear regression model (MLLR). For MLLR, it follows from Equations (0) and (0) (see Appendix 1) that 0=0,+0.50^+¾ (0)
Substituting in (0) gives
and consequently
where
In other words, the effect of additive and convolutional noise in an MLLR transform is to add a bias to that transform, which depends in a non-linear way on the noise-to-signal ratio.
From the above, it can be seen that in the case of MLLR, the function / in Equation (0) is therefore given (from Equation (0)) as
with
(0)
In another example of the invention, a CAT acoustic model is employed. The main difference between CAT and general MLLR is that due to its simplicity, CAT models trained on clean speech are strictly constrained to a space of clean speech. As a result, if °n/s or °h are orthogonal to clean speech, their projection into that space will be zero,
In yet another example of the invention, a CMLLR acoustic model is employed. In CMLLR, from Equations (0) and (0)
(0)
The noisy terms in the CMLLR case are now both in the numerator and the denominator of Equation (0). The relationship between the noisy and the clean transform is therefore much more complex than for MLLR.

Claims (15)

Claims
1. A method of adapting an acoustic model relating acoustic units to speech vectors, wherein said acoustic model comprises a set of speech factor parameters related to a given speech factor and which enable the acoustic model to output speech vectors with different values of the speech factor, the method comprising: inputting a sample of speech with a first value of the speech factor; determining values of the set of speech factor parameters which enable the acoustic model to output speech with said first value of the speech factor; and employing said determined values of the set of speech factor parameters in said acoustic model, wherein said sample of speech is corrupted by noise, and wherein said step of determining the values of the set of speech factor parameters comprises: (i) obtaining noise characterization parameters characterising the noise; (ii) performing a speech factor parameter generation algorithm on the sample of speech, thereby generating corrupted values of the set of speech factor parameters; (iii) using the noise characterization parameters to map said corrupted values of the set of speech factor parameters to clean values of the set of speech factor parameters, wherein the clean values of the set of speech factor parameters are estimates of the speech factor parameters which would be obtained by performing the speech factor parameter generation algorithm on the sample of speech if the sample of speech were not corrupted by the noise; and (iv) employing said clean values of the set of speech factor parameters as said determined values of the set of speech factor parameters.
2. The method of claim 1, wherein the speech factor is speaker voice.
3. The method of claim 1 or claim 2, wherein said mapping is performed by an adaptive model trained using a machine learning algorithm.
4. The method of claim 3 further wherein said adaptive model was trained by iteratively adjusting weight values of the deep neural network to minimise a measure of the difference between (a) the values of the set of speech factor parameters without noise corruption, and (b) the output of the deep neural network when the set of speech factor parameters corrupted by noise and the information regarding the noise corruption are input to the neural network.
5. The method of claim 3 or 4, wherein the adaptive model is a deep neural network.
6. The method of any preceding claim, wherein said speech factor parameters comprise parameters of Maximum Likelihood Linear Regression transforms.
7. The method of any of claims 1 to 5, wherein said speech factor parameters comprise Cluster Adaptive Training weights.
8. The method of any preceding claim, wherein said speech factor parameters comprise parameters of Constrained Maximum Likelihood Linear Regression transforms.
9. A method of text to speech synthesis, comprising: inputting text; dividing said input text into a sequence of acoustic units; adapting an acoustic model according to the method of any of claims 1-8, converting said sequence of acoustic units to a sequence of speech vectors using said adapted acoustic model; and outputting said sequence of speech vectors as audio with said first value of the speech factor.
10. A method of training a deep neural network for use in the method of claim 5, the method comprising: (i) receiving training data, said training data comprising: values of a set of speech factor parameters corrupted by noise, values of the same set of speech factor parameters without noise corruption, and data values characterizing the noise; (ii) training the deep neural network by iteratively adjusting weight values of the deep neural network to minimise a measure of the difference between (a) the values of the set of speech factor parameters without noise corruption, and (b) the output of the deep neural network when the set of speech factor parameters corrupted by noise and the data values characterizing the noise are input to the deep neural network.
11. A method according to claim 10 in which said measure is inversely related to the log likelihood of the set of speech factor parameters without noise corruption being obtained as the output of the deep neural network.
12. A method according to claim 10 in which said measure is a root mean squared difference between the set of speech factor parameters without noise corruption and the output of the deep neural network.
13. Apparatus for adapting an acoustic model relating acoustic units to speech vectors, wherein said acoustic model comprises a set of speech factor parameters related to a given speech factor and which enable the acoustic model to output speech vectors with different values of the speech factor, the system comprising: a receiver for receiving a sample of speech with a first value of the speech factor; and a processor configured to: (a) determine values of the set of speech factor parameters which enable the acoustic model to output speech with said first value of the speech factor; and (b) employ said determined values of the set of speech factor parameters in said acoustic model, wherein said sample of speech is corrupted by noise, and wherein said determination of the values of the set of speech factor parameters comprises: (i) obtaining noise characterization parameters characterising the noise with which the sample of speech is corrupted; (ii) performing a speech factor parameter generation algorithm on the sample of speech, thereby generating corrupted values of the set of speech factor parameters; (iii) using the noise characterisation parameters to map said corrupted values of the set of speech factor parameters to clean values of the set of speech factor parameters, wherein the clean values of the set of speech factor parameters are estimates of the speech factor parameters which would be obtained by performing the speech factor parameter generation algorithm on the sample of speech if the sample of speech were not corrupted by the noise.
14. A text to speech apparatus, said apparatus comprising: a receiver for receiving input text; the apparatus of claim 13; a processor configured to: (i) divide said input text into a sequence of acoustic units; and (ii) convert said sequence of acoustic units to a sequence of speech vectors using said adapted acoustic model; and an output configured to output said sequence of speech vectors as audio with said first value of the speech factor.
15. A carrier medium comprising computer readable code configured when performed by a processor of a computer to cause the processor to perform the method of any of claims 1 to 12.
GB1601828.5A 2016-02-02 2016-02-02 Noise compensation in speaker-adaptive systems Expired - Fee Related GB2546981B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
GB1601828.5A GB2546981B (en) 2016-02-02 2016-02-02 Noise compensation in speaker-adaptive systems
JP2017017624A JP6402211B2 (en) 2016-02-02 2017-02-02 Noise compensation in speaker adaptation systems.
US15/422,583 US10373604B2 (en) 2016-02-02 2017-02-02 Noise compensation in speaker-adaptive systems

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1601828.5A GB2546981B (en) 2016-02-02 2016-02-02 Noise compensation in speaker-adaptive systems

Publications (3)

Publication Number Publication Date
GB201601828D0 GB201601828D0 (en) 2016-03-16
GB2546981A true GB2546981A (en) 2017-08-09
GB2546981B GB2546981B (en) 2019-06-19

Family

ID=55590526

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1601828.5A Expired - Fee Related GB2546981B (en) 2016-02-02 2016-02-02 Noise compensation in speaker-adaptive systems

Country Status (3)

Country Link
US (1) US10373604B2 (en)
JP (1) JP6402211B2 (en)
GB (1) GB2546981B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11842722B2 (en) 2020-07-21 2023-12-12 Ai Speech Co., Ltd. Speech synthesis method and system

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107657961B (en) * 2017-09-25 2020-09-25 四川长虹电器股份有限公司 Noise elimination method based on VAD and ANN
US20200279549A1 (en) * 2019-02-28 2020-09-03 Starkey Laboratories, Inc. Voice cloning for hearing device
JP6993376B2 (en) * 2019-03-27 2022-01-13 Kddi株式会社 Speech synthesizer, method and program
KR102294638B1 (en) * 2019-04-01 2021-08-27 한양대학교 산학협력단 Combined learning method and apparatus using deepening neural network based feature enhancement and modified loss function for speaker recognition robust to noisy environments
KR20190092326A (en) * 2019-07-18 2019-08-07 엘지전자 주식회사 Speech providing method and intelligent computing device controlling speech providing apparatus
EP3809410A1 (en) * 2019-10-17 2021-04-21 Tata Consultancy Services Limited System and method for reducing noise components in a live audio stream
CN113409809B (en) * 2021-07-07 2023-04-07 上海新氦类脑智能科技有限公司 Voice noise reduction method, device and equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6446038B1 (en) * 1996-04-01 2002-09-03 Qwest Communications International, Inc. Method and system for objectively evaluating speech
EP1536414A2 (en) * 2003-11-26 2005-06-01 Microsoft Corporation Method and apparatus for multi-sensory speech enhancement

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7089182B2 (en) * 2000-04-18 2006-08-08 Matsushita Electric Industrial Co., Ltd. Method and apparatus for feature domain joint channel and additive noise compensation
EP1197949B1 (en) * 2000-10-10 2004-01-07 Sony International (Europe) GmbH Avoiding online speaker over-adaptation in speech recognition
JP2002123285A (en) * 2000-10-13 2002-04-26 Sony Corp Speaker adaptation apparatus and speaker adaptation method, recording medium and speech recognizing device
JP2002244689A (en) * 2001-02-22 2002-08-30 Rikogaku Shinkokai Synthesizing method for averaged voice and method for synthesizing arbitrary-speaker's voice from averaged voice
CN1453767A (en) * 2002-04-26 2003-11-05 日本先锋公司 Speech recognition apparatus and speech recognition method
JP4033299B2 (en) * 2003-03-12 2008-01-16 株式会社エヌ・ティ・ティ・ドコモ Noise model noise adaptation system, noise adaptation method, and speech recognition noise adaptation program
JP2010078650A (en) * 2008-09-24 2010-04-08 Toshiba Corp Speech recognizer and method thereof
US8700394B2 (en) * 2010-03-24 2014-04-15 Microsoft Corporation Acoustic model adaptation using splines
GB2482874B (en) * 2010-08-16 2013-06-12 Toshiba Res Europ Ltd A speech processing system and method
JP5949553B2 (en) * 2010-11-11 2016-07-06 日本電気株式会社 Speech recognition apparatus, speech recognition method, and speech recognition program
KR20120054845A (en) * 2010-11-22 2012-05-31 삼성전자주식회사 Speech recognition method for robot
US8738376B1 (en) * 2011-10-28 2014-05-27 Nuance Communications, Inc. Sparse maximum a posteriori (MAP) adaptation
GB2501067B (en) * 2012-03-30 2014-12-03 Toshiba Kk A text to speech system
JP6266372B2 (en) * 2014-02-10 2018-01-24 株式会社東芝 Speech synthesis dictionary generation apparatus, speech synthesis dictionary generation method, and program
US20160055846A1 (en) * 2014-08-21 2016-02-25 Electronics And Telecommunications Research Institute Method and apparatus for speech recognition using uncertainty in noisy environment
US9721559B2 (en) * 2015-04-17 2017-08-01 International Business Machines Corporation Data augmentation method based on stochastic feature mapping for automatic speech recognition

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6446038B1 (en) * 1996-04-01 2002-09-03 Qwest Communications International, Inc. Method and system for objectively evaluating speech
EP1536414A2 (en) * 2003-11-26 2005-06-01 Microsoft Corporation Method and apparatus for multi-sensory speech enhancement

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11842722B2 (en) 2020-07-21 2023-12-12 Ai Speech Co., Ltd. Speech synthesis method and system

Also Published As

Publication number Publication date
US10373604B2 (en) 2019-08-06
GB2546981B (en) 2019-06-19
JP2017138596A (en) 2017-08-10
JP6402211B2 (en) 2018-10-10
US20170221479A1 (en) 2017-08-03
GB201601828D0 (en) 2016-03-16

Similar Documents

Publication Publication Date Title
GB2546981B (en) Noise compensation in speaker-adaptive systems
Chou et al. One-shot voice conversion by separating speaker and content representations with instance normalization
JP5242724B2 (en) Speech processor, speech processing method, and speech processor learning method
EP1995723B1 (en) Neuroevolution training system
JP5242782B2 (en) Speech recognition method
JP5717097B2 (en) Hidden Markov model learning device and speech synthesizer for speech synthesis
JPH0850499A (en) Signal identification method
US7552049B2 (en) Noise adaptation system of speech model, noise adaptation method, and noise adaptation program for speech recognition
KR101026632B1 (en) Method and apparatus for formant tracking using a residual model
Elshamy et al. DNN-supported speech enhancement with cepstral estimation of both excitation and envelope
JP5670298B2 (en) Noise suppression device, method and program
Giacobello et al. Stable 1-norm error minimization based linear predictors for speech modeling
Liu Environmental adaptation for robust speech recognition
JP5740362B2 (en) Noise suppression apparatus, method, and program
Bawa et al. Developing sequentially trained robust Punjabi speech recognition system under matched and mismatched conditions
Sarikaya Robust and efficient techniques for speech recognition in noise
Shahnawazuddin et al. A fast adaptation approach for enhanced automatic recognition of children’s speech with mismatched acoustic models
Sathiarekha et al. A survey on the evolution of various voice conversion techniques
JP6000094B2 (en) Speaker adaptation device, speaker adaptation method, and program
JP7333878B2 (en) SIGNAL PROCESSING DEVICE, SIGNAL PROCESSING METHOD, AND SIGNAL PROCESSING PROGRAM
Han et al. Switching linear dynamic transducer for stereo data based speech feature mapping
JP6376486B2 (en) Acoustic model generation apparatus, acoustic model generation method, and program
KR20120040649A (en) Front-end processor for speech recognition, and apparatus and method of speech recognition using the same
JP5885686B2 (en) Acoustic model adaptation apparatus, acoustic model adaptation method, and program
BabaAli et al. A model distance maximizing framework for speech recognizer-based speech enhancement

Legal Events

Date Code Title Description
PCNP Patent ceased through non-payment of renewal fee

Effective date: 20230202