GB2582572A - A speech processing system and method - Google Patents

A speech processing system and method Download PDF

Info

Publication number
GB2582572A
GB2582572A GB1904100.3A GB201904100A GB2582572A GB 2582572 A GB2582572 A GB 2582572A GB 201904100 A GB201904100 A GB 201904100A GB 2582572 A GB2582572 A GB 2582572A
Authority
GB
United Kingdom
Prior art keywords
fmllr
transform
speaker
model
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB1904100.3A
Other versions
GB201904100D0 (en
GB2582572B (en
Inventor
Doddipatla Rama
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Corp
Original Assignee
Toshiba Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Corp filed Critical Toshiba Corp
Priority to GB1904100.3A priority Critical patent/GB2582572B/en
Publication of GB201904100D0 publication Critical patent/GB201904100D0/en
Publication of GB2582572A publication Critical patent/GB2582572A/en
Application granted granted Critical
Publication of GB2582572B publication Critical patent/GB2582572B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Complex Calculations (AREA)

Abstract

A method of speech processing extracts filter bank coefficients "FBANK" from input speech frames, applies a Discrete Cosine Transform (DCT) to produce Mel frequency cepstral coefficients (MFCCs), applies a feature maximum likelihood logistic regression (FMLLR) transform and correlating transform (eg. an Inverse Discrete Cosine/Fourier Transform) to the speaker-adapted MFCCs, and inputs these to a Deep Neural Network (DNN) acoustic model having at least one convolutional layer which outputs text. Vocal Tract Length Normalisation may be applied in addition to FMLLR, which may be estimated by a trained Hidden Markov Model (HMM) Gaussian Mixture Model (GMM).

Description

A Speech Processing System and Method
FIELD
Embodiments described herein relate to a speech processing system and method.
BACKGROUND
Normalising speaker variability is known to improve the performance of automatic speech recognition (ASR). Approaches like vocal tract length normalisation (VTLN) and transforming features with feature space maximum likelihood linear regression (FMLLR) have been shown to improve the ASR performance in deep neural network (DNN) acoustic models (AM).
BRIEF DESCRIPTION OF THE FIGURES
Figure 1 is a schematic of a system in accordance with an embodiment.
Figure 2 is a flow chart showing a method of speech recognition using the FMLLR transform derived in accordance with the method of figure 3 (below); Figure 3 is a flowchart showing a method of estimating an FMLLR transform in accordance with an embodiment, where the FMLLR transform is to be applied to correlated features; Figure 4 is a flow chart showing a method of speech recognition using the FMLLR transform derived in accordance with the method of figure 5 (below); Figure 5 is a flowchart showing a method of estimating an FMLLR transform in accordance with an embodiment where the FMLLR transform is to be applied directly to the FBANK features; Figure 6 is a diagram showing a CNN that can be used as an acoustic model in accordance with an embodiment; Figure 7 is a flowchart showing a variation of the method of figure 2 in combination with VTLN; Figure 8 is a flowchart showing how the VTLN warping and FMLLR used in the method of figure 7 are trained; Figure 9 is a flow chart showing how VTLN warping and the FMLLR transform are estimated for a test speaker; and Figure 10 is a flow chart showing a variation on the method of figure 2 in combination with layer adaptation; and Figure 11 is a flow chart showing how the layer adaptation and FMLLR used in the method of figure 10 are trained.
Figure 12 is a flow chart showing how the layer adaptation and FMLLR transform are estimated for a test speaker; and Figure 13 is a schematic showing an acoustic model that is a combination of a CNN and a DNN.
DETAILED DESCRIPTION
In an embodiment, a method of speech processing is provided comprising: receiving input speech from a speaker; dividing said speech into frames; extracting filter bank coefficients "FBANK" from said frames; applying a discrete cosine transform to the extracted FBANK coefficients to produce Mel frequency cepstral coefficients "MFCCs"; applying a feature maximum likelihood logistic regression "FMLLR" transform to the MFCCs; applying a correlating transform having a closed form solution to the speaker adapted transformed MFCCs to form speaker adapted correlated coefficients; inputting said speaker adapted correlated coefficients into an acoustic model, said acoustic model being adapted to convert MFCC coefficients, to which a correlating transform has been applied into text, wherein said acoustic model comprises a neural network having at least one convolutional layer; and outputting said text The closed form solution allows a tractable method of implementing a correlating transform. Examples of transforms that can provide a closed form solution are an inverse discrete cosine transform and an inverse discrete Fourier transform.
In a further embodiment, a method of speech processing is provided comprising: receiving input speech from a speaker; dividing said speech into frames; extracting filter bank "FBANK" coefficients from said frames; applying a feature maximum likelihood logistic regression "FMLLR" transform to the FBANK coefficients to obtain speaker adapted FBANK coefficients; inputting said speaker adapted FBANK coefficients into an acoustic model, said acoustic model being adapted to convert FBANK coefficients into text, wherein said acoustic model comprises a neural network having at least one convolutional layer; and outputting said text The above embodiments describe two approaches to perform feature space maximum likelihood linear regression (FMLLR) based speaker adaptation in convolutional neural networks (CNN) for automatic speech recognition (ASR). The above methods provide speaker adaptation on correlated features for use with a CNN.
In the above embodiments, correlations between the input features are preserved for use with a CNN acoustic model. In one of the embodiments, an inverse discrete cosine transformation (IDCT) is applied on the FivILLR transformed MFCC features, while the other embodiment applies FMLLR directly on the log Mel filter-bank (FBANK) features.
In an embodiment, the FMLLR is estimated as a full covariance transformation.
The above embodiments allow a CNN to be used with relatively few layers and this improves computation time and hence speed of the synthesis. In an embodiment, less than 10 CNN layers can be used, in other embodiments 5 or less. In further embodiments, just a single CNN layer can be used. The embodiments describedherein focus on the front-end processing to perform speaker adaptation and thus the improvement to TTS us independent of how many CNN layers we have in the network. In one embodiment, the CNN layers are divided into blocks with each block comprising two CNN layers followed by a max pooling layer. In a further embodiment, each block might have a single CNN layer. There is no restriction on how many CNN layers should be there in a block.
The embodiments described herein perform speaker adaptation which can be used to reduce the complexity of network architecture. The use of 10 CNN layers was sufficient to reach the performance of top performing systems.
Thus the embodiments described herein achieve a good performance without a larger number of CNN layers.
In a further embodiment, vocal tract length normalisation (VTLN) and layer adaptation (LA) are used in combination with the above.
Convolutional neural networks (CNN) perform convolutions on the input features and are known to be robust to variabilities in the input features. They model local correlations in time and frequency and might implicitly normalise speaker variability. In an embodiment, the need for speaker adaptation should be minimised.
CNNs operate on the input features. Log Mel filterbank (FBANK) features are correlated whereas MFCC features are decorrelated.
FMLLR transformed FBANK features can be determined by: projecting the features with linear discriminant analysis (LDA) followed by maximum likelihood linear transformation (MLLT) and then transforming the features with FMLLR.
FMLLR transforms can also be estimated in the decorrelated feature space by projecting the FBANK features using semi-tied covariance (STC) transforms and further projecting the FMLLR transformed features back to correlated space using inverse STC (ISTC).
In an embodiment, estimating said FMLLR transform comprises estimating said FMLLR transform in an unsupervised manner from a trained Hidden Markov Model-Gaussian mixture model "HMM-GMM".
In a further embodiment, estimating said FMLLR transform comprises estimating said FMLLR transform on said FBANK features by projecting the features with linear discriminant analysis followed by maximum likelihood linear transformation and then transforming the features with FMLLR.
In a yet further embodiment, a method of estimating a speaker adaptation transform is provided for a first acoustic model, the method comprising: receiving input speech from a speaker; and estimating an FMLLR transform for the speaker for the FBANK coefficients of a second acoustic model, wherein the first acoustic model comprises a neural network having at least one convolutional layer, the second acoustic model being a Hidden Markov Model-Gaussian mixture model "HMM-GMM" model, the first and second models being speaker independent Estimating the FMLLR transform may comprise estimating said FMLLR transform in an unsupervised manner from a trained Hidden Markov Model-Gaussian mixture model "HMM-GMM".
In a yet further embodiment, a method of estimating a speaker adaptation transform is provided for a first acoustic model, the method comprising: receiving input speech from a speaker; estimating an FMLLR transform for the speaker for the MFCC coefficients of a second acoustic model, wherein the first acoustic model comprises a neural network having at least one convolutional layer, the second acoustic model being a hidden markov model-Gaussian mixture model "HMM-GMM" model, the first and second models being speaker independent, the first acoustic model being trained on MFCC features that have been subjected to a correlating transform having a closed form solution.
Estimating said FMLLR transform may comprise estimating said FMLLR transform on said FBANK features by projecting the features with linear discriminant analysis followed by maximum likelihood linear transformation and then transforming the features with FMLLR.
In a further embodiment, a method of speech processing is provided, the method comprising: receiving input speech from a speaker; dividing said speech into frames; extracting speech coefficients From said frames; applying a feature maximum likelihood logistic regression "FMLLR" transform to the speech coefficients, the speech coefficients being selected from filter bank "FBANK" and Mel frequency cepstral "MFCC" coefficients; inputting FBANK coefficients and FMLLR transformed speech coefficients into an acoustic model, wherein the acoustic model comprises a first branch, a second branch and a common branch wherein the first and second branches run parallel to one another and meet to form said common branch, said first branch comprising a neural network having at least one convolutional layer and said second branch comprising neural network layers, wherein the FBANK coefficients are inputted to the first branch and an FMLLR transformed speech coefficients are inputted into the second branch; and outputting said text from said common branch.
The second branch may comprise a Deep Neural Network. In a further embodiment a hilly connected is provided for the second branch.
In a further embodiment, a system for speech processing is provided comprising: a receiver for receiving input speech from a speaker; a memory; and a processor adapted to: divide said speech into frames; extract filter bank coefficients "FBANK" from said frames; apply a discrete cosine transform to the extracted FBANK coefficients to produce Mel frequency cepstral coefficients "MFCCs"; apply a feature maximum likelihood logistic regression "FMLLR" transform to the MFCCs; apply a correlating transform having a closed form solution to the speaker adapted transformed MFCCs to form speaker adapted correlated coefficients; and input said speaker adapted correlated coefficients into an acoustic model retrieved from said memory, said acoustic model being adapted to convert MFCC coefficients, to which a correlating transform has been applied into text, wherein said acoustic model comprises a neural network having at least one convolutional layer, the system further comprising an output for outputting said text In a further embodiment, a system for speech processing is provided comprising: a receiver for receiving input speech from a speaker, a memory; and a processor adapted to: divide said speech into frames; extract filter bank coefficients "FBANK" from said frames; apply a feature maximum likelihood logistic regression "FMLLR" transform to the FBANK coefficients to obtain speaker adapted FBANK coefficients; and input said speaker adapted FBANK coefficients into an acoustic model retrieved from said memory, said acoustic model being adapted to convert FBANK coefficients into text, wherein said acoustic model comprises a neural network having at least one convolutional layer, the system further comprising an output for outputting said text.
In a further embodiment, a system for estimating a speaker adaptation transform for a first acoustic model is provided, the system comprising: a receiver adapted to receive input speech from a speaker; and a processor, said processor being adapted to: estimate an FMLLR transform for the speaker for the FBANK coefficients of a second acoustic model, wherein the first acoustic model comprises a neural network having at least one convolutional layer, the second acoustic model being a Hidden Markov Model-Gaussian mixture model "HMM-GMM" model, the first and second models being speaker independent.
In a further embodiment, a system for estimating a speaker adaptation transform for a first acoustic model is provided, the system comprising: a receiver adapted to receive input speech from a speaker; and a processor, said processor being adapted to: estimate an FMLLR transform for the speaker for the MFCC coefficients of a second acoustic model, wherein the first acoustic model comprises a neural network having at least one convolutional layer, the second acoustic model being a hidden markov model-Gaussian mixture model "HMM-GMM" model, the first and second models being speaker independent, the first acoustic model being trained on MFCC features that have been subjected to a correlating transform having a closed form solution.
In a further embodiment, a system for speech processing is provided comprising: a receiver for receiving input speech from a speaker, and a processor adapted to: divide said speech into frames; extract speech coefficients from said frames; apply a feature maximum likelihood logistic regression "FMLLR" transform to the speech coefficients, the speech coefficients being selected from filter bank "FBANK" and Mel frequency cepstral "MFCC" coefficients; input FBANK coefficients and FMLLR transformed speech coefficients into an acoustic model, wherein the acoustic model comprises a first branch, a second branch and a common branch wherein the first and second branches run parallel to one another and meet to form said common branch, said first branch comprising a neural network having at least one convolutional layer and said second branch comprising neural network layers, wherein the FBANK coefficients are inputted to the first branch and an FMLLR transformed speech coefficients are inputted into the second branch, the system further comprising an output for outputting said text Embodiments discussed herein derive speaker dependent features that can presented as input to CNN AMs. In this context, two approaches are explored: Approach 1 -(referred to as IDCT) projects the FMLLR transformed MFCC using inverse discrete cosine transform (IDCT).
Approach 2 -(referred to as DIRECT) estimates the FMLLR transforms on the log Mel filter-bank (FBANK) features directly.
In an embodiment, FMLLR is estimated as a full covariance transformation for CNN AMs. In yet further embodiments, the above two approaches are combined with VTLN and layer adaptation (LA). A 10 layer CNN may be used and a baseline LM.
Feature space maximum likelihood linear regression (FMLLR) is a (Gaussian Mixture model -Hidden Markov Model) GMM-HMM based approach to perform speaker adaptation.
FMLLR estimates a transformation on the means and covariances of the GMM model using data from a specific speaker. The transformation is given by: if = Ay+ 6E AE AT (1) where it and E are the mean and covariance of the GMM model. A is the FMLLR transform. The same matrix is used to transform both means and covariances and is also known as constrained MLLR (CMLLR) transformation. Equivalently, the inverse transformation can be applied on the features (A) to perform similar normalisation and is given by: L(X,ThE,A) = L(A-1X,p,E) + log(IA-11) (2) The FMLLR transform can be estimated in both diagonal and full modes. The diagonal mode estimates a diagonal matrix to transform the mean and covariances of the GMM model, while the full mode estimates a full matrix to transform the mean and covariances of the GMM model.
In an embodiment, the decorrelated feature such as MFCC are used and from this, speaker dependent FBANK features are derived. In the MFCC feature extraction pipeline, the relation between the FBANK (F) and MFCC (X) is the discrete cosine transform DCT transformation (D) and is given by: X= DE (3) Thus, applying an IDCT on the MFCC will return the analysis to the FBANK space. The proposed approach exploits these relations in the feature extraction pipeline and apply IDCT on FMLLR transformed MFCC features. Assuming that /?(= A-1) represents the FMLLR derived on the MFCC features, the speaker dependent FBANK (0 features are obtained as follows: F= D-1BX (4) r is used as input for CNN AMs. The main difference form the STC approach is to use a DCT instead. Since DCT is data independent transform, it makes the adaptation pipeline simpler and avoids the need to estimate a decorrelating transform.
In a further embodiment, the FMLLR transform is applied directly on the FBANK features.
Since FBANK features are correlated, training a GMM-HMM model with diagonal mean and covariance might not be a suitable fit. Having known these limitations, we still train the GMM-HMM model using diagonal mean and covariances, which is Further used for estimating FMLLR in the conventional approach without making any changes to the FBANK features.
Figure 1 is a schematic of an automatic speech recognition "ASR" system 1. The ASR system 1 comprises a processor 3 which executes a program 5. ASR system 1 further comprises storage 7. The storage 7 stores data which is used by program 5 to convert speech to text. The ASR system 1 further comprises an input module 11 and an output module 13. The input module 11 is connected to a voice input 15. Voice input 15 receives an audio input The voice input 15 may be for example a microphone. Alternatively, voice input 15 may be a means for receiving speech data from an external storage medium or a network.
Connected to the output module 13 is output for text 17. The text output 17 is used for outputting text converted from a speech signal which is input into voice input 15. The text output 17 may be for example a direct text output e.g. a display or an output for a text data file which may be sent to a storage medium, networked etc. In use, the ASR system 1 receives speech through voice input 15. The program 5 executed on processor 3 converts the speech into text data using data stored in the storage 7. The text is output via the output module 13 to text output 17.
The ASR system one can be part of a larger system. For example, it may be included within a mobile telephone, within a dictation system on a laptop et cetera. The text input may be configured to connected to a further instruction system. For example, to allow the user to either control the functions of their mobile telephone or interact with the functions provided by an ap within the telephone.
The system of figure 1 uses speaker adaptation to allow the ASR to be optimised for a speaker. In an embodiment, the program 5 comprises an acoustic model that allows features derived from the input speech to be converted to text. In this embodiment, the acoustic model comprises a neural network. The acoustic model can be trained using data from multiple speakers. However, using speaker adaptation, it is possible to use the model for a speaker who was not used during the training. In an embodiment, this speaker adaptation is performed using a feature space maximum likelihood regression "FMLLR" transform that transforms speech features derived from the speech input via voice input 15 to be more similar to the speech used to train the acoustic model.
Figure 2 is a schematic flow diagram illustrating a method of speech recognition using a method in accordance with an embodiment.
The user input speech at step 5201. Next, the speech signal is divided into frames, typically, 25 msec long with an overlap of 10msec. Next, a windowing step is performed on the speech signal in step 5203. The windowing step applies a window function, for example a Hamming window to each frame.
After the windowing step S203, a fast Fourier transform "FFT" 5205 is taken of each windowed frame and a power spectrum is obtained.
The next step is then a filter bank step 5207 where a triangular filter is applied on a Mel Scale to the power spectrum to extract frequency bands and compute filter bank coefficients to obtain FBANK coefficients. These FBANK coefficients are correlated.
The FBANK coefficients are then subjected to a discrete cosine transform (DCT) to decorrelate the filter bank coefficients and produce Mel frequency coefficients "MFCC" in step 5211.
In step 5211, a feature space maximum likelihood regression "FMLLR" transform is applied to the decorrelated MFCC coefficients. FMLLR estimates a single transformation matrix for both means and covariances and is also known in literature as constrained maximum likelihood linear regression (CMLLR). How the FMLLR transform is estimated for this speaker will be described with reference to the method of figure 3.
Next, in step 5213, an inverse discrete cosine transform is applied to the FMLLR transformed MFCC features to transfer them back to a correlated space.
These transformed features that have been returned to the correlated space are then used as the input into an acoustic model for deriving text in step 5215. The acoustic model used here is a speaker independent (SI) model, i.e. a general model that is not optimised for the target speaker. The FMLLR transform allows the SI model to be adapted to the target speaker.
Figure 3 is a flow chart showing schematically how the FMLLR is estimated that is used in the method of figure 2. Although the FMLLR transform is used for transforming features that are to be processed by a neural network based acoustic model, the FMLLR transforms themselves are derived from a Gaussian Mixture model-hidden Markov Model "GMM-HMM" model as FMLLR is a GMM-HMM approach to transform the means and covariances of the model to perform speaker adaptation.
This part of the process can be performed in advance or off-line. Once the FMLLR has been estimated for a speaker (herein referred to as a "target speaker"), the FMLLR transform can be stored for that speaker and re-used when necessary.
In step 5121, the input speech is received from a target speaker. As this method is used for deriving the FMLLR transform, in an embodiment the input speech from the target speaker will relate to known text However, this is not a requirement and it is also possible for the target speaker to speak text that is not already known by the system. In this case, the speech from the target speaker is first passed through a trained speaker independent acoustic model to obtain an estimate of the associated text This estimate of the text can then be used in the same way as known text In exactly the same way as described with reference to figure 2, the FBANK features are then extracted from that input speech. In step 5123. Similarly, MFCC features are then extracted from the FBANK features in step 5125 exactly as described with reference to figure 2.
Next, in step 5129, a GMM-HMM model that has been trained using the MFCC features from training speakers is used. How to train a GMM-HMM is known in the art. As noted above, in an embodiment, the speech input by the target speaker is first passed through the GMM-HMM to obtain a transcript This transcript is then used to estimate the FMLLR using a forward-backward algorithm in step 5131, The estimated FMLLR can then be used to decode again and produce a better transcript, the FMLLR can be re-estimated using the better transcript In the above, as the FMLLR is estimated directly from the MFCC features, it is not necessary to perform alignments using LDA, but this can be done if required.
In a further embodiment, the FMLLR is applied directly to the FRANK features. An ASR method based on this is shown in figure 4.
Similar to the method described with reference to figure 2, the user input speech at step 5251. Next, the speech signal is divided into frames, typically, 25 msec long with an overlap of 10 msec. Next, a windowing step is performed on the speech signal in step 5253. The windowing step applies a window function, for example a Hamming window to each frame.
After the windowing step 5253, a fast Fourier transform "FFT" 5255 is taken of each windowed frame and a power spectrum is obtained.
The next step is then a filter bank step 5257 where a triangular filter is applied on a Mel Scale to the power spectrum to extract frequency bands and compute filter bank coefficients to obtain FBANK coefficients. These FBANK coefficients are correlated.
In step 5259, a feature space maximum likelihood regression "FMLLR" transform is applied directly to the correlated FBANK coefficients. The FMLLR transform is also applied to just the static components. How the FMLLR transform is estimated for this speaker will be described with reference to the method of figure 5.
These transformed correlated features are then used as the input into an acoustic model for deriving text in step 5261. The acoustic model used here is a speaker independent (SI) model, i.e. a general model that is not optimised for the target speaker. The FMLLR transform allows the Canonical model to be adapted to the target speaker.
Figure 5 is a flow chart of a method showing schematically how the FMLLR is estimated that as used in the method of figure 4. Again, although the FMLLR transform is used for transforming features that are to be processed by a neural network based acoustic model, the FMLLR transforms themselves are derived from a Gaussian Mixture model-hidden Markov Model "GMM-HMM" model as FMLLR is a GMM-HMM approach to transform the means and covariances of the model to perform speaker adaptation.
This part of the process can be performed in advance or off-line. Once the FMLLR has been estimated for a speaker (herein referred to as a "target speaker"), the FMLLR transform can be stored for that speaker and re-used when necessary.
In step 5101, the input speech is received from a target speaker. As this method is used for deriving the FMLLR transform, the input speech from the target speaker will relate to known text.
In exactly the same way as described with reference to figure 2, the FBANK features are then extracted from that input speech in step 5103. Similarly, MFCC features are then extracted from the FBANK features in step 5105 exactly as described with reference to figure 2.
Next, in step 5107, a first GMM-HMM model is trained using the MFCC features. Before this training, the MFCC features are transformed with linear discriminant analysis (LDA) and maximum likelihood linear transformation (MLLT). The alignments generated using this first model are used in later steps.
In step 5109, a second GMM-HMM model is trained on the static FBANK features from step S105. The alignments generated in step 5107 are used for training the second GMM-HMM model. During the training of the second model, the alignments determined from the first model are kept and the training forces not to perform any re-alignments. This is because the estimate of the alignments from the first model are expected to be better than the alignments from the second model.
Next, in step 5111, the FMLLR transforms are estimated as described above in relation to the flow chart of figure 3. Here, the first transcript of the input speech is determined using the model trained in step 5107 and the alignments generated from this step are used with the model trained in Step 5109 using a forward backward algorithm to estimate the FMLLR Once the FMLLR transforms are estimated they can be used with a speaker independent neural net In one embodiment this comprises at least one convolutional layer. However, other types of neural net may be used.
One possible example of a neural net is shown in figure 6. The neural net here comprises 10 convolutional layers with batch normalisation and ReLU activations.
In an embodiment, both time and frequency padding are used in each CNN layer. Max-pooling is applied after every 2 CNN layers and also has dropout of 0.2. The CNN layers are followed by two fully connected (FC) layers before the output layer. The fully connected layers also have dropout. The alignments for the CNN AM are obtained from the GMM-HMM model trained using speaker adaptive training (SAT).
In an embodiment, the network is trained using cross-entropy training criterion. For training the acoustic models, the development set is used as the cross-validation set. FBANK features having 64 dimensions and a context of 8 frames is used for all the experiments presented in the paper. The input to the CNN is composed as an image of size 64 x 17. All the CNN layers use a filter kernel of size 3x3. In an embodiment, only static features are used for all the experiments and do not include delta and acceleration features. A tri-gram language model is used during recognition layer.
The above methods can be combined with other techniques for speaker adaptation such as Vocal tract length normalisation (VTLN) and layer adaptation.
Figure 7 is a flow diagram showing the method of figure 2 also incorporating VTLN. In summary the method is similar to that explained with reference to figure 2. However, in the method of figure 7, MFCC coefficients are derived using VTLN warping in step 5271.
In full, the user inputs speech at step S201. Next, the speech signal is divided into frames, typically, 25 msec long with an overlap of 10msec. Next, a windowing step is performed on the speech signal in step 5203. The windowing step applies a window function, for example a Hamming window to each frame.
After the windowing step 5203, a fast Fourier transform "FFT" S205 is taken of each windowed frame and a power spectrum is obtained.
The next step is then a filter bank step 5271 where a triangular filter is applied on a Mel Scale to the VTLN warped power spectrum to extract frequency bands and compute filter bank coefficients to obtain FBANK coefficients. These FBANK coefficients are correlated. How the VTLN warping is estimated will be described with reference to figures 8 and 9.
The FBANK coefficients are then subjected to a discrete cosine transform (DCT) to decorrelate the filter bank coefficients and produce Mel frequency coefficients "MFCC" in step 5211.
In step 5211, a feature space maximum likelihood regression "FMLLR" transform is applied to the decorrelated MFCC coefficients. FMLLR estimates a single transformation matrix for both means and covariances and is also known in the literature as constrained maximum likelihood linear regression (CMLLR). How the FMLLR transform is estimated for this speaker will be described with reference to the methods of figures 8 and 9.
Next, in step 5213, an inverse discrete cosine transform is applied to the FMLLR (and VTLN) transformed MFCC features to transfer them back to a correlated space.
These transformed features that have been returned to the correlated space are then used as the input into an acoustic model for denying text in step 5215. The acoustic model used here is a speaker independent (Si) model, i.e. a general model that is not optimised for the target speaker. The combined FMLLR and VTLN transform allows the SI model to be adapted to the target speaker. Figure 8 is a flow diagram showing the training of the method of figure adapted for also estimating VTLN warping.
For completeness, the entire training will be discussed. In step 5301 a GMM-HMM model is trained using input data from many different training speakers to serve as a first speaker independent model in step 5301.
In step 5303, VTLN warping is estimated for each training speaker. This is performed by taking the data from each training speaker separately and estimating the VTLN for that speaker against the first speaker independent model trained in step 5301.
In step 5305, a new model is then trained using as an input, the data from each of the training speakers with VTLN warping specific to each speaker.
Once this has been trained, the FMLLR transform for each training speaker is then estimated in step 5307. Finally, the GMM-HMM is retrained for a second time in step 5309 using the FMLLR and VTLN transformed speaker data.
Using the alignments from the GMM HMM in step 5309, a CNN is then trained in step 5311.
If the CNN is to be used on FMLLR and VTLN transformed data, then the CNN will be trained using this data.
In the above description, the VTLN transform is estimated prior to the FMLLR transform. However, it is possible to estimate the FMLLR transform prior to the VTLN transform.
The method of figure 8 related to the application of both VTLN and FMLLR using IDCT. However, as noted above, VTLN and FMLLR can also be combined in for example, the system described with reference to figure 4 when the FMLLR transform is applied directly to the FBANK features. In this situation, the method of figure 8 will remain largely unchanged.
However, the features used to train the models will be different.
For example, where FMLLR is applied to the MFCC features and then an IDCT is applied, the CNN is trained on data where the input features are MFCC features with an IDCT applied (with FMLLR and/or VTLN applied as appropriate). However, where the FMLLR transform is applied directly to the FBANK features, the CNN will be trained using FBANK features (with FMLLR and/or VTLN applied as appropriate).
Figure 9 shows a method for estimating the FMLLR and VTLN transforms for the test speaker (figure 8 shows the methods of training where these are estimated for the training speakers).
In figure 9, test speaker data is received in step 5351. In an embodiment, the test speaker will not be speaking known text. Here, in step 5353, the speaker independent GMM HMM trained with VTLN in step 5305 of figure 8 is used to estimate the alignments using MFCC features. In other words, it is used to perform first recognition of the text. Once this is determined, this estimated text can then be used with the test speaker data to derive transforms.
In step 5355, VTLN is estimated for the test speaker. Next, in step 5357, the FMLLR transform is estimated using the VTLN warped test speaker data and the method described with reference to step 5131 of figure 3.
Figure 10 is a flow diagram showing the method of figure 2 also incorporating layer adaptation. In summary the method is similar to that explained with reference to figure 2.
However, in the method of figure 10, there is a final step, 5281 where a trained neural net is adapted to the test speaker. In an embodiment, the final layer of the neural net would be the adaptation layer and the other layers of the neural net will have they weighting is fixed during training. When the system is used for the test speaker, test speaker specific weightings are used for the final layer of the neural network.
In full, the user input speech at step 5201. Next, the speech signal is divided into frames, typically, 25 msec long with an overlap of 10msec. Next, a windowing step is performed on the speech signal in step S203. The windowing step applies a window function, for example a Hamming window to each frame.
After the windowing step 5203, a fast Fourier transform "FFT" 5205 is taken of each windowed frame and a power spectrum is obtained.
The next step is then a filter bank step 5271 where a triangular filter is applied on a Mel Scale to the power spectrum to extract frequency bands and compute filter bank coefficients to obtain FBANK coefficients. These FBANK coefficients are correlated.
The FBANK coefficients are then subjected to a discrete cosine transform (DCT) to decorrelate the filter bank coefficients and produce Mel frequency coefficients "MFCC" in step 5211.
In step S211, a feature space maximum likelihood regression "FMLLR" transform is applied to the decorrelated MFCC coefficients. FMLLR estimates a single transformation matrix for both means and covariances and is also known in literature as constrained maximum likelihood linear regression (CMLLR). How the FMLLR transform is estimated for this speaker will be described with reference to the methods of figures 11 and 12.
Next, in step 5213, an inverse discrete cosine transform is applied to the FMLLR transformed MFCC features to transfer them back to a correlated space.
In Step 5281, the inverse discrete cosine transformed signal is then directed into a trained CNN. The weights of the final layer of the CNN in step 5281 are specific to the speaker. How this CNN is trained will be described with reference to figures 11 and 12.
Figure 11 is a flowchart showing how the system described with reference to figure 10 is trained. In step 5401, dependent GMM-HMM is trained using data from the training speakers. The FMLLR transform is then estimated for each training speaker and retrained using FMLLR transformed data from each speaker in step 5407, the CNN is trained using the alignments derived from the GMM-HMM trained in step 5405.
Figure 12 is a flowchart showing figure 10 is adapted to a specific speaker.
In step 5451, speaker data is received. In an embodiment, the test speaker will not be speaking known text. Here, in step 5451, the speaker independent GMM-HMM is used to estimate the alignments.
Next, in step 5457, the FMLLR transform is estimated using the test speaker data and the method described with reference to step 5131 of figure 3.
In step 5459, the CNN is trained using layer adaptation to the test speaker data. In an embodiment, this is performed by using the trained model of step 5407 of figure 11. The weights of the trained model are then held fixed in all layers except for the hidden layer before the output layer. This layer will be termed the adapted layer. The CNN is then trained using the FMLLR transformed data from the test speaker. The alignments are taken from the GMM-HMM of step 5405. The weights of the adapted layer are then trained using the test speaker FMLLR transformed data. This speaker adapted CNN then provides the CNN for step 5281 of Figure 10.
Figure 13 shows a system in accordance with a further embodiment. Here, a joint CNN/DNN model is provided.
The model comprises two branches 501, 503 which merge to form a common branch 505. On the first 501 of the two branches, a CNN is provided. The CNN can be a CNN as previously described. The input into this CNN is the FBANK features. These features may be VTLN warped FBANK features.
On the second branch 503 of the two branches, a deep neural network (DNN) is provided. The input into this branch 503 may be FBANK features within FMLLR transform or MFCC features with FMLLR transformation. The DNN may comprise a fully connected network architecture.
These two branches are then combined together to form combined branch 505 with further neural net layers.
The FMLLR transform for the bank features can be determined using known methods. By combining the output of the CNN and DNN, it is possible for the FMLLR transform provided in the DNN to influence the correlated features from the CNN.
Although the methods described with reference to figures 8 to 12 are described as variations on the method of figure 2 that uses an IDCT to transform the FMLLR transformed MFCC features back to a correlated space. However, VTLN and layer adaptation can be used in combination with the method of figures 4 and 5 that directly apply the FMLLR transform to the correlated FBANK features.
Also, VTLN and layer adaptation can be used in combination with each other and either of the methods of figures 2 and 4.
To demonstrate the above embodiments, ASR experiments in this paper are reported on the CHiME4 single and multi-channel tasks.
The CHiME4 corpus is derived from the WSJO corpus. It is recorded using a multi-microphone tablet device in both real and simulated noisy environments. The environments where the recording were done include cafe, street, bus and restaurant. The data includes both real and simulated noise recordings. The training data has 7138 simulated noise utterances with 83 speakers and 1600 real noisy utterances from 4 speakers. The test set provides development and evaluation sets with 4 speakers each. Results are presented on the real noise evaluation set (Real ET05), which has 1320 utterances with 330 utterances for each speaker. The total amount of training data from each microphone channel corresponds to 18 hours.
The experiments are performed using a deep convolutional neural network (DCNN) AM for ASR. The acoustic model uses 10 CNN layers having batch normalisation and ReLU activations. A schematic of the architecture used in the experiments is illustrated in Fig. 6.
Both time and frequency padding are used in each CNN layer. Max-pooling is applied after every 2 CNN layers and also has dropout of 0.2. The CNN layers are followed by two fully connected (FC) layers before the output layer. The alignments for the CNN AM are obtained from the GMM-HMM model trained using speaker adaptive training (SAT). The network is trained using cross-entropy training criterion. For training the AMs, the development set is used as the cross-validation set. FBANK features having 64 dimensions and a context of 8 frames are used for all the experiments described herein. The input to the CNN is composed as an image of size 64 x 17. The CNN layers use a filter kernel of size 3x3. Only static features are used in all the experiments. The DCNN AM is trained using CNTK.
In these experiments, the input to the CNNs only use static features and do not include delta or acceleration features as input. In order to estimate FMLLR, a GMM-HMM model is also trained only using the static features, which is bootstrapped with alignments generated for the SAT model trained using MFCC. 64 dimensional FBANK features are used for the DIRECT approaches, while 64 dimensional MFCC features are used the IDCT approach.
Table 1 -Performance (%WER) of various adaptation approaches on the single channel CHiME4 real noise evaluation set.
Real ETOS % WER FBANK 14.8 VTLN- 13.3
FBANK
FMLLR- Diag. Full
TYPE
DIRECT 14.7 13.4 IDCT 14.2 12.6 Initial experiments present results only on the single-channel (lch) CHiME4 task to understand the ASR performance of various adaptation approaches. For the 1ch task, the AM in trained using data recorded on all the microphone channels expect the 2ndchannel. The total training data for the ich task amounts to 90 hours. All the results presented herein use the RNN language model, which is the baseline LM for the CHiME4 task.
Table 1 presents the results of various adaptation strategies using DCNN AMs on the 1ch CHiME4 task. All the results report word error rate (%WER) on the task. The table also includes the performance of vocal tract length normalisation (VTLN) based speaker normalisation, which has been shown to be effective with CNNs. The table presents results, where FMLLR transform is estimated in both diagonal and full modes. The following observations are made: * Performing speaker normalisation with VTLN improves the ASR performance of CNN AMs. VTLN is estimated following a two-pass approach.
* All the adaptation approaches have better performance when FMLLR is estimated in the full mode than when estimated in the diagonal mode. The full mode estimation has more parameters to estimate and requires more adaptation data. For the experiments here, all the data from the test speaker is used for estimating the FMLLR in both the modes. FMLLR is estimated following a two-pass approach.
* When FMLLR transform is estimated in the diagonal mode: - The performance of all the FMLLR approaches are inferior when compared with the performance of VTLN adaptation. This indicates that VTLN adaptation is better than FMLLR for CNN AMs.
- The performance of DIRECT has marginal gains in ASR performance, while IDCT improves the ASR performance when compared with FBANK without any adaptation.
* When FMLLR transform is estimated in the full mode: The performance of IDCT is better when compared with the performance of VTLN adaptation, while the rest of the approaches are having inferior performance. They both also have similar MR performance. This indicates that a data independent transform (like IDCT) to perform FMLLR in CNN AMs.
The above results also suggest that it might be better to provide features transformed with FMLLR as input to CNNs rather than providing them as auxiliary features.
From the observations presented above, it is clear that FMLLR estimated in the full mode is beneficial for CNN AMs. Using a GMW(IMM with diagonal mean and covariance for estimating the FMLLR transform benefits from using de-correlated features transformed with IDCT. The DIRECT approach seems to always perform inferior because it attempts to estimate FMLLR directly on correlated FBANK features. For the rest of the discussion in the paper, FMLLR is always estimated in the full mode.
Next results are presented on the single (1ch) and multi (6ch) channel CHiME4 tasks, where FMLLR based speaker adaptation is applied in combination with VTLN and layer adaptation (LA). For the 6ch task, GEV enhancement is applied on the multi-channel data on both train and test sets. The GEV enhanced data along with data from all the channels except 2nd channel (108 hours), is used for model training.
Layer adaptation (LA) is a network adaptation approach, where the weights of a specific layer are tuned using the data from the test speaker while the weights in the rest of the layers are kept fixed. For the experiments presented here, the last layer before the output layer is used for performing speaker adaptation. The targets used for tuning the weights are obtained by performing recognition using the previous best model.
Table 2 presents the results of various FMLLR adaptation approaches in combination with VTLN and LA. It also includes the performance of VTLN as a reference. All the adaptation approaches on the test speaker follow a two-pass approach. The performance is reported using RNN LM rescoring, which is the baseline LM for the challenge.
Table 2. Performance (%WER) of various FMLLR adaptation approaches along with VTLN and LA on both single and multi-channel CI-IiME4 tasks on real noise evaluation set % WER LA V F lch 6ch VTLN- + 13.3 4.4
FRANK
DIRECT + 13.4 4.3 IDCT - + 12.6 4.3 DIRECT - + + 12.4 4.1 IDCT + + 11.8 4.1 DIRECT + + + 11.3 3.7 IDCT + + + 10.9 3.8 The following observations are made: * The adaptation approaches (VTLN, FMLLR and LA) provide complementary gains in ASR performance, with the best performance achieved when all of them are applied together.
* The differences in performance between the DIRECT and IDCT approaches that exists on the lch task disappear and have similar performance on the 6ch task.
Compared with the performance of systems reported using the baseline LM in the challenge, the system presented in the paper reaches the secondposition in the challenge on both 1ch and 6ch tasks. The performance of the system described herein is based on the single best system without system combination. The best performing system uses 25 CNN layers and is obtained after system combination. The second best performing system uses 22 CNN layers followed by BLSTM layers for the AM. The system described above uses 10 CNN layers and has comparably fewer parameters than the best performing systems in the challenge.
The above embodiments perform speaker adaptation in CNNs using FMLLR. Since CNNs require the input features to be correlated, any transformation that can alter these relations might degrade the ASR performance.
The above described two ways (IDCT and DIRECT) to perform FMLLR adaptation in CNNs. ASR experiments on the CHiME4 task are used to evaluate the performance of proposed approaches.
The above shows that FMLLR when applied with VTLN and LA provide complementary gains in ASR performance. Using a 10 layer CNN AM, having a simpler architecture than the top performing systems in the challenge, and using the baseline LM achieved the second best performance on both single and multi-channel CHiME4 tasks. The performance is based on a single best performing system without system combination. The paper showed that FMLLR adaptation can be successfully applied to CNNs and improve the ASR performance. The proposed methods can be also applied to complex acoustic models that have a combination of CNN and LSTM architectures.
Whilst certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel devices, and methods described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the devices, methods and products described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
GB1904100.3A 2019-03-25 2019-03-25 A speech processing system and method Active GB2582572B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1904100.3A GB2582572B (en) 2019-03-25 2019-03-25 A speech processing system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1904100.3A GB2582572B (en) 2019-03-25 2019-03-25 A speech processing system and method

Publications (3)

Publication Number Publication Date
GB201904100D0 GB201904100D0 (en) 2019-05-08
GB2582572A true GB2582572A (en) 2020-09-30
GB2582572B GB2582572B (en) 2022-04-06

Family

ID=66381557

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1904100.3A Active GB2582572B (en) 2019-03-25 2019-03-25 A speech processing system and method

Country Status (1)

Country Link
GB (1) GB2582572B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114169291A (en) * 2021-11-29 2022-03-11 天津大学 Text-to-speech method and device based on convolutional neural and generation countermeasure network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040117183A1 (en) * 2002-12-13 2004-06-17 Ibm Corporation Adaptation of compound gaussian mixture models
US20080010057A1 (en) * 2006-07-05 2008-01-10 General Motors Corporation Applying speech recognition adaptation in an automated speech recognition system of a telematics-equipped vehicle
US20150161993A1 (en) * 2013-12-06 2015-06-11 International Business Machines Corporation Systems and methods for applying speaker adaption techniques to correlated features

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040117183A1 (en) * 2002-12-13 2004-06-17 Ibm Corporation Adaptation of compound gaussian mixture models
US20080010057A1 (en) * 2006-07-05 2008-01-10 General Motors Corporation Applying speech recognition adaptation in an automated speech recognition system of a telematics-equipped vehicle
US20150161993A1 (en) * 2013-12-06 2015-06-11 International Business Machines Corporation Systems and methods for applying speaker adaption techniques to correlated features

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114169291A (en) * 2021-11-29 2022-03-11 天津大学 Text-to-speech method and device based on convolutional neural and generation countermeasure network
CN114169291B (en) * 2021-11-29 2024-04-26 天津大学 Text-to-speech method and device based on convolutional neural and generating countermeasure network

Also Published As

Publication number Publication date
GB201904100D0 (en) 2019-05-08
GB2582572B (en) 2022-04-06

Similar Documents

Publication Publication Date Title
CN108463848B (en) Adaptive audio enhancement for multi-channel speech recognition
Delcroix et al. Strategies for distant speech recognitionin reverberant environments
Weninger et al. Feature enhancement by deep LSTM networks for ASR in reverberant multisource environments
Hori et al. The MERL/SRI system for the 3rd CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition
US10283115B2 (en) Voice processing device, voice processing method, and voice processing program
Kumatani et al. Channel selection based on multichannel cross-correlation coefficients for distant speech recognition
Delcroix et al. Speech recognition in living rooms: Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds
Nesta et al. A flexible spatial blind source extraction framework for robust speech recognition in noisy environments
Nesta et al. Blind source extraction for robust speech recognition in multisource noisy environments
WO2013030134A1 (en) Method and apparatus for acoustic source separation
Ramirez et al. A survey of the effects of data augmentation for automatic speech recognition systems
Alam et al. Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation
Sainath et al. Raw multichannel processing using deep neural networks
Astudillo et al. Integration of beamforming and uncertainty-of-observation techniques for robust ASR in multi-source environments
Ren et al. Combination of bottleneck feature extraction and dereverberation for distant-talking speech recognition
GB2582572A (en) A speech processing system and method
Huemmer et al. A new uncertainty decoding scheme for DNN-HMM hybrid systems with multichannel speech enhancement
Kumatani et al. Maximum kurtosis beamforming with a subspace filter for distant speech recognition
Wang et al. Noise Robust IOA/CAS Speech Separation and Recognition System For The Third'CHIME'Challenge
Paliwal et al. Robust speech recognition under noisy ambient conditions
Menne Learning acoustic features from the raw waveform for automatic speech recognition
Himawan et al. Feature mapping using far-field microphones for distant speech recognition
Wolf et al. Towards microphone selection based on room impulse response energy-related measures
Zhu et al. Maximum likelihood sub-band adaptation for robust speech recognition
Rotili et al. Multi-channel Feature Enhancement for Robust Speech Recognition