GB2560174A

GB2560174A - A feature extraction system, an automatic speech recognition system, a feature extraction method, an automatic speech recognition method and a method of train

Info

Publication number: GB2560174A
Application number: GB1703310.1A
Authority: GB
Inventors: Thanh Do Cong; Stylianou Ioannis
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2017-03-01
Filing date: 2017-03-01
Publication date: 2018-09-05
Anticipated expiration: 2037-03-01
Also published as: GB201703310D0; GB2560174B

Abstract

An automatic speech recognition system, comprising: an input for receiving a speech signal; and a processor configured to: filter an input speech signal using a filter bank comprising a plurality of filters, wherein each filter in the filter bank modifies the input speech signal in the time domain by a different frequency dependent gain, the filter bank outputting a time domain signal from each filter; extract a temporal envelope from the output time domain signal from each filter in the filter bank; frame the temporal envelopes; extract a feature vector for each frame, wherein each feature vector comprises a feature coefficient extracted from the frame of the temporal envelope of the output time domain signal from each filter in the filter bank; input the feature vectors into a deep neural network based classifier, the classifier generating one or more automatic speech recognition hypotheses corresponding to the input speech signal. The filter bank is a Gammatone filter bank. Extracting the temporal envelope from the output time domain signal comprises full wave rectifying the output time domain signal from each filter in the filter bank and low pass filtering each of the rectified signals. A time delay neural network (TDNN) de-noising auto encoder is used.

Description

(71) Applicant(s):

(56) Documents Cited:

US 8442821 B1 US 20160240190 A1

US 5185848 A US 20140257804 A1

KABUSHIKI KAISHA TOSHIBA

1-1, Shibaura 1-chome, Minato-ku, Tokyo 105-8001, Japan (58) Field of Search:

INT CLG10L

Other: WPI EPODOC TXTE INTERNET (72) Inventor(s):

Cong Thanh Do loannis Stylianou (74) Agent and/or Address for Service:

Marks & Clerk LLP

Long Acre, LONDON, WC2E 9RA, United Kingdom (54) Title of the Invention: A feature extraction system, an automatic speech recognition system, a feature extraction method, an automatic speech recognition method and a method of train Abstract Title: Training an automatic speech recognition system (57) An automatic speech recognition system, comprising: an input for receiving a speech signal; and a processor configured to: filter an input speech signal using a filter bank comprising a plurality of filters, wherein each filter in the filter bank modifies the input speech signal in the time domain by a different frequency dependent gain, the filter bank outputting a time domain signal from each filter; extract a temporal envelope from the output time domain signal from each filter in the filter bank; frame the temporal envelopes; extract a feature vector for each frame, wherein each feature vector comprises a feature coefficient extracted from the frame of the temporal envelope of the output time domain signal from each filter in the filter bank; input the feature vectors into a deep neural network based classifier, the classifier generating one or more automatic speech recognition hypotheses corresponding to the input speech signal. The filter bank is a Gammatone filter bank. Extracting the temporal envelope from the output time domain signal comprises full wave rectifying the output time domain signal from each filter in the filter bank and low pass filtering each of the rectified signals. A time delay neural network (TDNN) de-noising auto encoder is used.

Figure 9

DNN training;

TDNN

OAF

	DNN Training Tool

I
	DNN

State alignment (from HMMGMM system}

Figure 9 (c)

Enhanced features

DNN training ί

.] F_£

At least one drawing originally filed was informal and the print reproduced here is taken from a later filed formal copy.

1/10

02 18


1	7	3	11
5	13

Figure 1

2/10

02 18 $201

S202 $203

5204

5205

Figure 2

3/10

02 18

ΙΌ

Ο

CO ω

sr ο

co ω

Figure 3

ο co ω

4/10

02 18

ο ο ο ο ο ο ο τ- cm η ιο co

1000 2000 3000 4000 5000 6000 7000 8000

Frequency (Hz)

Figure 4 (gp) ujeo

5/10

02 18 in

θρήνων £ ω αΓ 3 ε σ>

η ;τ

6/10 ω

ω ω

ω ο

ο.

Οί ω

<

02 18

φ $ ο Φ £ => _c Φ c Φ LU

CO φ

φ

ο.

ω ω

φ p

c 03 .Φ

7/10

S701

S702

02 18

S703

S704

S705

S706

Figure 7

8/10

02 18

igure

9/10

OD £

Λ

o (Z1 cu czc cox on _£ 'c ro k_

JUL) <

Q z

z

Q

training	ignment HMM-
z ·	«J E
Z !	0) 0
O ! 1 I	Stat (fr

O1 </) (SO £

£_ po z

Q xxi ro

UlY £

<L>

E

C <£. tn >^ro cj 0 £

E (U ao £

ro

l......I QJ u_ ro :

4_>

en

02 18

UJ <

Q

Z

Q

O £

(30

C

C 'co £

U£

Z UJ Z <

Q S H

/—	—.
00 £
C
re i—.	0
f— z z Q	£
I

~i 1 ~i—1

-| I

U1

— UJ Z < Q Q Fro σ>

I-Γ r

I-Γ ~i—1

ro αι u_ £

UL.

Q cd

Uir7

I________________________________I_______________________________l____________

ω	ω
=3	=3
D)	O)

CD ω

=3

D)

10/10

02 18 seRiiiqeqojd uoijisubji

Figure 10

A feature extraction system, an automatic speech recognition system, a feature extraction method, an automatic speech recognition method and a method of training an automatic speech recognition system

FIELD

The present disclosure relates to a feature extraction system, an automatic speech recognition system, a feature extraction method, an automatic speech recognition method and a method of training an automatic speech recognition system.

BACKGROUND

Automatic speech recognition (ASR) systems are used in many applications, including hands free computing, in-car systems, meeting transcription, natural human-machine interfaces, automated call-centres, speech-to-speech translation, spoken data annotations and transcription systems for example. It is often necessary for the ASR system to operate in a noisy environment, for example, when using a mobile telephone in a crowded place. Environmental noise can significantly degrade speech recognition performance.

There is a continuing need to make ASR systems more robust to environmental noise.

BRIEF DESCRIPTION OF THE DRAWINGS

Systems and methods in accordance with non-limiting arrangements will now be described with reference to the accompanying figures in which:

Figure 1 is a schematic illustration of an ASR system comprising a feature extraction system;

Figure 2 is a flow diagram of an ASR method, comprising a feature extraction method;

Figure 3 is a flow diagram of a method of feature extraction which may be used in the

ASR method;

Figure 4 shows an example frequency response of a Gammatone filter-bank;

Figure 5 shows an example signal output from one filter in the filter bank and the corresponding extracted sub-band temporal envelope;

Figure 6 is a schematic illustration of an ASR system comprising a time-delay neural network (TDNN) denoising autoencoder;

Figure 7 is a flow chart showing a method of training an automatic speech recognition system;

Figure 8 is a flow chart showing a method of training an automatic speech recognition system comprising a TDNN denoising autoencoder;

Figures 9(a), (b) and (c) are schematic illustrations of training of TDNNs and deep neural networks (DNNs);

Figure 10 is a schematic illustration of an automatic speech recognition system during the testing stage comprising a TDNN denoising autoencoder.

DETAILED DESCRIPTION

There is provided an automatic speech recognition system, comprising: an input for receiving a speech signal; and a processor configured to:

filter an input speech signal using a filter bank comprising a plurality of filters, wherein each filter in the filter bank modifies the input speech signal in the time domain by a different frequency dependent gain, the filter bank outputting a time domain signal from each filter;

extract a temporal envelope from the output time domain signal from each filter in the filter bank;

frame the temporal envelopes;

extract a feature vector for each frame, wherein each feature vector comprises a feature coefficient extracted from the frame of the temporal envelope of the output time domain signal from each filter in the filter bank;

input the feature vectors into a deep neural network based classifier, the classifier generating one or more automatic speech recognition hypotheses corresponding to the input speech signal.

Extracting the temporal envelope may comprise:

full-wave rectifying the output time domain signal from each filter in the filter bank;

low pass filtering each of the rectified signals.

There is also provided an automatic speech recognition system, comprising: an input for receiving a speech signal; and a processor configured to:

extract a temporal envelope from the output time domain signal from each filter in the filter bank, extracting a temporal envelope comprising:

low pass filtering each of the rectified signals; frame the temporal envelopes;

extract a feature vector from the temporal envelopes for each frame, wherein each feature vector comprises a feature coefficient extracted from the frame of the temporal envelope of the output time domain signal from each filter in the filter bank;

input the feature vectors into a classifier, the classifier generating one or more automatic speech recognition hypotheses corresponding to the input speech signal.

Each filter in the filter bank is a band-pass filter and each band pass filter has a different centre frequency. The feature vector for each frame comprises a feature coefficient corresponding to each band in the filter bank. Each feature vector therefore comprises at least as many feature coefficients as there are bands in the filter bank. Each feature vector may further comprise additional feature coefficients.

The classifier is a speech recognition classifier. The feature vectors inputted into the speech recognition classifier comprise information describing the temporal envelopes. The feature vector comprises feature coefficients which each describe the temporal envelope for the corresponding filter band. Optionally, the feature coefficient for each filter band is a representative value of the temporal envelope, which may be generated using summary statistics, for example it may be a power average of the temporal envelope. In this case, extracting the feature vector for a frame comprises generating a power signal using each temporal envelope and, for each power signal, averaging the power signal values over the frame. The feature coefficient may be a compressed value, for example, extracting the feature vector for a frame may further comprise root compressing each average power value.

The centre frequencies for the band pass filters may be linearly spaced on an equivalent rectangular bandwidth scale.

The filter bank may be a Gammatone filter bank. Each filter may be implemented as a cascade of two or more infinite impulse response filters.

The extraction of the feature vectors from the speech signal is performed entirely in the time domain. Each filter in the filter bank modifies the input speech signal by performing a convolution of the input speech signal with a filter impulse response function in the time domain. Where a low-pass filter is applied to generate the temporal envelope, the low-pass filter is applied performing a convolution of the input speech signal with a filter impulse response function in the time domain.

A pre-emphasis filter may be applied to the input speech signal before the input speech signal is filtered using the filter bank. The pre-emphasis filter is also applied in the time domain, by performing a convolution of the input speech signal with a filter impulse response function in the time domain.

A de-noising neural network may be used to enhance the features before inputting them into the classifier. The de-noising neural network may be a time delay neural network.

The system may further comprise an output for outputting a text signal, wherein the processor is further configured to determine a sequence of text from the one or more automatic speech recognition hypotheses and output the text at the output.

Framing the temporal envelopes may comprise applying a window function, for example a Hamming window.

There is further provided a feature extraction system, comprising: an input for receiving a speech signal; and a processor configured to:

low pass filtering each of the rectified signals; frame the temporal envelopes;

extract a feature vector from the temporal envelopes for each frame, wherein each feature vector comprises a feature coefficient extracted from the frame of the temporal envelope of the output time domain signal from each filter in the filter bank.

There is further provided a method of training an automatic speech recognition system for recognition of speech data, the method comprising:

obtaining a corpus of data comprising a plurality of speech signals in which the text corresponding to parts of the speech signal is labelled;

filtering each input speech signal using a filter bank comprising a plurality of filters, wherein each filter in the filter bank modifies the input speech signal in the time domain by a different frequency dependent gain, the filter bank outputting a time domain signal from each filter;

extracting a temporal envelope from the output time domain signal from each filter in the filter bank;

framing the temporal envelopes;

extracting a feature vector from the temporal envelopes for each frame, wherein each feature vector comprises a feature coefficient extracted from the frame of the temporal envelope of the output time domain signal from each filter in the filter bank;

training a deep neural network based classifier using the feature vectors and the text labels.

There is also provided a method of training an automatic speech recognition system for recognition of speech data, the method comprising:

filtering each speech signal using a filter bank comprising a plurality of filters, wherein each filter in the filter bank modifies the input speech signal in the time domain by a different frequency dependent gain, the filter bank outputting a time domain signal from each filter;

extracting a temporal envelope from the output time domain signal from each filter in the filter bank, extracting a temporal envelope comprising:

low pass filtering each of the rectified signals to remove a high frequency component;

framing the temporal envelopes;

training a classifier using the feature vectors and the text labels.

There is also provided a feature extraction method, comprising:

filtering an input speech signal using a filter bank comprising a plurality of filters, wherein each filter in the filter bank modifies the input speech signal in the time domain by a different frequency dependent gain, the filter bank outputting a time domain signal from each filter;

framing the temporal envelopes;

extracting a feature vector from the temporal envelopes for each frame, wherein each feature vector comprises a feature coefficient extracted from the frame of the temporal envelope of the output time domain signal from each filter in the filter bank.

There is also provided an automatic speech recognition method, comprising:

framing the temporal envelopes;

inputting the feature vectors into a deep neural network based classifier, the classifier generating one or more automatic speech recognition hypotheses corresponding to the input speech signal.

There is also provided an automatic speech recognition method, comprising:

framing the temporal envelopes;

inputting the feature vectors into a classifier, the classifier generating one or more automatic speech recognition hypotheses corresponding to the input speech signal.

The method may further comprise training a HMM/GMM system to generate state alignment information, using a corpus of data.

The method may further comprise training a de-noising neural network for feature enhancement.

There is also provided a carrier medium comprising computer readable code configured to cause a computer to perform any of the above methods.

Since some methods in accordance with embodiments can be implemented by software, some embodiments encompass computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise a storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal. The carrier medium may comprise a nontransitory computer readable storage medium.

Figure 1 is a schematic illustration of an ASR system 1 comprising a feature extraction system. The system 1 comprises a processor 3, and takes input speech signals. The system may output text signals. A computer program 5 is stored in non-volatile memory. The non-volatile memory is accessed by the processor 3 and the stored computer program code is retrieved and executed by the processor 3. The storage 7 stores data that is used by the program 5.

The system 1 further comprises an input module 11. The input module 11 is connected to an input 15 for receiving data relating to a speech signal. The input 15 may be an interface that allows a user to directly input data, for example a microphone. Alternatively, the input 15 may be a receiver for receiving data from an external storage medium or a network.

The system 1 may further comprise an output module 13. Connected to the output module 13 may be an output 17. The output 17 may be an interface that displays data to the user, for example a screen. Alternatively, the output 17 may be a transmitter for transmitting data to an external storage medium or a network.

Alternatively, the ASR system may be simply part of a system, for example, the ASR system may be part of a spoken dialogue system, in which the output of the ASR is used to generate a system action, e.g. a response to the input speech signal. In this case, the ASR does not output to an interface, but provides output information to a further functional part of the system.

In use, the system 1 receives speech signals through the input 15. The program 5 is executed on processor 3 in the manner which will be described with reference to the following figures. It may output a text signal at the output 17. The system 1 may be configured and trained in the manner which will be described with reference to the following figures.

The ASR system filters an input speech signal using a time domain filter bank. Subband temporal envelopes are then extracted, before framing and extraction of a feature vector for each frame. Framing is thus performed after the sub-band temporal envelopes are extracted for the utterance, meaning that temporal context information is preserved. Such contextual information is beneficial for neural network based speech recognition classifiers for example. Since neural network based classifiers take multiple frames for each input, extracting the sub-band temporal envelopes on a time scale greater than one frame preserves relevant information.

Sub-band temporal envelopes (STEs) may be extracted by full-wave rectification and low-pass filtering. The temporal resolution of STEs can thus be controlled by the cut-off frequency of the low-pass filter. For example, the low pass filter may have a cut-off frequency of the same order of magnitude as the word rate, meaning that only information relevant to the classifier is preserved.

Figure 2 is a flow chart of a method of ASR. The method may be performed by the ASR system 1 shown in Figure 1 on input speech signals. There may be a series of input speech signals, corresponding to a sequence of utterances. The start and end points of each utterance may be found by automatically segmenting a long continuous speech signal for example.

The input speech signal is a sampled speech signal s(n), where n is a discrete time index. The input speech signal may be generated by sampling an analogue speech signal s(t). Alternatively, a sampled speech signal may simply be input from an external storage medium or a network for example. For example, the sampling frequency may be 16 kHz.

A pre-emphasis filter may be applied to the input speech signal at this stage, as shown in the example method of feature extraction shown in Figure 3. Figure 3 is a flow chart of an example method of sub-band temporal envelope feature extraction, corresponding to steps S201 to S204 of Figure 2 and will be explained in further detail below.

For example, the speech signal s(n) may be pre-emphasized using a filter having a transfer function H(z) = 1 - 0.97z '¹. The filter is applied in the time domain. The filter performs a convolution of the input speech signal with a filter impulse response function in the time domain and outputs a time domain signal. Applying a preemphasis filter boosts the high frequency energy in the speech signal.

In step S201 the input speech signal, which may be pre-emphasized as described above, is filtered using a filter bank. The filter bank comprises a plurality of filters, wherein each filter in the filter bank modifies the input speech signal in the time domain by a different frequency dependent gain, the filter bank outputting a time domain signal from each filter.

The filter bank may be any time domain band pass filter bank. The input speech signal is inputted to each filter in the filter-bank. Each filter performs a convolution of the input speech signal with a filter impulse response function in the time domain and outputs a time domain signal. The filter bank is thus configured to decompose the input speech signal into M sub-band signals s_m(n), m = 1, ..., M. The output signal for each band, s_m(n) is n by:

s_m(n) = s(n)*h_m(n) where h_m(n) is the impulse response function for the filter m.

An example impulse response for the filter m is:

h_m(n) = an^v'¹e'^Ancos(2TTif_cn + Φ) where a is the amplitude and the filter order is v. The damping factor λ is defined as λ = 2TTbERB(f_c), where ERB denotes equivalent rectangular bandwidth, and the center frequency is f_c. The center frequency f_c varies for each m. The parameter b controls the bandwidth of the filter proportional to the ERB of a human auditory filter. Φ (in radians) is the phase of the carrier.

For example, the filter bank may comprise M Gammatone band-pass filters as shown in Figure 3. In this case, the speech signal is decomposed into M sub-band signals s_m(n), m = 1, ..., M using a filter-bank of M Gammatone band-pass filters. For example, M may be equal to 40.

Optionally, the centre frequencies of the Gammatone filters are linearly spaced on the ERB (equivalent rectangular bandwidth) scale with the centre frequency of the first filter being at 100 Hz. These filter bands track energy peaks in perceptual frequency bands, and thus reflect the resonant properties of the vocal tract. Figure 4 shows the frequency response of a Gammatone filter-bank which may be used in the method. Each line in the figure corresponds to the frequency response of the respective filter in the filter bank.

Each Gammatone band-pass filter in the filter-bank may be implemented as a cascade of four separate second order HR (infinite impulse response) filters. The transfer function of the four HR filters share the same denominators (poles) but have different numerators (zeros). This reduces round-off errors.

Alternatively, the filter back may be a perceptual filter-bank, comprising band-pass filters linearly spaced on perceptual frequency bands, for example.

The band-pass filtering is performed in the time domain, as described above, and thus the method provides time-domain feature extraction.

In step S202, a temporal envelope is extracted from the output time domain signal from each filter in the filter bank, i.e. for each sub-band. The resulting sub-band temporal envelopes (STEs) are temporal envelopes of sub-band signals resulting from bandpass filtering of the original speech signal. Thus for an input speech signal s(n), M STE signals e_m(n), m = 1, ..., M of s(n) are extracted.

Figure 5 shows an example signal s_m(n) output from one filter m in the filter bank and the extracted sub-band temporal envelope e_m(n). The filter-bank output is shown by the lighter line, the sub-band temporal envelope by the darker line. The sub-band temporal envelope is extracted from the sub-band signal resulting from the 8th Gammatone band-pass filter in the analysis filter-bank, i.e. where m=8. The STE is extracted in this case using a low-pass filter having a cut-off frequency of 50 Hz.

Optionally, each sub-band temporal envelope is extracted by full-wave rectifying the output time domain signal from the respective filter in the filter bank and then low pass filtering the rectified signal to remove a high frequency component, as shown in Figure 3. The example STE shown in Figure 5 was extracted by this method.

In this case, the STEs e_m(n), m = 1, ..., M of the sub-band signals s_m(n), m = 1, ..., M are extracted by, first, full-wave rectifying the sub-band signals followed by low-pass filtering of the resulting signals. For example, the low-pass filter for extracting STEs may be a fourth-order elliptic low-pass filter. The filter may have a 2dB peak-to-peak ripple. The minimum stop-band attenuation of the filter may be 50dB. The low-pass filter is applied in the time domain. The low-pass filter performs a convolution of the input speech signal with a filter impulse response function in the time domain and outputs a time domain signal.

The cut-off frequency of the low-pass filter controls the band width of the sub-band temporal envelopes. For example the cut-off frequency of the low-pass filter may be of the order of magnitude of the word rate of typical human speech. For example, the cutoff frequency of the low-pass filter may be in the range 5-100 Hz. The cut-off frequency of the low-pass filter may be 50 Hz. The cut-off frequency is selected to give a reasonable bandwidth for human and machine speech recognition.

The output of step S202 is a plurality of time domain signals, each corresponding to one sub-band of the filter bank. Each of the signals is the length of the input speech signal, i.e. the length of an utterance. This may be of the order of 10 seconds for example. Each utterance inputted into the system thus generates a plurality of time domain envelope signals, each corresponding to a sub-band.

In S203, each sub-band temporal envelope is framed. The STEs may also be windowed in this step.

In this step, from STEs e_m(n), m = 1, , M extracted from the whole utterance in S202, frames are extracted. For example, frames of 25 ms may be extracted every 10 ms. Optionally, the frames are multiplied by window functions. This emphasizes the samples in the middle of the analysis frames. For example, the frames may be multiplied with Hamming windows.

The output of S203 thus comprises a sequence of frames, wherein for each frame, there are a plurality of sub-band temporal envelope signals, each corresponding to one sub-band of the filter-bank.

In S204, a feature vector is extracted for each frame. Each feature vector comprises a feature coefficient extracted from the frame of the sub-band temporal envelope for each filter in the filter bank.

For each frame k, the signals e_mk(n) are the STEs obtained after framing. A feature vector Y_k = [yi, _k ,y2, k, , Ym, k] is extracted for each frame k. The feature vector for the frame k comprises a feature coefficient corresponding to each subband m. Each feature coefficient Y_m,_k is extracted from the STE e_m,_k(n) corresponding to the band m and the frame k. k is the frame index.

For example, each feature coefficient may be computed as:

Vm,k ~

N

n=l where N is the number of samples in a frame. Thus a power average is taken over the frame to give the feature coefficient.

Optionally, the feature coefficients are each compressed. For example, each feature coefficient may be root compressed. For example, each feature coefficient may be root compressed with the 1/r^th root such that:

ym,k compressed ^{= r}4y m,k

Optionally, 3 < r < 15. This is a range of perceptual compression used in speech processing. Optionally, r = 15 for example. This may provide improved speech recognition.

Further feature coefficients can be included in the feature vector. For example, the frame’s total energy may be used as an additional feature coefficient.

The dimension of the static STE features for each frame, i.e. the dimension of the feature vectors, is equal to the number of bands M, plus the number of additional feature coefficients.

As described above, Figure 3 is a flow chart of an example method of sub-band temporal envelope feature extraction, corresponding to steps S201 to S204 of Figure 2. In Figure 3, the speech signal is input in S301, before the “pre-emphasis” step S302. The input signal is then filtered in the “Gammatone filters” step S303 of Figure 3. This corresponds to S201 in Figure 2. The sub-band temporal envelope extraction is then performed in the “full-wave rectification” step S304 and “low pass filtering” step S305 of Figure 3. This corresponds to S202 of Figure 2. The framing is then performed in the “framing + windowing” step S306 of Figure 3. This corresponds to S203 of Figure 2.

The feature vector is then extracted in the “power & step S307 and “root compression” step S308 of Figure 3. This corresponds to S204 of Figure 2.

The feature extraction is performed in the time domain. The filter bank is applied in the time domain as described above. The low-pass filter used in the envelope extraction stage is also applied in the time domain. If a pre-emphasis filter is used, this is also applied in the time domain. Applying these filters in the time domain means that the sub-band temporal envelopes are extracted before framing. This allows long temporal context to be retained. In addition, extracting time-domain features means that no discrete Fourier transform (DFT) or equivalent is applied to transform the speech signal into the frequency domain.

Returning to Figure 2, once the feature vectors have been extracted, they are then inputted into a speech recognition classifier in S205, which is configured to generate one or more automatic speech recognition hypotheses corresponding to the input speech signal, i.e. corresponding to the utterance. The automatic speech recognition hypotheses may be generated with corresponding probabilities. The speech recognition classifier may be a neural network for example.

The classifier is trained prior to implementation to generate automatic speech recognition hypotheses based on the input feature vectors. The training methods are described below. During implementation, feature vectors are simply inputted into the trained classifier.

Optionally, the classifier is a deep neural network (DNN) based classifier. In this specification, the term deep neural network based classifier is used to refer to a classifier employing a feed forward neural network with certain number of hidden layers. DNNs do not require uncorrelated data, thus for example, no decorrelation step is included within the feature extraction method. Thus the feature vectors inputted into the DNN based classifier may be, to some degree, correlated.

For each frame of interest, a DNN based classifier also takes a number of adjacent frames as input. The number of frames for each input is referred to as the context window. For example, a context window of 11 may be used. This means that the DNN takes 11 frames at a time as input, rolling forward one frame at a time, where the middle frame is the frame of interest each time.

Optionally, the ASR is a hybrid HMM/DNN based ASR. In a hybrid HMM/DNN based ASR, a HMM/GMM system is included as well as the speech recognition DNN, and is used to generate state alignments information. State alignments contain information about the corresponding HMM state of each speech frame extracted from training data. The HMM/GMM system is trained initially. It is then used during training of the speech recognition DNN. The state alignments information generated by the HMM/GMM is used to train the speech recognition DNN. The posterior probability that a feature vector belongs to a phonetic class (or HMM state) is computed using the HMM/GMM system during training of the DNN.

Optionally, before being input into the speech recognition DNN, the feature vectors are enhanced using a second neural network. The second neural network acts as a denoising auto-encoder. In this case, extracting the feature vector for a frame comprises using the second neural network to enhance the features before inputting them into the speech recognition classifier. The feature vectors are inputted into the second neural network, which outputs enhanced feature vectors, which are then inputted into the first neural network for recognition.

Feature enhancement may further improve the ASR noise robustness. It aims at enhancing the features extracted from noisy input speech, the enhanced features then being used for recognition. A neural network which attempts to estimate a cleaner version of noisy input features is known as a DAE. DAE stands for de-noising auto encoder.

Figure 6 shows an example architecture of such a system. Feature extraction is performed in S601 as described in S201 to S204 above in the step “feature extraction”. The features are then inputted into a trained DAE in S602, which outputs enhanced features. The enhanced features are then inputted into the trained speech recognition DNN classifier in S603, which corresponds to S205 above, which in turn outputs one or more automatic speech recognition hypotheses.

Neural network architectures used for feature enhancement often take into account long temporal context. Such systems are able to model the temporal evolution of speech and noise over a long period of time. Use of a DAE together with the above described feature extraction, in which framing is performed after the sub-band temporal envelopes are extracted for the utterance, can therefore be beneficial. Since temporal context information from speech is extracted by the STE features, combining these features with a DAE based on neural network architectures which learn long-term temporal contexts is beneficial.

The second neural network may be a time delay neural network (TDNN). The timedelay neural network is applied as a de-noising auto-encoder to improve speech recognition performance. Time delay neural network architecture can represent relationships between events in time. In this case, the events in time are the extracted feature vectors. In a TDNN architecture, the initial transforms are learned on narrower contexts and the deeper layers process the hidden activations from a wider temporal context. Thus the higher layers have the ability to learn wider temporal relationships. In this architecture, the time delay of speech frames is explicitly modelled thanks to the delay units. As described above, in neural network computation, the current speech frame at time instant t is not the only speech frame taken into account. Speech frames att+1, t+2 ... and/or t-1, t-2 ... are used as well. For example, 11 frames may betaken as input at each time instant, as described above. The speech frames at t+1, t+2... are delayed compared to that at t. Those at t-1, t-2... can be considered as “delayed” as well. In a TDNN, the delays are modelled explicitly by a delay unit.

In a TDNN structure, the delayed frames of the current frame at time t (for instance at time t+1, t+2 ...) are modelled (explicitly) by a delay unit. Therefore, this modelling is taken into account explicitly in the computation method. The computation knows the delay of each input frame when these frames are used in the computation. The weights of each frame in the computation are thus optimized based not only on their values but also on their delay. In DNN on the other hand, delayed frames are used but their use in the computation is implicit: the current frame and delayed frames are batched together and used to compute hidden activations. The computation does not know which frame is the current frame and which are the delayed frames. In TDNN DAE, the enhanced features are generated in the training method using minimization of square error between noisy features and clean features. When noisy features are introduced to a trained TDNN DEA, the output enhanced features are resemble a clean version of the input features. Context information can improve the output. If the delay (partly context information) is modelled explicitly, the predicted output (enhanced features) may be improved. Hence, temporal contexts are better modelled. For example, a TDNN is able to learn long-term temporal contexts from an input speech signal. In a TDNN an activation in a hidden layer is computed from a limited number of nodes in the previous hidden/input layer. In a TDNN, an activation in a higher hidden layer is computed from a wider context compared to one in lower hidden layer because the context is expanded every layer.

This system processes the noisy input speech signal to produce the feature vectors, which are then processed by the second neural network to produce cleaner input features for the speech recognition DNN classifier.

Thus the ASR system may be based on a hybrid HMM/DNN ASR or a hybrid HMM/DNN ASR using a time-delay neural network (TDNN) de-noising auto-encoder (DAE) for feature enhancement for example.

The classifier generates a list of one or more automatic speech recognition hypotheses. Each hypothesis may have a corresponding probability.

The hypotheses may be used to generate a text signal corresponding to the input utterance. For example, the text signal may simply be the automatic speech recognition hypothesis having the highest probability. The text signal may be output to an interface that displays text to a user, for example a screen. Alternatively, the text signal may be transmitted to an external storage medium or a network.

Alternatively, the list of one or more automatic speech recognition hypotheses and corresponding probabilities may be inputted into a further stage of, e.g. a spoken dialogue system for example.

Prior to implementation, the classifier is trained for speech recognition. Figure 7 is a flow chart of a method of training an automatic speech recognition system for recognition of speech data, performed before implementation.

A corpus of data comprising a plurality of speech signals in which the text corresponding to parts of the speech signal is labelled is obtained. Parts of the speech may refer to e.g. words or phones (monophone or triphones). The data may be handlabelled for example. The labels may be extracted from a transcription for example.

For example the system could be trained using Aurora-4 corpus. Aurora-4 is a medium vocabulary task based on the Wall Street Journal (WSJO) corpus. A TDNN DAE may be trained on the Aurora-4 corpus for example.

Optionally, a multi-condition training set is used as well as a clean training set. Both clean and multi-condition training sets may be used to train a TDNN for example. The multi-condition data set may then be used to train the DNN for speech recognition, where the TDNN is used to enhance the features before they are inputted into the DNN. The multi-condition training set may be created by keeping some of the clean training set and replacing the rest by the same speech utterances which were simultaneously recorded by one of a number of different secondary microphones. A portion of the utterances for each may then be corrupted by the inclusion of different noises (e.g. airport, babble, car, restaurant, street, train) at a selected signal to noise ratio (SNR). Thus optionally, the training method comprises a prior step of generating the multi-condition training set by adding noise to part of the utterances in the clean training set.

For each utterance in the training set, the feature vectors are then extracted in the same manner as they are extracted during implementation, i.e. as described in steps S201 to S204 above.

Thus in S702, the input speech signal from the training set is filtered using a filter bank comprising a plurality of filters, wherein each filter in the filter bank modifies the input speech signal by a different frequency dependent gain and outputs a time domain signal, such that there is an output time domain signal from each filter in the filter bank.

S703 comprises extracting a sub-band temporal envelope from the output time domain signal from each filter in the filter bank. This may involve full wave rectification of each sub-band signal, followed by low pass filtering.

S704 comprises framing the sub-band temporal envelopes.

In S705, for each utterance in the corpus, a feature vector is extracted for each frame in the utterance. Each feature vector comprises a feature coefficient extracted from the frame of the sub-band temporal envelope of the output time domain signal from each filter in the filter bank. Additional feature coefficients may be included.

Extracting the feature vector may comprise generating a power signal from each subband temporal envelope and for each power signal, averaging the power signal values over the frame. These power average values may then be root compressed.

As described above, extracting the feature vector for a frame may further comprise using a de-noising neural network, for example a time delay neural network, to enhance the features before inputting them into the classifier, for example a deep neural network based classifier. In this case, the de-noising neural network is trained initially, before being used to enhance the features from the output of S705 during training of the classifier.

S706 comprises training the speech recognition classifier using the feature vectors and the text labels. The classifier may be a DNN based classifier. The training may be performed using a DNN training tool.

Figure 9(a) shows the training of a TDNN DAE, which can then be used during training of a DNN for speech recognition. Figure 9(b) shows the training of a DNN for speech recognition. The training data may be clean training data. Figure 9(c) shows the training of a DNN for speech recognition using features extracted from multi-condition training data. In Figure 9(c), enhanced features are used the train DNN. The features are enhanced using an already trained TDNN, i.e. which is trained as shown in Figure 9(a).

Optionally, a training recipe such as disclosed in G. Dahl et. al., “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 30-42, 2012”, the contents of which are hereby incorporated by reference, is used to train the DNN. The training may be performed using a Kaldi speech recognition toolkit disclosed in “D. Povey et. al., “The Kaldi speech recognition toolkit,” in Proc. IEEE ASRU 2011, Hawaii, USA, December 2011, the contents of which are hereby incorporated by reference. A DNN architecture which includes 7 hidden layers, each layer comprising 2048 nodes may be used.

The classifier may be a DNN-HMM based classifier, in which a HMM-GMM system is used to generate state alignment information. Figure 9(b) shows training of a DNN based on the extracted feature vectors for such a system. The state alignment is obtained from the speaker adaptive training (SAT) HMM-GMM system, which has been trained beforehand on multi-condition training data using, e.g. MFCCs features. Optionally, the HMM-GMM system comprises context-dependent HMMs with 2298 senones and 16 Gaussians per state trained using maximum likelihood estimation. The input features are 39-dimensional Mel frequency ceptral coefficient (MFCC) features (static plus first and second order delta features). Cepstral mean normalization is performed on the features. The state alignments indicate which HMM state a feature vector belongs to. This information is used in the training of the DNN for speech recognition, and particularly in training the acoustic models. The state alignments may be obtained by automatically forced-alignment on the training data. The training corpus may provide transcriptions which are used in the forced-alignment. The DNN and the HMM/GMM system may be trained on the same training data.

Where a DAE is used for feature enhancement, the DAE may also be trained beforehand. Figure 8 shows a flow chart of an example training method for such a system. The DAE is trained for feature enhancement initially. The DAE is trained using multi-condition and clean training data. The HMM-GMM is also trained initially and separately to the DAE. The HMM-GMM is trained to generate state alignment information. The speech recognition DNN based classifier is then trained using the state alignment information from the trained HMM-GMM, where the trained DAE is used to enhance the features extracted from the multi-condition data before they are inputted into the speech recognition DNN. The HMMs provide a topology (transition and emission probabilities) for generating an acoustic unit (word, sub-word, phones, etc). The transition probabilities are obtained during training of the DNN using the trained HMM-GMM system. The emission probabilities are obtained with the DNN.

The DAE may be a TDNN and may be trained using a back-propagation learning algorithm based on square error criterion, given input features extracted from noisy training speech and output features extracted from clean training speech. Optionally, training a TDNN DAE comprises extracting the feature vectors from both clean and multi-condition training data as described above. The input features are those extracted as described previously, e.g. in S705. The output features are the same type of features, but are enhanced. Figure 9(a) shows training of a TDNN DAE.

The back propagation algorithm adjusts the TDNN’s link weights to realize the feature enhancement mapping. The cost function that is used by the back-propagation algorithm may be a square error measure between referenced clean output features and the actual TDNN’s output. The actual TDNN’s output may be computed from noisy input features using actual TDNN weights. On every presentation of learning samples, each weight may be updated in an attempt to decrease this square error measure.

The trained TDNN is then used during the training of the DNN based recognition classifier, to enhance the features extracted from the training corpus. The enhanced features are then used to train the DNN classifier. Such a system is shown in Figure 9(c). Thus features extracted from multi-condition training data are enhanced by TDNN DAE before being used for training DNN. Figure 10 shows this in more detail.

During DNN training, the DNN may be initialized using layer-by-layer generative pretraining and then discriminatively trained using back-propagation based on crossentropy criterion. The DNNs may comprise 7 hidden layers, each layer having 2048 nodes. The activation function may be sigmoid. The softmax layer may comprise 2298 senones. The DNN training may be done in up to 18 epochs and stopped when the relative reduction of the cost function was lower than 0.001 for example. The initial learning rate may be 0.008 and be halved every time the relative improvement is lower than 0.01.

Examples

ASR experiments performed with Aurora-4 corpus are described in the following examples. All the data were sampled at 16 kHz.

The ASR systems were trained using Aurora-4 corpus. Two training sets were used: clean and multi-condition. Each set comprised 7138 utterances from 83 speakers. All the utterances in the clean training set were recorded by a primary Sennheiser microphone which is a close-talking microphone. The multi-condition training set was created by keeping half of the clean training set and replacing the other half by the same speech utterances, simultaneously recorded by one of a number of different secondary microphones. Seventy five percent of the utterances in each half were corrupted by six different noises (airport, babble, car, restaurant, street, and train) at 10-20 dB signal to noise ratio (SNR).

An evaluation set was derived from WSJ0 5K-word closed vocabulary test set which comprised 330 utterances spoken by 8 speakers. This test set was recorded by the primary microphone and a secondary microphone. 14 test sets were created by corrupting these two sets by the same six noises used in the training set at 5-15 dB SNR. Thus the types of noises were matched across training and test sets but the SNRs of the data were partially mismatched. These 14 test sets were grouped into 4 subsets: clean, noisy, clean with channel distortion, and noisy with channel distortion, which will be referred to as A, B, C, and D, respectively.

Example 1

A DNN was trained according to a standard training recipe using a speech recognition toolkit. The state alignment was obtained from a speaker adaptive training (SAT) HMM/GMM system, trained on the clean training data using MFCCs features. STE features were extracted from the clean training data as described in relation to Figure 3 above. 40 band-pass filters were used in a Gammatone filter-bank. A cut-off frequency of 50 Hz was used for the low-pass filter. The frame’s total energy was used as an additional coefficient. The dimension of the static STE features was thus 41. The DNN trained in this manner is referred to as STE (DNN_C).

In the DNN training, a context window of 11 frames was used. In both HMM/GMM and HMM/DNN systems training, utterance-level mean normalization was performed on static features. First and second-order delta features were then appended. The DNNs were initialized using layer-by-layer generative pre-training and then discriminatively trained using back-propagation based on cross-entropy criterion. The DNNs comprised hidden layers, each layer having 2048 nodes. The activation function was sigmoid. The softmax layer had 2298 senones. The DNN training was done in up to 18 epochs and was stopped when the relative reduction of the cost function was lower than 0.001. The initial learning rate was 0.008 and was halved every time the relative improvement was lower than 0.01. Decoding was performed with the task-standard WSJ0 bi-gram language model.

Reference Example 1

The DNN was trained using FBANK features extracted from the clean training data. The DNN trained in this manner is referred to as FBANK (DNN_C). The FBANK features were extracted as follows: the speech signal was first pre-emphasized by using a filter having a transfer function H(z) = 1 - 0.97z '¹. Speech frames of 25ms were then extracted every 10ms and multiplied with Hamming windows. A discrete Fourier transform (DFT) was used to transform the speech frames into the spectral domain. Sums of the element-wise multiplication between the magnitude spectrum and a Melscale filter-bank were computed. The FBANK coefficients were obtained by taking a logarithm of these sums. A 40-channel Mel-scale filter-bank was used and the frame’s total energy coefficient was also appended to the feature vector, resulting in 41dimensional static FBANK features. These FBANK features were extracted using a HTK toolkit.

Example 2

In this example, the system included a TDNN DAE. STEs were extracted from both clean and multi-condition training data in the same manner as described for Example 1 to train the TDNN DAE. The TDNN DAE training was done using back-propagation based on square error criterion. A TDNN architecture for DAE which allows flexible selection of input contexts of each layers required to compute output features, at one time step, was used. The TDNN architecture included the use of p-norm non-linearity which is a dimension reducing non-linearity. The TDNN used group size of 10, and 2norm. In all hidden layers the p-norm input and output dimensions were 3000 and 300, respectively.

The input contexts of each layer required to compute an output activation define the TDNN architecture. Symmetric input contexts, with balanced left-right context, were used. A 6 hidden layer TDNN architecture with an input temporal context of [t-17, t+17] was applied. In this configuration, the highest hidden layer covered a context of 17 frames on the left and 17 frames on the right of the current frame. The layer-wise context of this architecture were {-3, 3}, {0}, {-2, 2}, {0}, {-4, 4}, {0}, {-8, 8}, where {-3, 3} means 7 frames at offsets -3, -2, -1, 0, 1, 2, 3 compared to the current frame are spliced together for the computation of hidden activations in the first hidden layer. Assuming that h and t₂ are two positive integers, {-ft, t₂} means two frames at offsets -ft and t₂ compared to the current frame are spliced to compute hidden activations in the corresponding hidden layer. {0} is a non-splicing hidden layer.

The DNN was then trained using STE features extracted from the clean training data as described in relation to Example 1 above, and inputted into the TDNN DAE. The features outputted from the TDNN DAE were then inputted to the DNN for training of the DNN. The DNN trained in this manner is referred to as STE (TDNN + DNN_C).

Reference Example 2

This system was trained in the same manner as for Example 2, where FBANK features were extracted instead of STE features. The DNN trained in this manner is referred to as FBANK (TDNN + DNN_C).

Example 3

This system was trained in the same manner as for Example 1, however features were extracted from the multi-condition data set. The DNN trained in this manner is referred to as STE (DNN_m).

Reference Example 3

This system was trained in the same manner as for Reference Example 1, however features were extracted from the multi-condition data set. The DNN trained in this manner is referred to as FBANK (DNN_M).

Example 4

This system was trained in the same manner as for Example 2, however features were extracted from the multi-condition data set before being inputted to the TDNN. Features extracted from multi-condition training data can be enhanced by TDNN DAE before being used for training DNN. The DNN trained in this manner is referred to as STE (TDNN + DNN_e).

Reference Example 4

This system was trained in the same manner as for Reference Example 2, however features were extracted from the multi-condition data set before being inputted to the TDNN. Features extracted from multi-condition training data can be enhanced by TDNN DAE before being used for training DNN. The DNN trained in this manner is referred to as FBANK (TDNN + DNN_E).

Word error rates (WERs) for the above examples are shown in Tables 1 and 2 below.

From Table 1, it can be seen that average WER obtained with STE features is lower than that obtained with FBANK features in a hybrid HMM/DNNC system. The relative reductions of the average WER obtained with STE features is 30.8%. Compared to respective HMM/DNNC system, a TDNN DAE + HMM/DNNC system reduces relatively the WERs by 52.3% and 47.7% for FBANK and STE features, respectively. In TDNN DAE + HMM/DNNC system, combining TDNN DAE and STE features provides a 24.2% relative reduction of WER compared to the combination of TDNN DAE and FBANK features.

In Table 2 when multi-condition training data was used, relative WER reduction obtained with STE features compared to FBANK features is 2.2% in HMM/DNNM system abbreviated as DNNM. TDNN DAE reduces relatively the WERs by 4.6% and 10.3% for FBANK and STE features, respectively, in TDNN DAE + HMM/DNNE system. Combining TDNN DAE and STE features provides a 7.8% relative reduction of WER compared to the combination of TDNN.

STE features alone created relative reductions of WER from 2.2% to 30.8% compared to FBANK features. TDNN DAE provided relative reductions of WER from 4.6% to 52.3%. The relative reductions of WER created by the combination of TDNN DAE and STE features compared to the combination with FBANK features are from 7.8% to 24.2%.

The above described methods use sub-band temporal envelope features for automatic speech recognition, and in particular, for deep neural network based automatic speech recognition. These features are robust to environmental noise and therefore provide improved ASR performance.

Energy peaks in frequency bands of speech signal reflect the resonant properties of the vocal tract and thus provide acoustic information on the production of speech sound. The extracted features thus provide acoustic information on the production of speech sounds. This information can be extracted in the time domain from the subband temporal envelopes (STEs).

The STEs may be extracted by full-wave rectification and low-pass filtering of bandpassed speech using a Gammatone filter-bank. By using full-wave rectification and low-pass filtering to extract STEs, bandwidths of STEs as well as their temporal resolution can be controlled by the cut-off frequency of the low-pass filter.

Temporal context information from speech is extracted by the STE features. This contextual information is useful for neural networks which model long temporal context.

Sub-band temporal envelope features are thus effective in deep neural network-based automatic speech recognition, especially when combining with a time-delay neural network de-noising auto-encoder.

While certain arrangements have been described, these arrangements have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed the methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of methods and apparatus described herein may be made.

Claims

CLAIMS:

1. An automatic speech recognition system, comprising:

an input for receiving a speech signal; and a processor configured to:

frame the temporal envelopes;

2. The system of claim 1, wherein extracting the temporal envelope comprises:

low pass filtering each of the rectified signals.

3. An automatic speech recognition system, comprising:

an input for receiving a speech signal; and a processor configured to:

low pass filtering each of the rectified signals; frame the temporal envelopes;

4. The system of any preceding claim, wherein each filter in the filter bank is a bandpass filter and wherein each band pass filter has a different centre frequency.

5. The system of claim 4, wherein the centre frequencies for the band pass filters are linearly spaced on an equivalent rectangular bandwidth scale.

6. The system of any preceding claim, wherein the filter bank is a Gammatone filter bank.

7. The system of any preceding claim, wherein each filter in the filter bank modifies the input speech signal by performing a convolution of the input speech signal with a filter impulse response function in the time domain.

8. The system of any preceding claim, wherein extracting the feature vector for a frame comprises:

generating a power signal using each temporal envelope;

for each power signal, averaging the power signal values over the frame.

9. The system of claim 8, wherein extracting the feature vector for a frame further comprises:

root compressing each average power value.

10. The system of any preceding claim, wherein the processor is further configured to:

apply a pre-emphasis filter to the input speech signal before the input speech signal is filtered using the filter bank.

11. The system of any preceding claim, wherein extracting the feature vector for a frame further comprises using a de-noising neural network to enhance the features before inputting them into the classifier.

12. The system of claim 11, wherein the de-noising neural network is a time delay neural network.

13. The system of any preceding claim, further comprising:

an output for outputting a text signal, wherein the processor is further configured to determine a sequence of text from the one or more automatic speech recognition hypotheses and output the text at the output.

14. A feature extraction system, comprising:

an input for receiving a speech signal; and a processor configured to:

low pass filtering each of the rectified signals; frame the temporal envelopes;

15. A method of training an automatic speech recognition system for recognition of speech data, the method comprising:

framing the temporal envelopes;

16. A method of training an automatic speech recognition system for recognition of speech data, the method comprising:

framing the temporal envelopes;

training a classifier using the feature vectors and the text labels.

17. A feature extraction method, comprising:

framing the temporal envelopes;

18. An automatic speech recognition method, comprising:

framing the temporal envelopes;

19. An automatic speech recognition method, comprising:

5 low pass filtering each of the rectified signals to remove a high frequency component;

framing the temporal envelopes;

extracting a feature vector from the temporal envelopes for each frame, wherein each feature vector comprises a feature coefficient extracted from the frame of the

10 temporal envelope of the output time domain signal from each filter in the filter bank;

15

20. A carrier medium comprising computer readable code configured to cause a computer to perform the method of any of claims 14 to 19.

Intellectual

Property

Office

Application No: GB1703310.1 Examiner: Mrs Hannah Sylvester