GB2463908A - Speech recognition utilising a hybrid combination of probabilities output from a language model and an acoustic model. - Google Patents

Speech recognition utilising a hybrid combination of probabilities output from a language model and an acoustic model. Download PDF

Info

Publication number
GB2463908A
GB2463908A GB0817821A GB0817821A GB2463908A GB 2463908 A GB2463908 A GB 2463908A GB 0817821 A GB0817821 A GB 0817821A GB 0817821 A GB0817821 A GB 0817821A GB 2463908 A GB2463908 A GB 2463908A
Authority
GB
Grant status
Application
Patent type
Prior art keywords
codewords
gaussians
vector
speech
acoustic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB0817821A
Other versions
GB0817821D0 (en )
GB2463908B (en )
Inventor
Anton Ragni
Kean Kheong Chin
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Research Europe Ltd
Original Assignee
Toshiba Research Europe Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules

Abstract

A method, apparatus and computer program for speech recognition comprising receiving speech input and using it to predict a sequences of words likely to match the received speech input using an acoustic model and a language model; likelihoods of the predicted sequences of words matching the received speech input output from the 2 models are combined to output a sequence of identified words. The speech input comprises a sequence of observations, a likelihood of a sequence of words arising from the sequence of observations is determined using an acoustic model. In the acoustic model each observation is converted into an input vector in acoustic space. The acoustic model represents words as points in acoustic space, each point representing a probability density function (pdf) of a word and the pdfs are usually Gaussians. Within the acoustic model, the pdfs/Gaussians are clustered to form code words using a Gaussian selection method known as Vector Quantization. The present invention selects the code-words closest each input vector by comparing the distance of each codeword from the input vector with a predetermined distance and retaining code words which are closer to the input vector than the predetermined distance.

Description

Speech Recognition Apparatus and Method The present invention relates to methods and systems for performing speech recognition. More specifically, the present invention is concerned with a speech recognition method and apparatus suitable for use in a large vocabulary system (LVCSR).

In order to achieve real-time performance, conventional speech recognition systems make extensive use of various pruning strategies during decoding to limit the number of states for which likelihood have to be computed. In the nearest neighbour approach, state likelihood is approximated by the largest value of probability density functions (pdf) composing a state output distribution. Even though evaluation of single pdf is completely sufficient for any given state, all pdfs are still computed and the largest one is selected.

To overcome this problem a technique called Gaussian Selection (GS) has been proposed (since pdfs in most recognition systems are Gaussians). Upon receiving the next observation vector, GS system explicitly marks pdfs which will be calculated and which will be skipped. One popular approach to GS is based on the well-known technique called vector quantization (Bocchieri, "Vector quantization for efficient computation of continuous density likelihoods" Proceedings ICASSP, Vol. II, p 692- 695, 1993). In this technique, Gaussians are clustered to form one or more codewords.

it is necessary to select the closest codewords in order to ensure that only the Gaussians which are most likely to give the best results are calculated. Generally, this has been performed by sorting. However, when the number of codewords exceeds few thousands (relevant for L VCSR) determination of N closest codewords for given observation vector, incurs an O(n log n) average-time complexity which becomes so time consuming that the benefit from using GS is practically lost.

The present invention attempts to address the above problem and in a first aspect provides a speech recognition method, comprising: receiving a speech input comprising a sequence of observations; determining the likelihood of a sequence of words arising from the sequence of observation using an acoustic model, comprising: converting each observation into an input vector in acoustic space; determining the closest codewords to an input vector in said acoustic space, wherein each codeword comprises a cluster of points in said acoustic space, and each point represents the probability distribution of a word or part thereof being related to an observation; and determining the likelihood of a sequence of words arising from the sequence of observations using probabilities determined from said probability distributions, determining the likelihood of a sequence of observations occurring in a given language using a language model; combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal, wherein determining said closest codewords comprises comparing the distance of each codeword from the observation vector with a predetermined distance wherein codewords which are closer to the input vector than said pre-determined distance are retained.

Thus, since the above method does not sort the results, it is computationally quicker.

Given an array of distances between observation vector and each of codewords, the problem of determining top N distances by means of sorting on average has a 0( n log n) time complexity. This complexity is reduced to just 0(n) by redefining the underlying problem into equivalent one: given an array containing n distances find a value which is larger than N distances but smaller than the remaining n-N distances.

Further, the remaining N distances are not sorted. Sorting is not used in any form to determine the top N codewords. The above method not only has smaller time complexity but also avoids time-consuming memory move operations: operations counted in 0(n) are simple additions and subtractions.

In an embodiment, the N closest codewords are determined and the predetermined distance is estimated to be the distance which allows the Nth closest codewords to be selected.

The Nth closest codewords may be determined by estimating the distance of the closest nJ2 codewords from the input vector where n is the total number of codewords and N<n12, and splitting the codewords into two groups where the codewords with distances smaller than the distance of the n12 closest codewords are retained, the process being repeated with the n12 closest code words until the closest N codewords are identified.

A preferred method according to claim of estimating the distance of the closest n12 codewords from the input vector is determined by estimating the median.

There are many possible ways of estimating a median, one is to set the median value to the distance of one of said n codewords and then comparing the median value with each of the n codeword distances in turn arid adjusting the median value as it is compared with each codeword dependent on the difference between the median and the codeword distance.

The above method is of particular use when N and n can be expressed as an integer to the power of 2. However, it may also be used when neither N or n cannot be expressed as an integer to the power of 2, and said N values are determined by recursively dividing the n codewords until a sample of N' codewords is established where N' is less than N, and then performing the process on the last group to be discarded to establish the N-N' codewords with the smallest distances out of this group to add to the N' codewords.

Other methods may also be used for selecting the top N, for example the method of Press, W. H., Flannery, B. P., Teukolsky, S. A., Vetterling, W. T., "Numerical recipes in C: the art of scientific computing", Cambridge University Press, 2ed, pp 34 1-345, 1992.

Generally, the acoustic model is a Hidden Markov Model, but other models are possible. Also, the probability distributions are usually Gaussian probability distributions.

When considering distance measures, it is preferable if the distance between a codeword mean c0 and the input vector o is based on a likelihood measure. More preferably, the distance between a codeword and an input vector is determined by: d(o,cj A+B[oj i Where c is the codeword variance. A and B are constant parameters where usually A is OandBis 1.

The clustering of points also preferably uses a distance measure based on likelihood values: f I,u.-c.

d(,J,c!=)D+El i( I + where 1u is the mean of a Gaussian or other pdf and a-1, is the variance of the Gaussian or pdf. D and E are constant parameters where D is usually 0 and E is 1.

Other methods are discussed in M.J.F. Gales, K.M. Knill and S.J. Young (1999). "State-Based Gaussian Selection in Large Vocabulary Continuous Speech Recognition using HMMs" IEEE Transactions on Speech and Audio Processing (1999).

During the recognition process the decoder will usually request a likelihood value which requires a Gaussian which has not been marked for calculation. In general pruning a state left without selected Gaussians, in general, leads to a degradation in accuracy. If a constant value is estimated and used as a back-off value then if there is a mismatch between development and testing conditions the system can behave unpredictably.

If the likelihood of random Gaussian is used as a back-off value then in most cases it will be underestimated. Forcing each codeword to contain at least one Gaussian from each state leads to a significant increase in memory used to store Gaussian membership lists. In addition, if more than a single codeword is selected then in some cases all Gaussians will be calculated, thus, reducing efficiency of GS aimed at calculating as few Gaussians as possible.

When a state likelihood is requested but Gaussians belonging to it have not been marked or calculated a state flooring occurs and a back-off strategy has to be employed to avoid state pruning.

In a preferred embodiment, codewords which are not considered to be in the closest codeword group, but which are within a second predefined distance from the input vector are marked to be calculated on demand if state likelihood scores cannot be computed from codewords in the closest codeword group but are needed by the acoustic model which requires the probability from a probability density function in these codewords.

Using the above, a state likelihood which comprises one or more probability distributions from said first group of codewords is calculated using only probability distributions from said first group of codewords and a state likelihood which comprises one or more probability distributions from said back-off group of codewords and not any from said first group of codewords is calculated using only probability distributions from said back-off group of codewords The second predefined distance is preferably estimated to allow the Mth closest codewords to be identified where M>N.

Thus, all pdfs (which will be referred to as Gaussians from hereon) which lie within a first distance from the input vector will be marked for calculation. These Gaussians will then be calculated if a state likelihood is calculated which requires Gaussians from the first group. In a preferred embodiment, the largest of these Gaussians will be selected as the state likelihood.

Pdfs or Gaussians which are in the second or back-off group will only be calculated if they are required for a state which has no Gaussians in the first group. If there is a state which has both Gaussians in the first group and the second group, then only Gaussians in the first group will be calculated and Gaussians belonging to the second group will be ignored.

Thus the above method extends the group of Gaussians which are marked for calculation if the state contains no Gaussians which lie in the first group.

By carefully tuning top N and top M values in the double level state back-off it is possible to achieve the smallest number of codewords that needs to be selected; the smallest number of codewords that needs to participate in the back-off stage; the largest number of states that can be floored free of any costs (word accuracy, etc.); the smallest number of Gaussians that needs to be calculated; all in the framework of the new technology described so far.

The above method may be used for speech to speech translation. Such a method comprises recognising a speech signal as described above; translating said recognised speech signal into a different language; and outputting said recognised speech in said different language. Outputting said speech may comprise using a text to speech conversion method.

In a second aspect, the present invention provides a speech recognition apparatus said apparatus comprising: a receiver for a speech input, said speech input comprising a sequence of observations, a processor adapted to determine the likelihood of a sequence of words arising from the sequence of observations using an acoustic model, said processor being adapted to convert each observation into an input vector in acoustic space; determine the closest codewords to an input vector in said acoustic space, wherein each codeword comprises a cluster of points in said acoustic space, and each point represents the probability distribution of a word or part thereof being related to an observation; and determine the likelihood of a sequence of words arising from the sequence of observations using probabilities determined from said probability distributions; the processing being further adapted to determining the likelihood of a sequence of observations occurring in a given language using a language model; and combine the likelihoods determined by the acoustic model and the language model, the apparatus further comprising an output adapted to outputting a sequence of words identified from said speech input signal, wherein determining said closest codewords comprises comparing the distance of each codeword from the observation vector with a predetermined distance wherein codewords which are closer to the input vector than said pre-determined distance are retained.

In a third aspect, the present invention provides a method of determining an indication of the probability that a speech input corresponds to a word or sequence of words, the method comprising: receiving a speech input which comprises a sequence of observations; determining the likelihood of a sequence of words arising from the sequence of observations using an acoustic model, comprising: converting each observation into an input vector in acoustic space; determining the closest codewords to an input vector in said acoustic space, wherein each codeword comprises a cluster of points in said acoustic space, and each point represents the probability distribution of a word or part thereof being related to an observation; and determining the likelihood of a sequence of words arising from the sequence of observations using probabilities determined from said probability distributions, wherein determining said closest codewords comprises comparing the distance of each codeword from the observation vector with a predetermined distance wherein codewords which are closer to the input vector than said pre-determined distance are retained.

In a fourth aspect, the present invention provides a computer running a computer program configured to cause a computer to perform the any of the above methods.

The present invention has been described with reference to a speech system, but the method of determining N extreme values has wider implications. Therefore, in a fiflh aspect, the present invention provides a computer implemented method of determining N extreme values from a sample of n values where N<n12, the method comprising: splitting the codewords into two groups dependent on whether a sample value is higher or lower than an estimated midpoint of said sample values; selecting the group of samples dependent on whether the extreme values required are the highest N values or the lowest N values; repeating the process until the sample size reaches N values.

The present invention will now be described with reference to the following non-limiting embodiments, in which: Figure 1 is a schematic of a speech recognition system; Figure 2 is a schematic of a processor for use with the speech recognition system of figure 1; Figure 3 is a schematic of a Gaussian distribution; Figure 4 is a plot of acoustic space indicating the relationship of an input vector with Gaussian; Figure 5 is a schematic of an acoustic space where the Gaussians are clustered; Figure 6 is a flow diagram showing the preparation stage in a Gaussian selection procedure; Figure 7 is a flow diagram showing the steps in the operation stage in accordance with an embodiment of the present invention; Figure 8 is a schematic of a median method in accordance with an embodiment of the present invention; Figure 9 is a schematic illustrating a back-off strategy in accordance with an embodiment of the present invention; Figure 10 is a plot of an acoustic space showing a back-off strategy in accordance with an embodiment of the present invention; and Figure 11 is a schematic plot of the relative advantages of a distance measure in accordance with an embodiment of the present invention.

Figure 1 is a schematic of a very basic speech recognition system. A user (not shown) speaks into microphone 1 or other collection device for an audio system. The device I could be substituted by a memory which contains audio data previously recorded or the device 1 may be a network connection for receiving audio data from a remote location.

The speech signal is then directed into a speech processor 3 which will be described in more detail with reference to figure 2.

The speech processor 3 takes the speech signal and turns it into text corresponding to the speech signal. Many different forms of output are available. For example, the output may be in the form of a display 5 which outputs to a screen. Alternatively, the output could be directed to a printer or the like. Also, the output could be in the form of an electronic signal which is provided to a further system 9. For example, the further system 9 could be part of a speech translation system which takes the outputted text from processor 3 and then converts it into a different language and is outputted via a further text or speech system.

Alternatively, the text outputted by the processor 3 could be used to operate different types of equipment, for example, it could be part of a mobile phone, car etc. where the user controls various functions via speech.

Figure 2 is a block diagram of the standard components of a speech recognition processor 3 of the type shown in figure 1. The speech signal received from microphone, through a network or from a recording medium 1 is directed into front-end unit 11.

Front end unit 11 digitises the received speech signal and splits into frames of equal lengths. The speech signals are then subjected to a spectral analysis to determine various parameters which are plotted in an "acoustic space". The parameters which are derived will be discussed in more detail later.

The front end unit also removes signals which are not believed to be speech signals and other irrelevant information. Popular front end units comprise apparatus which use filter bank (F BANK) parameters, Melfrequency Cepstral Coefficients (MFCC) and Perceptual Linear Predictive (PLP) parameters. The output of the front end unit is in the form of an input vector which is in n-dimensional acoustic space.

The input vector is then fed into decoder 13 which cooperates with both an acoustic model section 15 and a language model section 17. The acoustic model section 15 will generally operate using Hidden Markov Models. However, it is also possible to use acoustic models based on connectionist models and hybrid models.

The acoustic model unit 15 derives the likelihood of a sequence of observations corresponding to a word or part thereof on the basis of the acoustic input alone.

The language model section 17 contains information concerning probabilities of a certain sequence of words or parts of words following each other in a given language.

Generally a static model is used. The most popular method is the N-gram model.

The decoder 13 then traditionally uses a dynamic programming (DP) approach to fmd the best transcription for a given speech utterance using the results from the acoustic model 15 and the language model 17.

This is then output via the output device 19 which allows the text to be displayed, presented or converted for further use e.g. in speech to speech translation or to control a voice activated device.

This description will be mainly concerned with the use of an acoustic model which is a Hidden Markov Model (14MM). However, it could also be used for other models.

The actual model used in this embodiment is a standard model, the details of which are outside the scope of this patent application. However, the model will require the provision of probability density functions (pdfs) which relate to the probability of an observation represented by an acoustic vector being related to a word or part thereof.

Generally, this probability distribution will be a Gaussian distribution in n-dimensional space.

A schematic example of a generic Gaussian distribution is shown in figure 3. Here, the horizontal axis corresponds to a parameter of the input vector in one dimension and the probability distribution is for a particular word or part thereof relating to the observation. For example, in figure 3, an observation corresponding to acoustic vector x has a probability p1 of corresponding to the word whose probability distribution is shown in figure 3. The shape and position of the Gaussian is defined by its mean and variance. These parameters are determined during training for the vocabulary which the acoustic model, they will be referred to as the "model parameters".

In a FIMM, once the model parameters have been determined, the model can be used to determine the likelihood of a sequence of observations corresponding to a sequence of Figure 4 is schematic plot of acoustic space where an observation is represented by observation vector or feature vector x1. The open circles g correspond to the means of Gaussians or other probability distribution functions plotted in acoustic space.

During decoding, the acoustic model will calculate a number of different likelihoods that the feature vector x1 corresponds to a word or part thereof represented by the Gaussians. These likelihoods are then used in the acoustic model and combined with probabilities from the language model to determine the text spoken.

However, in a real time speaker recognition system with a large vocabulary, it is not possible to calculate the probability of the utterance expressed by acoustic vector x1, corresponding to each of the words or part thereof represented by the Gaussian in figure 3. Therefore, various pruning strategies have been used which allow calculation of only certain Gaussians.

It is not feasible from a computing point of view to establish the Gaussians which are actually closest to the input vector x1. A particularly successful method of Gaussian selection is shown in figure 5. Here, during the preparation stage, the Gaussians are clustered together into a plurality of different code words. Figure 4 schematically shows two code words c1 and C2 which lie reasonably close to acoustic vector x1. The Gaussians in code words c1 are clustered together and a centroid 0 of the Gaussian is calculated and represented by the filled middle circle. This centroid is a Gaussian pdf with the mean and variance estimated from the mean and variances of underlying Gaussians assigned to the given code word. Another code word C2 with centroid 02 is also shown in figure 5.

In a typical clustering algorithm, a Gaussian is assigned to a code word only if it gives the smallest distortion out of all the code words. Thus, each Gaussian only belongs to one codeword. Once the clustering procedure converges, Gaussian assignment along with the mean and variance is stored for code words separately. It should be noted that this process takes place before the system is used for speech recognition.

Figure 6 is a flow diagram showing the preparation stage in more detail. As mentioned above, this preparation takes place before the user speaks into the system.

In step S2 1, all the probability density functions or Gaussians are pooled together in a single codebook. Next in S23, a number of code words are selected. Generally, this is based on previous experience, common practice, results from experiments and development data sets etc. A clustering algorithm is then used to assign each Gaussian to a single code word in step S25. Gaussian assignment is made on the basis of the distortion between the code This then leads to step S27 where each pdf is assigned to a single code word. The code word mean and variance which is based on the Gaussians which belong to that code word is then saved in S29 and finally pdf membership lists are saved in S3 1 so that each Gaussian associated with the code word is known.

The operation stage as shown in figure 7. At step S41, the next observation is inputted into the system. The observation is in the form of an acoustic vector. This has been explained with reference to figure 1. In the next stage, the distances of the code word centroids from the acoustic vector are calculated in step S43.

Many systems exist for measuring the distances and establish the closest codewords such as Euclidean, Mahalanobis, Bhattacharyya, varieties of Kullback-Leibler divergence etc. However, since the ultimate goal is to select a Gaussian with the highest likelihood, it is advantageous to use a distance measure which is based on a likelihood calculation. A typical log-likelihood with a diagonal covariance can be expressed as: logmN(o;p,cr)= iJ (1) 2, a.

Using the assumption of orthogonality, the above summation can be re-written as: 2 +101 1 (2) O) + 0J) o) where a0is a codeword variance and, and c0 a codeword mean. The first term on the right hand side is controlled and minimised by the clustering process and the second term by the quantisation. It is further assumed that the disjoint minimisation of both terms leads to the joint minimisation of their cross product which would be required if (2) above was written as a strict equality and not an estimation.

Thus, the distance measures used for both clustering and quantisation are d(p,c)= (3) d(o,c)= [0iCiJ (4) The above distance measures have a stronger correlation to the likelihoods than other methods for measuring distances in GS.

The above may also be expressed as: d(p, c) = A + B[I (+] d(o,c)=D+E10 -c�,i where A, B, D and e are constants. Usually, A and D are 0 and B and E are 1.

In step S45, the n closest code words to the input vector are identified.

Conventionally, this was performed by sorting the distances of the code words into ascending order and selecting the top N. However, the present invention has a new more efficient way of achieving this which will be described with reference to figure 8.

Finally, in step S47, the code words which have been identified as the closest n code words in step S45 are then calculated for use in the acoustic model.

As mentioned above, in the operation stage, in conventional systems the distances of the code words from the input vector are sorted in ascending order. This is computationally quite a heavy task and thus reduces the run speed of the operation.

In the present invention, the sorting procedure is not performed and instead just the n closest code words to the input vector are detennined.

The method of achieving this is shown in figure 8 which is a simple example with just ten values. In reality, there will be in excess of 4000 or more code words. In line (i), the ten values are shown. From these values, a median is estimated. There are many known methods for estimating medians.

In one method which may be used (Feldman, D. and Shavitt, Y., "An optimal median calculation algorithm for estimating internet link delays from active measurements", Proc. IEEE E2EMON, pp 1-7, 2007) the median to be estimated is initialized with a first sample value (sample [01). A constant is used to define the minimal initial step (minStep). If half of the first sample value is larger than the minimal step, then the current step (step) is set to 0.5 multiplied by sample [0]. For each next sample value, it is compared with the median estimate and the median estimate is increased or decreased by "step" depending on the outcome of the comparison. If the difference between a given sample value and the median estimated is smaller than the current step, then the last one is decreased by half In the end, the estimated median value is stored as a median variable. The run time complexity of such a median estimation algorithm is 0(n).

Code for such a median selection program is as follows: median -sample [0] minStep 4-C step 4-max (Isample [0] * 0.5, minStep) for � from 1 to n if median > sample [ii then median 4-median -Step else if median < sample [i] then median +-median + step endif if sample [ii -medianl< step then step 4-step * 0.5 endif endfor By comparing this median value with all of the sample values, it is possible to easily identify which samples have a value which is lower than the median and which samples have a value which is higher than the median. In the example of figure 8, the median is 4.5. If N is exactly 50% of the code words, then the lowest N values may be selected purely by comparing each value with the median. The selected values are the light values in line (ii) of figure 8 If it is necessary to select the lowest 25%, then the median of the light values in line (ii) is calculated and this median (2 in figure 8) is then compared with the selected values to select the lowest 25%. These are the light values in line (iii) of figure 8. The schematic outcome of an arbitrary sorting procedure is given in line (iv) of figure 8. In both cases the final result is the same, however, the underlying principle and the number of required steps are completely different.

The method is not limited to finding values which are a division by 2 of the total number of codewords. For example, if it is necessary to find the top 37.5%, it is possible to identify the lowest 25% as explained above and then identify the median of the 25% of values which were not selected in the final step and select the lower half of these values using the median method. Other values of N may be achieved by subdividing the group of values further.

The above method has shown how to identify the Gaussians which are to be calculated.

However, during the decoding process, the decoder may require Gaussians which have not been calculated or otherwise marked, a state flooring occurs and a back-off strategy is employed. Generally, a constant value is used as the back-off value.

When a state is left without selected Gaussians, instead of approximating its likelihood value using a constant a second group of Gaussians is identified to be marked for calculation if required by the decoder.

This group of Gaussians or "back-off' Gaussians is selected on the basis of the distance of each Gaussian has to the observation vector. An estimate of such distance can be obtained from the distance to a codeword to which given Gaussian has been assigned.

If the decoder requires a Gaussian which is not either selected as being part of the top N codewords or part of the back-off group of Gaussians then a constant back-off value is given, Selection of back-off Gaussians is made in the following way. During vector quantization as described above, the Gaussians belonging to the top N codewords are selected. A further group of Gaussians is also selected belonging to the M-N closest codewords i.e. the closest M codewords which lie outside the top N closest codewords.

The remaining Gaussians i.e. those outside the closest M codewords will be marked as unselected.

During the recognition process, if a state is selected which comprises Gaussians which are in the top N codewords then all of the Gaussians in that state which are in the closest N codewords are calculated and the state likelihood is determined using the Gaussian which gives the largest likelihood value.

When a state appears to be floored (i.e. no Gaussians are in the top N codewords), the system checks to see if Gaussians belonging to this state have been marked as the back-off Gaussians. If true, their likelihoods are calculated and the state likelihood is determined from the Gaussians which gives the largest likelihood score.

It should be noted that if a state is selected which contains both Gaussians from the top N codewords and the top M-N codewords only Gaussians from the top N codewords are calculated. Gaussians from the top M-N codewords are only used if a state is selected which contains no Gaussians in the top N codewords.

However, if for given state none of Gaussians have been marked selected neither backed off then state likelihood is given a small constant value or may be pruned Figure 9 illustrates the double level state back-off strategy in a two dimensional acoustic space. For a given observation vector x, the first level comprises all Gaussians belonging to the first top N codewords (1L). The second layer is composed of back-off Gaussians which belong to the top M-N codewords. The remaining Gaussians are considered to be unimportant for the current speech frame and their likelihood calculation is skipped.

Figure 10 is a further schematic of the back-off strategy. Here the centroid of codeword Z is within the N closest codewords and therefore all Gaussians within this codeword are selected for calculation and shown as dark circles.

Codewords X and Y have centroids which lie in the range M-N and hence all Gaussians in these codewords are marked as back-off Gaussians. The codeword W has a centroid which lies outside the top M codewords and hence none of the Gaussians in this codeword are selected for calculation.

Thus, all Gaussians lying sufficiently close to the observation vector O will be calculated or at least considered during the back-off. Figure 10 also shows that the new strategy is very close to the ideal case when Gaussians lying between the M circle and N circle are used for the state back-off and those lying outside the M circle will not be calculated. Such ideal case would have required the knowledge about position of each Gaussian with respect to the observation vector. However, at least now it is impractical for real-time operation.

Thus, in summary, as compared to all back-off strategies employed so far, the double level strategy does not approximate a likelihood of the state to be floored by a constant, task and environment dependent value. Neither it makes a random guess by calculating the likelihood of m-th Gaussian. If a state is left without Gaussians due to the relevant codeword not being within the top N codewords, this restriction is weakened and the boundary is extended to the top M codewords where Al is larger than N. If none of Gaussians belong even to the top M codewords then it is assumed that the likelihood of a given state is too small and hence it can be floored.

The above method was performed on two large vocabulary continuous speech recognition (L VCSR) tasks. In both cases the Gaussian selection system was built in the same way. A codebook with 4096 codewords was prepared using 60,000 Gaussians, the distance measure of equation (4) above and Linde-Buzo-Gray clustering procedure (Linde et al. "An algorithm for vector quantizer design" IEEE Trans. Communications Vol. 28 Iss. 1 p 84-95 (1980) None of codewords were empty or share Gaussians with any other codeword.

The baseline system is a single-pass Viterbi recognizer, a hidden Markov model (11MM) based AM with 60,000 Gaussians and a trigram language model (LM). The baseline model uses the nearest neighbour approximation when calculating state likelihoods. This means that the likelihood is approximated with a value of Gaussian giving the largest likelihood from the rest of the Gaussians composing the state output distribution. The contribution of the likelihood calculation stage to the overall recognition time is 24% in the first task and 14% in the second task.

a) Task 1 The first task was an internal speech dictation task. The training database contains 637 hours of recorded male and female data. Context-independent HMMs with 3 emitting states and 14 Gaussians per state output distribution are trained using a Baum-Welch training procedure. The Trigram LM is built from a Gigaword corpus and in-domain adaptation data. The vocabulary of the system was fixed to 27,000 words. The testing set contained 1 hour and 25 minutes of male and female data. The acoustic data was preprocessed by MFCC (Mel-Frequency Cepstral Coefficients) front-end with the first and second order derivatives appended to the basic observation vector. Consecutive MFCC, delta and delta-delta vectors were projected into 33 dimensional space by means of HLDA (Heteroscedastic Linear Discriminant Analysis) transformation.

b)Task 2 The second task was a Wall Street Journal (WSJ) task with a vocabulary fixed to 20,000 words. The first two sections of WSJ data were used to train context-independent HMMs with 3 emitting states and 10 Gaussians per state output distribution. A standard trigram LM was trained with the vocabulary of 20,000 words. MFCC preprocessing with the first and second order derivatives is applied to give the final observation vector with 33 dimensions. A H2PO testing set was used to access the performance of the above method.

Figure 11 summarizes the relative performance of different distance metrics in the first task. The Hit-rate shows how frequently for a given state a Gaussian giving the highest likelihood is selected by the vector quantizer. The proposed metric finds those Gaussians correctly in 56.6% cases and is at least twice better than Euclidean and Bhattacharyya distance metrics. Word and sentence error rate (WER, SER) figures follow the same pattern.

Tables 1 and 2 below summarize the relative performance of different back-off strategies in the first and the second task. No back-off strategy (No) means that when a state is left without Gaussians but its likelihood value is requested then a very small constant number is used as its likelihood (0 in this example case). Another possibility for state back-off is to calculate the likelihood of the first Gaussian (1 st) composing state's output distribution and use this value as an estimate of true state likelihood.

The most accurate back-off strategy (All) follows the baseline system: all Gaussians from the output distribution are calculated and the one giving the largest value is used as the state likelihood. Double level state back-off (2L) verifies whether any of Gaussians belong to the top M codewords instead of a more tight top N assumption (M>N> made for selected Gaussians.

In both tasks the double level state back-off shows a good compromise between doing nothing (No) and doing everything (All) to avoid the state flooring problem. Keeping accuracy of recognition at the baseline level, the double level state back-off skips calculation of 14% and 15.7% states giving 13.3% and 16.9% relative speed-up over the most accurate state back-off strategy (All). Note that in the second task SER is not as important as WER and xR T since each sentence in the WSJ data is more likely a passage than sentence. Therefore, a relative increase of 1.4% in SER over the baseline model for the double level state back-off can be tolerated.

_________ 1 SF GC WER SER xRT Baseline 0% 100% 26.53% 56.37% 0.90 No 26.1% 27.3% 26.59% 58.15% 0.71 - 1st 0% 26.2% -26.27% 56.39% -0.79 All 0% 58.8% 26.53% 56.57% 0.90 2L 14.0% 27.9% 26.52% 56.34% 0.78 Table 1 -Comparative performance of different back-off strategies in the first task. SF is the number of floored states. GC is the number of calculated Gaussians, WER is the word error rate, SER the sentence error rate and xRT the real time factor.

SF GC -WER SER xRT Baseline 0% 100% 10.52% 64.65% 0.72 No -25.4% 32.0% 12.18% -69.95% 0.52 1st 0% 31.9% 11.53% 69.63% 0.59 All 0% 61,8% 10.88% 65.12% 0.71 2L 15.7% 32,2% 10.93% -65.58% 0,59 Table 2 -Comparative performance of different back-off strategies in the second task.

SF is the number of floored states. GC is the number of calculated Gaussians, WER is the word error rate, SER the sentence error rate and xRT the real time factor.

Table 3 provides performance details of two GS systems: one using the quicksort procedure to rank codewords and the second makes use of method of medians to accomplish the same task. Method of medians gives 8.2% relative speed-up with 0.3% and 0.4% relative increase in WER and SER due to its approximate nature. Profiling the work of both systems using Vaigrind Version 2 on the random testing utterance shows that contribution of quicksort is 2.4% to the overall recognition time and 25.6% to the time spent on Gaussian selection. The method of medians has 1.8% contribution to the overall recognition time and 22.0% to GS. Hence, the method of medians provides 25.0% relative speed-up to a speech recognizer and 14.1 % relative speed-up to GS subsystem.

SER xRT Quicksort 26.45% 56.14% 0.85 MM 26.52% 56.34% 0.78 Table 3 -Comparative accuracy and speed results for a GS system using the known quicksort method and the method of medians (MM) to perform selection of the top N codewords for each observation vector.

Tables 4 and 5 provide comparative results for the baseline and GS systems on the first and second recognition task. GS systems have N=512 and M=768 in the first task, and N=768 and M=1248 in the second task. The first rows in both tables describe the baseline system (Baseline). The second row describes a simple GS system (GS) with no back-off strategy and quicksort procedure used to select the top N codewords. The third row stands for the GS system with double level back-off strategy (GS+2L) and the fourth row for the system which uses the method of medians instead of sorting (GS+2L+MM). The final row shows relative reductions obtained by the last and the most advanced GS system described in the table. The second column in both tables gives the average number of Gaussians calculated each time frame with respect to the baseline model. For example, if on average 18,000 Gaussians are computed by the baseline and 7,000 by the GS system then GC would be equal to 38.9%. The third column shows the average number of states floored at each time frame. The fourth and fifth columns give accuracy results in terms of WER and SER figures. The fmal column gives the real time factor for each system.

GC SF WER -SER xRT Baseline 100% 0% 26.53% 56.37% 0.90 GS 27.8% 25.4% 26.37% 57.65% 0.74 GS+2L 28.5% 13.3% 26.45% 56.]4% 0.85 GS+2Li-MM 24,8% -14.0% 26.52% 56.34% 0.78 -Total reduction 75.2% 14.0% 0.04% 0.05% 13.3% Table 4 -Evaluation results for baseline and different GS systems in the first task ________ GC SF WER SER xRT BaseZine 100% 0% -10.52% 64.65% 0.72 GS 27.7% 21.3% 11.37% 66.51% 0.58 GS+2L 36.6% 12.2% 10.91% 65.12% 0.67 GS+2L+MM 26.2% 15.7% -10.93% 65.58% 0.59 Total reduction 73.8% 15.7% -3.9% -1.4% 18.1% Table 5 -Evaluation results for baseline and different GS systems in the second task.

Claims (20)

  1. CLAIMS: 1. A speech recognition method, comprising: receiving a speech input comprising a sequence of observations; determining the likelihood of a sequence of words arising from the sequence of observation using an acoustic model, comprising: converting each observation into an input vector in acoustic space; determining the closest codewords to an input vector in said acoustic space, wherein each codeword comprises a cluster of points in said acoustic space, and each point represents the probability distribution of a word or part thereof being related to an observation; and determining the likelihood of a sequence of words arising from the sequence of observations using probabilities determined from said probability distributions, determining the likelihood of a sequence of observations occurring in a given language using a language model; combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal, wherein determining said closest codewords comprises comparing the distance of each codeword from the observation vector with a predetermined distance wherein codewords which are closer to the input vector than said pre-determined distance are retained.
  2. 2. A method according to claim 1, wherein the N closest codewords are determined and the predetermined distance is estimated to be the distance which allows the Nth closest codewords to be selected.
  3. 3. A method according to claim 2, wherein the Nth closest codewords are determined by estimating the distance of the closest n12 codewords from the input vector where n is the total number of codewords and N<n12, and splitting the codewords into two groups where the codewords with distances smaller than the distance of the n/2 closest codewords are retained, the process being repeated with the n12 closest code words until the closest N codewords are identified.
  4. 4. A method according to claim 3, wherein estimating the distance of the closest n/2 codewords from the input vector is determined by estimating the median.
  5. 5. A method according to claim 4, wherein said median is estimated by setting the median value to the distance of one of said n codewords and then comparing the median value with each of the n codeword distances in turn and adjusting the median value as it is compared with each codeword dependent on the difference between the median and the codeword distance.
  6. 6. A method according to any of claims 2 to 5, wherein N and n can be expressed as an integer to the power of 2.
  7. 7. A method according to any of claims 2 to 5, wherein either N or n cannot be expressed as an integer to the power of 2, and said N values are determined by recursively dividing the n codewords until a sample of N' codewords is established where N' is less than N, and then performing the process on the last group to be discarded to establish the N-N' codewords with the smallest distances out of this group to add to the N' codewords.
  8. 8. A method according to any preceding claim, wherein the acoustic model is a I{idden Markov Model.
  9. 9. A method according to any preceding claim, wherein the probability distributions are Gaussian probability distributions.
  10. 10. A method according to any preceding claim, wherein the distance between a codeword and the input vector is based on a likelihood measure.
  11. 11. A method according to claim 10, wherein the distance between a codeword and an input vector is determined by: d(o,c0) A+B1°' C0 i where o is the input vector, c0 is the codeword mean, a, the codeword variance and A and B are constants.
  12. 12. A method according to any preceding claim, wherein the clustering of points uses the distance measure: / I \ I _________ dp,c0)=D+E i J'°/l,i +O�j where p is the mean of the probability distribution, o is the variance of the probability distribution, c0 is the codeword mean, a0 the codeword variance and D and E are constants.
  13. 13. A method according to any preceding claim, wherein likelihood scores are calculated which require the probability of the observation being related to a word or part thereof for points which are not in the selected closest codewords, and wherein a back-off probability is used for said points.
  14. 14. A method according to any preceding claim, wherein the probability densities which belong to codewords which are not considered to be in the closest codeword group, but which are within a second predefined distance from the input vector are marked to be calculated on demand if a state likelihood score is required which does not contain probability densities belonging to the closest N codewords.
  15. 15. A method according to claim 14, when dependent on claim 2 further comprising: wherein the second predefined distance is estimated to allow the Mth closest codewords to be identified where M>N.
  16. 16. A method of determining an indication of the probability that a speech input corresponds to a word or sequence of words, the method comprising: receiving a speech input which comprises a sequence of observations; determining the likelihood of a sequence of words arising from the sequence of observations using an acoustic model, comprising: converting each observation into an input vector in acoustic space; determining the closest codewords to an input vector in said acoustic space, wherein each codeword comprises a cluster of points in said acoustic space, and each point represents the probability distribution of a word or part thereof being related to an observation; and determining the likelihood of a sequence of words arising from the sequence of observations using probabilities determined from said probability distributions, wherein determining said closest codewords comprises comparing the distance of each codeword from the observation vector with a predetermined distance wherein codewords which are closer to the input vector than said pre-determined distance are retained.
  17. 17. A speech recognition apparatus said apparatus comprising: a receiver for a speech input, said speech input comprising a sequence of observations, a processor adapted to determine the likelihood of a sequence of words arising from the sequence of observations using an acoustic model, said processor being adapted to convert each observation into an input vector in acoustic space; determine the closest codewords to an input vector in said acoustic space, wherein each codeword comprises a cluster of points in said acoustic space, and each point represents the probability distribution of a word or part thereof being related to an observation; and determine the likelihood of a sequence of words arising from the sequence of observations using probabilities determined from said probability distributions; the processing being further adapted to determining the likelihood of a sequence of observations occurring in a given language using a language model; and combine the likelihoods determined by the acoustic model and the language model, the apparatus further comprising an output adapted to outputting a sequence of words identified from said speech input signal, wherein determining said closest codewords comprises comparing the distance of each codeword from the observation vector with a predetermined distance wherein codewords which are closer to the input vector than said pre-determined distance are retained.
  18. 18. A computer running a computer program configured to cause a computer to perform the method of any of claims ito 16.
  19. 19. A computer implemented method of determining N extreme values from a sample of n values where N<n12, the method comprising: splitting the codewords into two groups dependent on whether a sample value is higher or lower than an estimated midpoint of said sample values; selecting the group of samples dependent on whether the extreme values required are the highest N values or the lowest N values; repeating the process until the sample size reaches N values.
  20. 20. A speech translation method comprising: recognising a speech input signal according to any of claims 1 to 15; translating said recognised speech signal into a different language; and outputting said recognised speech in said different language.
GB0817821A 2008-09-29 2008-09-29 Speech recognition apparatus and method Active GB2463908B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB0817821A GB2463908B (en) 2008-09-29 2008-09-29 Speech recognition apparatus and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0817821A GB2463908B (en) 2008-09-29 2008-09-29 Speech recognition apparatus and method

Publications (3)

Publication Number Publication Date
GB0817821D0 GB0817821D0 (en) 2008-11-05
GB2463908A true true GB2463908A (en) 2010-03-31
GB2463908B GB2463908B (en) 2011-02-16

Family

ID=40019745

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0817821A Active GB2463908B (en) 2008-09-29 2008-09-29 Speech recognition apparatus and method

Country Status (1)

Country Link
GB (1) GB2463908B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5535305A (en) * 1992-12-31 1996-07-09 Apple Computer, Inc. Sub-partitioned vector quantization of probability density functions
US6374217B1 (en) * 1999-03-12 2002-04-16 Apple Computer, Inc. Fast update implementation for efficient latent semantic language modeling
WO2008044582A1 (en) * 2006-09-27 2008-04-17 Sharp Kabushiki Kaisha Method and apparatus for locating speech keyword and speech recognition system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5535305A (en) * 1992-12-31 1996-07-09 Apple Computer, Inc. Sub-partitioned vector quantization of probability density functions
US6374217B1 (en) * 1999-03-12 2002-04-16 Apple Computer, Inc. Fast update implementation for efficient latent semantic language modeling
WO2008044582A1 (en) * 2006-09-27 2008-04-17 Sharp Kabushiki Kaisha Method and apparatus for locating speech keyword and speech recognition system

Also Published As

Publication number Publication date Type
GB0817821D0 (en) 2008-11-05 grant
GB2463908B (en) 2011-02-16 grant

Similar Documents

Publication Publication Date Title
Rohlicek et al. Continuous hidden Markov modeling for speaker-independent word spotting
US5794197A (en) Senone tree representation and evaluation
Soltau et al. The IBM 2004 conversational telephony system for rich transcription
US5579436A (en) Recognition unit model training based on competing word and word string models
US6539353B1 (en) Confidence measures using sub-word-dependent weighting of sub-word confidence scores for robust speech recognition
US5729656A (en) Reduction of search space in speech recognition using phone boundaries and phone ranking
Ajmera et al. A robust speaker clustering algorithm
US4837831A (en) Method for creating and using multiple-word sound models in speech recognition
US6076057A (en) Unsupervised HMM adaptation based on speech-silence discrimination
US5745873A (en) Speech recognition using final decision based on tentative decisions
Woodland et al. The 1998 HTK broadcast news transcription system: Development and results
US5684925A (en) Speech representation by feature-based word prototypes comprising phoneme targets having reliable high similarity
Rabiner et al. HMM clustering for connected word recognition
US5621859A (en) Single tree method for grammar directed, very large vocabulary speech recognizer
US5839105A (en) Speaker-independent model generation apparatus and speech recognition apparatus each equipped with means for splitting state having maximum increase in likelihood
US5963903A (en) Method and system for dynamically adjusted training for speech recognition
US5842163A (en) Method and apparatus for computing likelihood and hypothesizing keyword appearance in speech
US5710866A (en) System and method for speech recognition using dynamically adjusted confidence measure
US20110077943A1 (en) System for generating language model, method of generating language model, and program for language model generation
US20060287856A1 (en) Speech models generated using competitive training, asymmetric training, and data boosting
US5606644A (en) Minimum error rate training of combined string models
US7457745B2 (en) Method and apparatus for fast on-line automatic speaker/environment adaptation for speech/speaker recognition in the presence of changing environments
US20050010412A1 (en) Phoneme lattice construction and its application to speech recognition and keyword spotting
US5857169A (en) Method and system for pattern recognition based on tree organized probability densities
US6542866B1 (en) Speech recognition method and apparatus utilizing multiple feature streams