WO2001031636A2

WO2001031636A2 - Speech recognition on gsm encoded data

Info

Publication number: WO2001031636A2
Application number: PCT/IB2000/001679
Authority: WO
Inventors: Martine Lapere
Original assignee: Lernout & Hauspie Speech Products N.V.
Priority date: 1999-10-25
Filing date: 2000-10-24
Publication date: 2001-05-03
Also published as: WO2001031636A3; AU1049601A

Abstract

A speech recognizer uses Global System for Mobile Communications (GSM)-encoded digital data. Templates model words in a recognition vocabulary using GSM Quantized Log Area Ratio (Q-LAR) features. A recognizer module compares Q-LAR features of a template representing an input GSM signal to recognition vocabulary templates and produces a recognition output.

Description

Small Vocabulary Speaker Dependent Speech Recognition

Field of the Invention The invention relates to automatic speech recognition in a low resource environment.

Background Art Automatic speech recognition is a complicated technology that is rapidly entering daily life in an increasing number of applications. Digital mobile telephony is another fast-growing technology. Several significant technical challenges must be met to provide automatic speech recognition in a digital mobile telephone system. For example, digital mobile telephones operate from limited capacity batteries, but automatic speech recognition uses computer processors that perform a significant number of calculations, thereby consuming a relatively substantial amount of power. In addition, the physical size of a digital mobile telephone handset is very limited. Digital storage memory and digital signal processors required for automatic speech recognition also represent a significant additional cost beyond that of the telephone handset. In addition, automatic speech recognition is technically more difficult in the digital mobile telephone environment which includes operating in noisy environments such as public places, automobiles, etc., distortion effects related to the digital encoding of speech, and transmission errors due to the radio channel.

One widely employed digital mobile telephone system uses the GSM standard. The GSM full rate codec 6.10 samples input speech at an 8 kHz rate and generates a 13-bit digital signal which is converted into 260-bit blocks that represent 160 of the original samples. Although the nominal bit rate of the GSM encoding algorithm is 13 kbps, the actual transmitted data stream includes error recovery and packet information which increases the total bit rate.

The GSM codec uses the technique of linear predictive analysis-by - synthesis to encode the speech as a combination of linear prediction coefficients (LPC) containing spectral information and a residual pulse excitation signal. The LPC filter information is in the form of quantized log area ratios (Q-LARs), while the residual pulse signal is in the form of quantized RPE-LTP parameters. The quantization and compression performed in encoding the input speech creates noise and distortion that degrade the signal. To decode the digital signal back into speech, the pulse excitation signal is reconstructed and then input to a digital filter defined by the LPC parameters.

Traditionally, automatic speech recognition has operated in the cepstral domain by converting a digitized speech signal input into a cepstral domain signal and then performing speech recognition. One automatic speech recognition system designed to operate in a GSM digital mobile telephone environment reconverts digital GSM features back into cepstral component factors and then performs the recognizing process. Another speech recognition system converts the GSM parameters into linear predictive components, and then into a 256-point spectrum of each speech frame followed by a Mel-filter weighting and conversion into cepstrum.

Summary of the Invention A representative embodiment of the present invention includes a speech recognizer using Global System for Mobile Communications (GSM)-encoded digital data. The recognizer includes a plurality of templates, each template modeling a word in a recognition vocabulary using GSM Quantized Log Area Ratio (Q-LAR) features; and a recognizer module that compares Q-LAR features of a template representing an input GSM signal to at least one of the plurality of templates, and produces a recognition output.

A representative embodiment also includes a method of speech recognition using Global System for Mobile Communications (GSM)-encoded digital data. The method includes providing a plurality of templates, each template modeling a word in a recognition vocabulary using time domain GSM Quantized Log Area Ratio (Q-LAR) features; and comparing with a recognizer module Q-LAR features of a template representing an input GSM signal to at least one of the plurality of templates, and producing a recognition output.

In a further embodiment of either of the above, the Q-LAR features of the input GSM signal may be smoothed over time and have a zero mean. In such a case, the Q-LAR features of the input GSM signal may be generated by bandpass filtering, and may include at least one time derivative. The at least one time derivative may be used by a speech detector to determine a speech begin point and a speech end point for the input GSM signal. The recognizer module may also use a dynamic time warping (DTW) algorithm, and a pruned matrix to compare the template representing the input GSM signal to the at least one of the plurality of templates. The pruned matrix may be generated based on a threshold distance from a matrix diagonal. The recognizer module may also use variable frame rate discrimination. In another embodiment, one of the plurality of templates may be a composite representation of multiple repetitions of the corresponding word in the recognition vocabulary. Such a composite representation may be based on a path along a matrix diagonal of a matrix representing a prior template for the corresponding word and a template for a repetition of the corresponding word. The template representing the input GSM signal may be a vector quantized template in one embodiment. Hidden Markov models (HMMs), for example fenonic HMMs, may be used as templates.

A representative embodiment also includes an apparatus and method for recognizing speech including providing a plurality of templates, each template including a plurality of multi-dimensional vectors that represent a path along a matrix diagonal of a matrix representing a prior template for the word and a template for a repetition of the word; and comparing with a recognizer module a template representing an input speech signal to at least one of the plurality of templates, and producing a recognition output. In such an embodiment, each vector may represent Quantized Log Area Ratio (Q-LAR) features according to the Global System for Mobil Communications (GSM) standard. The path may represent a minimal distance path within a threshold distance of the matrix diagonal. At least one of the plurality of templates may represent at least three repetitions of a word. The input speech signal may be a Global System for Mobil Communications (GSM)- encoded digital data signal having Quantized Log Area Ration (Q-LAR) features, which may be smoothed over time and have a zero mean, and /or be generated by bandpass filtering and may include at least one time derivative. The recognizer module may use a dynamic time warping (DTW) algorithm. The template representing an input speech signal may be a vector quantized template. The templates may be hidden Markov models (HMMs), e.g., fenonic models.

Another embodiment of the invention includes apparatus and method for detecting when speech is present in an input acoustic signal including converting, with an input preprocessor, an input acoustic signal into a sequence of frames containing representative features; and analyzing, with a speech detection module, at least one time derivative of the representative features with respect to a feature time derivative norm to determine when speech is present in a portion of the sequence. In such an embodiment, the representative features may be Quantized Log Area Ratio (Q-LAR) features in a Global System for Mobile

Communications (GSM) data stream. Analyzing with a speech detection module further includes determining a speech begin point in the sequence when a local integration of the at least one feature time derivative is increasing and greater than a first selected value, and determining a speech end point in the sequence when the local integration of the at least one feature time derivative is less than a second selected value.

Brief Description of the Drawings The present invention will be more readily understood by reference to the following detailed description taken with the accompanying drawings, in which: Fig. 1 illustrates an automatic speech recognizer in a GSM environment according to a representative embodiment of the present invention.

Fig. 2 is an illustration of the three possible path ancestors for the matrix element i,j.

Detailed Description of Specific Embodiments Various embodiments of the present invention are directed to techniques for a small vocabulary speaker dependent automatic speech recognizer to be used in a relatively low resource environment, e.g., a GSM digital mobile telephone handset. A relatively low complexity dynamic time warping (DTW) algorithm requires a limited number of computations based on speech features extracted from the GSM signal.

GSM signal features represent quantized log area ratio parameters (Q- LARs) in which the quantization bins are set to convert a full range of speech with minimal distortion. However, conventional speech recognizers operate in the cepstral domain. Converting from GSM Q-LARs to cepstrum parameters for speech recognition requires a significant computational effort. The Q-LARs must be converted into continuous LARs, which then must be converted into linear predictive coefficients (LPC) from which cepstral coefficients may be calculated. This conventional conversion to cepstrum is done because speaker dependent pitch information is lost if only the lowest cepstral coefficients are considered. For speaker dependent "single microphone" speech recognition purposes, however, it is not necessary to convert to a cepstrum representation or to filter out pitch. Instead, the extracted features may simply have a zero mean and be smoothed over time (for the highly quantized high-order coefficients).

Thus, in contrast to a conventional speech recognizer, a representative embodiment operates using speech features directly derived from the Q-LARs without the intermediate step of decoding into continuous LARs. This approach minimizes the feature extraction effort. A weighted bandpass filtering of the GSM Q-LARs drops DC and high-frequency variations, then, time derivatives are calculated, but energy is not reconstructed.

The speech recognizer of a representative embodiment also includes a speech detector that operates by monitoring time derivatives of the filtered Q- LARs. The spectrum (or bench) of Q-LAR coefficients is much more stable during noise than during organized speech. Thus, a norm of the time derivative of the Q- LARs can be used for determining when speech is present. For example, a speech begin point may be determined by when the local integration of the time derivative is increasing and greater than a first selected value. A speech end point may be determined by when the local integration of the time derivative decreases below a second selected value. In addition, the time derivative of the Q-LARs may also serve as a variable frame rate control signal that only retransmits frames that differ significantly from the previous ones.

Fig. 1 illustrates an automatic speech recognizer in a GSM environment according to a representative embodiment. Speech processor 10 provides a spoken input signal to GSM frame coder 11 which converts the input speech into a sequence of GSM encoded frames. The GSM frames are then processed by the GSM channel coder 12 and output to a GSM network 13. The GSM frames from the GSM frame coder 11 also are available as an input to an automatic speech recognition GSM pre-processor 141 which removes DC and high-frequency variations by performing weighted bandpass filtering of the GSM Q-LARs and also calculates signal time derivatives. Speech recognition engine 14 compares the output of the ASR GSM pre-processor 141 to GSM acoustic models 15 using DTW. This comparison also uses speech detector 142 that monitors the time derivatives of the filtered Q-LARs from the ASR GSM pre-processor 141. The recognition output of the speech recognition engine 14 may be further processed; for example, to control an automatic dialing feature.

Alternatively, or in addition, a received GSM signal from the GSM network 13 may be decoded by a GSM channel decoder 16 into a sequence of GSM frames which are then further processed into an audio output signal by a GSM frame decoder 17 and speech output processor 18. The GSM frames from the GSM channel decoder may be provided via the ASR GSM pre-processor 141 to the speech recognition engine 14. The recognition engine 14 uses a dynamic time warping (DTW) algorithm in which a multi-dimensional vector template representing the input speech signal is compared to one or more reference templates representing words in a recognition vocabulary. A relatively low complexity algorithm scores two such templates against each other by time warping of their extracted features. The technique of DTW is described, for example, in Chapter 11 of Deller, Proakis & Hansen, Discrete-Time Processing of Speech Signals (Prentice Hall, 1987), which is incorporated herein by reference.

A standard dynamic time warping algorithm determines the degree of match between two p-dimensional vector templates A and B of length m and n respectively. An m*n matrix L of local scores is generated wherein L(i,j) is the distance (typically, the Euclidean distance) between the i_th component of A and the j_th component of B. Corresponding with the matrix L(i,j), another matrix G(i,j) of global scores is also generated representing the cumulative sum of the elements L(k # i, 1 # j) over the minimal path. Fig. 2 is an illustration of the three possible path ancestors for the matrix element i,j: (1) a horizontal path 21 from i,j- 1 to i,j that reflects the degree of match of several successive vectors of template B on one vector of template A, (2) a vertical path 22 from i-l,j to i,j that reflects the degree of match of several successive vectors of template A on one vector of B, or (3) a diagonal path 23 from i-l,j-l to i,j that reflects a one to one correspondence of a vector point of A to a single vector point of B. The DTW algorithm can evaluate in either top-down or left-to-right order. The score G(m,n) is the total score of the degree of match between templates A and B.

A standard DTW algorithm has some disadvantages, however. First, all of the local distances are calculated of all the feature vectors of an input utterance against all the feature vectors of the model template. In the specific case of GSM coding with 20 msec frames, a 2 second sample of input speech needs 100 frames, and a corresponding matrix of local distances takes a cumbersome 10K of memory. Several methods can be used to prune possible paths of the DTW algorithm in order to reduce the number of evaluations that have to be made. For example, the pruning may be based on a maximum number of active states where a fixed maximum of n-best states is kept active. There will be n horizontal states in top- down evaluations or n vertical states in left-right evaluation. This pruning method is not symmetric since the total beam of active states can vary from top-down to left-right evaluation.

A representative embodiment uses pruning around the diagonal. The diagonal from (1,1) to (m,n) is calculated, and the states with a topological distance smaller than a preset threshold from the diagonal are evaluated. This pruning method is symmetric in top-down or left to right evaluation. Compared to pruning on maximum number of active states, pruning around the diagonal is much more demanding on the overall match of the two templates. The local match must be within the threshold on each cut through the diagonal so both templates must have good local correspondence all along the diagonal. With the pruning on a fixed number of active states, a local good match might compensate for another local mismatch. This could lead to an optimal path that is far away from the diagonal. Imposing diagonal pruning on two utterances with different acoustical content forces higher scores and thereby enables better rejection.

Pruning on the diagonal, however, does demand that both input templates match locally well all along the length. This requirement is not a significant issue for single-word utterances, but does become problematic for multiple word utterances with varying inter-word silence lengths, or with utterances with leading or trailing silence. In representative embodiments, the silence regions are stripped by a begin-end point detector, and then variable frame rate discrimination is employed. This approach works very well with pruning on the diagonal.

In a representative embodiment, the m-by-n matrix of global scores is not fully calculated. By restricting the path to the specified beam around the diagonal, only 14 elements (28 bytes) needs to be kept in memory. One additional memory element is used for the local score, and one or more elements are used for beam boundaries, plus an additional 14 bytes are used to track path length so that altogether, a representative embodiment uses a mere 46 bytes of memory at recognition time. During system training, backtrace information is also kept in memory; for a two second speech sample, 14x100x2 bits, or 350 bytes Thus total memory needed for training is 46 bytes + 350 bytes = 396 bytes, a very modest number. In one embodiment where the extra memory is available, word templates may also be kept in memory to improve system speed.

A standard DTW system performs a one to one evaluation of two feature vector templates in which a new incoming feature vector is matched to each template feature vector stored during training. In some conventional DTW systems, multiple repetitions of a given utterance may be provided during training, and the DTW system scores a given test utterance against each of the stored repetitions. The score for a single word in the recognition vocabulary then is a combination of the scores for each of the different stored repetitions of that word that were trained. There are several drawbacks to this approach including excessive storage requirements for a single word in recognition vocabulary and excessive CPU processing needed to evaluate all of the various matches.

A representative embodiment combines multiple repetition templates during training to form a single "glued" template for each word that represents the "average" of the various repetitions. A first repetition of a training utterance is stored in temporary memory. Next, a second repetition of the same training utterance is requested and stored in temporary memory. A check is then made on the consistency of the length of the two noise-stripped canonical form utterances. If the lengths don't match, the first utterance is overwritten by the second, and a new input utterance taken to replace the second. If first and second utterances match in length, they are scored against each other by the diagonal pruned DTW algorithm. This is done not with the aim of getting the score, but in order to get the optimal path of the best scores around the diagonal. Next, a similarity check is performed: if the score is too high and the template represents only a single utterance, then the template is overwritten by the second utterance; if the template is already a glued form of two or more utterances, the last utterance is neglected. Since the templates are already very similar, most of the optimal path will be a concatenation of diagonal transitions, although some horizontal and vertical transitions will remain. All the horizontal and vertical transitions are packed since they correspond to a local mismatch of the two templates, then the two templates are averaged over the path. When a local diagonal path exists, each element of the glued template will be an average of the corresponding elements of the first and the second templates. In a case of a local vertical path, the corresponding element in the glued template will be the average of all successive frames of template A and a single frame of template B. In a case of a local horizontal path, the corresponding element in the glued template will be the average of all successive frames of template B and the single corresponding frame of template A. The resulting glued template is stored for use during the recognition process.

A representative embodiment uses three utterance repetitions for training. Each new training utterance is stored to the primary glued template generating a new secondary glued template. The new secondary glued template represents the weighted mean of the new utterance (that is, the first new utterance that passes the length consistency check and score check), and the previously stored glued template. This algorithm has the secondary advantages that a glued template sublimates to a minimal length representation, and that any possible remaining trailing or ending noise in one of the repetition templates get compressed to a maximum of one state, since the chance of having similar remaining trailing or ending noise states in all training utterances is insignificant.

In a further embodiment, features derived from the input speech may be vector quantized (NQ) in order to have a lower resolution representation. Vector quantization reduces the size of the data stream, and so also reduces the amount of data memory that is needed. This benefit is maximized if the VQ can be done in real time. A VQ system also may need fewer calculations to perform recognition because the distances between different points are predefined. One disadvantage of classical VQ systems is the need to store codebooks. For example, a codebook for an eight-dimensional feature vector system is typically around IK which, for real-time recognition should be loaded in (expensive) RAM. Alternatively, the feature vector dimensions may be reduced by using a regular grid of codewords, in which case, there is no need to store a codebook, but rather only to define an appropriate quantization scheme. On the whole, the modified DTW approach of a representative embodiment appears to be the most appropriate pattern matching routine to be used in a GSM environment. However, alternative embodiments may employ other approaches. As an alternative to the local distance, Euclidian, abs(diff), or "- improduct" of vectors calculations may be used. In addition, two or more reference templates can be "glued" together by an appropriate modification of the DTW algorithm. Neighboring states within a template can also be combined as well, in order to reduce the dimensionality of the templates. These compressed templates can eventually also be used for single Gaussian Viterbi scoring, they are fully compatible in origin. An alternative embodiment may also be based on stochastic Hidden

Markov Models (HMMs) trained by Viterbi iteration. For instance, each word in the recognition vocabulary may be represented as a single continuous density (or Gaussian) HMM. This implies a training procedure with different iterations on the feature vectors, and the feature vectors would be kept in system RAM. Evaluation is fast, in such an embodiment, but not necessary superior to the glued DTW patterns approach (which take less RAM at training).

An embodiment could also be based on the use of fenonic discrete density HMMs to represent words in the recognition vocabulary. This approach implies the storage of phonetic reference models which could be stored in flash memory. One disadvantage would be the problem of storage of the flash data. Advantages would include relatively small storage templates, low RAM requirements, and more potential for noise robustness and speaker independent solutions.

Of course, the cost of the required hardware is quite important in such an application. In a representative embodiment, the speech recognizer uses the existing GSM digital signal processor and does not need significant additional hardware. Hardware cost considerations also include making efficient use of Random Access Memory (RAM), CPU processing power, Read-Only Memory (ROM), and flash memory. RAM is relatively expensive therefore the amount of RAM required by a representative embodiment is kept as low as possible, typically 1-2 Kbytes with a maximum of 4 Kbytes. The amount of RAM necessary is not, however, independent of CPU processing power. CPU processing, especially for feature extraction, is kept as low as possible in order to run in real-time. Otherwise, data must be buffered which in turn means additional relatively large data buffers. RAM used for generating reference templates during training is also kept as low as possible. Because templates are typically block-loaded into RAM for speed considerations, RAM usage is also minimized.

A CPU for a typical GSM DSP operates at 50 MIPS, more than enough processing power for a representative embodiment. In a representative embodiment, the CPU is used for (a) training time — feature extraction is kept as low as possible, and (b) recognition /verification time — to score an input phrase to a number of words in the recognition vocabulary. For speed considerations, it is advisable to load comparison templates into RAM since flash memory access may be relatively slow. Due to RAM limitations, this means that candidate words will have to be scored sequentially, or block-buffered. One workable approach is to aim for a one second response time, of which, about 0.5 seconds is used for speech end-point detection, leaving around 25 msec per word (on average). Thus, the number of active states per word also should be minimized.

ROM code memory should also be kept as small as possible since code memory will have to be shared between the GSM normal functionality and the speech recognizer. Flash memory is less of an issue, an adequate working target is about IK per word. A speech recognizer according to a representative embodiment may be found in the ASRIOO small footprint isolated word recognizer made by Lernout & Hauspie Speech Products, N.N. of leper, Belgium. The ASRIOO is intended to be used in handheld consumer devices, for example, in a GSM mobile telephone for providing access to a personal address or telephone book by speaking the name of the addressee. The ASRIOO uses part of the digital GSM frame data as input, and the appropriate code is called at the normal frame rate, but not all items composing a frame have to be computed during enrollment nor at recognition- time. The basic footprint figures used herein do not include the requirements for the GSM frame encoding process; the figures refer to a recognizer running in the digital GSM domain. If however the engine runs in the GSM phone itself, it is possible to share some RAM data with the RAM reserved for the GSM encoding.

By storing only the GSM frames that contain speech (using Begin-Endpoint detection), 13 Kbit/sec average effective speech duration of memory is needed for playback functionality. An average speech duration of 1 second is assumed, for other average speech lengths, storage needed changes proportionally. Such an embodiment is designed with the following system characteristics:

• 3 kWord of RAM, some of which can be shared with the GSM processing,

• 4 kWord of ROM,

• O.δkByte (on average) in flash memory per enrolled word for recognition (1 second),

• 1.7kByte (on average) in flash memory per enrolled word for playback (1 second),

• Total flash memory size for 30 words of average 1 second duration is 75 Kbyte, • 10 MIPS DSP for an average sub-second recognition latency measured from end of utterance, and

• 8 kFIz sampling rate.

When the ASRIOO is used in handheld devices other than mobile telephones, its footprint is increased slightly. The extra resources are used for the partial calculation of the GSM frames, estimated as 2 KWord of ROM with no extra RAM needed. The input data is assumed to be given by the codec at 13 bits PCM linear, 8 kHz. In absence of a full GSM coder, an alternative speech compression algorithm could be added for the playback function, for example, an ADPCM at 16 Kbit/sec at the expense of 1 KWord extra code. The total footprint size without the playback functionality is:

• 3 kWord of RAM,

• 6 kWord of ROM (without playback functionality) / 7 kWord of ROM (with playback functionality),

• O.δkByte (on average) in flash memory per enrolled word for recognition ( 1 second),

• 2kByte (on average ) in flash memory per enrolled word for playback (1 second),

• Total flash memory size for 30 words of average duration of 1 second is 24 Kbyte without playback functionality, 84 Kbyte with playback functionality • 10 MIPS DSP for an average sub-second recognition latency measured from end of utterance, and

• 8 kHz sampling rate @ 13 bits PCM linear.

The recognition engine normally operates in push-to- talk mode, although automatic speech detection can be used in some applications. In a representative embodiment, the training procedure for new entries takes three repetitions of the new entry (the minimum number of repetitions is two). A consistency check is made of the repetitions during training, and additional training utterances are adaptively requested if required. If a user chooses a standard training with only two sample utterances, there is an option to adapt the template at recognition time. A confusability check is also made at training time, preventing the generation of confusable word pairs in the vocabulary. Since longer utterances (first + last names) are less confusable than short nicknames, the user is encouraged not to use nickname entries. At recognition time, the input utterance is checked against all vocabulary entries. If the utterance cannot be found in the recognition vocabulary, it is rejected. Otherwise, the template reference or the vocal playback of the recognized word is given for confirmation.

An embodiment may be employed in various alternative configurations provided the various tradeoffs are considered and accommodated since RAM space and CPU resources directly compete with system performance. A minimal RAM implementation could use only 1 KWord of RAM with, however, an increase in code size of about 20%. A minimal Flash storage embodiment would decrease the template storage of a word for recognition by a factor of two, at the expense, however, of CPU load and eventually a slight decrease in performance. A minimal CPU embodiment would require some 20% increase in code, and could result in a slightly decreased performance.

Embodiments of the invention may be implemented in any conventional computer programming language. For example, representative embodiments may be implemented in a procedural prograrnming language (e.g., "C") or an object oriented programming language (e.g., "C++"). Alternative embodiments of the invention may be implemented as pre-programmed hardware elements, other related components, or as a combination of hardware and software components. Embodiments can be implemented as a computer program product for use with a computer system. Such implementation may include a series of computer instructions fixed either on a tangible medium, such as a computer readable medium (e.g., a diskette, CD-ROM, ROM, or fixed disk) or transmittable to a computer system, via a modem or other interface device, such as a communications adapter connected to a network over a medium. The medium may be either a tangible medium (e.g., optical or analog connmunications lines) or a medium implemented with wireless techniques (e.g., microwave, infrared or other transmission techniques). The series of computer instructions embodies all or part of the functionality previously described herein with respect to the system. Those skilled in the art should appreciate that such computer instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Furthermore, such instructions may be stored in any memory device, such as semiconductor, magnetic, optical or other memory devices, and may be transmitted using any communications technology, such as optical, infrared, microwave, or other transmission technologies. It is expected that such a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation (e.g., shrink wrapped software), preloaded with a computer system (e.g., on system ROM or fixed disk), or distributed from a server or electronic bulletin board over the network (e.g., the Internet or World Wide Web). Of course, some embodiments of the invention may be implemented as a combination of both software (e.g., a computer program product) and hardware. Still other embodiments of the invention are implemented as entirely hardware, or entirely software (e.g., a computer program product).

Although various exemplary embodiments of the invention have been disclosed, it should be apparent to those skilled in the art that various changes and modifications can be made which will achieve some of the advantages of the invention without departing from the true scope of the invention.

Claims

What is claimed is:

1. A speech recognizer using Global System for Mobile Communications (GSM)-encoded digital data, the recognizer comprising: a plurality of templates, each template modeling a word in a recognition vocabulary using GSM Quantized Log Area Ratio (Q-LAR) features; and a recognizer module that compares Q-LAR features of a template representing an input GSM signal to at least one of the plurality of templates, and produces a recognition output.

2. A speech recognizer according to claim 1, wherein the Q-LAR features of the input GSM signal are smoothed over time and have a zero mean.

3. A speech recognizer according to claim 2, wherein the Q-LAR features of the input GSM signal are generated by bandpass filtering.

4. A speech recognizer according to claim 2, wherein the Q-LAR features of the input GSM signal include at least one time derivative.

5. A speech recognizer according to claim 4, further including a speech detector that uses the at least one time derivative to determine a speech begin point and a speech end point for the input GSM signal.

6. A speech recognizer according to claim 1, wherein the recognizer module uses a dynamic time warping (DTW) algorithm.

7. A speech recognizer according to claim 6, wherein the recognition module uses a pruned matrix to compare the template representing the input GSM signal to the at least one of the plurality of templates.

8. A speech recognizer according to claim 7, wherein the pruned matrix is generated based on a threshold distance from a matrix diagonal.

9. A speech recognizer according to claim 8, wherein the recognizer module uses variable frame rate discrimination.

10. A speech recognizer according to claim 1, wherein at least one of the plurality of templates is a composite representation of multiple repetitions of the corresponding word in the recognition vocabulary.

11. A speech recognizer according to claim 10, wherein the composite representation is based on a path along a matrix diagonal of a matrix representing a prior template for the corresponding word and a template for a repetition of the corresponding word.

12. A speech recognizer according to claim 1, wherein the template representing the input GSM signal is a vector quantized template.

13. A speech recognizer according to claim 1, wherein the speech recognizer uses hidden Markov models (HMMs) as templates.

14. A speech recognizer according to claim 13, wherein the HMMs are fenonic models.

15. A method of speech recognition using Global System for Mobile Communications (GSM)-encoded digital data, the method comprising: providing a plurality of templates, each template modeling a word in a recognition vocabulary using time domain GSM Quantized Log Area Ratio (Q-LAR) features; comparing Q-LAR features of a template representing an input GSM signal to at least one of the plurality of templates, and producing a recognition output.

16. A method according to claim 15, wherein the Q-LAR features of the input GSM signal are smoothed over time and have a zero mean.

17. A method according to claim 16, wherein the Q-LAR features of the input GSM signal are generated by bandpass filtering.

18. A method according to claim 16, wherein the Q-LAR features of the input GSM signal include at least one time derivative.

19. A method according to claim 18, further including using the at least one time derivative to determine a speech begin point and a speech end point for the input GSM signal.

20. A method according to claim 17, wherein the comparing uses a dynamic time warping (DTW) algorithm.

21. A method according to claim 20, wherein the comparing uses a pruned matrix.

22. A method according to claim 21, wherein the pruned matrix is generated based on a threshold distance from a matrix diagonal.

23. A method according to claim 22, wherein the comparing includes using variable frame rate discrimination.

24. A method according to claim 15, wherein at least one of the plurality of templates is a composite representation of multiple repetitions of the corresponding word in the recognition vocabulary.

25. A method according to claim 24, wherein the composite representation is based on a path along a matrix diagonal of a matrix representing a prior template for the corresponding word and a template for a repetition of the corresponding word.

26. A method according to claim 15, wherein the template representing the input GSM signal is a vector quantized template.

27. A method according to claim 15, wherein hidden Markov models (HMMs) are used as templates.

28. A method according to claim 27, wherein the HMMs are fenonic models.

29. A speech recognizer comprising: a plurality of templates, each template including a plurality of multi- dimensional vectors that represent a path along a matrix diagonal of a matrix representing a prior template for the word and a template for a repetition of the word; and a recognizer module that compares a template representing an input speech signal to at least one of the plurality of templates, and produces a recognition output.

30. A speech recognizer according to claim 29, wherein each vector represents Quantized Log Area Ratio (Q-LAR) features according to the Global System for Mobil Communications (GSM) standard.

31. A speech recognizer according to claim 29, wherein the path represents a minimal distance path within a threshold distance of the matrix diagonal.

32. A speech recognizer according to claim 29, wherein at least one of the plurality of templates represents at least three repetitions of a word.

33. A speech recognizer according to claim 29, wherein the input speech signal is a Global System for Mobil Communications (GSM)-encoded digital data signal having Quantized Log Area Ratio (Q-LAR) features.

34. A speech recognizer according to claim 33, wherein the Q-LAR features are smoothed over time and have a zero mean.

35. A speech recognizer according to claim 34, wherein the Q-LAR features are generated by bandpass filtering.

36. A speech recognizer according to claim 34, wherein the Q-LAR features include at least one time derivative.

37. A speech recognizer according to claim 29, wherein the recognizer module uses a dynamic time warping (DTW) algorithm.

38. A speech recognizer according to claim 29, wherein the template representing an input speech signal is a vector quantized template.

39. A speech recognizer according to claim 29, wherein the templates are hidden Markov models (HMMs).

40. A speech recognizer according to claim 39, wherein the HMMs are fenonic models.

41. A method of recognizing speech method comprising: providing a plurality of templates, each template including a plurality of multi-dimensional vectors that represent a path along a matrix diagonal of a matrix representing a prior template for the word and a template for a repetition of the word; and comparing a template representing an input speech signal to at least one of the plurality of templates, and producing a recognition output.

42. A method according to claim 41, wherein each vector represents Quantized Log Area Ratio (Q-LAR) features according to the Global System for Mobil Communications (GSM) standard.

43. A method according to claim 41, wherein the path represents a minimal distance path within a threshold distance of the matrix diagonal.

44. A method according to claim 41, wherein at least one of the plurality of templates represents at least three repetitions of a word.

45. A method according to claim 41, wherein the input speech signal is a Global System for Mobil Communications (GSM)-encoded digital data signal having Quantized Log Area Ratio (Q-LAR) features.

46. A method according to claim 45, wherein the Q-LAR features are smoothed over time and have a zero mean.

47. A method according to claim 46, wherein the Q-LAR features are generated by bandpass filtering.

48. A method according to claim 46, wherein the Q-LAR features include at least one time derivative.

49. A method according to claim 41, wherein the recognizer module uses a dynamic time warping (DTW) algorithm.

50. A method according to claim 41, wherein the template representing an input speech signal is a vector quantized template.

51. A method according to claim 41, wherein the templates are hidden Markov models (HMMs).

52. A method according to claim 51, wherein the HMMs are fenonic models.

53. A speech detector for detecting when speech is present in an input acoustic signal, the detector comprising: an input preprocessor that converts an input acoustic signal into a sequence of frames containing representative features; and a speech detection module that analyzes at least one time derivative of the representative features with respect to a feature time derivative norm to determine when speech is present in a portion of the sequence.

54. A speech detector according to claim 53, wherein the representative features are Quantized Log Area Ratio (Q-LAR) features in a Global System for Mobile Communications (GSM) data stream.

55. A speech detector according to claim 53, wherein the speech detection module further determines a speech begin point in the sequence when a local integration of the at least one feature time derivative is increasing and greater than a first selected value, and determines a speech end point in the sequence when the local integration of the at least one feature time derivative is less than a second selected value.

56. A method of detecting when speech is present in an input acoustic signal, the method comprising: converting an input acoustic signal into a sequence of frames containing representative features; and analyzing at least one time derivative of the representative features with respect to a feature time derivative norm to determine when speech is present in a portion of the sequence.

57. A method according to claim 56, wherein the representative features are Quantized Log Area Ratio (Q-LAR) features in a Global System for Mobile Communications (GSM) data stream.

58. A method according to claim 56, wherein the analyzing further includes determining a speech begin point in the sequence when a local integration of the at least one feature time derivative is increasing and greater than a first selected value, and determining a speech end point in the sequence when the local integration of the at least one feature time derivative is less than a second selected value.