WO2001031636A2 - Speech recognition on gsm encoded data - Google Patents

Speech recognition on gsm encoded data Download PDF

Info

Publication number
WO2001031636A2
WO2001031636A2 PCT/IB2000/001679 IB0001679W WO0131636A2 WO 2001031636 A2 WO2001031636 A2 WO 2001031636A2 IB 0001679 W IB0001679 W IB 0001679W WO 0131636 A2 WO0131636 A2 WO 0131636A2
Authority
WO
WIPO (PCT)
Prior art keywords
speech
template
features
gsm
lar
Prior art date
Application number
PCT/IB2000/001679
Other languages
French (fr)
Other versions
WO2001031636A3 (en
Inventor
Martine Lapere
Original Assignee
Lernout & Hauspie Speech Products N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lernout & Hauspie Speech Products N.V. filed Critical Lernout & Hauspie Speech Products N.V.
Priority to AU10496/01A priority Critical patent/AU1049601A/en
Publication of WO2001031636A2 publication Critical patent/WO2001031636A2/en
Publication of WO2001031636A3 publication Critical patent/WO2001031636A3/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • Automatic speech recognition is a complicated technology that is rapidly entering daily life in an increasing number of applications.
  • Digital mobile telephony is another fast-growing technology.
  • Several significant technical challenges must be met to provide automatic speech recognition in a digital mobile telephone system.
  • digital mobile telephones operate from limited capacity batteries, but automatic speech recognition uses computer processors that perform a significant number of calculations, thereby consuming a relatively substantial amount of power.
  • the physical size of a digital mobile telephone handset is very limited.
  • Digital storage memory and digital signal processors required for automatic speech recognition also represent a significant additional cost beyond that of the telephone handset.
  • automatic speech recognition is technically more difficult in the digital mobile telephone environment which includes operating in noisy environments such as public places, automobiles, etc., distortion effects related to the digital encoding of speech, and transmission errors due to the radio channel.
  • GSM Global System for Mobile Communications
  • the GSM full rate codec 6.10 samples input speech at an 8 kHz rate and generates a 13-bit digital signal which is converted into 260-bit blocks that represent 160 of the original samples.
  • the nominal bit rate of the GSM encoding algorithm is 13 kbps
  • the actual transmitted data stream includes error recovery and packet information which increases the total bit rate.
  • the GSM codec uses the technique of linear predictive analysis-by - synthesis to encode the speech as a combination of linear prediction coefficients (LPC) containing spectral information and a residual pulse excitation signal.
  • LPC linear prediction coefficients
  • the LPC filter information is in the form of quantized log area ratios (Q-LARs), while the residual pulse signal is in the form of quantized RPE-LTP parameters.
  • Q-LARs quantized log area ratios
  • RPE-LTP parameters quantized log area ratios
  • the quantization and compression performed in encoding the input speech creates noise and distortion that degrade the signal.
  • the pulse excitation signal is reconstructed and then input to a digital filter defined by the LPC parameters.
  • automatic speech recognition has operated in the cepstral domain by converting a digitized speech signal input into a cepstral domain signal and then performing speech recognition.
  • One automatic speech recognition system designed to operate in a GSM digital mobile telephone environment reconverts digital GSM features back into cepstral component factors and then performs the recognizing process.
  • Another speech recognition system converts the GSM parameters into linear predictive components, and then into a 256-point spectrum of each speech frame followed by a Mel-filter weighting and conversion into cepstrum.
  • a representative embodiment also includes a method of speech recognition using Global System for Mobile Communications (GSM)-encoded digital data.
  • the method includes providing a plurality of templates, each template modeling a word in a recognition vocabulary using time domain GSM Quantized Log Area Ratio (Q-LAR) features; and comparing with a recognizer module Q-LAR features of a template representing an input GSM signal to at least one of the plurality of templates, and producing a recognition output.
  • GSM Global System for Mobile Communications
  • the Q-LAR features of the input GSM signal may be smoothed over time and have a zero mean.
  • the Q-LAR features of the input GSM signal may be generated by bandpass filtering, and may include at least one time derivative.
  • the at least one time derivative may be used by a speech detector to determine a speech begin point and a speech end point for the input GSM signal.
  • the recognizer module may also use a dynamic time warping (DTW) algorithm, and a pruned matrix to compare the template representing the input GSM signal to the at least one of the plurality of templates.
  • the pruned matrix may be generated based on a threshold distance from a matrix diagonal.
  • the recognizer module may also use variable frame rate discrimination.
  • one of the plurality of templates may be a composite representation of multiple repetitions of the corresponding word in the recognition vocabulary. Such a composite representation may be based on a path along a matrix diagonal of a matrix representing a prior template for the corresponding word and a template for a repetition of the corresponding word.
  • the template representing the input GSM signal may be a vector quantized template in one embodiment.
  • Hidden Markov models (HMMs) for example fenonic HMMs, may be used as templates.
  • a representative embodiment also includes an apparatus and method for recognizing speech including providing a plurality of templates, each template including a plurality of multi-dimensional vectors that represent a path along a matrix diagonal of a matrix representing a prior template for the word and a template for a repetition of the word; and comparing with a recognizer module a template representing an input speech signal to at least one of the plurality of templates, and producing a recognition output.
  • each vector may represent Quantized Log Area Ratio (Q-LAR) features according to the Global System for Mobil Communications (GSM) standard.
  • the path may represent a minimal distance path within a threshold distance of the matrix diagonal.
  • At least one of the plurality of templates may represent at least three repetitions of a word.
  • the input speech signal may be a Global System for Mobil Communications (GSM)- encoded digital data signal having Quantized Log Area Ration (Q-LAR) features, which may be smoothed over time and have a zero mean, and /or be generated by bandpass filtering and may include at least one time derivative.
  • the recognizer module may use a dynamic time warping (DTW) algorithm.
  • the template representing an input speech signal may be a vector quantized template.
  • the templates may be hidden Markov models (HMMs), e.g., fenonic models.
  • Another embodiment of the invention includes apparatus and method for detecting when speech is present in an input acoustic signal including converting, with an input preprocessor, an input acoustic signal into a sequence of frames containing representative features; and analyzing, with a speech detection module, at least one time derivative of the representative features with respect to a feature time derivative norm to determine when speech is present in a portion of the sequence.
  • the representative features may be Quantized Log Area Ratio (Q-LAR) features in a Global System for Mobile
  • FIG. 1 illustrates an automatic speech recognizer in a GSM environment according to a representative embodiment of the present invention.
  • Fig. 2 is an illustration of the three possible path ancestors for the matrix element i,j.
  • Various embodiments of the present invention are directed to techniques for a small vocabulary speaker dependent automatic speech recognizer to be used in a relatively low resource environment, e.g., a GSM digital mobile telephone handset.
  • a relatively low complexity dynamic time warping (DTW) algorithm requires a limited number of computations based on speech features extracted from the GSM signal.
  • GSM signal features represent quantized log area ratio parameters (Q- LARs) in which the quantization bins are set to convert a full range of speech with minimal distortion.
  • Q-LARs quantized log area ratio parameters
  • conventional speech recognizers operate in the cepstral domain. Converting from GSM Q-LARs to cepstrum parameters for speech recognition requires a significant computational effort. The Q-LARs must be converted into continuous LARs, which then must be converted into linear predictive coefficients (LPC) from which cepstral coefficients may be calculated. This conventional conversion to cepstrum is done because speaker dependent pitch information is lost if only the lowest cepstral coefficients are considered. For speaker dependent "single microphone" speech recognition purposes, however, it is not necessary to convert to a cepstrum representation or to filter out pitch. Instead, the extracted features may simply have a zero mean and be smoothed over time (for the highly quantized high-order coefficients).
  • a representative embodiment operates using speech features directly derived from the Q-LARs without the intermediate step of decoding into continuous LARs. This approach minimizes the feature extraction effort.
  • a weighted bandpass filtering of the GSM Q-LARs drops DC and high-frequency variations, then, time derivatives are calculated, but energy is not reconstructed.
  • the speech recognizer of a representative embodiment also includes a speech detector that operates by monitoring time derivatives of the filtered Q- LARs.
  • the spectrum (or bench) of Q-LAR coefficients is much more stable during noise than during organized speech.
  • a norm of the time derivative of the Q- LARs can be used for determining when speech is present. For example, a speech begin point may be determined by when the local integration of the time derivative is increasing and greater than a first selected value. A speech end point may be determined by when the local integration of the time derivative decreases below a second selected value.
  • the time derivative of the Q-LARs may also serve as a variable frame rate control signal that only retransmits frames that differ significantly from the previous ones.
  • Fig. 1 illustrates an automatic speech recognizer in a GSM environment according to a representative embodiment.
  • Speech processor 10 provides a spoken input signal to GSM frame coder 11 which converts the input speech into a sequence of GSM encoded frames.
  • the GSM frames are then processed by the GSM channel coder 12 and output to a GSM network 13.
  • the GSM frames from the GSM frame coder 11 also are available as an input to an automatic speech recognition GSM pre-processor 141 which removes DC and high-frequency variations by performing weighted bandpass filtering of the GSM Q-LARs and also calculates signal time derivatives.
  • Speech recognition engine 14 compares the output of the ASR GSM pre-processor 141 to GSM acoustic models 15 using DTW. This comparison also uses speech detector 142 that monitors the time derivatives of the filtered Q-LARs from the ASR GSM pre-processor 141.
  • the recognition output of the speech recognition engine 14 may be further processed; for example, to control an automatic dialing feature.
  • a received GSM signal from the GSM network 13 may be decoded by a GSM channel decoder 16 into a sequence of GSM frames which are then further processed into an audio output signal by a GSM frame decoder 17 and speech output processor 18.
  • the GSM frames from the GSM channel decoder may be provided via the ASR GSM pre-processor 141 to the speech recognition engine 14.
  • the recognition engine 14 uses a dynamic time warping (DTW) algorithm in which a multi-dimensional vector template representing the input speech signal is compared to one or more reference templates representing words in a recognition vocabulary.
  • DTW dynamic time warping
  • a relatively low complexity algorithm scores two such templates against each other by time warping of their extracted features.
  • the technique of DTW is described, for example, in Chapter 11 of Deller, Proakis & Hansen, Discrete-Time Processing of Speech Signals (Prentice Hall, 1987), which is incorporated herein by reference.
  • a standard dynamic time warping algorithm determines the degree of match between two p-dimensional vector templates A and B of length m and n respectively.
  • An m*n matrix L of local scores is generated wherein L(i,j) is the distance (typically, the Euclidean distance) between the i_th component of A and the j_th component of B.
  • another matrix G(i,j) of global scores is also generated representing the cumulative sum of the elements L(k # i, 1 # j) over the minimal path.
  • FIG. 2 is an illustration of the three possible path ancestors for the matrix element i,j: (1) a horizontal path 21 from i,j- 1 to i,j that reflects the degree of match of several successive vectors of template B on one vector of template A, (2) a vertical path 22 from i-l,j to i,j that reflects the degree of match of several successive vectors of template A on one vector of B, or (3) a diagonal path 23 from i-l,j-l to i,j that reflects a one to one correspondence of a vector point of A to a single vector point of B.
  • the DTW algorithm can evaluate in either top-down or left-to-right order.
  • the score G(m,n) is the total score of the degree of match between templates A and B.
  • a standard DTW algorithm has some disadvantages, however.
  • All of the local distances are calculated of all the feature vectors of an input utterance against all the feature vectors of the model template.
  • a 2 second sample of input speech needs 100 frames, and a corresponding matrix of local distances takes a cumbersome 10K of memory.
  • Several methods can be used to prune possible paths of the DTW algorithm in order to reduce the number of evaluations that have to be made. For example, the pruning may be based on a maximum number of active states where a fixed maximum of n-best states is kept active. There will be n horizontal states in top- down evaluations or n vertical states in left-right evaluation. This pruning method is not symmetric since the total beam of active states can vary from top-down to left-right evaluation.
  • a representative embodiment uses pruning around the diagonal.
  • the diagonal from (1,1) to (m,n) is calculated, and the states with a topological distance smaller than a preset threshold from the diagonal are evaluated.
  • This pruning method is symmetric in top-down or left to right evaluation.
  • pruning around the diagonal is much more demanding on the overall match of the two templates.
  • the local match must be within the threshold on each cut through the diagonal so both templates must have good local correspondence all along the diagonal.
  • a local good match might compensate for another local mismatch. This could lead to an optimal path that is far away from the diagonal. Imposing diagonal pruning on two utterances with different acoustical content forces higher scores and thereby enables better rejection.
  • Pruning on the diagonal does demand that both input templates match locally well all along the length. This requirement is not a significant issue for single-word utterances, but does become problematic for multiple word utterances with varying inter-word silence lengths, or with utterances with leading or trailing silence.
  • the silence regions are stripped by a begin-end point detector, and then variable frame rate discrimination is employed. This approach works very well with pruning on the diagonal.
  • the m-by-n matrix of global scores is not fully calculated.
  • 14 elements 28 bytes
  • One additional memory element is used for the local score, and one or more elements are used for beam boundaries, plus an additional 14 bytes are used to track path length so that altogether, a representative embodiment uses a mere 46 bytes of memory at recognition time.
  • backtrace information is also kept in memory; for a two second speech sample, 14x100x2 bits, or 350 bytes
  • word templates may also be kept in memory to improve system speed.
  • a standard DTW system performs a one to one evaluation of two feature vector templates in which a new incoming feature vector is matched to each template feature vector stored during training.
  • multiple repetitions of a given utterance may be provided during training, and the DTW system scores a given test utterance against each of the stored repetitions.
  • the score for a single word in the recognition vocabulary then is a combination of the scores for each of the different stored repetitions of that word that were trained.
  • a representative embodiment combines multiple repetition templates during training to form a single "glued" template for each word that represents the "average" of the various repetitions.
  • a first repetition of a training utterance is stored in temporary memory.
  • a second repetition of the same training utterance is requested and stored in temporary memory.
  • a check is then made on the consistency of the length of the two noise-stripped canonical form utterances. If the lengths don't match, the first utterance is overwritten by the second, and a new input utterance taken to replace the second. If first and second utterances match in length, they are scored against each other by the diagonal pruned DTW algorithm. This is done not with the aim of getting the score, but in order to get the optimal path of the best scores around the diagonal.
  • a similarity check is performed: if the score is too high and the template represents only a single utterance, then the template is overwritten by the second utterance; if the template is already a glued form of two or more utterances, the last utterance is neglected. Since the templates are already very similar, most of the optimal path will be a concatenation of diagonal transitions, although some horizontal and vertical transitions will remain. All the horizontal and vertical transitions are packed since they correspond to a local mismatch of the two templates, then the two templates are averaged over the path. When a local diagonal path exists, each element of the glued template will be an average of the corresponding elements of the first and the second templates.
  • a representative embodiment uses three utterance repetitions for training.
  • Each new training utterance is stored to the primary glued template generating a new secondary glued template.
  • the new secondary glued template represents the weighted mean of the new utterance (that is, the first new utterance that passes the length consistency check and score check), and the previously stored glued template.
  • This algorithm has the secondary advantages that a glued template sublimates to a minimal length representation, and that any possible remaining trailing or ending noise in one of the repetition templates get compressed to a maximum of one state, since the chance of having similar remaining trailing or ending noise states in all training utterances is insignificant.
  • features derived from the input speech may be vector quantized (NQ) in order to have a lower resolution representation.
  • NQ vector quantized
  • Vector quantization reduces the size of the data stream, and so also reduces the amount of data memory that is needed. This benefit is maximized if the VQ can be done in real time.
  • a VQ system also may need fewer calculations to perform recognition because the distances between different points are predefined.
  • One disadvantage of classical VQ systems is the need to store codebooks. For example, a codebook for an eight-dimensional feature vector system is typically around IK which, for real-time recognition should be loaded in (expensive) RAM.
  • the feature vector dimensions may be reduced by using a regular grid of codewords, in which case, there is no need to store a codebook, but rather only to define an appropriate quantization scheme.
  • the modified DTW approach of a representative embodiment appears to be the most appropriate pattern matching routine to be used in a GSM environment.
  • alternative embodiments may employ other approaches.
  • Euclidian, abs(diff), or "- improduct" of vectors calculations may be used.
  • two or more reference templates can be "glued” together by an appropriate modification of the DTW algorithm. Neighboring states within a template can also be combined as well, in order to reduce the dimensionality of the templates. These compressed templates can eventually also be used for single Gaussian Viterbi scoring, they are fully compatible in origin.
  • An alternative embodiment may also be based on stochastic Hidden
  • HMMs Markov Models trained by Viterbi iteration. For instance, each word in the recognition vocabulary may be represented as a single continuous density (or Gaussian) HMM. This implies a training procedure with different iterations on the feature vectors, and the feature vectors would be kept in system RAM. Evaluation is fast, in such an embodiment, but not necessary superior to the glued DTW patterns approach (which take less RAM at training).
  • An embodiment could also be based on the use of fenonic discrete density HMMs to represent words in the recognition vocabulary.
  • This approach implies the storage of phonetic reference models which could be stored in flash memory.
  • One disadvantage would be the problem of storage of the flash data.
  • Advantages would include relatively small storage templates, low RAM requirements, and more potential for noise robustness and speaker independent solutions.
  • a CPU for a typical GSM DSP operates at 50 MIPS, more than enough processing power for a representative embodiment.
  • the CPU is used for (a) training time — feature extraction is kept as low as possible, and (b) recognition /verification time — to score an input phrase to a number of words in the recognition vocabulary.
  • training time feature extraction is kept as low as possible
  • recognition /verification time recognition /verification time — to score an input phrase to a number of words in the recognition vocabulary.
  • One workable approach is to aim for a one second response time, of which, about 0.5 seconds is used for speech end-point detection, leaving around 25 msec per word (on average). Thus, the number of active states per word also should be minimized.
  • ROM code memory should also be kept as small as possible since code memory will have to be shared between the GSM normal functionality and the speech recognizer. Flash memory is less of an issue, an adequate working target is about IK per word.
  • a speech recognizer according to a representative embodiment may be found in the ASRIOO small footprint isolated word recognizer made by Lernout & Hauspie Speech Products, N.N. of leper, Belgium.
  • the ASRIOO is intended to be used in handheld consumer devices, for example, in a GSM mobile telephone for providing access to a personal address or telephone book by speaking the name of the addressee.
  • the ASRIOO uses part of the digital GSM frame data as input, and the appropriate code is called at the normal frame rate, but not all items composing a frame have to be computed during enrollment nor at recognition- time.
  • the basic footprint figures used herein do not include the requirements for the GSM frame encoding process; the figures refer to a recognizer running in the digital GSM domain. If however the engine runs in the GSM phone itself, it is possible to share some RAM data with the RAM reserved for the GSM encoding.
  • Total flash memory size for 30 words of average duration of 1 second is 24 Kbyte without playback functionality, 84 Kbyte with playback functionality • 10 MIPS DSP for an average sub-second recognition latency measured from end of utterance, and
  • the recognition engine normally operates in push-to- talk mode, although automatic speech detection can be used in some applications.
  • the training procedure for new entries takes three repetitions of the new entry (the minimum number of repetitions is two).
  • a consistency check is made of the repetitions during training, and additional training utterances are adaptively requested if required. If a user chooses a standard training with only two sample utterances, there is an option to adapt the template at recognition time.
  • a confusability check is also made at training time, preventing the generation of confusable word pairs in the vocabulary. Since longer utterances (first + last names) are less confusable than short nicknames, the user is encouraged not to use nickname entries.
  • the input utterance is checked against all vocabulary entries. If the utterance cannot be found in the recognition vocabulary, it is rejected. Otherwise, the template reference or the vocal playback of the recognized word is given for confirmation.
  • An embodiment may be employed in various alternative configurations provided the various tradeoffs are considered and accommodated since RAM space and CPU resources directly compete with system performance.
  • a minimal RAM implementation could use only 1 KWord of RAM with, however, an increase in code size of about 20%.
  • a minimal Flash storage embodiment would decrease the template storage of a word for recognition by a factor of two, at the expense, however, of CPU load and eventually a slight decrease in performance.
  • a minimal CPU embodiment would require some 20% increase in code, and could result in a slightly decreased performance.
  • Embodiments of the invention may be implemented in any conventional computer programming language. For example, representative embodiments may be implemented in a procedural prograrnming language (e.g., "C") or an object oriented programming language (e.g., "C++"). Alternative embodiments of the invention may be implemented as pre-programmed hardware elements, other related components, or as a combination of hardware and software components. Embodiments can be implemented as a computer program product for use with a computer system.
  • a procedural prograrnming language e.g., "C”
  • object oriented programming language e.g., "C++”
  • Alternative embodiments of the invention may be implemented as pre-programmed hardware elements, other related components, or as a combination of hardware and software components.
  • Embodiments can be implemented as a computer program product for use with a computer system.
  • Such implementation may include a series of computer instructions fixed either on a tangible medium, such as a computer readable medium (e.g., a diskette, CD-ROM, ROM, or fixed disk) or transmittable to a computer system, via a modem or other interface device, such as a communications adapter connected to a network over a medium.
  • the medium may be either a tangible medium (e.g., optical or analog connmunications lines) or a medium implemented with wireless techniques (e.g., microwave, infrared or other transmission techniques).
  • the series of computer instructions embodies all or part of the functionality previously described herein with respect to the system. Those skilled in the art should appreciate that such computer instructions can be written in a number of programming languages for use with many computer architectures or operating systems.
  • Such instructions may be stored in any memory device, such as semiconductor, magnetic, optical or other memory devices, and may be transmitted using any communications technology, such as optical, infrared, microwave, or other transmission technologies.
  • a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation (e.g., shrink wrapped software), preloaded with a computer system (e.g., on system ROM or fixed disk), or distributed from a server or electronic bulletin board over the network (e.g., the Internet or World Wide Web).
  • some embodiments of the invention may be implemented as a combination of both software (e.g., a computer program product) and hardware. Still other embodiments of the invention are implemented as entirely hardware, or entirely software (e.g., a computer program product).

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A speech recognizer uses Global System for Mobile Communications (GSM)-encoded digital data. Templates model words in a recognition vocabulary using GSM Quantized Log Area Ratio (Q-LAR) features. A recognizer module compares Q-LAR features of a template representing an input GSM signal to recognition vocabulary templates and produces a recognition output.

Description

Small Vocabulary Speaker Dependent Speech Recognition
Field of the Invention The invention relates to automatic speech recognition in a low resource environment.
Background Art Automatic speech recognition is a complicated technology that is rapidly entering daily life in an increasing number of applications. Digital mobile telephony is another fast-growing technology. Several significant technical challenges must be met to provide automatic speech recognition in a digital mobile telephone system. For example, digital mobile telephones operate from limited capacity batteries, but automatic speech recognition uses computer processors that perform a significant number of calculations, thereby consuming a relatively substantial amount of power. In addition, the physical size of a digital mobile telephone handset is very limited. Digital storage memory and digital signal processors required for automatic speech recognition also represent a significant additional cost beyond that of the telephone handset. In addition, automatic speech recognition is technically more difficult in the digital mobile telephone environment which includes operating in noisy environments such as public places, automobiles, etc., distortion effects related to the digital encoding of speech, and transmission errors due to the radio channel.
One widely employed digital mobile telephone system uses the GSM standard. The GSM full rate codec 6.10 samples input speech at an 8 kHz rate and generates a 13-bit digital signal which is converted into 260-bit blocks that represent 160 of the original samples. Although the nominal bit rate of the GSM encoding algorithm is 13 kbps, the actual transmitted data stream includes error recovery and packet information which increases the total bit rate.
The GSM codec uses the technique of linear predictive analysis-by - synthesis to encode the speech as a combination of linear prediction coefficients (LPC) containing spectral information and a residual pulse excitation signal. The LPC filter information is in the form of quantized log area ratios (Q-LARs), while the residual pulse signal is in the form of quantized RPE-LTP parameters. The quantization and compression performed in encoding the input speech creates noise and distortion that degrade the signal. To decode the digital signal back into speech, the pulse excitation signal is reconstructed and then input to a digital filter defined by the LPC parameters.
Traditionally, automatic speech recognition has operated in the cepstral domain by converting a digitized speech signal input into a cepstral domain signal and then performing speech recognition. One automatic speech recognition system designed to operate in a GSM digital mobile telephone environment reconverts digital GSM features back into cepstral component factors and then performs the recognizing process. Another speech recognition system converts the GSM parameters into linear predictive components, and then into a 256-point spectrum of each speech frame followed by a Mel-filter weighting and conversion into cepstrum.
Summary of the Invention A representative embodiment of the present invention includes a speech recognizer using Global System for Mobile Communications (GSM)-encoded digital data. The recognizer includes a plurality of templates, each template modeling a word in a recognition vocabulary using GSM Quantized Log Area Ratio (Q-LAR) features; and a recognizer module that compares Q-LAR features of a template representing an input GSM signal to at least one of the plurality of templates, and produces a recognition output.
A representative embodiment also includes a method of speech recognition using Global System for Mobile Communications (GSM)-encoded digital data. The method includes providing a plurality of templates, each template modeling a word in a recognition vocabulary using time domain GSM Quantized Log Area Ratio (Q-LAR) features; and comparing with a recognizer module Q-LAR features of a template representing an input GSM signal to at least one of the plurality of templates, and producing a recognition output.
In a further embodiment of either of the above, the Q-LAR features of the input GSM signal may be smoothed over time and have a zero mean. In such a case, the Q-LAR features of the input GSM signal may be generated by bandpass filtering, and may include at least one time derivative. The at least one time derivative may be used by a speech detector to determine a speech begin point and a speech end point for the input GSM signal. The recognizer module may also use a dynamic time warping (DTW) algorithm, and a pruned matrix to compare the template representing the input GSM signal to the at least one of the plurality of templates. The pruned matrix may be generated based on a threshold distance from a matrix diagonal. The recognizer module may also use variable frame rate discrimination. In another embodiment, one of the plurality of templates may be a composite representation of multiple repetitions of the corresponding word in the recognition vocabulary. Such a composite representation may be based on a path along a matrix diagonal of a matrix representing a prior template for the corresponding word and a template for a repetition of the corresponding word. The template representing the input GSM signal may be a vector quantized template in one embodiment. Hidden Markov models (HMMs), for example fenonic HMMs, may be used as templates.
A representative embodiment also includes an apparatus and method for recognizing speech including providing a plurality of templates, each template including a plurality of multi-dimensional vectors that represent a path along a matrix diagonal of a matrix representing a prior template for the word and a template for a repetition of the word; and comparing with a recognizer module a template representing an input speech signal to at least one of the plurality of templates, and producing a recognition output. In such an embodiment, each vector may represent Quantized Log Area Ratio (Q-LAR) features according to the Global System for Mobil Communications (GSM) standard. The path may represent a minimal distance path within a threshold distance of the matrix diagonal. At least one of the plurality of templates may represent at least three repetitions of a word. The input speech signal may be a Global System for Mobil Communications (GSM)- encoded digital data signal having Quantized Log Area Ration (Q-LAR) features, which may be smoothed over time and have a zero mean, and /or be generated by bandpass filtering and may include at least one time derivative. The recognizer module may use a dynamic time warping (DTW) algorithm. The template representing an input speech signal may be a vector quantized template. The templates may be hidden Markov models (HMMs), e.g., fenonic models.
Another embodiment of the invention includes apparatus and method for detecting when speech is present in an input acoustic signal including converting, with an input preprocessor, an input acoustic signal into a sequence of frames containing representative features; and analyzing, with a speech detection module, at least one time derivative of the representative features with respect to a feature time derivative norm to determine when speech is present in a portion of the sequence. In such an embodiment, the representative features may be Quantized Log Area Ratio (Q-LAR) features in a Global System for Mobile
Communications (GSM) data stream. Analyzing with a speech detection module further includes determining a speech begin point in the sequence when a local integration of the at least one feature time derivative is increasing and greater than a first selected value, and determining a speech end point in the sequence when the local integration of the at least one feature time derivative is less than a second selected value.
Brief Description of the Drawings The present invention will be more readily understood by reference to the following detailed description taken with the accompanying drawings, in which: Fig. 1 illustrates an automatic speech recognizer in a GSM environment according to a representative embodiment of the present invention.
Fig. 2 is an illustration of the three possible path ancestors for the matrix element i,j.
Detailed Description of Specific Embodiments Various embodiments of the present invention are directed to techniques for a small vocabulary speaker dependent automatic speech recognizer to be used in a relatively low resource environment, e.g., a GSM digital mobile telephone handset. A relatively low complexity dynamic time warping (DTW) algorithm requires a limited number of computations based on speech features extracted from the GSM signal.
GSM signal features represent quantized log area ratio parameters (Q- LARs) in which the quantization bins are set to convert a full range of speech with minimal distortion. However, conventional speech recognizers operate in the cepstral domain. Converting from GSM Q-LARs to cepstrum parameters for speech recognition requires a significant computational effort. The Q-LARs must be converted into continuous LARs, which then must be converted into linear predictive coefficients (LPC) from which cepstral coefficients may be calculated. This conventional conversion to cepstrum is done because speaker dependent pitch information is lost if only the lowest cepstral coefficients are considered. For speaker dependent "single microphone" speech recognition purposes, however, it is not necessary to convert to a cepstrum representation or to filter out pitch. Instead, the extracted features may simply have a zero mean and be smoothed over time (for the highly quantized high-order coefficients).
Thus, in contrast to a conventional speech recognizer, a representative embodiment operates using speech features directly derived from the Q-LARs without the intermediate step of decoding into continuous LARs. This approach minimizes the feature extraction effort. A weighted bandpass filtering of the GSM Q-LARs drops DC and high-frequency variations, then, time derivatives are calculated, but energy is not reconstructed.
The speech recognizer of a representative embodiment also includes a speech detector that operates by monitoring time derivatives of the filtered Q- LARs. The spectrum (or bench) of Q-LAR coefficients is much more stable during noise than during organized speech. Thus, a norm of the time derivative of the Q- LARs can be used for determining when speech is present. For example, a speech begin point may be determined by when the local integration of the time derivative is increasing and greater than a first selected value. A speech end point may be determined by when the local integration of the time derivative decreases below a second selected value. In addition, the time derivative of the Q-LARs may also serve as a variable frame rate control signal that only retransmits frames that differ significantly from the previous ones.
Fig. 1 illustrates an automatic speech recognizer in a GSM environment according to a representative embodiment. Speech processor 10 provides a spoken input signal to GSM frame coder 11 which converts the input speech into a sequence of GSM encoded frames. The GSM frames are then processed by the GSM channel coder 12 and output to a GSM network 13. The GSM frames from the GSM frame coder 11 also are available as an input to an automatic speech recognition GSM pre-processor 141 which removes DC and high-frequency variations by performing weighted bandpass filtering of the GSM Q-LARs and also calculates signal time derivatives. Speech recognition engine 14 compares the output of the ASR GSM pre-processor 141 to GSM acoustic models 15 using DTW. This comparison also uses speech detector 142 that monitors the time derivatives of the filtered Q-LARs from the ASR GSM pre-processor 141. The recognition output of the speech recognition engine 14 may be further processed; for example, to control an automatic dialing feature.
Alternatively, or in addition, a received GSM signal from the GSM network 13 may be decoded by a GSM channel decoder 16 into a sequence of GSM frames which are then further processed into an audio output signal by a GSM frame decoder 17 and speech output processor 18. The GSM frames from the GSM channel decoder may be provided via the ASR GSM pre-processor 141 to the speech recognition engine 14. The recognition engine 14 uses a dynamic time warping (DTW) algorithm in which a multi-dimensional vector template representing the input speech signal is compared to one or more reference templates representing words in a recognition vocabulary. A relatively low complexity algorithm scores two such templates against each other by time warping of their extracted features. The technique of DTW is described, for example, in Chapter 11 of Deller, Proakis & Hansen, Discrete-Time Processing of Speech Signals (Prentice Hall, 1987), which is incorporated herein by reference.
A standard dynamic time warping algorithm determines the degree of match between two p-dimensional vector templates A and B of length m and n respectively. An m*n matrix L of local scores is generated wherein L(i,j) is the distance (typically, the Euclidean distance) between the i_th component of A and the j_th component of B. Corresponding with the matrix L(i,j), another matrix G(i,j) of global scores is also generated representing the cumulative sum of the elements L(k # i, 1 # j) over the minimal path. Fig. 2 is an illustration of the three possible path ancestors for the matrix element i,j: (1) a horizontal path 21 from i,j- 1 to i,j that reflects the degree of match of several successive vectors of template B on one vector of template A, (2) a vertical path 22 from i-l,j to i,j that reflects the degree of match of several successive vectors of template A on one vector of B, or (3) a diagonal path 23 from i-l,j-l to i,j that reflects a one to one correspondence of a vector point of A to a single vector point of B. The DTW algorithm can evaluate in either top-down or left-to-right order. The score G(m,n) is the total score of the degree of match between templates A and B.
A standard DTW algorithm has some disadvantages, however. First, all of the local distances are calculated of all the feature vectors of an input utterance against all the feature vectors of the model template. In the specific case of GSM coding with 20 msec frames, a 2 second sample of input speech needs 100 frames, and a corresponding matrix of local distances takes a cumbersome 10K of memory. Several methods can be used to prune possible paths of the DTW algorithm in order to reduce the number of evaluations that have to be made. For example, the pruning may be based on a maximum number of active states where a fixed maximum of n-best states is kept active. There will be n horizontal states in top- down evaluations or n vertical states in left-right evaluation. This pruning method is not symmetric since the total beam of active states can vary from top-down to left-right evaluation.
A representative embodiment uses pruning around the diagonal. The diagonal from (1,1) to (m,n) is calculated, and the states with a topological distance smaller than a preset threshold from the diagonal are evaluated. This pruning method is symmetric in top-down or left to right evaluation. Compared to pruning on maximum number of active states, pruning around the diagonal is much more demanding on the overall match of the two templates. The local match must be within the threshold on each cut through the diagonal so both templates must have good local correspondence all along the diagonal. With the pruning on a fixed number of active states, a local good match might compensate for another local mismatch. This could lead to an optimal path that is far away from the diagonal. Imposing diagonal pruning on two utterances with different acoustical content forces higher scores and thereby enables better rejection.
Pruning on the diagonal, however, does demand that both input templates match locally well all along the length. This requirement is not a significant issue for single-word utterances, but does become problematic for multiple word utterances with varying inter-word silence lengths, or with utterances with leading or trailing silence. In representative embodiments, the silence regions are stripped by a begin-end point detector, and then variable frame rate discrimination is employed. This approach works very well with pruning on the diagonal.
In a representative embodiment, the m-by-n matrix of global scores is not fully calculated. By restricting the path to the specified beam around the diagonal, only 14 elements (28 bytes) needs to be kept in memory. One additional memory element is used for the local score, and one or more elements are used for beam boundaries, plus an additional 14 bytes are used to track path length so that altogether, a representative embodiment uses a mere 46 bytes of memory at recognition time. During system training, backtrace information is also kept in memory; for a two second speech sample, 14x100x2 bits, or 350 bytes Thus total memory needed for training is 46 bytes + 350 bytes = 396 bytes, a very modest number. In one embodiment where the extra memory is available, word templates may also be kept in memory to improve system speed.
A standard DTW system performs a one to one evaluation of two feature vector templates in which a new incoming feature vector is matched to each template feature vector stored during training. In some conventional DTW systems, multiple repetitions of a given utterance may be provided during training, and the DTW system scores a given test utterance against each of the stored repetitions. The score for a single word in the recognition vocabulary then is a combination of the scores for each of the different stored repetitions of that word that were trained. There are several drawbacks to this approach including excessive storage requirements for a single word in recognition vocabulary and excessive CPU processing needed to evaluate all of the various matches.
A representative embodiment combines multiple repetition templates during training to form a single "glued" template for each word that represents the "average" of the various repetitions. A first repetition of a training utterance is stored in temporary memory. Next, a second repetition of the same training utterance is requested and stored in temporary memory. A check is then made on the consistency of the length of the two noise-stripped canonical form utterances. If the lengths don't match, the first utterance is overwritten by the second, and a new input utterance taken to replace the second. If first and second utterances match in length, they are scored against each other by the diagonal pruned DTW algorithm. This is done not with the aim of getting the score, but in order to get the optimal path of the best scores around the diagonal. Next, a similarity check is performed: if the score is too high and the template represents only a single utterance, then the template is overwritten by the second utterance; if the template is already a glued form of two or more utterances, the last utterance is neglected. Since the templates are already very similar, most of the optimal path will be a concatenation of diagonal transitions, although some horizontal and vertical transitions will remain. All the horizontal and vertical transitions are packed since they correspond to a local mismatch of the two templates, then the two templates are averaged over the path. When a local diagonal path exists, each element of the glued template will be an average of the corresponding elements of the first and the second templates. In a case of a local vertical path, the corresponding element in the glued template will be the average of all successive frames of template A and a single frame of template B. In a case of a local horizontal path, the corresponding element in the glued template will be the average of all successive frames of template B and the single corresponding frame of template A. The resulting glued template is stored for use during the recognition process.
A representative embodiment uses three utterance repetitions for training. Each new training utterance is stored to the primary glued template generating a new secondary glued template. The new secondary glued template represents the weighted mean of the new utterance (that is, the first new utterance that passes the length consistency check and score check), and the previously stored glued template. This algorithm has the secondary advantages that a glued template sublimates to a minimal length representation, and that any possible remaining trailing or ending noise in one of the repetition templates get compressed to a maximum of one state, since the chance of having similar remaining trailing or ending noise states in all training utterances is insignificant.
In a further embodiment, features derived from the input speech may be vector quantized (NQ) in order to have a lower resolution representation. Vector quantization reduces the size of the data stream, and so also reduces the amount of data memory that is needed. This benefit is maximized if the VQ can be done in real time. A VQ system also may need fewer calculations to perform recognition because the distances between different points are predefined. One disadvantage of classical VQ systems is the need to store codebooks. For example, a codebook for an eight-dimensional feature vector system is typically around IK which, for real-time recognition should be loaded in (expensive) RAM. Alternatively, the feature vector dimensions may be reduced by using a regular grid of codewords, in which case, there is no need to store a codebook, but rather only to define an appropriate quantization scheme. On the whole, the modified DTW approach of a representative embodiment appears to be the most appropriate pattern matching routine to be used in a GSM environment. However, alternative embodiments may employ other approaches. As an alternative to the local distance, Euclidian, abs(diff), or "- improduct" of vectors calculations may be used. In addition, two or more reference templates can be "glued" together by an appropriate modification of the DTW algorithm. Neighboring states within a template can also be combined as well, in order to reduce the dimensionality of the templates. These compressed templates can eventually also be used for single Gaussian Viterbi scoring, they are fully compatible in origin. An alternative embodiment may also be based on stochastic Hidden
Markov Models (HMMs) trained by Viterbi iteration. For instance, each word in the recognition vocabulary may be represented as a single continuous density (or Gaussian) HMM. This implies a training procedure with different iterations on the feature vectors, and the feature vectors would be kept in system RAM. Evaluation is fast, in such an embodiment, but not necessary superior to the glued DTW patterns approach (which take less RAM at training).
An embodiment could also be based on the use of fenonic discrete density HMMs to represent words in the recognition vocabulary. This approach implies the storage of phonetic reference models which could be stored in flash memory. One disadvantage would be the problem of storage of the flash data. Advantages would include relatively small storage templates, low RAM requirements, and more potential for noise robustness and speaker independent solutions.
Of course, the cost of the required hardware is quite important in such an application. In a representative embodiment, the speech recognizer uses the existing GSM digital signal processor and does not need significant additional hardware. Hardware cost considerations also include making efficient use of Random Access Memory (RAM), CPU processing power, Read-Only Memory (ROM), and flash memory. RAM is relatively expensive therefore the amount of RAM required by a representative embodiment is kept as low as possible, typically 1-2 Kbytes with a maximum of 4 Kbytes. The amount of RAM necessary is not, however, independent of CPU processing power. CPU processing, especially for feature extraction, is kept as low as possible in order to run in real-time. Otherwise, data must be buffered which in turn means additional relatively large data buffers. RAM used for generating reference templates during training is also kept as low as possible. Because templates are typically block-loaded into RAM for speed considerations, RAM usage is also minimized.
A CPU for a typical GSM DSP operates at 50 MIPS, more than enough processing power for a representative embodiment. In a representative embodiment, the CPU is used for (a) training time — feature extraction is kept as low as possible, and (b) recognition /verification time — to score an input phrase to a number of words in the recognition vocabulary. For speed considerations, it is advisable to load comparison templates into RAM since flash memory access may be relatively slow. Due to RAM limitations, this means that candidate words will have to be scored sequentially, or block-buffered. One workable approach is to aim for a one second response time, of which, about 0.5 seconds is used for speech end-point detection, leaving around 25 msec per word (on average). Thus, the number of active states per word also should be minimized.
ROM code memory should also be kept as small as possible since code memory will have to be shared between the GSM normal functionality and the speech recognizer. Flash memory is less of an issue, an adequate working target is about IK per word. A speech recognizer according to a representative embodiment may be found in the ASRIOO small footprint isolated word recognizer made by Lernout & Hauspie Speech Products, N.N. of leper, Belgium. The ASRIOO is intended to be used in handheld consumer devices, for example, in a GSM mobile telephone for providing access to a personal address or telephone book by speaking the name of the addressee. The ASRIOO uses part of the digital GSM frame data as input, and the appropriate code is called at the normal frame rate, but not all items composing a frame have to be computed during enrollment nor at recognition- time. The basic footprint figures used herein do not include the requirements for the GSM frame encoding process; the figures refer to a recognizer running in the digital GSM domain. If however the engine runs in the GSM phone itself, it is possible to share some RAM data with the RAM reserved for the GSM encoding.
By storing only the GSM frames that contain speech (using Begin-Endpoint detection), 13 Kbit/sec average effective speech duration of memory is needed for playback functionality. An average speech duration of 1 second is assumed, for other average speech lengths, storage needed changes proportionally. Such an embodiment is designed with the following system characteristics:
• 3 kWord of RAM, some of which can be shared with the GSM processing,
• 4 kWord of ROM,
• O.δkByte (on average) in flash memory per enrolled word for recognition (1 second),
• 1.7kByte (on average) in flash memory per enrolled word for playback (1 second),
• Total flash memory size for 30 words of average 1 second duration is 75 Kbyte, • 10 MIPS DSP for an average sub-second recognition latency measured from end of utterance, and
• 8 kFIz sampling rate.
When the ASRIOO is used in handheld devices other than mobile telephones, its footprint is increased slightly. The extra resources are used for the partial calculation of the GSM frames, estimated as 2 KWord of ROM with no extra RAM needed. The input data is assumed to be given by the codec at 13 bits PCM linear, 8 kHz. In absence of a full GSM coder, an alternative speech compression algorithm could be added for the playback function, for example, an ADPCM at 16 Kbit/sec at the expense of 1 KWord extra code. The total footprint size without the playback functionality is:
• 3 kWord of RAM,
• 6 kWord of ROM (without playback functionality) / 7 kWord of ROM (with playback functionality),
• O.δkByte (on average) in flash memory per enrolled word for recognition ( 1 second),
• 2kByte (on average ) in flash memory per enrolled word for playback (1 second),
• Total flash memory size for 30 words of average duration of 1 second is 24 Kbyte without playback functionality, 84 Kbyte with playback functionality • 10 MIPS DSP for an average sub-second recognition latency measured from end of utterance, and
• 8 kHz sampling rate @ 13 bits PCM linear.
The recognition engine normally operates in push-to- talk mode, although automatic speech detection can be used in some applications. In a representative embodiment, the training procedure for new entries takes three repetitions of the new entry (the minimum number of repetitions is two). A consistency check is made of the repetitions during training, and additional training utterances are adaptively requested if required. If a user chooses a standard training with only two sample utterances, there is an option to adapt the template at recognition time. A confusability check is also made at training time, preventing the generation of confusable word pairs in the vocabulary. Since longer utterances (first + last names) are less confusable than short nicknames, the user is encouraged not to use nickname entries. At recognition time, the input utterance is checked against all vocabulary entries. If the utterance cannot be found in the recognition vocabulary, it is rejected. Otherwise, the template reference or the vocal playback of the recognized word is given for confirmation.
An embodiment may be employed in various alternative configurations provided the various tradeoffs are considered and accommodated since RAM space and CPU resources directly compete with system performance. A minimal RAM implementation could use only 1 KWord of RAM with, however, an increase in code size of about 20%. A minimal Flash storage embodiment would decrease the template storage of a word for recognition by a factor of two, at the expense, however, of CPU load and eventually a slight decrease in performance. A minimal CPU embodiment would require some 20% increase in code, and could result in a slightly decreased performance.
Embodiments of the invention may be implemented in any conventional computer programming language. For example, representative embodiments may be implemented in a procedural prograrnming language (e.g., "C") or an object oriented programming language (e.g., "C++"). Alternative embodiments of the invention may be implemented as pre-programmed hardware elements, other related components, or as a combination of hardware and software components. Embodiments can be implemented as a computer program product for use with a computer system. Such implementation may include a series of computer instructions fixed either on a tangible medium, such as a computer readable medium (e.g., a diskette, CD-ROM, ROM, or fixed disk) or transmittable to a computer system, via a modem or other interface device, such as a communications adapter connected to a network over a medium. The medium may be either a tangible medium (e.g., optical or analog connmunications lines) or a medium implemented with wireless techniques (e.g., microwave, infrared or other transmission techniques). The series of computer instructions embodies all or part of the functionality previously described herein with respect to the system. Those skilled in the art should appreciate that such computer instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Furthermore, such instructions may be stored in any memory device, such as semiconductor, magnetic, optical or other memory devices, and may be transmitted using any communications technology, such as optical, infrared, microwave, or other transmission technologies. It is expected that such a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation (e.g., shrink wrapped software), preloaded with a computer system (e.g., on system ROM or fixed disk), or distributed from a server or electronic bulletin board over the network (e.g., the Internet or World Wide Web). Of course, some embodiments of the invention may be implemented as a combination of both software (e.g., a computer program product) and hardware. Still other embodiments of the invention are implemented as entirely hardware, or entirely software (e.g., a computer program product).
Although various exemplary embodiments of the invention have been disclosed, it should be apparent to those skilled in the art that various changes and modifications can be made which will achieve some of the advantages of the invention without departing from the true scope of the invention.

Claims

What is claimed is:
1. A speech recognizer using Global System for Mobile Communications (GSM)-encoded digital data, the recognizer comprising: a plurality of templates, each template modeling a word in a recognition vocabulary using GSM Quantized Log Area Ratio (Q-LAR) features; and a recognizer module that compares Q-LAR features of a template representing an input GSM signal to at least one of the plurality of templates, and produces a recognition output.
2. A speech recognizer according to claim 1, wherein the Q-LAR features of the input GSM signal are smoothed over time and have a zero mean.
3. A speech recognizer according to claim 2, wherein the Q-LAR features of the input GSM signal are generated by bandpass filtering.
4. A speech recognizer according to claim 2, wherein the Q-LAR features of the input GSM signal include at least one time derivative.
5. A speech recognizer according to claim 4, further including a speech detector that uses the at least one time derivative to determine a speech begin point and a speech end point for the input GSM signal.
6. A speech recognizer according to claim 1, wherein the recognizer module uses a dynamic time warping (DTW) algorithm.
7. A speech recognizer according to claim 6, wherein the recognition module uses a pruned matrix to compare the template representing the input GSM signal to the at least one of the plurality of templates.
8. A speech recognizer according to claim 7, wherein the pruned matrix is generated based on a threshold distance from a matrix diagonal.
9. A speech recognizer according to claim 8, wherein the recognizer module uses variable frame rate discrimination.
10. A speech recognizer according to claim 1, wherein at least one of the plurality of templates is a composite representation of multiple repetitions of the corresponding word in the recognition vocabulary.
11. A speech recognizer according to claim 10, wherein the composite representation is based on a path along a matrix diagonal of a matrix representing a prior template for the corresponding word and a template for a repetition of the corresponding word.
12. A speech recognizer according to claim 1, wherein the template representing the input GSM signal is a vector quantized template.
13. A speech recognizer according to claim 1, wherein the speech recognizer uses hidden Markov models (HMMs) as templates.
14. A speech recognizer according to claim 13, wherein the HMMs are fenonic models.
15. A method of speech recognition using Global System for Mobile Communications (GSM)-encoded digital data, the method comprising: providing a plurality of templates, each template modeling a word in a recognition vocabulary using time domain GSM Quantized Log Area Ratio (Q-LAR) features; comparing Q-LAR features of a template representing an input GSM signal to at least one of the plurality of templates, and producing a recognition output.
16. A method according to claim 15, wherein the Q-LAR features of the input GSM signal are smoothed over time and have a zero mean.
17. A method according to claim 16, wherein the Q-LAR features of the input GSM signal are generated by bandpass filtering.
18. A method according to claim 16, wherein the Q-LAR features of the input GSM signal include at least one time derivative.
19. A method according to claim 18, further including using the at least one time derivative to determine a speech begin point and a speech end point for the input GSM signal.
20. A method according to claim 17, wherein the comparing uses a dynamic time warping (DTW) algorithm.
21. A method according to claim 20, wherein the comparing uses a pruned matrix.
22. A method according to claim 21, wherein the pruned matrix is generated based on a threshold distance from a matrix diagonal.
23. A method according to claim 22, wherein the comparing includes using variable frame rate discrimination.
24. A method according to claim 15, wherein at least one of the plurality of templates is a composite representation of multiple repetitions of the corresponding word in the recognition vocabulary.
25. A method according to claim 24, wherein the composite representation is based on a path along a matrix diagonal of a matrix representing a prior template for the corresponding word and a template for a repetition of the corresponding word.
26. A method according to claim 15, wherein the template representing the input GSM signal is a vector quantized template.
27. A method according to claim 15, wherein hidden Markov models (HMMs) are used as templates.
28. A method according to claim 27, wherein the HMMs are fenonic models.
29. A speech recognizer comprising: a plurality of templates, each template including a plurality of multi- dimensional vectors that represent a path along a matrix diagonal of a matrix representing a prior template for the word and a template for a repetition of the word; and a recognizer module that compares a template representing an input speech signal to at least one of the plurality of templates, and produces a recognition output.
30. A speech recognizer according to claim 29, wherein each vector represents Quantized Log Area Ratio (Q-LAR) features according to the Global System for Mobil Communications (GSM) standard.
31. A speech recognizer according to claim 29, wherein the path represents a minimal distance path within a threshold distance of the matrix diagonal.
32. A speech recognizer according to claim 29, wherein at least one of the plurality of templates represents at least three repetitions of a word.
33. A speech recognizer according to claim 29, wherein the input speech signal is a Global System for Mobil Communications (GSM)-encoded digital data signal having Quantized Log Area Ratio (Q-LAR) features.
34. A speech recognizer according to claim 33, wherein the Q-LAR features are smoothed over time and have a zero mean.
35. A speech recognizer according to claim 34, wherein the Q-LAR features are generated by bandpass filtering.
36. A speech recognizer according to claim 34, wherein the Q-LAR features include at least one time derivative.
37. A speech recognizer according to claim 29, wherein the recognizer module uses a dynamic time warping (DTW) algorithm.
38. A speech recognizer according to claim 29, wherein the template representing an input speech signal is a vector quantized template.
39. A speech recognizer according to claim 29, wherein the templates are hidden Markov models (HMMs).
40. A speech recognizer according to claim 39, wherein the HMMs are fenonic models.
41. A method of recognizing speech method comprising: providing a plurality of templates, each template including a plurality of multi-dimensional vectors that represent a path along a matrix diagonal of a matrix representing a prior template for the word and a template for a repetition of the word; and comparing a template representing an input speech signal to at least one of the plurality of templates, and producing a recognition output.
42. A method according to claim 41, wherein each vector represents Quantized Log Area Ratio (Q-LAR) features according to the Global System for Mobil Communications (GSM) standard.
43. A method according to claim 41, wherein the path represents a minimal distance path within a threshold distance of the matrix diagonal.
44. A method according to claim 41, wherein at least one of the plurality of templates represents at least three repetitions of a word.
45. A method according to claim 41, wherein the input speech signal is a Global System for Mobil Communications (GSM)-encoded digital data signal having Quantized Log Area Ratio (Q-LAR) features.
46. A method according to claim 45, wherein the Q-LAR features are smoothed over time and have a zero mean.
47. A method according to claim 46, wherein the Q-LAR features are generated by bandpass filtering.
48. A method according to claim 46, wherein the Q-LAR features include at least one time derivative.
49. A method according to claim 41, wherein the recognizer module uses a dynamic time warping (DTW) algorithm.
50. A method according to claim 41, wherein the template representing an input speech signal is a vector quantized template.
51. A method according to claim 41, wherein the templates are hidden Markov models (HMMs).
52. A method according to claim 51, wherein the HMMs are fenonic models.
53. A speech detector for detecting when speech is present in an input acoustic signal, the detector comprising: an input preprocessor that converts an input acoustic signal into a sequence of frames containing representative features; and a speech detection module that analyzes at least one time derivative of the representative features with respect to a feature time derivative norm to determine when speech is present in a portion of the sequence.
54. A speech detector according to claim 53, wherein the representative features are Quantized Log Area Ratio (Q-LAR) features in a Global System for Mobile Communications (GSM) data stream.
55. A speech detector according to claim 53, wherein the speech detection module further determines a speech begin point in the sequence when a local integration of the at least one feature time derivative is increasing and greater than a first selected value, and determines a speech end point in the sequence when the local integration of the at least one feature time derivative is less than a second selected value.
56. A method of detecting when speech is present in an input acoustic signal, the method comprising: converting an input acoustic signal into a sequence of frames containing representative features; and analyzing at least one time derivative of the representative features with respect to a feature time derivative norm to determine when speech is present in a portion of the sequence.
57. A method according to claim 56, wherein the representative features are Quantized Log Area Ratio (Q-LAR) features in a Global System for Mobile Communications (GSM) data stream.
58. A method according to claim 56, wherein the analyzing further includes determining a speech begin point in the sequence when a local integration of the at least one feature time derivative is increasing and greater than a first selected value, and determining a speech end point in the sequence when the local integration of the at least one feature time derivative is less than a second selected value.
PCT/IB2000/001679 1999-10-25 2000-10-24 Speech recognition on gsm encoded data WO2001031636A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU10496/01A AU1049601A (en) 1999-10-25 2000-10-24 Small vocabulary speaker dependent speech recognition

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US16133399P 1999-10-25 1999-10-25
US60/161,333 1999-10-25

Publications (2)

Publication Number Publication Date
WO2001031636A2 true WO2001031636A2 (en) 2001-05-03
WO2001031636A3 WO2001031636A3 (en) 2001-11-01

Family

ID=22580769

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2000/001679 WO2001031636A2 (en) 1999-10-25 2000-10-24 Speech recognition on gsm encoded data

Country Status (2)

Country Link
AU (1) AU1049601A (en)
WO (1) WO2001031636A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003100372A1 (en) * 2002-05-29 2003-12-04 Nokia Corporation Method in a digital network system for controlling the transmission of terminal equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4688256A (en) * 1982-12-22 1987-08-18 Nec Corporation Speech detector capable of avoiding an interruption by monitoring a variation of a spectrum of an input signal
EP0527535A2 (en) * 1991-08-14 1993-02-17 Philips Patentverwaltung GmbH Apparatus for transmission of speech
US5632004A (en) * 1993-01-29 1997-05-20 Telefonaktiebolaget Lm Ericsson Method and apparatus for encoding/decoding of background sounds

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4688256A (en) * 1982-12-22 1987-08-18 Nec Corporation Speech detector capable of avoiding an interruption by monitoring a variation of a spectrum of an input signal
EP0527535A2 (en) * 1991-08-14 1993-02-17 Philips Patentverwaltung GmbH Apparatus for transmission of speech
US5632004A (en) * 1993-01-29 1997-05-20 Telefonaktiebolaget Lm Ericsson Method and apparatus for encoding/decoding of background sounds

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DATABASE INSPEC [Online] INSTITUTE OF ELECTRICAL ENGINEERS, STEVENAGE, GB; ZHANG HAIYAN ET AL: "CELP-based implementation of the GSM half-rate speech codec" Database accession no. 6254867 XP002159432 & JOURNAL OF CHINA UNIVERSITIES OF POSTS AND TELECOMMUNICATIONS, DEC. 1998, EDITORIAL DEPARTMENT, J. CHINA UNIV. OF POSTS & TELECOMMUNICATIONS, CHINA, vol. 5, no. 2, pages 72-75, ISSN: 1005-8885 *
GALLARDO-ANTOLIN A ET AL: "AVOIDING DISTORTIONS DUE TO SPEECH CODING AND TRANSMISSION ERRORS IN GSM ASR TASKS" PHOENIX, AZ, MARCH 15 - 19, 1999,NEW YORK, NY: IEEE,US, 15 March 1999 (1999-03-15), pages 277-280, XP000900112 ISBN: 0-7803-5042-1 *
SALONIDIS T ET AL: "ROBUST SPEECH RECOGNITION FOR MULTIPLE TOPOLOGICAL SCENARIOS OF THE GSM MOBILE PHONE SYSTEM" SEATTLE, WA, MAY 12 - 15, 1998,NEW YORK, NY: IEEE,US, vol. CONF. 23, 12 May 1998 (1998-05-12), pages 101-104, XP000854525 ISBN: 0-7803-4429-4 *
SEUNG HO CHOI ET AL: "Performance evaluation of speech coders for speech recognition in adverse communication environments" 1999 DIGEST OF TECHNICAL PAPERS. INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS (CAT. NO.99CH36277), 1999 DIGEST OF TECHNICAL PAPERS. INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS, LOS ANGELES, CA, USA, 22-24 JUNE 1999, pages 318-319, XP002159431 1999, Piscataway, NJ, USA, IEEE, USA ISBN: 0-7803-5123-1 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003100372A1 (en) * 2002-05-29 2003-12-04 Nokia Corporation Method in a digital network system for controlling the transmission of terminal equipment
CN100361117C (en) * 2002-05-29 2008-01-09 诺基亚有限公司 Method in a digital network system for controlling the transmission of terminal equipment

Also Published As

Publication number Publication date
WO2001031636A3 (en) 2001-11-01
AU1049601A (en) 2001-05-08

Similar Documents

Publication Publication Date Title
KR100923896B1 (en) Method and apparatus for transmitting speech activity in distributed voice recognition systems
KR100316077B1 (en) Distributed speech recognition system
Digalakis et al. Quantization of cepstral parameters for speech recognition over the world wide web
Huang et al. Microsoft Windows highly intelligent speech recognizer: Whisper
US6003004A (en) Speech recognition method and system using compressed speech data
Kim et al. A bitstream-based front-end for wireless speech recognition on IS-136 communications system
US7269561B2 (en) Bandwidth efficient digital voice communication system and method
US20110153326A1 (en) System and method for computing and transmitting parameters in a distributed voice recognition system
WO2002029782A1 (en) Perceptual harmonic cepstral coefficients as the front-end for speech recognition
US6917918B2 (en) Method and system for frame alignment and unsupervised adaptation of acoustic models
WO2004068893A2 (en) Method and apparatus for noise suppression within a distributed speech recognition system
JP2003036097A (en) Device and method for detecting and retrieving information
Kim et al. Robust DTW-based recognition algorithm for hand-held consumer devices
Kim et al. Bitstream-based feature extraction for wireless speech recognition
Gallardo-Antolín et al. Recognizing GSM digital speech
Li et al. An auditory system-based feature for robust speech recognition
Atal et al. Speech research directions
WO2001031636A2 (en) Speech recognition on gsm encoded data
JPH10254473A (en) Method and device for voice conversion
Spanias et al. Speech coding and speech recognition technologies: a review
Yoon et al. A MFCC-based CELP speech coder for server-based speech recognition in network environments
KR100794140B1 (en) Apparatus and Method for extracting noise-robust the speech recognition vector sharing the preprocessing step used in speech coding
Huang et al. The use of tree-trellis search for large-vocabulary Mandarin polysyllabic word speech recognition
Gibson et al. Speech signal processing
Fedila et al. Influence of G722. 2 speech coding on text-independent speaker verification

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AU CA JP

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
AK Designated states

Kind code of ref document: A3

Designated state(s): AU CA JP

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase in:

Ref country code: JP