WO1991011696A1 - Procede et appareil de reconnaissance d'ordres prononces dans des environnements bruyants - Google Patents

Procede et appareil de reconnaissance d'ordres prononces dans des environnements bruyants Download PDF

Info

Publication number
WO1991011696A1
WO1991011696A1 PCT/US1991/000053 US9100053W WO9111696A1 WO 1991011696 A1 WO1991011696 A1 WO 1991011696A1 US 9100053 W US9100053 W US 9100053W WO 9111696 A1 WO9111696 A1 WO 9111696A1
Authority
WO
WIPO (PCT)
Prior art keywords
features
utterance
distance
determining
reference samples
Prior art date
Application number
PCT/US1991/000053
Other languages
English (en)
Inventor
Kamyar Rohani
R. Mark Harrison
Original Assignee
Motorola, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola, Inc. filed Critical Motorola, Inc.
Publication of WO1991011696A1 publication Critical patent/WO1991011696A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech

Definitions

  • This invention relates generally to the field of word recognizers and in particular to those word recognizers which are capable of recognizing command words in noisy environments.
  • a policeman in a police car may activate numerous functions, such as turning on the siren, by simply uttering an appropriate utterance which contains a word command.
  • a word recognizer after receiving and processing the utterance, recognizes the word command and effectuates the desired function.
  • a word recognizer recognizes the word command by extracting features which adequately represent the utterance, and making a decision as to whether these features meet a particular criteria. These criteria may comprise correspondence to a set of pre-stored features representing the command words to be recognized.
  • the word recognizer may be speaker dependent or speaker independent.
  • a speaker independent word recognizer is designed to recognize the commands of potentially any number of users regardless of the differences in speech patterns, accents, and other variations in spoken words.
  • the speaker independent word recognizer requires significantly sophisticated processing capability and hence has been constrained to recognizing a limited number of command words.
  • a speaker dependent word recognizer is designed to recognize the command words of limited number of users by comparing the utterance to prestored voice templates which contain the voice features of those users. Therefore, it is necessary to train the word recognizer to recognize the voice features of each individual user. Training is commonly understood to be a process by which the individual users repeats a predetermined set of word commands for a sufficient number of times so that an acceptable number of their voice features are extracted and stored as reference features.
  • word recognizer One of the important characteristics of a word recognizer is its capability to accurately recognize a word command under various noise conditions. Typically, the word recognizers provides error rates of less than 1% in quiet environments.
  • the error rate may be degraded by as much as 40% in environments where there is a 20 db peak signal-to-noise ratio (SNR).
  • SNR signal-to-noise ratio
  • One of the factors contributing to poor noise performance is the difference between the training condition under which the reference features are derived and the operating condition under which the utterance features are derived. Accordingly, due to this difference, comparison of the reference features and input utterance features may produce substantially erroneous results.
  • Many word recognizers incorporate noise compensation techniques in means utilized to derive the reference features.
  • a background noise estimator provides the ambient noise characteristics, and the prestored reference features are temporarily modified according to the characteristics of the ambient noise. The modified reference features and the input utterance features are then compared to each other, and the reference sample having features with the closest similarity to the features of the input utterance is declared as the recognized word.
  • the features of the input utterance are represented by the amount of energy contained within predetermined number of frequency bands.
  • This technique is known as the filter banks method.
  • the noise compensation is achieved by determining the back ground noise energy at every frequency band and subtracting it from the energy at the corresponding frequency band of the input utterance.
  • the resulting features are then compared to the corresponding reference features, and again the reference sample having most similar features to the features of the input utterance is declared the recognized word.
  • this type of system suffers an inherent draw back in that the number of predetermined frequency band is critical to the proper operation of the word recognizer. That is, dividing the voice spectrum into a high number of frequency band causes degradation in recognition accuracy of high pitched voices, and dividing the voice spectrum into a low number of frequency bands causes smearing effect on the voice signal.
  • noise compensation in speech recognition utilize noise reduction techniques, wherein the signal to noise ratio is increased using various filtering techniques.
  • practical improvements in SNR typically fall short of achieving a substantial accuracy in recognizing word commands.
  • Another method of noise compensation for a system utilized in a severe noise environment is to train the system in a comparable noise environment.
  • certain type of noise such as acoustical background noise, are time variant in nature. Accordingly it is not possible to predict or otherwise reproduce, during training, the actual time variant noise which will exist during a subsequent speech recognition mode.
  • the word recognizer of the invention comprises a voice processing means for receiving an input utterance and determining features which adequately represent the utterance.
  • a template means provides the pre-stored features of a set of reference samples which represent the recognizable command words.
  • a noise analysis means determines ambient noise characteristics.
  • a comparison means determines the distance between the features of the utterance and the reference samples. The comparison means is responsive to the ambient noise characteristics for modifying the determined distance.
  • the word recognizer apparatus include means for determining the minimum distance and selecting the reference sample based thereon.
  • Figure 1 shows a block diagram of the word recognizer of the invention.
  • FIG. 2 shows a block diagram of the voice processor shown in Figure. 1.
  • Figure 3 is the flow chart for extracting CSM features of an input utterance.
  • Figure 4 shows the block diagram of the noise analyzer of FIG.1.
  • Figure 5 shows a portion of the word recognizer of the invention which includes the block diagram of the template means of Figure 1.
  • Figure 6 shows the graph of the power distribution of the reference sample and the input utterance for a command word.
  • Figure 7 shows a portion of the word recognizer of the invention which includes the block diagram of the comparison means of Figure 1.
  • Figure 8 is the flow chart of the steps taken according to the invention to recognize the word command in noisy environments.
  • the word recognizer 100 comprise an isolated word recognizer which is capable of recognizing more than one spoken word commands having a pause therebetween.
  • the word recognizer 100 includes a voice processor 110 for processing an input utterance containing one or more word commands.
  • the input utterance is received through a microphone 103 which produces a voice signal representing the input utterance.
  • a well known audio filter 105 is used to limit the frequency spectrum of the input utterance to a predetermined range. In the preferred embodiment of the invention, the range of the audio filter 105 is confined to a range of 200 Hz to 3200 Hz.
  • the voice processor 110 divides the input utterance in to frames of predetermined duration.
  • the voice processor 110 provides, in each frame, those features of the input utterance which adequately characterize the input utterance. The detailed process by which these features are produced is described later. These features comprise frequency components and corresponding amplitudes as well as the power of the input utterance in each frame.
  • a background noise analyzer 120 provides the characteristics of the ambient noise. These characteristics comprise signal to noise ratio in the frequency spectrum and the level of the ambient noise floor. Because the word recognizer 100 is an isolated word recognizer, the beginning and the end of the input utterance must be determined. In the preferred embodiment of the invention, this determination is made by comparing the power of the input utterance to the power of the ambient noise floor.
  • a comparator 130 closes a switch 140, thereby allowing the features of the input utterance to be stored in a temporary feature storage means 150.
  • the switch 140 is opened preventing features from being stored in the storage means 150. Accordingly, the end points of the input utterance are -determined by comparing the ambient noise floor to the power of the input utterance.
  • a template means 160 provides the features of a set of prestored reference samples. The features of the prestored reference samples are generated, during training, utilizing the same process as that which provides the features of the input utterance.
  • the template means 160 aligns the end points of the reference sample with the end points of the input utterance.
  • a comparison means 170 primarily comprising a microcomputer/controller provides the distance between the features of the input utterance and the reference samples. The detail of the process by which the distance between the features of the input utterance and the reference sample are produced is described later. The comparison means 170 then selects the reference sample having the minimum distance with the features of the input utterance and based thereon declares the word command. Noise compensation in the word recognizer of the invention is achieved by eliminating or modifying the distance between the features of the input utterance and the features of the reference sample having noise characteristics above a predetermined threshold.
  • the block diagram of the voice processor 110 comprises an A/D converter 102 which samples the voice signals provided by microphone 103 of FIG. 1 at a suitable sampling rate.such as 8000 samples per second.
  • a frame buffer 104 buffers the sampled signal and provides frames which consist of a predetermined number of consecutive voice samples.
  • the framing technique utilized by the frame buffer 104 is well known in the art, and the frames provided by the preferred embodiment of the invention comprise 160 samples which correspond to a frame duration of 20 msec. It may be appreciated that depending on the duration of each input utterance a variable number of frames (designated as N) may be generated by the frame buffer 104.
  • the features characterising each frame utterance may be parametric or discrete.
  • the discrete features of the utterance frames may be provided by such known techniques as the filter banks method.
  • the embodiment of the present invention utilizes a technique which provides the parametric features of the utterance frame.
  • the parametric features of the utterance may be provided by such known techniques as linear predictive analysis (LPC) or composite sinusoidal modeling (CSM).
  • LPC linear predictive analysis
  • CSM composite sinusoidal modeling
  • the features of the utterance frames are provided utilizing conventional CSM analysis techniques as described in S. Sagayama and F. Ikatura, "Duality Theory of Composite Sinusoidal Modelling and Linear Prediction", ICASSP '86 Proceedings, vol 3, pp. 1261-1264, the disclosure of which is hereby incorporated by reference.
  • the purpose of CSM analysis is to determine a set of CSM features which adequately characterize the frame utterance.
  • the CSM features comprise CSM frequencies ⁇ fj ⁇ and amplitudes ⁇ rrtj ⁇ which correspond thereto.
  • the number of CSM features (designated as M) of each frame of the input utterance is related to the frequency range of the utterance. In utterances confined to a range of 200 Hz to 3200 Hz in frequency spectrum there usually exists four formant resonant frequencies below 3200 Hz. Thus, it is usually sufficient to utilize 4 CSM frequencies and amplitudes to characterize the input utterance frames. Therefore, in the preferred embodiment of the invention, the number of features (designated as M) is equal to 4.
  • a feature extractor 106 executes a feature extraction process utilizing conventional CSM techniques which as shown in the flow chart of FIG.3.
  • the CSM extractor 106 applies the input utterance features and computes the autocorrelation of the frame utterances at block 320.
  • the term of the interpolative correlation is then computed, block 330.
  • the feature extractor 106 also provides the power content of the frame input utterance frame derived from the following equation: N
  • T (n) ⁇ m ⁇ n , m 2 n ,....,m M n , f 1 n , f 2 n f M n , P (n) ) ( 4 )
  • voice processor 110 described in FIG. 2 and in FIG. 3 may be implemented by means of any suitable digital signal processor (DSP), such as 56000 series family of DSPs manufactured by Motorola, Inc.
  • DSP digital signal processor
  • the noise analyzer 120 continually monitors the background noise and provides characteristics thereof.
  • the noise analyzer 120 includes a noise processing means 122 for producing the noise powers of the desired frequency spectrum.
  • the noise processing means 122 utilizes well known analysis techniques, such as Fast Fourier Transformation analysis, to provide noise power at desired CSM frequencies.
  • the noise processor 122 also receives the corresponding CSM amplitudes of the input utterance frames and produces the signal to noise ratios SNR (f) at the CSM frequencies.
  • the noise analyzer 120 includes a well known noise averaging means 124 which provides the power at noise floor Rn.
  • the techniques for providing ambient noise floor is well known in the art.
  • the template storage means 162 stores the features of a set of reference samples representing word commands recognizable by the word recognizer 100. These reference features have been obtained during a training process. During the training process, a user repeats each of the desired word commands to be recognized a number of times. Preferably, the training of the word recognizer is performed in a quiet environment. The features of the user voice are extracted and stored in the template storage means 162 as the reference samples. During training, the utterances are processed identically to the processing of the input utterance. In fact, the voice processor 110 is used to generate the reference sample features during training of the word recognizer 100.
  • the number of reference sample frames (designated as J) may be different from the number of the corresponding input utterance frames N. It should be noted that the powers of each frame as derived from equation (3) are also included in the features of the reference sample. Accordingly, the features of the each reference sample may be stored in the template storage means 162 as vectors:
  • R (j) ⁇ ⁇ " , m 2 J m , f ⁇ , fgi y, P l- ⁇ 6 >
  • each of these reference samples are selected and compared to the input utterance.
  • the end points of the reference sample under comparison and the input utterance must be aligned.
  • an end point aligner 164 is included in the template means 160 to alleviate end point misalignments.
  • FIG. 6 shows in time domain the power contour 610 of a reference sample for a word command.
  • the power contour 610 of the reference sample can actually be represented by a number of discrete powers corresponding to each frame. However, for the sake of simplicity and ease of understanding the contour of the power distribution of the reference sample is shown as a solid line 610. Similarly, the power contour of an input utterance substantially corresponding to that of the reference sample is shown by a dotted line 620. As shown, the end points of the reference sample in quiet background and the input utterance in noisy environments are separated from each other by the ambient noise floor power R (n).
  • the end points of the reference sample are readjusted by a number of frames such that the subsequent frames have powers above the noise floor power, the end points of the reference sample and the input utterance may be realigned. Therefore, the noise floor power Rn provided by noise analyzer 120 constitutes a threshold by which the end points of the reference sample are readjusted. Referring back to FIG.5, the end point aligner 164 skips those candidate endpoints whose power are below the noise power.
  • the end point aligner 164 may be implemented by means of any suitable microcomputer or DSP executing a suitable program for achieving the intended purpose thereof.
  • the comparison means 170 comprise a well known a microcomputer/controller, such as the 68000 family of microcomputers manufactured by Motorola, Inc.
  • the comparison means 170 includes a controller 172, a computer 174, a RAM 176 and a ROM 178.
  • the controller 172 performs several functions which include controlling the operation of the comparison means 170 and the template means 160 as well as interacting with the temporary storage means 150 and noise analyzer 120.
  • the computer 174 performs the computational functions of the comparison means 170.
  • the RAM 176 provides a temporary information storage for the computer 174 and the controller 172.
  • the program containing the operational steps of the computer 174 and the controller 172 is stored in the ROM 178.
  • the controller 172 receives the features of the input utterance from the temporary storage means 150.
  • the features of the first reference sample after endpoint alignment are received from the template means 160.
  • the computer 174 determines the distance between the features of the reference sample and the input utterance. In the preferred embodiment of the invention, only the frequency features of the frames of the utterance and the reference sample are utilized for computing the distance. The determined distance is called a local distance metric and is computed from the following equation:
  • T (i,n) represents the i th composite sinusoidal frequency in the n th frame of said utterance
  • R (i, j) represents the i th composite sinusoidal frequency in the j th frame of said reference sample.
  • the local distant metric d is modified by a function W(i, n) of the signal to noise ratio SNR(f) provided by the noise analyzer 120.
  • the function W(i,n) may be defined as:
  • K is the normalization constant defined by:
  • W(i,n) comprises a discrete function defined by:
  • N.T signal to noise ratio threshold. Accordingly, the ith frequency features of the nth frame is eliminated, if the SNR(f) at that frequency is below the SNR threshold.
  • the W(i,n) may comprise a continuously differentiable limiting function, such as well known sigmoidal or hyprobolic tangent functions . It may be appreciated that for each frame of the input utterance there is total of at most J local distances. The legal local distance minimum of each input utterance frame are added to subsequent local distances. An accumulated distance is thus determined for each reference sample frame, block 840.
  • a minimum distance may be obtained utilizing well known dynamic time warping techniques.
  • the minimum distance utilizing such a technique is computed and stored.
  • a decision is made to determine whether more reference samples are to be processed. After comparing all of the stored reference samples, block 870, the reference sample having the minimum distance is selected.
  • the command word contained in the input utterance is recognized based on a decision on the selected reference sample. The decision also takes into consideration a predetermined criteria before the recognized command word is declared. Such criteria may comprise a threshold minimum distance below which the recognized word is valid. This predetermined criteria prevents declaring an invalid input utterance, which produces a minimum distance, the recognized word command.
  • the local distances between the features of the input utterance and the reference sample are relied upon in recognizing the command word.
  • the local distances are modified as a function of the signal to noise ratio. Accordingly, the accuracy of the word recognizer under severe noise conditions is improved by eliminating or lessening the contribution of those local distances which have an undesirable noise characteristic.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

Une émission vocale contenant un ordre à reconnaître est traitée en entrée (110), et des caractéristiques représentant adéquatement cette émission sont déterminées. Des caractéristiques pré-stockées d'un ensemble d'échantillons de référence d'ordres (160) sont comparés (170) aux caractéristiques de ladite émission vocale. On améliore la reconnaissance d'ordres dans des environnements bruyants par détermination de la distance entre les caractéristiques de ladite émission vocale et les caractéristiques des échantillons de référence, et par modification de la distance (120) en réponse au bruit de fond. L'échantillon de référence présentant la distance minimum est sélectionné comme ordre reconnu.
PCT/US1991/000053 1990-02-02 1991-01-02 Procede et appareil de reconnaissance d'ordres prononces dans des environnements bruyants WO1991011696A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US47443590A 1990-02-02 1990-02-02
US474,435 1990-02-02

Publications (1)

Publication Number Publication Date
WO1991011696A1 true WO1991011696A1 (fr) 1991-08-08

Family

ID=23883519

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1991/000053 WO1991011696A1 (fr) 1990-02-02 1991-01-02 Procede et appareil de reconnaissance d'ordres prononces dans des environnements bruyants

Country Status (1)

Country Link
WO (1) WO1991011696A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2330677A (en) * 1997-10-21 1999-04-28 Lothar Rosenbaum Phonetic control apparatus
KR100450787B1 (ko) * 1997-06-18 2005-05-03 삼성전자주식회사 스펙트럼의동적영역정규화에의한음성특징추출장치및방법
US6983245B1 (en) 1999-06-07 2006-01-03 Telefonaktiebolaget Lm Ericsson (Publ) Weighted spectral distance calculator
KR100714721B1 (ko) 2005-02-04 2007-05-04 삼성전자주식회사 음성 구간 검출 방법 및 장치

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2137791A (en) * 1982-11-19 1984-10-10 Secr Defence Noise Compensating Spectral Distance Processor
US4829578A (en) * 1986-10-02 1989-05-09 Dragon Systems, Inc. Speech detection and recognition apparatus for use with background noise of varying levels
US4852181A (en) * 1985-09-26 1989-07-25 Oki Electric Industry Co., Ltd. Speech recognition for recognizing the catagory of an input speech pattern
US4897878A (en) * 1985-08-26 1990-01-30 Itt Corporation Noise compensation in speech recognition apparatus
US4918732A (en) * 1986-01-06 1990-04-17 Motorola, Inc. Frame comparison method for word recognition in high noise environments
US4933973A (en) * 1988-02-29 1990-06-12 Itt Corporation Apparatus and methods for the selective addition of noise to templates employed in automatic speech recognition systems

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2137791A (en) * 1982-11-19 1984-10-10 Secr Defence Noise Compensating Spectral Distance Processor
US4897878A (en) * 1985-08-26 1990-01-30 Itt Corporation Noise compensation in speech recognition apparatus
US4852181A (en) * 1985-09-26 1989-07-25 Oki Electric Industry Co., Ltd. Speech recognition for recognizing the catagory of an input speech pattern
US4918732A (en) * 1986-01-06 1990-04-17 Motorola, Inc. Frame comparison method for word recognition in high noise environments
US4829578A (en) * 1986-10-02 1989-05-09 Dragon Systems, Inc. Speech detection and recognition apparatus for use with background noise of varying levels
US4933973A (en) * 1988-02-29 1990-06-12 Itt Corporation Apparatus and methods for the selective addition of noise to templates employed in automatic speech recognition systems

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100450787B1 (ko) * 1997-06-18 2005-05-03 삼성전자주식회사 스펙트럼의동적영역정규화에의한음성특징추출장치및방법
GB2330677A (en) * 1997-10-21 1999-04-28 Lothar Rosenbaum Phonetic control apparatus
US6983245B1 (en) 1999-06-07 2006-01-03 Telefonaktiebolaget Lm Ericsson (Publ) Weighted spectral distance calculator
KR100714721B1 (ko) 2005-02-04 2007-05-04 삼성전자주식회사 음성 구간 검출 방법 및 장치

Similar Documents

Publication Publication Date Title
US4933973A (en) Apparatus and methods for the selective addition of noise to templates employed in automatic speech recognition systems
US7133826B2 (en) Method and apparatus using spectral addition for speaker recognition
EP0691024B1 (fr) Procede et dispositif d'identification de locuteur
US7756700B2 (en) Perceptual harmonic cepstral coefficients as the front-end for speech recognition
US7877254B2 (en) Method and apparatus for enrollment and verification of speaker authentication
US5459815A (en) Speech recognition method using time-frequency masking mechanism
KR19990043998A (ko) 패턴인식시스템
US20060165202A1 (en) Signal processor for robust pattern recognition
JP3451146B2 (ja) スペクトルサブトラクションを用いた雑音除去システムおよび方法
JP2745535B2 (ja) 音声認識装置
EP1141942A1 (fr) Compensation et normalisation du bruit dans un procede de transformation par alignement dynamique
Hautamäki et al. Improving speaker verification by periodicity based voice activity detection
WO2005020212A1 (fr) Dispositif d'analyse de signaux, dispositif de traitement de signaux, dispositif de reconnaissance de la parole, programme d'analyse de signaux, programme de traitement de signaux, programme de reconnaissance de la parole, support d'enregistrement et dispositif electronique
CN117672201A (zh) 一种农机无人驾驶语音识别的控制系统
FI111572B (fi) Menetelmä puheen käsittelemiseksi akustisten häiriöiden läsnäollessa
WO1994022132A1 (fr) Procede et dispositif d'identification de locuteur
WO1991011696A1 (fr) Procede et appareil de reconnaissance d'ordres prononces dans des environnements bruyants
Tazi A robust speaker identification system based on the combination of GFCC and MFCC methods
US20080228477A1 (en) Method and Device For Processing a Voice Signal For Robust Speech Recognition
JPH0449952B2 (fr)
Kumar et al. Effective preprocessing of speech and acoustic features extraction for spoken language identification
JP3046029B2 (ja) 音声認識システムに使用されるテンプレートに雑音を選択的に付加するための装置及び方法
US20070124143A1 (en) Adaptation of environment mismatch for speech recognition systems
Hilger et al. Noise level normalization and reference adaptation for robust speech recognition
JPS6039695A (ja) 自動音声アクチビテイ検出方法および装置

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CA JP

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IT LU NL SE

NENP Non-entry into the national phase

Ref country code: CA