EP0255529A4 - FRAMEWORK COMPARISON PROCEDURE FOR RECOGNIZING WORDS IN A LARGE NOISE ENVIRONMENT. - Google Patents
FRAMEWORK COMPARISON PROCEDURE FOR RECOGNIZING WORDS IN A LARGE NOISE ENVIRONMENT.Info
- Publication number
- EP0255529A4 EP0255529A4 EP19870900768 EP87900768A EP0255529A4 EP 0255529 A4 EP0255529 A4 EP 0255529A4 EP 19870900768 EP19870900768 EP 19870900768 EP 87900768 A EP87900768 A EP 87900768A EP 0255529 A4 EP0255529 A4 EP 0255529A4
- Authority
- EP
- European Patent Office
- Prior art keywords
- energy
- energy level
- channel
- frame
- representative
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 238000010606 normalization Methods 0.000 description 11
- 238000012360 testing method Methods 0.000 description 8
- 238000005259 measurement Methods 0.000 description 6
- 230000003595 spectral effect Effects 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 238000009825 accumulation Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000011524 similarity measure Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
Definitions
- the present invention relates generally to the practice of word recognition in a speech recognition system and, more particularly, to the recognition of words in the presence of high noise.
- a hands-free system is extremely practical in cases where the operator is - 2 -
- the incoming speech is split into frames before the matching process begins.
- Each frame from the incoming speech is then compared to frames from the template memory.. A match is depicted by a sequence of frames from the incoming speech
- spectral subtraction usually requires that an estimate of the background noise be subtracted from the incoming speech before matching to the template.
- the present invention teaches an improved method of matching an input frame to a word template frame for speech recognition in a high noise environment, the method employs the use of spectral channels for representing both the word template frames and the input frames. Specific steps of this method include determining three energy levels for each channel used. The energy levels include a first level representative of background noise energy, a second level representative of the input frame energy and a third level representative of the word template frame energy. Values are assigned at each channel including a constant value when the second level is less than the first level at one or more channels. These values are then used to generate a distance score between the input frame and the word template frame.
- Figure 1 is a block diagram of a speech recognition system illustrating an arrangement of blocks pertinent to the present invention.
- Figure 2 is an illustration of a prior art state model for a word template used by a speech recognition system.
- Figure 3 is a graph illustrating an example of a input frame being compared to a word template frame, according to the present invention.
- Figure 4 is a general flowchart, illustrating steps for generating a distance score between an input frame and word template frame, according to the present invention.
- Figure 5a is an expanded flowchart showing, more specifically, the steps illustrated in Figure 4.
- Figure 5b is a continuation of the flowchart of Figure 5a.
- FIG. 1 is a block diagram of a speech recognition system illustrating blocks pertinent to the present invention.
- input speech is converted into channel bank information by the acoustic processor 2 for training the system, i.e. establishing a system vocabulary.
- the acoustic processor converts the speech into filter bank information.
- filter bank information One method for generating filter bank information is described by R.W. Shafer, L.R. Rabiner and 0. Herrmann in "FIR Digital Filter Banks for Speech Analysis” Bell System Technical Journal, Vol. 54 No. 3 pp. 531-544, March 1975.
- the training processor 4 detects endpoints from the converted speech and generates word templates for storage in template memory 6.
- the training processor 4 is disabled and the system configures itself for speech recognition.
- Speech which is processed by the acoustic processor 2 is used by the distance calculator 10 and the background noise estimator 8.
- the background noise estimator 8 calculates an approximation of noise which is input with the speech.
- Such an estimator is described in by Robert J. McAulay and Marilyn L. Malpass in a paper entitled “Speech Enhancement Using a Soft-Decision Suppression Filter", IEEE Transactions on Acoustics,
- the distance calculator 10 employs the converted speech from the acoustic processor 2, the word templates from the template memory 6, and background noise levels from the estimator 8 to generate a measure of similarity, or a distance, between each input frame and frames from the word template.
- the recognizer 12 provides the distance calculator 10 with the appropriate frames from template memory 6 for comparison.
- recognition frameworks There are a number of well known recognition frameworks that can be used to provide this control, one of which is described by Bridle, Brown and Chamberlain in "An Algorithm for Connected Word Recognition", Proceedings of the 9182 IEEE international Conference on Acoustics, Speech and Signal Processing, PP. 899-902.
- the distances calculated by the distance calculator 10 are critical for whichever recognizer is used by the system. Regardless of the manner in which the word templates are controlled or located, if the calculated distances are not accurate, the recognizer cannot identify an accurate match to a word template. Subsequent discussion is directed to frame comparison performed within the distance calculator 10.
- Improvements in isolated word searching techniques have recently used silence states in recognition models for matching an input speech utterance, represented as a series of frames, to a word template in a frame-by frame manner.
- a speech recognition system will represent speech utterances using feature data such as LPC parameters, channel bank information or line spectral pairs.
- a vocabulary is formed by representing the spoken words as word templates.
- a distance score is generated. For a number of word templates compared to a series of input frames representing an input ' word, the template with the lowest distance score usually indicates that the input word is matched to that word template.
- Silence states are employed in a typical word searching technique by matching input frames, representative of silence, to silence frames within a word model using a suitable distance measure.
- Figure 2 illustrates a word template model with an initial silence state 13 and an ending silence state 15 appended.
- the inner states 14 represent prestored frames of the actual word template.
- the use of silence states can greatly increase the performance of a word recognition system operated in a high noise environment. In high noise it is very difficult to accurately determine the endpoints of words since the beginning and/or ending of the words may be buried in the noise.
- the silence states 13 and 15 can help spot the beginning and ending of the spoken word to which the model is being compared. Employing a state sequence estimation method, such as that described by G.D. Forney, Jr.
- Figure 2 illustrates one model which can be used for a variety of method for comparing input frames to a word template.
- using a frame-by-frame comparison technique requires that a distance score be generated for each pair of frames compared.
- the method must generate a distance score accurate enough for the word template searching technique to distinguish between many potential word templates.
- this method must still be able to generate a distinguishable distance score.
- FIG. 3 is a graph illustrating an example of such a method in accordance with the present invention.
- each frame contains channel bank information pertaining to energy levels in each of 'k' frequency channels. These channels are seen on the horizontal axis. On the vertical axis, depicted is a relative log magnitude scale. Plotted in the graph are three energy levels 16, 18 and 20 representing an input frame 16, a word template frame 18 and an estimated noise floor 20, respectively.
- a dotted line represents a buffer level 22.
- the buffer level 22 is determined from the noise floor at a particular channel i plus a constant. Though this constant may vary for different applications, in the present embodiment, it is preferred to be a value representing above 3dB.
- the input frame level 16 is compared to the buffer level 22.
- C- L 24 is used to determine which input frame channels, if any, are substantially above the noise floor 20. "Substantially” refers to the 3dB buffer.
- a value is assigned. This value depends on the difference between the word template frame level 18 and the input frame level 16. If the difference between the word template level 18 and the input frame level 16 is not greater than a preselected nominal differential, then the nominal differential becomes the assigned value for that particular channel. Otherwise, the assigned value is equal to the subtracted difference.
- This nominal difference is a predetermined constant, used to prevent the partial distance score from accumulating nonrepresentative values. This occurs when input frame levels very close to the noise floor 20 are predominantly noise rather than the actual speech itself.
- a total accumulation of assigned values from every channel is used to form a total distance score. This total distance score represents a measure of acoustic similarity between the input frame and the word template frame. A relatively low total distance score indicates two similarly acoustic frames. A total distance score of zero indicates a perfect match.
- FIG. 4 shown in a general flowchart of the frame comparison method illustrated above.
- T word template frame channel level minus input frame channel level.
- DIST total distance score
- Block 30 of Figure 4 is expanded into blocks 40 through 50 of Figure 5a.
- Block 59 indicate the normalization at each channel.
- the input frame is normalized at each channel by subtracting its means level from the input frame level.
- Block 56 illustrates the parallel normalization for the word template frame level. This special mean level determination is important because in the instance where the input frame level exceeds the buffer level at only a few channels, then the remaining channels should not significantly weigh the resulting normalization level. If they did significantly weigh the resulting level, then the normalization process would be affected by the background noise rather than depending only on the speech energy.
- the first measurement is the difference between the normalzied word template frame level and the normalized input frame level. This difference is denoted 'T', as seen in bloc k62. .
- the Ci test is illustrated in block 64 and the absolute value of T is shown accumulated into the distance score, DIST, in block 66.
- the second measurement tests if T is greater than zero, block 68, only for those channels where the buffer level is greater than the input frame level, i.e. where Ci 0.
- T the word template frame level
- the word template level is below the input frame level, then it follows that the word template frame is below the buffer level. This is important because if both levels are below the buffer level, no true level comparison can be made. Therefore, the expected value 'e' is assigned.
- the total distance score is shown accumulating 'e' in block 70.
- the word template frame level is not detected below the buffer level (T > 0) , then it is conceivable that a somewhat accurate measurement can be made between this level and the input frame level which is somewhere below the buffer level. If this measurement is not greater than a preselected nominal differential, then this differential becomes the assigned value at that channel. Although this value may differ depending on the application, it is preferred that this value be the value, 'e'. If the measurement is greater than the differential, then the value of "t" becomes the value assigned to the channel. This is illustrated in block 72, with the greater of either 'e' or the value of 'T ! accumulated into the total distance score, DIST. This value assignment process and accumulation is done at each channel as indicated by blocks 70, 74 and 76. These accumulated values become the total distance score, DIST, representing an accurate distance measure between a frame from a spoken word and a frame from a potentially matching word template.
- the buffer differential, d, and the predetermined expected channel value, e may be different for different channels.
- the above described technique can be modified to word with Euclidean or weighted Euclidean distance measures by making appropriate changes to blocks 66 and 72.
- this.method may be applied to any speech recognition system which uses channel bank type information in representing the speech utterances, a problem arises when this method is used with truncated 5 searching technique.
- a truncated searching technique such as Beam Decoding, only extends decoding paths whose accumulated distance is within a threshold of the accumulated distance for the best current path. This searching strategy reduces searching time and is well
- the input frame will be considered dissimilar to a word template frame if the word template frame's energy is about equal to the peak 35 energy of the entire word template. This offers an additional similarity measure for the truncated search strategy. If the distance scores for each frame of the different word templates are not distinguishing, then energy thresholding can be used. This is because an input frame from a spoken word, with relatively low energy, cannot correspond to a matching word template frame, if the latter frame has relatively high energy.
- the preferred energy threshold test is as follows: If the average energy of the word template frame is within 12dB of the peak energy in the word template; and the average energy of the input frame is less than the "valley" plus 6dB, then the word template frame does not correspond to the input frame. ⁇ h e term "valley is sued to represent the last previously detected, minimum energy level of speech relative to the present frame. For further information regarding valley detectors, reference may be made to U.S. Patent No. 4,378,603.
- this frame comparison method can be used in an unlimited number of word template searching techniques.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Time-Division Multiplex Systems (AREA)
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US81659886A | 1986-01-06 | 1986-01-06 | |
US816598 | 1986-01-06 |
Publications (2)
Publication Number | Publication Date |
---|---|
EP0255529A1 EP0255529A1 (en) | 1988-02-10 |
EP0255529A4 true EP0255529A4 (en) | 1988-06-08 |
Family
ID=25221081
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP19870900768 Withdrawn EP0255529A4 (en) | 1986-01-06 | 1986-12-29 | FRAMEWORK COMPARISON PROCEDURE FOR RECOGNIZING WORDS IN A LARGE NOISE ENVIRONMENT. |
Country Status (5)
Country | Link |
---|---|
EP (1) | EP0255529A4 (fi) |
JP (1) | JPS63502304A (fi) |
CA (1) | CA1301338C (fi) |
FI (1) | FI873567A (fi) |
WO (1) | WO1987004294A1 (fi) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4805218A (en) * | 1987-04-03 | 1989-02-14 | Dragon Systems, Inc. | Method for speech analysis and speech recognition |
EP0459362B1 (en) * | 1990-05-28 | 1997-01-08 | Matsushita Electric Industrial Co., Ltd. | Voice signal processor |
JP3033061B2 (ja) * | 1990-05-28 | 2000-04-17 | 松下電器産業株式会社 | 音声雑音分離装置 |
KR950013551B1 (ko) * | 1990-05-28 | 1995-11-08 | 마쯔시다덴기산교 가부시기가이샤 | 잡음신호예측장치 |
DE69132749T2 (de) * | 1990-05-28 | 2002-07-04 | Matsushita Electric Industrial Co., Ltd. | Vorrichtung zur Sprachsignalverarbeitung für die Bestimmung eines Sprachsignals in einem verrauschten Sprachsignal |
DE69133085T2 (de) * | 1990-05-28 | 2003-05-15 | Matsushita Electric Industrial Co., Ltd. | Sprachkodierer |
US6937674B2 (en) * | 2000-12-14 | 2005-08-30 | Pulse-Link, Inc. | Mapping radio-frequency noise in an ultra-wideband communication system |
WO2012176199A1 (en) * | 2011-06-22 | 2012-12-27 | Vocalzoom Systems Ltd | Method and system for identification of speech segments |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2137791A (en) * | 1982-11-19 | 1984-10-10 | Secr Defence | Noise Compensating Spectral Distance Processor |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4052568A (en) * | 1976-04-23 | 1977-10-04 | Communications Satellite Corporation | Digital voice switch |
GB1569450A (en) * | 1976-05-27 | 1980-06-18 | Nippon Electric Co | Speech recognition system |
JPS58130393A (ja) * | 1982-01-29 | 1983-08-03 | 株式会社東芝 | 音声認識装置 |
-
1986
- 1986-12-29 WO PCT/US1986/002826 patent/WO1987004294A1/en not_active Application Discontinuation
- 1986-12-29 JP JP87500806A patent/JPS63502304A/ja active Pending
- 1986-12-29 EP EP19870900768 patent/EP0255529A4/en not_active Withdrawn
- 1986-12-30 CA CA000526489A patent/CA1301338C/en not_active Expired - Lifetime
-
1987
- 1987-08-18 FI FI873567A patent/FI873567A/fi not_active IP Right Cessation
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2137791A (en) * | 1982-11-19 | 1984-10-10 | Secr Defence | Noise Compensating Spectral Distance Processor |
Non-Patent Citations (1)
Title |
---|
See also references of WO8704294A1 * |
Also Published As
Publication number | Publication date |
---|---|
CA1301338C (en) | 1992-05-19 |
FI873567A0 (fi) | 1987-08-18 |
WO1987004294A1 (en) | 1987-07-16 |
FI873567A (fi) | 1987-08-18 |
JPS63502304A (ja) | 1988-09-01 |
EP0255529A1 (en) | 1988-02-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US4918732A (en) | Frame comparison method for word recognition in high noise environments | |
US5131043A (en) | Method of and apparatus for speech recognition wherein decisions are made based on phonemes | |
EP1159737B1 (en) | Speaker recognition | |
AU2001273410A1 (en) | Method and apparatus for constructing voice templates for a speaker-independent voice recognition system | |
CN108335699A (zh) | 一种基于动态时间规整和语音活动检测的声纹识别方法 | |
JP3298858B2 (ja) | 低複雑性スピーチ認識器の区分ベースの類似性方法 | |
Hogg et al. | Speaker change detection using fundamental frequency with application to multi-talker segmentation | |
US5023911A (en) | Word spotting in a speech recognition system without predetermined endpoint detection | |
CA1301338C (en) | Frame comparison method for word recognition in high noise environments | |
Rabiner et al. | Some preliminary experiments in the recognition of connected digits | |
Sriskandaraja et al. | A model based voice activity detector for noisy environments. | |
EP1488410B1 (en) | Distortion measure determination in speech recognition | |
Ahmad et al. | An isolated speech endpoint detector using multiple speech features | |
Demuynck et al. | Feature versus model based noise robustness. | |
JPH034918B2 (fi) | ||
CA2013263C (en) | Rejection method for speech recognition | |
JPH0242238B2 (fi) | ||
Vroomen et al. | Robust speaker-independent hidden Markov model based word spotter | |
EP0551374A1 (en) | Boundary relaxation for speech pattern recognition | |
JP3100208B2 (ja) | 音声認識装置 | |
JPH0361957B2 (fi) | ||
JPH0247758B2 (fi) | ||
JPS6129897A (ja) | パタ−ン比較装置 | |
JPH01185599A (ja) | 音声認識装置 | |
JPH04254897A (ja) | 音声認識方式 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 19870813 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): DE FR GB IT NL SE |
|
A4 | Supplementary search report drawn up and despatched |
Effective date: 19880608 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN |
|
18W | Application withdrawn |
Withdrawal date: 19880808 |
|
RIN1 | Information on inventor provided before grant (corrected) |
Inventor name: GERSON, IRA, ALAN Inventor name: LINDSLEY, BRETT, LOUIS |
|
P01 | Opt-out of the competence of the unified patent court (upc) registered |
Effective date: 20230522 |