EP0143161A1 - Einrichtung zum automatischen Feststellen einer Sprachsignalaktivität - Google Patents

Einrichtung zum automatischen Feststellen einer Sprachsignalaktivität Download PDF

Info

Publication number
EP0143161A1
EP0143161A1 EP84107846A EP84107846A EP0143161A1 EP 0143161 A1 EP0143161 A1 EP 0143161A1 EP 84107846 A EP84107846 A EP 84107846A EP 84107846 A EP84107846 A EP 84107846A EP 0143161 A1 EP0143161 A1 EP 0143161A1
Authority
EP
European Patent Office
Prior art keywords
speech
signals
frames
noise
detection threshold
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP84107846A
Other languages
English (en)
French (fr)
Inventor
Sandra E. Hutchins
Steven F. Boll
George Vensko
Lawrence Carlin
Allen R. Smith
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Standard Electric Corp
Original Assignee
International Standard Electric Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Standard Electric Corp filed Critical International Standard Electric Corp
Publication of EP0143161A1 publication Critical patent/EP0143161A1/de
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • This invention relates to an apparatus and method for speaker independent speech activity detection in an environment of relatively high level noise, and to automatic speech recognizers which use such speaker independent speech activity detection.
  • This invention also relates to U.S application Serial No. 473,422, filed March 9, 1983, entitled “Apparatus and Method for Automatic Speech Recognition", assigned along with this application to a common assignee, and hereby incorporated by reference as if specifically set forth herein.
  • Automatic speech recognition systems provide a means for man to interface with communication equipment, computers and other machines in a human's most natural and convenient mode of communication. Where reauired, this will enable operators of telephones, computers, etc. to call others, enter data, recruest information and control systems when their hands and eyes are busy, when they are in the dark, or when they are unable to be stationary at a terminal.
  • One known approach to automatic speech recognition involves the following: periodically sampling a bandpass filtered (BPF) audio speech input signal to create frames of data and then preprocessing the data to convert them to processed frames of parametric values which are more suitable for speech processing; storing a plurality of templates (each template is a plurality of previously created processed frames of parametric values representing a word, which when taken together form the reference vocabulary of the automatic speech recognizer); and comparing the processed frames of speech with the templates in accordance with a predetermined algorithm, such as the dynami>c programming algorithm(DPA) described in an article by F. Itakura, entitled “Minimum prediction residual principle applied to speech recognition", IEEE Trans. Acoustics, Speech and Signal Processing, Vo. A SS P -23, pp. 67-72, February 1975, to find the best time alignment path or match between a given template and the spoken word.
  • BPF bandpass filtered
  • Automatic Speech Recognizers depend on detecting the end points of speech based on measurements of energy.
  • Prior art speech activity detectors discriminate between energy, assumed to be speech, and lack of energy, assumed to be silence. Therefore, prior art Automatic Speech R ecognizers require a relatively quiet environment in which to operate, otherwise, performance in terms of recognition accuracy drops drastically. Requiring a quiet environment restricts the uses to which a Speech Recognizer can be put, for example, prior art recognizers would have difficulty operating on a noisy factory floor or in a cockpit of a tactical aircraft, etc.
  • Such noisy environments as these can be characterized as having background noise present whether or not speech is present and noise events occurring when speech is not present, the noise events sometimes having signal levels equal to or greater than the speech signal levels.
  • the input signals are frequency filtered to provide a plurality of filter output signals which are then digitized.
  • the frames are created from the digitized filter output signals.
  • a linear transformation is applied to the frames of digital signal values to create a scalar feature for each frame whose magnitude will be larger for speech signals than for noise event signals.
  • a detection threshold value is created for the scalar feature magnitudes and repeatedly updated. Scalar features are compared with the detection threshold value, and the results of a plurality of successive comparisons are stored. The stored results are combined in a predetermined manner to obtain an indication of when speech signals are present.
  • frames are further preprocessed before being compared with stored templates representing the vocabulary of recognizable words.
  • the comparison is based on the dynamic programming algorithm (DPA).
  • Fig. 1 is a block diagram of an automatic speech recognizer apparatus designated generally 100. It comprises a microphone 102; a microphone preamplifier circuit 104; a bandpass filter bank circuit 108 for providing a digital spectrum sampling of the audio output of circuit 104; a pair of processors 110 and 112 - interconnected by inter-processor communication circuits 114 and 116; and an external non-volatile memory device 118.
  • processors 110 and 112 are Motorola MC68000 microprocessors and inter-processor communication circuits 114 and 116 are conventionally designed circuits for handling interrupts and data transfers between MC68000 microprocessors. Interrupt procedures for the MC68000 are adequately described in tne MC68000 specification.
  • the speech recognition algorithm is stored in the E P ROM memory portions 122 and 124 of the processors 110 and 112, respectively, while the predefined vocabulary is stored as previously created templates in the external non-volatile memory device 118 which in the preferred embodiment is an Intel bubble memory, Model No. 7110, capable of storing one million bits. In the preferred embodiment, there are only 36 words in the vocabulary, and, hence, 36 templates with 4000 bits required per template on the average. Hence, the bubble memory is capable of storing approximately 250 templates. When templates are needed for comparison with incoming frames of speech data from BPF circuit 108, they are brought from memory 118 into working memory 126 in processor 11 2 .
  • Fig. 2 a more detailed block diagram of the bandpass filter bank circuit 108 is shown.
  • the output from preamp 104 on lead 130 from Fig. 1 is transmitted to an input amplifier stage 200 which has a 3 db bandwidth of lOkHz. This is followed by a 6 db per octave preemphasis amplifier 202 having selectable cut in frequencies of 500 or 5000 Hz. This is conventional practice to provide more gain at the higher frequencies than at the lower frequencies since the higher frequencies are generally lower in amplitude in speech data.
  • the signal splits and is provided to the inputs of anti-aliasing filters 204 (with a cutoff frequency of 1.4 kHz) and 206 (with a cutoff frequency of 10.5 KHz). These are provided to eliminate aliasing which may result because of subsequent sampling.
  • BPF 208 includes channels 1-9 while BPF 210 includes channels 10-19.
  • Each of channels 1-18 contains a one/third octave filter.
  • Channel 19 contains a full octave filter.
  • the channel filters are implemented in a conventional manner using Reticon Model Numbers R5604 and R 56606 switched-capacitor devices.
  • Fig. 3 gives the clock input frequency, center frequency and 3 db bandwidth of the 19 channels of the BPF circuits 208 and 210.
  • the bandpass filter clock frequency inputs required for the BP F circuits 208 and 210 are generated in a conventional manner from a clock generator circuit 212 driven by a 1.632 MHz clock 213.
  • the 19 channel samples are then multiplexed through multiplexers 216 and 218 (Siliconix Model No. DG506) and converted from analog to digital signals in log A/D converter 220, a Siliconix device, Model No. DF331.
  • the converter 220 has an 8 bit serial output which is converted to a parallel format in serial to parallel register 222 (National Semiconductor Model No. DM86LS62) for input to processor 110 via bus 132.
  • a 2 MHz clock 224 generates various timing signals for the circuitry 214, multiplexers 216 and 218 and for A/D converter 220.
  • a sample and hold command is sent to circuitry 214 once every 10 milliseconds over lead 215. Then each of the sample and hold circuits is multiplexed sequentially (one every 5-00 microseconds) in response to a five bit selection'signal transmitted via bus 217 to circuits 216 and 218 from timing circuit 226. Four bits are used by each circuit while one bit is used to select which circuit. It therefore takes 10 milliseconds to A/D convert 19 sampled channels plus a ground reference sample. These 20 8-bit digital signals are called a frame of data and they are transmitted over bus 132 at appropriate times to microprocessor 110.
  • Timing generator circuit 226 Once every frame a status signal is generated from timing generator circuit 226 and provided to processor 110 via lead 228. This signal serves to sync the filter circuit 108 timing to the processor 110 input. Timing generator circuiit 226 further provides a 2 kHz data ready strobe via lead 230 to processor 110. This provides 20 interrupt signals per frame to processor 110.
  • a block diagram of the automatic speech recognition algorithm 400 of the present . invention is presented. It can be divided into four subtasks: bandpass filter data transformation 402; speech activity detection 404; variable frame rate encoding and normalized mel-cepstral transformation 406; and recognition 408.
  • the speech activity detection subtask 404 has been implemented in C language for use on a V AX 11/780 and in assembly language for use on an M C68000.
  • C language is a higher order language commonly used in the technical community and available from Western Electric.
  • the C language version of subtask 404 will be described in more detail in connection with a description of Fig. 7.
  • the microprocessor 110 is interrupted by the circuit 108 via lead 230.
  • the software which handles that interrupt is the B P F transformation subtask 402.
  • the new 8-bit filter value from bus 132 is stored into a buffer, but every-10 millisecond (the 20th interrupt) a new frame signal is sent via lead 228.
  • the BPF transformation subtask 402 takes the 19 8-bit filter values that were buffered, combines the first three values as the first coefficient and the next two values as the second coefficient, and discards the 19th value because it has been found to contain little if any useful information, especially in a noisy environment.
  • the resulting 15 coefficients characterize one 10 ms frame of the input signal
  • the transformed frame of speech is passed onto buffer 410 and then to the VFRE and mel-cepstral transformation subtask 406 if the speech activity detector subs'task 404 has indicated that speech is present.
  • the speech activity detector subtask 404 will be explained in more detail later. Assuming for tne moment that subtask 404 indicates that speech is present, then in subtask 406, the Euclidean distance between a previously stored frame and the current frame in buffer 410 is determined. If the distance is small (large similarly) and not more tnan two frames of data have been skipped, the current frame is passed over, otherwise it is stored for future comparison and passed onto the next step of normalized mel-cepstral transformation. On the average, one-half of the data frames from the circuit 108 are passed on (i.e. 50 frames per second).
  • the 15 filter coefficients are reduced to 5 coefficients by a linear transformation matrix.
  • a commonly used matrix comprises a family of 5 "mel-cosine" vectors that transform the bandpass filter data into an approximation of "mel-cepstral” coefficients.
  • Mel-cosine linear transformations are discussed in (1) Davis, S-B. and Mermelstein, P. "Evaluation of Acoustic Parameters for Monosyllable Word Identification", Journal Acoust. Soc. Am., Vol. 64, Suppl. 1, pp. S180-181, Fall 1978 (Abstract) and (2) S. Davis and P. Mermelstein "Comparison of Parameter Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences", IEEE Trans..
  • normalized mel-cepstral transformation i.e., the raw BP F data is normalized to zero mean, normalized to zero net slope above 500 H Z and mel-cosine transformed in one step.
  • the first mel-cepstral coefficient (which is very sensitive to spectral slope) is not used.
  • Eacn frame which has undergone mel-cepstral transformation is then compared with each of the templates representing the vocabulary which are now stored in the processor's working memory 126.
  • the comparison is done in accordance with a recognition portion 408 of the algorithm described in the above-mentioned patent application, Serial No: 473,422, filed March 9, 1983 and based on the well-known dynamic programming algorithm (DPA) which is described in an article by F. Itakura entitled “Minimum Prediction Residual Principle Applied to Speech Recognition", IEEE Trans. Acoustics, Speech and Signal Processing, Vol. ASSP-23, pp. 67-72, February 1975.
  • DPA dynamic programming algorithm
  • a modified version of the DPA is used, called a windowed DPA with path boundary control.
  • a summary of the DPA is provided in connection with a description of Fig. 5.
  • a template is placed on the y-axis 502 and the input word to be recognized is placed on the x-axis 504 to form a DPA matrix 500, Every cell in the matrix corresponds to a one-to-one mapping of a template frame with a word. frame. Any time alignment between the frames of these patterns can be represented by a path through the matrix from the lower-left corner to the upper-right corner.
  • a typical alignment path 506 is shown.
  • the DPA function finds the locally optimal path through the matrix by progressively finding the best path to each cell, D, in the matrix by extending the best path ending in the three adjacent cells labeled by variables, A,B, and C.
  • the path that has the minimum score is selected to be extended to D subject to the local path constraint: every horizontal or vertical step must be followed by a diagonal step. For example, if a vertical step was made into cell C, the path at cell C cannot be chosen as the best path to cell D.
  • the path score at cell D is updated with the previous path score (from A, B, or C) plus the frame-to-frame distance at cell D. This distance is doubled before adding if a diagonal step was chosen to aid in path score normalization.
  • the movement of the DPA function is along the template axis for each utterance frame.
  • the function just described is repeated in the innermost loop of the recognition algorithm by resetting the B variable to cell D's score, the A variable to cell C's score and retrieving from storage a new value for C.
  • a good feature vector F a collection of frames of BPF data from a plurality of speakers and noise events occurring when speech is not present are collected and modified as described above.
  • the data is divided into sets of speech frames [S] and noise event frames [N].
  • S speech frames
  • N noise event frames
  • the resultant scalar features formed by the inner product operation with the modified frames are collected and formed into a histogram designated generally 710 in Fig. 7.
  • the x-axis 712 is the magnitude of the scalar feature while the y-axis 714 is the number of times a particular magnitude occurs. Jet noise 716 and regulator sounds 718 occur below a threshold 720 while voice 722 occurs above the threshold 720.
  • the speech activity detection subtask 404 When the speech recognizer is being used, e.g., in flight in an aircraft cockpit, the speech activity detection subtask 404 initially selects a detection threshold but thereafter continually gathers statistics and updates the histogram on the feature 726. Every 1000 frames, the detection threshold is adjusted based on the statistics in the histogram. For example, the peak 750 is located in the histogram 710, and a search is conducted forward from the peak 750 to locate the low point 720. The threshold is set to the low point value plus some bias such as one or two. Finally, each histogram entry is divided by two to keep the histogram values from growing too large.
  • the magnitude of the detection threshold 708 is subtracted from the magnitude of the scalar feature 706 at block 730-for each frame.
  • a weighting function 732 is applied to the output value of block 730 to smooth out the values before they are filtered and clamped at 734.
  • the weighting function reduces large negative values from block 730 and reduces small positive values. Large positive values are left substantially unaffected.
  • the weighting function cooperates with the integration process performed by the filter and clamp function 734 . to provide sharp cutoff points between the beginning and end of speech detection. Large negative values provide no better indication of non-speech than smaller values, but will distort and delay the integration process from indicating when speech is present. Small positive values create uncertainty as to whether speech is present and are better left undetected.
  • An example of the preferred embodiment weighting function and filter and clamping functions are provided in C language on page 19 of the specification.
  • multi-frame decision logic 738 is employed to make a decision whether speech is present. For example, if no speech were present and if all four buffers provide a positive indication, then a decision is made that speech is present, and this is passed on to block 410 in Fig. 4, otherwise a decision is made that speech still is not present. On the other hand, if speech is currently present, a decision is made that speech is still present if any one of the buffers indicates that a speech signal is present. Only if all four buffers indicate no speech signals present will a decision be made that speech is now over.
  • the above-described decoding is provided in C language at pages 19 and 20 of the specification
  • subtasks 402, 404 and 406 are performed in processor 110 while subtask 408 is performed in processor 112.
  • the speech activity detector could not be used with larger vocabulary continuous speech recognition machines.
  • speech activity detection through the use of the inner product between a predefined feature vector and frames of speech can be performed on frames of speech provided directly from the bandpass filter transformation subtask 402 even though this frame is proportional to the log of the value of the digital signals.
  • the inner product could be performed using frames whose digital signals are prportional to the magnitude of the digital signals and not the magnitude squared.
  • Results to date on the performance of the recognizer indicate recognition accuracy of 85 to 95% for worst cases of cockpit sound pressure level of 115 dB and acceleration forces of 5G. In fact, the system shows no degradation from low level ambient noise performance (95+% accuracy) to noise levels of approximately 106 dB. It should be pointed out, however, that the 115 dB sound levels at 5G acceleration forces are often simulated.
  • the pilot is speaking into an oxygen regulator which partially seals off the ambient cockpit noise. However, the stress of the noise and acceleration forces causes the pilot to speak in a less than normal speaking manner. Also, the noise events caused by the stressed breathing of the pilot into the oxygen regulator are also present.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Complex Calculations (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
EP84107846A 1983-07-08 1984-07-05 Einrichtung zum automatischen Feststellen einer Sprachsignalaktivität Withdrawn EP0143161A1 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US51206883A 1983-07-08 1983-07-08
US512068 1983-07-08

Publications (1)

Publication Number Publication Date
EP0143161A1 true EP0143161A1 (de) 1985-06-05

Family

ID=24037538

Family Applications (1)

Application Number Title Priority Date Filing Date
EP84107846A Withdrawn EP0143161A1 (de) 1983-07-08 1984-07-05 Einrichtung zum automatischen Feststellen einer Sprachsignalaktivität

Country Status (3)

Country Link
EP (1) EP0143161A1 (de)
JP (1) JPS6039695A (de)
CA (1) CA1218458A (de)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2278984A (en) * 1993-06-11 1994-12-14 Redifon Technology Limited Speech presence detector
GB2422279A (en) * 2004-09-29 2006-07-19 Fluency Voice Technology Ltd Determining Pattern End-Point in an Input Signal
CN108242236A (zh) * 2016-12-26 2018-07-03 现代自动车株式会社 对话处理装置及其车辆和对话处理方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9959887B2 (en) 2016-03-08 2018-05-01 International Business Machines Corporation Multi-pass speech activity detection strategy to improve automatic speech recognition

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0008551A2 (de) * 1978-08-17 1980-03-05 Thomson-Csf Anordnung zur Feststellung der Anwesenheit von Sprachsignalen und ihre Verwendung
GB2109205A (en) * 1981-10-31 1983-05-25 Tokyo Shibaura Electric Co Apparatus for detecting the duration of voice

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS56135898A (en) * 1980-03-26 1981-10-23 Sanyo Electric Co Voice recognition device
JPS5797599A (en) * 1980-12-10 1982-06-17 Matsushita Electric Ind Co Ltd System of detecting final end of each voice section
JPS57177197A (en) * 1981-04-24 1982-10-30 Hitachi Ltd Pick-up system for sound section

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0008551A2 (de) * 1978-08-17 1980-03-05 Thomson-Csf Anordnung zur Feststellung der Anwesenheit von Sprachsignalen und ihre Verwendung
GB2109205A (en) * 1981-10-31 1983-05-25 Tokyo Shibaura Electric Co Apparatus for detecting the duration of voice

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, vol. IE-30, no. 2, May 1983, pages 150-155, IEEE, New York, USA; N. KISHI et al.: "A voice input system for automobiles using a microprocessor" *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2278984A (en) * 1993-06-11 1994-12-14 Redifon Technology Limited Speech presence detector
GB2422279A (en) * 2004-09-29 2006-07-19 Fluency Voice Technology Ltd Determining Pattern End-Point in an Input Signal
CN108242236A (zh) * 2016-12-26 2018-07-03 现代自动车株式会社 对话处理装置及其车辆和对话处理方法
CN108242236B (zh) * 2016-12-26 2023-12-15 现代自动车株式会社 对话处理装置及其车辆和对话处理方法

Also Published As

Publication number Publication date
JPS6039695A (ja) 1985-03-01
CA1218458A (en) 1987-02-24

Similar Documents

Publication Publication Date Title
US4624008A (en) Apparatus for automatic speech recognition
US4811399A (en) Apparatus and method for automatic speech recognition
US4933973A (en) Apparatus and methods for the selective addition of noise to templates employed in automatic speech recognition systems
US5054085A (en) Preprocessing system for speech recognition
US5794196A (en) Speech recognition system distinguishing dictation from commands by arbitration between continuous speech and isolated word modules
US6850887B2 (en) Speech recognition in noisy environments
US6133904A (en) Image manipulation
EP0077194B1 (de) Spracherkennungssystem
US4665548A (en) Speech analysis syllabic segmenter
US20070118364A1 (en) System for generating closed captions
JPH11502953A (ja) 厳しい環境での音声認識方法及びその装置
CN111508498A (zh) 对话式语音识别方法、系统、电子设备和存储介质
JPH0312319B2 (de)
CN113192535A (zh) 一种语音关键词检索方法、系统和电子装置
JPS6128998B2 (de)
KR101122590B1 (ko) 음성 데이터 분할에 의한 음성 인식 장치 및 방법
EP0248593A1 (de) Vorverarbeitungssystem zur Spracherkennung
KR101122591B1 (ko) 핵심어 인식에 의한 음성 인식 장치 및 방법
EP0143161A1 (de) Einrichtung zum automatischen Feststellen einer Sprachsignalaktivität
JPH0797279B2 (ja) 音声認識装置
JP3046029B2 (ja) 音声認識システムに使用されるテンプレートに雑音を選択的に付加するための装置及び方法
EP0177854B1 (de) Schlüsselworterkennungssystem unter Anwendung eines Sprachmusterverkettungsmodels
JPS63502304A (ja) 高雑音環境における言語認識のためのフレ−ム比較法
JPS60114900A (ja) 有音・無音判定法
JP2992324B2 (ja) 音声区間検出方法

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Designated state(s): AT DE FR GB

17P Request for examination filed

Effective date: 19860124

17Q First examination report despatched

Effective date: 19900912

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 19910123

RIN1 Information on inventor provided before grant (corrected)

Inventor name: BOLL, STEVEN F.

Inventor name: SMITH, ALLEN R.

Inventor name: VENSKO, GEORGE

Inventor name: HUTCHINS, SANDRA E.

Inventor name: CARLIN, LAWRENCE