EP0254409B1 - Vorrichtung und Verfahren zur Spracherkennung - Google Patents

Vorrichtung und Verfahren zur Spracherkennung Download PDF

Info

Publication number
EP0254409B1
EP0254409B1 EP19870305087 EP87305087A EP0254409B1 EP 0254409 B1 EP0254409 B1 EP 0254409B1 EP 19870305087 EP19870305087 EP 19870305087 EP 87305087 A EP87305087 A EP 87305087A EP 0254409 B1 EP0254409 B1 EP 0254409B1
Authority
EP
European Patent Office
Prior art keywords
output
microphone
pattern matching
speaker
speech recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
EP19870305087
Other languages
English (en)
French (fr)
Other versions
EP0254409A1 (de
Inventor
Michael Robinson Taylor
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Smiths Group PLC
Original Assignee
Smiths Group PLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Smiths Group PLC filed Critical Smiths Group PLC
Publication of EP0254409A1 publication Critical patent/EP0254409A1/de
Application granted granted Critical
Publication of EP0254409B1 publication Critical patent/EP0254409B1/de
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis

Definitions

  • This invention relates to speech recognition apparatus of the kind including an optical device mounted to view a part at least of the mouth of a speaker, the optical device providing an output that varies with movement of the speaker's mouth, a microphone that derives an output in respect of the sound produced by the speaker and a processing unit that derives from the output of the optical device and the microphone information as to the speech sounds made by the speaker.
  • an apparatus is known, for example, from PROCEEDINGS OF THE IEEE COMPUTER VISION AND PATTERN RECOGNITION, San Francisco 19th - 23rd June 1985, pages 40-47; E. D. PETAJAN: "Automatic lipreading to enhance speech recognition".
  • Speech signal processing can also be used in communication systems, where the speech input is degraded by noise, to improve the quality of speech output. This generally involves filtering and signal enhancement but usually results in some loss of the speech information where high noise is present.
  • speech recognition apparatus characterised in that a laryngograph responsive to changes in impedance to electromagnetic radiation during movement of the vocal folds such as thereby to derive information regarding the voiced sounds is also included, whereby the output from the laryngograph is used in said processing unit to improve identification of the speech sounds in combination with said output of the optical device and microphone.
  • the output of the microphone and the laryngograph are preferably combined in order to identify sound originating from the speaker and sound originating from external sources, the apparatus including a first store containing information of a reference vocabulary of sound signal information, a first pattern matching unit connected to the store and connected to receive the output of the microphone after rejecting signals associated with sounds originating from external sources.
  • the apparatus may include a second store containing information of a reference vocabulary of visual characteristics of the speaker's mouth, a second pattern matching unit that selects the closest match between the output of the optical device and the vocabulary in the second store, and a comparator that receives the outputs of the first and second pattern matching units and provides an output representative of the most probable speech sounds made in accordance therewith.
  • the apparatus may include a circuit that modifies the output of the microphone which may be modified by the output of the second pattern matching unit, the output of the microphone being supplied to the first pattern matching unit after modification by the second pattern matching unit.
  • FIG. 1 With reference first to Figures 1 and 2, there is shown a speaker wearing a breathing mask 1 having an air supply line 2 that opens into the mask on the side.
  • An exhaust valve not shown, is provided on the other side in the conventional way.
  • the mask 1 also supports a microphone 5 that is located to detect speech by the user and to supply electrical output signals on line 50, in accordance with the speech and other sounds within the mask, to a speech recognition unit 10.
  • a small, lightweight CCD television camera 6 which is directed to view the region of the user's mouth including an area immediately around the mouth, represented by the pattern in Figure 2.
  • the camera 6 is response to infra-red radiation so that it does not require additional illumination.
  • the camera 6 may be responsive to visible or ultraviolet radiation if suitable illumination is provided. Signals from the camera 6 are supplied via line 60 to the speech recognition unit 10.
  • the speech recognition apparatus also includes a laryngograph 20 of conventional construction such as described in ASHA Reports 11, 1981, p 116 - 127.
  • the laryngograph 20 includes two electrodes 21 and 22 secured to the skin of the user's throat by means of a neck band 23.
  • the electrodes 21 and 22 are located on opposite sides of the neck, level with the thyroid cartilage.
  • Each electrode 21 and 22 is flat and circular in shape being between 15 and 30mm in diameter, with a central circular plate and a surrounding annular guard ring insulated from the central plate.
  • One electrode 21 is connected via a coaxial cable 24 to a supply unit 25 which applies a 4MHz transmitting voltage between the central plate and guard ring of the electrode.
  • the other electrode 22 serves as a current pick-up.
  • Current flow through the user's neck will vary according to movement of the user's vocal folds. More particularly, current flow increases (that is, impendance decreases) when the area of contact between the vocal folds increases, although movement of the vocal folds which does not vary the area of contact will not necessarily produce any change in current flow.
  • the output from the second electrode 22 is supplied on line 26 to a processing unit 27.
  • the output signal is modulated according to the frequency of excitation of the vocal tract and thereby provides information about phonation or voiced speech, of the user. This signal is unaffected by external noise and by movement of the user's mouth and tongue.
  • the processing unit 27 provides an output signal on line 28 in accordance with the occurrence and frequency of voiced speech, this signal being in a form that can be handled by the speech recognition unit 10.
  • signals from the microphone 5 are first supplied to a spectral analysis unit 51 which produces output signals in accordance with the frequency bands within which the sounds falls. These signals are supplied to a spectral correction and noise adaptation unit 52 which improves the signal to noise ratio or eliminates, or marks, those speech signals that have been corrupted by noise.
  • the spectral correction unit 52 also receives input signals from the laryngograph 20 on line 28. These signals are used to improve the identification of speech sounds.
  • the microphone 5 receives signals which may have arisen from voiced speech (that is, speech with sound produced by vibration of the vocal folds) or from external noise, which produces sounds similar to phonemes
  • Output signals from the unit 52 are supplied to one input of a pattern matching unit 53.
  • the other input to the pattern matching unit 53 is taken from a store 54 containing information form a reference vocabulary of sound signal information in the form of pattern templates or word models of the frequency/time patterns or state descriptions of different words.
  • the pattern matching unit 53 compares the frequency/time patterns derived from the microphone 5 with the stored vocabulary and produces an output on line 55 in accordance with the word which is the most likely fit for the sound received by the microphone.
  • the output may include information as to the probability that the word selected from the vocabulary is the actual word spoken.
  • the output may also include signals representing a plurality of the most likely words actually spoken together with their associated probabilities.
  • the part of the unit 10 which processes the optical information from the camera 6 includes a visual processing unit 61 which receives the camera outputs.
  • the visual processing unit 61 analyses the input signals to identify key characteristics of the optical speech patterns or optical model states from the visual field of the camera, such as, for example, lip and teeth separation and lip shape. In this respect, well-known optical recognition techniques can be used.
  • the visual processing unit 61 supplies output signals on line 62 to one input of a pattern matching unit 63.
  • a second input to the pattern matching unit 63 is taken from a store 64 containing information of a reference vocabulary in the form of templates of the key visual characteristics of the mouth.
  • Signals from the laryngograph 20 on line 28 are supplied to a third input of the pattern matching unit 63 to improve identification of the word spoken.
  • the output of the laryngograph 20 is used to resolve situations where there is ambiguity of the sound produced from observation of mouth movement alone. For example, the sounds
  • the pattern matching unit 63 provides an output on line 65 in accordance with the word that best fits the observed mouth movement.
  • the output may also include information as to the probability that the word selected from the vocabulary is the actual word spoken.
  • the output may also include signals representing a plurality of the most likely words actually spoken with their associated probabilities.
  • the outputs from the two pattern matching circuits 53 and 63 are supplied to a comparison unit 70 which may function in various way. If both inputs to the comparison unit 70 indicate the same word then the comparison unit produces output signals representing that word on line 71 to a control unit 80 or other untilisation means. If the inputs to the comparison unit 70 indicate different words, the unit responds by selecting the word with the highest associated probability. Where the pattern matching units 53 and 63 produce outputs in respect of a plurality of the most likely words spoken, the comparison unit acts to select the word with the highest total probability.
  • the comparison unit 70 may be arranged to give signals from one or other of the pattern matching units 53 or 63 a higher weighting than the other when selecting between conflicting inputs.
  • the comparison unit 70 fails to identify a word with sufficiently high probability it supplies a feedback output on line 72 to a feedback device 82 giving information to the user which, for example, prompts him to repeat the word, or asks him to verify that a selected word was the word spoken.
  • the feedback device 82 may generate an audible or visual signal.
  • a third output on line 73 may be provided to an optional syntax selection unit (not shown) which is used in a known way to reduce the size of the reference vocabulary for subsequent words.
  • the output signals on line 71 are supplied to the control unit 80 which effects control of the selected function in accordance with the words spoken.
  • the user first establishes the reference vocabulary in stores 54 and 64 by speaking a list of words.
  • the apparatus then stores information derived from the sound, voicing and mouth movements produced by the spoken list of words for use in future comparison.
  • FIG. 4 A modification of the apparatus of Figure 3 is shown in Figure 4.
  • a spectrum substitution unit 74 is interposed between the spectral correction and noise adaptation unit 52 and the pattern matching unit 53.
  • This modification operates to substitute only short-term corrupted speech spectra with a 'most-likely' description of uncorrupted short-term spectra.
  • a clean spectrum most likely to be associated with the visual pattern detected by the pattern matching unit 63 is supplied to the input of the unit 53 via a spectrum substitution unit 74 in place of the noisy spectrum otherwise supplied by the unit 52.
  • the spectrum substitution unit 74 transforms the optical pattern recognised by the pattern matching unit 63 into an acoustic pattern with the same structure as the patterns produced at the outputs of the units 51 and 52.
  • the optical output derived from the camera 6 and the output of the laryngograph 20 will not be affected and this can be used to make a positive recognition.
  • the invention is therefore particularly useful in noisy environments such as factories, vehicles, quarries, underwater, commodity or financial dealing markets and so on.
  • Various alternative optical means could be used to view the user's mouth.
  • the end of a fibre-optic cable may be located in the breathing mask and a television camera mounted remotely at the other end of the cable.
  • an array of radiation detectors may be mounted in the breathing mask or remotely via fibre-optic cables to derive signals in accordance with the position and movement of the user's mouth.
  • voicing sensing means could be used. For example, it may be possible to sense voicing by ultrasound techniques.
  • the optical device can be mounted with his head by other means, such as in a helmet, or on the microphone boom of a headset. It is not essential for the optical device to be mounted with the user's head although this does make it easier to view the mouth since the optical field will be independent of head movement. Where the optical device is not mounted on the user's head, additional signal processing will be required to identify the location of the user's mouth.
  • the laryngograph electrodes could be mounted on an extended collar of the user's helmet.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)

Claims (5)

  1. Spracherkennungsvorrichtung, beinhaltend ein optisches Gerät (6), welches so angebracht ist, daß es zumindest einen Teil des Munds des Sprechers aufnimmt und ein Ausgangssignal erzeugt, welches sich mit der Mundbewegung des Sprechers ändert, ein Mikrofon (5), welches ein Ausgangssignal erzeugt bezüglich des vom Sprecher erzeugten Schalls und eine Verarbeitungseinheit (10), welche aus den Ausgangssignalen des optischen Geräts (6) und des Mikrofons (5) Information ableitet bezüglich des vom Sprecher erzeugten Sprachklangs, dadurch gekennzeichnet, daß sie auch einen Laryngographen (20) beinhaltet, welcher empfindlich ist auf Impedanzänderungen gegenüber elektromagnetischer Strahlung während der Bewegung der Stimmbänder, um so Information bezüglich der gesprochenen Klänge abzuleiten, wobei das Ausgangssignal des Laryngographen (20) in der genannten Verarbeitungseinheit (10) verwendet wird, um die Identifizierung der Sprachklänge in Kombination mit den Ausgangssignalen des optischen Geräts (6) und des Mikrofons (5) zu verbessern.
  2. Spracherkennungsvorrichtung nach Anspruch 1, dadurch gekennzeichnet, daß die Ausgangssignale des Mikrofons (5) und des Laryngographen (20) kombiniert werden, um Klänge des Sprechers und Klänge von äußeren Quellen zu identifizieren, daß die Vorrichtung beinhaltet einen ersten Speicher (54), welcher die Information eines Bezugsvokabulars von Klangsignalinformationen enthält, eine erste Musteranpassungseinheit (53), die an den Speicher (54) und zum Empfang des Ausgangssignals des Mikrofons (5) angeschlossen ist, nachdem Signale, welche von äußeren Quellen herrühren, ausgesondert wurden.
  3. Spracherkennungsvorrichtung nach Anspruch 2, dadurch gekennzeichnet, daß die Vorrichtung beinhaltet einen zweiten Speicher (64) mit Information eines Bezugsvokabulars visueller Charakteristika des Munds des Sprechers, eine zweite Musteranpassungseinheit (63), welche die beste Anpassung zwischen dem Ausgangssignal des optischen Geräts (6) und dem Vokabular in dem zweiten Speicher (64) auswählt, und eine Vergleichseinheit (70), welche die Ausgangssignale der ersten und zweiten Musteranpassungseinheiten (53 und 63) empfängt und ein Ausgangssignal erzeugt, welches repräsentativ für die wahrscheinlichsten Sprachklänge ist und diesen entsprechend hergestellt wird.
  4. Spracherkennungsvorrichtung nach einem der voranstehenden Ansprüche, dadurch gekennzeichnet, daß die Vorrichtung einen Schaltkreis (63,74) beinhaltet, welcher das Ausgangssignal des Mikrofons (5) modifiziert.
  5. Spracherkennungsvorrichtung nach Anspruch 4, dadurch gekennzeichnet, daß das Ausgangssignal des Mikrofons (5) durch das Ausgangssignal der zweiten Musteranpassungseinheit (63) modifiziert wird und daß das Ausgangssignal des Mikrofons (5) nach der Modifikation durch die zweite Musteranpassungseinheit (63) der ersten Musteranpassungseinheit (53) zugeführt wird.
EP19870305087 1986-07-25 1987-06-09 Vorrichtung und Verfahren zur Spracherkennung Expired - Lifetime EP0254409B1 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB8618193 1986-07-25
GB8618193A GB8618193D0 (en) 1986-07-25 1986-07-25 Speech recognition apparatus

Publications (2)

Publication Number Publication Date
EP0254409A1 EP0254409A1 (de) 1988-01-27
EP0254409B1 true EP0254409B1 (de) 1991-10-30

Family

ID=10601684

Family Applications (1)

Application Number Title Priority Date Filing Date
EP19870305087 Expired - Lifetime EP0254409B1 (de) 1986-07-25 1987-06-09 Vorrichtung und Verfahren zur Spracherkennung

Country Status (3)

Country Link
EP (1) EP0254409B1 (de)
DE (1) DE3774200D1 (de)
GB (2) GB8618193D0 (de)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014173325A1 (zh) * 2013-04-27 2014-10-30 华为技术有限公司 喉音识别方法及装置
US9626001B2 (en) 2014-11-13 2017-04-18 International Business Machines Corporation Speech recognition candidate selection based on non-acoustic input
CN107369449A (zh) * 2017-07-14 2017-11-21 上海木爷机器人技术有限公司 一种有效语音识别方法及装置

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0336032A1 (de) * 1988-04-07 1989-10-11 Research Triangle Institute Akustische und optische Spracherkennung
DE4212907A1 (de) * 1992-04-05 1993-10-07 Drescher Ruediger Spracherkennungsverfahren für Datenverarbeitungssysteme u.s.w.
US6471420B1 (en) 1994-05-13 2002-10-29 Matsushita Electric Industrial Co., Ltd. Voice selection apparatus voice response apparatus, and game apparatus using word tables from which selected words are output as voice selections
NL9400888A (nl) * 1994-05-31 1996-01-02 Meijer Johannes Leonardus Jozef Drs Werkwijze voor het vergroten van de verstaanbaarheid van het gesproken woord en inrichting daarvoor.
US5905972A (en) * 1996-09-30 1999-05-18 Microsoft Corporation Prosodic databases holding fundamental frequency templates for use in speech synthesis
US7081915B1 (en) * 1998-06-17 2006-07-25 Intel Corporation Control of video conferencing using activity detection
US10741182B2 (en) 2014-02-18 2020-08-11 Lenovo (Singapore) Pte. Ltd. Voice input correction using non-audio based input
US9881610B2 (en) 2014-11-13 2018-01-30 International Business Machines Corporation Speech recognition system adaptation based on non-acoustic attributes and face selection based on mouth motion using pixel intensities
CN114241594A (zh) * 2020-07-31 2022-03-25 南宁富联富桂精密工业有限公司 唇语识别方法及电子装置

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014173325A1 (zh) * 2013-04-27 2014-10-30 华为技术有限公司 喉音识别方法及装置
US9626001B2 (en) 2014-11-13 2017-04-18 International Business Machines Corporation Speech recognition candidate selection based on non-acoustic input
US9632589B2 (en) 2014-11-13 2017-04-25 International Business Machines Corporation Speech recognition candidate selection based on non-acoustic input
CN107369449A (zh) * 2017-07-14 2017-11-21 上海木爷机器人技术有限公司 一种有效语音识别方法及装置

Also Published As

Publication number Publication date
GB8618193D0 (en) 1986-11-26
DE3774200D1 (de) 1991-12-05
GB2193024A (en) 1988-01-27
GB8713682D0 (en) 1987-07-15
GB2193024B (en) 1990-05-30
EP0254409A1 (de) 1988-01-27

Similar Documents

Publication Publication Date Title
EP0254409B1 (de) Vorrichtung und Verfahren zur Spracherkennung
US20210035586A1 (en) System and method of correlating mouth images to input commands
US6128594A (en) Process of voice recognition in a harsh environment, and device for implementation
JP3697748B2 (ja) 端末、音声認識装置
CN106537492B (zh) 具有用于语音识别的校正策略的机动车操作装置
US10438609B2 (en) System and device for audio translation to tactile response
Christiansen et al. Detecting and locating key words in continuous speech using linear predictive coding
US20070057798A1 (en) Vocalife line: a voice-operated device and system for saving lives in medical emergency
WO1998008215A1 (en) Speech recognition manager
EP0233718B1 (de) Verfahren und Vorrichtung zur Sprachverarbeitung
KR20180090046A (ko) 음성인식장치와, 음성인식장치가 구비된 조명등기구와, 이를 이용한 조명시스템
US5278911A (en) Speech recognition using a neural net
GB2230370A (en) Speech recognition
US20030095674A1 (en) Microphone system for the fueling environment
US20200234712A1 (en) Portable Speech Recognition and Assistance using Non-Audio or Distorted-Audio Techniques
US20040167674A1 (en) Voice controlled vehicle wheel alignment system
US20220114447A1 (en) Adaptive tuning parameters for a classification neural network
US6856952B2 (en) Detecting a characteristic of a resonating cavity responsible for speech
JPH03208099A (ja) 音声認識装置及び方法
EP0269233A1 (de) Vorrichtung und Verfahren zur Spracherkennung
Holmes et al. Why have HMMs been so successful for automatic speech recognition and how might they be improved
AU580000B2 (en) Speech recognition system
JPH04167695A (ja) 遠隔制御システム
Rahman et al. Smart glass for awareness of important sound to people with hearing disability
JPH04297972A (ja) 文字認識訂正装置

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): DE ES FR IT NL

17P Request for examination filed

Effective date: 19880208

17Q First examination report despatched

Effective date: 19901022

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): DE ES FR IT NL

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRE;WARNING: LAPSES OF ITALIAN PATENTS WITH EFFECTIVE DATE BEFORE 2007 MAY HAVE OCCURRED AT ANY TIME BEFORE 2007. THE CORRECT EFFECTIVE DATE MAY BE DIFFERENT FROM THE ONE RECORDED.SCRIBED TIME-LIMIT

Effective date: 19911030

Ref country code: NL

Effective date: 19911030

REF Corresponds to:

Ref document number: 3774200

Country of ref document: DE

Date of ref document: 19911205

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: ES

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 19920210

ET Fr: translation filed
NLV1 Nl: lapsed or annulled due to failure to fulfill the requirements of art. 29p and 29m of the patents act
PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

26N No opposition filed
PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: FR

Payment date: 19930629

Year of fee payment: 7

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: DE

Payment date: 19930803

Year of fee payment: 7

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: FR

Effective date: 19950228

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: DE

Effective date: 19950301

REG Reference to a national code

Ref country code: FR

Ref legal event code: ST