AU613904B2 - Audio visual speech recognition - Google Patents

Audio visual speech recognition Download PDF

Info

Publication number
AU613904B2
AU613904B2 AU13360/88A AU1336088A AU613904B2 AU 613904 B2 AU613904 B2 AU 613904B2 AU 13360/88 A AU13360/88 A AU 13360/88A AU 1336088 A AU1336088 A AU 1336088A AU 613904 B2 AU613904 B2 AU 613904B2
Authority
AU
Australia
Prior art keywords
phonemes
phoneme
output
signal
indicating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU13360/88A
Other versions
AU1336088A (en
Inventor
Robert L. Beadles
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Research Triangle Institute
Original Assignee
Research Triangle Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Research Triangle Institute filed Critical Research Triangle Institute
Publication of AU1336088A publication Critical patent/AU1336088A/en
Application granted granted Critical
Publication of AU613904B2 publication Critical patent/AU613904B2/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Description

COM H ON WEALT OF AUSTRALIA PATENT ACT 1952 COMPLETE SPECIF'CATION
(ORIGINAL)
639 04 FOR OFFICE USE CLASS INT. CLASS Application Number: Lodged: Complete Specification Lodged: Accepted: Published: Priority: Related Art-: NAME OF APPLICANT: RESEARCH TRIANGLE INSTITUTE
ADDRESS
a o OF APPLICANT: 1 Cornwallis Drive, Research Triangle Park, North Carolina 27709, United States of America.
NAME(S) OF INVENTOR(S) Robert L. BEADLES S0 o aO aO a ADDRESS FOR SERVICE: DAVIES COLLISON, Pttent Attorneys 1 Little Collins Street, Melbourne, 3000.
COMPLETE SPECIFICATION FOR THE INVENTION ENTITLED: "AUDIO VISUAL SPEECH RECOGNITION" The follo; iftg statement is a full description of this invention, including the best method of performing it known to us Z- -1- -1 11 -PA- 1A The invention relates to a method and apparatus for producing an output indicating at least some spoken phonemes.
The primary method by which men communicate is speech. Communication between people by speech and hearing has many advantages over written communication. A person can speak at least ten times as fast as he can write, at least four times as fast as a skilled typist can work. Because of the many advantages and myriad uses of speech, the capability to recognize speech by an apparatus has lbng been recognized as an extremely desirable technological goal. For example, a reasonable cost limited vocabulary speech recognizer could replace the nowexisting inputs as the interface between man and the digital computer. Such an apparatus would revolutionize modern office practices by providing typewritten copy from voice input., Many military applications exist in command, control, intelligence and in electronic communication where such an apparatus would prove invaluable.
Another great need for such an apparatus is in assisting communications between hearing impaired or deaf people and hearing people. The difficulties in such communication have long handicapped deaf people in full integration intc their community and in achieving the same levels of education, employment, and social advancements which they would otherwise achieve. The use of hand signals, although slower than spoken speech, can be used between those hearing impaired persons who are sufficiently motivated to learn the signs, but is impractical as a mode of communication with the general public. By observation i__ 4 The basic application......... referred to in paragraph 3 of this De laration the first application....,.... made in a Convention country in respect of the invention thesutec -o th g appliGcat-io Insert place and date of signature. Declared at North Carolinathis /O day of /~avy 1988 United States of America RESEARCH TRIANGLE INSTITUTE Signature of declarint(s) (no attestation required) Note Initial all alterations. Name: DAV!ID R Ffl Ur0iE Title: SENIOI, COfl f 4
TRPPTOI
DAVIIS (OLLISON. MELBOURNE and CANBERRA 2 of the movements of the lips of the cpeaking person, a hearing impaired or deaf person can discern that each sound is one of a limited number of the possible speech sounds called phonemes. Unfortunately, the ambiguities in lipreading for a totally deaf person are too great for effective understanding by most people using only lipreading.
Previous attempts to recognize phonemes by analysis of speech sounds have not been successful in producing sufficiently accurate indices to be an effective aid to the deaf in communication. The best computer speech recognition to date has required a modest recognized vocabulary and a speaker especially trained in the pronunciation of phonemes, and training of the system to the idiosyncracies of each new speaker's voice. Even when adapting for individual speakers, typically women's and children's voices are recognized less well. Recognition of each word except for a limited vocabulary further has required many times as long as the speaking of the word, precluding real time speech recognition. These limitations have made such speech recognition devices unsatisfactory as general purpose devices and of extremely limited use to the deaf community.
While more and more sophisticated techniques have been developed for analyzing and determining the identity of a specific phoneme, such techniques have not been successful in telling apart considerable numbers of phonemes which in fact sound very similar. Resolution of the identity of these phonemes by a hearer is often done on the basis of visual confirmation, context and familiarity with the speaker, operations which are difficult at best in previous machine implemented systems. In fact, visual -3information in some circumstances is given greater weight by the human brain than acoustic information in normal speed perception.
Fortunately, those ambiguities which are very difficult to differentiate from the sounds of the phonemes can often be differentiated by the appearance of the lips and palate. This has been recognised in the technique called manual cuing in which communication with the deaf is expedited utilizing hand cues to remove sufficient ambiguities to make lipreading practical, In accordance with the present invention there is provided an apparatus for producing an output indicating at least some of a sequence of spoken phonemes from a human speaker comprising: means for detecting sounds and converting said sounds into an electrical signal; means for analyzing said signal to detect said phonemes to produce an electrical acoustic output signal indicating for each of at least some of said detected phonemes one group of a plurality of phoneme groups including the detecting phoneme, each of said phoneme groups including at least one phoneme; means for optically scanning the face of said speaker and producing an electrical lipshape signal representing the visual manifestation for at least some of said spoken phonemes indicating one of a plurality of lipshapes, each lipshape S 20 being associated with at least one phonnme; and means for receiving and correlating said lipshape signal and said acoustic O output signal to produce said output.
The present invention also provides a method of producing an output indicating at least some of a sequence of spoken phonemes from a human speaker S 25 comprising the steps of: 0 detecting sounds and converting said sounds into an electrical signal; analyzing said signal to detect said phonemes to produce an electrical acoustic output signal indicating for each of at least some of said detected phonemes one group of a plurality of phoneme groups including the detected phoneme, each of said phoneme groups including at least one phoneme; optically scanning the face of said speaker and producing an electrical 9 O'i062,dbwspe,031,ti,spe,3 K -4lipshape signal representing the visual manifestation for at least some of said spoken phonemes indicating one of a plurality of lipshapes, each lipshape being associated with at least one phoneme; and correlating said lipshape signal and said acoustic output signal to produce said output.
In the present invention, a sufficient number of phonemes in a sequence of spoken phonemes can be recognized to provide effective and practical communication. This is achieved by combining analysis of the phoneme sounds which determines to which of a number of groups of phonemes each of the sounds belongs with optical scanning which determines which of a number of lip shapes are being made by the speaker in association with production of that phoneme.
Correlating signals produced by the sounds and the optical scanning will produce sufficient indcations of the spoken sounds to be practical. The number of *a 00 *0 recognized phonemes will depend upon the sophistication of the optical and sound S 15 analysis, the precision of expression of the speaker, and how must guessing and wrong phonemes are appropriate for any given application. The present invention is particularly useful in that it is amenable to many different applications and can be carried out at different levels of sophistication depending on desired accuracy.
BRIEF DESCRIPTION OF THE DRAWINGS S 20 A preferred embodiment of the p.'esent invention is hereinafter described, by way of example only, with reference to the accompanying drawings, wherein: Figure 1 shows a chart of phonemes separated into lip shape and sound groups; and Figure 2 shows a block diagram of a preferred embodiment of the present S' 25 invention, DETAILED DESCRIPTION OF THE DRAWINGS Reference is now made to Figure 1 which shows a chart showing one possible selection of sound groups and lip-shape groups suitable for use in the present invention. It will be understood that while the phonemes shown in Figure 1 are those of the English language, similar groupings can be made for other human languages as well. In Figure 1, the phoneme sounds are separated into five "91 910612,dibwsp.3 .Ulslli pc,4 Z 1 groups of vowels, nasals, glides and semi-vowels, stops, and fricatives, These are well known groupings of phonemes, The voiced sounds are below the slash line with the unvoiced sounds shown above, The lip shapes are classified into three shapes which can be readily differentiated by analysis of signals produced by optical scanning. The classes are of a flattened, rounded and open lip shape.
These lip shapes can be readily differentiated using conventional optical scanning techniques, but additional lip-shapes or facial or palate positions can be utilized.
Some of the groups of sounds can be completely defined by reference to the lip shape, for example, the phoneme while other sounds can only be resolved to an ambiguity of several sounds. Resolution of these choices can sometimes be made by comparison of preceding and succeeding phonemes, or symbols can be visually or otherwise displayed to permit a viewer too resolve ambiguities in the same way that ambiguities are resolved by the mind in analysing speech sounds normally heard.
Reference is now made to Figure 2 which illustrates a block diagram of the present invention. As noted, an audio preprocessor 12 detects the sounds produced by a human speaker and those sounds arc converted into an electrical signal by a conventional microphone or similar device. The electrical signal thus produced is applied to a spectrum shaping amplification and automatic level-control circuit 24.
The output of circuit 24 is applied to both low pass filter 26 and high pass filter 28. The outputs of the filters are applied to zero crossing counters 30 and 32 and 0 peak to peak detectors 34 and 36. The output of the low pass filter in addition is applied to a circuit 38 for detecting the difference between voiced and unvoiced sounds, These circuits are known in the art, and are discussed further in the o 25 specification of US Patent 4,972,486, which is hereby incorporated by reference.
Other methods of acoustic analysis such as linear prediction and short time spectral analysis can alternatively be employed in either analog, digital or combination forms, Visual preprocessor 40 includes a conventional optical scanner 42, for example, a 910612,dbwspe03,tU. LY 1television camera, which produces a sequence of electrical signals indicating at a plurality of discrete locations the intensity of light received.
Selective optical filtering before input in scanner 42 enhances the contrast of various mouth features with respect to other features. Light level is detected and compensation therefor carried out. Scanner 42 is positioned to view the face of the speaker, particularly the lips, and can be in the form of a portable or a permanent installation. The electrical output of the scanner in the nature of a sequence of digital signals or the like is applied to a speaker normalization circuit 44 which in effect magnifies or reduces the size of the image to a standard. One normalization technique is to store in an analog or digital memory a standard face template and compare the stored template with the scanner image An electrically controlled zoom lens is then operated to normalize the scanner to speaker distance.
The standard scan image is next analyzed by circuit 46 to determine the size of the open mouth, for example, by determining lip length and contour and then integrating. The length and contour of the lips is determined by circuit 48. Standard techniques for optical image line enhancement, such as differentiation of the optical image, can be used to facilitate extraction of both lip contour and mouth Theo' Mcwilvn et;; a r r area. s--l 4~-k--wn itechniques aree-ee---R---- -g-rea e-r-destl-bi-a-op--se-1--a+4-Eleskr-i-e-0p- -lie--m -i-o--P-E>eeis- qubj-i-6-hed-by--i res- nS- -9 5 The tongue and teeth positions are also detected by tongue/teeth detec-or 49, for example to determine if the teeth and tong,,e are visible. The teeth can be detected by their characteristic shape and reflectivity relative to the lips and tongue. It 0 09 0 7 will be recognized by one skilled in the art that the functions performed by circuits 46, 48 and 49 can be performed by analog or digital techniques or appropriate combinations thereof.
The output signals from preprocessors 20 and are applied to multiplexer 50 and from there applied to a digital computer 52 directly for digital outputs and via an analog-to-digital converter 54 for analog outputs. Computer 52 carries out time aligned corielation between the audio and visual signal and produces an output for example, in visual or typewritten form indicating at least some of the individual phonemes being spoken.
Many changes and modifications in the abovedescribed embodiment of the invention can, of course, be carried out without departing from the scope thereof, that scope being intended to be limited only by the scope of the appended claims.
j.

Claims (7)

1. An apparatus for prod ucing an output indicating at least some of a sequence of spoken phonemes from a human speaker comprising: means for detecting sounds and converting said sounds into an electrical signal; means for analyzing said signal to detect said phonemes to produce an electrical acoustic output signal indicating for each of at least some of said detected phonemes one group of a plurality of phoneme groups including the detected phoneme, each of said phoneme groups including at least one phoneme; means for optically scanning the face of said speaker and producing an elacutrical l.ipshape signal representing thc, visual manifestation at least some of said spoken phonemes indicating one of a pluraliLty of lipshapes, each lipshape being associated with at least one phoneme; and means for receiving and correlating said lip- shape signal and said acoustic output signal to ~oduce said output.
2. An apparatus as in claim 1 wherein said receiving and correlating means includes a multiplexer for ,recelving signals from said scanning and analyzing means, an analog to digital converter connected to the output of said multiplexer and a digital computer connected to the output of said converter.
3. An apparatus~ as in claim 1 or 2 wherein said scanning moans includres an optical scanner, means for normal zing the distance betwe-n said scanner and the speaker's lips, means for extracting the mouth area, means for extracting the lip contour and means for detecting teeth and tongue positions.
4. An apparatus as in claim 1 or 2 wh-rein said analyzing means includes a low pass filter, means for analyzing the output of said low pass filter, a high pass filter and means for analyzing the output of said high pass filter.
A method of producing an output indicating at least some of a sequence of spoken phonemes from a human speaker comprising the steps of: detecting sounds and converting said sounds into an electrical signal; analyzing said signal to detect said phonemes to produce an electrical acoustic output signal indicating for each of at least some of said detected phonemes one group of a plurality of phoneme groups including the detected phoneme, each of said phoneme groups including at least one phoneme; optically scanning the face of said speaker and producing an electrical lipshape signal representing the visual manifestation for at least some of said spoken phonemes indicating one of a plurality of lipshapes, each lipshape being associated with at least one phoneme; and correlating said lip-shape signaW. and said acoustic output signal to produce said output.
6. An apparatus for producing an output indicating at least some of a sequence of spoken phonemes substantially as hereinbefore described with reference to the drawings.
7. A method of producing an output indicating at least some of a sequence of spoken phonemes substantially as hereinbefore described with reference to the drawings. rTh- tte disclosed-~~r~asin or any 6omb-it-ti%-&-the-ro-f Dated this 22nd day of March 1988 RESEARCH TRIANGLE INSTITUTE By its Patent Attorneys DAVIES COLLISON i i 0 b I
AU13360/88A 1985-11-05 1988-03-22 Audio visual speech recognition Ceased AU613904B2 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US79460285A 1985-11-05 1985-11-05

Publications (2)

Publication Number Publication Date
AU1336088A AU1336088A (en) 1989-09-28
AU613904B2 true AU613904B2 (en) 1991-08-15

Family

ID=25163117

Family Applications (1)

Application Number Title Priority Date Filing Date
AU13360/88A Ceased AU613904B2 (en) 1985-11-05 1988-03-22 Audio visual speech recognition

Country Status (1)

Country Link
AU (1) AU613904B2 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3805238A (en) * 1971-11-04 1974-04-16 R Rothfjell Method for identifying individuals using selected characteristic body curves
US3896266A (en) * 1971-08-09 1975-07-22 Nelson J Waterbury Credit and other security cards and card utilization systems therefore
US3919479A (en) * 1972-09-21 1975-11-11 First National Bank Of Boston Broadcast signal identification system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3896266A (en) * 1971-08-09 1975-07-22 Nelson J Waterbury Credit and other security cards and card utilization systems therefore
US3805238A (en) * 1971-11-04 1974-04-16 R Rothfjell Method for identifying individuals using selected characteristic body curves
US3919479A (en) * 1972-09-21 1975-11-11 First National Bank Of Boston Broadcast signal identification system

Also Published As

Publication number Publication date
AU1336088A (en) 1989-09-28

Similar Documents

Publication Publication Date Title
US4757541A (en) Audio visual speech recognition
US4284846A (en) System and method for sound recognition
US4181813A (en) System and method for speech recognition
US5220639A (en) Mandarin speech input method for Chinese computers and a mandarin speech recognition machine
US4783807A (en) System and method for sound recognition with feature selection synchronized to voice pitch
Krull Acoustic properties as predictors of perceptual responses: A study of Swedish voiced stops
US5806036A (en) Speechreading using facial feature parameters from a non-direct frontal view of the speaker
RU2466468C1 (en) System and method of speech recognition
Uchanski et al. Automatic speech recognition to aid the hearing impaired: prospects for the automatic generation of cued speech.
WO1996003741A1 (en) System and method for facilitating speech transcription
US4707857A (en) Voice command recognition system having compact significant feature data
Jeyalakshmi et al. Efficient speech recognition system for hearing impaired children in classical Tamil language
EP0336032A1 (en) Audio visual speech recognition
AU613904B2 (en) Audio visual speech recognition
Broad Formants in automatic speech recognition
Pols Flexible, robust, and efficient human speech processing versus present-day speech technology
JPH03273280A (en) Voice synthesizing system for vocal exercise
Clapper Automatic word recognition
JPH01259414A (en) Audio-visual speech recognition equipment
RU2119196C1 (en) Method and system for lexical interpretation of fused speech
Pols Flexible, robust, and efficient human speech recognition
Pavel et al. Temporal masking in automatic speech recognition
Warren Perceptual bases for the evolution of speech
De Oliveira et al. Speech aid for the deaf based on a representation of the vocal tract: the vowel module
EP0245252A1 (en) System and method for sound recognition with feature selection synchronized to voice pitch