WO2001059741A1

WO2001059741A1 - Sign language to speech converting method and apparatus

Info

Publication number: WO2001059741A1
Application number: PCT/EP2001/000478
Authority: WO
Inventors: Gandhimathi Vaithilingam
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2000-02-10
Filing date: 2001-01-17
Publication date: 2001-08-16
Also published as: EP1181679A1; JP2003522978A

Abstract

A portable appliance converts gesture-based inputs from a signer to audible speech in real time. The device employs a portable main processor, for example, one of the portable computers now in common use. For its input the appliance uses a data glove and for its output a speaker. Dynamic and static gestures are classified by a Continuous Hidden Markov Model (CHMM) which is capable of robust and rapid real time classification of both static and dynamic gestures. A natural language processor is used to transform the gesture classes into grammatically correct sequences of words. A speech synthesizer converts the word sequences into audible speech.

Description

Sign language to speech converting method and apparatus

The invention relates to signal language translators and specifically to such translators that convert sign language directly to spoken words using a portable computer.

Data gloves have been used for classification of sign language. In one prior art technology, static finger-spelling is translated into letters or words are translated and gestures (movement) are ignored. Discrete Hidden Markov models with data glove inputs allow interactive learning, which has been used successfully to train a series of gestures. This technology is described in "On-line, interactive learning of gestures for human/robot interfaces," Christopher Lee and Yanhsheng Xu, IEEE Int'l. Conf. on Robotics and Automation, vol. 4, pp. 2982-2987, 1996.

A neural network trained specifically by a user has been shown the ability to recognize small sets of letters signed by dynamic finger-spelling. This technology is described in "A multi-stage approach to fingerspelling and gesture recognition," R. Erenshteyn and P. Laskov, Proc. Workshop on the Integration of Gesture in Language and Speech, Wilmington, DE, 1996.

Another prior art system tracks gestures continuously using colored gloves and camera-based image processing techniques. The system allows no fingerspelling and encumbers the user with a video input system and the requirement of wearing specially colored gloves as well as the need to remain in the field of view of one or more cameras. This technology is described in "Visual recognition of American Sign Language using Hidden Markov models," Thad Striner, Master's thesis, The Media Laboratory, MIT, 1995. Data gloves have been proposed for mapping hand gestures into text using neural networks. This technology is described in "Glove-Talk II: Mapping hand gestures to speech using neural networks - an approach to building adaptive interfaces," Sidney Fels, PhD thesis, Univ. Toronto, 1994. Real-time processing using neural networks requires tremendous processing power. Currently, there are no systems described in the prior art that provide real-time continuous gesture to voice conversion that is ergonomically compatible with portable use and convenient operation to allow signers to communicate with non-signers.

The invention achieves gains in both portability and utility by its use of HMM to classify gestures. Such models are forgiving and relatively computationally undemanding. Thus, they can handle variation in the form of an input and still generate a proper classification. In addition, they are much more stingy in their use of computational resources than, say, neural networks. The use of a data-glove as an input and a speaker as an output offers a high degree of portability of the appliance. Additionally, the use of a data-glove allows a relatively small-bandwidth port to be used. The same is true of the output for a speech engine which could receive text through a port or other symbolic output and be synthesized by an inexpensive external processor system. Alternatively, the processing unit could already have a sound card with speech synthesis capability as to many personal digital assistants (PDAs). Thus, the system can be formed as a retrofit of existing and future PDS units provided with appropriate software.

The invention will be described in connection with certain preferred embodiments, with reference to the following illustrative figures so that it may be more fully understood. With reference to the figures, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of the preferred embodiments of the present invention only, and are presented in the cause of providing what is believed to be the most useful and readily understood description of the principles and conceptual aspects of the invention. In this regard, no attempt is made to show structural details of the invention in more detail than is necessary for a fundamental understanding of the invention, the description taken with the drawings making apparent to those skilled in the art how the several forms of the invention may be embodied in practice.

Fig. 1 is an illustration of a portable sign language-to-speech converter according to an embodiment of the invention.

Referring to Fig. 1, data gloves 130 and position sensors 110 apply hand- position and configuration data to a gesture recognition processor 120. The gesture recognition processor 120 classifies hand gestures into discrete symbols identifiable with words and generates outputs in real time indicating the words classified. Where classifications produce a low index of confidence, this information may also be output. The classification information is applied in turn to a natural language processor 140 that converts the words into full grammatical sentences and phrases, which may be output as text or as some other more compact symbolic form. The output of the natural language processor 140 is applied to a speech synthesizer 150. The speech synthesizer 150 generates a sound signal that may be output to a speaker 195. Alternatively, the sound signal may be generated at a port connectable 160 to, for example, headphones (not shown), to allow private use or use in a noisy environment. This might be particularly useful where the signer is a good lip-reader because conversations can be completely private to non-lip-readers.

The data glove 130 and position sensor 110 may be any electro-mechanical device effective to generate signals responsively to fingerspelling and sign language gestures. For example, inertial sensors with direct and integrated signals may provide velocity and position information for various parts of the hand, such as the wrist, some or all fingertips, etc. Alternatively, data-gloves currently on the market and used for control applications may be utilized. The types of inputs required to form a practical device for this application are becoming clearer as research continues in this area. At present, various prototypes discussed above have proven that hand configuration, position, and velocity information can be distilled into a manageable dataspace (a reasonable number of independent inputs) and these inputs applied to various types of recognition processors to classify sign-language-type gestures. Note that, of course, a combination of inertial and encoder-based technologies may also be used. The gesture recognition processor 120 can be based on various different technologies effective to classify the gesture inputs. Present technology in software and hardware makes a Continuous Hidden Markov Model (CHMM) strategy the preferred approach. Another advantage of CHMM classification technology is the fact that such classifiers tend to be tolerant of variation in the input values and relative values. Note that as processor speed, integration-scale, size and cost of computing hardware evolves, other classification technologies may prove appropriate, for example, neural network-based classifiers.

The gesture recognition processor 120 outputs a class indicator for each recognized gesture. A stream of such indicators is applied to the natural language processor which adds missing words to form grammatical sentences and phrases. Since sign-language does not necessarily include all elements of normal speech - obvious and essential components of grammar, such as subject and articles may be omitted - the natural language processor may insert these before application to the speech synthesizer 150. The natural language processor 140 identifies ungrammatical usage and corrects them. Such techniques are well-developed for word-processors and can be applied directly in the instant context. Note that the natural language processor 140 is not essential since the ungrammatical speech corresponding to sign language may still be recognizable. In training the natural language processor, it may be best, therefore, for no modifications to be made where the confidence corresponding to a change is low. That is, the natural language processor may be tuned to make changes only when a confidence measure for the contemplated change is high, since comprehensible speech may be derived directly from the output of the gesture recognition processor.

The speech synthesizer 150 may be any word-to-audio conversion device such as a text-to-speech converter. Preferably the speech is output to a small speaker or other audio transducer. Note that text need not be an intermediate product in the instant invention. However, it may facilitate the use of off the shelf devices such as text to existing speech converters.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

CLAIMS:

1. A portable sign language to audible speech conversion device, comprising: an electromechanical transducer (130, 110) effective to translate gestures of at least one human hand into a first data stream indicating gestures; a gesture classifier (120) connect to receive said first data stream, said classifier effective to produce a second data stream indicating first words corresponding to said gestures, recognized by said gesture classifier as sign language gestures, responsively to said first data stream; and a speech synthesizer (150); said gesture classifier being connected to apply said second data stream to said speech synthesizer such that an audio signal is generated by said speech synthesizer responsive to said words

2. A device as in claim 1, wherein said electromechanical transducer is, at least partly, wearable on said human hand.

3. A device as in claim 2, wherein said electromechanical transducer includes a data glove (130).

4. A device as in one of the claims 1 to 3, further comprising a sound-emitting transducer connect to receive said audio signal, whereby said words are converted to audible speech.

5. A device as in one of the claims 1 to 4, further comprising a natural language processor connected between said gesture classifier and said speech synthesizer, said natural language processor (140) including a programmable processor (140) programmed to insert at least one second word between a pair of said first words to fix an ungrammatical structure of a sequence of said first words.

6. A device as in claim 5, wherein said device is formed as a modular unit and such as to be wearable by a user.

7. A device as in claim 1, further comprising: a natural language processor connected between said gesture classifier and said speech synthesizer; said natural language processor including a programmable processor programmed to insert at least one second structure of a sequence of said first words; said device being formed as a modular unit; said electromechanical transducer being, at least partly, wearable on said human hand; a sound-emitting transducer connect to receive said audio signal, whereby said words are converted to audible speech; said sound-emitting transducer being powered sufficiently as to be audible to a person standing a normal conversational distance away from a wearer of said modular unit.

8. A device as in claim 1, further comprising: a natural language processor connected between said gesture classifier and said speech synthesizer; said natural language processor including a programmable processor programmed to insert at least one second word between a pair of said first words to fix an ungrammatical structure of a sequence of said first words; said device being formed as a modular unit; said electromechanical transducer being, at least partly, wearable on said human hand; a port connectable to a sound emitting transducer, whereby said audio signal may be received and said words are converted to audible speech.

9. A method of converting sign language to audible speech, comprising the steps of: generating first signals responsively to gestures of a human hand; classifying said first signals as gestures to generate second signals indicating classes; said step of classifying including applying a Continuous Hidden Markov model; indicating words corresponding, respectively, to said classes; outputting said words in a human-sensible form.

10. A method as in claim 9, wherein said step of outputting includes applying said second signals to a speech synthesizer.