JP2003522978A

JP2003522978A - Method and apparatus for converting sign language into speech

Info

Publication number: JP2003522978A
Application number: JP2001558982A
Authority: JP
Inventors: ヴァイティリンガム，ガンディマティ
Original assignee: Philips Electronics NV
Current assignee: Koninklijke Philips NV
Priority date: 2000-02-10
Filing date: 2001-01-17
Publication date: 2003-07-29
Also published as: WO2001059741A1; EP1181679A1

Abstract

(57)【要約】携帯機器は、手話者からのジェスチャに基づく入力をリアルタイムで可聴音声へ変換する。該装置は、ポータブル・メインプロセッサ、例えば今日一般的に使用されているポータブル・コンピュータのうちの１つ、を採用する。その入力に対して、該機器はデータグローブを用い、その出力に対して、スピーカを用いる。動的及び静的ジェスチャは、静的及び動的ジェスチャの両方の強固で迅速なリアルタイム分類が可能な連続隠しマルコフ・モデル（ＣｏｎｔｉｎｕｏｕｓＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ；ＣＨＭＭ）によって、分類される。自然言語プロセッサがジェスチャ・クラスを文法的に正しい語句列へ変換するために用いられる。音声合成器は、該語句列を可聴音声へ変換する。 (57) [Summary] A portable device converts an input based on a gesture from a signer into an audible sound in real time. The apparatus employs a portable main processor, for example, one of the portable computers commonly used today. For its inputs, the device uses data gloves and for its outputs it uses speakers. Dynamic and static gestures are classified by a Continuous Hidden Markov Model (CHMM) that allows robust and fast real-time classification of both static and dynamic gestures. A natural language processor is used to convert the gesture class into a grammatically correct phrase sequence. The speech synthesizer converts the phrase string into audible speech.

Description

Detailed Description of the Invention

【０００１】本発明は、信号言語翻訳機に係り、特に、ポータブル・コンピュータを用いて
手話を口語へ直接変換する翻訳機に関する。The present invention relates to a signal language translator, and more particularly to a translator for directly converting sign language into colloquial language using a portable computer.

【０００２】手話の分類にはデータグローブが用いられている。１つの従来技術において、
静的な手話は文字若しくは語句へ翻訳され、ジェスチャ（動き）は無視される。
データグローブ入力を有する離散隠しマルコフ・モデル（ＤｉｓｃｒｅｔｅＨ
ｉｄｄｅｎＭａｒｋｏｖｍｏｄｅｌ）は、双方向学習を可能にし、一連のジ
ェスチャを訓練するのに成功裏に用いられている。この技術は、「Ｏｎ−ｌｉｎ
ｅ，ｉｎｔｅｒａｃｔｉｖｅｌｅａｒｎｉｎｇｏｆｇｅｓｔｕｒｅｓ
ｆｏｒｈｕｍａｎ／ｒｏｂｏｔｉｎｔｅｒｆａｃｅｓ」、Ｃｈｒｉｓｔｏｐ
ｈｅｒＬｅｅ及びＹａｎｈｓｈｅｎｇＸｕ、ロボティクス及びオートメーシ
ョンに関するＩＥＥＥ国際会議（１９９６年）、ｖｏｌ．４、２９８２〜２９８
７頁、に開示されている。A data globe is used to classify sign language. In one prior art,
Static sign language is translated into letters or phrases and gestures are ignored.
Discrete Hidden Markov Model with Data Globe Input (Discrete H
idden Markov model) enables interactive learning and has been successfully used to train a series of gestures. This technology is called "On-lin
e, interactive learning of gestures
for human / robot interfaces ", Christop
her Lee and Yanghsheng Xu, IEEE International Conference on Robotics and Automation (1996), vol. 4, 2982-298
It is disclosed on page 7.

【０００３】ユーザによって具体的に訓練されたニュータル・ネットワークは、動的手話に
よって示された文字の小群を認識する能力を示されている。この技術は、「Ａ
ｍｕｌｔｉ−ｓｔａｇｅａｐｐｒｏａｃｈｔｏｆｉｎｇｅｒｓｐｅｌｌｉ
ｎｇａｎｄｇｅｓｔｕｒｅｒｅｃｏｇｎｉｔｉｏｎ」、Ｒ．Ｅｒｅｎｓｈ
ｔｅｙｎ及びＰ．Ｌａｓｋｏｖ、Ｐｒｏｃ．Ｗｏｒｋｓｈｏｐｏｎｔｈｅ
ＩｎｔｅｒｇｒａｔｉｏｎｏｆＧｅｓｔｕｒｅｉｎＬａｎｇｕａｇｅ
ａｎｄＳｐｅｅｃｈ（Ｗｉｌｍｉｎｇｔｏｎ、ＤＥ、１９９６年）、におい
て開示されている。[0003] Nutra networks specifically trained by users have been shown to be capable of recognizing the small groups of letters presented by dynamic sign language. This technology is
multi-stage approach to fingersspelli
ng and gesture recognition ", R.G. Erensh
teyn and P.T. Laskov, Proc. Workshop on the
Integration of Gesture in Language
and Speech (Wilmington, DE, 1996).

【０００４】別の従来技術のシステムは、色のついたグローブとカメラに基づく画像処理技
術とを用いて、ジェスチャを連続的に追跡する。このシステムは、手話を許さず
、ビデオ入力システムと、特殊な色の付いたグローブの着用の必要性と、１以上
のカメラの視界に留まることの必要性とでユーザの動きを邪魔する。この技術は
、「ＶｉｓｕａｌｒｅｃｏｇｎｉｔｉｏｎｏｆＡｍｅｒｉｃａｎＳｉｇ
ｎＬａｎｇｕａｇｅｕｓｉｎｇＨｉｄｄｅｎＭａｒｋｏｖｍｏｄｅｌ
ｓ」、ＴｈａｄＳｔｒｉｎｅｒ、Ｍａｓｔｅｒ’ｓｔｈｅｓｉｓ、Ｔｈｅ
ＭｅｄｉａＬａｂｏｒａｔｏｒｙ、ＭＩＴ、１９９５年、において開示されて
いる。データグローブは、ニューラル・ネットワークを用いて、手のジェスチャ
をテキストへマッピングすることが提案されている。この技術は、「Ｇｌｏｖｅ
−ＴａｌｋＩＩ：Ｍａｐｐｉｎｇｈａｎｄｇｅｓｔｕｒｅｔｏｓｐｅ
ｅｃｈｕｓｉｎｇｎｅｕｒａｌｎｅｔｗｏｒｋｓ − ａｎａｐｐｒｏ
ａｃｈｔｏｂｕｉｌｄｉｎｇａｄａｐｔｉｖｅｉｎｔｅｒｆａｃｅｓ」
、ＳｉｄｎｅｙＦｅｌｓ、ＰｈＤｔｈｅｓｉｓ、トロント大学、１９９４年
、において開示されている。ニューラル・ネットワークを用いたリアルタイム処
理には膨大な処理能力が要求される。現在、携帯利用に対して人間工学的に相性
が良い、連続的なジェスチャの音声へのリアルタイム変換と、手話者が非手話者
とコミュニケーションし得る便利なオペレーションと、を提供するシステムは、
従来技術には存在しない。Another prior art system uses colored gloves and camera-based image processing techniques to continuously track gestures. This system does not allow sign language and interferes with the user's movements with the video input system, the need to wear special colored gloves, and the need to stay in view of one or more cameras. This technology is called "Visual recognition of American Sig".
n Language using Hidden Markov model
s ", Thad Stringer, Master's thesis, The
Media Laboratory, MIT, 1995. Data gloves have been proposed to map hand gestures to text using neural networks. This technology is called "Glove
-Talk II: Mapping hand gesture to spe
ech using neural networks-an appro
ach to building adaptive interfaces "
, Sidney Fels, PhD thesis, University of Toronto, 1994. A huge amount of processing power is required for real-time processing using a neural network. Currently, a system that provides ergonomic compatibility for mobile use, real-time conversion of continuous gestures to voice, and convenient operation that allows a signer to communicate with a non-signer,
It does not exist in the prior art.

【０００５】携帯機器は、手話者からのジェスチャに基づく入力をリアルタイムで可聴音声
へ変換する。該装置は、ポータブル・メインプロセッサ、例えば今日一般的に使
用されているポータブル・コンピュータのうちの１つ、を採用する。その入力に
対して、該機器はデータグローブを用い、その出力に対して、スピーカを用いる
。動的及び静的ジェスチャは、静的及び動的ジェスチャの両方の強固で迅速なリ
アルタイム分類が可能な連続隠しマルコフ・モデル（ＣｏｎｔｉｎｕｏｕｓＨ
ｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ；ＣＨＭＭ）によって、分類される。自
然言語プロセッサがジェスチャ・クラスを文法的に正しい語句列へ変換するため
に用いられる。音声合成器は、該語句列を可聴音声へ変換する。The mobile device converts a gesture-based input from a signer into audible voice in real time. The device employs a portable main processor, such as one of the portable computers commonly used today. For its input, the device uses a data glove and for its output a speaker. Dynamic and static gestures are continuous hidden Markov models (Continuous H model) that enable robust and rapid real-time classification of both static and dynamic gestures.
It is classified by the idden Markov Model (CHMM). A natural language processor is used to convert the gesture classes into grammatically correct phrase sequences. The voice synthesizer converts the word sequence into audible voice.

【０００６】本発明は、ジェスチャを分類するためにＨＭＭを用いることによって、携帯性
及び有用性の両方において、利益を得る。このようなモデルは、寛大であり、計
算能力に多くを要求しない。このように、それらは入力形式の変化を取り扱い、
適正な分類を生成し得る。加えて、それらは、例えばニューラル・ネットワーク
よりも計算リソースの利用において大幅に利用が少ない。入力にデータグローブ
を用い、出力にスピーカを用いることは、該機器の高度な携帯性を提供する。The present invention benefits both in portability and usability by using HMMs to classify gestures. Such models are generous and do not require much computational power. Thus, they handle changes in input format,
Proper classification can be generated. In addition, they are significantly less utilized in computing resources than, for example, neural networks. Using a data glove at the input and a speaker at the output provides a high degree of portability of the device.

【０００７】その上、データグローブの使用は、比較的小さい帯域幅のポートが用いられる
ことを可能にする。同じことは、音声エンジンの出力にも当てはまる。該音声エ
ンジンは、ポートと通って来たテキスト若しくは他の記号出力を受信し、安価な
外部プロセッサ・システムによって合成され得る。別の方法として、処理ユニッ
トは、多くのパーソナル・ディジタル・アシスタント（ＰＤＡ）のように、音声
合成能力を有するサウンド・カードを既に持っていてもよい。このように、本シ
ステムは、適切なソフトウェアが備えられた現存の及び将来のＰＤＳユニットの
改良部品として形成され得る。Moreover, the use of data grove allows ports of relatively small bandwidth to be used. The same applies to the output of the voice engine. The voice engine receives textual or other symbolic output coming in through a port and can be synthesized by an inexpensive external processor system. Alternatively, the processing unit may already have a sound card with speech synthesis capabilities, such as many personal digital assistants (PDAs). In this way, the system can be formed as a retrofit of existing and future PDS units equipped with appropriate software.

【０００８】本発明は、以下の概略的な図面を参照して、特定の好ましい実施形態と共に、
説明されるため、より完全に理解されるであろう。図面を参照すると、図示され
た項目は例示であり、且つ、本発明の好ましい実施形態を説明することのみを目
的としたものであり、本発明の原理及び概念上の態様の最も有益で素早く理解さ
れる説明であると信じられるものを提供するために提示される、ことを強調する
。これに関し、本発明の基本的な理解のために必要な、本発明のいくつかの形式
が実際にはどのように実現され得るのかを当業者に明らかにする図面と共に取ら
れる説明よりも詳しく本発明の構造的詳細を示す試みは為されていない。The present invention will now be described with reference to the following schematic drawings, together with certain preferred embodiments,
As explained, it will be more fully understood. With reference to the drawings, the depicted items are exemplary and are intended only to illustrate the preferred embodiments of the invention and are the most informative and quick understanding of the principles and conceptual aspects of the invention. Emphasize that it is presented to provide what is believed to be the explanation given. In this regard, it is necessary to provide a basic understanding of the invention, and more detailed than the description taken with the drawings that makes apparent to those skilled in the art how some forms of the invention may actually be implemented. No attempt has been made to provide structural details of the invention.

【０００９】図１を参照する。データグローブ１３０及びポジション・センサ１１０は、手
の位置及び配置データを、ジェスチャ認識プロセッサ１２０へ適用する。このジ
ェスチャ認識プロセッサ１２０は、手のジェスチャを語句で識別可能な離散シン
ボルへ分類し、分類された語句を示す出力をリアルタイムで生成する。分類の生
成する信頼指数が低いところでは、この情報も出力され得る。分類情報は、次い
で、自然言語プロセッサ１４０へ適用される。自然言語プロセッサ１４０は、該
語句を完全に文法に沿った文章及びフレーズへ変換する。該文章及びフレーズは
、テキスト若しくは他のよりコンパクトな記号形式で、出力され得る。自然言語
プロセッサ１４０の出力は、音声合成器１５０へ適用される。この音声合成器１
５０は、スピーカ１９５へ出力され得る音声信号を生成する。別の方法として、
この音声信号は、例えば、個人使用若しくは騒々しい環境での使用を可能にする
ヘッドホン（図示せず）へ接続可能なポート１６０において生成されてもよい。
これは、手話者が読唇術が上手な場合に特に有益かもしれない。なぜなら、会話
は非読唇術者には完全に秘密となり得るからである。Please refer to FIG. The data glove 130 and position sensor 110 apply hand position and placement data to the gesture recognition processor 120. The gesture recognition processor 120 classifies hand gestures into discrete symbols that can be identified by a phrase, and generates an output indicating the classified phrase in real time. This information may also be output where the confidence index generated by the classification is low. The classification information is then applied to the natural language processor 140. The natural language processor 140 converts the phrase into sentences and phrases that are completely in grammar. The sentences and phrases can be output in text or other more compact symbolic formats. The output of natural language processor 140 is applied to speech synthesizer 150. This speech synthesizer 1
50 generates an audio signal that can be output to the speaker 195. Alternatively,
This audio signal may be generated, for example, at port 160, which may be connected to headphones (not shown) that allow for personal use or use in noisy environments.
This may be especially beneficial if the signer is good at lip reading. Because the conversation can be completely secret to the non-lipreader.

【００１０】データグローブ１３０及びポジション・センサ１１０は、手話のジェスチャに
応答して信号を生成し得るあらゆる電気機械装置でよい。例えば、直接統合信号
を有する慣性センサは、手首や一部若しくは全部の指などの手の様々な部分につ
いて速度及び位置情報を提供し得る。別の方法として、現在市場に出回っており
、制御用途に用いられているデータグローブを用いてもよい。この用途の実際の
装置を形成するのに必要な入力の種類は、この分野においてリサーチを続けるご
とにより明らかになる。現在、前述の多様なプロトタイプが、手の配置、位置、
及び速度情報は管理可能データ空間へ引き出されることが可能であり（合理的な
数の独立した入力）、これら入力は手話種類のジェスチャを分類するために多様
な種類の認識プロセッサへ適用される、ことが証明されている。当然、慣性及び
エンコーダに基づく技術の組み合わせも用いられ得ることに注意。The data glove 130 and position sensor 110 may be any electromechanical device capable of producing a signal in response to a sign language gesture. For example, an inertial sensor with a direct integrated signal may provide velocity and position information for various parts of the hand, such as the wrist or some or all fingers. Alternatively, data gloves currently on the market and used for control applications may be used. The types of inputs needed to form the actual device for this application will become more apparent with continued research in the field. Currently, the various prototypes mentioned above are
And velocity information can be extracted into a manageable data space (a reasonable number of independent inputs), which are applied to various types of recognition processors to classify sign language type gestures, Has been proven. Note, of course, a combination of inertial and encoder-based techniques may also be used.

【００１１】ジェスチャ認識プロセッサ１２０は、上記ジェスチャ入力を分類し得る多様な
異なる技術に基づき得る。ソフトウェア及びハンドセットにおける現在の技術は
、ＣｏｎｔｉｎｕｏｕｓＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ（ＣＨＭＭ
）を好ましいアプローチとしている。ＣＨＭＭ分類技術の別の利点は、この分類
は入力値及び相対値における変化を許容する傾向にあるという事実である。プロ
セッサ・スピード、統合スケール、サイズ、及び計算するハードウェアのコスト
が進化するとき、例えばニューラル・ネットワークに基づく分類などの他の分類
技術も適切となり得ることに注意。The gesture recognition processor 120 may be based on a variety of different techniques that may classify the gesture input. The current technology in software and handsets is the Continuous Hidden Markov Model (CHMM
) Is the preferred approach. Another advantage of the CHMM classification technique is the fact that it tends to tolerate changes in input and relative values. Note that as the processor speed, integrated scale, size, and cost of computing hardware evolve, other classification techniques such as classification based on neural networks may also be appropriate.

【００１２】ジェスチャ認識プロセッサ１２０は、認識されたジェスチャのそれぞれに対し
てクラス・インジケータを出力する。このようなインジケータのストリームは、
文法に沿った文章及びフレーズを形成するために不足している語を加える自然言
語プロセッサへ適用される。手話は通常の音声のすべての要素を必要的に含むわ
けではないため、すなわち主語や冠詞などの明白且つ基本的な文法成分は省略さ
れ得るため、自然言語プロセッサはこれらを音声合成器１５０への適用の前に挿
入し得る。自然言語プロセッサ１４０は、非文法的な用法を識別し、それらを訂
正する。このような技術は、ワープロに対して良く開発されており、急ぎの場合
に直接的に適用され得る。手話に対応する非文法的音声は依然として認識可能で
あるかもしれないため、自然言語プロセッサ１４０は不可欠なものではないこと
に注意。よって、自然言語プロセッサを訓練するにあたっては、変化に対応する
信頼が低い場合には修正が為されないことが最良であろう。すなわち、自然言語
プロセッサは、熟考された変化に対する信頼測定が高い場合のみ、変更されるよ
うに調整され得る。なぜなら、わかりやすい音声がジェスチャ認識プロセッサの
出力から直接的に引き出され得るからである。The gesture recognition processor 120 outputs a class indicator for each recognized gesture. A stream of such indicators is
Applied to natural language processors that add missing words to form grammatical sentences and phrases. Since sign language does not necessarily include all the elements of normal speech, that is, obvious and basic grammatical components such as subjects and articles can be omitted, the natural language processor sends them to the speech synthesizer 150. Can be inserted before application. Natural language processor 140 identifies non-grammatical usages and corrects them. Such techniques are well developed for word processors and can be directly applied in urgent cases. Note that the natural language processor 140 is not essential, as non-grammatical speech corresponding to sign language may still be recognizable. Therefore, in training a natural language processor, it would be best not to make any modifications if the confidence in responding to changes is low. That is, the natural language processor may be adjusted to change only if the confidence measure for the considered change is high. This is because intelligible speech can be derived directly from the output of the gesture recognition processor.

【００１３】音声合成器１５０は、テキスト−音声変換器などのあらゆる語句−オーディオ
変換装置でよい。該音声は、小さいスピーカ若しくは他のオーディオ変換器へ出
力されることが好ましい。急速に行う態様において、テキストを中間性生物とす
る必要はないことに注意。しかし、現存のテキスト−音声変換器などの既成の装
置の使用を容易にし得る。Speech synthesizer 150 may be any phrase-to-audio converter such as a text-to-speech converter. The voice is preferably output to a small speaker or other audio converter. Note that the text does not have to be an intermediate organism in the rapid mode. However, it may facilitate the use of off-the-shelf devices such as existing text-to-speech converters.

【００１４】本発明は以上の具体的実施形態の詳細に限定されず、本発明はその意図若しく
は基本的特性から逸脱することなく他の特定の形式において実施され得ることは
、当業者には明らかであろう。よって、本実施形態は、あらゆる点で例示的及び
非限定的とみなされるべきであり、本発明の範囲は上記説明ではなく付属の請求
項によって示され、よって、請求項の均等の意味及び範囲内にくるすべての変更
はそこに包含されることが意図される。It will be apparent to those skilled in the art that the present invention is not limited to the details of the above specific embodiments, and that the present invention can be embodied in other specific forms without departing from its spirit or basic characteristics. Will. Therefore, the present embodiments should be regarded as illustrative and non-limiting in all respects, and the scope of the present invention is shown not by the above description but by the appended claims, and thus the equivalent meaning and scope of the claims. All changes that come within are intended to be included therein.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の一実施形態に係る携帯手話音声変換器の図である。[Figure 1] It is a figure of the portable sign language voice converter concerning one embodiment of the present invention.

───────────────────────────────────────────────────── フロントページの続きＦターム(参考） 5D045 AB30 5L096 CA27 DA05 ─────────────────────────────────────────────────── ─── Continued front page F-term (reference) 5D045 AB30 5L096 CA27 DA05

Claims

[Claims]

1. A portable device for converting sign language to audible speech, the first data representing at least one human hand gesture indicating the gesture.
An electromechanical transducer translatable into a stream, a gesture classifier, and a speech synthesizer, the classifier being connected to receive the first data stream, the first data stream A second data stream indicating a first word corresponding to the gesture recognized as a sign language gesture by the gesture classifier may be generated according to the stream, and an audio signal corresponding to the word may be the voice signal. As produced by the synthesizer,
Connected to apply the second data stream to the speech synthesizer,
A device characterized by the above.

2. The apparatus of claim 1, wherein the electromechanical transducer is at least partially wearable on the human hand.

3. The apparatus according to claim 2, wherein the electromechanical transducer comprises a data globe.

4. A device according to claim 1, further comprising a sound emitting transducer connected to receive the audio signal so that the words are converted into audible speech. A device having.

5. The apparatus according to any one of claims 1 to 4, further comprising a natural language processor connected between the gesture classifier and the phonetic product, the natural language processor , A programmable processor programmed to insert at least one second word between a set of the first words to repair a non-grammatical structure of the sequence of the first words. A device characterized by the above.

6. Device according to claim 5, characterized in that it is formed as a modular unit, for example a unit wearable by the user.

7. The apparatus of claim 1, wherein the natural language processor is formed as a modular unit and is connected between the gesture classifier and the speech synthesizer, and the word is converted to audible speech. A sound emitting transducer connected to receive the audio signal, the natural language processor inserting at least one second sequence structure of the first words. A programmable processor, the electromechanical transducer being at least partially wearable on the human hand, and the sound emitting transducer being normal speech from a wearer of the modular unit. A device that is sufficiently powered to be heard by a person standing in the distance.

8. The apparatus of claim 1, wherein the natural language processor is formed as a modular unit and is connected between the gesture classifier and the speech synthesizer, and the audio signal is received. A port connectable to a sounding transducer so that the word is converted to audible speech, the natural language processor for repairing a non-grammatical structure of the first word sequence. A programmable processor programmed to insert at least one second word between a set of the first words, the electromechanical transducer being at least partially in the human hand. A device that can be mounted.

9. A method for converting sign language into audible speech, the method comprising: generating a first signal corresponding to a gesture of a human hand; classifying the first signal as a gesture and indicating a class. A second signal, a step of indicating a corresponding word according to the class, and a step of outputting the word in a human-perceptible format, wherein the classifying step is Continuous Hidden A method comprising applying a Markov model.

10. The method of claim 9, wherein the outputting step comprises applying the second signal to a speech synthesizer.