JPH1173297A

JPH1173297A - Recognition method using timely relation of multi-modal expression with voice and gesture

Info

Publication number: JPH1173297A
Application number: JP23461197A
Authority: JP
Inventors: Shigeki Nagaya; 茂喜長屋; Kiyoshi Furukawa; 清古川; Ryuichi Oka; 隆一岡
Original assignee: GIJUTSU KENKYU KUMIAI SHINJOHO; GIJUTSU KENKYU KUMIAI SHINJOHO SHIYORI KAIHATSU KIKO; NIPPON TEKKO RENMEI; Hitachi Ltd
Current assignee: GIJUTSU KENKYU KUMIAI SHINJOHO; GIJUTSU KENKYU KUMIAI SHINJOHO SHIYORI KAIHATSU KIKO; NIPPON TEKKO RENMEI; Hitachi Ltd
Priority date: 1997-08-29
Filing date: 1997-08-29
Publication date: 1999-03-16

Abstract

PROBLEM TO BE SOLVED: To make a voice indicating the meaning of gesture correspond to gesture performed with the voice. SOLUTION: Stop positions T1-T3 of an operation are detected based on a moving image. Also, a voice is word-recognized, and utterance start and end positions T11 and T12 of the word are detected. Stop positions T1 T2 which are the closest to the utterance start and end positions T11 and T12 are defined as gesture start and end positions, and a voice and the gesture are made correspond to each other by the start and end positions T1, T2, T11, and T12.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、人間のジェスチャ
ーを撮影し、その撮影画像に基づきジェスチャーの内容
を識別するジェスチャー認識方法に関し、特に音声とジ
ェスチャによるマルチモーダル表現の時間的関係を用い
た認識方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a gesture recognition method for photographing a human gesture and identifying the content of the gesture based on the photographed image, and more particularly to a recognition method using a temporal relationship between a voice and a multimodal expression by the gesture. About the method.

【０００２】[0002]

【従来の技術】人間同士のコミュニケーションのよう
に、音声やジェスチャ等のモーダルを利用可能とするた
めには個々のモーダルに対して音声認識装置やジェスチ
ャ認識装置の認識率を高めるだけでは不十分である。た
とえば、ジェスチャに関する分類（Ekman,P.Friesen,W.
V.,"The reportoire of nonverbval behaivior-categor
ies, origins, usage, and coding", Semiotical, pp.4
9-98(1969).)にも指摘されているように、音声で「こっ
ち」とか「このくらい」とか言いながらジェスチャで位
置や程度を表すことがしばしばある。このようにモーダ
ルの組み合わせにより初めて解釈可能なケースは数多く
存在する。2. Description of the Related Art It is not enough to increase the recognition rate of a speech recognition device or a gesture recognition device for each modal in order to make modals such as voices and gestures available, such as communication between humans. is there. For example, the classification of gestures (Ekman, P. Friesen, W.
V., "The reportoire of nonverbval behaivior-categor
ies, origins, usage, and coding ", Semiotical, pp.4
As pointed out in 9-98 (1969).), Gestures often indicate the position or degree of a gesture while saying "this way" or "this much". Thus, there are many cases that can be interpreted for the first time by a combination of modals.

【０００３】こうしたモーダルの組み合わせ表現を認識
装置により自動認識するためには音声とジェスチャとの
間の対応する部分を見つけなければならない。In order to automatically recognize such a modal combination expression by a recognition device, it is necessary to find a corresponding part between a voice and a gesture.

【０００４】[0004]

【発明が解決しようとする課題】音声を認識する音声認
識装置およびジェスチャを撮影画像に基づき認識するジ
ェスチャ認識装置は提案されているものの上述した音声
およびジェスチャの組み合わせを認識対象とした装置、
方法は提案されておらず、このようなマルチモーダル表
現を認識しようとした場合、上述の例でいえば、音声の
「こっち」を認識して文字列等に変換できるものの「こ
っち」に対応するジェスチャーが動画像の中でどこから
始まりどこで終了するかは、人間が目視で確認しないと
判定できないという解決すべき問題がある。A speech recognition apparatus for recognizing a voice and a gesture recognition apparatus for recognizing a gesture based on a photographed image have been proposed.
No method has been proposed, and when trying to recognize such a multi-modal expression, in the above example, it is possible to recognize the voice "here" and convert it to a character string etc. There is a problem to be solved in which it cannot be determined where a gesture starts and ends in a moving image without visual confirmation by a human.

【０００５】そこで、本発明の目的は、上述の点に鑑み
て、ジェスチャに関連する音声データおよびジェスチャ
の動作を撮影した動画像データが与えられたときに、音
声とジェスチャ部分の対応付けを自動的に行うことがで
きる認識方法を提供することにある。[0005] In view of the above, it is an object of the present invention to automatically associate a voice with a gesture part when audio data relating to the gesture and moving image data obtained by photographing the operation of the gesture are given. It is an object of the present invention to provide a recognition method which can be performed in an efficient manner.

【０００６】[0006]

【課題を解決するための手段】このような目的を達成す
るために、請求項１の発明は、ジェスチャ内容を示す音
声を伴うジェスチャを認識する認識方法において、前記
音声に対して単語認識を施し、前記単語認識において、
単語の発声開始時点および発声終了時点をそれぞれ検出
し、ジェスチャを撮影した動画像に基づき該ジェスチャ
の動作が停止する停止時点を検出し、当該検出した停止
時点の中で、前記発声開始時点および発声終了時点にそ
れぞれ最も近い停止時点を検出することによりジェスチ
ャと対応の音声とを関連付けることを特徴とする。In order to achieve the above object, a first aspect of the present invention is a recognition method for recognizing a gesture accompanied by a voice indicating the content of a gesture. , In the word recognition,
The utterance start time and the utterance end time of the word are detected, respectively, and the stop time at which the operation of the gesture is stopped is detected based on the moving image of the gesture, and the utterance start time and the utterance are detected among the detected stop times. The gesture is associated with the corresponding voice by detecting a stop point closest to the end point.

【０００７】請求項２の発明は、請求項１に記載の認識
方法において、前記動画像を構成する連続のフレーム画
像の中の隣接する２つのフレーム画像の差分値を取得
し、当該差分値の時系列変化の中の極小位置を前記停止
時点とすることを特徴とする。According to a second aspect of the present invention, in the recognition method according to the first aspect, a difference value between two adjacent frame images in a continuous frame image constituting the moving image is obtained, and the difference value of the difference value is obtained. The minimum position in the time series change is set as the stop point.

【０００８】請求項３の発明は請求項１に記載の認識方
法において、前記音声は連続音声であって、連続音声中
の単語をＲＩＦＣＤＰにより単語認識することを特徴と
する。According to a third aspect of the present invention, in the recognition method according to the first aspect, the voice is a continuous voice, and words in the continuous voice are recognized by RIFCDP.

【０００９】[0009]

【発明の実施の形態】以下、図面を参照して本発明の実
施形態を詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００１０】ジェスチャの分類例として代表的なものに
Ｅｋｍａｎの分類がある（表１参照）。A typical gesture classification is Ekman's classification (see Table 1).

【００１１】[0011]

【表１】 [Table 1]

【００１２】表１中の例示子は、音声内容を明示的に補
うカテゴリであり、いわゆる、モーダルの組み合わせ表
現にあたる。中でも、指示・バトン・空間・象形等のジ
ェスチャでは、「こっち」とか「このくらい」などの対
応する音声表現が、ジェスチャとともに同時に出現する
タイプである。本願発明者はこの同時性に着目し、モー
ダル間（音声−動画像間）で対応部分を検索する方法を
発明した。The exemplifiers in Table 1 are categories that explicitly supplement audio content, and correspond to so-called modal combination expressions. In particular, in the case of gestures such as instruction, baton, space, and elephant, corresponding voice expressions such as "this" and "this much" are types that appear simultaneously with the gesture. The inventor of the present application has paid attention to this simultaneity, and has invented a method of searching for a corresponding portion between modals (between voice and moving image).

【００１３】上記表１の分類とは別の分類として、時間
空間（Ｐｈａｓｅ）に分ける考え方（McNeil,D:Psychol
inguistics,Harper & Row (1987), 表２参照）がある。[0013] As a classification different from the classification in Table 1, the concept of dividing into time space (Phase) (McNeil, D: Psychol)
inguistics, Harper & Row (1987), see Table 2).

【００１４】[0014]

【表２】 [Table 2]

【００１５】認知心理学の分野の成果として、これらフ
ェーズが、共起する言語発話の基本周波数のパワーや音
韻等とのタイミングに関係があることが既に報告されて
いる。As a result of the field of cognitive psychology, it has already been reported that these phases are related to the power of fundamental frequencies of co-occurring language utterances and the timing of phonemes and the like.

【００１６】本願発明では、表２に現れるジェスチャの
特徴の中で、ジェスチャの開始および終了時点ではジェ
スチャの動作が停止することを発見した。かかる知見に
基づき、音声部分の開始点および終了点に最も近いジェ
スチャ画像の動作停止部分をジェスチャの開始点および
終了点と見なし、音声および動画像のジェスチャーのそ
れぞれの開始点、終了点を対応付けることに特徴があ
る。In the present invention, among the features of the gestures shown in Table 2, it has been found that the gesture stops at the start and end of the gesture. Based on such knowledge, the operation stop part of the gesture image closest to the start point and the end point of the audio part is regarded as the start point and the end point of the gesture, and the start point and the end point of the gesture of the audio and the moving image are associated with each other. There is a feature.

【００１７】「例示子」カテゴリの大半は、比較的単純
な直線動作の集合体である。そこで、（１）手の位置の
時間的変化が単純な加速と減速であり、（２）手の動き
の変化は画像の時間的な変化とほぼ比例すると仮定す
る。これより動画像の時間的な変化パターンから上述の
ＭｃＮｅｉｌ分類それぞれのフェーズに自動的に文節化
する（Nagaya,Seki,Oka,"A Proposal of Gesture Traje
ctry Feature for Gesture Spotthing Recognition",Te
chnical Report of IEICE,Vol.PRU95(142),pp45-50.(19
95) 参照）。The majority of the "exemplator" category is a collection of relatively simple linear motions. Therefore, it is assumed that (1) the temporal change of the hand position is simple acceleration and deceleration, and (2) the change of the hand motion is substantially proportional to the temporal change of the image. From this, the phrase is automatically segmented into the phases of the above-mentioned McNeil classification from the temporal change pattern of the moving image (Nagaya, Seki, Oka, "A Proposal of Gesture Traje
ctry Feature for Gesture Spotthing Recognition ", Te
chnical Report of IEICE, Vol.PRU95 (142), pp45-50. (19
95)).

【００１８】図１にフレーム間差分値の時系列とその移
動平均および極小時刻（符号Ｔ１，Ｔ２、Ｔ３）におけ
るフレーム画像の例を模式的に示す。FIG. 1 schematically shows an example of a time series of inter-frame difference values, a moving average thereof, and a frame image at a minimum time (codes T1, T2, T3).

【００１９】実際の音声認識の場面で、基本周波数のパ
ワーや音韻ピーク等の特徴を検出することは可能である
が安定して検出することは一般的に難しい。そこで、本
実施の形態ではＲＩＦＣＤＰ（Itoh,Kiyama,Kojima,Sei
ki,Oka:Reference Interval-Free Continuous Dynamic
Programming(RIFCDP) for spotting speech waves byar
itary parts of a reference pattern,IEICE Tech.Repo
rt,Vol.SP95-34(1995) 参照）と呼ばれる手法が連続的
な発話から認識と同時に時間区間とこれらフェーズとの
関係を調べた。その結果、両者のタイミングについて次
のような関係を確認した（図２参照）。In an actual speech recognition scene, it is possible to detect features such as the power of the fundamental frequency and the phoneme peak, but it is generally difficult to detect it stably. Therefore, in this embodiment, RIFCDP (Itoh, Kiyama, Kojima, Sei
ki, Oka: Reference Interval-Free Continuous Dynamic
Programming (RIFCDP) for spotting speech waves byar
itary parts of a reference pattern, IEICE Tech.Repo
A method called rt, Vol. SP95-34 (1995)) examined the relationship between time intervals and these phases at the same time as recognition from continuous speech. As a result, the following relationship was confirmed between the two timings (see FIG. 2).

【００２０】（ルール１）キーとなる発声表現が終了し
た直後、あるいは同時にジェスチャーストロークが停止
する。(Rule 1) The gesture stroke stops immediately after or at the same time as the utterance expression as a key ends.

【００２１】（ルール２）ジェスチャーの開始点につい
ては、キーとなる音声表現を含む最も短いストローク区
間となる。(Rule 2) The starting point of the gesture is the shortest stroke section including the key voice expression.

【００２２】音声・ジェスチャの時間的関係を検証する
ためにこれらの音声・ジェスチャ組み合わせ表現につい
て実際に収集されたデータを用いて実験を行った。用い
たのはＲＷＣが作成したマルチモーダルデータベースで
ある。これは、１０代後半から５０代前半までの男女に
マルチモーダルな表現を行わせて、その音声、動画像を
データベースとしたものである。人数は２２人、表現数
は２５種類、各被験者ごとに４回の実験データが収めら
れている。In order to verify the temporal relationship between voices and gestures, experiments were conducted using data actually collected for these voice / gesture combination expressions. What was used was a multimodal database created by RWC. This is a technique in which men and women in their late teens and early fifties express multimodal expressions, and their voices and moving images are used as a database. The number of participants is 22, the number of expressions is 25, and the experimental data of four times is stored for each subject.

【００２３】実験では人手による判別結果（目視による
ジェスチャー画像と音声データの対応づけ）とほぼ一致
した（９３．４％）。一致しないケースは被験者の振る
舞いが明らかに不自然であるような場合に限られた。In the experiment, the result almost coincided with the result of manual discrimination (association of a gesture image and audio data by visual observation) (93.4%). Disagreement cases were limited to those where the subject's behavior was clearly unnatural.

【００２４】次にキーとなる音声単語が切り出されたと
き、正しくジェスチャ・ストローク区間を決定できるか
どうかを実験した。実験データは直前にムービーファイ
ルに取り込んだ組み合わせ表現を用いた。音声単語の切
り出しにはＲＩＦＣＤＰの手法を用い、実時間入力した
音声と一致するムービー（動画像）の時間区間（開始点
および終点）をジェスチャ・ストローク区間検索のキー
とした。１０人の被験者に対して９６％の精度を得た。Next, an experiment was conducted to determine whether or not a gesture stroke section can be correctly determined when a key voice word is cut out. The experimental data used was the combination expression that was just imported into the movie file. The RIFCDP method was used to cut out speech words, and the time section (start point and end point) of a movie (moving image) that matches the voice input in real time was used as a key for a gesture / stroke section search. 96% accuracy was obtained for 10 subjects.

【００２５】以上、述べた認識方法についてよりわかり
やすく説明する。図３は認識装置のシステム構成を示
す。図１において、１はパーソナルコンピュータ等の情
報処理装置である。パーソナルコンピュータ１が図４の
処理プログラムを実行することによりマルチモーダルの
個々のマルチモーダル（音声とジェスチャ）の対応づけ
を行う。The above-described recognition method will be described more clearly. FIG. 3 shows the system configuration of the recognition device. In FIG. 1, reference numeral 1 denotes an information processing device such as a personal computer. The personal computer 1 executes the processing program of FIG. 4 to associate each multimodal multimodal (voice and gesture).

【００２６】２はマイクロホンであり、被験者が発声し
た音声を入力し、パーソナルコンピュータ１に音声信号
を出力する。３はビデオカメラであり、被験者のジェス
チャを撮影し、フレーム単位の動画像データ（いわゆる
ムービー）を電気信号の形態でパーソナルコンピュータ
１に出力する。Reference numeral 2 denotes a microphone which inputs a voice uttered by the subject and outputs a voice signal to the personal computer 1. Reference numeral 3 denotes a video camera which captures a gesture of a subject and outputs moving image data (a so-called movie) in frame units to the personal computer 1 in the form of electric signals.

【００２７】パーソナルコンピュータ１では図４の処理
プログラムに基づき以下の処理を実行する。マイクロホ
ン２から出力された音声データおよびビデオカメラ３か
ら出力されたフレーム画像を装置内に記憶する（ステッ
プ１０）。図１は「いいえ」と被験者が発声しながらこ
の［いいえ］に対応するジェスチャを行った場合の、特
定時点のフレーム画像を示しており、図２には音声波形
（符号２００）を示している。The personal computer 1 executes the following processing based on the processing program shown in FIG. The audio data output from the microphone 2 and the frame image output from the video camera 3 are stored in the device (step 10). FIG. 1 shows a frame image at a specific time point when a subject performs a gesture corresponding to this [No] while saying "No", and FIG. 2 shows an audio waveform (reference numeral 200). .

【００２８】所定時間分の音声データおよびフレーム画
像を取得すると、パーソナルコンピュータ１は音声デー
タについてＲＩＦＣＤＰの手法により単語の音声認識を
行う（ステップＳ２０）。より具体的には、音声データ
から特徴を抽出し、予め標準パターンとして用意されて
いる複数組の特徴とを連続ＤＰと呼ばれるマッチング手
法を使用して比較し、音声データの特徴に最も類似して
いる標準パターンの特徴を検出することにより音声認識
を行う。このマッチングの処理において、連続音声中の
認識対象の単語（この場合「いいえ」）の単語区間、す
なわち、開始位置（開始時刻、図２のＴ１１）および終
了位置（終了時刻Ｔ１２）も検出される（ステップ３
０）。After acquiring the voice data and the frame image for a predetermined time, the personal computer 1 recognizes the words of the voice data by the RIFCDP method (step S20). More specifically, features are extracted from voice data, and a plurality of sets of features prepared as a standard pattern are compared in advance using a matching method called continuous DP. Speech recognition is performed by detecting the features of the standard pattern. In this matching process, the word section of the word to be recognized in the continuous voice (in this case, “No”), that is, the start position (start time, T11 in FIG. 2) and the end position (end time T12) are also detected. (Step 3
0).

【００２９】次にパーソナルコンピュータ１は動画像
（連続する複数のフレーム画像）に基づき被験者の動作
の停止位置（時刻）を検出する。このために隣接する２
つのフレーム画像について同一画素位置の画像データの
差を計算する（ステップＳ４０）。静止部分については
２つの画像データの差分値は極めて０（ゼロ）に近い値
となる。Next, the personal computer 1 detects the stop position (time) of the subject's operation based on the moving image (a plurality of continuous frame images). Because of this, two adjacent
The difference between the image data at the same pixel position for one frame image is calculated (step S40). For a still portion, the difference value between the two image data is extremely close to 0 (zero).

【００３０】各画素の差分値を合計する。この合計した
ものが上述のフレーム間差分値である。また、ある時点
から一定時間分のフレーム間差分値の平均をとり、その
平均値を上記時点の移動平均値とし、各時点での移動平
均値を時系列的にプロットすると図１および図２の符号
１００の曲線が得られる。ジェスチャの停止位置では被
験者の手や頭の動き（ストローク）が静止するので、フ
レーム間差分値はゼロに近い値となる。このような性質
を利用して、移動平均値の時間的変化の中で極小となる
位置（時刻）をジェスチャーの停止位置と判定する（上
述のルール１、ステップＳ５０））。フレーム間差分
値、移動平均の計算や極小点の検出は周知であり、詳細
な説明は要しないであろう。The difference value of each pixel is summed. The sum is the above-described inter-frame difference value. 1 and FIG. 2 are obtained by taking the average of the inter-frame difference values for a certain time from a certain time point, and taking the average value as the moving average value at the above time point, and plotting the moving average value at each time point in time series. A curve denoted by reference numeral 100 is obtained. Since the movement (stroke) of the subject's hand or head stops at the gesture stop position, the inter-frame difference value is close to zero. By utilizing such a property, a position (time) at which the moving average value changes with time is determined as a gesture stop position (rule 1, step S50 described above). The calculation of the inter-frame difference value, the moving average, and the detection of the minimum point are well known and need not be described in detail.

【００３１】このジェスチャ停止位置の検出結果として
図２の時刻Ｔ１、Ｔ２、Ｔ３が得られる。なお、こフレ
ーム画像からはジェスチャの動作の停止位置しか検出で
きず、その停止位置がジェスチャの開始時点なのか終了
時点なのかは判別できないことに留意されたい。そこ
で、本実施形態では、単語の音声認識において、認識結
果とともに得られた単語の発声開始時刻（図２の時刻Ｔ
１１）とストローク停止時刻Ｔ１、Ｔ２、Ｔ３とをそれ
ぞれ比較し、単語の発声開始時刻Ｔ１１に対して間隔が
最も短いストローク時刻を検出する。図２の例では時刻
Ｔ１が最短の時刻として検出され、ここで時刻Ｔ１がジ
ェスチャの開始時点と決定される（上述のルール２）。Times T1, T2 and T3 in FIG. 2 are obtained as a result of detecting the gesture stop position. Note that only the stop position of the gesture operation can be detected from this frame image, and it cannot be determined whether the stop position is the start point or the end point of the gesture. Therefore, in the present embodiment, in the speech recognition of the word, the utterance start time of the word (the time T in FIG.
11) and stroke stop times T1, T2, and T3, respectively, to detect a stroke time having the shortest interval with respect to the word utterance start time T11. In the example of FIG. 2, the time T1 is detected as the shortest time, and the time T1 is determined as the start time of the gesture here (rule 2 described above).

【００３２】同様にして単語の発声終了時刻Ｔ１２に最
短のストローク停止時刻を検出すると時刻Ｔ２（図２参
照）が得られ、ジェスチャの終了時刻と決定される（ス
テップＳ６０）。このようにして音声の中のある単語が
認識されると、その単語の音声区間（開始時刻Ｔ１１お
よび終了時刻１２）とそのジェスチャ区間（開始時刻Ｔ
１および終了時刻Ｔ２）とが対応づけられ、これら時刻
データが装置内に記憶される。必要に応じて、これら時
刻データとともに音声の認識結果も記憶される。Similarly, when the shortest stroke stop time is detected at the utterance end time T12 of the word, the time T2 (see FIG. 2) is obtained, and the end time of the gesture is determined (step S60). When a certain word in the voice is recognized in this way, the voice section (start time T11 and end time 12) of the word and the gesture section (start time T)
1 and the end time T2), and these time data are stored in the apparatus. If necessary, a speech recognition result is stored together with the time data.

【００３３】上述の実施形態の他に次の形態を実施でき
る。The following embodiment can be carried out in addition to the above embodiment.

【００３４】１）音声の発声開始時刻と発声終了時刻の
検出の方法としては上述のＲＩＦＣＤＰの他に音声信号
の電圧レベルがしきい値以下からしきい値以上となる時
点およびその逆の時点を検出する方法が知られている。
この方法は、被験者が発声した音声が１つの単語のみの
場合に使用可能である。ＲＩＦＣＤＰの手法は、複数の
単語（句を含む）が連続する連続音声でもその中の各単
語の区間を検出することができるので、複数のジェスチ
ャを連続して行う場合、音声も連続するので、ＲＩＦＣ
ＤＰを使用すると各ジェスチャとその意味を示す音声と
を対応づけることができる。これにより複数種のジェス
チャを被験者が行った場合でもあるジェスチャと他のジ
ェスチャの区切れでは動作停止が生じるので、本発明に
よれば、複数の連続するジェスチャをも認識することが
できる。1) As a method of detecting the utterance start time and the utterance end time of the voice, in addition to the above-mentioned RIFCDP, the time when the voltage level of the voice signal is changed from the threshold value to the threshold value or more and vice versa. Methods for detecting are known.
This method can be used when the subject utters only one word. The RIFCDP method can detect a section of each word in a continuous speech in which a plurality of words (including phrases) are continuous. Therefore, when performing a plurality of gestures continuously, the speech is also continuous. RIFC
By using the DP, each gesture can be associated with a voice indicating its meaning. As a result, when a gesture is performed by a subject even when a plurality of types of gestures are performed, an operation stop occurs at a boundary between the gesture and another gesture. Therefore, according to the present invention, a plurality of continuous gestures can be recognized.

【００３５】[0035]

【発明の効果】以上、説明したように、請求項１の発明
によれば、音声の単語発声時点および単語発声終了時点
に基づき、ジェスチャの開始時点および終了時点を検出
することができ、また、これら開始および終了時点によ
りジェスチャの意味を示す音声区間とジェスチャとを自
動的に互いに対応付けることが可能となる。As described above, according to the first aspect of the present invention, the start time and the end time of the gesture can be detected based on the word utterance time and the word utterance end time of the voice. It is possible to automatically associate a gesture with a voice section indicating the meaning of the gesture based on the start and end points.

【００３６】請求項２の発明によればフレーム画像の差
分値を使用することでジェスチャの動作停止時点を検出
することができる。また、差分値の時間的な変化の極小
部分を検出することで、複数種のジェスチャを一連で行
ってもそのジェスチャの区切れを確実に検出することが
できる。According to the second aspect of the present invention, it is possible to detect the point at which the operation of the gesture is stopped by using the difference value between the frame images. In addition, by detecting the minimum portion of the temporal change in the difference value, even if a plurality of types of gestures are performed in series, it is possible to reliably detect a break in the gesture.

【００３７】請求項３の発明によれば、ＲＩＦＣＤＰに
よる単語認識は連続音声中での各単語の発声区間を検出
することができるので、複数種の一連のジェスチャを認
識する場合にも対応することができる。According to the third aspect of the present invention, since the word recognition by RIFCDP can detect the utterance section of each word in the continuous voice, it can cope with a case where a plurality of types of gestures are recognized. Can be.

[Brief description of the drawings]

【図１】本発明実施形態の特定時点のフレーム画像と画
像特徴の関係を示す図である。FIG. 1 is a diagram illustrating a relationship between a frame image at a specific point in time and image characteristics according to an embodiment of the present invention.

【図２】本発明実施形態の画像特徴と音声との間の対応
関係を示す図である。FIG. 2 is a diagram showing a correspondence relationship between image features and audio according to the embodiment of the present invention.

【図３】本発明実施形態のシステム構成を示すブロック
図である。FIG. 3 is a block diagram showing a system configuration according to the embodiment of the present invention.

【図４】本発明実施形態の処理手順を示すフローチャー
トである。FIG. 4 is a flowchart showing a processing procedure according to the embodiment of the present invention.

[Explanation of symbols]

１パーソナルコンピュータ２マイクロホン３ビデオカメラ 1 personal computer 2 microphone 3 video camera

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁶ 識別記号ＦＩＧ１０Ｌ 3/00 ５７１Ｇ０６Ｆ 15/62 ３８０ (72)発明者長屋茂喜東京都国分寺市東恋ケ窪一丁目280番地株式会社日立製作所中央研究所内 (72)発明者古川清東京都千代田区大手町１−９−４経団連会館社団法人日本鉄鋼連盟内 (72)発明者岡隆一茨城県つくば市竹園１丁目６番１号つくば三井ビル技術研究組合新情報処理開発機構つくば研究センタ内──────────────────────────────────────────────────の Continuation of the front page (51) Int.Cl. ⁶ Identification code FI G10L 3/00 571 G06F 15/62 380 (72) Inventor Shigeki Nagaya 1-280 Higashi-Koigakubo, Kokubunji-shi, Tokyo Hitachi, Ltd. In-house (72) Inventor Kiyoshi Furukawa 1-9-4 Otemachi, Chiyoda-ku, Tokyo Keidanren Kaikan Within the Iron and Steel Federation of Japan (72) Inventor Ryuichi Oka 1-6-1-1, Takezono, Tsukuba-shi, Ibaraki Tsukuba Mitsui Building Technology Research Association New Information Processing Development Organization Tsukuba Research Center

Claims

[Claims]

1. A recognition method for recognizing a gesture accompanied by a voice indicating the contents of a gesture, performing word recognition on the voice, and detecting a start time and a finish time of utterance of the word in the word recognition, respectively. Detecting a stop point at which the operation of the gesture stops based on the moving image taken of the gesture, and detecting a stop point closest to the utterance start point and the utterance end point, respectively, among the detected stop points, thereby defining the gesture as A recognition method characterized by associating a corresponding voice.

2. The recognition method according to claim 1, wherein a difference value between two adjacent frame images in a continuous frame image forming the moving image is obtained, and a difference value of the difference value in a time series is obtained. Wherein the minimum position is set as the stop time.

3. The recognition method according to claim 1, wherein the speech is continuous speech, and words in the continuous speech are converted to RIF.
A recognition method characterized by word recognition by CDP.