JP2004192150A

JP2004192150A - Image recognition device, its characteristic extraction method, storage medium and program

Info

Publication number: JP2004192150A
Application number: JP2002357079A
Authority: JP
Inventors: Masahiro Kawagoe; 正弘河越; Hiroshi Aoyama; 宏青山
Original assignee: National Institute of Advanced Industrial Science and Technology AIST
Current assignee: National Institute of Advanced Industrial Science and Technology AIST
Priority date: 2002-12-09
Filing date: 2002-12-09
Publication date: 2004-07-08

Abstract

<P>PROBLEM TO BE SOLVED: To rapidly perform a gesture recognition so as not to be influenced by the motion of a subject. <P>SOLUTION: Each of images taken at two different points of time is divided to a plurality of blocks, and the image data at the two points in each divided block are mutually compared, whereby a change in position of the same image data group is detected by a CPU 10. The changes in position for all the blocks (velocity vector) are treated as the characteristic of a gesture, and the recognition of the gesture is performed by use of this characteristic. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、撮影画像に基づいて被写体の、特に人間が好適であるが、動作の意味内容を認識する画像認識装置、その特徴抽出方法、記録媒体およびプログラムに関する。
【０００２】
【従来の技術】
被写体、たとえば、人間の手や顔を撮影し、その撮影画像から手や顔によるジェスチャの意味内容を認識しようとする試みが行われている。
【０００３】
撮影画像の中の認識対象の部位を目視で確認することは容易であるがコンピュータが判定することは容易ではない。このため、従来この種のジェスチャ認識方法では次のような方法で、認識すべき被写体のジェスチャの特徴を検出している。
【０００４】
認識すべき被写体（手など）の固定の撮影領域を予め定めておき画面中の固定領域内での画像データの変化に基づいてジェスチャを認識する。（例えば、特許文献１参照。）
異なる２つの時点の１画面分の画像データの差分を取り、０（ゼロ）ではない差分値を持つ画像データ、すなわち、被写体の移動により差分が生じた画像データの時系列変化をジェスチャ−の特徴とみなす。（例えば、特許文献２参照。）
【０００５】
【特許文献１】
特開２００２−０８３３０２号公報
【０００６】
【特許文献２】
特開平１０−１６２１５１号公報
【０００７】
【発明が解決しようとする課題】
特許文献１の方法では、撮影領域が固定化されるので、大きい動きや早い動きを伴うジェスチャなどの認識には向かないという欠点がある。
【０００８】
また、特許文献２の方法では、所定時間の間被写体を撮影しなければならず、ジェスチャの認識に時間がかかるという欠点がある。
【０００９】
そこで、本発明の目的は、短い時間でジェスチャ認識が可能であり、被写体の動きの制限を受けない画像認識装置、その特徴抽出方法、記録媒体およびプログラムを提供することにある。
【００１０】
【課題を解決するための手段】
このような目的を達成するために、本発明は、撮影画像を使用して認識対象のジェスチャの特徴を抽出し、予めジェスチャの意味内容が判明している特徴と抽出された特徴とを比較することによりジェスチャの認識を行う画像認識装置において、現在時点および前の時点の２つの撮影画像をそれぞれ複数のブロックに分割する手段と、当該分割された２つの時点の撮影画像をブロックごとに比較することにより所定数の画像データ群の前時点から現時点へのブロック中の位置の変化を検出する手段を有し、当該検出された複数のブロックについての画像データ群の位置の変化を前記認識対象のジェスチャの特徴となすことを特徴とする。
【００１１】
【発明の実施の形態】
以下、図面を参照して、本発明の実施形態を詳細に説明する。
【００１２】
図１は本発明を適用したジェスチャ認識装置（画像認識装置）のシステム構成を示す。
【００１３】
ジェスチャ認識装置としては汎用のパソコンや周知の情報処理装置を使用することができる。
【００１４】
図１において、１０はジェスチャ認識プログラム（図２および図３）を実行してジェスチャ認識を行うＣＰＵである。ＣＰＵ１０が本発明の画像処理手段として動作する。１１は上記ジェスチャ認識プログラムやこのプログラムに実行する種々のデータを記憶するメモリ（ＲＯＭやＲＡＭ）である。メモリ１１が本発明の記憶手段として動作する。
【００１５】
１２はジェスチャプログラムおよびこのプログラムで使用するデータを保存するためのハードディスクである。ジェスチャプログラムで使用するデータの中には予めジェスチャの意味内容が判明しており、ジェスチャの特徴を表す特徴データ（後で説明する基本動作のベクトルテンプレート）とその特徴データの意味内容を表す識別データが含まれている。
【００１６】
特徴データについては後で説明する。識別データは、たとえば、「はい」、「いいえ」といった文字列を使用してもよいし、「Ａ」、「Ｂ」や数字などの識別文字を使用すればよく、ジェスチャ認識装置の用途に応じて適宜、定めればよい。
【００１７】
１３はビデオカメラ１４と接続し、ビデオカメラ１４から被写体の画像データを入力するインターフェースである。１５はフロッピー（登録商標）ディスクやＣＤＲＯＭなどの記憶媒体（本発明の記録媒体）１６を受付け、その記憶媒体からジェスチャ認識プログラムや、予め撮影した被写体の動画像（いわゆるビデオ）を読み取るドライブである。
【００１８】
１７はジェスチャ認識結果を表示するためのディスプレイである。なお、ジェスチャ認識結果を出力するための手段としては、その他に、プリンタ、記憶媒体への書き込みを行うドライブ、他の情報処理装置と通信を行うための通信装置を使用することができる。
【００１９】
１８はＣＰＵ１０に対して動作命令や、ジェスチャ認識装置で使用するデータを入力する入力装置で、キーボードやマウスなどのポインティングデバイスを使用することができる。
【００２０】
このようなシステム構成において、ハードディスク１２に記憶されたジェスチャ認識プログラムがメモリ１１にロードされた後、ＣＰＵ１０により実行されて、ジェスチャ認識処理が行われる。
【００２１】
次に本発明に係わり、ビデオカメラ１４あるいは記憶媒体１６から取得した撮影画像（以下、単に撮影画像と称する）から被写体（ジェスチャの認識対象となる対象部）の画像特徴を抽出する方法を図４を参照して説明する。
【００２２】
１画面分の撮影画像データはＮ×Ｍのブロック２０１（図４参照）に分割される。この例ではＮ＝Ｍ＝９の例を説明する。また、分割された各ブロックは２４×２４の画素を持つものとする。２０２は現時点で取得した画像データの中の比較のために使用する領域で以下、フィルタあるいはフィルタ領域と称する。フィルタ領域はブロックよりも小さければよく、この形態では１６×１６画素の大きさを有する。フィルタ領域２０２内の、現時点で取得した画像データ群と一致する前時点の画像データ群の位置の変化を検出する。このためにフィルター領域と同じ大きさの領域を前時点で取得したブロック２０１内を移動させてこの領域に含まれる画像データ群と現時点で上記フィルタ領域２０２に含まれる画像データ群との一致比較を行うことで、同じデータ群の位置の変化の検出を行う。
【００２３】
被写体が動いた場合、ＣＰＵの処理速度のオーダでは、画面中の被写体の移動量と方向（傾き）変化は微小であり、ブロック２０１から外れることはない。また、画像データ、たとえば、輝度データが大きく変動することもない。このような画像の特質を利用して本発明では、各ブロックごとに現時点のフィルタ領域内の画像データ群と一致する前時点の画像データ群の位置の変化すなわち、同じ画像データ群の移動方向とその大きさ（移動量）を検出する。移動方向とその大きさを総称して本実施形態では以後、速度ベクトルと呼ぶことにする。全てのブロック（９×９＝８１個）から取得した速度ベクトルがジェスチャの特徴を表す特徴データとして扱われる。図４の例ではフィルタ２０２の位置はブロック２０１の左上に示しているが、実際はブロック２０１の中央に位置させるとよい。
【００２４】
以上の抽出を行うための数式を以下に示しておく。
元の濃淡画像から線分抽出を行ない、線分の部分を１、そうでない部分を０とした線画２値画像ＦおよびＧに対して、
ｆ（ｉ，ｊ）＝０ｏｒ１：画像Ｆのｉ行ｊ列目の要素
ｇ（ｉ，ｊ）＝０ｏｒ１：画像Ｇのｉ行ｊ列目の要素
と表記するとき、関数ｍａｔｃｈを、

と定義する。ここでＸＯＲはｅｘｃｌｕｓｉｖｅＯＲを表すものである。
【００２５】
つぎに、
Ｆ：１６×１６画素のフィルタ、
Ｇｐｑ：ブロックの左上端からｐ行ｑ列分ずれた位置から始まる１６×１６画素の領域、とするとき、下式で、ブロック内において、ズレｐ，ｑをブロックの行と列のサイズｗｘとｗｙ内で変化させたときの、下記で与えられる最小値となるｘ、ｙを、

とし、ｉ行ｊ列目のブロックに対応する速度ベクトルを次のように定義する。
Ｖｉｊ＝（ｘ，ｙ）：要素ｘ，ｙからなる速度ベクトル
【００２６】
上述の特徴抽出方法は、ジェスチャ認識時と、意味内容が判明しているジェスチャについての撮影画像から特徴データを取得する際に使用される。判明している意味内容の識別情報と、上記特徴抽出方法により取得された全ブロックの速度ベクトルを総称して本実施形態では基本動作のベクトルテンプレートと呼ぶことにする。
【００２７】
基本動作のベクトルテンプレートは、説明の便宜上、「はい」、「いいえ」および「わかりません」の３つのジェスチャについて予め作成され、ハードディスク１２にインストール（搭載）されているものとする。
【００２８】
以上の点を踏まえ、図１のジェスチャ認識装置により行われるジェスチャ認識処理を図２および図３を参照して説明する。
【００２９】
図２はジェスチャ認識プログラムのメイン処理手順を示し、図３は図２のステップ３０の詳細処理手順を示す。
【００３０】
ビデオカメラ１４により取得された被写体の画像（１画面分（たとえば、３２０×２４０ｐｉｘｅｌ（画素））の静止画像）は、ＣＰＵ１０の制御により一定周期（１０ｆｒａｍｅ／ｓｅｃ）でインタフェース１３を介してメモリ１１上に取り込まれる（いわゆる入力、図２のＳ１０）。このための制御手順は周知であり、詳細な説明を要しないであろう。
【００３１】
ＣＰＵ１０は取り込まれ、メモリ１１に記憶された現時点の画像と１つ前の時点の２つの時点の静止画像データをそれぞれ複数のＮ×Ｍ個のブロックに分割する。さらにフィルタ領域２０１（速度ベクトル算出領域）を各ブロックの中央に定義（設定）する。設定したフィルタ領域２０１から１６×１６画素の画像データをメモリ１１から切り出（取り込む）してメモリ１１内の作業領域に一時記憶する（図２のＳ２０）。
【００３２】
次にＣＰＵ１０はフィルタ領域内の現時点の画像データ群と一致する前時点の画像データ群の位置を各ブロックごとに検出する。検出された位置とフィルタの位置（画面中央）から速度ベクトルが算出される（図２のＳ３０）。この例では前時点の位置が速度ベクトルの始点となる。
【００３３】
全ブロックについて取得した（認識対象のジェスチャの）速度ベクトルについてハードディスク１２に保存されている基本動作のベクトルテンプレートの中からもっとも類似している速度ベクトルを有する基本動作のベクトルテンプレートが検出される。
【００３４】
具体的には各ベクトルテンプレートのベクトルと上記認識対象のジェスチャの速度ベクトルとの類似度の計算を行い、類似度の計算結果の中でもっとも類似度が高い値を持つベクトルテンプレートが検出され、その識別情報が単発の基本動作、すなわち、瞬間のジェスチャの意味内容の認識結果としてディスプレイ１７に表示される（図２のステップＳ４０）。
【００３５】
上記の類似度計算の式の例を下記に示す。
Ｐｋｉｊをｋ番目のベクトルテンプレートのｉ行ｊ列の速度ベクトル、
Ｋをパターンの数、ＮとＭをブロックの行と列の数（今回は各々９）とするとき、ｋ番目のパターンとの類似度を次のように定義する。

ここで、・はベクトル内積、｜｜を絶対値を表す。
【００３６】
本実施形態では、連続的なジェスチャの認識を行えるよう、複数時点の連続のベクトルテンプレートもハードディスク１２に用意されており、複数時点分の認識対象の速度ベクトルを取得したときに、その複数時点の速度ベクトルの組み合わせと、ハードディスク１２に保存されたベクトルテンプレートの速度ベクトルの組み合わせと比較して、もっとも類似するベクトルテンプレートの識別情報が連続動作のジェスチャ認識結果として表示される（図２のＳ５０）。
【００３７】
フィルタ領域２０２と同じ画像データの位置を検出するための処理内容を図３を参照して説明する。分割されたブロックのフィルタ領域２０１や前時点の同じブロックの読取領域の位置を初期設定した後（図３のＳ１００）、ＣＰＵ１０は現時点のブロックの中央部１６×１６個の画像データをメモリ１１から読み出し、作業領域に記憶する。また、初期設定された前時点の読取領域（たとえば、ブロックの左上１６×１６個）から前時点の画像データをメモリ１１から読み出し、作業領域に記憶する。読み出した２組の画像データ群の一致比較を行う（図３のＳ１２０）。比較の方法としては、
（１）簡便な方法としては読み出した画像データ群の平均値を計算し、平均値の比較を行う。
（２）詳細な方法としては、同じ画素位置の画像データの値の差分値が許容範囲内にあるか否かの判定を行い、許容範囲内にある場合には一致と判定する。１６×１６の画素について一致判定を行って、所定画素数以上の一致判定が得られた場合には、画像データ群は一致とみなす。
というような比較方法を使用するとよい。
【００３８】
一致の判定が得られた場合には、前時点の画像データの読取領域の位置を始点、ブロックの中央位置を終点とする速度ベクトル（実際には始点と終点の座標位置の形態）を取得する。また、速度ベクトルの始点と終点の間の距離を速度ベクトルの大きさとして計算し、ベクトルおよびその大きさをメモリ１１に記憶する（図３のＳ１２５）。
【００３９】
一方、ステップＳ１２０で一致判定が得られなかった場合には前時点の画像データの読取領域を１画素分Ｘ方向（水平右側）に移して（インクリメントして）画像データの読取を行う（図３のＳ１３５）。以下、現時点の画像データ群と前時点の画像データ群との比較が上述と同様にして行われる。以下、一致判定が得られるまで、ステップＳ１２０〜Ｓ１３０→Ｓ１３５→Ｓ１２０のループ処理が繰り返される。読取領域がＸ方向の最終位置まで到達すると、そのことがステップＳ１３０で検出される。これにより手順はステップＳ１３０からＳ１４５へと進み、読取領域のＹ方向（垂直方向）の位置が１画素だけインクリメントされ、また、読取領域のＸ方向の位置が初期位置（ブロックのもっとも左）に変更される。このようにして新たに設定された読取領域の前時点の画像データ群が読み出されて、現時点の画像データ群との比較に使用される（図３のＳ１４５→Ｓ１２０）。
【００４０】
以上のようにして読み取り領域をＸ方向およびＹ方向に移動させると、現時点および前時点の画像データ群が一致するので、そのときの前時点の読取領域の位置から速度ベクトルが計算される（ステップＳ１２５）。
【００４１】
（ジェスチャ認識装置の用途）
Ａ）用途１
上述のジェスチャ認識方法を使用するとインターネットでのユーザのパソコンとホームページを提供するサーバとの間の情報伝達にジェスチャを使用することができる。この場合には図５にユーザが使用するパソコン１０００にＣＣＤカメラ１００１を搭載し、ユーザのジェスチャによりパソコン１０００を介してサーバ（不図示）に「はい」、「いいえ」などの識別情報を送信することができる。ジェスチャ認識プログラムはサーバに搭載してもよいし、パソコン１０００に搭載してもよい。
【００４２】
Ｂ）用途２
図６に示すよう商品陳列用のショウウィンドウ２００２にビデオカメラ２００１を設置し、ショウウィンドウ２００２を通過する人のジェスチャをビデオカメラ２００１により撮影し、その撮影結果をジェスチャ認識装置（不図示）により認識する。
【００４３】
Ｃ）用途３
図７に示すようにパソコン（ジェスチャ認識装置）３００２の表示装置に複数の画像、たとえば、人物写真３００１を順次に表示させる。ユーザ３００３のジェスチャをビデオカメラ３００４で撮影して、ジェスチャ認識を行う。そのジェスチャ認識結果（「はい」、「いいえ」）に応じて、表示された画像３００１を取捨選択する。選択された画像３００５は表示装置にまとめて別途表示される。
このような情報処理はタレントのオーディションや結婚相談所のお見合い写真の選択に利用することができる。
【００４４】
Ｄ）用途４
図８に示すように学校の教室にビデオカメラ４００１および大画面のディスプレイ４００２を設置する、また、ディスプレイ４００２には仮想教師が授業（講義）を行うための映像をシステム４１００から送る。システム４１００はビデオカメラ４００１から取得した動画像をジェスチャ認識し（４１０１）、その認識結果から授業進行コントーラ（ＤＶＤのチャプター制御コントローラ）を使用して、授業のための映像の表示を制御して、適宜、補足説明用の画面に飛ばしたり、説明を早めたりすることができる。
【００４５】
Ｅ）用途５
図９に示すように学校などの教室にビデオカメラ５００１および５００２を設置する。ビデオカメラ５００１は教師５００３を撮影し、ビデオカメラ５００２は生徒５００４を撮影する。ビデオカメラの撮影方向は遠隔操作可能とすることが好ましい。撮影された画像は通信回線を解してジェスチャ認識装置５１００に送られてジェスチャ認識する。この形態では教師および生徒双方の画像から速度ベクトルを取得して（Ｓ１０１）、教師５００３および生徒５００４のジェスチャの意味内容を認識する（Ｓ１０２）。ジェスチャの認識結果は教師の熱意（刺激）を表す識別情報、生徒の反応を表す識別情報で出力され、その出力結果に対応する教師の力量を示す言葉あるいは数値が計算される。このような処理を行うことにより、ジェスチャ認識装置を教師の力量を診断することに使用することができる。
【００４６】
上述の実施形態の他に次の形態を実施することができる。
１）撮影画面の大きさや分割する個数は使用するデバイスに応じて適宜定めればよい。
２）ビデオカメラや記憶媒体の種類に制約を受けることはないが、ジェスチャ認識装置に好適なものを適宜使用すればよい。
３）ジェスチャ認識処理を単体のコンピュータで実行してももよいし、複数のコンピュータで、分散処理してもよい。
【００４７】
上述の実施形態は発明を説明するための例示である。特許請求の範囲に記載された技術思想に従えば、種々の改良形態が存在するが、これらの改良形態は本発明の技術的範囲内となる。
【００４８】
【発明の効果】
以上、説明したように、本発明によれば、撮影画像を複数ブロックに分割し、２つの異なる時点の同一の画像データ群の位置の変化を取得する。複数のブロックすべての位置の変化がジェスチャの特徴とみなされる。本発明では、取得画像は２つの時点で取得する竹でよいので、時系列的な画像を必要とする従来技術と比べるとジェスチャ認識に要する時間が短縮される。また、迅速にジェスチャ認識結果を出力することが可能となる。
【００４９】
また、撮影画面の一部領域の画像データが使用されるのではなく、画面内の全データが特徴抽出に使用されるので、被写体が速く動いてもその動きに対応することができ、従来技術のように特徴抽出のための固定領域を被写体画像がはみ出ることはない。
【００５０】
さらに連続的に得られる撮影画像についてもおなじように特徴抽出を行うことにより連続的なジェスチャの意味内容を認識することが可能となる。
【図面の簡単な説明】
【図１】本発明実施形態のシステム構成を示すブロック図である。
【図２】ジェスチャ認識処理プログラムの内容を示すフローチャートである。
【図３】ジェスチャ認識処理プログラムの部分的な詳細手順を示すフローチャートである。
【図４】本発明実施形態の特徴抽出方法を説明するための説明図である。
【図５】ジェスチャ認識装置の用途を説明するための説明図である。
【図６】ジェスチャ認識装置の他の用途を説明するための説明図である。
【図７】ジェスチャ認識装置の他の用途を説明するための説明図である。
【図８】ジェスチャ認識装置の他の用途を説明するための説明図である。
【図９】ジェスチャ認識装置の他の用途を説明するための説明図である。
【符号の説明】
１０ＣＰＵ
１１メモリ
１２ハードディスク
１４ビデオカメラ[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an image recognizing device that recognizes the meaning of an operation, which is preferably a human, particularly a human, based on a captured image, a feature extracting method thereof, a recording medium, and a program.
[0002]
[Prior art]
Attempts have been made to capture a subject, for example, a human hand or face, and recognize the meaning of the gesture by the hand or face from the captured image.
[0003]
It is easy to visually confirm the part to be recognized in the captured image, but it is not easy for the computer to determine it. For this reason, in the conventional gesture recognition method of this type, the feature of the gesture of the subject to be recognized is detected by the following method.
[0004]
A fixed shooting region of a subject (such as a hand) to be recognized is determined in advance, and a gesture is recognized based on a change in image data in the fixed region on the screen. (For example, refer to Patent Document 1.)
The difference between the image data of one screen at two different points in time is taken, and the time series change of the image data having a difference value other than 0 (zero), that is, the image data having a difference caused by the movement of the subject is characterized by a gesture. Is considered. (For example, see Patent Document 2.)
[0005]
[Patent Document 1]
JP-A-2002-083302
[Patent Document 2]
JP 10-162151 A
[Problems to be solved by the invention]
The method of Patent Document 1 has a drawback that the imaging region is fixed, and thus is not suitable for recognizing a gesture involving a large movement or a fast movement.
[0008]
In addition, the method of Patent Document 2 has a drawback in that the subject must be photographed for a predetermined time, and it takes time to recognize a gesture.
[0009]
Therefore, an object of the present invention is to provide an image recognition device that can perform gesture recognition in a short time and is not restricted by the movement of a subject, a feature extraction method, a recording medium, and a program.
[0010]
[Means for Solving the Problems]
In order to achieve such an object, the present invention extracts a feature of a gesture to be recognized using a captured image, and compares a feature whose semantic content of the gesture is known in advance with the extracted feature. In the image recognition device that recognizes a gesture by this, a means for dividing the two captured images at the current time point and the previous time point into a plurality of blocks, respectively, and compares the captured images at the two divided time points for each block Means for detecting a change in the position in the block from the previous point in time of the predetermined number of image data groups to the current time, thereby detecting the change in the position of the image data group for the plurality of detected blocks as the recognition target. The feature of the gesture is provided.
[0011]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
[0012]
FIG. 1 shows a system configuration of a gesture recognition device (image recognition device) to which the present invention is applied.
[0013]
A general-purpose personal computer or a known information processing device can be used as the gesture recognition device.
[0014]
In FIG. 1, a CPU 10 executes a gesture recognition program (FIGS. 2 and 3) to perform gesture recognition. The CPU 10 operates as an image processing unit of the present invention. Reference numeral 11 denotes a memory (ROM or RAM) for storing the gesture recognition program and various data executed by the program. The memory 11 operates as storage means of the present invention.
[0015]
Reference numeral 12 denotes a hard disk for storing a gesture program and data used in the program. Among the data used in the gesture program, the semantic content of the gesture is known in advance, and characteristic data (vector template of a basic operation described later) representing the characteristic of the gesture and identification data representing the semantic content of the characteristic data It is included.
[0016]
The feature data will be described later. As the identification data, for example, a character string such as “Yes” or “No” may be used, or an identification character such as “A”, “B” or a number may be used. May be determined appropriately.
[0017]
An interface 13 is connected to the video camera 14 and inputs image data of a subject from the video camera 14. Reference numeral 15 denotes a drive that receives a storage medium (recording medium of the present invention) 16 such as a floppy (registered trademark) disk or a CDROM, and reads a gesture recognition program and a moving image (so-called video) of a previously captured subject from the storage medium. .
[0018]
Reference numeral 17 denotes a display for displaying a gesture recognition result. In addition, as a means for outputting a gesture recognition result, a printer, a drive for writing to a storage medium, and a communication device for communicating with another information processing device can be used.
[0019]
Reference numeral 18 denotes an input device for inputting operation commands to the CPU 10 and data used by the gesture recognition device, and a pointing device such as a keyboard and a mouse can be used.
[0020]
In such a system configuration, the gesture recognition program stored in the hard disk 12 is loaded into the memory 11 and then executed by the CPU 10 to perform the gesture recognition processing.
[0021]
Next, according to the present invention, a method of extracting image features of a subject (a target portion for which a gesture is to be recognized) from a captured image (hereinafter simply referred to as a captured image) acquired from the video camera 14 or the storage medium 16 is shown in FIG. This will be described with reference to FIG.
[0022]
The captured image data for one screen is divided into N × M blocks 201 (see FIG. 4). In this example, an example in which N = M = 9 will be described. Each divided block has 24 × 24 pixels. Reference numeral 202 denotes an area used for comparison in the currently acquired image data, and is hereinafter referred to as a filter or a filter area. The filter area only needs to be smaller than the block, and in this embodiment, has a size of 16 × 16 pixels. A change in the position of the image data group in the filter area 202 at the previous time point that matches the currently acquired image data group is detected. For this purpose, an area having the same size as the filter area is moved in the block 201 acquired at the previous time, and the image data group included in this area is compared with the image data group currently included in the filter area 202 at the present time. By doing so, a change in the position of the same data group is detected.
[0023]
When the subject moves, the movement amount and the direction (inclination) change of the subject on the screen are very small on the order of the processing speed of the CPU, and do not deviate from the block 201. Further, image data, for example, luminance data does not greatly change. In the present invention, utilizing the characteristics of such an image, the change in the position of the image data group at the previous point in time that matches the image data group in the current filter area for each block, that is, the movement direction of the same image data group and The size (movement amount) is detected. In the present embodiment, the moving direction and the magnitude thereof are hereinafter collectively referred to as a velocity vector. The velocity vectors acquired from all the blocks (9 × 9 = 81) are treated as feature data representing the features of the gesture. In the example of FIG. 4, the position of the filter 202 is shown at the upper left of the block 201, but it may be actually located at the center of the block 201.
[0024]
Formulas for performing the above extraction are shown below.
Line segments are extracted from the original grayscale image, and the line segment binary images F and G are set to 1 for the line segment and 0 for the other segments.
f (i, j) = 0or1: The element g (i, j) of the i-th row and the j-th column of the image F gor (i, j) = 0or1: The element of the i-th row and the j-th column of the image G

Is defined. Here, XOR represents exclusive OR.
[0025]
Next,
F: a filter of 16 × 16 pixels,
Gpq: a region of 16 × 16 pixels starting from a position shifted by p rows and q columns from the upper left corner of the block, and in the following equation, the deviations p and q in the block are determined by the row and column sizes wx and x and y, which are the minimum values given below when changed within wy,

And the velocity vector corresponding to the block in the i-th row and the j-th column is defined as follows.
Vij = (x, y): velocity vector composed of elements x, y
The above-described feature extraction method is used at the time of gesture recognition and at the time of acquiring feature data from a captured image of a gesture whose semantic content is known. In the present embodiment, the identification information of the known meaning contents and the velocity vectors of all the blocks obtained by the feature extraction method are collectively referred to as a vector template of the basic operation.
[0027]
For convenience of explanation, it is assumed that three gestures of “Yes”, “No”, and “I do not understand” are created in advance and the vector template of the basic operation is installed (mounted) on the hard disk 12.
[0028]
Based on the above points, the gesture recognition processing performed by the gesture recognition device of FIG. 1 will be described with reference to FIGS.
[0029]
FIG. 2 shows a main processing procedure of the gesture recognition program, and FIG. 3 shows a detailed processing procedure of step 30 in FIG.
[0030]
An image of a subject (a still image of one screen (for example, 320 × 240 pixels (pixels))) acquired by the video camera 14 is stored in the memory 11 via the interface 13 at a constant cycle (10 frame / sec) under the control of the CPU 10. (So-called input, S10 in FIG. 2). The control procedure for this is well known and will not require detailed explanation.
[0031]
The CPU 10 divides the current image stored in the memory 11 and the still image data at the two previous time points into a plurality of N × M blocks. Further, a filter area 201 (velocity vector calculation area) is defined (set) at the center of each block. Image data of 16 × 16 pixels is cut out (taken in) from the memory 11 from the set filter area 201 and temporarily stored in a work area in the memory 11 (S20 in FIG. 2).
[0032]
Next, the CPU 10 detects, for each block, the position of the image data group at the previous time point that matches the current image data group in the filter area. A velocity vector is calculated from the detected position and the position of the filter (the center of the screen) (S30 in FIG. 2). In this example, the position at the previous time is the start point of the velocity vector.
[0033]
A vector template of the basic operation having the most similar speed vector is detected from among the vector templates of the basic operation stored in the hard disk 12 for the speed vectors (of the gestures to be recognized) obtained for all the blocks.
[0034]
Specifically, the similarity between the vector of each vector template and the speed vector of the gesture to be recognized is calculated, and the vector template having the highest similarity value among the similarity calculation results is detected. The identification information is displayed on the display 17 as a single basic operation, that is, as a recognition result of the meaning of the instantaneous gesture (step S40 in FIG. 2).
[0035]
An example of the equation for calculating the similarity is shown below.
Pkij is the velocity vector at row i and column j of the k-th vector template,
When K is the number of patterns and N and M are the numbers of rows and columns of a block (9 each in this case), the similarity to the k-th pattern is defined as follows.

Here, represents a vector inner product and || represents an absolute value.
[0036]
In the present embodiment, a continuous vector template at a plurality of time points is also prepared on the hard disk 12 so that continuous gesture recognition can be performed. Compared with the combination of the velocity vectors and the combination of the velocity vectors of the vector templates stored in the hard disk 12, the identification information of the most similar vector template is displayed as the gesture recognition result of the continuous operation (S50 in FIG. 2).
[0037]
Processing for detecting the same image data position as the filter area 202 will be described with reference to FIG. After initially setting the positions of the filter area 201 of the divided block and the reading area of the same block at the previous time (S100 in FIG. 3), the CPU 10 stores the 16 × 16 image data of the central part of the current block from the memory 11. Read it out and store it in the work area. Also, the image data at the previous time point is read from the memory area 11 from the initially set reading area at the previous time point (for example, 16 × 16 blocks at the upper left) and stored in the work area. A match between the two sets of read image data is compared (S120 in FIG. 3). As a comparison method,
(1) As a simple method, the average value of the read image data group is calculated, and the average values are compared.
(2) As a detailed method, it is determined whether or not the difference between the values of the image data at the same pixel position is within an allowable range. If the difference value is within the allowable range, it is determined that they match. A match determination is performed for 16 × 16 pixels, and if a match determination equal to or more than a predetermined number of pixels is obtained, the image data group is regarded as a match.
It is good to use such a comparison method.
[0038]
If a match is obtained, a velocity vector (actually, the form of the coordinate positions of the start point and the end point) having the start point at the position of the image data reading area at the previous time and the end point at the center position of the block is acquired. . Further, the distance between the start point and the end point of the velocity vector is calculated as the magnitude of the velocity vector, and the vector and its magnitude are stored in the memory 11 (S125 in FIG. 3).
[0039]
On the other hand, if a match is not determined in step S120, the image data reading area at the previous time is moved by one pixel in the X direction (horizontal right) (incremented) and the image data is read (FIG. 3). S135). Hereinafter, the comparison between the current image data group and the previous image data group is performed in the same manner as described above. Thereafter, the loop processing of steps S120 to S130 → S135 → S120 is repeated until a match determination is obtained. When the reading area reaches the final position in the X direction, this is detected in step S130. Accordingly, the procedure proceeds from step S130 to S145, where the position of the reading area in the Y direction (vertical direction) is incremented by one pixel, and the position of the reading area in the X direction is changed to the initial position (leftmost block). Is done. The image data group before the newly set reading area in this way is read and used for comparison with the current image data group (S145 → S120 in FIG. 3).
[0040]
When the reading area is moved in the X direction and the Y direction as described above, the image data groups at the current time and the previous time coincide with each other, so that the velocity vector is calculated from the position of the reading area at that time at the time (step). S125).
[0041]
(Use of gesture recognition device)
A) Application 1
When the above gesture recognition method is used, a gesture can be used for information transmission between a user's personal computer and a server providing a homepage on the Internet. In this case, the CCD camera 1001 is mounted on the personal computer 1000 used by the user in FIG. 5, and identification information such as “Yes” or “No” is transmitted to the server (not shown) via the personal computer 1000 by the gesture of the user. be able to. The gesture recognition program may be mounted on a server or on the personal computer 1000.
[0042]
B) Application 2
As shown in FIG. 6, a video camera 2001 is installed in a show window 2002 for displaying goods, a gesture of a person passing through the show window 2002 is shot by the video camera 2001, and the shooting result is recognized by a gesture recognition device (not shown). I do.
[0043]
C) Application 3
As shown in FIG. 7, a plurality of images, for example, a portrait 3001 are sequentially displayed on a display device of a personal computer (gesture recognition device) 3002. The gesture of the user 3003 is photographed by the video camera 3004 to perform gesture recognition. The displayed image 3001 is selected according to the gesture recognition result (“Yes” or “No”). The selected images 3005 are collectively displayed separately on the display device.
Such information processing can be used for auditions of talent and selection of matchmaking photos at a marriage agency.
[0044]
D) Application 4
As shown in FIG. 8, a video camera 4001 and a large-screen display 4002 are installed in a classroom of a school, and an image for a virtual teacher to give a class (lecture) is transmitted from the system 4100 to the display 4002. The system 4100 gesture-recognizes the moving image acquired from the video camera 4001 (4101), and controls the display of the video for the lesson using the lesson progress controller (the DVD chapter controller) based on the recognition result. It is possible to jump to a supplementary explanation screen or expedite the explanation as appropriate.
[0045]
E) Use 5
As shown in FIG. 9,

video cameras

5001 and 5002 are installed in a classroom such as a school. A video camera 5001 captures an image of a teacher 5003, and a video camera 5002 captures an image of a student 5004. It is preferable that the shooting direction of the video camera can be remotely controlled. The captured image is sent to the gesture recognition device 5100 via the communication line to perform gesture recognition. In this embodiment, speed vectors are acquired from images of both the teacher and the student (S101), and the meanings of the gestures of the teacher 5003 and the student 5004 are recognized (S102). The gesture recognition result is output as identification information indicating the teacher's enthusiasm (stimulus) and identification information indicating the student's response, and words or numerical values indicating the teacher's ability corresponding to the output result are calculated. By performing such processing, the gesture recognition device can be used for diagnosing the competence of the teacher.
[0046]
The following embodiments can be implemented in addition to the above-described embodiments.
1) The size of the photographing screen and the number of divisions may be appropriately determined according to the device used.
2) There is no restriction on the type of video camera or storage medium, but any suitable one for the gesture recognition device may be used as appropriate.
3) The gesture recognition processing may be executed by a single computer, or may be distributed by a plurality of computers.
[0047]
The above embodiment is an example for describing the invention. According to the technical idea described in the claims, various improvements exist, but these improvements fall within the technical scope of the present invention.
[0048]
【The invention's effect】
As described above, according to the present invention, a captured image is divided into a plurality of blocks, and changes in the position of the same image data group at two different points in time are obtained. A change in the position of all of the blocks is regarded as a feature of the gesture. In the present invention, since the acquired image may be bamboo acquired at two points in time, the time required for gesture recognition is reduced as compared with the related art that requires a time-series image. In addition, it is possible to quickly output a gesture recognition result.
[0049]
In addition, since image data of a partial area of the shooting screen is not used but all data in the screen is used for feature extraction, even if the subject moves fast, it can cope with the movement. The subject image does not protrude from the fixed area for feature extraction as in the above.
[0050]
In addition, by performing the same feature extraction on continuously obtained captured images, it is possible to recognize the meaning of continuous gestures.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a system configuration according to an embodiment of the present invention.
FIG. 2 is a flowchart showing the contents of a gesture recognition processing program.
FIG. 3 is a flowchart showing a partial detailed procedure of a gesture recognition processing program.
FIG. 4 is an explanatory diagram for explaining a feature extraction method according to the embodiment of the present invention.
FIG. 5 is an explanatory diagram for explaining an application of the gesture recognition device.
FIG. 6 is an explanatory diagram for explaining another application of the gesture recognition device.
FIG. 7 is an explanatory diagram for explaining another application of the gesture recognition device.
FIG. 8 is an explanatory diagram for explaining another application of the gesture recognition device.
FIG. 9 is an explanatory diagram for explaining another application of the gesture recognition device.
[Explanation of symbols]
10 CPU
11 Memory 12 Hard Disk 14 Video Camera

Claims

In an image recognition device that extracts a feature of a gesture to be recognized using a captured image, and performs gesture recognition by comparing a feature whose semantic content of the gesture is known in advance with the extracted feature,
Means for dividing each of the two captured images at the current time point and the previous time point into a plurality of blocks,
Means for detecting a change in the position in the block from the previous time point to the current time point of the predetermined number of image data groups by comparing the captured images at the two divided time points for each block,
An image recognition device, wherein a change in the position of an image data group for the plurality of detected blocks is a feature of the gesture to be recognized.

A feature extraction method of an image recognition apparatus that performs gesture recognition by extracting features of a gesture to be recognized using a captured image, and comparing the extracted features with features whose semantic contents are known in advance. At
The image recognition device has image processing means and storage means for storing two captured images at the current time point and the previous time point,
The image processing unit divides each of the two captured images at the current time point and the previous time point stored in the storage unit into a plurality of blocks,
By comparing the captured images at the two divided time points for each block, a change in the position in the block from the previous time point to the current time point of the predetermined number of image data groups is detected by the image processing means,
A feature extraction method for an image recognition apparatus, wherein a change in the position of an image data group for the plurality of detected blocks is used as a feature of the recognition target gesture.

To extract the features of the gesture to be recognized using the captured image, and to perform the gesture recognition by comparing the extracted features with the features whose semantic contents are known in advance. Recording medium recording the gesture recognition program of
The image recognition device includes an image processing unit and a storage unit that stores two captured images at a current time point and a previous time point.
Dividing the two captured images at the current time point and the previous time point stored in the storage means into a plurality of blocks by the image processing means, respectively;
Detecting, by the image processing means, a change in the position of the predetermined number of image data groups in the block from the previous time point to the current time point by comparing the captured images at the two divided time points for each block. ,
A recording medium characterized in that a change in the position of the image data group for the plurality of detected blocks is a feature of the gesture to be recognized.

To extract the features of the gesture to be recognized using the captured image, and to perform the gesture recognition by comparing the extracted features with the features whose semantic contents are known in advance. In the gesture recognition program of
The image recognition device has image processing means and storage means for storing two captured images at the current time point and the previous time point,
Dividing the two captured images at the current time point and the previous time point stored in the storage means into a plurality of blocks by the image processing means, respectively;
Detecting, by the image processing means, a change in the position of the predetermined number of image data groups in the block from the previous time point to the current time point by comparing the captured images at the two divided time points for each block. ,
A gesture recognition program, wherein a change in the position of an image data group for a plurality of detected blocks is a feature of the gesture to be recognized.