JP2004303014A

JP2004303014A - Gesture recognition device, its method and gesture recognition program

Info

Publication number: JP2004303014A
Application number: JP2003096520A
Authority: JP
Inventors: Nobuo Higaki; 信男檜垣
Original assignee: Honda Motor Co Ltd
Current assignee: Honda Motor Co Ltd
Priority date: 2003-03-31
Filing date: 2003-03-31
Publication date: 2004-10-28
Anticipated expiration: 2023-03-31
Also published as: JP4153819B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a gesture recognition device capable of reducing arithmetic processing in performing posture recognition or gesture recognition. <P>SOLUTION: A gesture recognition device 4 is provided with: a face and finger position detection means for detecting a face position and finger positions in the real space of a target person; and a posture/gesture recognition means 42 for recognizing the posture or gesture of the target person on the basis of the face position and finger positions detected by the face and finger position detection means 41. The posture/gesture recognition means 42 recognizes the posture or gesture of the target person by using the "Bayes method" which is one of statistical methods. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、カメラによって対象人物を撮像した画像から、対象人物のポスチャ（姿勢）又はジェスチャ（動作）を認識するための装置、方法及びプログラムに関する。
【０００２】
【従来の技術】
従来、カメラによって対象人物を撮像した画像から、対象人物の動きの特徴を示す点（特徴点）を検出し、その特徴点に基づいて対象人物のジェスチャを推定するジェスチャ認識手法が数多く提案されている（例えば、特許文献１参照）。
【０００３】
【特許文献１】
特開２０００−１４９０２５号公報（第３−６頁、第１図）
【０００４】
【発明が解決しようとする課題】
しかし、従来のジェスチャ認識手法では、対象人物のジェスチャを認識する際に、前記特徴点を一々検出する必要があるため、計算量が多くなり演算処理の負担が増加するという問題があった。
【０００５】
本発明は、以上のような問題点に鑑みてなされたものであり、ポスチャ認識又はジェスチャ認識を行う際の演算処理を軽減することができるジェスチャ認識装置、ジェスチャ認識方法及びジェスチャ認識プログラムを提供することを目的とする。
【０００６】
【課題を解決するための手段】
請求項１に記載のジェスチャ認識装置は、カメラによって対象人物を撮像した画像から、前記対象人物のポスチャ又はジェスチャを認識するための装置であって、前記画像から生成した前記対象人物の輪郭情報と肌色領域情報に基づいて、前記対象人物の実空間上における顔位置と手先位置を検出する顔・手先位置検出手段と、前記顔位置と前記手先位置から、前記顔位置を基準とした際の前記手先位置の所定数フレームにおける平均及び分散を特徴ベクトルとして求め、前記特徴ベクトルに基づいて、統計的手法を用いて、全てのポスチャ及びジェスチャの確率変数の事後分布の確率密度を計算し、各フレームにおける前記事後分布の確率密度が最大であるポスチャ又はジェスチャを、そのフレームにおけるポスチャ又はジェスチャと認識するポスチャ・ジェスチャ認識手段とを備えたことを特徴とする。
【０００７】
この装置は、まず、顔・手先位置検出手段によって、画像から生成した対象人物の輪郭情報と肌色領域情報に基づいて、対象人物の実空間上における顔位置と手先位置を検出する。次に、ポスチャ・ジェスチャ認識手段によって、顔位置と手先位置から、顔位置を基準とした際の手先位置の所定数フレーム（例えば５フレーム）における平均及び分散を「特徴ベクトル」として求める。そして、求めた「特徴ベクトル」に基づいて、統計的手法を用いて、全てのポスチャ及びジェスチャの確率変数の「事後分布の確率密度」を計算し、各フレームにおける「事後分布の確率密度」が最大であるポスチャ又はジェスチャを、そのフレームにおけるポスチャ又はジェスチャと認識する。
【０００８】
なお、ポスチャ・ジェスチャ認識手段は、所定数フレームにおいて、同一のポスチャ又はジェスチャを所定回数以上認識できた場合にのみ、前記同一のポスチャ又はジェスチャを認識できたと判断する（請求項２）。
【０００９】
請求項３に記載のジェスチャ認識方法は、カメラによって対象人物を撮像した画像から、前記対象人物のポスチャ又はジェスチャを認識するための方法であって、前記画像から生成した前記対象人物の輪郭情報と肌色領域情報に基づいて、前記対象人物の実空間上における顔位置と手先位置を検出する顔・手先位置検出ステップと、前記顔位置と前記手先位置から、前記顔位置を基準とした際の前記手先位置の所定数フレームにおける平均及び分散を特徴ベクトルとして求め、前記特徴ベクトルに基づいて、統計的手法を用いて、全てのポスチャ及びジェスチャの確率変数の事後分布の確率密度を計算し、各フレームにおける前記事後分布の確率密度が最大であるポスチャ又はジェスチャを、そのフレームにおけるポスチャ又はジェスチャと認識するポスチャ・ジェスチャ認識ステップとを含むことを特徴とする。
【００１０】
この方法は、まず、顔・手先位置検出ステップにおいて、画像から生成した対象人物の輪郭情報と肌色領域情報に基づいて、対象人物の実空間上における顔位置と手先位置を検出する。次に、ポスチャ・ジェスチャ認識ステップにおいて、顔位置と手先位置から、顔位置を基準とした際の手先位置の所定数フレームにおける平均及び分散を「特徴ベクトル」として求める。そして、求めた「特徴ベクトル」に基づいて、統計的手法を用いて、全てのポスチャ及びジェスチャの確率変数の「事後分布の確率密度」を計算し、各フレームにおける「事後分布の確率密度」が最大であるポスチャ又はジェスチャを、そのフレームにおけるポスチャ又はジェスチャと認識する。
【００１１】
請求項４に記載のジェスチャ認識プログラムは、カメラによって対象人物を撮像した画像から、前記対象人物のポスチャ又はジェスチャを認識するために、コンピュータを、前記画像から生成した前記対象人物の輪郭情報と肌色領域情報に基づいて、前記対象人物の実空間上における顔位置と手先位置を検出する顔・手先位置検出手段、前記顔位置と前記手先位置から、前記顔位置を基準とした際の前記手先位置の所定数フレームにおける平均及び分散を特徴ベクトルとして求め、前記特徴ベクトルに基づいて、統計的手法を用いて、全てのポスチャ及びジェスチャの確率変数の事後分布の確率密度を計算し、各フレームにおける前記事後分布の確率密度が最大であるポスチャ又はジェスチャを、そのフレームにおけるポスチャ又はジェスチャと認識するポスチャ・ジェスチャ認識手段、として機能させることを特徴とする。
【００１２】
このプログラムは、まず、顔・手先位置検出手段によって、画像から生成した対象人物の輪郭情報と肌色領域情報に基づいて、対象人物の実空間上における顔位置と手先位置を検出するする。次に、ポスチャ・ジェスチャ認識手段によって、顔位置と手先位置から、顔位置を基準とした際の手先位置の所定数フレーム（例えば５フレーム）における平均及び分散を「特徴ベクトル」として求める。そして、求めた「特徴ベクトル」に基づいて、統計的手法を用いて、全てのポスチャ及びジェスチャの確率変数の「事後分布の確率密度」を計算し、各フレームにおける「事後分布の確率密度」が最大であるポスチャ又はジェスチャを、そのフレームにおけるポスチャ又はジェスチャと認識する。
【００１３】
【発明の実施の形態】
以下、本発明の実施の形態について、適宜図面を参照して詳細に説明する。ここでは、まず、本発明に係るジェスチャ認識装置を含むジェスチャ認識システムの構成について図１〜図１３を参照して説明し、その後、ジェスチャ認識システムの動作について図１４及び図１５を参照して説明する。
【００１４】
（ジェスチャ認識システムＡの構成）
まず、本発明に係るジェスチャ認識装置４を含むジェスチャ認識システムＡの全体構成について図１を参照して説明する。図１はジェスチャ認識システムＡの全体構成を示すブロック図である。
【００１５】
図１に示すように、ジェスチャ認識システムＡは、図示しない対象人物を撮像する２台のカメラ１（１ａ，１ｂ）と、カメラ１で撮像された画像（撮像画像）を解析して各種情報を生成する撮像画像解析装置２と、撮像画像解析装置２で生成された各種情報に基づいて対象人物の輪郭を抽出する輪郭抽出装置３と、撮像画像解析装置２で生成された各種情報及び輪郭抽出装置３で抽出された対象人物の輪郭（輪郭情報）に基づいて、対象人物のポスチャ（姿勢）又はジェスチャ（動作）を認識するジェスチャ認識装置４とから構成されている。以下、カメラ１、撮像画像解析装置２、輪郭抽出装置３、ジェスチャ認識装置４について、順に説明する。
【００１６】
（カメラ１）
カメラ１（１ａ，１ｂ）はカラーＣＣＤカメラであり、右カメラ１ａと左カメラ１ｂは、左右に距離Ｂだけ離れて並設されている。ここでは、右カメラ１ａを基準カメラとしている。カメラ１ａ，１ｂで撮像された画像（撮像画像）は、フレーム毎に図示しないフレームグラバに記憶された後、撮像画像解析装置２に同期して入力される。
【００１７】
なお、カメラ１ａ，１ｂで撮像した画像（撮像画像）は、図示しない補正機器によりキャリブレーション処理とレクティフィケーション処理を行い、画像補正した後に撮像画像解析装置２に入力される。
【００１８】
（撮像画像解析装置２）
撮像画像解析装置２は、カメラ１ａ，１ｂから入力された画像（撮像画像）を解析して、「距離情報」、「動き情報」、「エッジ情報」、「肌色領域情報」を生成する装置である（図１参照）。
【００１９】
図２は、図１に示したジェスチャ認識システムＡに含まれる撮像画像解析装置２と輪郭抽出装置３の構成を示すブロック図である。図２に示すように、撮像画像解析装置２は、「距離情報」を生成する距離情報生成部２１と、「動き情報」を生成する動き情報生成部２２と、「エッジ情報」を生成するエッジ情報生成部２３と、「肌色領域情報」を生成する肌色領域情報生成部２４とから構成されている。
【００２０】
（距離情報生成部２１）
距離情報生成部２１は、同時刻にカメラ１ａ，１ｂで撮像された２枚の撮像画像の視差に基づいて、各画素についてカメラ２からの距離を検出する。具体的には、基準カメラであるカメラ１ａで撮像された第１の撮像画像と、カメラ１ｂで撮像された第２の撮像画像とからブロック相関法を用いて視差を求め、その視差から三角法を用いて、カメラ１から「各画素に撮像された物」までの距離を求める。そして、求めた距離を第１の撮像画像の各画素に対応付けて、距離を画素値で表現した距離画像Ｄ１（図３（ａ）参照）を生成する。この距離画像Ｄ１が距離情報となる。図３（ａ）の例では、同一の距離に対象人物Ｃが存在している。
【００２１】
なお、ブロック相関法とは、第１の撮像画像と第２の撮像画像とで特定の大きさの同一ブロック（例えば８×３画素）を比較し、第１の撮像画像と第２の撮像画像とでブロック内の被写体が何画素分ずれているかを調べることにより視差を検出する方法である。
【００２２】
（動き情報生成部２２）
動き情報生成部２２は、基準カメラであるカメラ１ａで時系列に撮像した「時刻ｔ」における「撮像画像（ｔ）」と、「時刻ｔ＋Δｔ」における「撮像画像（ｔ＋Δｔ）」との差分に基づいて、対象人物の動きを検出する。具体的には、「撮像画像（ｔ）」と「撮像画像（ｔ＋Δｔ）」との差分をとり、各画素の変位を調べる。そして、調べた変位に基づいて変位ベクトルを求め、求めた変位ベクトルを画素値で表わした差分画像Ｄ２（図３（ｂ）参照）を生成する。この差分画像Ｄ２が動き情報となる。図３（ｂ）の例では、対象人物Ｃの左腕に動きが検出されている。
【００２３】
（エッジ情報生成部２３）
エッジ情報生成部２３は、基準カメラであるカメラ１ａで撮像された画像（撮像画像）における各画素の濃淡情報又は色情報に基づいて、その撮像画像内に存在するエッジを抽出したエッジ画像を生成する。具体的には、撮像画像における各画素の輝度に基づいて、輝度が大きく変化する部分をエッジとして検出し、そのエッジのみからなるエッジ画像Ｄ３（図３（ｃ）参照）を生成する。このエッジ画像Ｄ３がエッジ情報となる。
【００２４】
エッジの検出は、例えばＳｏｂｅｌオペレータを画素毎に乗算し、行又は列単位で、隣の線分と所定の差がある線分をエッジ（横エッジ又は縦エッジ）として検出する。なお、Ｓｏｂｅｌオペレータとは、ある画素の近傍領域の画素に対して重み係数を持つ係数行例のことである。
【００２５】
（肌色領域情報生成部２４）
肌色領域情報生成部２４は、基準カメラであるカメラ１ａで撮像された画像（撮像画像）から、その撮像画像内に存在する対象人物の肌色領域を抽出する。具体的には、撮像画像における全画素のＲＧＢ値を、色相、明度、彩度からなるＨＬＳ空間に変換し、色相、明度、彩度が予め設定された閾値の範囲内にある画素を肌色領域として抽出する（図３（ｄ）参照）。図３（ｄ）の例では、対象人物Ｃの顔が肌色領域Ｒ１として抽出され、手が肌色領域Ｒ２として抽出されている。この肌色領域Ｒ１，Ｒ２が肌色領域情報となる。
【００２６】
撮像画像解析装置２で生成された「距離情報（距離画像Ｄ１）」、「動き情報（差分画像Ｄ２）」、「エッジ情報（エッジ画像Ｄ３）」は、輪郭抽出装置３に入力される。また、撮像画像解析装置２で生成された「距離情報（距離画像Ｄ１）」と「肌色領域情報（肌色領域Ｒ１，Ｒ２）」は、ジェスチャ認識装置４に入力される。
【００２７】
（輪郭抽出装置３）
輪郭抽出装置３は、撮像画像解析装置２で生成された「距離情報（距離画像Ｄ１）」、「動き情報（差分画像Ｄ２）」、「エッジ情報（エッジ画像Ｄ３）」に基づいて、対象人物の輪郭を抽出する装置である（図１参照）。
【００２８】
図２に示すように、輪郭抽出装置３は、対象人物が存在する距離である「対象距離」を設定する対象距離設定部３１と、「対象距離」に基づいた「対象距離画像」を生成する対象距離画像生成部３２と、「対象距離画像内」における「対象領域」を設定する対象領域設定部３３と、「対象領域内」から「対象人物の輪郭」を抽出する輪郭抽出部３４とから構成されている。
【００２９】
（対象距離設定部３１）
対象距離設定部３１は、撮像画像解析装置２で生成された距離画像Ｄ１（図３（ａ）参照）と、差分画像Ｄ２（図３（ｂ）参照）とに基づいて、対象人物が存在する距離である「対象距離」を設定する。具体的には、距離画像Ｄ１における同一の画素値を有する画素を一群（画素群）として、差分画像Ｄ２における前記画素群の画素値を累計する。そして、画素値の累計値が所定値よりも大きい、かつ、カメラ１に最も近い距離にある領域に、最も動き量の多い移動物体、即ち対象人物が存在しているとみなし、その距離を対象距離とする（図４（ａ）参照）。図４（ａ）の例では、対象距離は２．２ｍに設定されている。対象距離設定部３１で設定された対象距離は、対象距離画像生成部３２に入力される。
【００３０】
（対象距離画像生成部３２）
対象距離画像生成部３２は、撮像画像解析装置２で生成された距離画像Ｄ１（図３（ａ）参照）を参照し、対象距離設定部３１で設定された対象距離±αｍに存在する画素に対応する画素をエッジ画像Ｄ３（図３（ｃ）参照）から抽出した「対象距離画像」を生成する。具体的には、距離画像Ｄ１における対象距離設定部３１から入力された対象距離±αｍに対応する画素を求める。そして、求められた画素のみをエッジ情報生成部２３で生成されたエッジ画像Ｄ３から抽出し、対象距離画像Ｄ４（図４（ｂ）参照）を生成する。したがって、対象距離画像Ｄ４は、対象距離に存在する対象人物をエッジで表現した画像になる。対象距離画像生成部３２で生成された対象距離画像Ｄ４は、対象領域設定部３３と輪郭抽出部３４に入力される。
【００３１】
（対象領域設定部３３）
対象領域設定部３３は、対象距離画像生成部３２で生成された対象距離画像Ｄ４（図３（ｂ）参照）内における「対象領域」を設定する。具体的には、対象距離画像Ｄ４の縦方向の画素値を累計したヒストグラムＨを生成し、ヒストグラムＨにおける度数が最大となる位置を、対象人物Ｃの水平方向における中心位置として特定する（図５（ａ）参照）。そして、特定された中心位置の左右に特定の大きさ（例えば０．５ｍ）の範囲を対象領域Ｔとして設定する（図５（ｂ）参照）。なお、対象領域Ｔの縦方向の範囲は、特定の大きさ（例えば２ｍ）に設定される。また、対象領域Ｔを設定する際は、カメラ１のチルト角や高さ等のカメラパラメータを参照して、対象領域Ｔの設定範囲を補正する。対象領域設定部３３で設定された対象領域Ｔは、輪郭抽出部３４に入力される。
【００３２】
（輪郭抽出部３４）
輪郭抽出部３４は、対象距離画像生成部３２で生成された対象領域画像Ｄ４（図４（ｂ）参照）において、対象領域設定部３３で設定された対象領域Ｔ内から対象人物Ｃの輪郭Ｏを抽出する（図５（ｃ）参照）。具体的には、対象人物Ｃの輪郭Ｏを抽出する際は、「Ｓｎａｋｅｓ」と呼ばれる閉曲線からなる動的輪郭モデルを用いた手法（以下、「スネーク手法」という）を用いる。なお、スネーク手法とは、動的輪郭モデルである「Ｓｎａｋｅｓ」を、予め定義されたエネルギ関数が最小となるように収縮変形させることにより、対象物の輪郭を抽出する手法である。輪郭抽出部３４で抽出された対象人物Ｃの輪郭Ｏは、「輪郭情報」としてジェスチャ認識装置４に入力される（図１参照）。
【００３３】
（ジェスチャ認識装置４）
ジェスチャ認識装置４は、撮像画像解析装置２で生成された「距離情報」及び「肌色領域情報」と、輪郭抽出装置３で生成された「輪郭情報」とに基づいて、対象人物のポスチャ又はジェスチャを認識し、その認識結果を出力する装置である（図１参照）。
【００３４】
図６は、図１に示したジェスチャ認識システムＡに含まれるジェスチャ認識装置４の構成を示すブロック図である。図６に示すように、ジェスチャ認識装置４は、対象人物の実空間上における顔位置と手先位置を検出する顔・手先位置検出手段４１と、顔・手先位置検出手段４１によって検出された顔位置と手先位置に基づいて、対象人物のポスチャ又はジェスチャを認識するポスチャ・ジェスチャ認識手段４２とを備えている。
【００３５】
（顔・手先位置検出手段４１）
顔・手先位置検出手段４１は、実空間上における対象人物の「頭頂部の位置（頭頂部位置）」を検出する頭位置検出部４１Ａと、対象人物の「顔の位置（顔位置）」を検出する顔位置検出部４１Ｂと、対象人物の「手の位置（手位置）」を検出する手位置検出部４１Ｃと、対象人物の「手先の位置（手先位置）」を検出する手先位置検出部４１Ｄとから構成されている。なお、ここでいう「手」とは、腕（ａｒｍ）と手（Ｈａｎｄ）とからなる部位のことであり、「手先」とは、手（Ｈａｎｄ）の指先のことである。
【００３６】
（頭位置検出部４１Ａ）
頭位置検出部４１Ａは、輪郭抽出装置３で生成された輪郭情報に基づいて、対象人物Ｃの「頭頂部位置」を検出する。頭頂部位置の検出方法について図７（ａ）を参照して説明すると、まず、輪郭Ｏで囲まれた領域における重心Ｇを求める（１）。次に、頭頂部位置を探索するための領域（頭頂部位置探索領域）Ｆ１を設定する（２）。頭頂部位置探索領域Ｆ１の横幅（Ｘ軸方向の幅）は、重心ＧのＸ座標を中心にして、予め設定されている人間の平均肩幅Ｗとなるようにする。なお、人間の平均肩幅Ｗは、撮像画像解析装置２で生成された距離情報を参照して設定される。また、頭頂部位置探索領域Ｆ１の縦幅（Ｙ軸方向の幅）は、輪郭Ｏを覆うことができるような幅に設定される。そして、頭頂部位置探索領域Ｆ１内における輪郭Ｏの上端点を、頭頂部位置ｍ１とする（３）。頭位置検出部４１Ａで検出された頭頂部位置ｍ１は、顔位置検出部４１Ｂに入力される。
【００３７】
（顔位置検出部４１Ｂ）
顔位置検出部４１Ｂは、頭位置検出部４１Ａで検出された頭頂部位置ｍ１と、撮像画像解析装置２で生成された肌色領域情報とに基づいて、対象人物Ｃの「顔位置」を検出する。顔位置の検出方法について図７（ｂ）を参照して説明すると、まず、顔位置を探索するための領域（顔位置探索領域）Ｆ２を設定する（４）。顔位置探索領域Ｆ２の範囲は、頭頂部位置ｍ１を基準にして、予め設定されている「おおよそ人間の頭部を覆う大きさ」となるようにする。なお、顔位置探索領域Ｆ２の範囲は、撮像画像解析装置２で生成された距離情報を参照して設定される。
【００３８】
次に、顔位置探索領域Ｆ２内における肌色領域Ｒ１の重心を、画像上における顔位置ｍ２とする（５）。肌色領域Ｒ１については、撮像画像解析装置２で生成された肌色領域情報を参照する。そして、画像上における顔位置ｍ２（Ｘｆ，Ｙｆ）から、撮像画像解析装置２で生成された距離情報を参照して、実空間上における顔位置ｍ２ｔ（Ｘｆｔ，Ｙｆｔ，Ｚｆｔ）を求める。
【００３９】
顔位置検出部４１Ｂで検出された「画像上における顔位置ｍ２」は、手位置検出部４１Ｃと手先位置検出部４１Ｄに入力される。また、顔位置検出部４１Ｂで検出された「実空間上における顔位置ｍ２ｔ」は、図示しない記憶手段に記憶され、ポスチャ・ジェスチャ認識手段４２のポスチャ・ジェスチャ認識部４２Ｂ（図６参照）において対象人物Ｃのポスチャ又はジェスチャを認識する際に使用される。
【００４０】
（手位置検出部４１Ｃ）
手位置検出部４１Ｃは、撮像画像解析装置２で生成された肌色領域情報と、輪郭抽出装置３で生成された輪郭情報とに基づいて、対象人物Ｃの「手位置」を検出する。なお、ここでは、肌色領域情報は、顔位置ｍ２周辺を除いた領域の情報を用いる。手位置の検出方法について図８（ａ）を参照して説明すると、まず、手位置を探索するための領域（手位置探索領域）Ｆ３（Ｆ３Ｒ，Ｆ３Ｌ）を設定する（６）。手位置探索領域Ｆ３は、顔位置検出部４１Ｂで検出された顔位置ｍ２を基準にして、予め設定されている「手が届く範囲（左右の手の届く範囲）」となるようにする。なお、手位置探索領域Ｆ３の大きさは、撮像画像解析装置２で生成された距離情報を参照して設定される。
【００４１】
次に、手位置探索領域Ｆ３内における肌色領域Ｒ２の重心を、画像上における手位置ｍ３とする（７）。肌色領域Ｒ２については、撮像画像解析装置２で生成された肌色領域情報を参照する。なお、ここでは、肌色領域情報は、顔位置ｍ２周辺を除いた領域の情報を用いる。図８（ａ）の例では、肌色領域は手位置探索領域Ｆ３（Ｌ）においてのみ存在しているので、手位置ｍ３は手位置探索領域Ｆ３（Ｌ）においてのみ検出される。また、図８（ａ）の例では、対象人物は長袖の服を着ており、手首より先しか露出していないので、手（ＨＡＮＤ）の位置が手位置ｍ３となる。手位置検出部４１Ｃで検出された「画像上における手位置ｍ３」は、手先位置検出部４１Ｄに入力される。
【００４２】
（手先位置検出部４１Ｄ）
手先位置検出部４１Ｄは、顔位置検出部４１Ｂで検出された顔位置ｍ２と、手位置検出部４１Ｃで検出された手位置ｍ３とに基づいて、対象人物Ｃの「手先位置」を検出する。手先位置の検出方法について図８（ｂ）を参照して説明すると、まず、手位置探索領域Ｆ３Ｌ内において、手先位置を探索するための領域（手先位置探索範囲）Ｆ４を設定する（８）。手先位置探索範囲Ｆ４は、手位置ｍ３を中心にして、予め設定されている「おおよそ手を覆う大きさ」となるようにする。なお、手先位置探索範囲Ｆ４の範囲は、撮像画像解析装置２で生成された距離情報を参照して設定される。
【００４３】
続いて、手先位置探索範囲Ｆ４における肌色領域Ｒ２の上下左右の端点ｍ４ａ〜ｍ４ｄを検出する（９）。肌色領域Ｒ２については、撮像画像解析装置２で生成された肌色領域情報を参照する。そして、上下端点間（ｍ４ａ、ｍ４ｂ間）の垂直方向距離ｄ１と、左右端点間（ｍ４ｃ、ｍ４ｄ間）の水平方向距離ｄ２とを比較し、距離が長い方を手が伸びている方向と判断する（１０）。図８（ｂ）の例では、垂直方向距離ｄ１の方が水平方向距離ｄ２よりも距離が長いので、手先は上下方向に伸びていると判断される。
【００４４】
次に、画像上における顔位置ｍ２と、画像上における手位置ｍ３との位置関係に基づいて、上下端点ｍ４ａ，ｍ４ｂのどちら（もしくは左右端点ｍ４ｃ，ｍ４ｄのどちらか）が手先位置であるかを判断する。具体的には、手位置ｍ３が顔位置ｍ２から遠い場合は、手は伸びているとみなし、顔位置ｍ２から遠い方の端点を手先位置（画像上における手先位置）ｍ４と判断する。逆に、手位置ｍ３が顔位置ｍ２に近い場合は、肘を曲げているとみなし、顔位置ｍ２に近い方の端点を手先位置ｍ４と判断する。図８（ｂ）の例では、手位置ｍ３が顔位置ｍ２から遠く、上端点ｍ４ａが下端点ｍ４ｂよりも顔位置ｍ２から遠いので、上端点ｍ４ａが手先位置ｍ４であると判断する（１１）。
【００４５】
そして、画像上における手先位置ｍ４（Ｘｈ，Ｙｈ）から、撮像画像解析装置２で生成された距離情報を参照して、実空間上における手先位置ｍ４ｔ（Ｘｈｔ，Ｙｈｔ，Ｚｈｔ）を求める。手先位置検出部４１Ｄで検出された「実空間上における手先位置ｍ４ｔ」は、図示しない記憶手段に記憶され、ポスチャ・ジェスチャ認識手段４２のポスチャ・ジェスチャ認識部４２Ｂ（図６参照）において対象人物Ｃのポスチャ又はジェスチャを認識する際に使用される。
【００４６】
（ポスチャ・ジェスチャ認識手段４２）
ポスチャ・ジェスチャ認識手段４２は、ポスチャデータとジェスチャデータを記憶するポスチャ・ジェスチャデータ記憶部４２Ａと、顔・手先位置検出手段４１によって検出された「実空間上における顔位置ｍ２ｔ」と「実空間上における手先位置ｍ４ｔ」に基づいて、対象人物のポスチャ又はジェスチャを認識するポスチャ・ジェスチャ認識部４２Ｂとから構成されている（図６参照）。
【００４７】
（ポスチャ・ジェスチャデータ記憶部４２Ａ）
ポスチャ・ジェスチャデータ記憶部４２Ａは、ポスチャデータＰ１〜Ｐ４（図９参照）とジェスチャデータＪ１〜Ｊ４（図１０参照）を記憶している。ポスチャデータＰ１〜Ｐ４とジェスチャデータＪ１〜Ｊ４は、「顔位置を基準とした際の手先位置と、その手先位置の変動」に対応するポスチャ又はジェスチャを記したデータである。ポスチャデータＰ１〜Ｐ４とジェスチャデータＪ１〜Ｊ４は、ポスチャ・ジェスチャ認識部４２Ｂにおいて対象人物のポスチャ又はジェスチャを認識する際に使用される。
【００４８】
ポスチャデータＰ１〜Ｐ４について図９を参照して説明すると、図９（ａ）に示す「ポスチャＰ１：ＦＡＣＥＳＩＤＥ」は「こんにちは」、図９（ｂ）に示す「ポスチャＰ２：ＨＩＧＨＨＡＮＤ」は「追従開始」、図９（ｃ）に示す「ポスチャＰ３：ＳＩＤＥＨＡＮＤ」は「手の方向を見よ」、図９（ｄ）に示す「ポスチャＰ４：ＬＯＷＨＡＮＤ」は「手の方向に曲がれ」を意味するポスチャである。
【００４９】
また、ジェスチャＪ１〜Ｊ４について図１０を参照して説明すると、図１０（ａ）に示す「ジェスチャＪ１：ＨＡＮＤＳＷＩＮＧ」は「注意せよ」、図１０（ｂ）に示す「ジェスチャＪ２：ＢＹＥＢＹＥ」は「さようなら」、図１０（ｃ）に示す「ジェスチャＪ３：ＣＯＭＥＨＥＲＥ」は「接近せよ」、図１０（ｄ）に示す「ジェスチャＪ４：ＨＡＮＤＣＩＲＣＬＩＮＧ」は「旋回せよ」を意味するジェスチャである。
【００５０】
なお、本実施の形態では、ポスチャ・ジェスチャデータ記憶部４２Ａ（図６参照）は、ポスチャデータＰ１〜Ｐ４（図９参照）とジェスチャデータＪ１〜Ｊ４（図１０参照）を記憶しているが、ポスチャ・ジェスチャデータ記憶部４２Ａに記憶させるポスチャデータとジェスチャデータは任意に設定することができる。また、各ポスチャと各ジェスチャの意味も任意に設定することができる。
【００５１】
ポスチャ・ジェスチャ認識部４２Ｂは、統計的手法の一つである「ベイズの手法」を用いて、対象人物のポスチャ又はジェスチャを認識する。具体的には、まず、顔・手先位置検出手段４１によって検出された「実空間上における顔位置ｍ２ｔ」と「実空間上における手先位置ｍ４ｔ」から、「顔位置ｍ２ｔを基準とした際の手先位置の所定数フレーム（例えば５フレーム）における平均及び分散」を特徴ベクトルｘとして求める。そして、求めた特徴ベクトルｘに基づいて、ベイズの手法を用いて、全てのポスチャ及びジェスチャｉの確率変数ωｉの「事後分布の確率密度」を計算し、各フレームにおける「事後分布の確率密度」が最大であるポスチャ又はジェスチャを、そのフレームにおけるポスチャ又はジェスチャと認識する。
【００５２】
次に、図１１及び図１２に示すフローチャートを参照して、ポスチャ・ジェスチャ認識部４２Ｂにおけるポスチャ又はジェスチャの認識方法について詳しく説明する。ここでは、まず、図１１に示すフローチャートを参照してポスチャ・ジェスチャ認識部４２Ｂでの処理の概略について説明し、その後、図１２に示すフローチャートを参照して図１１に示したフローチャートにおける「ステップＳ１：ポスチャ・ジェスチャ認識処理」について説明する。
【００５３】
（ポスチャ・ジェスチャ認識部４２Ｂでの処理の概略）
図１１は、ポスチャ・ジェスチャ認識部４２Ｂでの処理の概略を説明するためのフローチャートである。図１１に示すフローチャートを参照して、まず、ステップＳ１では、ポスチャ又はジェスチャの認識を試みる。次に、ステップＳ２では、ステップＳ１においてポスチャ又はジェスチャを認識できたかどうかを判断する。ここで、ポスチャ又はジェスチャを認識できたと判断された場合はステップＳ３に進み、ポスチャ又はジェスチャを認識できなかったと判断された場合はステップＳ５に進む。
【００５４】
ステップＳ３では、過去の所定数のフレーム（例えば１０フレーム）において、同一のポスチャ又はジェスチャを所定回数（例えば５回）以上認識できたかどうかを判断する。ここで、同一のポスチャ又はジェスチャを所定回数以上認識できたと判断された場合はステップＳ４に進み、同一のポスチャ又はジェスチャを所定回数以上認識できなかったと判断された場合はステップＳ５に進む。
【００５５】
そして、ステップＳ４では、ステップＳ１において認識されたポスチャ又はジェスチャを認識結果として出力し、処理を終了する。また、ステップＳ５では、ポスチャ又はジェスチャを認識できなかった、即ち認識不能であると出力し、処理を終了する。
【００５６】
（ステップＳ１：ポスチャ・ジェスチャ認識処理）
図１２は、図１１に示したフローチャートにおける「ステップＳ１：ポスチャ・ジェスチャ認識処理」について説明するためのフローチャートである。図１２に示すフローチャートを参照して、まず、ステップＳ１１では、顔・手先位置検出手段４１によって検出された「実空間上における顔位置ｍ２ｔ（Ｘｆｔ，Ｙｆｔ，Ｚｆｔ）」と「実空間上における手先位置ｍ４ｔ（Ｘｈｔ，Ｙｈｔ，Ｚｈｔ）」から、「顔位置ｍ２ｔを基準とした際の手先位置の所定数フレーム（例えば５フレーム）における平均及び分散」を

として求める。
【００５７】
次のステップＳ１２では、ステップＳ１１で求めた特徴ベクトルｘに基づいて、ベイズの手法を用いて、全てのポスチャ及びジェスチャｉの確率変数ωｉの「事後分布の確率密度」を計算する。
【００５８】
ステップＳ１２における「事後分布の確率密度」の求め方について詳しく説明すると、まず、特徴ベクトルｘが与えられたときに、それがポスチャ又はジェスチャｉである確率密度Ｐ（ωｉ｜ｘ）は、下記の「ベイズの定理」と呼ばれる式（１）により求められる。なお、確率変数ωｉは各ポスチャ及び各ジェスチャ毎に予め設定されている。
【００５９】
【数１】

【００６０】
式（１）におけるＰ（ｘ｜ωｉ）は、ポスチャ又はジェスチャｉが与えられたときに、画像が特徴ベクトルｘを含む「条件付き確率密度」であり、下記の式（２）で表わされる。なお、特徴ベクトルｘは、共分散行列Σを持ち、

の正規分布に従うものとする。
【００６１】
【数２】

【００６２】
また、式（１）におけるＰ（ωｉ）は、確率変数ωｉの「事前分布の確率密度」であり、下記の式（３）で表わされる。なお、Ｐ（ωｉ）は、期待値ωｉ_０、分散Ｖ［ωｉ_０］の正規分布であるとする。
【００６３】
【数３】

【００６４】
式（１）の右項の分母はωｉに依存しないので、式（２），式（３）より、確率変数ωｉの「事後分布の確率密度」は、下記の式（４）で表わされる。
【００６５】
【数４】

【００６６】
図１２のフローチャートに戻り、次のステップＳ１３では、各フレームにおける「事後分布の確率密度」が最大であるポスチャ又はジェスチャを求め、続くテップＳ１４では、ステップＳ１３で求められたポスチャ又はジェスチャが各フレームにおけるポスチャ又はジェスチャであるという認識結果を出力し、処理を終了する。
【００６７】
図１３は、フレーム１〜１００における、ポスチャＰ１〜Ｐ４及びジェスチャＪ１〜Ｊ４についての確率変数ωｉの「事後分布の確率密度」を示すグラフである。なお、ここでは、ポスチャＰ１〜Ｐ４及びジェスチャＪ１〜Ｊ４を、「ｉ（ｉ＝１〜８）」としている。
【００６８】
図１３に示すように、フレーム１〜２６では「ジェスチャＪ２：ＢＹＥＢＹＥ」の確率密度が最大となるので、フレーム１〜２６では、対象人物のポスチャ又はジェスチャは「ジェスチャＪ２：ＢＹＥＢＹＥ」（図１０（ｂ）参照）であると認識される。また、フレーム２７〜４３では「ポスチャＰ１：ＦＡＣＥＳＩＤＥ」の確率密度が最大となるので、フレーム２７〜４３では対象人物のポスチャ又はジェスチャは「ポスチャＰ１：ＦＡＣＥＳＩＤＥ」（図９（ａ）参照）であると認識される。
【００６９】
また、フレーム４４〜７６では「ジェスチャＪ３：ＣＯＭＥＨＥＲＥ」の確率密度が最大となるので、フレーム４４〜７６では対象人物のポスチャ又はジェスチャは「ジェスチャＪ３：ＣＯＭＥＨＥＲＥ」（図１０（ｃ）参照）であると認識される。そして、フレーム８０〜１００では「ジェスチャＪ１：ＨＡＮＤＳＷＩＮＧ」の確率密度が最大となるので、フレーム８０〜１００では対象人物のポスチャ又はジェスチャは「ジェスチャＪ１：ＨＡＮＤＳＷＩＮＧ」図１０（ａ）参照）であると認識される。
【００７０】
なお、フレーム７７〜７９では「ジェスチャＪ４：ＨＡＮＤＣＩＲＣＬＩＮＧ」の確率密度が最大となるが、「ジェスチャＪ４：ＨＡＮＤＣＩＲＣＬＩＮＧ」は３回しか認識されていないため、対象人物のポスチャ又はジェスチャは「ジェスチャＪ４：ＨＡＮＤＣＩＲＣＬＩＮＧ」であるとは認識されない。これは、前記したように、ポスチャ・ジェスチャ認識４２Ｂでは、過去の所定数のフレーム（例えば１０フレーム）において、同一のポスチャ又はジェスチャを所定回数（例えば５回）以上認識できた場合にのみ、前記同一のポスチャ又はジェスチャを認識できたと判断するためである（図１１に示すフローチャートのステップＳ３〜ステップＳ５参照）。
【００７１】
以上のようにして、ポスチャ・ジェスチャ認識４２Ｂは、ベイズの手法を用いて、全てのポスチャ及びジェスチャｉ（ｉ＝１〜８）の確率変数ωｉの「事後分布の確率密度」を計算し、各フレームにおける「事後分布の確率密度」が最大であるポスチャ又はジェスチャを調べることにより、対象人物のポスチャ又はジェスチャを認識することができる。
【００７２】
（ジェスチャ認識システムＡの動作）
次に、ジェスチャ認識システムＡの動作について図１に示すジェスチャ認識システムＡの全体構成を示すブロック図と、図１４及び図１５に示すフローチャートを参照して説明する。図１４は、ジェスチャ認識システムＡの動作における「撮像画像解析ステップ」と「輪郭抽出ステップ」を説明するために示すフローチャートであり、図１５は、ジェスチャ認識システムＡの動作における「顔・手先位置検出ステップ」と「ポスチャ・ジェスチャ認識ステップ」を説明するために示すフローチャートである。
【００７３】
＜撮像画像解析ステップ＞
図１４に示すフローチャートを参照して、撮像画像解析装置２では、カメラ１ａ，１ｂから撮像画像が入力されると（ステップＳ１０１）、距離情報生成部２１において、撮像画像から距離情報である距離画像Ｄ１（図３（ａ）参照）を生成し（ステップＳ１０２）、動き情報生成部２２において、撮像画像から動き情報である差分画像Ｄ２（図３（ｂ）参照）を生成する（ステップＳ１０３）。また、エッジ情報生成部２３において、撮像画像からエッジ情報であるエッジ画像Ｄ３（図３（ｃ）参照）を生成し（ステップＳ１０４）、肌色領域情報生成部２４において、撮像画像から肌色領域情報である肌色領域Ｒ１，Ｒ２（図３（ｄ）参照）を抽出する（ステップＳ１０５）。
【００７４】
＜輪郭抽出ステップ＞
引き続き図１４に示すフローチャートを参照して、輪郭抽出装置３では、まず、対象距離設定部３１において、ステップＳ１０２とステップＳ１０３で生成された距離画像Ｄ１と差分画像Ｄ２から、対象人物が存在する距離である対象距離を設定する（ステップＳ１０６）。続いて、対象距離画像生成部３２において、ステップＳ１０４で生成されたエッジ画像Ｄ３からステップＳ１０６で設定された対象距離に存在する画素を抽出した対象距離画像Ｄ４（図４（ｂ）参照）を生成する（ステップＳ１０７）。
【００７５】
次に、対象領域設定部３３において、ステップＳ１０７で生成された対象距離画像Ｄ４内における対象領域Ｔ（図５（ｂ）参照）を設定する（ステップＳ１０８）。そして、輪郭抽出部３４において、ステップＳ１０８で設定された対象領域Ｔ内から、対象人物Ｃの輪郭Ｏ（図５（ｃ）参照）を抽出する（ステップＳ１０９）。
【００７６】
＜顔・手先位置検出ステップ＞
図１５に示すフローチャートを参照して、ジェスチャ認識装置４の顔・手先位置検出手段４１では、まず、頭位置検出部４１Ａにおいて、ステップＳ１０９で生成された輪郭情報に基づいて、対象人物Ｃの頭頂部位置ｍ１（図７（ａ）参照）を検出する（ステップＳ１１０）。
【００７７】
続いて、顔位置検出部４１Ｂにおいて、ステップＳ１１０で検出された頭頂部位置ｍ１と、ステップＳ１０５で生成された肌色領域情報とに基づいて、「画像上における顔位置ｍ２」（図７（ｂ）参照）を検出し、検出された「画像上における顔位置ｍ２（Ｘｆ，Ｙｆ）」から、ステップＳ１０２で生成された距離情報を参照して、「実空間上における顔位置ｍ２ｔ（Ｘｆｔ，Ｙｆｔ，Ｚｆｔ）」を求める（ステップＳ１１１）。
【００７８】
次に、手位置検出部４１Ｃにおいて、ステップＳ１１１で検出された「画像上における顔位置ｍ２」から、「画像上における手位置ｍ３」（図８（ａ）参照）を検出する（ステップＳ１１２）。
【００７９】
そして、手先位置検出部４１Ｄにおいて、顔位置検出部４１Ｂで検出された「画像上における顔位置ｍ２」と、手位置検出部４１Ｃで検出された手位置ｍ３とに基づいて、「画像上における手先位置ｍ４」（図８（ｂ）参照）を検出し、検出された「画像上における手先位置ｍ４（Ｘｈ，Ｙｈ）」から、ステップＳ１０２で生成された距離情報を参照して、「実空間上における手先位置ｍ４ｔ（Ｘｈｔ，Ｙｈｔ，Ｚｈｔ）」を求める（ステップＳ１１３）。
【００８０】
＜ポスチャ・ジェスチャ認識ステップ＞
引き続き図１５に示すフローチャートを参照して、ジェスチャ認識装置４のポスチャ・ジェスチャ認識部４２Ｂでは、統計的手法の一つである「ベイズの手法」を用いて、対象人物のポスチャ又はジェスチャを認識する。ポスチャ・ジェスチャ認識部４２Ｂにおけるポスチャ又はジェスチャの認識方法の詳細については、前記したのでここでは省略する。
【００８１】
以上、ジェスチャ認識システムＡについて説明したが、このジェスチャ認識システムＡに含まれるジェスチャ認識装置４は、コンピュータにおいて各手段を各機能プログラムとして実現することも可能であり、各機能プログラムを結合してジェスチャ認識プログラムとして動作させることも可能である。
【００８２】
また、このジェスチャ認識システムＡは、例えば自律ロボットに適用することができる。その場合、自律ロボットは、例えば人が手を上げるとそのポスチャ「ポスチャＰ２：ＨＩＧＨＨＡＮＤ」（図９（ｂ）参照）と認識することや、人が手を振るとそのジェスチャを「ジェスチャＪ１：ＨＡＮＤＳＷＩＮＧ」（図１０（ａ）参照）と認識することが可能となる。
【００８３】
なお、ポスチャやジェスチャによる指示は、音声による指示と比べて、周囲の騒音により左右されない、音声が届かないような状況でも指示が可能である、言葉では表現が難しい（又は冗長になる）指示を簡潔に行うことができる、という利点がある。
【００８４】
【発明の効果】
以上、詳細に説明したように、本発明によれば、対象人物のジェスチャを認識する際に、特徴点（対象人物の動きの特徴を示す点）を一々検出する必要が無いため、従来のジェスチャ認識手法と比べて、ポスチャ認識又はジェスチャ認識を行う際の演算処理を軽減することができる。
【図面の簡単な説明】
【図１】ジェスチャ認識システムＡの全体構成を示すブロック図である。
【図２】図１に示したジェスチャ認識システムＡに含まれる撮像画像解析装置２と輪郭抽出装置３の構成を示すブロック図である。
【図３】（ａ）は距離画像Ｄ１、（ｂ）は差分画像Ｄ２、（ｃ）はエッジ画像Ｄ３、（ｄ）は肌色領域Ｒ１，Ｒ２を示す図である。
【図４】対象距離を設定する方法を説明するための図である。
【図５】対象領域Ｔを設定する方法と、対象領域Ｔ内から対象人物Ｃの輪郭Ｏを抽出する方法を説明するための図である。
【図６】図１に示したジェスチャ認識システムＡに含まれるジェスチャ認識装置４の構成を示すブロック図である。
【図７】（ａ）は頭頂部位置ｍ１の検出方法を説明するための図であり、（ｂ）は顔位置ｍ２の検出方法を説明するための図である。
【図８】（ａ）は手位置ｍ３の検出方法を説明するための図であり、（ｂ）は手先位置ｍ４の検出方法を説明するための図である。
【図９】ポスチャデータＰ１〜Ｐ４を示す図である。
【図１０】ジェスチャデータＪ１〜Ｊ４を示す図である。
【図１１】ポスチャ・ジェスチャ認識部４２Ｂでの処理の概略を説明するためのフローチャートである。
【図１２】図１１に示したフローチャートにおける「ステップＳ１：ポスチャ・ジェスチャ認識処理」について説明するためのフローチャートである。
【図１３】フレーム１〜１００における、ポスチャＰ１〜Ｐ４及びジェスチャＪ１〜Ｊ４についての確率変数ωｉの「事後分布の確率密度」を示すグラフである。
【図１４】ジェスチャ認識システムＡの動作における「撮像画像解析ステップ」と「輪郭抽出ステップ」を説明するために示すフローチャートである。
【図１５】ジェスチャ認識システムＡの動作における「顔・手先位置検出ステップ」と「ポスチャ・ジェスチャ認識ステップ」を説明するために示すフローチャートである。
【符号の説明】
Ａジェスチャ認識システム
１カメラ
２撮像画像解析装置
３輪郭抽出装置
４ジェスチャ認識装置
４１顔・手先位置検出手段
４１Ａ頭位置検出部
４１Ｂ顔位置検出部
４１Ｃ手位置検出部
４１Ｄ手先位置検出部
４２ポスチャ・ジェスチャ認識手段
４２Ａポスチャ・ジェスチャデータ記憶部
４２Ｂポスチャ・ジェスチャ認識部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an apparatus, a method, and a program for recognizing a posture (posture) or a gesture (motion) of a target person from an image of the target person captured by a camera.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, many gesture recognition methods have been proposed that detect points (feature points) indicating characteristics of movement of a target person from an image of the target person captured by a camera, and estimate a gesture of the target person based on the characteristic points. (For example, see Patent Document 1).
[0003]
[Patent Document 1]
JP-A-2000-149025 (page 3-6, FIG. 1)
[0004]
[Problems to be solved by the invention]
However, in the conventional gesture recognition method, when recognizing the gesture of the target person, it is necessary to detect the feature points one by one, so that there is a problem that the amount of calculation increases and the load of the arithmetic processing increases.
[0005]
The present invention has been made in view of the above-described problems, and provides a gesture recognition device, a gesture recognition method, and a gesture recognition program that can reduce arithmetic processing when performing posture recognition or gesture recognition. The purpose is to:
[0006]
[Means for Solving the Problems]
The gesture recognition device according to claim 1, which is a device for recognizing a posture or a gesture of the target person from an image of the target person captured by a camera, wherein the outline information of the target person generated from the image and Based on the skin color area information, a face / hand position detecting means for detecting a face position and a hand position of the target person in the real space, and the face position and the hand position, based on the face position, The average and variance of the hand positions in a predetermined number of frames are obtained as a feature vector, and based on the feature vector, the probability density of the posterior distribution of the random variables of all the postures and gestures is calculated using a statistical method. Recognizes the posture or gesture in which the probability density of the posterior distribution is the largest in the frame as the posture or gesture in the frame Characterized by comprising a that posture, gesture recognition means.
[0007]
In this device, first, a face / hand position of a target person in a real space is detected by a face / hand position detection unit based on contour information and skin color region information of the target person generated from an image. Next, from the face position and the hand position, an average and a variance of the hand position based on the face position in a predetermined number of frames (for example, five frames) are obtained as a “feature vector” by the posture / gesture recognition unit. Then, based on the obtained “feature vector”, using a statistical method, calculate the “probability density of the posterior distribution” of the random variables of all postures and gestures, and calculate the “probability density of the posterior distribution” in each frame. The largest posture or gesture is recognized as the posture or gesture in the frame.
[0008]
Note that the posture / gesture recognition means determines that the same posture or gesture has been recognized only when the same posture or gesture has been recognized a predetermined number of times or more in a predetermined number of frames (claim 2).
[0009]
The gesture recognition method according to claim 3 is a method for recognizing a posture or a gesture of the target person from an image of the target person captured by a camera, and includes outline information of the target person generated from the image. Based on the skin color area information, a face / hand position detection step of detecting a face position and a hand position in the real space of the target person, and, based on the face position and the hand position, based on the face position, The average and variance of the hand positions in a predetermined number of frames are obtained as a feature vector, and based on the feature vector, the probability density of the posterior distribution of the random variables of all the postures and gestures is calculated using a statistical method. The posture or gesture having the maximum probability density of the posterior distribution in is defined as the posture or gesture in the frame. Characterized in that it comprises a posture gesture recognition step of identification.
[0010]
In this method, first, in a face / hand position detection step, a face position and a hand position of a target person in a real space are detected based on contour information and skin color region information of the target person generated from an image. Next, in a posture / gesture recognition step, an average and a variance of a hand position in a predetermined number of frames based on the face position are obtained as a “feature vector” from the face position and the hand position. Then, based on the obtained “feature vector”, using a statistical method, calculate the “probability density of the posterior distribution” of the random variables of all postures and gestures, and calculate the “probability density of the posterior distribution” in each frame. The largest posture or gesture is recognized as the posture or gesture in the frame.
[0011]
The gesture recognition program according to claim 4, wherein the computer recognizes a posture or a gesture of the target person from an image obtained by capturing the target person with a camera. Face / hand position detecting means for detecting a face position and a hand position of the target person in real space based on the area information, and the hand position based on the face position based on the face position and the hand position The average and variance in a predetermined number of frames are determined as a feature vector, and based on the feature vector, the probability density of the posterior distribution of the random variables of all the postures and gestures is calculated using a statistical method. Posture or gesture with the highest probability density of the post-article distribution is defined as the posture or gesture in the frame. Characterized in that to function as posture, gesture recognition means recognizes that.
[0012]
In this program, first, a face / hand position of a target person in a real space is detected by a face / hand position detection unit based on contour information and skin color region information of the target person generated from an image. Next, from the face position and the hand position, an average and a variance of the hand position based on the face position in a predetermined number of frames (for example, five frames) are obtained as a “feature vector” by the posture / gesture recognition unit. Then, based on the obtained “feature vector”, using a statistical method, calculate the “probability density of the posterior distribution” of the random variables of all postures and gestures, and calculate the “probability density of the posterior distribution” in each frame. The largest posture or gesture is recognized as the posture or gesture in the frame.
[0013]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings as appropriate. Here, first, the configuration of the gesture recognition system including the gesture recognition device according to the present invention will be described with reference to FIGS. 1 to 13, and then the operation of the gesture recognition system will be described with reference to FIGS. 14 and 15. I do.
[0014]
(Configuration of Gesture Recognition System A)
First, an overall configuration of a gesture recognition system A including a gesture recognition device 4 according to the present invention will be described with reference to FIG. FIG. 1 is a block diagram showing the overall configuration of the gesture recognition system A.
[0015]
As shown in FIG. 1, the gesture recognition system A analyzes two images of a camera 1 (1a, 1b) that captures a target person (not shown) and an image (captured image) captured by the camera 1 to obtain various information. A captured image analysis device 2 to generate, a contour extraction device 3 that extracts a contour of a target person based on various information generated by the captured image analysis device 2, and various information and a contour extraction generated by the captured image analysis device 2. The gesture recognition device 4 recognizes the posture (posture) or gesture (motion) of the target person based on the target person's contour (contour information) extracted by the device 3. Hereinafter, the camera 1, the captured image analysis device 2, the contour extraction device 3, and the gesture recognition device 4 will be described in order.
[0016]
(Camera 1)
The cameras 1 (1a, 1b) are color CCD cameras, and the right camera 1a and the left camera 1b are arranged side by side with a distance B left and right. Here, the right camera 1a is used as a reference camera. Images (captured images) captured by the

cameras

1a and 1b are stored in a frame grabber (not shown) for each frame, and are input in synchronization with the captured image analysis device 2.
[0017]
Note that images (captured images) captured by the

cameras

1a and 1b are subjected to calibration processing and rectification processing by a correction device (not shown), and are input to the captured image analysis device 2 after image correction.
[0018]
(Captured image analysis device 2)
The captured image analysis device 2 is a device that analyzes images (captured images) input from the

cameras

1a and 1b and generates “distance information”, “motion information”, “edge information”, and “skin color region information”. (See FIG. 1).
[0019]
FIG. 2 is a block diagram showing a configuration of the captured image analysis device 2 and the contour extraction device 3 included in the gesture recognition system A shown in FIG. As shown in FIG. 2, the captured image analysis device 2 includes a distance information generation unit 21 that generates “distance information”, a motion information generation unit 22 that generates “motion information”, and an edge that generates “edge information”. It comprises an information generation unit 23 and a skin color region information generation unit 24 that generates “skin color region information”.
[0020]
(Distance information generation unit 21)
The distance information generation unit 21 detects the distance of each pixel from the camera 2 based on the parallax of the two captured images captured by the

cameras

1a and 1b at the same time. Specifically, a parallax is obtained from the first captured image captured by the camera 1a, which is the reference camera, and the second captured image captured by the camera 1b using a block correlation method. Is used to determine the distance from the camera 1 to the “object imaged at each pixel”. Then, the obtained distance is associated with each pixel of the first captured image, and a distance image D1 (see FIG. 3A) in which the distance is expressed by a pixel value is generated. This distance image D1 becomes distance information. In the example of FIG. 3A, the target person C exists at the same distance.
[0021]
Note that the block correlation method refers to comparing the same block (for example, 8 × 3 pixels) of a specific size between a first captured image and a second captured image, and comparing the first captured image with the second captured image. Is a method of detecting parallax by checking how many pixels the subject in the block is shifted.
[0022]
(Motion information generation unit 22)
The motion information generation unit 22 is based on the difference between the “captured image (t)” at “time t” and the “captured image (t + Δt)” at “time t + Δt” captured in time series by the camera 1a as the reference camera. Then, the movement of the target person is detected. Specifically, the difference between the “captured image (t)” and “captured image (t + Δt)” is obtained, and the displacement of each pixel is checked. Then, a displacement vector is obtained based on the checked displacement, and a difference image D2 (see FIG. 3B) in which the obtained displacement vector is represented by pixel values is generated. This difference image D2 becomes the motion information. In the example of FIG. 3B, a motion is detected in the left arm of the target person C.
[0023]
(Edge information generation unit 23)
The edge information generation unit 23 generates an edge image by extracting an edge present in the captured image based on density information or color information of each pixel in the image (captured image) captured by the camera 1a as the reference camera. I do. Specifically, based on the brightness of each pixel in the captured image, a portion where the brightness changes greatly is detected as an edge, and an edge image D3 (see FIG. 3C) including only the edge is generated. This edge image D3 becomes edge information.
[0024]
For edge detection, for example, a Sobel operator is multiplied for each pixel, and a line segment having a predetermined difference from an adjacent line segment is detected as an edge (horizontal edge or vertical edge) in row or column units. Note that the Sobel operator is an example of a coefficient row having a weight coefficient for a pixel in an area near a certain pixel.
[0025]
(Skin color area information generation unit 24)
The skin color area information generation unit 24 extracts a skin color area of the target person existing in the captured image from an image (captured image) captured by the camera 1a as the reference camera. Specifically, the RGB values of all the pixels in the captured image are converted into an HLS space including hue, lightness, and saturation, and the pixels whose hue, lightness, and saturation are within a preset threshold are replaced with skin color areas. (See FIG. 3D). In the example of FIG. 3D, the face of the target person C is extracted as the skin color region R1, and the hand is extracted as the skin color region R2. The skin color regions R1 and R2 serve as skin color region information.
[0026]
The “distance information (distance image D1)”, “motion information (difference image D2)”, and “edge information (edge image D3)” generated by the captured image analysis device 2 are input to the contour extraction device 3. The “distance information (distance image D1)” and “skin color region information (skin color regions R1, R2)” generated by the captured image analysis device 2 are input to the gesture recognition device 4.
[0027]
(Contour extraction device 3)
The contour extraction device 3 generates a target person based on “distance information (distance image D1)”, “motion information (difference image D2)”, and “edge information (edge image D3)” generated by the captured image analysis device 2. (See FIG. 1).
[0028]
As illustrated in FIG. 2, the contour extraction device 3 generates a target distance setting unit 31 that sets a “target distance” that is a distance at which the target person exists, and a “target distance image” based on the “target distance”. The target distance image generation unit 32, a target region setting unit 33 that sets “target region” in “target distance image”, and a contour extraction unit 34 that extracts “target person contour” from “target region” It is configured.
[0029]
(Target distance setting unit 31)
The target distance setting unit 31 includes a target person based on the distance image D1 (see FIG. 3A) generated by the captured image analysis device 2 and the difference image D2 (see FIG. 3B). The "target distance" which is a distance is set. Specifically, the pixels having the same pixel value in the distance image D1 are regarded as one group (pixel group), and the pixel values of the pixel group in the difference image D2 are accumulated. Then, it is considered that a moving object having the largest amount of motion, that is, a target person is present in an area in which the accumulated pixel value is larger than a predetermined value and which is closest to the camera 1, and the distance is set as a target. The distance is set (see FIG. 4A). In the example of FIG. 4A, the target distance is set to 2.2 m. The target distance set by the target distance setting unit 31 is input to the target distance image generation unit 32.
[0030]
(Target distance image generation unit 32)
The target distance image generation unit 32 refers to the distance image D1 generated by the captured image analysis device 2 (see FIG. 3A), and generates a pixel existing at the target distance ± αm set by the target distance setting unit 31. A “target distance image” is generated by extracting corresponding pixels from the edge image D3 (see FIG. 3C). Specifically, a pixel corresponding to the target distance ± αm input from the target distance setting unit 31 in the distance image D1 is obtained. Then, only the determined pixels are extracted from the edge image D3 generated by the edge information generation unit 23, and a target distance image D4 (see FIG. 4B) is generated. Therefore, the target distance image D4 is an image in which the target person existing at the target distance is represented by an edge. The target distance image D4 generated by the target distance image generation unit 32 is input to the target area setting unit 33 and the contour extraction unit 34.
[0031]
(Target area setting unit 33)
The target area setting unit 33 sets a “target area” in the target distance image D4 (see FIG. 3B) generated by the target distance image generation unit 32. Specifically, a histogram H is generated by accumulating the vertical pixel values of the target distance image D4, and the position where the frequency in the histogram H is maximum is specified as the horizontal center position of the target person C (FIG. 5). (A)). Then, a range of a specific size (for example, 0.5 m) is set as the target region T on the left and right of the specified center position (see FIG. 5B). Note that the vertical range of the target region T is set to a specific size (for example, 2 m). When the target area T is set, the setting range of the target area T is corrected with reference to camera parameters such as the tilt angle and the height of the camera 1. The target area T set by the target area setting unit 33 is input to the contour extraction unit 34.
[0032]
(Contour extraction unit 34)
In the target area image D4 generated by the target distance image generating section 32 (see FIG. 4B), the contour extracting section 34 extracts the outline O of the target person C from within the target area T set by the target area setting section 33. Is extracted (see FIG. 5C). Specifically, when the contour O of the target person C is extracted, a method using a dynamic contour model called a “Snakes” and composed of a closed curve (hereinafter, referred to as a “snake method”) is used. The snake method is a method of extracting the contour of the object by contracting and deforming “Snakes”, which is an active contour model, so that a predefined energy function is minimized. The contour O of the target person C extracted by the contour extracting unit 34 is input to the gesture recognition device 4 as “contour information” (see FIG. 1).
[0033]
(Gesture recognition device 4)
The gesture recognition device 4 is configured to posture or gesture the target person based on the “distance information” and “skin color region information” generated by the captured image analysis device 2 and the “contour information” generated by the contour extraction device 3. This is a device that recognizes and outputs the recognition result (see FIG. 1).
[0034]
FIG. 6 is a block diagram showing a configuration of the gesture recognition device 4 included in the gesture recognition system A shown in FIG. As shown in FIG. 6, the gesture recognition device 4 includes a face / hand position detection unit 41 that detects a face position and a hand position of the target person in the real space, and a face position detected by the face / hand position detection unit 41. And a gesture / gesture recognizing means 42 for recognizing a posture or a gesture of the target person based on the hand position.
[0035]
(Face / hand position detecting means 41)
The face / hand position detection unit 41 detects a head position detection unit 41A that detects the “top position (top position)” of the target person in the real space, and a “face position (face position)” of the target person. A face position detection unit 41B to detect, a hand position detection unit 41C to detect the "hand position (hand position)" of the target person, and a hand position detection unit to detect the "hand position (hand position)" of the target person 41D. Note that the “hand” here refers to a portion composed of an arm (arm) and a hand (Hand), and the “hand” refers to a fingertip of the hand (Hand).
[0036]
(Head position detector 41A)
The head position detection unit 41A detects the “top position” of the target person C based on the outline information generated by the outline extraction device 3. The method of detecting the top of the head will be described with reference to FIG. 7 (a). Next, an area (top area search area) F1 for searching the top position is set (2). The lateral width (width in the X-axis direction) of the crown position search area F1 is set to be a preset average human shoulder width W around the X coordinate of the center of gravity G. The average shoulder width W of a human is set with reference to the distance information generated by the captured image analysis device 2. The vertical width (the width in the Y-axis direction) of the crown position search area F1 is set to a width that can cover the contour O. Then, the upper end point of the contour O in the crown position search area F1 is defined as the crown position m1 (3). The head position m1 detected by the head position detection unit 41A is input to the face position detection unit 41B.
[0037]
(Face position detector 41B)
The face position detection unit 41B detects the “face position” of the target person C based on the top position m1 detected by the head position detection unit 41A and the skin color region information generated by the captured image analysis device 2. . The method for detecting a face position will be described with reference to FIG. 7B. First, an area (face position search area) F2 for searching for a face position is set (4). The range of the face position search area F2 is set to a preset “approximately a size that covers a human head” with reference to the top position m1. Note that the range of the face position search area F2 is set with reference to the distance information generated by the captured image analysis device 2.
[0038]
Next, the center of gravity of the skin color area R1 in the face position search area F2 is set as the face position m2 on the image (5). For the skin color region R1, the skin color region information generated by the captured image analysis device 2 is referred to. Then, the face position m2t (Xft, Yft, Zft) in the real space is obtained from the face position m2 (Xf, Yf) on the image with reference to the distance information generated by the captured image analysis device 2.
[0039]
The “face position m2 on the image” detected by the face position detection unit 41B is input to the hand position detection unit 41C and the hand position detection unit 41D. The “face position m2t in the real space” detected by the face position detection unit 41B is stored in a storage unit (not shown), and is stored in the posture / gesture recognition unit 42B (see FIG. 6) of the posture / gesture recognition unit 42. It is used when recognizing the posture or gesture of the person C.
[0040]
(Hand position detector 41C)
The hand position detection unit 41C detects the “hand position” of the target person C based on the skin color region information generated by the captured image analysis device 2 and the outline information generated by the outline extraction device 3. Here, as the skin color area information, information on an area excluding the periphery of the face position m2 is used. The method for detecting the hand position will be described with reference to FIG. 8A. First, an area (hand position search area) F3 (F3R, F3L) for searching the hand position is set (6). The hand position search area F3 is set to be a preset “range of reach of hands (range of reach of left and right hands)” based on the face position m2 detected by the face position detection unit 41B. Note that the size of the hand position search area F3 is set with reference to the distance information generated by the captured image analysis device 2.
[0041]
Next, the center of gravity of the skin color region R2 in the hand position search region F3 is set as the hand position m3 on the image (7). For the skin color region R2, the skin color region information generated by the captured image analysis device 2 is referred to. Here, as the skin color area information, information on an area excluding the periphery of the face position m2 is used. In the example of FIG. 8A, since the skin color area exists only in the hand position search area F3 (L), the hand position m3 is detected only in the hand position search area F3 (L). In the example of FIG. 8A, the target person is wearing long-sleeved clothes and is exposed only before the wrist, so that the position of the hand (HAND) is the hand position m3. The “hand position m3 on the image” detected by the hand position detection unit 41C is input to the hand position detection unit 41D.
[0042]
(Hand position detection unit 41D)
The hand position detection unit 41D detects the “hand position” of the target person C based on the face position m2 detected by the face position detection unit 41B and the hand position m3 detected by the hand position detection unit 41C. The method of detecting the hand position will be described with reference to FIG. 8B. First, an area (hand position search range) F4 for searching the hand position is set in the hand position search area F3L (8). The hand position search range F4 is set to a preset “approximately covering the hand” centering on the hand position m3. The range of the hand position search range F4 is set with reference to the distance information generated by the captured image analysis device 2.
[0043]
Subsequently, upper, lower, left and right end points m4a to m4d of the skin color region R2 in the hand position search range F4 are detected (9). For the skin color region R2, the skin color region information generated by the captured image analysis device 2 is referred to. The vertical distance d1 between the upper and lower end points (between m4a and m4b) and the horizontal distance d2 between the left and right end points (between m4c and m4d) are compared, and the longer distance is determined to be the direction in which the hand is extending. (10). In the example of FIG. 8B, since the vertical distance d1 is longer than the horizontal distance d2, it is determined that the hand extends vertically.
[0044]
Next, based on the positional relationship between the face position m2 on the image and the hand position m3 on the image, it is determined which of the upper and lower end points m4a and m4b (or which of the left and right end points m4c and m4d) is the hand position. to decide. Specifically, when the hand position m3 is far from the face position m2, the hand is regarded as being extended, and the end point farther from the face position m2 is determined as the hand position (hand position on the image) m4. Conversely, when the hand position m3 is close to the face position m2, it is considered that the elbow is bent, and the end point closer to the face position m2 is determined as the hand position m4. In the example of FIG. 8B, since the hand position m3 is farther from the face position m2 and the upper end point m4a is farther from the face position m2 than the lower end point m4b, it is determined that the upper end point m4a is the hand position m4 (11). .
[0045]
Then, the hand position m4t (Xht, Yht, Zht) in the real space is obtained from the hand position m4 (Xh, Yh) on the image with reference to the distance information generated by the captured image analysis device 2. The “hand position m4t in the real space” detected by the hand position detection unit 41D is stored in a storage unit (not shown), and is read by the posture / gesture recognition unit 42B of the posture / gesture recognition unit 42 (see FIG. 6). Is used when recognizing the posture or gesture of the user.
[0046]
(Posture / gesture recognition means 42)
The posture / gesture recognition unit 42 includes a posture / gesture data storage unit 42A that stores the posture data and the gesture data, the “face position m2t in the real space” detected by the face / hand position detection unit 41, and the “ And a gesture / recognition unit 42B for recognizing a posture or a gesture of the target person based on the hand position m4t in (see FIG. 6).
[0047]
(Posture / gesture data storage unit 42A)
The posture / gesture data storage unit 42A stores posture data P1 to P4 (see FIG. 9) and gesture data J1 to J4 (see FIG. 10). The posture data P1 to P4 and the gesture data J1 to J4 are data describing a posture or a gesture corresponding to “the hand position based on the face position and the fluctuation of the hand position”. The posture data P1 to P4 and the gesture data J1 to J4 are used when the posture / gesture recognition unit 42B recognizes the posture or gesture of the target person.
[0048]
With reference to FIG. 9, the posture data P1 to P4, shown in FIG. 9 (a) "Posture P1: FACE SIDE" is "Hello", shown in FIG. 9 (b) "Posture P2: HIGH HAND" is ""Startfollowing","Posture P3: SIDE HAND" shown in FIG. 9C shows "look at the hand direction", and "Posture P4: LOW HAND" shown in FIG. 9D shows "Bend in the hand direction". Posture that means.
[0049]
Also, the gestures J1 to J4 will be described with reference to FIG. 10. "Be careful" is shown in "gesture J1: HAND SWING" shown in FIG. 10A, and "gesture J2: BYE BYE" shown in FIG. Is a gesture that means "goodbye", "gesture J3: COMME HERE" shown in FIG. 10C means "approach", and "gesture J4: HAND CIRCLING" shown in FIG. 10D means "turn". .
[0050]
In the present embodiment, the posture / gesture data storage unit 42A (see FIG. 6) stores the posture data P1 to P4 (see FIG. 9) and the gesture data J1 to J4 (see FIG. 10). The posture data and the gesture data stored in the posture / gesture data storage unit 42A can be set arbitrarily. Further, the meaning of each posture and each gesture can be arbitrarily set.
[0051]
The posture / gesture recognition unit 42B recognizes the posture or the gesture of the target person by using the “Bayesian method” which is one of the statistical methods. Specifically, first, from the “face position m2t in the real space” and the “hand position m4t in the real space” detected by the face / hand position detection means 41, the “tip based on the face position m2t” is obtained. The average and the variance of the position in a predetermined number of frames (for example, 5 frames) are obtained as the feature vector x. Then, based on the obtained feature vector x, using the Bayesian method, the “probability density of the posterior distribution” of the random variables ωi of all postures and gestures i is calculated, and the “probability density of the posterior distribution” in each frame is calculated. Is recognized as the posture or gesture in the frame.
[0052]
Next, a method of recognizing a posture or a gesture in the posture / gesture recognition unit 42B will be described in detail with reference to flowcharts illustrated in FIGS. Here, first, the outline of the processing in the posture / gesture recognition unit 42B will be described with reference to the flowchart shown in FIG. 11, and then “step S1” in the flowchart shown in FIG. 11 will be described with reference to the flowchart shown in FIG. : Posture / Gesture Recognition Processing ”will be described.
[0053]
(Outline of Processing in Posture / Gesture Recognition Unit 42B)
FIG. 11 is a flowchart for explaining the outline of the processing in the posture / gesture recognition unit 42B. Referring to the flowchart shown in FIG. 11, first, in step S1, an attempt is made to recognize a posture or a gesture. Next, in step S2, it is determined whether or not the posture or the gesture has been recognized in step S1. Here, if it is determined that the posture or the gesture has been recognized, the process proceeds to step S3. If it is determined that the posture or the gesture has not been recognized, the process proceeds to step S5.
[0054]
In step S3, it is determined whether or not the same posture or gesture has been recognized a predetermined number of times (for example, 5 times) in a predetermined number of past frames (for example, 10 frames). Here, if it is determined that the same posture or gesture has been recognized a predetermined number of times or more, the process proceeds to step S4. If it is determined that the same posture or gesture has not been recognized a predetermined number of times or more, the process proceeds to step S5.
[0055]
In step S4, the posture or gesture recognized in step S1 is output as a recognition result, and the process ends. In step S5, it is output that the posture or the gesture could not be recognized, that is, the posture or the gesture could not be recognized, and the process is terminated.
[0056]
(Step S1: Posture / gesture recognition processing)
FIG. 12 is a flowchart for explaining “Step S1: posture / gesture recognition processing” in the flowchart shown in FIG. Referring to the flowchart shown in FIG. 12, first, in step S11, "face position m2t (Xft, Yft, Zft) in real space" detected by face / hand position detecting means 41 and "hand in real space" From the position m4t (Xht, Yht, Zht), the “average and variance of the hand positions based on the face position m2t in a predetermined number of frames (for example, 5 frames)” is calculated.

Asking.
[0057]
In the next step S12, based on the feature vector x obtained in step S11, the “probability density of the posterior distribution” of the random variables ωi of all the postures and gestures i is calculated using Bayesian method.
[0058]
The method of calculating the “probability density of the posterior distribution” in step S12 will be described in detail. First, when a feature vector x is given, the probability density P (ωi | x) that is a posture or a gesture i is given by It is obtained by equation (1) called “Bayes' theorem”. Note that the probability variable ωi is set in advance for each posture and each gesture.
[0059]
(Equation 1)

[0060]
P (x | ωi) in Expression (1) is a “conditional probability density” in which an image includes a feature vector x when a posture or a gesture i is given, and is expressed by Expression (2) below. Note that the feature vector x has a covariance matrix Σ,

Is assumed to follow a normal distribution.
[0061]
(Equation 2)

[0062]
Further, P (ωi) in the equation (1) is the “probability density of the prior distribution” of the random variable ωi, and is expressed by the following equation (3). It is assumed that P (ωi) is a normal distribution of the expected value ωi ₀ and the variance V [ωi ₀ ].
[0063]
[Equation 3]

[0064]
Since the denominator of the right term of the equation (1) does not depend on ωi, the “probability density of the posterior distribution” of the random variable ωi is expressed by the following equation (4) from the equations (2) and (3).
[0065]
(Equation 4)

[0066]
Returning to the flowchart of FIG. 12, in the next step S13, the posture or gesture in which the “probability density of the posterior distribution” in each frame is the largest, and in the subsequent step S14, the posture or gesture found in step S13 is The recognition result indicating that the gesture is a posture or a gesture is output, and the process ends.
[0067]
FIG. 13 is a graph showing the “probability density of the posterior distribution” of the random variables ωi for the postures P1 to P4 and the gestures J1 to J4 in the frames 1 to 100. Here, the postures P1 to P4 and the gestures J1 to J4 are set to “i (i = 1 to 8)”.
[0068]
As shown in FIG. 13, the probability density of “gesture J2: BYE BYE” is maximum in frames 1 to 26, and therefore in the frames 1 to 26, the posture or gesture of the target person is “gesture J2: BYE BYE” (see FIG. 10 (b)). Further, in frames 27 to 43, the probability density of “posture P1: FACE SIDE” is maximum, and in frames 27 to 43, the posture or gesture of the target person is “posture P1: FACE SIDE” (see FIG. 9A). Is recognized.
[0069]
In the frames 44 to 76, the probability density of “gesture J3: COMME HERE” becomes the maximum, and in the frames 44 to 76, the posture or gesture of the target person is “gesture J3: COMME HERE” (see FIG. 10C). Is recognized. Since the probability density of “gesture J1: HANDSWING” is maximum in frames 80 to 100, the posture or gesture of the target person is “gesture J1: HAND SWING” in frames 80 to 100 (see FIG. 10A). Is recognized.
[0070]
In the frames 77 to 79, the probability density of “gesture J4: HAND CIRCLING” is the maximum, but since “gesture J4: HAND CIRCLING” is recognized only three times, the posture or gesture of the target person is “gesture J4”. : HAND CIRCLING ”is not recognized. This is because, as described above, in the gesture / gesture recognition 42B, the same gesture or gesture can be recognized a predetermined number of times (for example, 5 times) in a predetermined number of past frames (for example, 10 frames), and This is to determine that the same posture or gesture has been recognized (see steps S3 to S5 in the flowchart shown in FIG. 11).
[0071]
As described above, the posture / gesture recognition 42B calculates the “probability density of the posterior distribution” of the random variables ωi of all the postures and the gestures i (i = 1 to 8) using the Bayesian method. The posture or the gesture of the target person can be recognized by examining the posture or the gesture in which the “probability density of the posterior distribution” in the frame is the largest.
[0072]
(Operation of Gesture Recognition System A)
Next, the operation of the gesture recognition system A will be described with reference to the block diagram showing the entire configuration of the gesture recognition system A shown in FIG. 1 and the flowcharts shown in FIGS. FIG. 14 is a flowchart illustrating the “captured image analysis step” and the “contour extraction step” in the operation of the gesture recognition system A. FIG. 15 is a flowchart illustrating the “face / hand position detection” in the operation of the gesture recognition system A. It is a flowchart shown for demonstrating "step" and "posture gesture recognition step".
[0073]
<Captured image analysis step>
Referring to the flowchart shown in FIG. 14, in the captured image analysis device 2, when a captured image is input from the

cameras

1a and 1b (step S101), the distance information generation unit 21 uses the distance information, which is distance information, based on the captured image. D1 (see FIG. 3A) is generated (step S102), and the motion information generating unit 22 generates a difference image D2 (see FIG. 3B) as motion information from the captured image (step S103). Further, the edge information generation unit 23 generates an edge image D3 (see FIG. 3C), which is edge information, from the captured image (step S104), and the skin color region information generation unit 24 uses the skin color region information from the captured image. A certain skin color region R1, R2 (see FIG. 3D) is extracted (step S105).
[0074]
<Contour extraction step>
Continuing with reference to the flowchart illustrated in FIG. 14, in the contour extraction device 3, first, the target distance setting unit 31 determines the distance at which the target person exists from the distance image D1 and the difference image D2 generated in steps S102 and S103. Is set (step S106). Subsequently, the target distance image generation unit 32 generates a target distance image D4 (see FIG. 4B) in which pixels existing at the target distance set in step S106 are extracted from the edge image D3 generated in step S104. (Step S107).
[0075]
Next, the target area setting unit 33 sets a target area T (see FIG. 5B) in the target distance image D4 generated in step S107 (step S108). Then, the contour extraction unit 34 extracts a contour O (see FIG. 5C) of the target person C from the target area T set in step S108 (step S109).
[0076]
<Face / hand position detection step>
Referring to the flowchart shown in FIG. 15, in the face / hand position detection means 41 of the gesture recognition device 4, first, the head position detection unit 41A detects the head of the target person C based on the contour information generated in step S109. The top position m1 (see FIG. 7A) is detected (step S110).
[0077]
Subsequently, in the face position detection unit 41B, based on the crown position m1 detected in step S110 and the skin color region information generated in step S105, “the face position m2 on the image” (FIG. 7B) Reference), and from the detected “face position m2 (Xf, Yf) on the image”, refer to the distance information generated in step S102, and refer to “face position m2t (Xft, Yft, Zft) ”is obtained (step S111).
[0078]
Next, the hand position detection unit 41C detects “hand position m3 on the image” (see FIG. 8A) from “face position m2 on the image” detected in step S111 (step S112).
[0079]
Then, based on the “face position m2 on the image” detected by the face position detection unit 41B and the hand position m3 detected by the hand position detection unit 41C, the hand position detection unit 41D determines “the hand position on the image”. The position m4 ”(see FIG. 8B) is detected, and from the detected“ hand position m4 (Xh, Yh) on the image ”, the distance information generated in step S102 is referred to, and At the hand end m4t (Xht, Yht, Zht) "(step S113).
[0080]
<Posture / gesture recognition step>
With continued reference to the flowchart illustrated in FIG. 15, the posture / gesture recognition unit 42 </ b> B of the gesture recognition device 4 recognizes the posture or the gesture of the target person by using the “Bayesian method” which is one of the statistical methods. . The details of the method of recognizing the posture or the gesture in the posture / gesture recognition unit 42B have been described above, and thus are omitted here.
[0081]
Although the gesture recognition system A has been described above, the gesture recognition device 4 included in the gesture recognition system A can also realize each unit as a function program in a computer. It is also possible to operate as a recognition program.
[0082]
The gesture recognition system A can be applied to, for example, an autonomous robot. In this case, for example, the autonomous robot recognizes that the posture is “posture P2: HIGH HAND” (see FIG. 9B) when a person raises his hand, or changes the gesture to “gesture J1: HAND SWING ”(see FIG. 10A).
[0083]
It should be noted that, in the case of an instruction by a posture or a gesture, an instruction that is not influenced by the surrounding noise or can be issued even in a situation where the sound does not reach, and an instruction that is difficult to express in words (or becomes redundant), as compared with an instruction by a voice. There is an advantage that it can be performed simply.
[0084]
【The invention's effect】
As described above in detail, according to the present invention, when recognizing a gesture of a target person, it is not necessary to detect each feature point (point indicating a feature of the motion of the target person). As compared with the recognition method, it is possible to reduce calculation processing when performing posture recognition or gesture recognition.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an overall configuration of a gesture recognition system A.
FIG. 2 is a block diagram showing a configuration of a captured image analysis device 2 and a contour extraction device 3 included in the gesture recognition system A shown in FIG.
3A is a diagram showing a distance image D1, FIG. 3B is a diagram showing a difference image D2, FIG. 3C is a diagram showing an edge image D3, and FIG. 3D is a diagram showing skin color regions R1 and R2.
FIG. 4 is a diagram for explaining a method of setting a target distance.
FIG. 5 is a diagram for explaining a method of setting a target area T and a method of extracting a contour O of a target person C from within the target area T;
6 is a block diagram showing a configuration of a gesture recognition device 4 included in the gesture recognition system A shown in FIG.
FIG. 7A is a diagram for explaining a method for detecting a top position m1, and FIG. 7B is a diagram for explaining a method for detecting a face position m2.
8A is a diagram for explaining a method for detecting a hand position m3, and FIG. 8B is a diagram for explaining a method for detecting a hand position m4.
FIG. 9 is a diagram showing posture data P1 to P4.
FIG. 10 is a diagram showing gesture data J1 to J4.
FIG. 11 is a flowchart for explaining an outline of processing in a posture / gesture recognition unit 42B.
FIG. 12 is a flowchart for describing “Step S1: posture / gesture recognition processing” in the flowchart shown in FIG. 11;
FIG. 13 is a graph showing the “probability density of the posterior distribution” of the random variables ωi for the postures P1 to P4 and the gestures J1 to J4 in the frames 1 to 100.
FIG. 14 is a flowchart illustrating a “captured image analysis step” and an “outline extraction step” in the operation of the gesture recognition system A;
FIG. 15 is a flowchart illustrating a “face / hand position detection step” and a “posture / gesture recognition step” in the operation of the gesture recognition system A.
[Explanation of symbols]
A gesture recognition system 1 camera 2 captured image analysis device 3 contour extraction device 4 gesture recognition device 41 face / hand position detection means 41A head position detection unit 41B face position detection unit 41C hand position detection unit 41D hand position detection unit 42 posture / gesture Recognition unit 42A Posture / gesture data storage unit 42B Posture / gesture recognition unit

Claims

Apparatus for recognizing the posture or gesture of the target person from a captured image of a target person for posture recognition or gesture recognition by a camera,
Face / hand position detection means for detecting a face position and a hand position in the real space of the target person based on contour information and skin color region information of the target person generated from the captured image,
From the face position and the hand position, the average and variance in a predetermined number of frames of the hand position based on the face position are obtained as a feature vector, and based on the feature vector, all using a statistical method, Posture and gesture recognition means for calculating the probability density of the posterior distribution of the posture and gesture random variables, and recognizing the posture or gesture having the largest probability density of the posterior distribution in each frame as the posture or gesture in the frame. And a gesture recognition device.

The method according to claim 1, wherein the posture / gesture recognizing unit determines that the same posture or gesture has been recognized only when the same posture or gesture has been recognized more than a predetermined number of times in a predetermined number of frames. The gesture recognition device according to the above.

A method for recognizing a posture or gesture of the target person from a captured image of a target person for posture recognition or gesture recognition by a camera,
A face / hand position detection step of detecting a face position and a hand position in the real space of the target person based on contour information and skin color region information of the target person generated from the captured image,
From the face position and the hand position, the average and variance in a predetermined number of frames of the hand position based on the face position are obtained as a feature vector, and based on the feature vector, all using a statistical method, Posture / gesture recognition step of calculating the probability density of the posterior distribution of the posture and gesture random variables, and recognizing the posture or gesture having the largest probability density of the posterior distribution in each frame as the posture or gesture in that frame. And a gesture recognition method.

From a captured image obtained by capturing a target person for posture recognition or gesture recognition by a camera, a computer for recognizing the posture or gesture of the target person,
Face / hand position detection means for detecting a face position and a hand position in the real space of the target person based on contour information and skin color region information of the target person generated from the captured image,
From the face position and the hand position, the average and variance in a predetermined number of frames of the hand position based on the face position are obtained as a feature vector, and based on the feature vector, all using a statistical method, Posture and gesture recognition means for calculating the probability density of the posterior distribution of the posture and gesture random variables, and recognizing the posture or gesture having the largest probability density of the posterior distribution in each frame as the posture or gesture in the frame. ,
A gesture recognition program characterized by functioning as a.