JP3732757B2

JP3732757B2 - Image recognition method and image recognition apparatus

Info

Publication number: JP3732757B2
Application number: JP2001174574A
Authority: JP
Inventors: 功雄三原; 俊一沼崎; 美和子土井
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2001-06-08
Filing date: 2001-06-08
Publication date: 2006-01-11
Anticipated expiration: 2021-06-08
Also published as: JP2002366958A

Description

【０００１】
【発明の属する技術分野】
本発明は、例えば、距離画像のような被写体の３次元情報の表れた画像から被写体の３次元的な動きを認識する画像認識方法およびそれを用いた画像認識装置に関する。
【０００２】
【従来の技術】
従来、ビデオカメラなどの撮像装置を用いて、認識対象物の動きを認識しようとした場合、以下のような方法が取られていた。
【０００３】
まず１つ目は、オプティカルフローと呼ばれる方法である。これは、所定のブロック画像に着目し、隣り合うフレーム画像間で、ある着目画像領域が平面内でどの方角に動いたかを計測し、その方向を推定するものである。次フレームにおける対象画像の移動方向を特定するには、時系列的に隣り合うフレーム間で類似度を算出する方法が代表的である。対象画像領域近傍で同じサイズのブロック画像を対象に前フレームにおける着目ブロック画像との相関係数を計算し、その係数の最も高いブロックへの方向が動きベクトルとして推定される。
【０００４】
この技術は人間の顔のトラッキングなどロボットビジョンの分野では広く利用されている。この手法は、着目ブロック画像が２次元的に大きく変化しない場合、かなりの精度で平面方向の動きを検出することが可能であるが、対象画像はビデオカメラなどで取得した２次元画像であるため、奥行き方向を含めた３次元的な動きの検出は不可能である。
【０００５】
また、別の手法としては、認識対象物の特徴点を用いて動きの推測を行う方法がある。これは、認識対象物に予め幾つかの特徴点を定めておいて、対象物の動きによって生じる特徴点どうしの位置関係の変化を用いて動きを推測する方法である。例えば、顔を左右に振っている（左右方向に回転させている）動作を認識する場合、顔の特徴点として、目や鼻など数カ所定めておき、顔の動きによって、目の位置の特徴点が右に動いた、両目の特徴点の間隔が狭くなった、右目の特徴点が消失した（右目がカメラから見えない位置に移動したため）、などといった変化から顔を右に振っているのであろうであろうと推測する。
【０００６】
しかし、この方法を用いた場合、対応点をカメラ画像内で安定して得るためには、顔の特徴点の位置にマーカーなどを貼付しなければならないため、使用できる環境が限られているなどの問題があった。マーカーを用いない実現方法もあるが、この場合は画像内から特徴点を自動的に認識する必要があり、特徴点の抽出を安定的に行えない可能性がある上に、特徴点を得るために多大な計算コストも必要としてしまう。この手法も、対象画像はビデオカメラなどで取得した２次元画像であるため、奥行き方向を含めた３次元的な動きは、２次元画像から推定するしかない。
【０００７】
また、別の方法として、運動モーメントの変化を求めることで推測する方法がある。これは、例えば、手を縦軸周りに回転させる動きの場合、手の横方向の前方投影面積が著しく変化するのに対し、縦方向はあまり変化しないというような性質を利用しており、このような場合、手の横方向の運動モーメントのみの変化が激しいことより、手を縦軸周りに回転させているのではないかと推測される。
【０００８】
この方法は、確かに３次元的な動きを推測する一手法ではあるが、認識に使用できる対象物の形状に制限があったり、平面的な別の動きとの区別がつき難いため、誤認識をする可能性があるなどの問題点があった。
【０００９】
ここで挙げた以外にも様々な方法があるであろうが、何れにしても、ビデオカメラなどの撮像装置を用いたこれらの方法では、あくまでも平面的な情報のみしか持たない２次元画像から、３次元的な動きを推測しているに過ぎず、３次元的な動きの認識を安定的に、高精度で行うことは困難である。もともと３次元的な形状の対象物をカメラで平面情報として取得した時点で、かなりの情報が欠落しているからである。
【００１０】
これらの問題を回避するために、複数のビデオカメラを用いて、数カ所から同時に撮像し、各カメラの対応点を求めることで、複数の撮像画像から立体情報を計算し、３次元画像を構成して、それを用いて動作の認識を行う方法がある。
【００１１】
この方法は、ステレオ法と呼ばれ、実際に複数の撮像画像から立体情報を構成しているため、上述したような平面情報から３次元情報を推測するという問題点は解決されるが、複数のカメラからの画像を立体的に融合するための対応点の計算に大変計算時間を必要とするため、リアルタイム処理に不向きであった。また、対応点を求めるためにカメラの位置情報を必要とするため、カメラの位置に制約があったり、カメラ位置のキャリブレーションの必要があったりした。そのため、一般ユーザが容易に使用することは困難であった。
【００１２】
また、動きを特徴づける関節などの部位にあらかじめセンサを装着し、撮像した画像からセンサ部位を抽出し、２次元的あるいは３次元的な動きを計測するモーションキャプチャと呼ばれる手法も存在する。この手法では、上記で紹介した手法に比べ、特徴点の抽出や対応づけ処理は軽くなるが、システム全体のコストが高くつき、システムを稼働する上での制約も多い。さらに煩わしい特定のセンサデバイスを装着する必要があり、とても一般ユーザが使えるものにはなっていない。
【００１３】
以上のように、従来方法では、画像から奥行き情報を含む３次元的な動きの認識を行う方法には様々な問題点があった。
【００１４】
【発明が解決しようとする課題】
従来の手法では、ビデオカメラなどを用いて認識対象物を２次元情報しかもたない画像として取得していたため、対象物の３次元的動きの認識を、２次元情報のみから行うしかなく、安定して、高精度で奥行き方向を含めた３次元的な動きの認識を行うことは困難であった。
【００１５】
そこで、本発明は、３次元的な動きを容易に、しかも安定的かつ高精度で認識できる画像認識方法およびそれを用いた画像認識装置を提供することを目的とする。
【００１６】
【課題を解決するための手段】
本発明は、被写体の３次元情報を持つ画像を取得し、取得した複数の画像の差分データを求め、この差分データから前記被写体の動きに伴い画素値の減少した領域と増加した領域とを抽出し、これらの３次元的な位置関係から前記被写体の３次元的な動きの特徴量を抽出して、この特徴量を基に前記被写体の動きを認識することにより、前記画像中の３次元的な動きを容易にしかも安定的かつ高精度に認識することができる。
【００１７】
被写体の３次元情報を持つ画像を取得し、取得した複数の画像から前記被写体の動きを検知し、動きの検知された画像領域対応の前記複数の画像の差分データから前記画像領域毎に前記被写体の動きに伴い画素値の減少した領域と増加した領域とを抽出し、これらの３次元的な位置関係から前記被写体の３次元的な動きの特徴量を抽出し、前記画像領域毎に、それぞれの画像領域から抽出された特徴量を基に前記検知された動きを認識することにより、前記画像中に複数の動きが存在する場合も、その複数の３次元的な動きのそれぞれを容易にしかも安定的かつ高精度に認識することができる。
【００１８】
好ましくは、前記３次元的な動きの特徴量のｘ方向、ｙ方向、ｚ方向の各成分値のうち、認識すべき動きに応じて選択された少なくとも１つの成分値に基づき、前記被写体の動きを認識する。その際、好ましくは、前記認識すべき動きの特徴的な動き方向に基づき、前記特徴量の各成分値のうち少なくとも１つの成分値を選択する。あるいは、前記認識すべき動きの特徴的な動き方向と、その動き方向と相関関係のある方向とに基づき、前記特徴量の各成分値のうち少なくとも１つの成分値を選択する。
【００１９】
好ましくは、前記画像として距離画像を用いる。
【００２０】
【発明の実施の形態】
以下、本発明の実施形態について、図面を参照しながら説明する。
【００２１】
（第１の実施形態）
まず、本発明の第１の実施形態について説明する。
【００２２】
図１は、第１の実施形態に係る画像認識装置の全体構成図である。本実施形態の画像認識装置は、距離画像または奥行き方向の情報を持った画像を取得するための撮像手段を備えた画像取得部１と、画像取得部１で取得された任意の２枚の奥行き方向の情報を持った画像（例えば、距離画像）の差を計算するための差分計算部２と、差分計算部２で結果得られた差分画像から特徴量を検出するための検出部３と、検出部３で得られた特徴量を基に画像内に含まれる対象物の動作を認識するための認識部４とから構成される。
【００２３】
まず、画像取得部１について説明する。
【００２４】
画像取得部１は、認識対象物体（例えば、人間の手、顔、全身など）を被写体として、所定時間毎（例えば１／３０秒毎など）に、その３次元形状を反映した奥行き方向の値を持つ画像の１つである例えば距離画像として取得するものである。例えば、距離画像は、特開平１０−１７７４４９号に開示されている手法を用いて取得することができる。
【００２５】
所定時間毎に距離画像が取得されてゆくため、これらをメモリなどを用いて、画像取得部１の内部または外部で逐次保持することで、対象物の距離画像による動画像（以降、距離画像ストリームと呼ぶ）をも得ることができる。このとき、距離画像ストリームは、距離画像の取得間隔をｔ秒としたとき、「最新の距離画像」、「最新からｔ秒前（以降、１フレーム前と呼ぶ）の距離画像」、「最新から２ｔ秒前（２フレーム前、以下同様）の距離画像」、…、といった複数フレームの距離画像の集合体として得られることになる。
【００２６】
ここで、距離画像を取得する画像取得部１（以下、距離画像を取得するための画像取得部を距離画像取得部１と呼ぶ）および距離画像について説明する。距離画像取得部１は、対象物としての人物が本装置の所定位置についたとき、当該人物の手腕や顔、全身などが撮像できるように、予め位置決めされている。
【００２７】
距離画像取得部１の外観を図２に示す。中央部には円形レンズとその後部にあるエリアセンサ（図示せず）から構成される受光部１０３が配置され、円形レンズの周囲にはその輪郭に沿って、赤外線などの光を照射するＬＥＤから構成される発光部１０１が複数個（例えば８個）等間隔に配置されている。
【００２８】
発光部１０１から照射された光が物体に反射され、受光部１０３のレンズにより集光され、レンズの後部にあるエリアセンサで受光される。エリアセンサは、例えば２５６×２５６のマトリックス状に配列されたセンサで、マトリックス中の各センサにて受光された反射光の強度がそれぞれ画素値となる。このようにして取得された画像が、図４に示すような反射光の強度分布としての距離画像である。
【００２９】
図３は、距離画像取得部１の構成例を示したもので、主に、発光部１０２、受光部１０３、反射光抽出部１０２、タイミング信号生成部１０４から構成される。
【００３０】
発光部１０１は、タイミング信号生成部１０４にて生成されたタイミング信号に従って時間的に強度変動する光を発光する。この光は発光部前方にある対象物体に照射される。
【００３１】
受光部１０３は、発光部１０１が発した光の対象物体による反射光の量を検出する。
【００３２】
反射光抽出部１０２は、受光部１０３にて受光された反射光の空間的な強度分布を抽出する。この反射光の空間的な強度分布は画像として捉えることができるので、以下、これを距離画像と呼ぶ。
【００３３】
受光部１０３は一般的に発光部１０１から発せられる光の対象物による反射光だけでなく、照明光や太陽光などの外光も同時に受光する。そこで、反射光抽出部１０２は発光部１０１が発光しているときに受光した光の量と、発光部１０１が発光していないときに受光した光の量の差をとることによって、発光部１０１からの光の対象物体による反射光成分だけを取り出す。
【００３４】
反射光抽出部１０２では、受光部１０３にて受光された反射光から、その強度分布、すなわち、図４に示すような距離画像のデータを抽出する。
【００３５】
図４では、簡単のため、２５６×２５６画素の距離画像の一部である８×８画素の距離画像のデータ場合について示している。
【００３６】
物体からの反射光は、物体の距離が大きくなるにつれ大幅に減少する。物体の表面が一様に光を錯乱する場合、距離画像１画素あたりの受光量は物体までの距離の２乗に反比例して小さくなる。
【００３７】
図４において、行列中のセルの値（画素値）は、取得した反射光の強さを２５６階調（８ビット）で示したものである。例えば、「２５５」の値があるセルは、距離画像取得部１に最も接近した状態、「０」の値があるセルは、距離画像取得部１から遠くにあり、反射光が距離画像取得部１にまで到達しないことを示している。
【００３８】
距離画像の各画素値は、その画素に対応する単位受光部で受光した反射光の量を表す。反射光は、物体の性質（光を鏡面反射する、散乱する、吸収する、など）、物体の向き、物体の距離などに影響されるが、物体全体が一様に光を錯乱する物体である場合、その反射光量は物体までの距離と密接な関係を持つ。手などは、このような性質をもつため、距離画像取得部１の前方に手を差し出した場合の距離画像は、手までの距離、手の傾き（部分的に距離が異なる）などを反映する図５に示したような３次元的なイメージを得ることができる。
【００３９】
物体からの反射光の強さは物体までの距離ｄの２乗に反比例して小さくなる。すなわち、当該物体の画像の代表画素値をＱ（ｉ、ｊ）とすると、
Ｑ（ｉ、ｊ）＝Ｋ／ｄ^２…（１）
と表すことができる。
【００４０】
ここで、Ｋは、例えば、ｄ＝０．５ｍのときに、画素値Ｒ（ｉ、ｊ）の値が「２５５」になるように調整された係数である。式（１）をｄについて解くことで、距離ｄを求めることができる。
【００４１】
このように、図４に示したような反射光の強度分布を表した距離画像の各画素値は、そのまま画像取得部１からの距離（奥行き方向の値）に対応する情報である。距離画像は奥行き情報を有する３次元画像である。なお、距離画像の各画素値は、画像取得部１からの距離（奥行き方向の値）に対応する情報であるが、この画素値を上記式（１）を用いて、画像取得部１からの距離値に変換したものであってもよいし、このような絶対的な距離値に限らず、相対的な値に変換して、それを画素値としてもよい。また、画像取得部１からの距離に対応する情報は、上述したような２次元行列形式だけではなく、他の方法を取ることも可能である。
【００４２】
なお、距離画像の取得方法は、上述した特開平１０−１７７４４９号の画像取得方法に限定されるものではなく、これに準じる、あるいは別の手段を用いて取得するものでも構わない。例えば、レンジファインダと呼ばれるレーザー光を用いた距離画像取得方法や、ステレオ法と呼ばれる２台のカメラを用いて同時に撮像した２枚の画像の視差情報を用いて距離画像を取得する方法などがそれにあたる。
【００４３】
図６は、画像取得部１により取得された手の距離画像の表示イメージを示したもので、例えば、ｘ軸（横）方向６４画素、ｙ軸（縦）方向６４画素、ｚ軸（奥行き）方向２５６階調の画像になっている。図６は、距離画像の奥行き値、すなわちｚ軸方向の階調（画素値）をグレースケールで表現したもので、この場合、色が黒に近いほど距離が近く、白に近くなるほど距離が遠いことを示している。また、色が完全に白のところは、画像がない、あるいはあっても遠方でないのと同じであることを示している。
【００４４】
次に、図７に示すフローチャートを参照して、図１の画像認識装置の処理動作について説明する。
【００４５】
まず、画像取得部１は、認識対象物体の距離画像ストリームを取得し、その中に含まれる任意の２フレームの距離画像（以降、距離画像Ａ、距離画像Ｂ）を差分計算部２へ渡す（ステップＳ１）。
【００４６】
差分計算部２は、画像取得部１によって取得された認識対象物体の距離画像ストリーム中に含まれる任意の２フレームの距離画像（以降、距離画像Ａ、距離画像Ｂ）に差分処理を施し、差分画像を生成する（ステップＳ２）。
【００４７】
任意の２フレームは、リアルタイムに認識を行いたい場合は、通常、最新フレーム（時刻ｔ）の距離画像Ａ、および、それから数フレーム前（時刻ｔ−ｎ、ｎは任意の正定数）の距離画像Ｂが選択される。ここで、何フレーム前の距離画像を用いるかは、画像取得部１の距離画像取得間隔（フレームレート）や、対象物の動作速度などの情報を基に決定する。
【００４８】
それでは、差分計算部２おける差分処理の方法について具体的に説明する。
【００４９】
距離画像Ａ（時刻ｔに撮像）と距離画像Ｂ（時刻ｔ−ｎに撮像）との差分画像Ｄの計算は、全ての画素（ｉ，ｊ）に関して式（２）適用する。
【００５０】
ここで、時刻ｔにおける距離画像の各画素位置（ｉ，ｊ）の距離値をＦ^（ｔ）（ｉ，ｊ）、時刻ｔにおける差分画像をＤ^（ｔ）、その各画素位置（ｉ，ｊ）の値をＤ^（ｔ）（ｉ，ｊ）と表現する。
【００５１】
つまり、距離画像Ａの画素位置（ｉ，ｊ）での距離値はＦ^（ｔ）（ｉ，ｊ）、距離画像Ｂの画素位置（ｉ，ｊ）での距離値はＦ^{（ｔ−ｎ）}（ｉ，ｊ）、距離画像Ａと距離画像Ｂとの差分画像Ｄ^（ｔ）（ｉ、ｊ）は、式（２）から生成することができる。
【００５２】
【数１】

【００５３】
差分画像について、図１４を参照して、具体的に説明する。図１４（ａ）は、距離画像Ｂの一部のデータであり、画素値が「２００」と「１５０」の２つの画素Ｐ１、Ｐ２があったとする。また、図１４（ｂ）は、距離画像Ａの図１４（ａ）に示した２つの画素Ｐ１、Ｐ２と同じ位置にある２つの画素を示したもので、画素値がそれぞれ「１５０」と「２００」であったとする。この場合、式（２）を用いることにより、距離画像Ａと距離画像Ｂとの間の画素Ｐ１、Ｐ２の画素値の変化量は、それぞれ「−５０」「５０」となり、この値が、図１４（ｃ）に示すように、差分画像上の画素Ｐ１、Ｐ２の画素値となる。すなわち、距離画像Ｂでは、画素Ｐ１の位置にあったものが、当該対象物が動作した結果、距離画像Ａでは、画素Ｐ２に移動し、その結果、差分画像上では、画素Ｐ１の画素値が「−」の値を持ち、画素Ｐ２が「＋」の値をもつこととなる。
【００５４】
差分画像で得られたものは、距離画像Ａと距離画像Ｂで変化のあった部分、つまり、時刻ｔ−ｎと時刻ｔでそれぞれの距離画像に撮像されているもののうち、変化のあった部分である。距離画像Ａと距離画像Ｂが時系列的に同じものを撮像した画像の場合、動きのあった部分のみが変化するため、差分画像によって得られるものは、撮像された対象物のうち、動きのあった部分であるといえる。
【００５５】
例えば、図８に示すように、人間の上半身が撮像されている際に、その人間が手振り動作をしている時には、距離画像Ａとしての図８（ｂ）と距離画像Ｂとしての図８（ａ）とから、実際に動いた腕の部分の領域が差分画像として得られる。図８（ｃ）は、図８（ａ）と図８（ｂ）とから生成される差分画像の表示イメージを示したものである。差分画像のデータ中「−」の値を持つ画素値の画素は、その画素値の絶対値をとって、グレースケールで表現したものである。
【００５６】
図７の説明に戻る。次に、検出部３では、差分計算部２によって生成された差分画像から対象物の動きの特徴量を検出する（図７のステップＳ３〜ステップＳ５）。
【００５７】
それでは、検出部３で実際にどのようにして特徴量の検出を行うのかを主に、図９〜図１３を参照して具体的に説明する。
【００５８】
まず、得られた差分画像から流入領域と流出領域とを抽出する（ステップＳ３）。
【００５９】
対象物の動きにより、距離画像Ｂの時点（時刻ｔ−ｎ）では物体が存在せずに、距離画像Ａの時点（時刻ｔ）で新たに物体が存在するようになった領域（以降、流入領域Ｄ_ＩＮと呼ぶ）と、逆に、距離画像Ｂの時点（時刻ｔ−ｎ）では物体が存在し、距離画像Ａの時点（時刻ｔ）で既に物体が存在しなくなった領域（以降、流出領域Ｄ_ＯＵＴと呼ぶ）が生じる。
【００６０】
例えば、図９（ａ）（ｂ）に示すように、対象物が時刻ｔ−ｎから時刻ｔの間に、移動した場合を考える。この場合、時刻ｔ−ｎに取得された距離画像Ｂと、時刻ｔに取得された距離画像Ａとの差分画像の表示イメージは、図１０（ａ）に示したようなものとなる。実際の差分画像のデータでは、図１０（ｂ）に示すように、流入領域に対応する部分の画素の画素値（ｚ軸方向の値）は「＋」の値であり、流出領域に対応する部分の画素の画素値は「−」の値である。
【００６１】
すなわち、流入領域は、差分画像中「＋」の値の画素値を持つ画素からなる領域であって、流出領域は、差分画像中「−」の値の画素値を持つ画素からなる領域であり、時刻ｔにおける流入領域Ｄ_ＩＮ ^（ｔ）、流出領域Ｄ_ＯＵＴ ^（ｔ）は、それぞれ式（３）、（４）で表すことができる。
【００６２】
【数２】

【００６３】
例えば、図１４（ｃ）に示した差分画像（の一部）からは、画素値「５０」の画素Ｐ２が流入領域（の一部）として抽出され、画素値「−５０」の画素Ｐ１が流出領域の（一部）として抽出される。
【００６４】
図１０（ａ）に示した差分画像から抽出される流入領域の画像を図１１（ａ）に、流出領域の画像を図１２（ａ）に示す。なお、図１２（ａ）に示すように、流出領域の画像は、式（４）からも明らかなように、各画素値は絶対値に変換されている。
【００６５】
次に、流入領域Ｄ_ＩＮ ^（ｔ）、流出領域Ｄ_ＯＵＴ ^（ｔ）の位置を求める（ステップＳ４）。本実施形態では、両領域の位置を重心点で代表し（図１１，図１２参照）、流入領域Ｄ_ＩＮ ^（ｔ）の重心位置をＧ_ＩＮ ^（ｔ）、流出領域Ｄ_ＯＵＴ ^（ｔ）の重心位置をＧ_ＯＵＴ ^（ｔ）を計算する。
【００６６】
重心位置Ｇ＝（Ｇｘ，Ｇｙ，Ｇｚ）は式（５）を用いて計算する。
【００６７】
【数３】

【００６８】
なお、ここに示した重心の計算方法は一例で、これに限定されるものではなく、他の定義を用いて計算することが可能である。
【００６９】
さらに、図１３に示すように、ステップＳ４で得られた重心位置Ｇ_ＯＵＴ ^（ｔ）からＧ_ＩＮ ^（ｔ）へのベクトルＶ^（ｔ）＝（Ｖ^（ｔ）ｘ，Ｖ^（ｔ）ｙ，Ｖ^（ｔ）ｚ）を求め、これを特徴量として得る（ステップＳ５）。この特徴量を以降、ディファレンシャル・フロー（ＤｉｆｆｅｒｅｎｔｉａｌＦｌｏｗ）と呼ぶ。時刻ｔにおけるディファレンシャル・フローは、式（６）で得られる。
【００７０】
【数４】

【００７１】
なお、以上で説明したディファレンシャル・フローの計算方法は一例であり、これに限定されるものではない。また、特徴量は、ディファレンシャル・フローに限定されるものではない。
【００７２】
図７の説明に戻る。次に、認識部４は、検出部３で得られた特徴量、すなわち、ディファレンシャル・フローを基に、画像内に含まれる対象物の動きを認識する。
【００７３】
それでは、認識部４で実際にどのようにして認識処理を行うのかを人間の上半身における手振り動作の例を用いて具体的に説明する。手振り動作は、手挙げ／手下げ動作と、手の左右振りという一連の複数の動作から構成されているが、ここでは、この一連の複数の動作のうち、まず、人間の手挙げ／手下げ動作を認識する場合を例にとり説明する。なお、以下の説明では、「動作」という用語も「動き」という用語も同じ意味合いで用いている。
【００７４】
図１５に人間の手挙げ／手下げ動作の様子を示し、図１６（ａ）〜（ｃ）は、この動作中のディファレンシャル・フローＶ^（ｔ）＝（Ｖ^（ｔ）ｘ，Ｖ^（ｔ）ｙ，Ｖ^（ｔ）ｚ）の時間変化の様子を各成分毎に示したものである。なお、図１６（ａ）〜（ｃ）では、横軸方向に時間、縦軸にディファレンシャル・フローの各成分の値を示し、縦軸方向の値は、動きの大きさ（量）の大小を表すための適当な値である。
【００７５】
図１６では、実際にある（任意の）人に手挙げ／手下げ動作を行ってもらい、その際の距離画像から上記のようにして求めたファレンシャル・フローの値の時間的な変化を示したものである、図１６中、手挙げ／手下げ動作時の部分を点線で囲った。動きがあった部分は、ディファレンシャル・フローの値が大きく変化しており、それ以外の動きが無い部分（静止状態）は「０」に近い値を取っていることが分かる。このように、ディファレンシャル・フローの値を解析することで、動きの認識を行うことができる。
【００７６】
以降では、より具体的にディファレンシャル・フローの値の解析方法について説明する。
【００７７】
例えば、人間の「手挙げ」動作の場合、図１５（ａ）、（ｂ）に示すように、手を挙げるのであるから、ｙ軸方向の動きに特徴がある。さらに、「手挙げ」動作の場合、一般的に人間は腕を手前方向（ｚ軸方向）に動かしながら、手を挙げるものである。このように、ｙ軸方向とｚ軸方向の動きに特徴があれば、それらの動き量を乗算した結果には、当該「手挙げ」動作の動き量およびその動作時点がより顕著に表されている。そこで、このように、一般的な人間の「手挙げ」動作を分析した結果、人間の「手挙げ」動作は、ディファレンシャル・フローＶ^（ｔ）＝（Ｖ^（ｔ）ｘ，Ｖ^（ｔ）ｙ，Ｖ^（ｔ）ｚ）のｙ成分とｚ成分を用いて、以下に示す式（７）より認識を行うこととができる。
【００７８】
【数５】

【００７９】
式（７）において、ＴＨ１は閾値で、任意の正定数である。得られたディファレンシャル・フローの成分Ｖｙ、Ｖｚが式（７）の関係を満たすとき、「手挙げ」動作が行われたと認識する。
【００８０】
図１７に｜Ｖｙ×Ｖｚ｜の変化の様子を示す。なお、図１７において、横軸方向に時間、縦軸に｜Ｖｙ×Ｖｚ｜の値を示し、縦軸方向の値は、動きの量（大きさ）の大小を表すための適当な値である。式（７）の関係を満たし、｜Ｖｙ×Ｖｚ｜の値が閾値ＴＨ１を越える時点で、「手挙げ」動作が行われたと認識するわけである。
【００８１】
このように、例えば、人間の動作を認識する場合、実際の人間の動きの３次元性を利用する。人間が手を動かす際、その平面方向（ｘｙ平面方向）の動きと、奥行き方向（ｚ方向）の動きは、独立して生じることはない。つまり、例えば、「手挙げ」動作を行うときには、単に手が上方向に動いているだけではなく、奥行き方向の値も、従属して変化している訳である。つまり、平面方向の動きの成分と奥行き方向の成分には相関関係が存在する。そこで、平面方向の成分と奥行き方向の成分を同時に見ることで、このような３次元的な動きを安定して認識することが可能であるという訳である。
【００８２】
そこで、式（７）で示したように、「手挙げ」動作の場合には、ディファレンシャル・フローの各成分のうち、その動作を特徴付ける動きの方向（例えば、ここでは、ｙ軸方向）の成分と、この動き方向と相関関係のある方向の成分とを用いて、例えば、Ｖｙ×Ｖｚというような平面方向と奥行き方向の成分の積を得ることで、「手挙げ」動作といった認識が可能となる。
【００８３】
さらに、ディファレンシャル・フローを用いた、人間の「手による否定表現（手振り）」動作の認識手法について説明する。
【００８４】
「手振り」動作は、手を何回か横方向に動かす動作と考える。図１８に示すように、最少の手振り回数は４回である。手挙げ時（図１８（ｂ）参照）に１回、横方向（図１８（ｃ）、（ｄ）参照）に２回（一往復で左右に１回ずつ）、手下げ時（図１８（ｅ）参照）に１回である。そこで、横方向に４回以上の運動があった場合、「手振り」動作であるとする。
【００８５】
このように、人間の「手振り」動作は、ｘ軸方向の動きに特に特徴があり、ｘ軸方向の動きには、必ずｚ軸方向の動きも伴う（従って、ｘ軸方向とｚ軸方向とは相関関係がある）ため、例えば、｜Ｖｘ×Ｖｚ｜の値をみることで認識を行うことができる。そこで、左右振り動作は、式（７）によって検出することができる。ここで、ＴＨ２は閾値であり、任意の正定数値をとる。
【００８６】
【数６】

【００８７】
式（８）の条件を、一連の動作中に４回以上満たす場合、その動作を「手振り」動作と認識する。
【００８８】
図１９は、実際に人間が一般的に普通の早さで「手振り」動作を行った場合の、｜Ｖｘ×Ｖｚ｜の値の変化の様子を示したものである。なお、図１９において、横軸方向に時間、縦軸に｜Ｖｘ×Ｖｚ｜の値を示し、縦軸方向の値は、動き量の大小を表すための適当な値である。
【００８９】
図１９に示した例の場合、一連の動作中に６回の横方向の運動が検出され、この動作は、「手振り」動作であると認識された。
【００９０】
なお、以上に説明では、ディファレンシャル・フローの３つの成分のうち、これから認識しようとする動きの特徴的な動き方向の成分とその動き方向と相関関係のある方向の成分との２つを用いて、当該動きを認識するようになっているが、この場合に限らず、ディファレンシャル・フローの３つの成分のうち、これから認識しようとする動きの特徴的な動きの方向成分のみを用い、その成分値が予め定められた閾値を超えたとき、当該動きを認識するようにしてもよい。さらに、ディファレンシャル・フローの３つの成分全てを用い、各成分値を乗算した結果が予め定められた閾値を超えたとき、当該動きを認識するようにしてもよい。このように、認識しようとする動きの種類に応じて、ディファレンシャル・フローの３つの成分のうちの少なくとも１つを用いることにより、動きを認識することができる。その際、３つの成分のうち選択された成分は、認識しようとする動きの特徴的な動き方向の成分のみである場合か、あるいは、認識しようとする動きの特徴的な動き方向の成分とその動き方向と相関関係のある方向の成分とである場合であることが望ましい。
【００９１】
また、認識部４は、動きの種類を認識するだけでなく、その動作を行う際の動きの早さ、動きの量（大きさ）などの動きの状態も認識することができる。
【００９２】
例えば、図１９に示したような手の振り方よりも早く手を左右に振った場合の「手振り」動作の｜Ｖｘ×Ｖｚ｜の値の時間的な変化を図２０に示す。なお、図２０において、横軸方向に時間、縦軸に｜Ｖｘ×Ｖｚ｜の値を示し、縦軸方向の値は、動きの量（大きさ）の大小を表すための適当な値である。
【００９３】
図１９と図２０を比較することにより明らかなように、図２０では、動作の開始時刻と終了時刻が図１９の場合より早くなり、しかも一連の動作中に検出される、６回の横方向の運動の間隔は狭くなっていることがわかる。そこで、例えば、認識すべき動きに含まれる一連の動きの検出間隔が所定時間より短い場合には、「早い動き」であると判定するようにしてもよい。
【００９４】
また、図１９に示したような手の振り方よりも大振りで手を左右に振った場合の「手振り」動作の｜Ｖｘ×Ｖｚ｜の値は、図１９の場合よりも大きくなる。従って、｜Ｖｘ×Ｖｚ｜の値に、横方向の動きを検出するための第１の閾値（この場合、ＴＨ２）の他に、「大きな動き」であると判定するための第２の閾値を設け、例えば、この値を超えるような場合には、「大きな動き」であると判定するようにしてもよい。
【００９５】
一般的に、「手振り」動作には、「さようなら」を意味する「手振り」動作や、「ちがう、ちがう」と否定するときの「手振り」動作があるが、この両者の違いは、手を振るときの早さであろう。「ちがう、ちがう」と手を振るときの方が、「バイバイ」と手を振るときよりも手を振る動作は速くなるのが普通である。そこで、認識部４では、「手挙げ」「手下げ」あるいは、これらと「手の左右振り」とからなる「手振り」動作であるかといった動きの種類を認識するだけでなく、上記したような動きの状態をも認識することにより、例えば、早い動きの「手振り」動作が認識されたときには、「いいえ」を意味し、早き動きでない通常の「手振り」動作が認識されたときには、「さようなら」を意味していると判断することもできる。すなわち、認識された動きが表す意味も認識することができる。
【００９６】
なお、以上で述べた解析手法は、あくまでも一例であり、これに限定されるものではない。Ｖｘ、Ｖｙ、Ｖｚに関する他の計算方法を用いてもよいし、ＦＦＴやＷａｖｅｌｅｔ変換に代表されるような信号処理の手法を用いることも可能である。人工知能における知識処理的な手法でも構わない。あるいは、その他の考えられるあらゆる手法を取ることができる。
【００９７】
また、以上で述べた「手挙げ」、「手の左右振り」といった動作は、あくまでも一例であり、これに限定されることなく、あらゆる動作を解析することが可能である。動作主体も人間に限定されるものではなく、あらゆる物体に関して、本手法を適用可能である。
【００９８】
さらに、ディファレンシャル・フローを用いた解析は、一例であり、これとはことなる特徴量を解析しても構わない。
【００９９】
以上で説明したように、上記第１の実施形態では、対象物を撮影した２枚の距離画像間の差を用いることで、対象物の動きに関する３次元的な特徴量を算出し、それを利用して、対象物の動きの３次元的な認識を実現している。
【０１００】
もし、奥行き方向の情報をも表した距離画像を用いず、２次元画像から２次元的な特徴量のみで動きを認識しようとしても、例えば、人の「頭を横に向ける」といった動作の場合、２枚の２次元画像上の頭の画像領域の差分からでは、頭に動きがあったことは検出することはできるが、その動きが「横に向けた」動きでることは正確には認識することができない。しかし、上記第１の実施形態では、距離画像のように奥行き方向の情報を持たない従来の２次元画像内の２次元的な情報から、３次元的な動きを推測するといった認識手法（例えば、手のｘ軸方向（横方向）の投影面積が減少したから、手をｙ軸周りで回転したのであろうといったもの）と異なり、実際に距離画像の３次元的な性質を表す特徴量（ディファレンシャル・フロー）を用いることで認識を行っているため、従来法よりも、より確実に、より安定して３次元的な動きの認識を行うことが可能である。
【０１０１】
以下、第１の実施形態のいくつかの変形例を示す。
【０１０２】
（第１実施形態の変形例１）
画像取得部１で、所定時間毎に距離画像を取得するのではなく、任意のタイミングで距離画像を取得するようにしてもよい。動きの速い物体を撮像している際には速い間隔毎に、遅い物体を撮像している際には遅い間隔毎になどといったように、撮像物に応じて取得間隔をダイナミックに変化させてもよいし、例えば、ユーザの指示などを用いて、任意のタイミングで取得するようにしてもよい。また、それ以外の方法でも構わない。
【０１０３】
このようにすることにより、例えばユーザが開始時と終了時をスイッチで指示し、その間に特定の動きが行われたかどうかといったような任意の時間間隔内での３次元的な動き認識を行うことが可能である。また、認識したい物体の動作速度に応じて、動作認識に適した取得間隔に制御するようにしてもよい。
【０１０４】
（第１実施形態の変形例２）
差分計算部２で、最新のフレームではなく、過去の特定のフレーム（時刻ｔ（現在）よりも前の任意の時刻ｔ’）を距離画像Ａとし、そこから数フレーム前（例えば、時刻ｔ’−ｎのフレーム）を距離画像Ｂとして差分画像を生成するようにしてもよい。
【０１０５】
このようにすることにより、過去の特定の時点での３次元的な動き認識を行うことが可能である。
【０１０６】
つまり、第１の実施形態で説明したように、リアルタイムの動き認識だけではなく、任意の時点の動き認識を行うことが可能である。これにより、ビデオテープ、ハードディスクなどの記録装置に記録された距離画像ストリームのオフライン認識を行うことができる。
【０１０７】
（第１実施形態の変形例３）
第１の実施形態および上記変形例２で、差分計算部２において、距離画像Ａは、距離画像Ｂよりも時刻的に新しい画像として説明したが、これに限られるものではなく、時刻関係が逆転しても同様である。
【０１０８】
（第１実施形態の変形例４）
第１の実施形態でも説明したように、認識部４では、特徴量（一例としてディファレンシャル・フロー）の解析を行うことで、ある動きが行われているかどうかが認識するとともに、特徴量の値の大きさや、その変動幅などを解析することで、その動きがどのくらいの大きさで行われているのかをも認識することができる。
【０１０９】
例えば、第１の実施形態では、「手の左右振り」動作の認識の例で、横方向の動きを検出する際に、｜Ｖｘ×Ｖｚ｜の値がある閾値を越えたかどうかをみていたが、これを押し進めて、閾値を１つだけではなく、ＴＨ１、ＴＨ２、ＴＨ３（これらは任意の正定数で、ＴＨ１＜ＴＨ２＜ＴＨ３を満たすものとする）などと言ったように例えば３つ用意して、この値の大きさがどの閾値を超えたかによって動きの大きさを３段階に分けることができる。このように、複数の閾値を用意することで、動きが行われたかどうかだけでなく、その動きの大きさのレベルをも知ることが可能である。また、閾値処理ではなく、その値自体をアナログ量として見て、動きの大きさをアナログ量として表現することも可能である。
【０１１０】
なお、ここで説明した方法は一例であり、これに限定されるものではない。どの値を解析するかも自由に選べるし、その選んだ値からどのように動きの大きさを判別するかも、各種の方法を取ることができる。
【０１１１】
（第１実施形態の変形例５）
画像取得部１で、取得する距離画像は、第１の実施形態で表現した画像に限られない。例えば、モーションキャプチャ法により得られた物体の特徴点データと物体の３次元モデルを組み合わせることで得られた物体の３次元形状データや、ＣＧなどで用いられるために作成された３次元データなどは、通常画像と呼ばないことが多いが、データの持つ性質は、３次元的な形状を表現しているため、第１の実施形態で説明した距離画像に準じる性質を持つ。そこで、これらは本実施形態における距離画像と同等とみなすことができる。
【０１１２】
このように、通常画像と呼ばれないデータに関しても、３次元の形状データを持つものを画像取得部１で取得することで、同様に、その物体の動きの認識を行うことが可能である。
【０１１３】
（第１実施形態の変形例６）
認識部４で、動きが行われたかどうかの認識結果だけではなく、その認識に対する信頼度と併せて結果として出力することがある。信頼度は、認識を行う際、認識のための条件を満たす際の数値の差異などをもとに決定する。例えば、第１の実施形態における「手挙げ」動作を認識する場合、式（７）を用いて認識のための判別を行っているが、｜Ｖｙ×Ｖｚ｜−ＴＨ１の値（閾値との差の大きさ）や、Ｖｙの値を信頼度とすることができる。また、これらを相互用いて信頼度を算出してもよいし、これ以外の値を用いてもよい。
【０１１４】
このようにすることで、ある動きの認識がどのくらい信頼度で行われているのかを知ることができる。例えば、「手挙げ」の認識が高い信頼度で成功していれば、ユーザは、この認識結果は非常に信頼する事ができるが、信頼度が低い場合、参考程度に考えるなどということが可能となる。
【０１１５】
（第２の実施形態）
上記第１の実施形態で説明した画像認識装置およびその手法は、距離画像から対象物の３次元的な動きの特徴量（ディファレンシャル・フロー）を検出し、それを用いて距離画像内に含まれる対象物の動きを認識するものであり、距離画像内の１つの動きの特徴量を求めて、その１つの動きの認識のみを行う場合について説明した。次に、第２の実施形態では、距離画像に含まれる複数の動きのそれぞれを認識する場合について説明する。
【０１１６】
図２１は、第２の実施形態に係る画像認識装置の全体構成図である。なお、図２１において、図１と同一部分には同一符号を付し、異なる部分についてのみ説明する。すなわち、図２１の画像認識装置は、差分計算部２で得られた差分画像から、対象物の動作認識のための認識領域を抽出する領域抽出部５が新たに追加され、検出部３は、領域抽出部５で差分画像から抽出された認識領域毎に特徴量を検出するようになっている。
【０１１７】
画像取得部１および差分計算部２に関しては、第１の実施形態とまったく同様である。
【０１１８】
次に、領域抽出部５について、図２２に示すフローチャートを参照して説明する。
【０１１９】
領域抽出部５は、画像取得部１から送られてきた、例えば、図２３（ａ）（ｂ）に示したような距離画像中に複数の動きが同時に混在している場合に、図２３（ｃ）に示したように、差分画像から、各動きを認識するための複数の領域を抽出するようになっている。
【０１２０】
まず、図２３（ａ）、（ｂ）に示した距離画像Ａ（時刻ｔに撮像されたもの）、距離画像Ｂ（時刻ｔ−ｎに撮像されたもの）に含まれる対象物（動き）の領域を抽出する（ステップＳ１０１）。ここで、１つの対象物は連続する領域で占められた領域であると定義し、対象物の画像の外接矩形領域を抽出するものとする。なお、外接矩形領域に限らず、対象物の存在する領域が抽出されれば、他の形状の領域であってもよい。この場合、図２３（ａ）に示した距離画像Ａからは、図２４（ａ）に示すように、対象物の領域Ｒ１、Ｒ２が抽出される。また、図２３（ｂ）に示した距離画像Ｂからは、図２４（ｂ）に示すように、対象物の領域Ｒ１´、Ｒ２´が抽出される。
【０１２１】
次に、距離画像Ａ、Ｂ中の対応する２つの領域（好ましくは、同じ対象物が含まれる２つの領域）を合成して認識領域を生成する（ステップＳ１０２）。例えば、図２３（ａ）の距離画像Ａ中の領域Ｒ１と図２３（ｂ）の距離画像Ｂ中の領域Ｒ１´とが対応し、図２３（ａ）の距離画像Ａ中の領域Ｒ２と図２３（ｂ）の距離画像Ｂ中の領域Ｒ２´とが対応するのであれば、図２５に示したように、領域Ｒ１とＲ１´とを合成して動きを認識するための認識領域ＣＲ１が生成され、また、領域Ｒ２とＲ２´とを合成して認識領域ＣＲ２が生成される。
【０１２２】
例えば、距離画像ＡとＢとを重ね合わせたときに、領域Ｒ１とＲ１´の重なり合う領域と、それ以外の両者の全ての領域とを認識領域ＣＲ１とする。認識領域ＣＲ２も同様に、距離画像ＡとＢとを重ね合わせたときに、領域Ｒ２とＲ２´の重なり合う領域と、それ以外の両者の全ての領域とを認識領域ＣＲ２とする。
【０１２３】
ここで、対応の求め方に関しては本発明では特に限定しないが、一番近い領域同士が同じ対象物の領域であると判断し、それらを対応させても良いし、何らかの知識を用いて同じ対象物だと判別される領域を求め、それらを対応させてもよい。他の方法でも構わない。
【０１２４】
さらに、領域抽出部５は、差分計算部２で求めた差分画像から複数の認識領域を抽出する（ステップＳ１０３）。すなわち、例えば、図２３（ａ）に示した距離画像Ａと図２３（ｂ）に示した距離画像Ｂとから、差分計算部２にて、図２６（ａ）に示すような差分画像が生成されたとする。このような差分画像から図２５に示した認識領域ＣＲ１、ＣＲ２のそれぞれに対応する部分を認識領域ＣＲ１´、ＣＲ２´として抽出する。例えば、距離画像ＡとＢとを重ね合わせて認識領域ＣＲ１、ＣＲ２を生成したが、さらに、その上に差分画像を重ね合わせたときの、差分画像中の認識領域ＣＲ１、ＣＲ２のそれぞれに対応する領域を認識領域ＣＲ１´、ＣＲ２´として抽出する。
【０１２５】
なお、領域抽出部５は、ステップＳ１０１において、距離画像中から１つの対象物の領域のみが抽出されたときでも、ステップＳ１０２，ステップＳ１０３の処理を行って、距離画像Ａと距離画像Ｂ中の当該対象物の含まれる対応する領域を合成して認識領域を生成し、差分画像から当該認識領域を抽出する。
【０１２６】
次に検出部３について説明する。
【０１２７】
検出部３では、領域抽出部５で差分画像から抽出された複数の認識領域のそれぞれについて、特徴量（例えば、ここでは、ディファレンシャル・フロー）を求める（図２７参照）。
【０１２８】
特徴量の検出処理に関しては、第１の実施形態の検出部３と同様である。
【０１２９】
認識部４では、検出部３で検出された複数の認識領域毎の特徴量をそれぞれ解析し、動きの認識を行う。具体的な個々の動作の認識方法に関しては、第１の実施形態の認識部４と同様である。
【０１３０】
この際、認識のための解析は、それぞれの特徴量の値に関して独立して行ってもよいし、それぞれの値を相互参照して解析してもよい。
【０１３１】
このように、距離画像中に複数の動きが存在する場合には、差分画像から各動きの存在位置に対応する複数の認識領域を抽出して、この認識領域毎に複数の動きのそれぞれに対応した特徴量を求めて動作を認識することにより、単一の動きの認識にとどまらず、複数の動きの認識を同時に行うことが可能となり、しかも、複数の３次元的な動きのそれぞれを、安定的かつ高精度に認識することができる。
【０１３２】
なお、以上で説明した領域抽出部における差分画像からの認識領域の抽出手法は一例であり、これに限定されるものではない。
【０１３３】
（第３の実施形態）
第１の実施形態では、認識部４において、ある動きに関する認識を行っていた。第３の実施形態では、これを推し進め、複数の動きの識別を含んだ動き認識を可能とするものである。
【０１３４】
例えば、第１の実施形態では、「手振り」動作を例にとり説明したが、この「手振り」動作は、「手挙げ」「手下げ」「手の左右振り」という動きからなる。このように、１つの認識対象の動きには、複数種類の動きから構成される場合もある。そこで、第３の実施形態では、複数種類の動きをそれぞれ認識して、それらの関連性から１つの動きを識別する事も可能な画像認識装置について説明する。
【０１３５】
図２８は、第３の実施形態に係る画像認識装置の全体構成図である。なお、図２８において、図１と同一部分には、同一符号を付し、異なる部分についてのみ説明する。すなわち、図２８の画像認識装置は、検出部３で得られた特徴量（例えば、ここでは、ディファレンシャル・フロー）を基に画像内に含まれる対象物の動きを認識するための複数の（例えば、ここでは、ｘ個（ｘは、任意の整数））認識部（第１の認識部４ａ、第２の認識部４ｂ、…、第ｘの検出部４ｘ）を持ち、さらに、この複数の認識部４ａ〜４ｘで得られた認識結果をもとに、対象物の動きの識別を行う動作識別部６が新たに追加されている。
【０１３６】
画像取得部１、差分計算部２および検出部３に関しては、第１の実施形態とまったく同様である。
【０１３７】
次に、複数の認識部４ａ〜４ｘについて説明する。各認識部では、その認識部に予め定められた特定の動きを認識する。
【０１３８】
例えば、第１の認識部４ａは、「手挙げ」動作の認識を行う。認識の方法に関しては、第１の実施形態と同様である。第２の認識部４ｂでは、第１の認識部４ａとは異なる特定の動きの認識を行う。例えば、「手の左右振り」動作の認識を行う。認識の方法に関しては、第１の実施形態と同様である。
【０１３９】
以下、同様にして、第ｘの認識部４ｘでは、それ以外の認識部とは異なる特定の動きの認識を行う。例えば、「首の上下振り」動作の認識を行う。認識の方法に関しては、第１の実施形態と同様である。
【０１４０】
次に、動作識別部６について説明する。動作識別部６では、複数の認識部４ａから４ｘで得られた認識結果をもとに、対象物の動きの種類を最終的に識別（弁別）する。
【０１４１】
例えば、「首の上下振り」動作のみが認識成功の結果が得られており、他の動きに関する認識が失敗している場合、対象物の動作は、「首の上下振り」であると識別することができる。このように、複数の認識部４ａ〜４ｘのうちの１つの認識部での認識結果のみが成功している場合は、動作識別部６は、その認識された動きをそのまま識別結果として出力する。
【０１４２】
複数の認識部４ａ〜４ｘでの認識結果に複数の成功が含まれる場合の動作識別部６の処理動作について説明する。第１の実施形態で説明したように、人間が「手振り」動作を行う場合、通常、人間は手を体の前ぐらいまで挙げて、それから左右方向に手を振る。そして、最後には、手を降ろす。そこで、このような動作の場合、「手挙げ」、「手の左右振り」、「手下げ」の３つの動きの認識が成功し、この順番に動作が行われているのであれば、「手振り」という動作が識別（弁別）されることとなる。
【０１４３】
このような場合、複数の認識部４ａ〜４ｘのいずれか３つで、上記３つの動作のそれぞれを認識するようにし、人間の「手振り」動作に関する知識として、上述したような３つの動作が包含されるという知識を予め動作識別部６に記憶させておけばよい。
【０１４４】
なお、知識の表現方法、記憶方法などは、本発明では特に問わない。考えられる任意の方法をとることが可能である。また、知識は、予め記憶しておいたもので固定されているわけではなく、動作中に任意に入れ替えたり、更新したりすることも可能である。
【０１４５】
なお、上述した弁別の手法はあくまでも一例であり、これに限定されるものではない。第１の実施形態の第６の変形例の項で説明した信頼度などをもとに弁別を行ってもよいし、これ以外の方法でも構わない。
【０１４６】
また、上記第３の実施形態では、１つの対象物の動きを認識する場合を説明したが、この手法を第２の実施形態で説明した画像認識装置にも適用する事も可能である。すなわち、距離画像中に複数の動きが存在する場合には、領域抽出部５で差分画像から各動きの存在位置に対応する複数の認識領域を抽出し、検出部３で抽出された認識領域毎に、複数の動きのそれぞれに対応した特徴量を求めれば、各認識対象領域のそれぞれについて、複数の認識部４ａ〜４ｘで動きの種類を認識して、動作識別部６で最終的に各認識対象領域でどのような動作が行われていたのかを識別する。また、動作識別部６は、各認識対象領域から認識された各動きから、全体で、どのような動きが行われていたのかを識別することもできる。（第４の実施形態）
図２９は、本発明の第４の実施形態に係る画像認識装置の全体構成図である。なお、図２９において、図１と同一部分には同一符号を付し、異なる部分についてのみ説明する。すなわち、図２９に示す画像認識装置には、画像取得部１で取得された距離画像から、その画像中に含まれる動作認識の対象物の形状を認識するための形状認識部７がさらに追加されている。
【０１４７】
形状認識部７での対象物の形状の識別手法に関しては本発明では特に言及しないが、考えられるあらゆる手段を用いることができる。例えば、その一手法として、テンプレートマッチング法が挙げられる。これは、テンプレートと呼ばれる形状の雛形を多数用意し、画像に含まれる物体と一番類似しているテンプレートを検出し、そのテンプレートが表現している形状を結果として得るというものである。具体的には、丸、三角、四角、手の形状…などといったようなテンプレートを形状認識部７に予め記憶しておき、距離画像内の物体が三角のテンプレートに最も類似している場合には、距離画像内の対象物の形状は三角形状であると認識する。
【０１４８】
そのために、形状認識部７は、例えば、画像取得部１から取得した距離画像から対象物の輪郭情報を抽出するようにしてもよい。すなわち、図６に示したような距離画像から画素値が予め定められた所定値以下のセルを除き、図３０に示すような撮像された対象物の輪郭情報を抽出する。
【０１４９】
図３０のような輪郭情報を抽出するには、隣り合う画素の画素値を比較し、画素値が一定値α以上のところだけに定数値を入れて、同じ定数値が割り振られた連続した画像領域の画素を抽出すればよい。
【０１５０】
すなわち、例えば図４に示したような距離画像データのマトリックス上の座標位置（ｉ、ｊ）にある画素値をＰ（ｉ、ｊ）とし、輪郭情報の画素値をＲ（ｉ、ｊ）とすると、
・｛Ｐ（ｉ、ｊ）−Ｐ（ｉ−１、ｊ）｝＞α、かつ
｛Ｐ（ｉ、ｊ）−Ｐ（ｉ、ｊ−１）｝＞α、かつ
｛Ｐ（ｉ、ｊ）−Ｐ（ｉ＋１、ｊ）｝＞α、かつ
｛Ｐ（ｉ、ｊ）−Ｐ（ｉ、ｊ＋１）｝＞α
のとき、Ｒ（ｉ、ｊ）＝２５５
・上記以外のとき、Ｒ（ｉ、ｊ）＝０
とすることにより、図３０のような対象物の輪郭情報を得ることができる。
【０１５１】
このようにして抽出された対象物の輪郭情報と、予め記憶されたテンプレートとを比較し、対象物の輪郭情報と一番類似しているテンプレートを検出し、そのテンプレートが表現している形状を対象物の形状の認識結果として出力すればよい。
【０１５２】
なお、上記のような輪郭を用いた対象物の形状の認識手法は、一例であって、距離画像から輪郭を求めることなく、テンプレート自体が距離画像であって、取得した距離画像をそのままテンプレートである距離画像と比較して、対象物の形状を認識するようにしてもよい。
【０１５３】
このように、対象物の動作の認識だけではなく、その形状の認識も同時に行い、対象物の動作の認識の際に、認識された形状を参照することにより、例えば、手をどのような形状にどのように動かしたかなども認識することができる。さらに、上記手法は、手話認識にも適用可能である。
【０１５４】
以上の各実施形態やその変形例は、適宜組み合わせて実施することが可能である。また、本発明の手法は、与えられた距離画像もしくはそのストリームに基づいて、動作を認識し、あるいはさらにその認識結果をもとに各種の処理を行うような装置に適用可能である。
【０１５５】
図１、図２１、図２８，図２９に示した各構成部は、画像取得部１を除いて、ソフトウェアとしても実現可能である。また、上記した本発明の手法は、コンピュータに実行させるためのプログラムを記録した機械読みとり可能な媒体として実行することもできる。
【０１５６】
本発明の実施の形態に記載した本発明の手法は、コンピュータに実行させることのできるプログラムとして、磁気ディスク（フロッピーディスク、ハードディスクなど）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤなど）、半導体メモリなどの記録媒体に格納して頒布することもできる。
【０１５７】
なお、本発明は、上記実施形態に限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で種々に変形することが可能である。さらに、上記実施形態には種々の段階の発明は含まれており、開示される複数の構成用件における適宜な組み合わせにより、種々の発明が抽出され得る。例えば、実施形態に示される全構成要件から幾つかの構成要件が削除されても、発明が解決しようとする課題の欄で述べた課題（の少なくとも１つ）が解決でき、発明の効果の欄で述べられている効果（のなくとも１つ）が得られる場合には、この構成要件が削除された構成が発明として抽出され得る。
【０１５８】
【発明の効果】
以上説明したように、本発明によれば、３次元的な動きの認識を容易にしかも安定して、高精度で行うことができる。
【図面の簡単な説明】
【図１】本発明の第１の実施形態に係る画像認識装置の構成例を概略的に示す図。
【図２】距離画像を取得する画像取得部の外観の一例を示した図。
【図３】距離画像を取得する画像取得部の構成例を示した図。
【図４】反射光の強度を画素値とする距離画像の一例を示した図。
【図５】図３に示した様なマトリックス形式の距離画像を３次元的な表した図。
【図６】画像取得部により取得された手の距離画像の表示イメージを示した図。
【図７】図１の画像認識装置の処理動作を説明するためのフローチャート。
【図８】差分画像について説明するための図。
【図９】特徴量について説明するための図。
【図１０】特徴量について説明するための図で、特に、流入領域と流出領域について説明するための図。
【図１１】特徴量について説明するための図で、特に、流入領域とその代表点（ここでは、重心）について説明するための図。
【図１２】特徴量について説明するための図で、特に、流出領域とその代表点（ここでは、重心）について説明するための図。
【図１３】特徴量としてのディファレンシャル・フローについて説明するための図。
【図１４】差分画像、流入領域、流出領域の画像データについて説明するための図。
【図１５】距離画像を用いた、手挙げ／手下げ動作について説明するための図。
【図１６】特徴量（ディファレンシャル・フロー）の時間的変化の様子を示した図。
【図１７】｜Ｖｙ×Ｖｚ｜の時間的変化の様子を示した図。
【図１８】手動作における横方向の動きを説明するための図。
【図１９】｜Ｖｘ×Ｖｚ｜の時間的変化の様子を示した図。
【図２０】速い動きで手振り動作を行った場合の｜Ｖｘ×Ｖｚ｜の時間的変化の様子を示した図。
【図２１】本発明の第２の実施形態に係る画像認識装置の構成例を概略的に示す図。
【図２２】図２１の領域抽出部５の処理動作を説明するためのフローチャート。
【図２３】２枚の距離画像に複数の（例えば、ここでは、２つの）動きが存在する場合を説明するための図。
【図２４】距離画像から対象物の外接矩形を抽出する処理を説明するための図。
【図２５】動きを認識するための認識領域を生成する処理を説明するための図。
【図２６】差分画像から認識領域を抽出する処理を説明するための図。
【図２７】差分画像から抽出された認識領域から求めた特徴量（ディファレンシャル・フロー）を説明するための図。
【図２８】本発明の第３の実施形態に係る画像認識装置の構成例を概略的に示す図。
【図２９】本発明の第４の実施形態に係る画像認識装置の構成例を概略的に示す図。
【図３０】距離画像から抽出された物体の輪郭画像の一例を示した図。
【符号の説明】
１…画像取得部
２…差分計算部
３…検出部
４…認識部
４ａ…第１の認識部
４ｂ…第２の認識部
４ｘ…第ｘの認識部
５…領域抽出部
６…動作識別部
７…形状認識部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an image recognition method for recognizing a three-dimensional movement of a subject from an image in which three-dimensional information of the subject such as a distance image appears, and an image recognition apparatus using the same.
[0002]
[Prior art]
Conventionally, when an attempt is made to recognize the movement of an object to be recognized using an imaging device such as a video camera, the following method has been employed.
[0003]
The first is a method called optical flow. In this method, attention is paid to a predetermined block image, and the direction in which a certain image area has moved in a plane is measured between adjacent frame images, and the direction thereof is estimated. A typical method for specifying the moving direction of the target image in the next frame is to calculate the similarity between adjacent frames in time series. A correlation coefficient with the block image of interest in the previous frame is calculated for block images of the same size in the vicinity of the target image area, and the direction to the block with the highest coefficient is estimated as a motion vector.
[0004]
This technology is widely used in the field of robot vision such as human face tracking. In this method, when the block image of interest does not change greatly two-dimensionally, it is possible to detect the movement in the plane direction with considerable accuracy, but the target image is a two-dimensional image acquired by a video camera or the like. It is impossible to detect a three-dimensional motion including the depth direction.
[0005]
As another method, there is a method of estimating a motion using feature points of a recognition object. In this method, several feature points are determined in advance on the recognition target, and the movement is estimated using a change in the positional relationship between the feature points caused by the movement of the target. For example, when recognizing an action of waving a face to the left or right (rotating in the left-right direction), a predetermined number of face feature points such as eyes and nose are determined, and the feature point of the eye position is determined by the movement of the face. Is moving to the right, the distance between the feature points of both eyes is narrowed, the feature point of the right eye has disappeared (because the right eye has moved to a position where it cannot be seen from the camera), etc. I guess it would be.
[0006]
However, when this method is used, in order to stably obtain corresponding points in the camera image, a marker or the like must be attached to the position of the facial feature point, so the usable environment is limited. There was a problem. There is also an implementation method that does not use markers, but in this case, it is necessary to automatically recognize feature points from the image, and it may not be possible to extract feature points stably, and in order to obtain feature points In addition, a large calculation cost is required. In this method as well, since the target image is a two-dimensional image acquired by a video camera or the like, the three-dimensional motion including the depth direction can only be estimated from the two-dimensional image.
[0007]
As another method, there is a method of inferring by determining a change in the moment of motion. For example, in the case of the movement of rotating the hand around the vertical axis, the forward projection area in the horizontal direction of the hand changes significantly, while the vertical direction does not change much. In such a case, it is presumed that the hand is rotated around the vertical axis because the change in only the lateral movement moment of the hand is severe.
[0008]
Although this method is certainly a technique for estimating three-dimensional movement, there is a limit to the shape of the object that can be used for recognition, and it is difficult to distinguish it from other planar movements. There was a problem such as the possibility of doing.
[0009]
There will be various methods other than those listed here, but in any case, these methods using an imaging device such as a video camera, from a two-dimensional image having only planar information, Only three-dimensional motion is estimated, and it is difficult to stably recognize the three-dimensional motion with high accuracy. This is because a considerable amount of information is missing when an object having a three-dimensional shape is originally acquired as plane information by a camera.
[0010]
In order to avoid these problems, a plurality of video cameras are used to capture images from several locations at the same time, and the corresponding points of each camera are obtained, so that stereoscopic information is calculated from the plurality of captured images and a three-dimensional image is formed. Then, there is a method of recognizing the action using it.
[0011]
This method is called a stereo method, and actually constructs stereoscopic information from a plurality of captured images. Therefore, the problem of estimating three-dimensional information from plane information as described above is solved. Since it takes a lot of calculation time to calculate the corresponding points for merging images from the camera three-dimensionally, it is not suitable for real-time processing. Further, since the camera position information is required to obtain the corresponding points, there are restrictions on the camera position and the camera position needs to be calibrated. Therefore, it was difficult for general users to use it easily.
[0012]
There is also a technique called motion capture in which a sensor is previously attached to a part such as a joint that characterizes the movement, the sensor part is extracted from the captured image, and two-dimensional or three-dimensional movement is measured. Compared with the method introduced above, this method makes feature point extraction and association processing lighter, but the cost of the entire system is higher, and there are many restrictions on operating the system. Furthermore, it is necessary to wear a specific sensor device that is bothersome, and it is not very usable for general users.
[0013]
As described above, the conventional method has various problems in the method of recognizing a three-dimensional motion including depth information from an image.
[0014]
[Problems to be solved by the invention]
In the conventional method, since the recognition target object is acquired as an image having only two-dimensional information using a video camera or the like, the three-dimensional movement of the target object can be recognized only from the two-dimensional information, and is stable. Thus, it is difficult to recognize a three-dimensional motion including the depth direction with high accuracy.
[0015]
SUMMARY OF THE INVENTION An object of the present invention is to provide an image recognition method and an image recognition apparatus using the same, which can recognize three-dimensional movement easily, stably and with high accuracy.
[0016]
[Means for Solving the Problems]
The present invention acquires an image having three-dimensional information of a subject, obtains difference data of the obtained plurality of images, and extracts a region in which a pixel value is decreased and a region in which the pixel value is increased with the movement of the subject from the difference data Then, a feature quantity of the three-dimensional movement of the subject is extracted from the three-dimensional positional relationship, and the movement of the subject is recognized based on the feature quantity, thereby obtaining a three-dimensional view in the image. Can be recognized easily and stably and with high accuracy.
[0017]
An image having three-dimensional information of a subject is acquired, the movement of the subject is detected from the acquired plurality of images, and the subject is detected for each image region from the difference data of the plurality of images corresponding to the detected image region. A region in which the pixel value is decreased and a region in which the pixel value is increased in accordance with the movement of the subject is extracted, and the feature amount of the three-dimensional motion of the subject is extracted from these three-dimensional positional relationships, and for each image region, By recognizing the detected movement based on the feature amount extracted from the image area, even when there are a plurality of movements in the image, each of the plurality of three-dimensional movements can be facilitated. It can be recognized stably and with high accuracy.
[0018]
Preferably, the motion of the subject is based on at least one component value selected according to the motion to be recognized among the component values in the x, y, and z directions of the feature quantity of the three-dimensional motion. Recognize At this time, preferably, at least one component value is selected from the component values of the feature amount based on the characteristic movement direction of the movement to be recognized. Alternatively, at least one component value is selected from the component values of the feature amount based on the characteristic motion direction of the motion to be recognized and the direction correlated with the motion direction.
[0019]
Preferably, a distance image is used as the image.
[0020]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0021]
(First embodiment)
First, a first embodiment of the present invention will be described.
[0022]
FIG. 1 is an overall configuration diagram of an image recognition apparatus according to the first embodiment. The image recognition apparatus according to the present embodiment includes an image acquisition unit 1 including an imaging unit for acquiring a distance image or an image having information in the depth direction, and any two depths acquired by the image acquisition unit 1. A difference calculation unit 2 for calculating a difference between images having direction information (for example, a distance image), a detection unit 3 for detecting a feature amount from the difference image obtained as a result of the difference calculation unit 2, It is comprised from the recognition part 4 for recognizing operation | movement of the target object contained in an image based on the feature-value obtained by the detection part 3. FIG.
[0023]
First, the image acquisition unit 1 will be described.
[0024]
The image acquisition unit 1 uses a recognition target object (for example, a human hand, face, whole body, etc.) as a subject, and values in the depth direction reflecting the three-dimensional shape at predetermined time intervals (for example, every 1/30 seconds). For example, it is acquired as a distance image. For example, the distance image can be acquired using a technique disclosed in Japanese Patent Laid-Open No. 10-177449.
[0025]
Since distance images are acquired every predetermined time, these are sequentially held inside or outside the image acquisition unit 1 using a memory or the like, thereby moving a moving image (hereinafter referred to as a distance image stream) based on the distance image of the object. Can also be obtained. At this time, when the distance image acquisition interval is t seconds, the distance image stream includes “the latest distance image”, “the distance image t seconds before the latest (hereinafter referred to as one frame before)”, “from the latest It is obtained as an aggregate of distance images of a plurality of frames, such as a distance image 2 t seconds ago (two frames before, and so on).
[0026]
Here, the image acquisition unit 1 that acquires the distance image (hereinafter, the image acquisition unit for acquiring the distance image is referred to as the distance image acquisition unit 1) and the distance image will be described. The distance image acquisition unit 1 is positioned in advance so that when a person as an object reaches a predetermined position of the apparatus, the hand, face, whole body, etc. of the person can be imaged.
[0027]
The appearance of the distance image acquisition unit 1 is shown in FIG. A light receiving unit 103 composed of a circular lens and an area sensor (not shown) at the rear thereof is arranged in the center, and the circular lens is surrounded by an LED that emits light such as infrared rays along its outline. A plurality of (for example, eight) light emitting units 101 are arranged at equal intervals.
[0028]
The light emitted from the light emitting unit 101 is reflected by the object, collected by the lens of the light receiving unit 103, and received by the area sensor at the rear of the lens. The area sensor is, for example, a sensor arranged in a 256 × 256 matrix, and the intensity of reflected light received by each sensor in the matrix becomes a pixel value. The image acquired in this way is a distance image as the intensity distribution of reflected light as shown in FIG.
[0029]
FIG. 3 shows a configuration example of the distance image acquisition unit 1, which mainly includes a light emitting unit 102, a light receiving unit 103, a reflected light extraction unit 102, and a timing signal generation unit 104.
[0030]
The light emitting unit 101 emits light whose intensity varies with time in accordance with the timing signal generated by the timing signal generating unit 104. This light is applied to the target object in front of the light emitting unit.
[0031]
The light receiving unit 103 detects the amount of light reflected by the target object of the light emitted from the light emitting unit 101.
[0032]
The reflected light extraction unit 102 extracts a spatial intensity distribution of the reflected light received by the light receiving unit 103. Since the spatial intensity distribution of the reflected light can be captured as an image, this is hereinafter referred to as a distance image.
[0033]
In general, the light receiving unit 103 simultaneously receives not only light reflected from an object of light emitted from the light emitting unit 101 but also external light such as illumination light and sunlight. Therefore, the reflected light extraction unit 102 takes the difference between the amount of light received when the light emitting unit 101 emits light and the amount of light received when the light emitting unit 101 does not emit light, thereby obtaining the light emitting unit 101. Only the reflected light component of the light from the target object is extracted.
[0034]
The reflected light extraction unit 102 extracts the intensity distribution, that is, distance image data as shown in FIG. 4 from the reflected light received by the light receiving unit 103.
[0035]
For the sake of simplicity, FIG. 4 shows a data case of a distance image of 8 × 8 pixels that is a part of a distance image of 256 × 256 pixels.
[0036]
The reflected light from the object decreases significantly as the distance of the object increases. When the surface of the object uniformly scatters light, the amount of light received per pixel in the distance image decreases in inverse proportion to the square of the distance to the object.
[0037]
In FIG. 4, the cell value (pixel value) in the matrix indicates the intensity of the acquired reflected light in 256 gradations (8 bits). For example, a cell having a value of “255” is closest to the distance image acquisition unit 1, a cell having a value of “0” is far from the distance image acquisition unit 1, and reflected light is a distance image acquisition unit 1 is not reached.
[0038]
Each pixel value of the distance image represents the amount of reflected light received by the unit light receiving unit corresponding to the pixel. Reflected light is affected by the nature of the object (specularly reflecting, scattering, absorbing, etc.), the orientation of the object, the distance of the object, etc. In this case, the amount of reflected light is closely related to the distance to the object. Since hands and the like have such properties, the distance image when the hand is put out in front of the distance image acquisition unit 1 reflects the distance to the hand, the inclination of the hand (the distance is partially different), and the like. A three-dimensional image as shown in FIG. 5 can be obtained.
[0039]
The intensity of the reflected light from the object decreases in inverse proportion to the square of the distance d to the object. That is, if the representative pixel value of the image of the object is Q (i, j),
Q (i, j) = K / d²... (1)
It can be expressed as.
[0040]
Here, for example, K is a coefficient adjusted so that the value of the pixel value R (i, j) becomes “255” when d = 0.5 m. The distance d can be obtained by solving the equation (1) for d.
[0041]
As described above, each pixel value of the distance image representing the intensity distribution of the reflected light as illustrated in FIG. 4 is information corresponding to the distance (value in the depth direction) from the image acquisition unit 1 as it is. The distance image is a three-dimensional image having depth information. Each pixel value of the distance image is information corresponding to the distance from the image acquisition unit 1 (value in the depth direction). This pixel value is obtained from the image acquisition unit 1 using the above equation (1). It may be converted into a distance value, and is not limited to such an absolute distance value, but may be converted into a relative value and used as a pixel value. In addition, the information corresponding to the distance from the image acquisition unit 1 is not limited to the two-dimensional matrix format as described above, and other methods can be used.
[0042]
The distance image acquisition method is not limited to the image acquisition method disclosed in Japanese Patent Laid-Open No. 10-177449 described above, and may be acquired in accordance with this or using another means. For example, a distance image acquisition method using a laser beam called a range finder, a method of acquiring a distance image using parallax information of two images captured simultaneously using two cameras called a stereo method, etc. It hits.
[0043]
FIG. 6 shows a display image of the hand distance image acquired by the image acquisition unit 1, for example, 64 pixels in the x-axis (horizontal) direction, 64 pixels in the y-axis (vertical) direction, and z-axis (depth). The image has 256 gradations in the direction. FIG. 6 shows the depth value of the distance image, that is, the gradation (pixel value) in the z-axis direction in gray scale. In this case, the closer the color is to black, the closer the distance is, and the closer to white, the farther the distance is. It is shown that. Further, a place where the color is completely white indicates that there is no image, or even if it is present, it is the same as not far away.
[0044]
Next, the processing operation of the image recognition apparatus in FIG. 1 will be described with reference to the flowchart shown in FIG.
[0045]
First, the image acquisition unit 1 acquires a distance image stream of a recognition target object, and passes an arbitrary two-frame distance image (hereinafter, distance image A, distance image B) included therein to the difference calculation unit 2 ( Step S1).
[0046]
The difference calculation unit 2 performs difference processing on any two frames of distance images (hereinafter, the distance image A and the distance image B) included in the distance image stream of the recognition target object acquired by the image acquisition unit 1 to obtain a difference. An image is generated (step S2).
[0047]
When it is desired to recognize two arbitrary frames in real time, the distance image A of the latest frame (time t) and the distance image several frames before (time t−n, n is an arbitrary positive constant) are usually used. B is selected. Here, how many frames before the distance image are used is determined based on information such as the distance image acquisition interval (frame rate) of the image acquisition unit 1 and the operation speed of the object.
[0048]
Now, a difference processing method in the difference calculation unit 2 will be specifically described.
[0049]
The calculation of the difference image D between the distance image A (captured at time t) and the distance image B (captured at time t−n) applies Equation (2) for all pixels (i, j).
[0050]
Here, the distance value of each pixel position (i, j) of the distance image at time t is expressed as F^(T)(I, j), the difference image at time t is D^(T), The value of each pixel position (i, j) is D^(T)Expressed as (i, j).
[0051]
That is, the distance value at the pixel position (i, j) of the distance image A is F^(T)(I, j), the distance value at the pixel position (i, j) of the distance image B is F^(Tn)(I, j), difference image D between distance image A and distance image B^(T)(I, j) can be generated from equation (2).
[0052]
[Expression 1]

[0053]
The difference image will be specifically described with reference to FIG. FIG. 14A shows a part of data of the distance image B, and it is assumed that there are two pixels P1 and P2 having pixel values “200” and “150”. FIG. 14B shows two pixels at the same position as the two pixels P1 and P2 shown in FIG. 14A of the distance image A. The pixel values are “150” and “ 200 ”. In this case, by using Expression (2), the amount of change in the pixel values of the pixels P1 and P2 between the distance image A and the distance image B are “−50” and “50”, respectively. As shown in FIG. 14C, the pixel values of the pixels P1 and P2 on the difference image are obtained. That is, in the distance image B, the object located at the position of the pixel P1 moves to the pixel P2 in the distance image A as a result of the movement of the target object. As a result, the pixel value of the pixel P1 is changed on the difference image. The pixel P2 has a value of “−” and the pixel P2 has a value of “+”.
[0054]
What is obtained as a difference image is a portion that has changed between the distance image A and the distance image B, that is, a portion that has changed among images captured in the distance images at time t-n and time t. It is. In the case of an image in which the distance image A and the distance image B are captured in the same time series, only the portion that has moved changes. Therefore, what is obtained by the difference image is the motion of the captured object. It can be said that it was the part that was there.
[0055]
For example, as shown in FIG. 8, when the human upper body is imaged and the person is performing a hand gesture, FIG. 8B as the distance image A and FIG. From a), the area of the part of the arm that has actually moved is obtained as a difference image. FIG. 8C shows a display image of the difference image generated from FIGS. 8A and 8B. Pixels having a pixel value of “−” in the difference image data are expressed in gray scale by taking the absolute value of the pixel value.
[0056]
Returning to the description of FIG. Next, the detection unit 3 detects the feature amount of the movement of the target object from the difference image generated by the difference calculation unit 2 (steps S3 to S5 in FIG. 7).
[0057]
Then, how the feature amount is actually detected by the detection unit 3 will be specifically described mainly with reference to FIGS. 9 to 13.
[0058]
First, an inflow region and an outflow region are extracted from the obtained difference image (step S3).
[0059]
Due to the movement of the target object, there is no object at the time of the distance image B (time t−n), and there is a new object at the time of the distance image A (time t). Region D_INConversely, an object is present at the time point of the distance image B (time t−n) and no longer exists at the time point of the distance image A (time t) (hereinafter, the outflow region D)._OUTIs called).
[0060]
For example, as shown in FIGS. 9A and 9B, consider a case where an object moves between time t-n and time t. In this case, the display image of the difference image between the distance image B acquired at time t−n and the distance image A acquired at time t is as shown in FIG. In the actual difference image data, as shown in FIG. 10B, the pixel value of the pixel corresponding to the inflow region (value in the z-axis direction) is a “+” value and corresponds to the outflow region. The pixel values of the partial pixels are “−” values.
[0061]
That is, the inflow area is an area composed of pixels having a pixel value of “+” in the difference image, and the outflow area is an area composed of pixels having a pixel value of “−” in the difference image. , Inflow region D at time t_IN ^(T), Outflow area D_OUT ^(T)Can be represented by formulas (3) and (4), respectively.
[0062]
[Expression 2]

[0063]
For example, from the difference image (a part) shown in FIG. 14C, the pixel P2 having the pixel value “50” is extracted as the inflow region (a part), and the pixel P1 having the pixel value “−50” is extracted. Extracted as (part) of the outflow area.
[0064]
An inflow region image extracted from the difference image shown in FIG. 10A is shown in FIG. 11A, and an outflow region image is shown in FIG. As shown in FIG. 12A, in the image of the outflow region, each pixel value is converted to an absolute value, as is apparent from Expression (4).
[0065]
Next, inflow region D_IN ^(T), Outflow area D_OUT ^(T)Is determined (step S4). In this embodiment, the position of both regions is represented by the center of gravity (see FIGS. 11 and 12), and the inflow region D_IN ^(T)G of the center of gravity_IN ^(T), Outflow area D_OUT ^(T)G of the center of gravity_OUT ^(T)Calculate
[0066]
The gravity center position G = (Gx, Gy, Gz) is calculated using the equation (5).
[0067]
[Equation 3]

[0068]
The calculation method of the center of gravity shown here is an example, and the calculation method is not limited to this, and can be calculated using other definitions.
[0069]
Furthermore, as shown in FIG. 13, the gravity center position G obtained in step S4._OUT ^(T)To G_IN ^(T)Vector V to^(T)= (V^(T)x, V^(T)y, V^(T)z) is obtained and obtained as a feature value (step S5). This feature amount is hereinafter referred to as a differential flow. The differential flow at time t is obtained by Expression (6).
[0070]
[Expression 4]

[0071]
The differential flow calculation method described above is an example, and the present invention is not limited to this. Further, the feature quantity is not limited to the differential flow.
[0072]
Returning to the description of FIG. Next, the recognition unit 4 recognizes the movement of the object included in the image based on the feature amount obtained by the detection unit 3, that is, the differential flow.
[0073]
Now, how the recognition unit 4 actually performs the recognition process will be specifically described using an example of a hand movement operation in the upper body of a human. The hand movement operation is composed of a series of a plurality of movements such as a hand raising / lowering action and a left / right swinging of the hand. An example of recognizing an operation will be described. In the following description, the term “motion” and the term “motion” are used interchangeably.
[0074]
FIG. 15 shows the state of a human hand raising / lowering operation. FIGS. 16A to 16C show the differential flow V during this operation.^(T)= (V^(T)x, V^(T)y, V^(T)The state of time change of z) is shown for each component. 16A to 16C, the horizontal axis indicates time, the vertical axis indicates the value of each component of the differential flow, and the vertical axis indicates the magnitude (amount) of the movement. It is an appropriate value to represent.
[0075]
FIG. 16 shows a temporal change in the value of the facial flow obtained as described above from the distance image at the time when an actual (arbitrary) person raises / lowers hands. In FIG. 16, the portion during the hand raising / lowering operation is surrounded by a dotted line. It can be seen that the differential flow value greatly changes in the portion where there is movement, and the portion where there is no movement (stationary state) takes a value close to “0”. Thus, motion can be recognized by analyzing the value of the differential flow.
[0076]
Hereinafter, a differential flow value analysis method will be described more specifically.
[0077]
For example, in the case of a human “hand raising” operation, as shown in FIGS. 15A and 15B, a hand is raised, and thus there is a feature in movement in the y-axis direction. Further, in the case of a “hand raising” operation, generally, a human raises his / her hand while moving his / her arm toward the front (z-axis direction). As described above, if there is a characteristic in the movement in the y-axis direction and the z-axis direction, the result of multiplying the movement amount indicates the movement amount of the “hand-lifting” operation and the operation time point more prominently. Yes. Therefore, as a result of analyzing the general “hand-raising” motion of a human as described above, the human “hand-raising” motion is^(T)= (V^(T)x, V^(T)y, V^(T)Using the y component and the z component of z), recognition can be performed from the following equation (7).
[0078]
[Equation 5]

[0079]
In Formula (7), TH1 is a threshold value and is an arbitrary positive constant. When the obtained differential flow components Vy and Vz satisfy the relationship of Expression (7), it is recognized that the “hand-up” operation has been performed.
[0080]
FIG. 17 shows a state of change of | Vy × Vz |. In FIG. 17, time is shown on the horizontal axis, and | Vy × Vz | is shown on the vertical axis, and the value on the vertical axis is an appropriate value for representing the magnitude (size) of the movement. . When the relationship of Expression (7) is satisfied and the value of | Vy × Vz | exceeds the threshold value TH1, it is recognized that the “hand raising” operation has been performed.
[0081]
Thus, for example, when recognizing human movement, the three-dimensionality of actual human movement is used. When a human moves his / her hand, the movement in the plane direction (xy plane direction) and the movement in the depth direction (z direction) do not occur independently. That is, for example, when performing a “hand raising” operation, the hand is not only moving upward, but the value in the depth direction is also dependently changing. That is, there is a correlation between the motion component in the plane direction and the component in the depth direction. Therefore, it is possible to stably recognize such a three-dimensional movement by simultaneously viewing the planar component and the depth component.
[0082]
Therefore, as shown in Expression (7), in the case of the “hand-lifting” operation, among the components of the differential flow, the components in the direction of movement (for example, the y-axis direction here) characterizing the operation Using the component in the direction that has a correlation with the movement direction, for example, by obtaining the product of the component in the plane direction and the depth direction such as Vy × Vz, it is possible to recognize the “hand raising” operation. Become.
[0083]
Furthermore, a method for recognizing a human “negative expression by hand (hand gesture)” operation using a differential flow will be described.
[0084]
The “hand gesture” operation is considered as an operation of moving the hand several times in the lateral direction. As shown in FIG. 18, the minimum number of hand gestures is four. Once in hand-lifting (see FIG. 18 (b)), twice in the lateral direction (see FIGS. 18 (c) and 18 (d)) (one reciprocation once to the left and right), and in hand-down (see FIG. e) once per reference). Therefore, when there are four or more movements in the lateral direction, it is assumed that the movement is “hand shaking”.
[0085]
Thus, the human “hand-shaking” movement is particularly characterized by movement in the x-axis direction, and movement in the x-axis direction is always accompanied by movement in the z-axis direction (therefore, the x-axis direction and the z-axis direction are Therefore, for example, it can be recognized by looking at the value of | Vx × Vz |. Therefore, the left / right swing motion can be detected by Expression (7). Here, TH2 is a threshold value and takes an arbitrary positive constant value.
[0086]
[Formula 6]

[0087]
When the condition of Expression (8) is satisfied four or more times during a series of operations, the operation is recognized as a “hand gesture” operation.
[0088]
FIG. 19 shows how the value of | Vx × Vz | changes when a human performs a “hand shake” operation at a generally normal speed. In FIG. 19, time is shown on the horizontal axis, and | Vx × Vz | is shown on the vertical axis, and the value on the vertical axis is an appropriate value for representing the amount of motion.
[0089]
In the case of the example shown in FIG. 19, six lateral movements were detected during a series of movements, and this movement was recognized as a “hand shaking” movement.
[0090]
In the above description, among the three components of the differential flow, two components, i.e., a component of a characteristic motion direction of a motion to be recognized and a component of a direction correlated with the motion direction are used. However, the present invention is not limited to this case. Of the three components of the differential flow, only the characteristic direction component of the motion to be recognized is used, and its component value is used. When the value exceeds a predetermined threshold value, the movement may be recognized. Furthermore, all three components of the differential flow may be used, and when the result of multiplying each component value exceeds a predetermined threshold value, the motion may be recognized. Thus, according to the type of movement to be recognized, the movement can be recognized by using at least one of the three components of the differential flow. In this case, the selected component of the three components is only the component in the characteristic motion direction of the motion to be recognized, or the component in the characteristic motion direction of the motion to be recognized and its component It is desirable that the direction component has a correlation with the movement direction.
[0091]
Further, the recognition unit 4 can recognize not only the type of movement but also the state of movement such as the speed of movement and the amount (size) of movement when performing the movement.
[0092]
For example, FIG. 20 shows a temporal change in the value of | Vx × Vz | of the “hand-shaking” operation when the hand is swung to the left or right earlier than the way of waving as shown in FIG. In FIG. 20, the horizontal axis indicates time, the vertical axis indicates | Vx × Vz |, and the vertical axis value is an appropriate value for representing the magnitude (magnitude) of movement. .
[0093]
As is clear by comparing FIG. 19 and FIG. 20, in FIG. 20, the start time and end time of the operation are earlier than those in FIG. 19, and the six horizontal directions are detected during a series of operations. It can be seen that the interval between the movements is narrow. Therefore, for example, when the detection interval of a series of movements included in the movement to be recognized is shorter than a predetermined time, it may be determined that the movement is “fast movement”.
[0094]
Also, the value of | Vx × Vz | in the “hand-shaking” operation when the hand is swung left and right larger than the way of shaking the hand as shown in FIG. 19 is larger than that in FIG. Therefore, in addition to the first threshold value (in this case, TH2) for detecting lateral movement, the second threshold value for determining “large movement” is added to the value of | Vx × Vz |. For example, if this value is exceeded, it may be determined that the movement is “large movement”.
[0095]
In general, there are “hand gesture” actions that mean “goodbye” and “hand shake” actions that deny “different” or “no”. It will be fast. When waving "No, no", the action of waving is usually faster than when waving "Bye Bye". Therefore, the recognition unit 4 not only recognizes the type of movement such as “hand raising”, “hand down”, or “hand shaking” consisting of these and “hand swinging left and right”, but also as described above. By recognizing the state of movement, for example, when a “hand gesture” motion of fast movement is recognized, it means “no”, and when a normal “hand gesture” motion that is not fast motion is recognized, “goodbye” It can also be determined that it means “.” That is, the meaning represented by the recognized movement can also be recognized.
[0096]
The analysis method described above is merely an example, and the present invention is not limited to this. Other calculation methods relating to Vx, Vy, and Vz may be used, and signal processing techniques such as FFT and Wavelet conversion may be used. It may be a knowledge processing technique in artificial intelligence. Alternatively, any other possible approach can be taken.
[0097]
In addition, the operations such as “hand raising” and “hand swinging left and right” described above are merely examples, and the present invention is not limited to this, and any operation can be analyzed. The moving subject is not limited to human beings, and this method can be applied to any object.
[0098]
Furthermore, the analysis using the differential flow is an example, and a feature quantity different from this may be analyzed.
[0099]
As described above, in the first embodiment, by using a difference between two distance images obtained by photographing an object, a three-dimensional feature amount related to the movement of the object is calculated, Utilizing this, three-dimensional recognition of the movement of the object is realized.
[0100]
Even if an attempt is made to recognize a motion from a two-dimensional image using only a two-dimensional feature amount without using a distance image that also represents information in the depth direction, for example, in the case of an operation such as “turning the head to the side” Although it is possible to detect the movement of the head from the difference between the head image areas on the two two-dimensional images, it is accurately recognized that the movement is a "sideways" movement. Can not do it. However, in the first embodiment, a recognition method (e.g., estimating a three-dimensional motion from two-dimensional information in a conventional two-dimensional image that does not have depth information like a distance image (for example, Unlike the case where the projected area in the x-axis direction (lateral direction) of the hand has decreased, the feature amount (differential) that actually represents the three-dimensional nature of the distance image, unlike the case where the hand may have rotated around the y-axis) Since the recognition is performed by using the flow), it is possible to recognize the three-dimensional movement more reliably and more stably than the conventional method.
[0101]
Hereinafter, some modified examples of the first embodiment will be described.
[0102]
(Modification 1 of the first embodiment)
The image acquisition unit 1 may acquire a distance image at an arbitrary timing, instead of acquiring a distance image every predetermined time. Even if the acquisition interval is dynamically changed according to the imaged object, such as every fast interval when imaging a fast-moving object, every slow interval when imaging a slow object, etc. Alternatively, for example, it may be acquired at an arbitrary timing using a user instruction or the like. Other methods may also be used.
[0103]
In this way, for example, the user indicates the start and end times with a switch, and performs three-dimensional motion recognition within an arbitrary time interval such as whether or not a specific motion has been performed during that time. Is possible. Further, the acquisition interval suitable for motion recognition may be controlled according to the motion speed of the object to be recognized.
[0104]
(Modification 2 of the first embodiment)
In the difference calculation unit 2, instead of the latest frame, a past specific frame (any time t ′ before time t (current)) is set as the distance image A, and several frames before (for example, time t ′) The difference image may be generated using the -n frames) as the distance image B.
[0105]
By doing so, it is possible to perform three-dimensional motion recognition at a specific point in the past.
[0106]
That is, as described in the first embodiment, not only real-time motion recognition but also motion recognition at an arbitrary time point can be performed. Thereby, offline recognition of the distance image stream recorded on recording devices, such as a video tape and a hard disk, can be performed.
[0107]
(Modification 3 of the first embodiment)
In the first embodiment and the second modification, the difference calculation unit 2 has described the distance image A as an image that is newer in time than the distance image B. However, the present invention is not limited to this, and the time relationship is reversed. Even so, it is the same.
[0108]
(Modification 4 of the first embodiment)
As described in the first embodiment, the recognizing unit 4 recognizes whether or not a certain movement is being performed by analyzing a feature amount (differential flow as an example), and determines the value of the feature amount. By analyzing the size and the fluctuation range, it is possible to recognize how much the movement is performed.
[0109]
For example, in the first embodiment, in the example of recognition of the “hand swinging” movement, when detecting the movement in the horizontal direction, whether or not the value of | Vx × Vz | , Push this forward, and prepare not only one threshold, but three such as TH1, TH2, TH3 (these are arbitrary positive constants and TH1 <TH2 <TH3) Thus, the magnitude of the motion can be divided into three levels depending on which threshold value the magnitude of this value exceeds. Thus, by preparing a plurality of threshold values, it is possible to know not only whether or not a movement has been performed, but also the level of the magnitude of the movement. In addition, instead of threshold processing, the value itself can be viewed as an analog quantity, and the magnitude of movement can be expressed as an analog quantity.
[0110]
In addition, the method demonstrated here is an example and is not limited to this. You can freely choose which value to analyze and how to determine the magnitude of movement from the selected value.
[0111]
(Modification 5 of the first embodiment)
The distance image acquired by the image acquisition unit 1 is not limited to the image expressed in the first embodiment. For example, 3D shape data of an object obtained by combining feature point data of an object obtained by a motion capture method and a 3D model of the object, 3D data created for use in CG, etc. Although it is often not called a normal image, the property of data has a property conforming to the distance image described in the first embodiment because it represents a three-dimensional shape. Therefore, these can be regarded as equivalent to the distance image in the present embodiment.
[0112]
As described above, even with respect to data that is not referred to as a normal image, it is possible to recognize the movement of the object in the same manner by acquiring the data having the three-dimensional shape data with the image acquisition unit 1.
[0113]
(Modification 6 of the first embodiment)
The recognition unit 4 may output the result together with not only the recognition result indicating whether or not the movement has been performed but also the reliability for the recognition. The reliability is determined based on the difference in numerical values when the conditions for recognition are satisfied. For example, when recognizing the “hand-lifting” operation in the first embodiment, the discrimination for recognition is performed using the equation (7), but the value of | Vy × Vz | −TH1 (difference from the threshold value) Or the value of Vy can be used as the reliability. Further, the reliability may be calculated by using these mutually, or other values may be used.
[0114]
In this way, it is possible to know how reliable the recognition of a certain movement is. For example, if recognition of “hand-raising” is successful with a high degree of confidence, the user can be very confident in the recognition result, but if the degree of confidence is low, it can be considered as a reference level. It becomes.
[0115]
(Second Embodiment)
The image recognition apparatus and its method described in the first embodiment detect a feature quantity (differential flow) of a three-dimensional motion of an object from a distance image, and are included in the distance image using it. A case has been described in which the motion of an object is recognized, and a feature amount of one motion in the distance image is obtained and only the one motion is recognized. Next, in the second embodiment, a case will be described in which each of a plurality of movements included in a distance image is recognized.
[0116]
FIG. 21 is an overall configuration diagram of an image recognition apparatus according to the second embodiment. In FIG. 21, the same parts as those in FIG. 1 are denoted by the same reference numerals, and only different parts will be described. That is, in the image recognition apparatus of FIG. 21, a region extraction unit 5 that newly extracts a recognition region for motion recognition of an object from the difference image obtained by the difference calculation unit 2 is added. A feature amount is detected for each recognition region extracted from the difference image by the region extraction unit 5.
[0117]
The image acquisition unit 1 and the difference calculation unit 2 are exactly the same as those in the first embodiment.
[0118]
Next, the region extraction unit 5 will be described with reference to the flowchart shown in FIG.
[0119]
The region extracting unit 5 receives the image shown in FIG. 23 (a) when a plurality of movements are mixed in the distance image as shown in FIGS. 23 (a) and 23 (b) sent from the image acquisition unit 1, for example. As shown in c), a plurality of regions for recognizing each motion are extracted from the difference image.
[0120]
First, the objects (movements) included in the distance image A (imaged at time t) and the distance image B (imaged at time t−n) shown in FIGS. An area is extracted (step S101). Here, one object is defined as an area occupied by a continuous area, and a circumscribed rectangular area of the image of the object is extracted. Note that the region is not limited to the circumscribed rectangular region, but may be a region having another shape as long as the region where the object exists is extracted. In this case, regions R1 and R2 of the object are extracted from the distance image A shown in FIG. 23A as shown in FIG. Also, from the distance image B shown in FIG. 23B, as shown in FIG. 24B, regions R1 ′ and R2 ′ of the object are extracted.
[0121]
Next, a corresponding region in the distance images A and B (preferably, two regions including the same object) is synthesized to generate a recognition region (step S102). For example, the region R1 in the distance image A in FIG. 23A corresponds to the region R1 ′ in the distance image B in FIG. 23B, and the region R2 in the distance image A in FIG. If the region R2 ′ in the distance image B of 23 (b) corresponds, a recognition region CR1 for recognizing motion is generated by combining the regions R1 and R1 ′ as shown in FIG. In addition, the recognition region CR2 is generated by synthesizing the regions R2 and R2 ′.
[0122]
For example, when the distance images A and B are overlapped, the overlapping region of the regions R1 and R1 ′ and all other regions are set as the recognition region CR1. Similarly, in the recognition region CR2, when the distance images A and B are overlapped, the region where the regions R2 and R2 ′ overlap and all other regions are set as the recognition region CR2.
[0123]
Here, the method of obtaining the correspondence is not particularly limited in the present invention, but it may be determined that the closest regions are regions of the same object, and may correspond to each other, or the same subject using some knowledge An area that is discriminated as an object may be obtained, and these may be made to correspond. Other methods may be used.
[0124]
Furthermore, the region extraction unit 5 extracts a plurality of recognition regions from the difference image obtained by the difference calculation unit 2 (step S103). That is, for example, a difference image as shown in FIG. 26A is generated by the difference calculation unit 2 from the distance image A shown in FIG. 23A and the distance image B shown in FIG. Suppose that From such a difference image, portions corresponding to the recognition regions CR1 and CR2 shown in FIG. 25 are extracted as recognition regions CR1 ′ and CR2 ′. For example, the recognition images CR1 and CR2 are generated by superimposing the distance images A and B. Further, when the difference image is superimposed on the distance images A and B, the recognition images CR1 and CR2 in the difference image correspond to the recognition images CR1 and CR2, respectively. The areas are extracted as recognition areas CR1 ′ and CR2 ′.
[0125]
It should be noted that the region extraction unit 5 performs the processing of steps S102 and S103 even when only one target region is extracted from the distance image in step S101, so that the region images A and B are included in the distance image A and the distance image B. A recognition area is generated by combining corresponding areas including the target object, and the recognition area is extracted from the difference image.
[0126]
Next, the detection unit 3 will be described.
[0127]
The detection unit 3 obtains a feature amount (for example, a differential flow here) for each of a plurality of recognition regions extracted from the difference image by the region extraction unit 5 (see FIG. 27).
[0128]
The feature amount detection process is the same as that of the detection unit 3 of the first embodiment.
[0129]
The recognition unit 4 analyzes the feature amounts for each of the plurality of recognition regions detected by the detection unit 3 and performs motion recognition. A specific method for recognizing individual actions is the same as that of the recognition unit 4 of the first embodiment.
[0130]
At this time, the analysis for recognition may be performed independently for each feature value, or may be analyzed by cross-referencing each value.
[0131]
As described above, when a plurality of movements exist in the distance image, a plurality of recognition areas corresponding to the positions of the movements are extracted from the difference image, and each of the plurality of movements corresponds to each recognition area. By recognizing a motion by obtaining a feature amount, it is possible to recognize not only a single motion but also a plurality of motions simultaneously, and each of a plurality of three-dimensional motions is stable. Can be recognized accurately and accurately.
[0132]
In addition, the extraction method of the recognition area | region from the difference image in the area | region extraction part demonstrated above is an example, and is not limited to this.
[0133]
(Third embodiment)
In the first embodiment, the recognition unit 4 recognizes a certain movement. In the third embodiment, this is promoted to enable motion recognition including identification of a plurality of motions.
[0134]
For example, in the first embodiment, the “hand shaking” operation has been described as an example. However, the “hand shaking” operation includes “hand raising”, “hand lowering”, and “hand swinging left and right” movements. As described above, a single recognition target motion may be composed of a plurality of types of motion. Therefore, in the third embodiment, an image recognition apparatus capable of recognizing a plurality of types of movements and identifying one movement from their relevance will be described.
[0135]
FIG. 28 is an overall configuration diagram of an image recognition apparatus according to the third embodiment. 28, the same parts as those in FIG. 1 are denoted by the same reference numerals, and only different parts will be described. That is, the image recognition apparatus of FIG. 28 recognizes a plurality of (for example, a plurality of (for example, a differential flow) obtained by the detection unit 3 for recognizing the movement of an object included in the image. Here, x (x is an arbitrary integer)) recognition units (first recognition unit 4a, second recognition unit 4b,..., X-th detection unit 4x), and this plurality of recognitions Based on the recognition results obtained by the units 4a to 4x, a motion identification unit 6 for newly identifying the movement of the object is newly added.
[0136]
The image acquisition unit 1, the difference calculation unit 2, and the detection unit 3 are exactly the same as those in the first embodiment.
[0137]
Next, the plurality of recognition units 4a to 4x will be described. Each recognizing unit recognizes a specific movement predetermined by the recognizing unit.
[0138]
For example, the first recognition unit 4a recognizes the “hand raising” operation. The recognition method is the same as in the first embodiment. The second recognizing unit 4b recognizes a specific movement different from that of the first recognizing unit 4a. For example, the “hand swinging” motion is recognized. The recognition method is the same as in the first embodiment.
[0139]
Hereinafter, similarly, the x-th recognition unit 4x recognizes a specific motion different from the other recognition units. For example, the recognition of “up and down the neck” operation is performed. The recognition method is the same as in the first embodiment.
[0140]
Next, the operation identification unit 6 will be described. The motion identification unit 6 finally identifies (discriminates) the type of movement of the object based on the recognition results obtained from the plurality of recognition units 4a to 4x.
[0141]
For example, when the result of successful recognition is obtained only for the “up and down movement of the neck” and the recognition regarding other movements fails, the movement of the object is identified as “up and down movement of the neck”. be able to. As described above, when only the recognition result of one of the plurality of recognition units 4a to 4x is successful, the action identification unit 6 outputs the recognized movement as the identification result as it is.
[0142]
The processing operation of the operation identification unit 6 when a plurality of successes are included in the recognition results in the plurality of recognition units 4a to 4x will be described. As described in the first embodiment, when a human performs a “hand shake” operation, the human usually raises his hand to the front of the body and then shakes his / her hand in the left-right direction. Finally, take your hand down. Therefore, in the case of such an operation, if the movements of “hand raising”, “hand swinging left and right”, and “hand lowering” have been successfully recognized and the movement is performed in this order, Is identified (discriminated).
[0143]
In such a case, any three of the plurality of recognizing units 4a to 4x recognize each of the above three actions, and the above three actions are included as knowledge about the human “hand shaking” action. It is only necessary to store in advance the knowledge that the action is performed in the action identifying unit 6.
[0144]
Note that the knowledge expression method, storage method, and the like are not particularly limited in the present invention. It is possible to take any conceivable method. The knowledge is stored in advance and is not fixed, and can be arbitrarily replaced or updated during operation.
[0145]
The above-described discrimination method is merely an example, and the present invention is not limited to this. The discrimination may be performed based on the reliability described in the section of the sixth modification of the first embodiment, or a method other than this may be used.
[0146]
In the third embodiment, the case of recognizing the movement of one object has been described. However, this method can also be applied to the image recognition apparatus described in the second embodiment. That is, when there are a plurality of motions in the distance image, the region extraction unit 5 extracts a plurality of recognition regions corresponding to the positions of the respective motions from the difference image, and each recognition region extracted by the detection unit 3 In addition, if the feature amounts corresponding to each of the plurality of movements are obtained, the type of movement is recognized by each of the plurality of recognition units 4a to 4x for each of the recognition target regions, and each recognition is finally performed by the motion identification unit 6. Identify what action was performed in the target area. Further, the motion identification unit 6 can also identify what kind of motion has been performed as a whole from each motion recognized from each recognition target area. (Fourth embodiment)
FIG. 29 is an overall configuration diagram of an image recognition apparatus according to the fourth embodiment of the present invention. 29, the same parts as those in FIG. 1 are denoted by the same reference numerals, and only different parts will be described. That is, the shape recognition unit 7 for recognizing the shape of the motion recognition target object included in the image from the distance image acquired by the image acquisition unit 1 is further added to the image recognition apparatus shown in FIG. ing.
[0147]
The method for identifying the shape of the object in the shape recognition unit 7 is not particularly mentioned in the present invention, but any conceivable means can be used. For example, one method is a template matching method. In this method, a large number of templates called shapes are prepared, the template most similar to the object included in the image is detected, and the shape represented by the template is obtained as a result. Specifically, a template such as a circle, a triangle, a square, a hand shape, etc. is stored in the shape recognition unit 7 in advance, and the object in the distance image is most similar to the triangle template. The object in the distance image is recognized as a triangle.
[0148]
Therefore, the shape recognition unit 7 may extract the contour information of the target object from the distance image acquired from the image acquisition unit 1, for example. That is, the contour information of the imaged target as shown in FIG. 30 is extracted from the distance image as shown in FIG. 6 except for cells whose pixel values are equal to or smaller than a predetermined value.
[0149]
In order to extract the contour information as shown in FIG. 30, the pixel values of adjacent pixels are compared, a constant value is inserted only where the pixel value is equal to or greater than a certain value α, and consecutive images assigned the same constant value. What is necessary is just to extract the pixel of an area | region.
[0150]
That is, for example, the pixel value at the coordinate position (i, j) on the matrix of distance image data as shown in FIG. 4 is P (i, j), and the pixel value of the contour information is R (i, j). Then
{P (i, j) -P (i-1, j)}> α, and
{P (i, j) -P (i, j-1)}> α, and
{P (i, j) -P (i + 1, j)}> α, and
{P (i, j) -P (i, j + 1)}> α
Then R (i, j) = 255
・ R (i, j) = 0 in other cases
By doing so, the contour information of the object as shown in FIG. 30 can be obtained.
[0151]
The contour information of the object extracted in this way is compared with a template stored in advance, the template most similar to the contour information of the object is detected, and the shape represented by the template is detected. What is necessary is just to output as a recognition result of the shape of a target object.
[0152]
Note that the method for recognizing the shape of an object using the contour as described above is an example, and the template itself is a distance image without obtaining the contour from the distance image, and the acquired distance image is directly used as a template. You may make it recognize the shape of a target object compared with a certain distance image.
[0153]
In this way, not only recognition of the motion of the object, but also recognition of its shape, and by referring to the recognized shape when recognizing the motion of the object, for example, any shape of the hand You can also recognize how it moved. Further, the above method can be applied to sign language recognition.
[0154]
The above embodiments and their modifications can be implemented in combination as appropriate. The technique of the present invention can be applied to an apparatus that recognizes an operation based on a given distance image or its stream, or performs various processes based on the recognition result.
[0155]
Each component shown in FIGS. 1, 21, 28, and 29 can be realized as software except for the image acquisition unit 1. In addition, the above-described method of the present invention can also be executed as a machine-readable medium storing a program to be executed by a computer.
[0156]
The technique of the present invention described in the embodiment of the present invention is a program that can be executed by a computer, such as a magnetic disk (floppy disk, hard disk, etc.), an optical disk (CD-ROM, DVD, etc.), a semiconductor memory, etc. It can be stored in a medium and distributed.
[0157]
In addition, this invention is not limited to the said embodiment, In the implementation stage, it can change variously in the range which does not deviate from the summary. Further, the above embodiments include inventions at various stages, and various inventions can be extracted by appropriate combinations of a plurality of disclosed configuration requirements. For example, even if some constituent elements are deleted from all the constituent elements shown in the embodiment, the problem (at least one of them) described in the column of problems to be solved by the invention can be solved, and the column of the effect of the invention If at least one of the effects described in (1) is obtained, a configuration in which this configuration requirement is deleted can be extracted as an invention.
[0158]
【The invention's effect】
As described above, according to the present invention, three-dimensional motion recognition can be performed easily and stably with high accuracy.
[Brief description of the drawings]
FIG. 1 is a diagram schematically showing a configuration example of an image recognition apparatus according to a first embodiment of the present invention.
FIG. 2 is a diagram illustrating an example of an appearance of an image acquisition unit that acquires a distance image.
FIG. 3 is a diagram illustrating a configuration example of an image acquisition unit that acquires a distance image;
FIG. 4 is a diagram illustrating an example of a distance image in which the intensity of reflected light is a pixel value.
FIG. 5 is a three-dimensional representation of a matrix format distance image as shown in FIG.
FIG. 6 is a diagram showing a display image of a distance image of a hand acquired by an image acquisition unit.
7 is a flowchart for explaining the processing operation of the image recognition apparatus in FIG. 1;
FIG. 8 is a diagram for explaining a difference image.
FIG. 9 is a diagram for explaining feature amounts;
FIG. 10 is a diagram for explaining a feature amount, and in particular, a diagram for explaining an inflow region and an outflow region.
FIG. 11 is a diagram for explaining feature amounts, and more particularly, a diagram for explaining an inflow region and its representative point (here, the center of gravity).
FIG. 12 is a diagram for explaining a feature amount, and in particular, a diagram for explaining an outflow region and its representative point (in this case, the center of gravity).
FIG. 13 is a diagram for explaining a differential flow as a feature amount.
FIG. 14 is a diagram for explaining image data of a difference image, an inflow region, and an outflow region.
FIG. 15 is a diagram for explaining a hand raising / lowering operation using a distance image;
FIG. 16 is a diagram illustrating a temporal change in a feature amount (differential flow).
FIG. 17 is a diagram showing a state of time change of | Vy × Vz |.
FIG. 18 is a diagram for explaining a lateral movement in a manual operation.
FIG. 19 is a diagram illustrating a temporal change in | Vx × Vz |.
FIG. 20 is a diagram showing a temporal change state of | Vx × Vz | when a hand movement is performed with a fast movement.
FIG. 21 is a diagram schematically showing a configuration example of an image recognition apparatus according to a second embodiment of the present invention.
FIG. 22 is a flowchart for explaining the processing operation of the region extraction unit 5 in FIG. 21;
FIG. 23 is a diagram for explaining a case where a plurality of (for example, two here) motions exist in two distance images.
FIG. 24 is a diagram for explaining processing for extracting a circumscribed rectangle of an object from a distance image.
FIG. 25 is a diagram for explaining processing for generating a recognition area for recognizing motion.
FIG. 26 is a diagram for explaining processing for extracting a recognition area from a difference image.
FIG. 27 is a diagram for explaining a feature amount (differential flow) obtained from a recognition area extracted from a difference image;
FIG. 28 is a diagram schematically showing a configuration example of an image recognition apparatus according to a third embodiment of the present invention.
FIG. 29 is a diagram schematically showing a configuration example of an image recognition apparatus according to a fourth embodiment of the present invention.
FIG. 30 is a diagram showing an example of a contour image of an object extracted from a distance image.
[Explanation of symbols]
1 ... Image acquisition unit
2 ... Difference calculator
3 ... Detector
4. Recognition unit
4a ... 1st recognition part
4b ... 2nd recognition part
4x ... xth recognition unit
5 ... Area extraction unit
6 ... Operation identification part
7. Shape recognition unit

Claims

A distance image generating means for generating a distance image in which each pixel value indicates a distance to the target object, acquiring a plurality of time-series distance images of the target object ;
Between two range images of the plurality of range images, calculates the difference data of pixel values, extracts the inflow region increased with decreased outflow region of the pixel values with this difference data to the motion of the object Steps,
Calculating the amount of change in the x-axis, y-axis, and z-axis directions from the centroid position of the outflow region to the centroid position of the inflow region;
The resulting x-axis, based on the amount of change y-axis and z-axis direction, a step of recognizing a movement of the object,
An image recognition method comprising:

A first step in which a distance image generating means for generating a distance image in which each pixel value indicates a distance to each object acquires a plurality of time-series distance images for each object;
Pixel value difference data is obtained between the image areas of each target object in two distance images of the plurality of distance images, and the pixel value of the target object is calculated from the difference data corresponding to each target object. A second step of extracting a reduced outflow area and an increased inflow area;
A third step of calculating, for each object, the amount of change in the x-axis, y-axis, and z-axis directions from the centroid position of the outflow region to the centroid position of the inflow region;
A fourth step of recognizing the movement of each object based on the obtained amounts of change in the x-axis, y-axis and z-axis directions of each object;
An image recognition method comprising:

In the second step, an area other than the overlapping area with the image area corresponding to the object in the second distance image having an area overlapping with the image area of each object in the first distance image. The image recognition method according to claim 3, wherein the outflow region and the inflow region are extracted from difference data of pixel values between them.

The movement of the object is recognized based on at least one component value selected according to the movement to be recognized among the component values of the change amount in the x direction, the y direction, and the z direction. The image recognition method according to claim 1 or 2.

5. The image recognition method according to claim 4 , wherein at least one component value is selected from among the component values of the change amount based on a characteristic movement direction of the movement to be recognized.

The at least one component value is selected from among the component values of the change amount based on a characteristic movement direction of the movement to be recognized and a direction correlated with the movement direction. 5. The image recognition method according to 4 .

In an image recognition apparatus provided with a distance image generation means for generating a distance image in which each pixel value indicates a distance to an object,
First calculation means for calculating difference data of pixel values between two distance images among a plurality of time-series distance images of the object obtained by the distance image generation means;
Extraction means for extracting outflow areas where pixel values have decreased and increased inflow areas due to the movement of the object from the difference data;
Second calculating means for calculating a change amount in the x-axis, y-axis, and z-axis directions from the centroid position of the outflow region to the centroid position of the inflow region;
Recognition means for recognizing the movement of the object based on the obtained x-axis, y-axis and z-axis direction variation amounts;
An image recognition apparatus comprising:

In an image recognition apparatus provided with distance image generation means for generating a distance image in which each pixel value indicates a distance to each object,
A first difference value calculation unit calculates pixel value difference data between image areas of each target object in two distance images among a plurality of time-series distance images obtained by the distance image generation unit. Calculation means;
A means for extracting from the difference data corresponding to each object, an outflow region in which the pixel value decreases and an inflow region in which the pixel value decreases with the movement of the object;
A second calculation means for calculating the amount of change in the x-axis, y-axis, and z-axis directions from the centroid position of the outflow region to the centroid position of the inflow region for each object;
Recognizing means for recognizing the movement of each object based on the obtained amounts of change in the x-axis, y-axis and z-axis directions of each object;
An image recognition apparatus comprising:

The extraction means includes a region between the regions other than the overlapping region with the image region corresponding to the target object in the second distance image having a region overlapping with the image region of each target object in the first distance image. The image recognition apparatus according to claim 8, wherein the outflow region and the inflow region are extracted from difference data of pixel values.

The movement of the object is recognized based on at least one component value selected according to the movement to be recognized among the component values of the change amount in the x direction, the y direction, and the z direction. The image recognition apparatus according to claim 7 or 8.

The image recognition apparatus according to claim 10 , wherein at least one component value is selected from the component values of the change amount based on a characteristic movement direction of the movement to be recognized.

The at least one component value is selected from among the component values of the change amount based on a characteristic movement direction of the movement to be recognized and a direction correlated with the movement direction. 10. The image recognition apparatus according to 10 .