JP2004280376A

JP2004280376A - Method and system for recognition of subject's behavior

Info

Publication number: JP2004280376A
Application number: JP2003069913A
Authority: JP
Inventors: Kunio Fukunaga; 邦雄福永
Original assignee: Japan Science and Technology Agency
Current assignee: Japan Science and Technology Agency
Priority date: 2003-03-14
Filing date: 2003-03-14
Publication date: 2004-10-07

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method and system for recognition of a subject's behavior that can continuously observe a person's indoor and outdoor behavior and can manage a plurality of persons individually. <P>SOLUTION: The system for recognition of a subject's behavior comprises a camera 6 mounted on a moving body 2, a radio 7 for transmitting a moving image of a subject 13, including an object 14 or a part 12 of the moving body, imaged by the camera 6 as a radio signal 16, and a behavior analysis device 20 for receiving the radio signal 16 transmitted from the radio 7 via a network 18. The behavior analysis device 20 has an image analysis part 24 for processing the moving image to analyze the subject's behavior, and a text generation part 28 for outputting the behavior of the subject 13 analyzed by the image analysis part 24 as text information. An image analysis method and a text generation method are also provided. Advantageously, the behavior of a person or the like can be individually managed in real time and converted to text via the network. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明はカメラを用いて被写体の行動を認識する方法に関し、更に詳細には、カメラを人・動物・物・車などの移動体に装着し、カメラにより撮影される動画像を分析して、移動体や対象物などの行動を認識する被写体の行動認識方法及び装置に関する。
【０００２】
【従来の技術】
従来、病院などにおいて要介護者の行動を管理したり、建物に外部から侵入する不審者を監視する無人報知システムが知られている。この無人報知システムとして典型的なものは、赤外線センサを配置し、人物から放射される赤外線を検出して人物の侵入を報知したり、侵入者の赤外像を表示するシステムである。
【０００３】
この赤外線報知システムでは、赤外線という限定された単一情報しか報知できず、また赤外線センサが設置されている特定位置・特定方向の情報しか得られない。例えば、センサから外れた場所における人物の行動を管理する等は不可能であった。特に、侵入者の赤外像を確認するためには、管理者の常時監視が必要になるという欠点もあった。
【０００４】
また、ビデオカメラを建物の所要位置に固定し、無人の時間帯にはビデオカメラで録画するシステムも常用されている。このシステムでは、録画中は無人状態でよいが、異常があるかどうかはビデオ情報を再生して確認する必要がある。特に、ビデオカメラの死角領域の監視は不可能であるから、多数の箇所にビデオカメラを設置するなどシステム費用が高価になっていた。また、遠隔地域で監視するには、ビデオ情報を送信しなければならず、伝達情報量が大きく通信コストが過大になっていた。
【０００５】
【発明が解決しようとする課題】
そこで、伝達情報量を小さくするため、本発明者等は特開平１０−４０４８２号により、「文章情報による無人報知システム」を公開した。この公開発明は、ビデオカメラとマイクロホンを建物内に固定配置し、特定の人物を常時撮影し、得られた動画情報と音声情報を格文法に従ってテキスト情報に変換し、このテキスト情報を管理者に送信して特定人物の行動を観察するシステムである。
【０００６】
この公開発明では、膨大な情報量を有した動画情報と音声情報を小さな情報量で済むテキスト情報に変換するから、管理者に送信する場合でも通信コストが少なく、またテキスト情報を記録するだけであるから記憶装置も安価で済むという利点を有している。
【０００７】
しかし、この公開発明も、ビデオカメラとマイクロホンは建物内の特定箇所に固定配置されるから、その特定箇所のしかも特定方向しか撮影されないという弱点を有する。ビデオカメラの死角は大きく、死角に入った人物の行動は全く不明である。
【０００８】
特に、人物がビデオカメラから遠く離れた場合には、人物を撮影することは不可能になる。このため、建物内の多数の箇所にビデオカメラを配置しなければならず、ビデオシステムの構築に多額の費用を要する結果となる。
【０００９】
ビデオシステムによる管理は不審者の侵入監視だけではなく、例えば病院や老人ホームなどにおける要介護者の行動管理にも必要となる。建物内の死角を無くすために、費用の多少を問わずにビデオシステムを完成させた場合を考えよう。この場合でも、要介護者が建物から外出したケースでは、ビデオシステムの管理区域から外れるため、要介護者の行動管理は不可能になる。
【００１０】
このように、ビデオカメラを固定配置する従来システムでは、要介護者などの行動を管理できる区域は建物内に限定されている。また、複数の要介護者を管理するには、管理者が目視で要介護者を区別する以外に無く、ビデオカメラを常時観察する負担から逃れることはできなかった。
【００１１】
この固定ビデオカメラシステムに文章化システムを組み合わせ、文章を報知機能にアラーム機能を付設することにより、過大記憶容量が不要になったり、要介護者が一人の場合に常時観察が不要になるという利点はある。しかし、この場合であっても、複数の要介護者を個別に行動管理することは不可能であり、やはり目視による常時観察という重圧が管理者側にあった。
【００１２】
従って、本発明は、従来から呪縛のように存在したビデオカメラを建物に固定するという方式を捨て去り、全く新たな着眼点を導入して、建物の内外における人物行動の常時観察を可能にし、しかも複数の人物を個別に管理できる被写体の行動認識方法及び装置を提供することである。また、本発明に文章化システムを付加することにより、記憶容量や通信容量を急減でき、更にアラーム機能を付加することにより、目視による常時観察をしなくても非常時にのみ通報して対処できることを目的とする
【００１３】
【課題を解決するための手段】
本発明は上記課題を解決するためになされたものであり、第１の発明は、移動体に装着されたカメラと、このカメラで撮影される対象物又は前記移動体の一部からなる被写体の動画像を無線信号として発信する無線機と、この無線信号を受信する行動分析装置と、この行動分析装置は、動画像を加工して被写体の行動を分析する画像分析部と、この画像分析部により分析された被写体の行動をテキスト情報として出力するテキスト生成部を有することを特徴とする被写体の行動認識装置である。この発明は、ビデオカメラや携帯電話内蔵カメラ等のカメラを人物や動物や車などの移動体に装着して、移動体と共にカメラも同時的に移動させる点に特徴を有している。このカメラを本発明者はウェアラブルカメラ（帯同カメラ）と称している。対象世界の様々な対象物がカメラに撮影され、また移動体が人物であれば、人物の手がカメラにより撮影される。従って、要介護者の手の動きからその動作が常時認識でき、また対象世界の映像によって要介護者の動作環境が常時把握される。複数の要介護者の夫々にカメラと無線機を装着すれば、夫々の動画像が無線機により個別的に受信されるから、複数の人物の個別的な同時管理が可能になる。また、動画像の無線信号が受信される構成になっておればよいから、管理センターの近傍で行動する場合には、無線信号の受発信装置により動画像信号を直ちに受信できる構成を採用でき、遠隔地で行動する場合には、インターネットや携帯電話システムなどのネットワークを使用して動画像信号を受信できるように構成すればよい。更に、これらの動画信号が分析されてテキスト情報として出力されるから記憶容量や通信容量が低減でき、テキスト情報にアラーム装置を付加すれば、画像を常時目視しなくても、移動体の異常行動時にリアルタイムでその行動を認識でき、即時対応が可能となる利点を有する。また、動画像に対応してテキスト情報が生成されるから、カメラを装着した人物の特性に応じたテキスト情報が集積でき、個別の人物の特性に応じたテキストデータベースを構築でき、人物などの管理情報の体系化を図ることが可能になる。
【００１４】
第２の発明は、移動体に装着されたカメラと、このカメラで撮影される対象物又は前記移動体の一部からなる被写体の動画像を無線信号として発信する無線機と、この無線機から発信される無線信号をネットワークを介して受信する行動分析装置と、この行動分析装置は、動画像を加工して被写体の行動を分析する画像分析部と、この画像分析部により分析された被写体の行動をテキスト情報として出力するテキスト生成部を有する被写体の行動認識装置である。この発明は、動画像信号をネットワークを介して送受信する構成を有する点で第１の発明と相違しているだけであるから、第１の発明と同様の作用効果を有している。特に、ネットワークを介して無線信号を送受信するから、人物などの移動体が遠隔地に離れても、広域ネットワークや近域ネットワークを介して動画像信号を瞬時に送受信でき、人物などの行動管理を広域的にも確立できる利点を有する。
【００１５】
第３の発明は、移動体に装着されたカメラと、このカメラに付属する行動分析装置及び無線機と、前記行動分析装置は、前記カメラで撮影される対象物又は前記移動体の一部からなる被写体の動画像を加工して被写体の行動を分析する画像分析部と、この画像分析部により分析された被写体の行動をテキスト情報として出力するテキスト生成部を有し、前記無線機によりテキスト情報を必要なサイトまで無線送信することを特徴とする被写体の行動認識装置である。この発明の特徴は、人・動物・物・車などの移動体にカメラと行動分析装置と無線機を一体に装着する点にある。行動分析装置を超小型コンピュータで構成することにより、装置全体をコンパクト化でき、例えば行動する人物に装着して、個々人を個別に管理することが可能になる。しかも、無線機によりテキスト情報を送信するから、情報容量は小さくて済み、ネットワークを介して必要なサイトに通信できるし、また直接に管理センタなどに送信することも可能になる。
【００１６】
第４の発明は、移動体に装着されたカメラにより対象世界を撮影して動画像を取り込み、動画像を構成する時系列的に流れる複数の画像フレームの一つを基準フレームに設定し、この基準フレームに時間的に後続する画像フレームを入力フレームとし、この入力フレームに変換処理を施して前記基準フレームに極力近似させるようにし、この変換処理により入力フレームが基準フレームからどれほど移動しているかを示す移動パラメータを導出し、この移動パラメータにより前記移動体の移動量を推定する被写体の行動認識方法である。入力フレームに対し、例えばアフィン変換を行って基準フレームに戻す処理を行えば、ＸＹＺ方向への並進移動量や回転移動量や拡大縮小率が移動パラメータとして導出できる。入力フレームの移動方向とカメラの移動方向は逆であるから、前述した戻し処理で得られる移動パラメータの値はカメラが移動体と一緒に移動した方向の移動量を与える。アフィン変換以外にも、カメラの移動量を推定できる数学変換が広範囲に利用できる。
【００１７】
第５の発明は、移動体に装着されたカメラにより対象世界を撮影して動画像を取り込み、動画像を構成する時系列的に流れる複数の画像フレームの一つを基準フレームに設定し、この基準フレームに時間的に後続する画像フレーム系列を入力フレーム系列とし、この入力フレーム系列に変換処理を施して前記基準フレームに極力近似させるようにした変換画像系列を形成し、変換処理により入力フレームが基準フレームからどれほど移動しているかを示す移動パラメータを導出し、この移動パラメータにより推定される前記移動体の移動量が基準移動量より大きくなるとその入力フレームを基準フレームに再設定して基準フレームの更新を行い、以上の操作を反復して基準フレームの更新頻度（更新率）から前記移動体の行動を判断する被写体の行動認識方法である。移動体として人物を考えると、人物が座った状態で体を上下に伸縮したり、左右に体を曲げたり、前後に体を往復させる場合にも、入力フレームは基準フレームから変動する。この変換処理による移動パラメータはそれほど大きくなることは無い。しかし、人物が歩行する場合には、Ｘ方向やＹ方向に直進的に移動するから、移動パラメータは一方向に大きくなると考えられる。この発明は、人物の静止状態又は移動状態を基準フレームが更新される割合、即ち更新頻度（又は更新率とも云う）で判断するものである。基準フレームが一方向的に度々更新される場合には、人物は歩行（移動）していると判断し、また更新頻度（更新率）が小さい場合には人物は座っているか立ったままの状態で静止していると判断するものである。
【００１８】
第６の発明は、移動体に装着されたカメラにより対象世界を撮影して動画像を取り込み、動画像を構成する時系列的に流れる複数の画像フレームの一つを基準フレームに設定し、この基準フレームに時間的に後続する画像フレーム系列を入力フレーム系列とし、この入力フレーム系列に変換処理を施して前記基準フレームに極力近似させるようにした変換画像系列を形成し、各変換画像における背景領域の中で特定対象物を示す特定領域に着目し、変換画像系列の中で特定領域の動作から前記特定対象物の行動を判断する被写体の行動認識方法である。入力フレームを基準フレームに変換した変換画像では、人物（移動体）が静止状態にある場合、変換画像の大きな面積を占める背景領域は共通している。従って、共通した背景領域の中で、対象物である他者の頭部や、人物の手や、人物が把持するコップなどの特定対象物の動きに着目し、この特定対象物の動作によって被写体（人物や他者）の行動を判断することが可能になる。
【００１９】
第７の発明は、前記移動体が人物である場合に、前記特定対象物がこの人物の手であり、この手領域を少なくとも肌色情報と動作情報から特定領域として抽出し、この手領域の動作から人物の行動を判断する被写体の行動認識方法である。人物にカメラを装着した場合に、人物の両手又は片手がカメラの前で動けば、この手も当然に被写体となる。変換画像の中で、手を肌色領域で選別し、更に手の動作情報により手であることを確実に認識する。例えば、手の動きを運動方程式で解いて予測位置に手があるかどうか判断すれば、手であることの傍証となる。肌色情報と動作情報の両立性により手が認識され、変換画像の中で手の部分に着色すれば、手の動きによって移動体である人物がどのような動作または行動をしているかを認識することが可能になる。
【００２０】
第８の発明は、特定対象物がカメラにより撮影される他者の顔であり、この顔領域を少なくとも肌色情報と輪郭情報から特定領域として抽出し、この顔領域の動作から前記他者の行動を判断する被写体の行動認識方法である。変換画像の中で、他者の顔を肌色情報と例えば楕円形などの輪郭情報の両立性によって認識する。この両者によって他者の顔を判断できれば、顔を着色表示し、顔の動きから他者がどのような動作をしているかが判断できる。
【００２１】
第９の発明は、移動体に装着されたカメラにより対象世界を撮影して動画像を取り込み、動画像を構成する時系列的に流れる複数の画像フレームの一つを基準フレームに設定し、この基準フレームに時間的に後続する画像フレーム系列を入力フレーム系列とし、この入力フレーム系列に変換処理を施して前記基準フレームに極力近似させるようにした変換画像系列を形成し、各変換画像における背景領域の中で特定対象物を示す特定領域に着目し、この特定領域の画像を記憶されているテンプレートモデルと比較し、前記特定対象物を具体的に特定する被写体の行動認識方法である。例えば、移動体である人物がコップを持っているとき、前述した方法で人物の手の動作が判断できる。この手が何かを持っているときに、メモリに一以上の具体物、例えばコップや時計や本などの形状を記憶させておき、手が持っている物をこれらのテンプレートモデルと比較して物を特定する。つまり、手の動きを判断すると同時に、その物がコップと認識されれば、人がコップで飲み物を飲もうとしているという動作認識が可能となる。このように、人物の動作と物の認識を結合することによって、被写体の行動認識をより高度に行うことができる。
【００２２】
第１０の発明は、前記被写体の行動をテキスト情報に変換する被写体の行動認識方法である。被写体の行動を認識できれば、この行動を格文法などにより簡潔にテキスト化でき、画像情報からテキスト情報へと情報量を軽量化して、テキスト情報の格納や通信によって、記憶装置のコストや通信コストを急減させることが可能になる。また、移動体である人物の行動をテキスト情報に変換できるから、行動記録としてテキストによるデータベース化が可能になり、特定の人物の行動データベースを構築して、病院や老人ホームや学校などにおいて人の安全管理などを効率的に行うことができる。
【００２３】
【発明の実施の形態】
以下に、本発明に係る被写体の行動認識方法及び装置の実施形態を添付する図面に従って詳細に説明する。
【００２４】
図１は本発明に係る被写体の行動認識装置の第１実施形態の概略構成図である。移動体２は人物・動物・物・車などの移動性のあるものであり、ここでは行動を認識され管理されるべき人物であるとする。しかし、対象世界を認識する場合には、移動体２（人物２と称する場合もある）としては、動物でもよいし、自動車や自転車やバイクなどの車でもよく、自在に移動しながら対象世界をカメラで撮像できるものであればよい。
【００２５】
この移動体２に無線機７を内蔵したカメラ６を固定状態で装着する。このカメラにはマイク４も装備され、人物２や対象世界８が発する音声も記録することができる。具体的には、カメラ６としては無線機７を付設したビデオカメラ、またはカメラ付きの携帯電話などが該当する。カメラ６による動画情報とマイク４による音声情報が無線機７により送信できる機構が採用されている。
【００２６】
このカメラ６は対象世界８の動画像を撮影でき、この動画像は家庭用ビデオカメラでは通常１秒間に３０フレームで構成されるが、家庭用と業務用によっても異なる場合がある。また、家庭用のビデオカメラを用いて、６フレーム毎に１フレームを使用すれば、１秒間当りに５フレームのフレーム率に設定できる。従って、１秒間あたりのフレーム数（フレーム率）は任意に定められる。
【００２７】
このカメラ６によって撮像される対象世界８は想像線１０により囲われた領域で、この中には移動体（人物）２の手１２や対象物１４が存在し、これらを被写体１３と称する。従って、被写体１３の動画像と音声が得られ、無線機７により動画信号と音声信号からなる無線信号１６が送信される。
【００２８】
無線信号１６はインターネットなどのネットワークを通して広域に伝達される。対象世界８を観察する管理センタ３４では、ネットワーク１８を介して前記無線信号を行動分析装置２０により受信する。
【００２９】
この行動分析装置２０は例えばパソコンなどのコンピュータや電子回路装置により構成される。この行動分析装置２０は入力部２２、画像分析部２４、音声分析部２６、テキスト生成部２８及びテキスト生成部２８の中に形成されたテキストデータベース部３０から構成されている。
【００３０】
入力部２２はネットワーク１８から入力信号１９を受信する。この入力信号１９は動画信号と音声信号から構成されている。動画信号は画像分析部２４に入力され、音声信号は音声分析部２６に入力される。
【００３１】
動画分析部２４の具体的な作用・機能は図３〜図１２を用いて後で詳細に説明される。簡単に言えば、動画信号が画像フレームの時系列信号として入力され、各画像フレームを数学的に変換したり、変換後の画像を解析することにより、画像中の特定領域の動作が推定される。
【００３２】
音声分析部２６では、マイク４により聴取された音声信号が解析される。音声信号の分析には、例えば隠れマルコフモデル手法（ＨＭＭ）を用いることができる。被写体が人物の場合には、音声も同時に生じるから、画像から動作を分析するだけでなく、音声を補助的に使用して、判別された動作をより確実なものにすることが可能になる。従って、動作の内容と音声の内容が一致したときに、動作の判断を高確率で確定することができる。
【００３３】
テキスト生成部２８では、画像と音声、特に画像から得られた動作をテキスト情報へと変換する。つまり、補助的に音声情報を使用しながら、画像情報をテキスト情報へと変換する。この変換により、大容量メモリを有する画像情報を低容量メモリで対応できるテキスト情報へと変換し、情報量のスリム化を実現する。
【００３４】
テキスト生成部２８において、画像からテキストを生成する一つの方法として、格文法を使用することができる。まず第１に、画像で示される行動に最も相応しい動詞（ＰＲＥＤ）が選択される。次に、この動詞を中心にして、動詞に係る語句の格、例えば主格や目的格や道具格などが決定され、動詞と格を結合することによりテキスト（文章）が構成される。
【００３５】
具体的には、動作を行う動作主（ＡＧ）、動作が行われる対象（ＯＢＪ）が選択される。更に、この動作の開始時刻（ＳＯ−ＴＩＭＥ）と終了時刻（ＧＯ−ＴＩＭＥ）が与えられる。この結果、次のような動作表現が与えられる。
［ＰＲＥＤ：ｖｅｒｂ，ＡＧ：ａｇｅｎｔ，ＯＢＪ：ｏｂｊｅｃｔ，ＳＯ−ＴＩＭＥ：ｔｉｍｅ１，ＧＯ−ＴＩＭＥ：ｔｉｍｅ２］
【００３６】
最終的には自然言語文からなるテキスト表現が好ましい。上記のようにして得られた動作表現は、例えば下記のように格構造変換の手法により自然言語文に変換される。
［ＰＲＥＤ：ｓｏｕｓａ−ｓｕｒｕ，ＡＧ：ｍａｎ１，ＯＢＪ：ｗｓ１，ＳＯ−ＴＩＭＥ：ｔ１，ＧＯ−ＴＩＭＥ：ｔ２］
「時刻ｔ１からｔ２に、利用者ｍａｎ１が、ワークステーションｗｓ１を操作した」
【００３７】
つまり、テキスト生成部では、動画像を分析して、多数の動作表現を連続的に生成し、この動作表現を次々と自然言語文に翻訳して、誰でもが理解できるテキストが生成されることになる。しかし、テキストの生成方法は、格文法構造や格構造変換の手法に限定されず、現在開発されている種々のテキスト化方法が採用される。
【００３８】
テキストデータベース部３０は、生成されたテキストを与えられた規則の下で配列して記憶するメモリ部である。被写体１３がカメラを装着した人物２である場合には、この人物２の動作が次々にテキスト化されるから、この人物に特徴的な動作データベースが構築できる。
【００３９】
例えば、このシステムを病院で採用すると、一人一人の患者にカメラ６を装着し、患者毎の行動データベースが構成でき、患者の管理が極めて円滑になる。老人ホームでは、各高齢者にカメラ６を装着して、高齢者毎の行動データベースを作成し、この行動データベースに基づいて、各高齢者を迅速且つ安全に介助することが可能になる。従って、このシステムは、複数人の集団において、各構成員を個別に管理する場合に特に効果がある。
【００４０】
通信部３２はテキスト生成部２８からテキスト信号３１を受信し、テキスト情報を管理センタ３４に送信する役割を担う。テキストデータは極めて容量が小さいから、記憶容量や通信容量が小さくて済み、通信速度も高速化できる利点を有する。従って、通信部３２として通常の通信装置及び通信方式でよいから安価で済む。
【００４１】
管理センタ３４はテキストデータを下にカメラを装着した人物や、カメラにより撮影される対象人物を管理する施設である。入手されるデータはテキストデータであるから、管理センタ３４の記憶装置の容量も小さくて済む。また、管理センタ３４では、各人物毎に作成されたテキストデータベース３０を受け取り、個別管理の基礎データとする。
【００４２】
図２は本発明に係る被写体の行動認識装置の第２実施形態の概略構成図である。この装置はネットワークを使用しないで、無線信号を直接アンテナで受信して行動を管理する装置である。従って、多くの部分は図１の装置と同一であるから、図１と同一部分には同一符号を付して説明を省略し、異なる符合部分について説明する。
【００４３】
画像信号や音声信号から構成される無線信号１６は受信アンテナ２１により直接受信される。この無線信号１６は入力信号１９として入力部２２に送られる。以後の処理は図１と同様である。
【００４４】
ネットワーク１８が配置されている地域に付いては図１の行動分析装置２０が利用されるが、ネットワーク１８が配置されていない地域では、行動分析装置２０とカメラ６の間を無線で結合するシステムが有効である。
【００４５】
図３は本発明に係る被写体の行動認識装置の第３実施形態の概略構成図である。この装置は、画像分析装置２０と無線機７をカメラ６と一体にして移動体２に装着し、テキスト情報を必要なサイトに無線送信するものである。図１の装置と同一部分には同一符号を付して説明を省略し、異なる部分について説明する。
【００４６】
画像分析装置２０を超小型のコンピュータで構成すれば、画像分析装置２０をカメラ６と無線機７と一体化して人物などの移動体２に装着すれば、被写体の行動を分析したテキスト情報を移動体２から直ちに必要なサイトに送信できる。
【００４７】
即ち、カメラ６で撮影された被写体１３の動画像は、移動体２に装着された画像分析装置２０に入力され、同時にマイク４で検出された音声信号も入力される。動画像と音声から分析された被写体の動作は、画像分析装置２０で分析され、動作の特徴がテキスト生成部２８によりテキスト情報として出力される。
【００４８】
テキスト情報はテキスト信号３１として無線機７に入力され、この無線機７から無線信号１６として空間に無線送信される。この無線信号１６は、例えばネットワーク１８を介して入力信号１９として管理センタ３４に受信される。また、無線信号１６は想像線で示されるように直ちにアンテナにより受信されて管理センタ３４に受信される。
【００４９】
第１実施形態と第２実施形態は画像信号や音声信号を無線信号１６として送信するのに対し、第３実施形態ではテキスト信号を無線信号１６として送信する点に相違を有する。この相違は、行動分析装置２０をサイト側に設けるか、それとも移動体２に設けるかに起因している。
【００５０】
以下では、画像分析部２４の具体的分析方法とテキスト生成部２８の具体的生成方法について、個別の場合に応じて説明する。行動分析装置２０がコンピュータ装置により構成される場合には、前記方法はプログラムにより進行する。行動分析装置２０が電子回路装置により構成される場合には、前記方法は電子回路の手順に従って進行する。
【００５１】
図４は本発明において動画像の変換処理によりカメラを装着した移動体の移動量を推定する工程図である。（４Ａ）では、動画像を撮像できるカメラが移動体に装着される。この移動体は、人・動物・車・自転車などのように移動する物体であればよいが、行動を管理される対象として人が通常である。従って、以下では移動体は人物であるとする。
【００５２】
（４Ｂ）では、カメラにより対象世界を撮影し、対象世界の動画像が取り込まれる。この動画像には、カメラを装着された人物の手や対象世界の他者など様々な映像が含まれる。移動体は地上を動き回るから、動画像も時間的に種々に変化する。
【００５３】
移動体が歩行（移動状態）するときには、動画像の中の背景画像もかなり変化する。逆に、移動体（人物）が座った状態や直立状態にある場合は静止状態であり、動画像の中の背景画像はそれほど変化しない。しかし、人物が左右に体を回したり、前後に体を微小移動させると、カメラも同様に動くから動画像も多少変化する。この動画像の大変化や小変化を認識して、移動体、即ちカメラの移動量を推定する。
【００５４】
（４Ｃ）では、動画像を構成する多数の動画フレームが時系列的に取り込まれる。これらの時系列的な動画フレームの中の一つが基準フレームとして設定される。この基準フレームは以後に取り込まれるフレーム群の先頭フレームと考えればよい。
【００５５】
（４Ｄ）では、基準フレームより時間的に後続する画像フレームが次々と取り込まれる。これらの画像フレームを入力フレームと呼んでいる。従って、基準フレームの後に多数の入力フレーム群が存在する。
【００５６】
（４Ｅ）では、入力フレームに数学的変換処理を加えて基準フレームにできるだけ一致させるように変換する。この変換処理によって得られる移動パラメータ群の値が、カメラの移動量であると推定できる。
【００５７】
カメラは移動体と共に動くから、入力フレームは基準フレームから多少ずれている。例えば、カメラが右へ移動すると、入力フレームは基準フレームより左に移動する。つまり、カメラの移動方向と入力フレームの移動方向は逆の関係になる。従って、入力フレームを基準フレームに一致する方向に移動させれば、その移動量はカメラの移動量に一致するはずである。
【００５８】
この移動パラメータを得るために適切な変換はアフィン変換である。このアフィン変換は平行移動、回転、拡大縮小、せん断などの処理を行う変換で、特に、平行移動と回転移動と拡大縮小のパラメータが移動パラメータ群になる。
【００５９】
（４Ｆ）では、例えばアフィン変換により、平行移動パラメータ、回転移動パラメータ、拡大縮小パラメータが移動パラメータ群として得られる。（３Ｇ）では、これらの移動パラメータ群の値により移動体、即ちカメラの移動量が推定される。
【００６０】
図５は本発明において動画像の変換処理により移動量を導出する具体的工程図である。（５Ａ）はパソコンを中心に配置した基準フレームを示す。右方向がｘ方向、下方向がｙ方向を与える。
【００６１】
（５Ｂ）は入力フレームの一例を示す。この入力フレームでは、中心にあるパソコンが少し右に移動している。カメラの立場から言えば、カメラが左へ移動した結果、フレーム内で被写体が右へ移動したと考えられる。カメラの移動方向と被写体、即ちフレームの移動方向が逆の関係にある。
【００６２】
（５Ｃ）では、前記入力フレームに対しアフィン変換を施して、入力フレームを基準フレームに一致させるように変換する。どれくらい変換すれば一致するかは事前に不明であるから、例えばコンデンセーション・アルゴリズムを使用してランダム近似させながら一致度を高めてゆく。
【００６３】
（５Ｄ）では、入力フレームをアフィン変換させた後の変換画像が示されている。パソコンの配置がほぼ基準フレームの配置と同程度に一致している。単純に云えば、入力フレームの中の画像を左方向に移動させると、変換画像が得られる。フレームの枠から外れた領域は消去され、画像が無くなくなった領域は黒く塗られている。
【００６４】
（５Ｅ）では、アフィン変換により得られた移動パラメータ群の値が示されている。ｄｘ＝−５２は、入力フレームを左方向に５２だけ移動したことを示し、この値が実際にカメラの移動量となる。ｄｙ＝１３は入力フレームを下方向に１３だけ移動したことを示し、カメラのｙ方向移動量を示す。
【００６５】
θ＝２．６は、入力フレームを原点を中心に時計方向に２．６だけ回転させたことを示し、この値がカメラの回転移動量を与える。ｓｃａｌｅ＝０．９４は入力フレームを０．９４倍することにより変換画像になったことを示し、カメラが基準フレームよりやや前進したことを示している。
【００６６】
従って、（５Ｆ）のように、これらの移動パラメータ群の値により、移動体（人物）、即ちカメラが左下方向に平行移動し、少しだけ右方向回転し、やや前進したという結果が得られる。移動量は前述した値であり、このように移動パラメータ群から移動体の移動量が導出できる。
【００６７】
しかし、上記の結果は、移動体が座った状態（静止状態）にあって体を微小移動させているのか、それとも移動体が歩行状態（移動状態）にあるのか、については結論できていない。次に、移動体が静止状態にあるのか移動状態にあるのかに関する判断方法を説明する。
【００６８】
図６は移動体（人物）の移動状態又は静止状態の判断基準を与える判別フローチャートである。人物が歩行している場合には移動パラメータが一方向に大きくなり、入力フレームの背景画像が基準フレームと全く異なる状態が生じる。このような場合には、新しい入力フレームを基準フレームに再設定して基準フレームを更新する必要が生じる。
【００６９】
このように基準フレームを次々に更新する必要が生じた場合には、人物は歩行していると考えられる。つまり、何フレーム毎に基準フレームを更新しているか、という基準フレームの更新頻度（更新率）により移動（歩行）・静止の判別を行う。その基準率を基準更新率と呼び、基準フレーム更新率が基準更新率を超えたときに移動状態と判断し、それより小さいときに静止状態と判断する。以下、各ステップを説明する。
【００７０】
ステップｎ１では、入力フレーム系列の先頭フレームを基準フレームとして設定する。ステップｎ２では、後続の画像フレームを入力フレームとして継続的に取り込む。ステップｎ３では、各入力フレームに変換処理を施して基準フレームに極力一致させる処理を行う。
【００７１】
ステップｎ４では、例えばアフィン変換処理により、移動パラメータ群を具体的に導出し、人物の移動量（ステップｎ５）を推定する。ステップｎ６では、移動量と基準移動量の比較が行われ、移動量が小さい場合には静止状態と判断され（ステップｎ７）、ステップｎ２にフィードバックされる。
【００７２】
移動量が基準移動量より大きくなると、その入力フレームを基準フレームに設定し直し（ステップｎ８）、基準フレームの更新率が算定される（ステップｎ９）。この基準フレームの更新率と基準更新率が比較され（ステップｎ１０）、基準更新率よりも大きい場合には人物は移動状態にあると判断される（ステップｎ１）。他方、基準更新率よりも小さい場合には、人物は静止状態にあると判断され（ステップｎ１２）、ステップｎ２にフィードバックされる。
【００７３】
以上のように、入力フレームを連続的に取り込みながら、基準フレームの更新率（更新頻度）を計算して、人物（移動体）の移動状態又は静止状態が確実に定量的に判断されるのである。
【００７４】
図７は移動体（人物）の移動状態又は静止状態の判断を与える具体的工程図である。（７Ａ）〜（７Ｅ）は入力フレーム系列を示し、（７ａ）〜（７ｅ）は（７Ａ）〜（７Ｅ）のアフィン変換による変換画像系列を示している。矢印方向が時間方向である。
【００７５】
基準移動量や基準更新率は状況に応じて任意に定められる。この例では、基準移動量はｄｘ＝２０に設定される。また、３フレームに１回基準フレームの更新があり、その更新が連続して２回継続したときを基準更新率と定めている。
【００７６】
（７Ａ）が基準フレームとして設定され、次々に入力フレーム系列が取り込まれてゆく。入力フレームから、カメラは右上方向に移動していることが分かる。（７Ｂ）を変換すると（７ｂ）になり、ｄｘ＝１５であるから基準移動量の範囲内である。
【００７７】
（７Ｃ）を変換すると（７ｃ）になり、ｄｘ＝２１であるから基準移動量のｄｘ＝２０を超えている。従って、基準移動量の範囲外と認定され、（７Ｃ）が基準フレームとして更新される。また、ここで第１回の基準フレームの更新が行われた。
【００７８】
今、（７Ｃ）が基準フレームであり、（７Ｄ）をアフィン変換すると、ｄｘ＝１０となるから基準移動量の範囲内である。（７ｃ）のｄｘ＝２１に加算されると、（７ｄ）ではｄｘ＝３１となる。（７ｂ）〜（７ｄ）では基準更新率の範囲内であるから、人物、即ちカメラは静止状態にあると判断される。
【００７９】
次に、（７Ｅ）をアフィン変換すると、ｄｘ＝１３の（７ｅ）が得られる。（７Ｃ）をｄｘ＝０の基準に取ると、ｄｘ＝２３となるから、基準移動量ｄｘ＝２０を超えている。従って、再び（７Ｅ）が基準フレームとして更新される。
【００８０】
（７ｅ）の段階で、基準フレームの更新が２回連続して行われたから、基準フレームの更新率が基準更新率を超えたことになり、人物、即ちカメラは移動状態にあると判断される。
【００８１】
以上から次のような結論が導出される。（７ｂ）〜（７ｄ）では、静止状態にありながらｄｘが連続的に増加している。これはカメラを装着した人物が体を右へ曲げたことを意味している。（７ｅ）の段階で初めて、人物が右方向へ移動（歩行）していると判断される。
【００８２】
このように、基準更新率を超えるかどうかで人物の移動・静止が判断され、この判断の下で移動パラメータの値の変化から人物が如何なる行動をしているかが判定されるのである。
【００８３】
図８はカメラを装着している人物（移動体）の手の動作を推定する工程図である。簡単な例として、静止している人物がパソコンのキーボード操作をしている場合を分析する。
【００８４】
（８Ａ）では、カメラを人物に装着固定する。（８Ｂ）では、入力フレーム系列の先頭フレームを基準フレームに設定する。（８Ｃ）では、後続の入力フレームを次々に基準フレームの状態に変換処理してゆく。（８Ｄ）では、基準フレームの更新率（更新頻度）が基準更新率より小さいことを確認して、人物は静止状態にあることが判定される。
【００８５】
（８Ｅ）では、変換画像の全体構成を背景領域と手領域の和として考え、背景領域から手領域を分離抽出する。分離抽出は２段階で行われる。第１指標として肌色が選択され、変換画像から肌色領域が抽出される。この場合、肌色領域は手以外にも存在する可能性があり、第１指標では手以外の肌色領域も抽出される。
【００８６】
第２指標として、手の動きを運動方程式で予測し、予測した位置に肌色領域が移動していることで、その肌色領域が手領域であると判定される。運動方程式として例えば線形予測とかカルマンフィルタによる方法が使用される。移動しない肌色領域は、この第２段階で除去される。
【００８７】
このような複数指標で特定領域を抽出する場合に、例えばＤｅｍｐｓｔｅｒ−Ｓｈａｆｆｅｒの方法が使用できる。この方法は、第１指標の確信度と第２指標の確信度が与えられたとき、総合的な確信度を導出する方法で、抽出された手領域の信頼率が算定される。
【００８８】
（８Ｆ）では、手領域を抽出した後、変換画像の手領域に着色が施され、手領域抽出後画像が構成される。（７Ｇ）では、手領域以外にノイズとして着色点が散在する場合には、これらのノイズを除去する必要が生じる。ここでは散在するノイズ除去としてメジアンフィルタ操作を用いている。
【００８９】
（８Ｈ）では、クラスタリング後画像の系列から、着色された手領域の動きが検出される。この動きを読み取ることにより、手の動作が推定される。手の動作として、左右への移動や、上下への移動などがある。
【００９０】
図９は手領域を抽出する具体的工程図である。原画像は変換画像の一例である。手領域を抽出する３方法が示されている。上の画像は、原画像から特定の背景着色を有した背景画像を除去して手領域を導出したものである。真中の画像は、肌色領域だけを抽出したものである。両方法共に手領域が全体的に抽出されていることが分かる。
【００９１】
下の画像は、肌色領域を楕円領域に置き換え、この楕円がフレーム毎に移動しているかどうかを判断するものである。手であれば、当然動くことが予測され、この楕円領域が運動方程式により予測された位置に移動するかどうかで手領域の高度判定を行う。ＤＳ理論とは、Ｄｅｍｐｓｔｅｒ−Ｓｈａｆｆｅｒの理論を意味している。
【００９２】
図１０は手の動作を推定する具体的工程図である。手領域抽出後画像にクラスタリング処理を行ってノイズを除去し、クラスタリング後画像を得る。このクラスタリング後画像を４枚並べると、着色された手領域の動作の詳細が明らかになる。
【００９３】
左手領域が右から左に移動しており、テキスト情報としては、「左手を左に動かした」となる。カメラを装着した人物の動作を推定する場合に、少なくともカメラに人物の一部分が撮影される必要があり、その一部分とは手である可能性が高い。従って、手に着目して人物の動作が推定されるのである。
【００９４】
図１１はカメラに撮影される他者の動作を推定する工程図である。他者を抽出する場合に、他者の顔に着目して他者を抽出する。（１１Ａ）では、カメラを移動体に装着する。（１１Ｂ）では、入力フレーム系列の先頭フレームが基準フレームとして設定される。（１１Ｃ）では、後続する入力フレーム系列が基準フレームに極力一致するように変換処理を施す。
【００９５】
（１１Ｄ）では、変換画像系列から顔領域が抽出される。変換画像が背景領域と顔領域に分離される。被撮影者の顔領域を抽出する基準として、肌色領域と楕円領域に二つの基準が使用される。
【００９６】
変換画像から肌色領域を抽出すれば、顔領域や手領域など、複数の肌色領域が抽出される。そこで、第２基準として楕円形状を条件として導入する。その結果、顔領域だけが抽出される。このとき、Ｄｅｍｐｓｔｅｒ−Ｓｈａｆｆｅｒの方法が利用される。
【００９７】
（１１Ｅ）では、抽出された顔領域に着色が施され、この着色顔領域を元の変換画像に組み込んで顔領域抽出後画像が構成される。（１０Ｆ）では、クラスタリングを行ってノイズが除去され、メジアンフィルタ操作後画像が構成される。
【００９８】
（１１Ｇ）では、クラスタリング後画像を並べて比較することにより、顔領域の動きが分析され、被撮影者の動作が推定される。この例では、被撮影者の行動が分析されたが、カメラにより撮影される対象物、例えば車や自転車など任意の対象物が行動分析の対象になる。
【００９９】
図１２カメラにより撮影される物体を認識して特定する工程図である。（１２Ａ）では、カメラを移動体に装着する。（１２Ｂ）では、入力フレーム系列の先頭フレームが基準フレームとして設定される。（１２Ｃ）では、後続する入力フレームを基準フレームにまで変換処理する。
【０１００】
（１２Ｄ）では、変換画像から対象物体が抽出される。変換画像は大きな背景を形成する背景領域と人物である肌色領域と着目する対象物体の領域の和であると考える。従って、変換画像から背景確率の高い領域と肌色確率の高い領域を除去すると、対象物体領域だけが抽出される。
【０１０１】
（１２Ｅ）では、記憶された多数のテンプレートモデルと抽出された対象物体とが比較される。両者間で色や形状などを比較しながら、最も近似したテンプレートモデルが選択され、対象物体は選択されたテンプレートモデルであると判定される。このようにして対象物体の認識が行われる。
【０１０２】
図１３はカメラにより撮影される物体の認識方法を示す具体的工程図である。上の変換画像から背景確率と肌色確率の低い領域が物体領域として抽出される。その結果、手に把持された対象物体が抽出される。
【０１０３】
この対象物体と多数のテンプレートモデルとが相互に比較される。この中で、一致確率の最も高いコップが選択される。この段階で、対象物体がコップであると判定される。このように、本発明は撮影される対象物体が何であるかを判定することもできる。
【０１０４】
図１４はカメラを装着している人物の行動を認識してテキストで表現する階層構造図である。まず、基準フレームの更新率（更新頻度）により人物の移動・静止が判定される。つまり、更新率が基準更新率を超えれば人物は移動していると判定し、また基準更新率を超えなければ人物は静止していると判定される。
【０１０５】
人物が移動状態にあるとき、移動パラメータ群の値により、その行動が認識される。例えば、ｄｘ＞０であれば「右に曲がった」、ｄｘ＜０であれば「左に曲がった」、ｄｙ＜０であれば「立ち上がった」、ｄｙ＞０であれば「座った」、ｓｃａｌｅ＞１であれば「前進した」、ｓｃａｌｅ＜１であれば「後退した」、と判定される。これらの動作判定は座標軸の取り方により変化する。
【０１０６】
人物が静止状態にあるとき、手や物体が抽出されない場合には、人物の身体動作が移動パラメータ群の値により認識される。例えば、ｄｘ＞０であれば「右を向いた」、ｄｘ＜０であれば「左を向いた」、ｄｙ＜０であれば「上を向いた」、ｄｙ＞０であれば「下を向いた」、と判定される。これらの動作判定は座標軸の取り方により変化する。
【０１０７】
人物が静止状態にあるとき、手を抽出した場合には、変換画像における手の動きから手の動作が判定される。例えば、「右手を上げた」、「右手を下げた」、「左手を上げた」、「左手を下げた」などである。
【０１０８】
人物が静止状態にあるとき、手を抽出し、把持されたコップを認識した場合には、変換画像における手の動きから更に詳しい動作が認識される。例えば、「右手で飲んだ」、「左手で飲んだ」、「右手で持った」、「左手で持った」などである。
【０１０９】
人物が静止状態にあるとき、手を抽出し、手と接触した状態で本を認識した場合には、変換画像における手の動きから次のような詳しい動作が認識できる。例えば、「本を読んだ」、「ページをめくった」、「本を開いた」、「本を閉じた」などである。
【０１１０】
このように、カメラにより得られる動画像から人物の動作を認識し、その動作を格文法によりテキストに表現すれば、画像表現がテキスト表現に変換される。この変換により、メモリ容量や通信容量が急減し、記憶装置や通信装置のコストダウンを図れると同時に、通信速度の飛躍的な向上を達成できる。
【０１１１】
特に、個別の人物に着目して、その動作をテキスト表現し、このテキスト群を所定の規則に従って保存すれば、人物ごとの行動データベースを作成できる。この行動データベースを用いれば、複数の人物を個別的に管理することが可能になる。
【０１１２】
図１５はカメラを装着した人物が研究室を立ち歩く行動実験図である。人物は位置１〜位置７までを矢印に従って歩行する。カメラの動画像をコンピュータで解析し、テキスト化して、文章表現と行動とが一致するかどうかを確認した。
【０１１３】
位置１では「机の上のコップを取る」・「イスから立ち上がり歩き始める」、位置２では「右に曲がる」、位置３では「右に曲がる」、位置４では「右に曲がる」、位置５では「右に曲がる」、位置６では「左に曲がる」、位置７では「左に曲がってイスに座る」・「机にコップを置き飲み物を飲む」と判定された。実際の行動とテキスト表現が一致することが確認された。
【０１１４】
本発明は上記実施形態に限定されるものではなく、本発明の技術的思想を逸脱しない範囲における種々の変形例、設計変更などをその技術的範囲内に包含することは云うまでもない。
【０１１５】
【発明の効果】
第１の発明によれば、カメラを人物や動物や車などの移動体に装着し、移動体と共にカメラも同時的に移動させることにより、カメラにより撮影される広範囲の対象物（移動体も含めて）の動画像を無線信号として送信し、この動画像を分析してテキスト化する装置が提供される。無線信号で画像情報を送信するから、移動体が屋内・屋外を問わずに移動する場合でも、移動体を適切に管理することができる。画像情報をテキスト情報に変換するから、記憶容量や通信容量を低減してコストダウンを可能にし、しかも通信速度を飛躍的に向上できる。また、カメラを管理すべき人物に装着すれば、人物の手などの情報から、人物がどこに所在しても、リアルタイムで管理でき、またテキスト情報を蓄積することによって、個別の人物の行動データベースを自動作成できる利点がある。
【０１１６】
第２の発明によれば、カメラ及び無線機としてカメラ付き携帯電話を使用すれば、既存のネットワークを利用して行動分析が容易にできる。また、インターネットなどのネットワークを介することによって、人物の所在場所の遠近に拘わらず、動画像をリアルタイムで分析でき、動作をテキストに直して管理センタに送信できる。ネットワークを介する点でのみ第１の発明と相違しているだけであるから、第１の発明と同様の作用効果を有する。
【０１１７】
第３の発明によれば、人・動物・物・車などの移動体にカメラと行動分析装置と無線機を一体に装着する。行動分析装置を超小型コンピュータで構成することにより、装置全体をコンパクト化でき、例えば行動する人物に装着して、個々人を個別に管理することが可能になる。しかも、無線機によりテキスト情報を送信するから、情報容量は小さくて済み、ネットワークを介して必要なサイトに通信できるし、また直接に管理センタなどに送信することも可能になる。
【０１１８】
第４の発明によれば、入力フレームに対し、例えばアフィン変換を行って基準フレームに戻す処理を行えば、ＸＹＺ方向への並進移動量や回転移動量や拡大縮小率が移動パラメータとして導出できる。入力フレームの移動方向とカメラの移動方向は逆であるから、前述した戻し処理で得られる移動パラメータの値はカメラが移動体と一緒に移動した方向の移動量を与える。アフィン変換以外にも、カメラの移動量を推定できる数学変換が広範囲に利用できる。
【０１１９】
第５の発明によれば、移動体として人物を考えると、人物が座った状態で体を上下に伸縮したり、左右に体を曲げたり、前後に体を往復させる場合にも、入力フレームは基準フレームから変動する。この場合には移動パラメータはそれほど大きくなることは無い。しかし、人物が歩行する場合には、Ｘ方向やＹ方向に直進的に移動するから、移動パラメータは一方向に大きくなると考えられる。この発明は、人物の静止状態又は移動状態を基準フレームが更新される割合、即ち更新頻度（又は更新率とも云う）で判断するものである。基準フレームが一方向的に度々更新される場合には、人物は歩行（移動）していると判断し、また更新頻度（更新率）が小さい場合には人物は座っているか立ったままの状態で静止していると判断することができる。
【０１２０】
第６の発明によれば、入力フレームを基準フレームに変換した変換画像では、人物（移動体）が静止状態にある場合、変換画像の大きな面積を占める背景領域は共通している。従って、共通した背景領域の中で、対象物である他者の頭部や、人物の手や、人物が把持するコップなどの特定対象物の動きに着目し、この特定対象物の動作によって被写体（人物や他者）の行動を判断することが可能になる。
【０１２１】
第７の発明によれば、人物にカメラを装着した場合に、人物の両手又は片手がカメラの前で動けば、この手も当然に被写体となる。変換画像の中で、手を肌色領域で選別し、更に手の動作情報により手であることを確実に認識する。例えば、手の動きを運動方程式で解いて予測位置に手があるかどうか判断すれば、手であることの傍証となる。肌色情報と動作情報の両立性により手が認識され、変換画像の中で手の部分に着色すれば、手の動きによって移動体である人物がどのような行動をしているかを認識することが可能になる。
【０１２２】
第８の発明によれば、変換画像の中で、他者の顔を肌色情報と例えば楕円形などの輪郭情報の両立性によって認識する。この両者によって他者の顔を判断できれば、顔を着色表示し、顔の動きから他者がどのような動作をしているかが判断できる。
【０１２３】
第９の発明によれば、例えば、移動体である人物がコップを持っているとき、前述した方法で人物の手の動作が判断できる。この手が何かを持っているときに、メモリに一以上の具体物、例えばコップや時計や本などの形状を記憶させておき、手が持っている物をこれらのテンプレートモデルと比較して物を特定する。つまり、手の動きを判断すると同時に、その物がコップと認識されれば、人がコップで飲み物を飲もうとしているという動作認識が可能となる。このように、人物の動作と物の認識を結合することによって、被写体の行動認識をより高度に行うことができる。
【０１２４】
第１０の発明によれば、被写体の行動を認識できれば、この行動を格文法などにより簡潔にテキスト化でき、画像情報からテキスト情報へと情報量を軽量化して、テキスト情報の格納や通信によって、記憶装置のコストや通信コストを急減させることが可能になる。また、移動体である人物の行動をテキスト情報に変換できるから、行動記録としてテキストによるデータベース化が可能になり、特定の人物の行動データベースを構築して、病院や老人ホームや学校などにおいて人の安全管理などを効率的に行うことができる。
【図面の簡単な説明】
【図１】本発明に係る被写体の行動認識装置の第１実施形態の概略構成図である。
【図２】本発明に係る被写体の行動認識装置の第２実施形態の概略構成図である。
【図３】本発明に係る被写体の行動認識装置の第３実施形態の概略構成図である。
【図４】本発明において動画像の変換処理によりカメラを装着した移動体の移動量を推定する工程図である。
【図５】本発明において動画像の変換処理により移動量を導出する具体的工程図である。
【図６】移動体（人物）の移動状態又は静止状態の判断基準を与える判別フローチャートである。
【図７】移動体（人物）の移動状態又は静止状態の判断を与える具体的工程図である。
【図８】カメラを装着している人物（移動体）の手の動作を推定する工程図である。
【図９】手領域を抽出する具体的工程図である。原画像は変換画像の一例である。
【図１０】手の動作を推定する具体的工程図である。
【図１１】カメラに撮影される他者の動作を推定する工程図である。
【図１２】カメラにより撮影される物体を認識して特定する工程図である。
【図１３】カメラにより撮影される物体の認識方法を示す具体的工程図である。
【図１４】カメラを装着している人物の行動を認識してテキストで表現する階層構造図である。
【図１５】カメラを装着した人物が研究室を立ち歩く行動実験図である。
【符号の説明】
２は移動体（人物）、４はマイク、６はカメラ、８は対象世界、１０は撮影領域、１２は手、１３は被写体、１４は対象物、１６は無線信号、１８はネットワーク、１９は入力信号、２０は行動分析装置、２１は受信アンテナ、２２は入力部、２４は画像分析部、２６は音声分析部、２８はテキスト生成部、３０はテキストデータベース、３１はテキスト信号、３２は通信部、３４は管理センタ。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a method of recognizing the behavior of a subject using a camera, and more particularly, to attach a camera to a moving object such as a person, an animal, an object, a car, and analyze a moving image captured by the camera, The present invention relates to a method and an apparatus for recognizing a behavior of a subject recognizing a behavior of a moving object or a target object.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, there has been known an unmanned notification system that manages the behavior of a care recipient in a hospital or the like and monitors a suspicious person who enters a building from the outside. A typical example of this unmanned notification system is a system in which an infrared sensor is arranged to detect an infrared ray emitted from a person to notify the intrusion of the person or display an infrared image of the intruder.
[0003]
In this infrared information system, only limited single information called infrared can be notified, and only information on a specific position and a specific direction where an infrared sensor is installed can be obtained. For example, it has not been possible to manage the behavior of a person in a place away from the sensor. In particular, there is a drawback in that an administrator needs to be constantly monitored in order to check the infrared image of the intruder.
[0004]
In addition, a system in which a video camera is fixed at a required position in a building and recording is performed by the video camera during unmanned hours is also commonly used. In this system, an unattended state may be used during recording, but it is necessary to confirm whether there is an abnormality by reproducing video information. In particular, since it is impossible to monitor the blind spot area of a video camera, the system cost has been high, such as installing video cameras at many locations. Further, in order to monitor in a remote area, video information must be transmitted, so that the amount of information to be transmitted is large and the communication cost is excessive.
[0005]
[Problems to be solved by the invention]
Therefore, in order to reduce the amount of transmitted information, the present inventors disclosed a "unmanned information system using text information" in Japanese Patent Application Laid-Open No. 10-40482. In this disclosed invention, a video camera and a microphone are fixedly arranged in a building, a specific person is constantly photographed, the obtained moving image information and audio information are converted into text information according to case grammar, and this text information is transmitted to an administrator. This is a system that transmits and observes the behavior of a specific person.
[0006]
In this disclosed invention, since the moving image information and audio information having a huge amount of information are converted into text information requiring only a small amount of information, the communication cost is low even when the text information is transmitted to an administrator. This has the advantage that the storage device can be inexpensive.
[0007]
However, this disclosed invention also has a disadvantage that the video camera and the microphone are fixedly arranged at a specific location in the building, so that only the specific location and a specific direction are photographed. The blind spot of the video camera is large, and the behavior of the person entering the blind spot is completely unknown.
[0008]
In particular, when the person is far away from the video camera, it becomes impossible to photograph the person. For this reason, video cameras have to be arranged at many places in a building, resulting in a large cost for building a video system.
[0009]
The management by the video system is necessary not only for monitoring the intrusion of suspicious persons, but also for managing the behavior of care recipients in hospitals and nursing homes, for example. Suppose you have completed a video system at little or no cost to eliminate blind spots in a building. Even in this case, when the care recipient goes out of the building, the user is out of the management area of the video system, and thus the behavior management of the care recipient becomes impossible.
[0010]
As described above, in the conventional system in which the video camera is fixedly arranged, the area where the behavior of the care recipient or the like can be managed is limited to the inside of the building. In addition, in order to manage a plurality of care recipients, the manager has no choice but to visually distinguish the care recipients, and cannot avoid the burden of constantly observing the video camera.
[0011]
By combining this fixed video camera system with a sentence writing system and adding an alarm function to the sentence notification function, there is no need for excessive storage capacity or the need for constant observation when only one care recipient is needed. Is there. However, even in this case, it is impossible to individually manage the behavior of a plurality of care-requirers, and the pressure of constant visual observation on the side of the manager also exists.
[0012]
Therefore, the present invention abandons the conventional method of fixing a video camera that has existed like a curse in a building, introduces a completely new point of view, enables constant observation of human behavior inside and outside the building, and An object of the present invention is to provide a method and an apparatus for recognizing a behavior of a subject that can individually manage a plurality of persons. In addition, by adding a wording system to the present invention, storage capacity and communication capacity can be sharply reduced, and by adding an alarm function, it is possible to notify and cope only in an emergency without regular visual observation. Aim
[0013]
[Means for Solving the Problems]
The present invention has been made to solve the above problems, and a first invention is to provide a camera mounted on a moving body and an object photographed by the camera or a subject formed by a part of the moving body. A wireless device that transmits a moving image as a wireless signal, a behavior analysis device that receives the wireless signal, the behavior analysis device processes the moving image to analyze a behavior of a subject, and the image analysis unit And a text generation unit that outputs the behavior of the subject analyzed by the method as text information. The present invention is characterized in that a camera such as a video camera or a camera with a built-in mobile phone is mounted on a moving object such as a person, an animal, or a car, and the camera is simultaneously moved with the moving object. The present inventor calls this camera a wearable camera. Various objects in the target world are photographed by the camera, and when the moving object is a person, the hand of the person is photographed by the camera. Therefore, the movement can always be recognized from the movement of the care recipient's hand, and the motion environment of the care recipient is always grasped by the image of the target world. If a camera and a wireless device are attached to each of a plurality of care recipients, the respective moving images are individually received by the wireless device, so that a plurality of persons can be individually and simultaneously managed. In addition, since it is sufficient that the configuration is such that a wireless signal of a moving image is received, it is possible to adopt a configuration in which a moving image signal can be immediately received by a wireless signal transmitting / receiving device when acting near the management center, In the case of acting at a remote place, it may be configured to be able to receive a moving image signal using a network such as the Internet or a mobile phone system. Furthermore, since these moving image signals are analyzed and output as text information, the storage capacity and communication capacity can be reduced. If an alarm device is added to the text information, abnormal behavior of the moving body can be achieved without always viewing the image. It has the advantage that the action can sometimes be recognized in real time and immediate action is possible. In addition, since text information is generated corresponding to the moving image, text information according to the characteristics of the person wearing the camera can be accumulated, a text database can be constructed according to the characteristics of the individual person, and management of the person etc. It is possible to systematize information.
[0014]
According to a second aspect of the present invention, there is provided a camera mounted on a moving body, a wireless device for transmitting a moving image of an object photographed by the camera or a subject composed of a part of the moving body as a wireless signal, and a wireless device. A behavior analysis device that receives a transmitted wireless signal via a network, the behavior analysis device processes the moving image to analyze the behavior of the subject, and an image analysis unit that analyzes the behavior of the subject. This is a subject behavior recognition apparatus having a text generation unit that outputs behavior as text information. This invention differs from the first invention only in having a configuration for transmitting and receiving moving image signals via a network, and thus has the same operational effects as the first invention. In particular, since wireless signals are transmitted and received via a network, even if a moving object such as a person moves away from a remote location, moving image signals can be transmitted and received instantaneously via a wide area network or a near area network, and behavior management of the person etc. can be performed. It has the advantage that it can be established over a wide area.
[0015]
A third invention is a camera mounted on a moving body, a behavior analysis device and a wireless device attached to the camera, and the behavior analysis device is configured to detect an object photographed by the camera or a part of the moving body. An image analysis unit that processes a moving image of the subject to analyze the behavior of the subject, and a text generation unit that outputs the behavior of the subject analyzed by the image analysis unit as text information. Is transmitted to a required site wirelessly. A feature of the present invention resides in that a camera, a behavior analysis device, and a wireless device are integrally mounted on a moving body such as a person, an animal, a thing, or a car. By configuring the behavior analysis device with a microcomputer, the entire device can be made compact. For example, the behavior analysis device can be attached to a person who behaves and individually managed. In addition, since the text information is transmitted by the wireless device, the information capacity can be small, communication can be performed with a required site via a network, and the text information can be directly transmitted to a management center or the like.
[0016]
The fourth invention captures a moving image by photographing a target world with a camera mounted on a moving body, and sets one of a plurality of time-series image frames constituting the moving image as a reference frame. An image frame temporally subsequent to the reference frame is set as an input frame, and the input frame is subjected to a conversion process so as to approximate the reference frame as much as possible, and the conversion process determines how much the input frame has moved from the reference frame. This is a method for recognizing the behavior of a subject, which derives the movement parameters shown and estimates the amount of movement of the moving body based on the movement parameters. If, for example, a process of performing an affine transformation on the input frame and returning it to the reference frame is performed, the translation amount in the XYZ directions, the rotation amount, and the enlargement / reduction ratio can be derived as movement parameters. Since the moving direction of the input frame and the moving direction of the camera are opposite, the value of the moving parameter obtained by the above-described return processing gives the moving amount in the direction in which the camera has moved together with the moving object. In addition to the affine transformation, a mathematical transformation that can estimate the amount of movement of the camera can be widely used.
[0017]
The fifth invention captures a moving image by capturing an image of a target world with a camera mounted on a moving object, and sets one of a plurality of time-series image frames constituting the moving image as a reference frame. An image frame sequence temporally subsequent to the reference frame is set as an input frame sequence, and a conversion process is performed on this input frame sequence to form a converted image sequence that is as close as possible to the reference frame. Deriving a movement parameter indicating how much movement has been made from the reference frame, and when the movement amount of the moving body estimated by the movement parameter becomes larger than the reference movement amount, the input frame is reset to the reference frame and the reference frame is set. An object that performs an update and repeats the above operation to determine the behavior of the moving object from the update frequency (update rate) of the reference frame It is a behavior recognition method. When a person is considered as a moving object, the input frame also fluctuates from the reference frame even when the person is sitting, stretching the body up and down, bending the body left and right, and reciprocating the body back and forth. The movement parameter by this conversion process does not become so large. However, when a person walks, the person moves straight in the X direction or the Y direction, so that the movement parameter is considered to increase in one direction. According to the present invention, a stationary state or a moving state of a person is determined based on a rate at which a reference frame is updated, that is, an update frequency (or an update rate). If the reference frame is updated frequently in one direction, it is determined that the person is walking (moving). If the update frequency (update rate) is low, the person is sitting or standing. Is determined to be stationary.
[0018]
The sixth invention captures a moving image by photographing the target world with a camera mounted on a moving object, and sets one of a plurality of time-series image frames constituting the moving image as a reference frame. An image frame sequence temporally succeeding the reference frame is set as an input frame sequence, and a conversion process is performed on the input frame sequence to form a converted image sequence that is as close as possible to the reference frame, and a background region in each converted image is formed. Is an action recognition method for a subject, which focuses on a specific area indicating a specific object and determines the action of the specific object from the operation of the specific area in the converted image sequence. In a converted image obtained by converting an input frame into a reference frame, when a person (moving body) is in a stationary state, a background area occupying a large area of the converted image is common. Therefore, in the common background region, attention is paid to the movement of a specific object such as the head of another person, the hand of a person, or a cup held by the person, and the movement of the specific object (A person or another person) can be determined.
[0019]
According to a seventh aspect, when the moving object is a person, the specific target is the hand of the person, and the hand region is extracted as a specific region from at least the skin color information and the motion information, and the operation of the hand region is performed. This is a method of recognizing the behavior of a subject in which the behavior of a person is determined from the information. When a camera is attached to a person, if both hands or one hand of the person moves in front of the camera, the hand naturally becomes a subject. In the converted image, the hand is selected based on the skin color area, and the hand is reliably recognized based on the hand movement information. For example, if the motion of the hand is solved by the equation of motion to judge whether or not the hand is at the predicted position, it is a proof that the hand is a hand. The hand is recognized based on the compatibility of the skin color information and the motion information, and if the hand is colored in the converted image, the motion of the hand is used to recognize what kind of motion or action the moving person takes. It becomes possible.
[0020]
According to an eighth aspect of the present invention, the specific object is a face of another person photographed by the camera, and the face area is extracted as a specific area from at least the skin color information and the contour information, and the behavior of the other person is determined from the operation of the face area. This is a method of recognizing the behavior of the subject that determines In the converted image, the face of another person is recognized based on the compatibility between the skin color information and the outline information such as an ellipse. If the face of the other person can be determined by both of them, the face is displayed in color and it can be determined from the movement of the face how the other person is performing.
[0021]
A ninth invention captures a moving image by photographing a target world with a camera mounted on a moving object, and sets one of a plurality of time-series image frames constituting the moving image as a reference frame. An image frame sequence temporally succeeding the reference frame is set as an input frame sequence, and a conversion process is performed on the input frame sequence to form a converted image sequence that is as close as possible to the reference frame, and a background region in each converted image is formed. Is a method of recognizing a specific area indicating a specific object, comparing an image of the specific area with a stored template model, and specifically identifying the specific object. For example, when a person who is a moving object has a cup, the motion of the person's hand can be determined by the above-described method. When this hand has something, it stores one or more concrete objects in the memory, such as a cup, a clock, a book, etc., and compares the object with the hand with these template models. Identify things. In other words, when the movement of the hand is determined and the object is recognized as a cup, it is possible to recognize that a person is about to drink a drink in the cup. As described above, by combining the motion of the person and the recognition of the object, the behavior of the subject can be more highly recognized.
[0022]
A tenth invention is a method of recognizing a behavior of a subject, which converts the behavior of the subject into text information. If the behavior of the subject can be recognized, this behavior can be converted into simple text using case grammar, etc., the amount of information can be reduced from image information to text information, and the storage and communication costs of storage devices can be reduced by storing and communicating text information. It becomes possible to decrease rapidly. In addition, since the behavior of a mobile person can be converted into text information, it is possible to create a text database as an activity record, and a specific human behavior database can be constructed and used in hospitals, nursing homes, schools, etc. Safety management can be performed efficiently.
[0023]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, an embodiment of a method and an apparatus for recognizing a behavior of a subject according to the present invention will be described in detail with reference to the accompanying drawings.
[0024]
FIG. 1 is a schematic configuration diagram of a first embodiment of a subject behavior recognition apparatus according to the present invention. The moving object 2 is a mobile object such as a person, an animal, a thing, a car, and the like, and is assumed to be a person whose behavior is to be recognized and managed here. However, when recognizing the target world, the moving object 2 (also referred to as a person 2) may be an animal or a car such as a car, a bicycle, or a motorcycle. What is necessary is just what can be imaged with a camera.
[0025]
A camera 6 having a built-in wireless device 7 is attached to the moving body 2 in a fixed state. This camera is also equipped with a microphone 4, and can also record sounds emitted by the person 2 and the target world 8. Specifically, the camera 6 corresponds to a video camera provided with a wireless device 7, a mobile phone with a camera, or the like. A mechanism capable of transmitting moving image information from the camera 6 and audio information from the microphone 4 through the wireless device 7 is employed.
[0026]
This camera 6 can capture a moving image of the target world 8, and this moving image is usually composed of 30 frames per second in a home video camera, but may be different depending on whether it is for home use or for business use. Further, if one frame is used for every six frames using a home video camera, the frame rate can be set to 5 frames per second. Therefore, the number of frames per second (frame rate) is arbitrarily determined.
[0027]
The target world 8 imaged by the camera 6 is an area surrounded by an imaginary line 10, in which a hand 12 and a target 14 of the moving body (person) 2 exist, and these are referred to as a subject 13. Accordingly, a moving image and sound of the subject 13 are obtained, and the wireless device 7 transmits a wireless signal 16 including a moving image signal and an audio signal.
[0028]
The wireless signal 16 is transmitted over a wide area through a network such as the Internet. In the management center 34 that observes the target world 8, the behavior analysis device 20 receives the wireless signal via the network 18.
[0029]
The behavior analysis device 20 is configured by a computer such as a personal computer or an electronic circuit device, for example. The behavior analysis device 20 includes an input unit 22, an image analysis unit 24, a voice analysis unit 26, a text generation unit 28, and a text database unit 30 formed in the text generation unit 28.
[0030]
The input unit 22 receives an input signal 19 from the network 18. This input signal 19 is composed of a moving image signal and an audio signal. The moving image signal is input to the image analysis unit 24, and the audio signal is input to the audio analysis unit 26.
[0031]
The specific operation and function of the moving image analysis unit 24 will be described later in detail with reference to FIGS. Simply put, a moving image signal is input as a time-series signal of image frames, and the operation of a specific region in an image is estimated by mathematically converting each image frame or analyzing the converted image. .
[0032]
In the voice analysis unit 26, the voice signal heard by the microphone 4 is analyzed. For analysis of the audio signal, for example, a Hidden Markov Model method (HMM) can be used. When the subject is a person, the sound is also generated at the same time, so that not only the motion can be analyzed from the image but also the sound can be used auxiliary to make the determined motion more reliable. Therefore, when the content of the operation matches the content of the voice, the determination of the operation can be determined with high probability.
[0033]
The text generation unit 28 converts images and sounds, particularly operations obtained from the images, into text information. That is, the image information is converted into text information while using the audio information in an auxiliary manner. By this conversion, image information having a large capacity memory is converted into text information that can be handled by a low capacity memory, thereby realizing a slim information amount.
[0034]
In the text generation unit 28, case grammar can be used as one method of generating text from an image. First, the verb (PRED) most suitable for the action shown in the image is selected. Next, with the verb as the center, the case of the phrase related to the verb, for example, the nominative case, the purpose case, the tool case, etc., is determined.
[0035]
More specifically, an operator (AG) who performs the operation and an object (OBJ) on which the operation is performed are selected. Further, a start time (SO-TIME) and an end time (GO-TIME) of this operation are given. As a result, the following operation expression is provided.
[PRED: verb, AG: agent, OBJ: object, SO-TIME: time1, GO-TIME: time2]
[0036]
Finally, a text expression composed of natural language sentences is preferable. The motion expression obtained as described above is converted into a natural language sentence by, for example, a case structure conversion method as described below.
[PRED: sousa-suru, AG: man1, OBJ: ws1, SO-TIME: t1, GO-TIME: t2]
"From time t1 to t2, user man1 has operated workstation ws1."
[0037]
In other words, the text generator analyzes a moving image, continuously generates a large number of motion expressions, translates these motion expressions one after another into natural language sentences, and generates text that anyone can understand. become. However, the text generation method is not limited to the case grammar structure or the case structure conversion method, and various text conversion methods currently developed are adopted.
[0038]
The text database unit 30 is a memory unit that arranges and stores generated texts under given rules. When the subject 13 is the person 2 wearing the camera, the motion of the person 2 is sequentially converted into text, so that a motion database characteristic of the person can be constructed.
[0039]
For example, if this system is adopted in a hospital, a camera 6 can be attached to each patient, and a behavior database for each patient can be configured, so that management of the patient becomes extremely smooth. In the nursing home, a camera 6 is attached to each elderly person to create a behavior database for each elderly person, and it is possible to quickly and safely assist each elderly person based on this behavior database. Therefore, this system is particularly effective when managing each member individually in a group of a plurality of persons.
[0040]
The communication unit 32 has a role of receiving the text signal 31 from the text generation unit 28 and transmitting text information to the management center 34. Since the text data has a very small capacity, the storage capacity and the communication capacity are small and the communication speed can be increased. Therefore, since the communication unit 32 may be a normal communication device and communication system, it is inexpensive.
[0041]
The management center 34 is a facility that manages a person wearing a camera below text data and a target person to be photographed by the camera. Since the obtained data is text data, the capacity of the storage device of the management center 34 can be small. Further, the management center 34 receives the text database 30 created for each person and sets it as basic data for individual management.
[0042]
FIG. 2 is a schematic configuration diagram of a second embodiment of the subject behavior recognition apparatus according to the present invention. This device is a device that manages behavior by receiving a wireless signal directly with an antenna without using a network. Therefore, since many parts are the same as those of the apparatus of FIG. 1, the same parts as those of FIG. 1 are denoted by the same reference numerals, description thereof will be omitted, and different parts will be described.
[0043]
The wireless signal 16 including the image signal and the audio signal is directly received by the receiving antenna 21. The wireless signal 16 is sent to the input unit 22 as an input signal 19. Subsequent processing is the same as in FIG.
[0044]
1 is used in an area where the network 18 is arranged, but in an area where the network 18 is not arranged, a system for wirelessly connecting the behavior analysis apparatus 20 and the camera 6 is used. Is valid.
[0045]
FIG. 3 is a schematic configuration diagram of a third embodiment of the subject behavior recognition apparatus according to the present invention. In this apparatus, the image analyzer 20 and the wireless device 7 are mounted on the moving body 2 integrally with the camera 6, and the text information is wirelessly transmitted to a necessary site. The same parts as those in the apparatus of FIG. 1 are denoted by the same reference numerals, and description thereof will be omitted. Different parts will be described.
[0046]
If the image analysis device 20 is composed of a very small computer, the image analysis device 20 can be integrated with the camera 6 and the wireless device 7 and attached to the moving body 2 such as a person. The body 2 can immediately transmit to the required site.
[0047]
That is, the moving image of the subject 13 captured by the camera 6 is input to the image analyzer 20 mounted on the moving body 2, and at the same time, the audio signal detected by the microphone 4 is also input. The motion of the subject analyzed from the moving image and the voice is analyzed by the image analyzer 20, and the characteristics of the motion are output as text information by the text generator 28.
[0048]
The text information is input to the wireless device 7 as a text signal 31 and is wirelessly transmitted from the wireless device 7 to the space as a wireless signal 16. The wireless signal 16 is received by the management center 34 as an input signal 19 via the network 18, for example. Further, the radio signal 16 is immediately received by the antenna as shown by the imaginary line and received by the management center 34.
[0049]
The first and second embodiments transmit an image signal and an audio signal as a wireless signal 16, whereas the third embodiment transmits a text signal as a wireless signal 16. This difference is attributable to whether the behavior analysis device 20 is provided on the site side or the mobile body 2.
[0050]
Hereinafter, a specific analysis method of the image analysis unit 24 and a specific generation method of the text generation unit 28 will be described according to individual cases. When the behavior analysis device 20 is configured by a computer device, the method proceeds by a program. When the behavior analysis device 20 is configured by an electronic circuit device, the method proceeds according to the procedure of the electronic circuit.
[0051]
FIG. 4 is a process diagram for estimating the moving amount of a moving body equipped with a camera by a moving image conversion process in the present invention. In (4A), a camera capable of capturing a moving image is mounted on a moving object. The moving object may be any moving object such as a person, an animal, a car, a bicycle, and the like, but a person is usually used as an object whose behavior is managed. Therefore, in the following, it is assumed that the moving object is a person.
[0052]
In (4B), a target world is photographed by a camera, and a moving image of the target world is captured. This moving image includes various images such as a hand of a person wearing the camera and others in the target world. Since the moving object moves around on the ground, the moving image also changes variously over time.
[0053]
When the moving object walks (moves), the background image in the moving image also changes considerably. Conversely, when the moving body (person) is in a sitting state or an upright state, it is in a stationary state, and the background image in the moving image does not change much. However, when a person turns his body left and right or moves his body slightly back and forth, the camera moves in the same way, so that the moving image slightly changes. By recognizing a large change or a small change in the moving image, the moving amount of the moving object, that is, the camera is estimated.
[0054]
In (4C), a large number of moving image frames constituting a moving image are chronologically captured. One of these time-series moving image frames is set as a reference frame. This reference frame may be considered as the first frame of a group of frames to be captured later.
[0055]
In (4D), image frames temporally succeeding the reference frame are successively captured. These image frames are called input frames. Therefore, there are many input frames after the reference frame.
[0056]
In (4E), an input frame is subjected to mathematical conversion processing to convert the input frame so as to match the reference frame as much as possible. The value of the movement parameter group obtained by this conversion processing can be estimated to be the movement amount of the camera.
[0057]
Since the camera moves with the moving object, the input frame is slightly shifted from the reference frame. For example, when the camera moves to the right, the input frame moves to the left from the reference frame. That is, the moving direction of the camera and the moving direction of the input frame have an opposite relationship. Therefore, if the input frame is moved in a direction that matches the reference frame, the movement amount should match the movement amount of the camera.
[0058]
An appropriate transformation to obtain this movement parameter is an affine transformation. This affine transformation is a transformation that performs processing such as translation, rotation, scaling, and shearing. In particular, the parameters of translation, rotation, and scaling constitute a movement parameter group.
[0059]
In (4F), a translation parameter, a rotation parameter, and a scaling parameter are obtained as a group of movement parameters by, for example, affine transformation. In (3G), the moving amount of the moving object, that is, the camera is estimated based on the values of these moving parameter groups.
[0060]
FIG. 5 is a specific process diagram for deriving a moving amount by a moving image conversion process in the present invention. (5A) shows a reference frame centered on a personal computer. The right direction gives the x direction, and the down direction gives the y direction.
[0061]
(5B) shows an example of the input frame. In this input frame, the central PC has moved slightly to the right. From the viewpoint of the camera, it is considered that the subject has moved right in the frame as a result of the camera moving left. The moving direction of the camera and the moving direction of the subject, that is, the frame, have an opposite relationship.
[0062]
In (5C), an affine transformation is performed on the input frame to convert the input frame so as to match the reference frame. Since it is not known in advance how long the conversion will be, the degree of matching will be increased while performing random approximation using, for example, a condensing algorithm.
[0063]
(5D) shows the transformed image after the affine transformation of the input frame. The arrangement of the personal computer almost matches the arrangement of the reference frame. Simply speaking, moving the image in the input frame to the left results in a transformed image. The area outside the frame of the frame is erased, and the area where the image is no longer lost is painted black.
[0064]
(5E) shows the values of the movement parameter group obtained by the affine transformation. dx = −52 indicates that the input frame has been moved leftward by 52, and this value is actually the movement amount of the camera. dy = 13 indicates that the input frame has been moved downward by 13 and indicates the amount of movement of the camera in the y direction.
[0065]
θ = 2.6 indicates that the input frame has been rotated clockwise 2.6 around the origin, and this value gives the amount of rotational movement of the camera. scale = 0.94 indicates that a converted image has been obtained by multiplying the input frame by 0.94, and indicates that the camera has advanced slightly from the reference frame.
[0066]
Therefore, as shown in (5F), the result of the moving body (person), that is, the camera, translates in the lower left direction, rotates slightly to the right, and moves forward slightly, as shown in (5F). The moving amount is the value described above, and thus the moving amount of the moving object can be derived from the moving parameter group.
[0067]
However, from the above results, it cannot be concluded whether the moving body is in the sitting state (stationary state) and the body is slightly moved, or whether the moving body is in the walking state (moving state). Next, a method for determining whether the moving object is in a stationary state or a moving state will be described.
[0068]
FIG. 6 is a discrimination flowchart that provides a criterion for determining whether the moving body (person) is moving or stationary. When a person is walking, the movement parameter increases in one direction, and a background image of the input frame is completely different from the reference frame. In such a case, it is necessary to reset the reference frame by resetting a new input frame as the reference frame.
[0069]
When it becomes necessary to update the reference frames one after another, the person is considered to be walking. In other words, the moving (walking) / stationary state is determined based on the reference frame update frequency (update rate), which frame is used to update the reference frame. The reference rate is referred to as a reference update rate. When the reference frame update rate exceeds the reference update rate, it is determined that the mobile apparatus is in the moving state, and when it is smaller than that, it is determined that the mobile apparatus is in the stationary state. Hereinafter, each step will be described.
[0070]
In step n1, the first frame of the input frame sequence is set as a reference frame. In step n2, the subsequent image frames are continuously taken as input frames. In step n3, a conversion process is performed on each input frame to match the reference frame as much as possible.
[0071]
In step n4, a group of movement parameters is specifically derived by, for example, affine transformation processing, and the movement amount of the person (step n5) is estimated. In step n6, the moving amount is compared with the reference moving amount. If the moving amount is small, it is determined that the vehicle is in a stationary state (step n7), and the result is fed back to step n2.
[0072]
When the movement amount becomes larger than the reference movement amount, the input frame is reset as the reference frame (step n8), and the update rate of the reference frame is calculated (step n9). The update rate of the reference frame is compared with the reference update rate (step n10). If the update rate is larger than the reference update rate, it is determined that the person is in a moving state (step n1). On the other hand, if it is smaller than the reference update rate, it is determined that the person is in a stationary state (step n12), and the result is fed back to step n2.
[0073]
As described above, the update rate (update frequency) of the reference frame is calculated while continuously taking in the input frames, and the moving state or the stationary state of the person (moving body) is reliably quantitatively determined. .
[0074]
FIG. 7 is a specific process diagram for determining whether the moving body (person) is moving or stationary. (7A) to (7E) show an input frame sequence, and (7a) to (7e) show converted image sequences by affine transformation of (7A) to (7E). The direction of the arrow is the time direction.
[0075]
The reference movement amount and the reference update rate are arbitrarily determined according to the situation. In this example, the reference movement amount is set to dx = 20. The reference update rate is defined as the time when the reference frame is updated once in three frames and the update is continued twice consecutively.
[0076]
(7A) is set as a reference frame, and an input frame sequence is fetched one after another. From the input frame, it can be seen that the camera is moving in the upper right direction. When (7B) is converted, it becomes (7b), and since dx = 15, it is within the range of the reference movement amount.
[0077]
When (7C) is converted, it becomes (7c), and since dx = 21, it exceeds the reference movement amount dx = 20. Therefore, it is determined that the reference movement amount is out of the range, and (7C) is updated as the reference frame. Here, the first update of the reference frame was performed.
[0078]
Now, (7C) is the reference frame, and when affine transformation is performed on (7D), dx = 10, which is within the range of the reference movement amount. When it is added to dx = 21 in (7c), dx = 31 in (7d). In (7b) to (7d), since the values are within the range of the reference update rate, it is determined that the person, that is, the camera is in a stationary state.
[0079]
Next, when (7E) is affine-transformed, (7e) of dx = 13 is obtained. Taking (7C) as the reference for dx = 0, dx = 23, which exceeds the reference movement amount dx = 20. Therefore, (7E) is updated again as the reference frame.
[0080]
In the stage (7e), since the reference frame has been updated twice consecutively, the update rate of the reference frame has exceeded the reference update rate, and it is determined that the person, that is, the camera is in a moving state. .
[0081]
From the above, the following conclusions are derived. In (7b) to (7d), dx continuously increases while in the stationary state. This means that the person wearing the camera bent his body to the right. Only at the stage (7e), it is determined that the person is moving (walking) rightward.
[0082]
In this way, whether the person is moving or stationary is determined based on whether or not the reference update rate is exceeded, and under this determination, what kind of action the person is taking is determined from a change in the value of the movement parameter.
[0083]
FIG. 8 is a process diagram for estimating the motion of the hand of the person (moving body) wearing the camera. As a simple example, let us analyze the case where a stationary person is operating the keyboard of a personal computer.
[0084]
In (8A), the camera is mounted and fixed on a person. In (8B), the first frame of the input frame sequence is set as a reference frame. In (8C), the subsequent input frames are successively converted into the state of the reference frame. In (8D), by confirming that the update rate (update frequency) of the reference frame is smaller than the reference update rate, it is determined that the person is in a stationary state.
[0085]
In (8E), the entire configuration of the converted image is considered as the sum of the background region and the hand region, and the hand region is separated and extracted from the background region. Separation and extraction are performed in two stages. A skin color is selected as the first index, and a skin color region is extracted from the converted image. In this case, there is a possibility that the flesh-colored area exists other than the hand, and the first index extracts the flesh-colored area other than the hand.
[0086]
As the second index, the motion of the hand is predicted by the equation of motion, and since the skin color region has moved to the predicted position, it is determined that the skin color region is the hand region. As the equation of motion, for example, a method using linear prediction or a Kalman filter is used. Skin color areas that do not move are removed in this second stage.
[0087]
When extracting a specific area using such a plurality of indices, for example, the Dempster-Shaffer method can be used. In this method, when the certainty factor of the first index and the certainty factor of the second index are given, the reliability of the extracted hand region is calculated by a method of deriving the overall certainty factor.
[0088]
In (8F), after the hand region is extracted, the hand region of the converted image is colored, and an image after the hand region extraction is formed. In (7G), when colored points are scattered as noises other than the hand region, it is necessary to remove these noises. Here, a median filter operation is used to remove scattered noise.
[0089]
In (8H), the movement of the colored hand region is detected from the sequence of the clustered image. By reading this movement, the movement of the hand is estimated. The movement of the hand includes movement to the left and right and movement to the up and down.
[0090]
FIG. 9 is a specific process diagram for extracting a hand region. The original image is an example of the converted image. Three methods for extracting a hand region are shown. The upper image is obtained by removing the background image having a specific background coloring from the original image to derive the hand region. The middle image is obtained by extracting only the skin color area. It can be seen that both methods extract the entire hand region.
[0091]
In the lower image, the flesh color area is replaced with an ellipse area, and it is determined whether or not the ellipse is moving for each frame. If the hand is a hand, it is naturally predicted to move, and the altitude of the hand area is determined based on whether or not the elliptical area moves to a position predicted by the equation of motion. The DS theory means Dempster-Shaffer's theory.
[0092]
FIG. 10 is a specific process diagram for estimating the hand motion. A clustering process is performed on the image after the hand region extraction to remove noise and obtain a clustered image. When four images after this clustering are arranged, the details of the operation of the colored hand region become clear.
[0093]
The left hand area is moving from right to left, and the text information is “left hand moved left”. When estimating the motion of a person wearing a camera, at least a part of the person needs to be photographed by the camera, and the part is likely to be a hand. Therefore, the movement of the person is estimated by paying attention to the hand.
[0094]
FIG. 11 is a process diagram for estimating the motion of another person captured by the camera. When extracting another person, the other person is extracted by focusing on the face of the other person. In (11A), the camera is mounted on the moving body. In (11B), the first frame of the input frame sequence is set as a reference frame. In (11C), the conversion process is performed so that the subsequent input frame sequence matches the reference frame as much as possible.
[0095]
In (11D), a face area is extracted from the converted image sequence. The converted image is separated into a background area and a face area. As criteria for extracting the face area of the subject, two criteria are used for the skin color area and the elliptical area.
[0096]
If a skin color region is extracted from the converted image, a plurality of skin color regions such as a face region and a hand region are extracted. Thus, an elliptical shape is introduced as a condition as a second reference. As a result, only the face region is extracted. At this time, the Dumpster-Shaffer method is used.
[0097]
In (11E), the extracted face area is colored, and the colored face area is incorporated into the original converted image to form a face area extracted image. In (10F), noise is removed by performing clustering, and an image after the median filter operation is configured.
[0098]
In (11G), the motion of the face area is analyzed by arranging and comparing the images after clustering, and the motion of the subject is estimated. In this example, the behavior of the subject is analyzed, but an object photographed by the camera, for example, an arbitrary object such as a car or a bicycle, is subjected to the behavior analysis.
[0099]
12 is a process diagram of recognizing and specifying an object captured by the camera. In (12A), the camera is mounted on the moving body. In (12B), the first frame of the input frame sequence is set as a reference frame. In (12C), a subsequent input frame is converted to a reference frame.
[0100]
In (12D), a target object is extracted from the converted image. The converted image is considered to be the sum of a background region forming a large background, a skin color region as a person, and a region of the target object of interest. Therefore, when a region having a high background probability and a region having a high skin color probability are removed from the converted image, only the target object region is extracted.
[0101]
In (12E), the stored many template models are compared with the extracted target object. The most similar template model is selected while comparing colors and shapes between the two, and the target object is determined to be the selected template model. In this way, the target object is recognized.
[0102]
FIG. 13 is a specific process chart showing a method of recognizing an object photographed by a camera. A region having a low background probability and a low skin color probability is extracted as an object region from the above converted image. As a result, the target object held by the hand is extracted.
[0103]
This target object and many template models are compared with each other. Among them, the cup with the highest matching probability is selected. At this stage, it is determined that the target object is a cup. In this way, the present invention can also determine what the target object to be photographed is.
[0104]
FIG. 14 is a hierarchical structure diagram for recognizing the behavior of the person wearing the camera and expressing it in text. First, whether the person is moving or stationary is determined based on the reference frame update rate (update frequency). That is, if the update rate exceeds the reference update rate, the person is determined to be moving, and if not, the person is determined to be stationary.
[0105]
When the person is in the moving state, the action is recognized based on the value of the moving parameter group. For example, if dx> 0, “turn right”; if dx <0, “turn left”; if dy <0, “stand up”; if dy> 0, “sit down”; If scale> 1, it is determined that “forward”, and if scale <1, “retreat”. These operation determinations change depending on how to take the coordinate axes.
[0106]
If the hand or the object is not extracted when the person is at rest, the body motion of the person is recognized by the value of the movement parameter group. For example, if dx> 0, “turn right”; if dx <0, “turn left”; if dy <0, “turn up”; if dy> 0, “turn down”. Turned ". These operation determinations change depending on how to take the coordinate axes.
[0107]
When the hand is extracted while the person is in a stationary state, the hand motion is determined from the hand motion in the converted image. For example, "right hand raised,""right hand lowered,""left hand raised,""left hand lowered."
[0108]
When the person is in a stationary state, the hand is extracted, and when the grasped cup is recognized, a more detailed motion is recognized from the hand movement in the converted image. For example, "drank with right hand", "drank with left hand", "held with right hand", "held with left hand", and the like.
[0109]
When the person is in a stationary state, his hand is extracted, and when a book is recognized in contact with the hand, the following detailed operation can be recognized from the hand movement in the converted image. For example, "read book", "turn page", "open book", "close book", and the like.
[0110]
As described above, when a motion of a person is recognized from a moving image obtained by a camera and the motion is expressed in text by case grammar, the image expression is converted into a text expression. By this conversion, the memory capacity and the communication capacity are sharply reduced, and the cost of the storage device and the communication device can be reduced, and at the same time, the communication speed can be dramatically improved.
[0111]
In particular, by focusing on individual persons and expressing their actions in text, and storing this text group in accordance with a predetermined rule, an action database for each person can be created. With this behavior database, it is possible to manage a plurality of persons individually.
[0112]
FIG. 15 is an experimental diagram of a behavior in which a person wearing a camera walks in a laboratory. The person walks from position 1 to position 7 according to the arrows. The video of the camera was analyzed by a computer and converted into text, and it was confirmed whether the sentence expression and the action corresponded.
[0113]
At position 1, "take a cup on the desk" and "start up from the chair", at position 2, "turn right", at position 3, "turn right", at position 4, "turn right", position 5 Then, it was determined that "turn right", "turn left" at position 6, "turn left and sit on a chair" at position 7, and "drink a cup and drink a drink". It was confirmed that the actual behavior and the text expression matched.
[0114]
The present invention is not limited to the above embodiment, and it goes without saying that various modifications and design changes without departing from the technical idea of the present invention are included in the technical scope thereof.
[0115]
【The invention's effect】
According to the first aspect of the invention, the camera is mounted on a moving object such as a person, an animal, or a car, and the camera is simultaneously moved with the moving object. A) is transmitted as a wireless signal, and the moving image is analyzed and converted to text. Since the image information is transmitted by the wireless signal, even when the moving object moves indoors or outdoors, the moving object can be appropriately managed. Since image information is converted into text information, storage capacity and communication capacity can be reduced to enable cost reduction, and communication speed can be dramatically improved. Also, if the camera is attached to the person to be managed, information such as the hand of the person can be managed in real time, no matter where the person is located, and by accumulating text information, the behavior database of each individual can be stored. There is an advantage that it can be created automatically.
[0116]
According to the second aspect, if a camera-equipped mobile phone is used as the camera and the wireless device, the behavior analysis can be easily performed using the existing network. Also, through a network such as the Internet, a moving image can be analyzed in real time regardless of the location of a person, and the operation can be converted to text and transmitted to the management center. Since it differs from the first invention only in the point of passing through a network, it has the same operation and effect as the first invention.
[0117]
According to the third aspect, the camera, the behavior analysis device, and the wireless device are integrally mounted on a moving object such as a person, an animal, a thing, or a car. By configuring the behavior analysis device with a microcomputer, the entire device can be made compact. For example, the behavior analysis device can be attached to a person who behaves and individually managed. In addition, since the text information is transmitted by the wireless device, the information capacity can be small, communication can be performed with a required site via a network, and the text information can be directly transmitted to a management center or the like.
[0118]
According to the fourth aspect, if the input frame is subjected to, for example, affine transformation to return to the reference frame, the translation amount, the rotational amount, and the enlargement / reduction ratio in the XYZ directions can be derived as movement parameters. Since the moving direction of the input frame and the moving direction of the camera are opposite, the value of the moving parameter obtained by the above-described return processing gives the moving amount in the direction in which the camera has moved together with the moving object. In addition to the affine transformation, a mathematical transformation that can estimate the amount of movement of the camera can be widely used.
[0119]
According to the fifth aspect, when a person is considered as a moving object, the input frame is also used when the person stretches the body up and down while the person is sitting, bends the body left and right, and reciprocates the body back and forth. Varies from the reference frame. In this case, the movement parameter does not become so large. However, when a person walks, the person moves straight in the X direction or the Y direction, so that the movement parameter is considered to increase in one direction. According to the present invention, a stationary state or a moving state of a person is determined based on a rate at which a reference frame is updated, that is, an update frequency (or an update rate). If the reference frame is updated frequently in one direction, it is determined that the person is walking (moving). If the update frequency (update rate) is low, the person is sitting or standing. Can be determined to be stationary.
[0120]
According to the sixth aspect, in the converted image obtained by converting the input frame into the reference frame, when the person (moving body) is in a stationary state, the background area occupying a large area of the converted image is common. Therefore, in the common background region, attention is paid to the movement of a specific object such as the head of another person, the hand of a person, or a cup held by the person, and the movement of the specific object (A person or another person) can be determined.
[0121]
According to the seventh aspect, when a camera is mounted on a person, if both hands or one hand of the person moves in front of the camera, the hand naturally becomes a subject. In the converted image, the hand is selected based on the skin color area, and the hand is reliably recognized based on the hand movement information. For example, if the motion of the hand is solved by the equation of motion to judge whether or not the hand is at the predicted position, it is a proof that the hand is a hand. The hand is recognized based on the compatibility of the skin color information and the motion information, and if the hand part is colored in the converted image, it is possible to recognize what kind of action the person who is the moving body is acting by the movement of the hand. Will be possible.
[0122]
According to the eighth aspect, in the converted image, the face of another person is recognized based on the compatibility between the skin color information and the contour information such as an ellipse. If the face of the other person can be determined by both of them, the face is displayed in color and it can be determined from the movement of the face how the other person is performing.
[0123]
According to the ninth aspect, for example, when a person who is a moving object has a cup, the motion of the hand of the person can be determined by the above-described method. When this hand has something, it stores one or more concrete objects in the memory, such as a cup, a clock, a book, etc., and compares the object with the hand with these template models. Identify things. In other words, when the movement of the hand is determined and the object is recognized as a cup, it is possible to recognize that a person is about to drink a drink in the cup. As described above, by combining the motion of the person and the recognition of the object, the behavior of the subject can be more highly recognized.
[0124]
According to the tenth aspect, if the behavior of the subject can be recognized, the behavior can be simply converted into text by case grammar or the like, the amount of information can be reduced from the image information to the text information, and the text information can be stored and communicated. It becomes possible to rapidly reduce the cost of the storage device and the communication cost. In addition, since the behavior of a mobile person can be converted into text information, it is possible to create a text database as an activity record, and a specific human behavior database can be constructed and used in hospitals, nursing homes, schools, etc. Safety management can be performed efficiently.
[Brief description of the drawings]
FIG. 1 is a schematic configuration diagram of a first embodiment of a subject behavior recognition apparatus according to the present invention.
FIG. 2 is a schematic configuration diagram of a second embodiment of the subject behavior recognition apparatus according to the present invention.
FIG. 3 is a schematic configuration diagram of a third embodiment of the subject behavior recognition apparatus according to the present invention.
FIG. 4 is a process diagram for estimating a moving amount of a moving body equipped with a camera by a moving image conversion process in the present invention.
FIG. 5 is a specific process diagram for deriving a moving amount by a moving image conversion process in the present invention.
FIG. 6 is a discrimination flowchart for providing a criterion of a moving state or a stationary state of a moving body (person).
FIG. 7 is a specific process diagram for determining whether the moving body (person) is moving or stationary.
FIG. 8 is a process diagram for estimating a motion of a hand of a person (moving body) wearing a camera.
FIG. 9 is a specific process diagram for extracting a hand region. The original image is an example of the converted image.
FIG. 10 is a specific process chart for estimating a hand motion.
FIG. 11 is a process chart for estimating a motion of another person captured by a camera.
FIG. 12 is a process diagram of recognizing and specifying an object captured by a camera.
FIG. 13 is a specific process chart showing a method of recognizing an object captured by a camera.
FIG. 14 is a hierarchical structure diagram for recognizing the behavior of a person wearing a camera and expressing it in text.
FIG. 15 is an experiment diagram of a behavior of a person wearing a camera walking in a laboratory.
[Explanation of symbols]
2 is a moving body (person), 4 is a microphone, 6 is a camera, 8 is a target world, 10 is a photographing area, 12 is a hand, 13 is a subject, 14 is a target, 16 is a radio signal, 18 is a network, 19 is a network. An input signal, 20 is a behavior analysis device, 21 is a receiving antenna, 22 is an input unit, 24 is an image analysis unit, 26 is a voice analysis unit, 28 is a text generation unit, 30 is a text database, 31 is a text signal, and 32 is a communication. Department, 34 is a management center.

Claims

A camera mounted on a moving body, a radio device for transmitting a moving image of an object photographed by the camera or a subject composed of a part of the moving body as a radio signal, and a behavior analysis device for receiving the radio signal; The behavior analysis device includes an image analysis unit that processes a moving image to analyze the behavior of a subject, and a text generation unit that outputs the behavior of the subject analyzed by the image analysis unit as text information. Recognition device for a moving subject.

A camera mounted on a moving object, a radio device for transmitting a moving image of an object photographed by the camera or a subject composed of a part of the moving object as a radio signal, and a radio signal transmitted from the radio device. A behavior analysis device that receives the information via a network, the behavior analysis device processes the moving image to analyze the behavior of the subject, and outputs the behavior of the subject analyzed by the image analysis unit as text information. A motion recognition device for a subject, comprising:

A camera attached to a moving object, a behavior analysis device and a wireless device attached to the camera, and the behavior analysis device converts a moving image of an object photographed by the camera or a subject including a part of the moving object. An image analysis unit that processes and analyzes the behavior of the subject, and a text generation unit that outputs the behavior of the subject analyzed by the image analysis unit as text information, wherein the wireless device wirelessly transmits the text information to a necessary site. A subject behavior recognition apparatus for transmitting.

A moving image is captured by capturing the target world with a camera attached to a moving object, and one of a plurality of time-series image frames constituting the moving image is set as a reference frame. A subsequent image frame is used as an input frame, and a conversion process is performed on the input frame so as to approximate the reference frame as much as possible. With this conversion process, a movement parameter indicating how much the input frame has moved from the reference frame is derived. And estimating a movement amount of the moving object by using the movement parameter.

A moving image is captured by capturing the target world with a camera attached to a moving object, and one of a plurality of time-series image frames constituting the moving image is set as a reference frame. A subsequent image frame sequence is used as an input frame sequence, a conversion process is performed on the input frame sequence to form a converted image sequence that approximates the reference frame as much as possible, and how much the input frame moves from the reference frame by the conversion process. Is derived, and when the moving amount of the moving body estimated by the moving parameter becomes larger than a reference moving amount, the input frame is reset to the reference frame to update the reference frame, and The behavior of the moving object is determined from the update frequency (update rate) of the reference frame by repeating the operation. Behavior recognition method of the body.

A moving image is captured by capturing the target world with a camera attached to a moving object, and one of a plurality of time-series image frames constituting the moving image is set as a reference frame. A subsequent image frame sequence is set as an input frame sequence, a conversion process is performed on the input frame sequence to form a converted image sequence that is as close as possible to the reference frame, and a specific target object is set in a background region in each converted image. And recognizing the behavior of the specific object from the motion of the specific area in the converted image sequence.

When the moving object is a person, the specific target is the hand of the person, and the hand region is extracted as a specific region from at least the skin color information and the operation information, and the behavior of the person is determined from the operation of the hand region. The method of recognizing a subject's action according to claim 5.

The said specific object is another person's face image | photographed by a camera, This face area is extracted as a specific area from at least skin color information and outline information, and the action of this other person is judged from operation | movement of this face area. 6. The method for recognizing a behavior of a subject according to 5.

The target world is photographed by a camera mounted on a moving object, a moving image is captured, and one of a plurality of time-series image frames constituting the moving image is set as a reference frame. A subsequent image frame sequence is set as an input frame sequence, a conversion process is performed on the input frame sequence to form a converted image sequence that is as close as possible to the reference frame, and a specific target object is set in a background region in each converted image. And recognizing the specific object by specifically comparing the image of the specific area with a stored template model.

The method for recognizing a behavior of a subject according to claim 4, wherein the behavior of the subject is converted into text information.