JP3639520B2

JP3639520B2 - Moving image learning method and recording medium recording this program

Info

Publication number: JP3639520B2
Application number: JP2000338629A
Authority: JP
Inventors: ライチェフビセル; 洋村瀬
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2000-11-07
Filing date: 2000-11-07
Publication date: 2005-04-20
Anticipated expiration: 2020-11-07
Also published as: JP2002150297A

Description

【０００１】
【発明の属する技術分野】
本発明は、人物顔動画像などの動画認織のための学習技術にかかわり、特に実環境条件の下で、カテゴリ（人の名前）に対する事前情報が与えられていない長時間にわたる連続映像データを用いて自己組織化的に画像を学習する方法に関するものである。
【０００２】
【従来の技術】
顔画像認識を例に説明する。顔画像認識は、学習段階で作成された顔画像モデルと、未知の入力顔画像とのパターン照合（距離値または類似度の計算）により、実現される。
【０００３】
従来の顔認識法における学習では、制限された環境条件で撮影され、かつ人手によって人名などのカテゴリを付けられた少数の顔画像から学習を進めるという方法がとられていた。
【０００４】
しかし、このアプローチでは、人手によるデータの準備作業が必要であるので、多量の顔データを学習に利用することが困難である。
【０００５】
このアプローチを用いて実環境で顔認識を行う場合には、限られた数の顔画像しか学習に利用できないために、実環境で発生する様々な変動（例えば、照明条件の変動、カメラ（視点）に対する顔の角度と距離の変動、顔表情や眼鏡、ヘアスタイルなどによる顔の形の変動）に単純に対処するだけの十分の学習データを蓄積することは困難である。
【０００６】
従来は、それに対処するために場当たり的に、前処理を工夫したり、特徴の工夫がなされていたが、十分な精度は得られていない。つまり、従来の顔認識は、複雑な変化に富んだ実環境に対応するための柔軟性が備わっていなかった。
【０００７】
一方、人間が行う顔認識の場合では、学習システムがいつでも動作していて、新しいデータが入力されると、システムの内部状態を更新し、いつでも新しいカテゴリを事前情報なしに追加できる。
【０００８】
【発明が解決しようとする課題】
前記のように、顔画像モデルと、未知の入力顔画像とのパターン照合による認識方法では、人手によるデータの準備作業が必要とし、多量の顔データを学習に利用することが困難であった。また、十分な学習データを蓄積すること及び十分な精度を得るのが困難であった。
【０００９】
本発明では、人間によって事前に用意された情報を用いることなく、実環境でコンピューターが自律的に長時間の連続データを自分のタスクの目的に応じて自己組織化的に学習し、学習途中で新しいカテゴリを自然に追加できる方法を提案するものである。つまり、本発明の目的は、長時間環境映像データに含まれる多様な変動を人間の介在なしに教師なしで追加学習する方法を提案することである。
【００１０】
また、本発明の目的は、すべての過程（データ収集、格納、顔領域抽出、学習、認識）が自動的に行われ、人手による作業は必要としない学習方法を提案することである。
【００１１】
【課題を解決するための手段】
本発明は、上記の目的を達成するために、長時間環境映像データを用いて、その中に現れる移動物体を自動的に抽出すると共に、複数の抽出された移動物体をその情報を利用して自動的に関連付けることにより、人為的にカテゴリを与えることなく自動的に移動物体のカテゴリ分けを行うようにしたものである。また、カテゴリ分けの際に、対象情報がどのカテゴリに含まれるかを映像データに対して補足的に人手で一部だけカテゴリ付けできるようにするものであり、以下の学習方法および記録媒体を特徴とする。
【００１２】
（学習方法の発明）
連続的に撮影した環境映像データから、映像内で移動する対象物体を教師なしで自動的に学習する動画像学習方法において、
事前に対象物体のカテゴリ情報が与えられていないデータを撮影する段階と、
前記対象物体の動画像を時空間的に切り出し動画像配列を作成する段階と、
全ての組み合わせの前記動画像配列間で動画像配列の中に含まれる画像間の距離をもとに動画像配列間の距離を計算する段階と、
前記動画像配列間の距離を用いて初期サブクラスタを形成する段階と、
ｃｏｎｓｉｓｔｅｎｔ／ｉｎｃｏｎｓｉｓｔｅｎｔエッジを用いて全てのサブクラスタを結合する段階と、
前記結合された各カテゴリー毎の動画像辞書を作成する段階と、
から構成されることを特徴とする。
【００１３】
また、前記切り出した動画像の一部に関してカテゴリ分けの事前情報を与える段階を備えたことを特徴とする。
【００１４】
また、前記ｃｏｎｓｉｓｔｅｎｔ／ｉｎｃｏｎｓｉｓｔｅｎｔエッジを用いて全てのサグクラスタを結合する段階で、カテゴリ情報を与える段階と、与えられたカテゴリ情報を基に前記ｃｏｎｓｉｓｔｅｎｔ／ｉｎｃｏｎｓｉｓｔｅｎｔエッジを変更する段階とを備えたことを特徴とする。
【００１５】
（記録媒体の発明）
連続的に撮影した環境映像データから、映像内で移動する対象物体を教師なしで自動的に学習する動画像学習方法を、コンピュータに実行させるためのプログラムを、該コンピュータが読み取り可能な記録媒体に記録した記録媒体であって、
事前に対象物体のカテゴリ情報が与えられていないデータを撮影する過程と、前記対象物体の動画像を時空間的に切り出し動画像配列を作成する過程と、全ての組み合わせの前記動画像配列間で動画像配列の中に含まれる画像間の距離をもとに動画像配列間の距離を計算する過程と、前記動画像配列間の距離を用いて初期サブクラスタを形成する過程と、ｃｏｎｓｉｓｔｅｎｔ／ｉｎｃｏｎｓｉｓｔｅｎｔエッジを用いて全てのサブクラスタを結合する過程と、前記結合された各カテゴリー毎の動画像辞書を作成する過程とをプログラムで記録したことを特徴とする。
【００１６】
【発明の実施の形態】
図１は、本発明の実施形態を示すブロック構成図である。本実施形態では３つのユニット、顔領域抽出装置（ユニット１）、学習装置（ユニット２）、認識装置（ユニット３）から構成される。
【００１７】
顔領域抽出装置１は、長時間連続の環境映像データに出現する人物の顔領域を自動的に抽出し、その大きさを正規化し、画像に人物が出現してから消失するまでの一連の顔画像系列（これを動画像配列と呼ぶ）を作成し、ファイルに格納し、後段の学習／認識装置２、３に入力情報を提供する。
【００１８】
学習装置２は、後述の学習アルゴリズムを用いて、事前のカテゴリ情報を含まない多数の動画像配列をカテゴリごとに組織化する。その結果に従って、システムの内部状態も適切に更新される。ここで、内部状態とは、動画像配列間のどれとどれが同じカテゴリーに属するかを示す表現のことを言う。
【００１９】
認識装置３は、システムの最新の動画像辞書（登録された動画像配列）と認識すべき入力データ（入力動画像から顔領域抽出装置により切り出された動画像配列）との照合に基づいて顔認識を行い、認識結果を出力する。
【００２０】
はじめに、学習や認識のための準備段階としての顔領域抽出装置の一例について、図２のフローチャートを用いて説明する。
【００２１】
ステップ２０１では、固定したカメラから連続的に環境映像データが入力されるが、隣り合った２枚の画像を引算し、閾値をとってバイナリ差分画像を作成する。
【００２２】
ステップ２０２では、作成された差分画像から、観察されている環境は変化があったかどうかを判断する。人物の登場に相当する変化がない場合、変化があるまでステップ２０１とステップ２０２を繰り返すが、変化があった場合にはステップ２０３へ進む。
【００２３】
ステップ２０３では、前ステップで作ったバイナリ差分画像からサブサンプリングを繰り返しながら多解像度画像ピラミッドを作成する。サブサンプリングは、元の画像の各２×２ピクセル領域を（領域内の「１」の数に基づいて）「１」または「０」値のピクセルに置き換える。なお、多解像度画像を用いることによって、顔領域の座標をもっと安定で正確に計算することができる。すなわち、低い解像度の画像がノイズの影響を受けにくいので比較的安定に顔領域を決めることができるが、正確ではない。逆に、高い解像度の画像から顔領域を正確に決めることができるがノイズの影響を受けやすい。以下で説明する顔領域抽出アルゴリズムは各解像度の画像でパラレルに行われ、それぞれの計算結果か後で述べるように、ステップ２０７で統合される。
【００２４】
次のステップ２０４では、照明条件の変動、衣服のテキスチャーなどの原因によるノイズの影響で人物領域内にできている「０」値の「穴」を埋めるために全画像に、次のようなフィルターを掛ける。
【００２５】
画像全体に３×３ピクセルの窓を１ピクセルステップで移動させながら、窓の中に少なくとも２つの異なった行と２つの異なった列には同時に値「１」のピクセルが存在すれば、窓の中心にあるピクセルを「１」に置き換える。この操作を左右、右左、上下、下上方向で何回か（あるいは変化がなくなるまで）行う。
【００２６】
次のステップ２０５では、前段階で作成された画像の重心とｘヒストグラム（画像の各列に含まれる「１」の数を表す関数）を計算して、重心からｘヒストグラムの全エネルギーの例えば９０％が含まれる領域を人物領域として切り出す。つまり、人物の面積が、背景に動く他の物体、陰、ノイズなどの面積よりずっと大きいと仮定している。
【００２７】
ステップ２０６では、ステップ２０５で抽出された人物領域上に、改めてｘヒストグラムとｙヒストグラムを計算し、それらの内容の解析を行うことによって顔座標を決定する。
【００２８】
ステップ２０７では、各解像度で計算された顔領域の座標の中央値（ｍｅｄｉａｎ）が求められ、最終的な顔領域の座標を得る。
【００２９】
ステップ２０８では、上記のように抽出された顔画像を一定のサイズに正規化する。実際の入力データから得られた最終的な動画像の一例を図４に示す。
【００３０】
その後、ステップ２１０では、カメラから次の画像が読み取り、背景画像とのバイナリ差分画像を作成する。
【００３１】
ステップ２１１では、ステップ２１０で作成した差分画像から人物がカメラの視野からいなくなったかどうかを検出する。この検出により、人物がまだいる場合にはステップ２０３に戻り、収得した画像を処理する。
【００３２】
ステップ２１２では、人物がカメラの視野からいなくなった場合、その時点までの一連の顔画像の系列を動画像配列として、動画像ファイルに記憶（格納）する。
【００３３】
次に、本発明の重要な要素である学習装置２の動作を、図３のフローチャートを用いて以下に説明する。
【００３４】
先ず、最初のステップ３０１では、前段階で（ｍｏｖｉｅファイルとして）得られた動画像配列（顔画像が連なったもので、１つの連続した動画像に対応）をコンピューターのメモリに読み込む。前記のとおり、それぞれの動画像配列に関するカテゴリ情報は事前に与えられていない。例えば、Ｘ個の異なった人物に対するＮ個の動画像配列（Ｎ≧Ｘ）が与えられるが、どの配列がどの人物に対応するかという情報は予め与えられない。
【００３５】
ステップ３０２では、蓄積された２つの動画像配列間の距離を計算するために、一方の動画像配列のｉ番目の顔画像と、もう一方の動画像配列に含まれるｊ番目の顔画像との間で距離値が計算され、行列Ｄ_ijに納める。そのとき、２枚の顔画像間の距離の計算の仕方としては、さまざまのものが考えられるが、ここではその１例として動画像をピクセル毎に引き算し、絶対値をとり、結果がある閾値より大きければ「１」と見なし、小さければ「０」とする。このように得られたバイナリ画像の中に含まれる「１」の総数を両顔画像の間の距離値として定義される。
【００３６】
ステップ３０３では、行列Ｄ_ijを用いて各動画像配列と他の全ての配列との間の最短距離を選んで最短距離行列Ｍに納める。例えば、行列Ｍの中のＭ_ijは配列ｉと配列ｊとの間の最短距離を意味している。両動画像配列には多数の顔画像が入っているが、その中の一番近い顔の対を代表として選び、それらの間の距離がＭ_ijとなる。
【００３７】
ステップ３０４では、行列Ｍの値に基づいて多数の初期サブクラスターグラフを形成する。各配列が一点のノードとして表示され、各ノードをそれに一番近いノードだけと（エッジで）結合され、このようにして出来上がった各々のグラフを初期サブクラスターグラフと名付ける。二つのノードをつなぐエッジを２種類定義し、片方はｃｏｎｓｉｓｔｅｎｔエッジと呼んで同じカテゴリのノード（動画像配列）をつなぐのに用いて、もう一方はｉｎｃｏｎｓｉｓｔｅｎｔエッジと呼んで異なるカテゴリのノードをつなぐのに用いる。但し、ステップ３０４で形成される初期サブクラスターグラフの中のエッジは全てｃｏｓｉｓｔｅｎｔエッジである。
【００３８】
次のステップ３０５では、初期サブクラスターをノードの数によってソートし、小さいサブクラスター（含まれるノードの数が少ない）から開始して、他のサブクラスターの内一番近いサブクラスターとエッジでつなげられる。そのときののエッジの種類は後述するルールによって決められる。二つのサブクラスターが結合されると同時に併合され、新しいサブクラスターが生成される。この過程は再帰的に、全てのサブクラスターが一つだけの大きいクラスターに併合されるまで繰り返される。二つのサブクラスターを最短距離エッジ（例えばノードＡとノードＢの間のエッジ）でつなぐときに、エッジの種類は次のようなルール（ｃｏｎｓｉｓｔｅｎｃｙ基準）で決定される。ノードＡとノードＢとの間の長さＬのエッジがｃｏｓｉｓｔｅｎｔであるために同時に満たすべき二つの条件を次に示す。
【００３９】
条件１「ノードＡとｃｏｎｓｉｓｔｅｎｔエッジで直接につながっているノードの内、ノードＡから最も遠いノード（ＦＮ）との間の距離Ｌ１がＣｘＬより小さいこと。ただし、ここでＣは定数である。」
条件２「ノードＢとｃｏｎｓｉｓｔｅｎｔエッジで直接につながっているノードの内、ノードＢから最も遠いノード（ＦＮ）との間の距離Ｌ２がＣｘＬより小さいこと。ただし、ここでＣは定数である。」
図５は、ノードＡ、ノードＢとそれぞれのＦＮノードの関係を示した一例である。条件１と条件２を同時に満たさないノードはｉｎｃｏｎｓｉｓｔｅｎｔエッジによって結合される。
【００４０】
最後のステップ３０６では、上記の過程によって形成されたグラフをトラバースしながらｃｏｓｉｓｔｅｎｔエッジでつながっているノードによって表示されている動画像配列を同じ顔カテゴリとして出力する。ここで、ｉｎｃｏｓｉｓｔｅｎｔエッジの役割は、異なったカテゴリに属するサブクラスターを分離することである。図６は、実際の入力データを使用したとき、上記の過程で形成されたグラフの一例を示したものである。図６では、各ノードが一つの動画像配列に相当し、文字は配列に映されている実際の人物名の頭文字であり、番号は配列番号である。実線で表示されているエッジがｃｏｎｓｉｓｔｅｎｔエッジであり、点線はｉｎｃｏｎｓｉｓｔｅｎｔエッジを表し、数字はエッジの長さ（２つのノードの間の最短距離）を表している。
【００４１】
このグラフはシステムの現在の内部状態を表し、認識過程ではその内部状態を一時的に固定したままで入力に相当するノードをステップ３０５で説明したｃｏｓｉｓｔｅｎｃｙ基準に基づいて、ｃｏｎｓｉｓｔｅｎｔ／ｉｎｃｏｎｓｉｓｔｅｎｔエッジで内部状態グラフと結合し、そのエッジがｃｏｓｉｓｔｅｎｔエッジの場合、入力のカテゴリがエッジの他方にあるノードと同じカテゴリであると見なす。一方、ｉｎｃｏｓｉｓｔｅｎｔエッジの場合、新しい（まだ登録されていない）カテゴリに属することとなる。
【００４２】
追加学習は下記のアルゴリズムによって行われる。但し、システムの現在の内部状態がＮ個のノードから構成されたグラフによって表現されると仮定し、以下では、図７に示す内部状態で入力されるＮ＋１個目のノードに対して追加学習を行うこととする。
【００４３】
ステップ１：新しく入力されたノードに対して、行列Ｍで新しい（Ｎ＋１）個目の列を計算し追加する。
【００４４】
ステップ２：新しいノードの最も近いノードｋを見つけて、前述のｃｏｎｓｉｓｔｅｎｃｙ基準に基づいていて両ノードをｃｏｓｉｓｔｅｎｔ／ｉｎｃｏｎｓｉｓｔｅｎｔエッジでつなぐ。ｉｎｃｏｎｓｉｓｔｅｎｔエッジの場合、ステップ４に進む。
【００４５】
ステップ３：ノードｋと同じクラスターに属する全てのノードｌに対して、ノードｌと新しいノードとの間の距離Ｄ₁、そしてノードｌとそれから最も近いノードｎとの間の距離Ｄ₂を計算する。もし、Ｄ１＜Ｄ２であれば、ノードｌと新しいノードをｃｏｎｓｉｓｔｅｎｔエッジでつなぐ。その後、ノードｌと直接ｃｏｓｉｓｔｅｎｔエッジでつながっているすべてのノードｍに対して（ノードｎも含めて）、ｌとｍとの間の距離Ｄ（ｌ，ｍ）とｍと新しいノードとの間の距離Ｄ（Ｎ＋１，ｍ）を計算し、Ｄ（ｌ，ｍ）＜Ｄ（Ｎ＋１，ｍ）であれば、ｍとｌとの間のエッジを削除し、ｍと新しいノードとの間に新しいエッジ（もしそんなエッジがまだ存在しなければ）を入れる。
【００４６】
ステップ４：新しいノードが属するクラスターＣ_k（図８参照）と、ｉｎｃｏｎｓｉｓｔｅｎｔエッジＥ_ikによってつながっているクラスターＣ_i（そういうクラスターが存在すれば）のすべてのノードＰ_iに対して、新しいノードとの間の距離Ｄ（Ｐ_i，Ｎ＋１）を計算し、Ｄ（Ｐ_i，Ｎ＋１）＜Ｅ_ikであれば、Ｅ_ikを削除し、Ｐ_iと新しいノードとの間に新しいエッジＥ_ikを挿入する（そのエッジの種類をｃｏｓｉｓｔｅｎｃｙ基準に基づいて決める）。その後、ｉｎｃｏｎｓｉｓｔｅｎｔエッジＥ_ijでつながっているクラスターＣ_iとＣ_jが存在したら（但し、Ｃ_iがＣ_kとつながってもいいが、Ｃ_jはＣ_kとはつながっていない場合に限る）、クラスターＣ_jに属するすべてのノードＰ_jに対して、新しいノードとの間の距離Ｄ（Ｐ_j，Ｎ＋１）を計算し、Ｄ（Ｐ_i，Ｎ＋１）＜Ｅ_ijが満たされば、Ｅ_ijを削除し、Ｐ_jと新しいノードの間に新しいエッジＥ_jkを入れる（そのエッジの種類をｃｏｎｓｉｓｔｅｎｃｙ基準に基づいて決める）。しかし、Ｃ_iがＣ_kにつながっていなければ、この操作をＣ_iまたはＣ_jが他のクラスターから切断されない場合に限って行う。
【００４７】
上述の追加学習法は、最初の段階である程度データが集まってから学習を行い、その後、追加学習を行う仕組みになっているが、最初から逐次的に学習を行うこと（所謂オンライン学習）も考えられる。オンラインバージョンの場合、最初は２つのノードだけが与えられ、それらがいくら離れていてもｃｏｎｓｉｓｔｅｎｔエッジでつなぎ、その後、新しく入力されるノードが上記の追加学習法に基づいてグラフに挿入される。ある程度データが集まってから定期的に全てのノードに対してｃｏｎｓｉｓｔｅｎｃｙチェックを行い、ｃｏｎｓｉｓｔｅｎｃｙ基準を満たさないノードの間のエッジをｉｎｃｏｎｓｉｓｔｅｎｔエッジに置き換える。
【００４８】
提案する学習法の特徴としては、追加学習の場合システムの内部状態を最初から再学習する必要がなく、新しい部分だけに関連ある結合が更新・追加され、関連のない部分がそのままに残る。これによって、生物などが行うような自然な追加学習が実現できるだけではなく、莫大な計算量の節約も可能である。特に、実環境では、システムの内部状態が長時間のデータに基づいて形成されるので、本方法の利用が有利である。
【００４９】
なお、上記の実施形態における学習装置２は、例えば図６の状態で新たな対象情報が入力されたときに、そのカテゴリ分けには全てのノードとの比較を行うことで対象情報のグラフでの位置を自動計算によって決定する場合を示すが、入力される対象情報があらかじめどのカテゴリに含まれるかの情報を人手で一部与えることにより、学習をより正確にすることができる。例えば、与えられた対象情報がＦで表されているカテゴリである場合には、自動的なカテゴリ分けをすることなく、直接に図６におけるＦ１〜Ｆ６のノードとの間の距離を計算し、正確なグラフ位置を決定することができる。
【００５０】
また、ｃｏｎｓｉｓｔｅｎｔ／ｉｎｃｏｎｓｉｓｔｅｎｔエッジを用いてサブクラスタを結合するにおいて、自動学習によるカテゴリ情報または人手で与えるカテゴリ情報を基にｃｏｎｓｉｓｔｅｎｔ／ｉｎｃｏｎｓｉｓｔｅｎｔエッジを変更するステップを追加することにより、生成されたカテゴリを示すグラフが間違って繋がれた場合にその修正が可能となる。
【００５１】
また、以上までの説明では顔画像を対象とする場合を示すが、本発明はそれ以外の対象物体についても同様な処理が可能である。
【００５２】
また、図２及び図３に示した方法又は図１に示した装置の一部又は全部をコンピュータプログラムで記載してそれを実行できるようにし、それをコンピュータが読み取り可能な記録媒体、例えば、フロッピーディスク（登録商標）や、ＭＯ、ＲＯＭ、メモリカード、ＣＤ、ＤＶＤ、リムーバブルディスクなどに記録して提供し、配布することが可能である。
【００５３】
【発明の効果】
以上説明したように、本発明によれば以下の効果が得られる。
【００５４】
（１）実環境条件下で、人間によって事前に用意された情報を用いずに、コンピューターが自律的に長時間の連続データを自分のタスクの目的に応じて自己組織化的に学習し、認識することが可能になる。
【００５５】
（２）入力データから連続的に抽出される動画像の顔画像１枚１枚をテンプレートとして使用することによって、照明条件、サイズや視点角度の変動の影響が受けにくくなるので、学習／認識過程の効率を向上させることができる。
【００５６】
（３）学習途中で新しいカテゴリのデータが与えられた場合、システムを最初から再学習する必要がなく、柔軟性のある自然な追加学習や認識が可能になる。
【００５７】
（４）本発明の学習方法を利用することにより、クラス内のサンプルデータが連続的に分布し、クラス間の距離がクラス内の距離より大きい場合でも認識が可能である。
【００５８】
（５）データ収集、格納、顔領域抽出、学習、認識の全ての過程が自動的に行われるため、人手による作業を大幅に節約できる。
【００５９】
（６）本発明の学習法の特徴としては、各顔カテゴリのサンプル数が極端に異なっても問題にはならない。つまり、本発明方法はサンプルの分布の形には依存しないという利点がある。
【００６０】
（７）本発明の学習方法を用いて物体認識で使用されるさまざまな特徴の評価を行うことができる。
【図面の簡単な説明】
【図１】本発明の構成例を示すフロック図。
【図２】本発明の顔領域抽出法を説明するためのフローチャート。
【図３】本発明の学習方法を説明するためのフローチャート。
【図４】カメラから入力される画像配列から抽出された顔画像配列の例
【図５】ｃｏｓｉｓｔｅｎｃｙ基準を説明するための補助図。
【図６】本発明の学習法によって形成されたシステムの内部状態の一例。
【図７】本発明における追加学習法（ステップ１〜３）を説明するための補助図。
【図８】本発明における追加学習法（ステップ４）を説明するための補助図。
【符号の説明】
１…顔領域抽出装置（ユニット１）
２…学習装置（ユニット２）
３…認識装置（ユニット３）[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a learning technique for moving image recognition such as a human face moving image, and in particular, under real environment conditions, continuous video data over a long period of time without prior information on a category (person name) is provided. The present invention relates to a method for learning images in a self-organizing manner.
[0002]
[Prior art]
A face image recognition will be described as an example. Face image recognition is realized by pattern matching (calculation of distance value or similarity) between a face image model created in the learning stage and an unknown input face image.
[0003]
In the learning in the conventional face recognition method, a method is adopted in which learning is advanced from a small number of face images that are photographed under limited environmental conditions and are manually assigned a category such as a person's name.
[0004]
However, since this approach requires manual data preparation, it is difficult to use a large amount of face data for learning.
[0005]
When face recognition is performed in this environment using this approach, since only a limited number of face images can be used for learning, various variations that occur in the actual environment (for example, variations in lighting conditions, camera (viewpoint) It is difficult to accumulate sufficient learning data to simply cope with fluctuations in the angle and distance of the face with respect to), fluctuations in face shape due to facial expressions, glasses, hairstyles, and the like.
[0006]
Conventionally, in order to cope with this, pre-processing has been devised and features have been devised, but sufficient accuracy has not been obtained. In other words, the conventional face recognition has not been provided with the flexibility to cope with a real environment rich in complicated changes.
[0007]
On the other hand, in the case of human face recognition, when the learning system is operating at any time and new data is input, the internal state of the system can be updated and a new category can be added at any time without prior information.
[0008]
[Problems to be solved by the invention]
As described above, the recognition method based on pattern matching between a face image model and an unknown input face image requires manual data preparation work, and it is difficult to use a large amount of face data for learning. In addition, it is difficult to accumulate sufficient learning data and to obtain sufficient accuracy.
[0009]
In the present invention, without using information prepared in advance by humans, a computer autonomously learns long-term continuous data in a real environment in a self-organizing manner according to the purpose of its own task. We propose a way to add new categories naturally. In other words, an object of the present invention is to propose a method for additionally learning various variations included in long-term environmental video data without human intervention without a teacher.
[0010]
Another object of the present invention is to propose a learning method in which all the processes (data collection, storage, face area extraction, learning, recognition) are automatically performed and no manual work is required.
[0011]
[Means for Solving the Problems]
In order to achieve the above-mentioned object, the present invention automatically extracts moving objects appearing in the environment video data for a long time and uses the information of a plurality of extracted moving objects. By automatically associating, moving objects are automatically classified into categories without artificially assigning categories. In addition, when categorizing, it is possible to manually categorize only a part of video data in which the target information is included, and features the following learning method and recording medium. And
[0012]
(Invention of learning method)
In a moving image learning method that automatically learns a target object that moves in an image from a continuously captured environment image data without a teacher,
Shooting data for which the target object category information is not given in advance,
Cutting out a moving image of the target object in time and space to create a moving image array;
Calculating a distance between moving image sequences based on a distance between images included in the moving image sequence among all combinations of the moving image sequences;
Forming an initial sub-cluster using a distance between the video sequences;
combining all sub-clusters using consistent / inconsistent edges;
Creating a moving image dictionary for each of the combined categories;
It is comprised from these.
[0013]
Further, the method includes a step of providing prior information for categorization with respect to a part of the clipped moving image.
[0014]
In addition, in the step of combining all sag clusters using the consistent / inconsistent edge, the method includes a step of providing category information and a step of changing the consistent / inconsistent edge based on the given category information. And
[0015]
(Invention of recording medium)
A program for causing a computer to execute a moving image learning method for automatically learning an object moving in an image from a continuously captured environment image data without a teacher on a computer-readable recording medium A recorded recording medium,
Between the process of photographing data for which category information of the target object is not given in advance, the process of cutting out the moving image of the target object in time and space, and creating the moving image array, and between the moving image arrays of all combinations A process of calculating a distance between moving image arrays based on a distance between images included in the moving image array, a process of forming an initial sub-cluster using the distance between the moving image arrays, and consistent / inconsistent The process of combining all sub-clusters using edges and the process of creating a moving image dictionary for each of the combined categories are recorded by a program.
[0016]
DETAILED DESCRIPTION OF THE INVENTION
FIG. 1 is a block diagram showing an embodiment of the present invention. In the present embodiment, it is composed of three units, a face area extraction device (unit 1), a learning device (unit 2), and a recognition device (unit 3).
[0017]
The face area extraction apparatus 1 automatically extracts a face area of a person who appears in long-term continuous environmental video data, normalizes the size thereof, and a series of faces from when a person appears in the image until disappearance An image series (this is called a moving image array) is created, stored in a file, and input information is provided to the learning / recognition apparatuses 2 and 3 in the subsequent stage.
[0018]
The learning device 2 uses a learning algorithm described later to organize a large number of moving image arrays that do not include prior category information for each category. According to the result, the internal state of the system is also appropriately updated. Here, the internal state refers to an expression indicating which and which belong to the same category among moving image arrays.
[0019]
The recognizing device 3 recognizes a face based on a comparison between the latest moving image dictionary (registered moving image array) of the system and input data to be recognized (moving image array cut out from the input moving image by the face area extracting device). Recognize and output the recognition result.
[0020]
First, an example of a face area extracting apparatus as a preparation stage for learning and recognition will be described with reference to the flowchart of FIG.
[0021]
In step 201, environmental video data is continuously input from a fixed camera, but two adjacent images are subtracted and a threshold value is taken to create a binary difference image.
[0022]
In step 202, it is determined from the created difference image whether the observed environment has changed. If there is no change corresponding to the appearance of a person, steps 201 and 202 are repeated until there is a change, but if there is a change, the process proceeds to step 203.
[0023]
In step 203, a multi-resolution image pyramid is created while repeating subsampling from the binary difference image created in the previous step. Subsampling replaces each 2 × 2 pixel region of the original image with a “1” or “0” value pixel (based on the number of “1” s in the region). By using a multi-resolution image, the coordinates of the face area can be calculated more stably and accurately. That is, since a low-resolution image is hardly affected by noise, the face area can be determined relatively stably, but it is not accurate. Conversely, the face area can be accurately determined from a high-resolution image, but it is susceptible to noise. The face area extraction algorithm described below is performed in parallel on each resolution image and is integrated in step 207 as will be described later.
[0024]
In the next step 204, the following filter is applied to all images in order to fill “holes” of “0” value formed in the person area due to the influence of noise due to fluctuations in lighting conditions, clothing texture, and the like. Multiply.
[0025]
If a window of 3x3 pixels is moved through the image in one pixel step, and there are pixels of value "1" in at least two different rows and two different columns in the window at the same time, Replace the pixel at the center with “1”. This operation is performed several times (or until there is no change) in the left / right, right / left, up / down, and down / up directions.
[0026]
In the next step 205, the center of gravity of the image created in the previous stage and the x histogram (a function representing the number of “1” s included in each column of the image) are calculated, and 90% of the total energy of the x histogram is calculated from the center of gravity. An area including% is cut out as a person area. In other words, it is assumed that the area of the person is much larger than the area of other objects moving in the background, shadows, noise, and the like.
[0027]
In step 206, an x histogram and a y histogram are newly calculated on the person region extracted in step 205, and the contents are analyzed to determine the face coordinates.
[0028]
In step 207, the median of the face area coordinates calculated at each resolution is obtained to obtain the final face area coordinates.
[0029]
In step 208, the face image extracted as described above is normalized to a certain size. An example of a final moving image obtained from actual input data is shown in FIG.
[0030]
Thereafter, in step 210, the next image is read from the camera, and a binary difference image with the background image is created.
[0031]
In step 211, it is detected from the difference image created in step 210 whether or not the person is out of the field of view of the camera. If there is still a person due to this detection, the process returns to step 203 to process the acquired image.
[0032]
In step 212, when the person is no longer in the field of view of the camera, a series of face images up to that point is stored (stored) in the moving image file as a moving image array.
[0033]
Next, operation | movement of the learning apparatus 2 which is an important element of this invention is demonstrated below using the flowchart of FIG.
[0034]
First, in the first step 301, the moving image array (as a movie file) obtained in the previous stage (a series of face images and corresponding to one continuous moving image) is read into the memory of the computer. As described above, the category information regarding each moving image array is not given in advance. For example, N moving image arrays (N ≧ X) for X different persons are given, but information on which array corresponds to which person is not given in advance.
[0035]
In step 302, in order to calculate the distance between the two stored moving image arrays, the i-th facial image of one moving image array and the j-th facial image included in the other moving image array are calculated. Distance values are calculated between them and stored in the matrix D _ij . At that time, there are various methods for calculating the distance between the two face images. Here, as an example, a moving image is subtracted for each pixel, an absolute value is obtained, and a threshold value is obtained. If it is larger, “1” is assumed, and if it is smaller, “0” is assumed. The total number of “1” included in the binary image thus obtained is defined as a distance value between both face images.
[0036]
In step 303, the shortest distance between each moving image array and all other arrays is selected using the matrix _Dij and _stored in the shortest distance matrix M. For example, M _ij in the matrix M means the shortest distance between the array i and the array j. A large number of face images are included in both moving image arrays, and the closest face pair among them is selected as a representative, and the distance between them is M _ij .
[0037]
In step 304, a number of initial subcluster graphs are formed based on the values of the matrix M. Each array is displayed as a single node, and each node is combined with only the closest node (by an edge), and each graph thus created is named an initial subcluster graph. Two types of edges that connect two nodes are defined, one is called a consistent edge and used to connect nodes of the same category (moving image array), and the other is called an inconsistent edge and connected to nodes of different categories. Used for. However, all edges in the initial subcluster graph formed in step 304 are cosistive edges.
[0038]
In the next step 305, the initial subclusters are sorted by the number of nodes, starting with a small subcluster (small number of nodes included) and linking with the closest subcluster of the other subclusters at the edge. . The type of edge at that time is determined by a rule described later. Two subclusters are merged and merged simultaneously to create a new subcluster. This process is recursively repeated until all sub-clusters are merged into only one large cluster. When two subclusters are connected by the shortest distance edge (for example, the edge between node A and node B), the type of edge is determined by the following rule (consistency criterion). Since the edge of length L between node A and node B is consistent, two conditions that must be satisfied simultaneously are shown below.
[0039]
Condition 1 “The distance L1 between the node A and the node (FN) farthest from the node A among the nodes directly connected to the node A by the consistent edge is smaller than CxL, where C is a constant.”
Condition 2 “The distance L2 between the node (FN) farthest from the node B among the nodes directly connected to the node B at the consistent edge is smaller than CxL, where C is a constant.”
FIG. 5 is an example showing the relationship between the node A and the node B and the respective FN nodes. Nodes that do not satisfy condition 1 and condition 2 at the same time are connected by an inconsistent edge.
[0040]
In the last step 306, while traversing the graph formed by the above process, the moving image sequence displayed by the nodes connected by the consistent edge is output as the same face category. Here, the role of the inconsistent edge is to separate sub-clusters belonging to different categories. FIG. 6 shows an example of a graph formed in the above process when actual input data is used. In FIG. 6, each node corresponds to one moving image array, characters are initial letters of actual person names shown in the array, and numbers are array numbers. An edge displayed by a solid line is a consistent edge, a dotted line represents an inconsistent edge, and a number represents the length of the edge (the shortest distance between two nodes).
[0041]
This graph shows the current internal state of the system. In the recognition process, the internal state is temporarily fixed, and the node corresponding to the input is determined based on the consistency criterion described in step 305 and the internal state at the consistent / inconsistent edge. If it is combined with the graph and its edge is a cosient edge, then it is assumed that the category of the input is the same category as the node on the other side of the edge. On the other hand, an incoincid edge belongs to a new category (not yet registered).
[0042]
Additional learning is performed by the following algorithm. However, assuming that the current internal state of the system is expressed by a graph composed of N nodes, in the following, additional learning is performed on the N + 1th node input in the internal state shown in FIG. I will do it.
[0043]
Step 1: A new (N + 1) th column is calculated and added in the matrix M for the newly input node.
[0044]
Step 2: Find the closest node k of the new node and connect both nodes with a consistent / inconsistent edge based on the above consistency criteria. If it is an inconsistent edge, go to step 4.
[0045]
Step 3: For all nodes l belonging to the same cluster as node k, calculate the distance D ₁ between node l and the new node, and the distance D ₂ between node l and the nearest node n. . If D1 <D2, node 1 and the new node are connected by a consistent edge. Then, for all nodes m that are directly connected to node l by the cosistive edge (including node n), the distance D (l, m) between l and m and the distance between m and the new node Calculate the distance D (N + 1, m), and if D (l, m) <D (N + 1, m), delete the edge between m and l and a new edge between m and the new node (If there is no such edge yet).
[0046]
Step 4: For all nodes P _{i in the} cluster C _k (see FIG. 8) to which the new node belongs and the cluster C _i connected by the inconsistent edge E _ik (if such a cluster exists), the new node distance _{D (P i, N + 1} ) between the calculated, if _{D (P i, N + 1} ) <E ik, delete the E _ik, to insert a new edge E _ik between the new node and P _i (The edge type is determined based on the consistency criterion). Then, inconsistent Once the cluster C _i and C _j which is connected at the edge E _ij is present (However, although C _i are good to connect with C _k, C _j is only if you have not connected the C _k), cluster For all the nodes P _j belonging to C _j , the distance D (P _j , N + 1) to the new node is calculated, and if D (P _i , N + 1) <E _ij is satisfied, E _ij is deleted Then, a new edge E _jk is inserted between P _j and the new node (the edge type is determined based on the consistency criterion). However, if C _i is not connected to C _k , this operation is performed only when C _i or C _j is not disconnected from other clusters.
[0047]
The above-described additional learning method has a mechanism in which data is collected to some extent at the initial stage, and then additional learning is performed. However, sequential learning from the beginning (so-called online learning) is also considered. It is done. In the online version, initially only two nodes are given, no matter how far apart they are connected by a consistent edge, and then a newly entered node is inserted into the graph based on the above additional learning method. After a certain amount of data is collected, consistency check is periodically performed on all nodes, and edges between nodes that do not satisfy the consistency criterion are replaced with inconsistent edges.
[0048]
As a feature of the proposed learning method, in the case of additional learning, it is not necessary to re-learn the internal state of the system from the beginning, and a connection related to only a new part is updated and added, and an unrelated part remains as it is. As a result, not only natural additional learning that a living creature or the like performs can be realized, but also a huge amount of calculation can be saved. In particular, in an actual environment, since the internal state of the system is formed based on long-time data, it is advantageous to use this method.
[0049]
In addition, the learning apparatus 2 in the above embodiment, for example, when new target information is input in the state of FIG. 6, categorizing the target information in a graph of the target information by comparing with all nodes. shows the case of determining by automatically calculating the position, whether the information object information input is included in any category in advance by giving some manually, can be learned more accurate. For example, when the given target information is a category represented by F, the distance between the nodes F1 to F6 in FIG. 6 is directly calculated without automatic categorization, The exact graph position can be determined.
[0050]
In addition, when sub-clusters are combined using a consistent / inconsistent edge, a step of changing the consistent / inconsistent edge based on category information by automatic learning or category information given manually is shown to indicate the generated category. If the graph is connected by mistake, it can be corrected.
[0051]
Further, the above description shows a case where a face image is a target, but the present invention can perform the same processing for other target objects.
[0052]
Also, a part or all of the method shown in FIG. 2 and FIG. 3 or the apparatus shown in FIG. 1 is written in a computer program so that the computer program can be executed. It can be recorded and provided on a disk (registered trademark), MO, ROM, memory card, CD, DVD, removable disk, and distributed.
[0053]
【The invention's effect】
As described above, according to the present invention, the following effects can be obtained.
[0054]
(1) Under real-world conditions, the computer autonomously learns and recognizes long-term continuous data according to the purpose of its task without using information prepared in advance by humans. It becomes possible to do.
[0055]
(2) Since each face image of a moving image extracted continuously from input data is used as a template, it is less susceptible to variations in illumination conditions, size, and viewpoint angle, so the learning / recognition process Efficiency can be improved.
[0056]
(3) When new category data is given in the middle of learning, it is not necessary to re-learn the system from the beginning, and flexible additional natural learning and recognition are possible.
[0057]
(4) By using the learning method of the present invention, it is possible to recognize even when sample data in a class is continuously distributed and the distance between classes is larger than the distance in the class.
[0058]
(5) Since all the processes of data collection, storage, face area extraction, learning, and recognition are automatically performed, it is possible to save a lot of manual work.
[0059]
(6) As a feature of the learning method of the present invention, there is no problem even if the number of samples of each face category is extremely different. In other words, the method of the present invention has the advantage that it does not depend on the shape of the sample distribution.
[0060]
(7) Various features used in object recognition can be evaluated using the learning method of the present invention.
[Brief description of the drawings]
FIG. 1 is a flock diagram illustrating a configuration example of the present invention.
FIG. 2 is a flowchart for explaining a face area extraction method according to the present invention;
FIG. 3 is a flowchart for explaining the learning method of the present invention.
FIG. 4 is an example of a face image array extracted from an image array input from a camera. FIG. 5 is an auxiliary diagram for explaining consistency standards.
FIG. 6 shows an example of an internal state of a system formed by the learning method of the present invention.
FIG. 7 is an auxiliary diagram for explaining an additional learning method (steps 1 to 3) in the present invention.
FIG. 8 is an auxiliary diagram for explaining an additional learning method (step 4) in the present invention.
[Explanation of symbols]
1. Face area extraction device (unit 1)
2 ... Learning device (unit 2)
3. Recognition device (unit 3)

Claims

In a moving image learning method that automatically learns a target object that moves in an image from a continuously captured environment image data without a teacher,
Shooting data for which the target object category information is not given in advance,
Cutting out a moving image of the target object in time and space to create a moving image array;
Calculating a distance between moving image sequences based on a distance between images included in the moving image sequence among all combinations of the moving image sequences;
Forming an initial sub-cluster using a distance between the video sequences;
combining all sub-clusters using consistent / inconsistent edges;
Creating a moving image dictionary for each of the combined categories;
A moving image learning method comprising:

The moving image learning method according to claim 1, further comprising a step of giving prior information for categorization with respect to a part of the cut out moving image.

The step of combining all sag clusters using the consistent / inconsistent edge comprises providing category information and changing the consistent / inconsistent edge based on the given category information. The moving image learning method according to claim 1 or 2.

A program for causing a computer to execute a moving image learning method for automatically learning an object moving in an image from a continuously captured environment image data without a teacher on a computer-readable recording medium A recorded recording medium,
Between the process of photographing data for which category information of the target object is not given in advance, the process of cutting out the moving image of the target object in time and space, and creating the moving image array, and between the moving image arrays of all combinations A process of calculating a distance between moving image arrays based on a distance between images included in the moving image array, a process of forming an initial sub-cluster using the distance between the moving image arrays, and consistent / inconsistent Combining all sub-clusters using edges, creating a moving image dictionary for each of the combined categories,
A recording medium characterized by recording the program by a program.