JP4281338B2

JP4281338B2 - Image detection apparatus and image detection method

Info

Publication number: JP4281338B2
Application number: JP2002339654A
Authority: JP
Inventors: 寛司三原
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2002-11-22
Filing date: 2002-11-22
Publication date: 2009-06-17
Anticipated expiration: 2022-11-22
Also published as: JP2004171490A

Description

【０００１】
【発明の属する技術分野】
本発明は，画像内の所定の対象物を検出する画像検出装置及び画像検出方法に関し，例えばビデオ映像から人間の顔を検出するのに好適な画像検出装置及び画像検出方法に関する。
【０００２】
【従来の技術】
従来，ビデオ映像等の画像から，所定の対象物，例えば人間の顔等を検出・認識する技術が提案されており（例えば，特許文献１参照），監視システム，ロボット装置などへの応用が考えられている。この分野の検出・認識方法としては，サポートベクタマシン（ＳＶＭ：ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ）のようなテンプレートマッチングの手法を使用する方法が知られている。なお，本願発明に関連する先行技術文献情報には，次のものがある。
【０００３】
【特許文献１】
特開２００１−２１６５１５号公報
【０００４】
【発明が解決しようとする課題】
ところで，上記のような顔検出・認識を行う装置においては，パターン認識アルゴリズムに要する演算量が膨大になる。そこで，演算処理を軽減しつつ，実用上十分な検出精度を実現することが重要となり，要望されている。従来，ビデオ映像から，縮小スケール画像からなる所定サイズのウィンドウ画像を作り，大まかに顔画像であるか否かを判断して，明らかに顔画像でないウィンドウ画像を除去することにより全体の演算量の軽減を図る手法が考案されている。しかしながら，この手法だけでは，連続した動画像から顔を検出するような場合，時間方向の冗長度を利用しておらず，演算量の軽減が十分とはいえなかった。
【０００５】
本発明は上述した問題に鑑みてなされたもので，画像内の所定の対象物を検出するにあたり，演算量の軽減を推進可能な画像検出装置及び画像検出方法を提供することを目的とする。
【０００６】
【課題を解決するための手段】
上記課題を解決するために，本発明の第１の観点によれば，画像の中における所定の対象物の位置を検出する画像検出装置であって，連続した複数のフレームのなかで，前記所定の対象物を全く探索しない非探索フレームと，前記所定の対象物をフレーム内全ての領域にわたって探索する全探索フレームと，を設定し，前記非探索フレームの間に前記全探索フレームを所定の周期で設ける検出手段を具備することを特徴とする画像検出装置が提供される。
【０００７】
本発明では，全てのフレームに対して対象物を探索するのではなく，対象物を全く探索しない非探索フレームを設定している。非探索フレームでは対象物を全く探索しないため，この分の演算量を大幅に軽減できる。非探索フレームの間に全探索フレームを所定の周期で設けることで，所定の対象物の位置を検出しつつ，演算量の軽減を達成できる。
【０００８】
ここで，検出手段はさらに，前記所定の対象物の位置が検出されたフレームの後には，前記所定の対象物の位置の近傍を中心に探索する近傍探索フレームを設けることが好ましい。
【０００９】
対象物が瞬時的に極端に位置を変更することは通常起こりにくいため，対象物の検出後は，対象物の位置の近傍を中心に探索すればよく，これにより，探索する画像を減らすことができる。
【００１０】
その際に，近傍探索フレームにおける探索範囲は，検出された所定の対象物の大きさや，画像を撮影している撮影手段のズーム量および移動角度，動きベクトル等に応じて，決定，調整されることが好ましく，これらの情報を用いることにより，探索する画像を減らすことができ，また高精度な検出が可能になる。ここで，移動角度は，例えば，後述の本実施の形態にかかるパン・チルト等に該当する。
【００１１】
また，検出手段は，所定のフレームにわたって全く動かない対象物を静止物体であるとして検出対象から除外するように構成してもよい。これにより，誤検出を排除でき，演算量を軽減できる。
【００１２】
また，本発明の第２の観点によれば，画像内の所定の対象物を検出する画像検出方法であって，連続したフレームのなかで，前記所定の対象物を全く探索しない非探索フレームと，前記所定の対象物をフレーム内全ての領域にわたって探索する全探索フレームと，を設定し，前記非探索フレームの間に前記全探索フレームを所定の周期で設けることを特徴とする画像検出方法が提供される。
【００１３】
本発明では，全てのフレームに対して対象物を探索するのではなく，対象物を全く探索しない非探索フレームを設定している。非探索フレームでは対象物を全く探索しないため，この分の演算量を大幅に軽減できる。非探索フレームの間に全探索フレームを所定の周期で設けることで，所定の対象物の位置を検出しつつ，演算量の軽減を達成できる。
【００１４】
ここで，前記所定の対象物の位置が検出されたフレームの後には，前記所定の対象物の位置の近傍を中心に探索する近傍探索フレームを設けることが好ましい。
【００１５】
対象物が瞬時的に極端に位置を変更することは通常起こりにくいため，対象物の検出後は，対象物の位置の近傍を中心に探索すればよく，これにより，探索する画像を減らすことができる。
【００１６】
【発明の実施の形態】
以下，添付図面を参照しながら，本発明の好適な実施の形態にかかる画像検出装置および画像検出方法について詳細に説明する。
【００１７】
まず，図１を参照しながら，本実施の形態にかかる画像検出装置の構成について説明する。なおここでは，検出する対象物を人間の顔とした場合を例にとり説明する。図１に示すように，本実施の形態にかかる画像検出装置は，画像入力手段としてのＣＣＤ（ＣｈａｒｇｅＣｏｕｐｌｅｄＤｅｖｉｃｅ）カメラ１０と，画像処理を行って顔を検出する顔検出部２０とを主要構成部として有する。
【００１８】
本装置はさらに，画像を圧縮して伝送する画像圧縮・伸張部３０，ＴＶ（テレビジョン）モニター４０，マイク５０，音声方向検出部６０，音声圧縮・伸張部７０，スピーカー８０，多重・ネットワークインタフェース９０のなかからそれぞれを必要に応じて組み合わせて構成することも可能である。
【００１９】
ＣＣＤカメラ１０は，映像入力デバイスからの動画像を取り込む画像入力手段であり，取り込んだ動画像を顔検出部２０へ送出する。ＣＣＤカメラ１０は，電動のＰＴＺ装置（パン・チルト・ズーム装置）等により自由に向きを変えることができることが好ましく，その場合には例えば，顔だと認識した領域が画像の中央に来るように制御することが容易になる。
【００２０】
顔検出部２０は，ＣＣＤカメラ１０で取り込んだ画像信号をフレーム単位で不図示の内部メモリに記憶し，取り込んだ映像から人間の顔画像を検出する。場合によっては，それに加えて人物の識別を処理する機能を持つよう構成してもよい。
【００２１】
マイク５０は，複数のマイクを配列したマイクアレーで構成することが好ましく，その場合は後述のように音声方向検出が可能になり，探索範囲の縮小に寄与できる。多重・ネットワークインタフェース９０は電話回線等のネットワークと接続されている。
【００２２】
図１に示す構成における全体的な情報の流れとしては，ＣＣＤカメラ１０で撮影された映像が顔検出部２０へ入力され，顔検出が行われる。ＣＣＤカメラ１０がＰＴＺ装置を装備している場合は，顔検出部２０での処理結果に応じて顔検出部２０からＣＣＤカメラ１０へＰＴＺ制御の指示が出される。映像データは顔検出部２０で画像処理を施された後，画像圧縮・伸張部３０へ送出され，必要に応じ圧縮・伸張された後，ＴＶモニター４０及び多重・ネットワークインタフェース９０へ送出される。なお場合に応じて，画像圧縮・伸張部３０から顔検出部２０へは動きベクトルの情報が送出される。
【００２３】
一方，マイク５０に集音された音声は音声方向検出部６０により音声の方向が検出される。検出された音声方向のデータは顔検出部２０へ送出される。また，音声は音声方向検出部６０から音声圧縮・伸張部７０へ送出され，必要に応じ圧縮・伸張された後，スピーカー８０及び多重・ネットワークインタフェース９０へ送出される。
【００２４】
図２は，顔検出部２０の処理内容を説明するための機能ブロック図である。図２に示すように，入力画像スケール変換部２０２，ウィンドウ切出部２０４，テンプレートマッチング部２０６，前処理部２０８，パターン識別部２１０，重なり判定部２１２に分けることができる。以下，各部の機能について概略的に説明する。
【００２５】
入力画像スケール変換部２０２は，ＣＣＤカメラ１０（図１）からの画像信号に基づくフレーム画像を不図示の内部メモリから読み出して，フレーム画像を縮小率が相異なる複数のスケール画像に変換する。例えば，２５３４４（＝１７６×１４４）画素からなるフレーム画像に対して，これを０．８倍ずつ順次縮小して５段階（１倍，０．８倍，０．６４倍，０．５１倍，０．４１倍）のスケール画像に変換することが考えられる。
【００２６】
ウィンドウ切出部２０４は，これらの複数のスケール画像のうち，まず１番目のスケール画像に対して，所定の画素量の矩形領域を順次切り出す。以下，この切り出した領域をウィンドウ画像と呼ぶ。
【００２７】
そして，ウィンドウ切出部２０４は，１番目のスケール画像から切り出した複数のウィンドウ画像のうち先頭のウィンドウ画像を後段のテンプレートマッチング部２０６に送出する。
【００２８】
テンプレートマッチング部２０６は，ウィンドウ切出部２０４から得られた先頭のウィンドウ画像について，当該ウィンドウ画像が顔画像か否かを判断する。ここで，テンプレートマッチング部２０６では，例えば１００人程度の人物の平均的な顔画像をテンプレートとして，当該ウィンドウ画像との大まかなマッチングをとり得るようになされている。
【００２９】
テンプレートマッチング部２０６で顔画像であると判断されたウィンドウ画像はスコア画像として後段の前処理部２０８に送出され，顔画像でないと判断された当該ウィンドウ画像はそのまま後段の重なり判定部２１２に送出される。
【００３０】
前処理部２０８では，スコア画像について人間の顔画像と無関係である背景部分に相当する領域を除去し，撮影時の照明による濃淡，コントラスト等を補正する。さらに前処理部２０８では，スコア画像をベクトル変換して，パターン識別部２１０に送出する。
【００３１】
パターン識別部２１０では，ここではサポートベクタマシンを用いてベクトルとして得られたスコア画像に対して顔データが存在するか否かを判断する。顔データが存在する場合は，画像の位置や大きさ，縮小率等をリスト化し，リストデータとして内部メモリに格納する。
【００３２】
また，パターン識別部２１０は，ウィンドウ切出部２０４に対して先頭のウィンドウ画像について顔検出が終了した旨を通知する。この通知によりウィンドウ切出部２０４は次のウィンドウ画像テンプレートマッチング部２０６に送出する。パターン識別部２１０は，入力画像スケール変換部２０２に対して１番目のスケール画像について顔検出が終了した旨を通知する。この通知により入力画像スケール変換部２０２は２番目のスケール画像をウィンドウ切出部２０４に送出する。
【００３３】
重なり判定部２１２は，内部メモリに格納されている複数のリストデータを読み出して，リストデータに含まれるスコア画像同士を比較して，重なり合う部分を含むか否かを判定し，その判定結果に基づいてスコア画像同士で重なり合う部分を除去し，各スケール画像において，複数のスコア画像から最終的に重なることなく寄せ集めた単一の画像領域を得，画像領域を顔決定データとして新たに内部メモリに格納する。
【００３４】
なお，重なり判定部２１２は，テンプレートマッチング部２０６において顔画像でないと判断された場合には，そのまま何もすることなく，内部メモリの格納も行わない。
【００３５】
このようにして，元のフレーム画像から顔画像を検出することができる。上記のような操作は演算量が膨大である。そこで，本実施の形態にかかる顔検出部２０は，演算量を軽減するために，上記の機能に加えて，以下に説明する種種の機能を有する。
【００３６】
（１）フレーム飛ばし探索機能
連続した動画像の中に含まれる人間の顔を認識する場合，毎フレーム，画像のすべての領域をパターンマッチングするのは非常に計算時間を要する。そこで，フレーム内すべての領域にわたってパターンマッチングによる顔探索を行うフレーム（以下，このフレームを全探索フレームと呼ぶ）と，全く顔の探索を行わないフレーム（以下，このフレームを非探索フレームと呼ぶ）と，を設定する。
【００３７】
また，認識対象が人間の顔の場合，ひとつのビデオカメラで撮影された連続した動画においては，フレーム間で人間の顔の位置が動く範囲は通常の人間の移動速度などから判定して限られた範囲である。したがって突然画面の端から端に人間の位置が飛ぶことはほとんどありえず，画面内の上下左右のある限られた範囲内に顔が移動している方が多い。
【００３８】
よって，上記２種類のフレームに加え，さらに前の画像で顔の場所が特定されたフレームを基準にして周辺領域を探索するフレーム，すなわち近傍探索するフレーム（以下，このフレームを近傍探索フレームと呼ぶ）を設定する。このように，全探索フレーム，非探索フレーム，近傍探索フレームの３種類のフレームを設けて探索を行うことにする。
【００３９】
探索の仕方としてはまず，連続したフレームのうち，複数の非探索フレームの間に全探索フレームを所定の周期で設ける。そして，一度顔の存在を検出したら，その近傍のみを探索範囲として定義し，近傍探索フレームを設けることにする。
【００４０】
図３に上記３種類のフレームを設けた場合の概念図を示す。図３では区別するために，全探索フレームＡは黒塗り，非探索フレームＢは白抜き，近傍探索フレームＣは斜線付き，で示している。横方向は時間を示し，各フレームが図３に示すように時系列で設けられている様子を示す。
【００４１】
すなわち，多数の非探索フレームＢの中に全探索フレームＡを一定の周期で設け，先頭の全探索フレームＡで顔の存在を検出した場合として，その後に近傍探索フレームＣを設けている。
【００４２】
例えばＮＴＳＣ方式のように３０フレーム／秒の動画像を入力している場合，３０枚に１枚だけ全探索フレームとし，残りを非探索フレームとする。この場合，１秒に一回のみ全探索を行えばよく，毎フレームを全探索する場合と比較して１／３０の計算量にすることができる。最初に顔を検出するまではこのように全探索フレームを一定周期で設けることで処理する。顔検出や顔認識に伴う計算処理量を削減するためには，連続する動画像に非探索フレームを多数設けることが好ましい。
【００４３】
図４に近傍探索フレームの探索範囲の例を示す。なお，この例ではカメラは静止しているものとする。図４では，前の探索フレームで顔検出された範囲ｆ１，次の近傍探索フレームで探索する範囲ｆ２，次の近傍探索フレームで探索しない範囲ｆ３が示されている。範囲ｆ３は，画像全体（図４における最外枠で示される範囲）から範囲ｆ２を除いた範囲を指す。範囲ｆ１は人間の顔の部分を示し，範囲ｆ２は範囲ｆ１を中心としてその近傍を含む範囲となっている。
【００４４】
上記のように近傍探索フレームを設けることにより，前述の画像のスケーリング処理の回数を削減するとともに，パターンマッチングの処理を削減することが可能になる。例えば従来では前述の入力画像スケール変換部２０２において，フレーム内すべての領域にわたって顔探索を行い０．８倍ずつの５段階のスケーリング映像を作っていた。これに対して本実施の形態では，近傍探索フレームで前回の探索フレームで検出されたスケーリング段階とその前後１段階ずつの計３段階のスケーリング画像に減らすといったことが可能である。
【００４５】
また，テンプレートマッチングに使うウィンドウ画像の切り出しにおいて，通常スケーリング映像の範囲すべてについて行うところを，前回の検出座標の近傍範囲のみに限定して行うことで計算量を大幅に削減することが可能になる。
【００４６】
このような近傍探索フレームを例えば５枚に１枚挿入することで，人間の動きになめらかに追随することができるようになるとともに，全探索フレームの頻度を減らして計算量を削減することが可能になる。
【００４７】
また，近傍探索のスケーリング処理において，顔がスケーリング画像の中央にくるようなスケーリング画像の切り出しをすることによって，スケーリング画像の境目に顔がかかる確率を減らすことが可能になる。
【００４８】
（２）対象物の大きさに応じた探索範囲の限定機能
近傍探索において，検出・認識する対象物の大きさに依存して探索範囲を限定する。人間の顔画像が画面内に大きく写っている場合と，小さく写っている場合では，人間が顔を移動させた移動量が同程度であっても，画面に映る移動の範囲が異なるという特性を利用することを考える。
【００４９】
例えば，顔画像が大きく写っている場合は，顔やカメラの移動量が小量であっても，隣接する探索フレーム間では画面上の顔の位置が大きく変化することがあり，探索範囲を比較的広くとる必要がある。一方，顔画像が小さく写っている場合は，隣接する探索フレーム間で画面上の顔の位置はさほど変わらないため，探索範囲は比較的狭くて良い。この特性と，スケーリングアルゴリズムを組み合わせることで，探索するスケーリング画像を減らすことが可能となる。
【００５０】
（３）カメラとの連動による探索範囲の調整機能
カメラ自体が左右にパンされた場合などは，画面内の顔画像もカメラの動きに応じて移動することが予想されるので，その特性を応用することができる。近傍探索において，探索する領域を決定する際に，画像を撮影するＣＣＤカメラ１０（図１）の動き情報と連動することで，さらに探索範囲を狭めたり，探索精度を向上させることが可能になる。
【００５１】
例えば電動ＰＴＺ機構を有する首振りカメラを使用したＴＶ会議システムを例にとると，カメラを右にパンした場合，映像に含まれる顔画像は左に動くことが予想される。また，その際の動き量は，顔画像の大きさとズーム量（画角）から予想することが可能であり，その動き予測を用いることで精度を向上することができる。なお，パンした場合だけでなく，カメラをチルトした場合も同様である。
【００５２】
（４）音声方向検出との組み合わせによる探索範囲の縮小機能
２個あるいは３個程度のマイクアレーを使用して，そのマイクアレーに到達する音声の時間差から音源の方向を検出する音声方向検出技術が知られている。このような公知技術を利用して，マイク５０（図１）をマイクアレーで構成し，音声方向検出部６０（図１）に音声方向検出回路を持たせて，組み合わせて使用することにより，音源の方向を検出できる。
【００５３】
例えばＴＶ会議における話者にカメラを向けるアプリケーションにおいて，音声がする方向を大まかに音声方向検出回路により検出し，その検出結果を顔検出回路に伝達することにより，音源方向と思われる方向の近傍だけをパターンマッチング探索することが可能になる。これにより，パターンマッチングの処理が軽減される。
【００５４】
（５）動きベクトルの利用による探索範囲の限定機能
画像圧縮・伸張部３０での処理により，前フレームから現フレームまでの間に，対象物である顔が移動した方向と距離を表す動きベクトルが得られる。この動きベクトルを戻すことにより，カメラのズーム量や移動角度等の情報を用いずに探索範囲を限定でき，顔を検出することが可能である。
【００５５】
（６）静止物体の排除機能
壁に人物の写真を含むポスターが貼ってあり，それも含めて画像内に取り込んだ場合等，人物の写真が画面の中に映っている場合は，通常のパターンマッチングによる顔検出手法では人間であると認識してしまい，アプリケーション上支障が出る場合がある。また，たまたま人間の顔に似た特徴を持つ模様があり，それを画像内に取り込んだ場合等，顔検出アルゴリズムが誤検出する場合もある。
【００５６】
本来の検出対象物は生身の人間の顔であり，上記のようなものは検出対象物とは異なる。このような誤検出するのを防ぐために，「生きている人間は普通じっとしていることはない」という特性を利用する。仮にカメラの向きや倍率が固定されているときに，毎回画面上のまったく同じ場所に，同じ大きさの顔画像が検出されている場合，それは静止物体であると判定して検出対象から除外するアルゴリズムを追加することで，このような誤検出を排除することができる。例えば，連続する１０枚の探索フレームの全てにおいて，毎回同じスケーリング倍率の同じ画素位置に顔画像が検出された場合は，これは静止物体であると判別し，顔として検出しないことにする。
【００５７】
以上述べたように，本実施の形態によれば，画像から人間の顔を検出するにあたり，計算処理量を大幅に削減することが可能になる。これにより，安価なデバイスを使用してシステム構築ができたり，ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）の負荷が減ることによる消費電力の低減などの効果がある。また，低い計算処理量でありながら，誤検出を低減することができ，検出精度を向上させることが出来る。
【００５８】
本実施の形態の画像検出装置及び画像検出方法は，ＴＶ会議システム以外にも，ロボット，監視システム等に適用可能であり，検出装置に限定されず，認識装置等にも適用可能なことは言うまでもない。また，上記説明では，検出する対象物を人間の顔とした場合を例にとり説明したが，必ずしもこれに限定するものではなく，他の物体を検出・認識対象とする検索システムにおいて同様の応用をすることが可能である。例えば検出・認識する対象物を車として，駐車場管理システムに本発明を適用することも考えられる。
【００５９】
なお，上記説明では，非探索フレーム，全探索フレーム，近傍探索フレームのようにフレーム単位で設定した例を挙げて説明したが，フレームをフィールドに置き換えて，非探索フィールド，全探索フィールド，近傍探索フィールドのようにフィールド単位で設定することも当然考えられる。
【００６０】
以上，添付図面を参照しながら本発明にかかる好適な実施形態について説明したが，本発明はかかる例に限定されないことは言うまでもない。当業者であれば，特許請求の範囲に記載された技術的思想の範疇内において，各種の変更例または修正例に想到し得ることは明らかであり，それらについても当然に本発明の技術的範囲に属するものと了解される。
【００６１】
【発明の効果】
以上，詳細に説明したように本発明によれば，画像内の所定の対象物を検出するにあたり，演算量の軽減を推進可能な画像検出装置及び画像を提供することができる。
【図面の簡単な説明】
【図１】本発明の１実施の形態にかかる画像検出装置の構成図である。
【図２】顔検出部の処理内容を説明するための機能ブロック図である。
【図３】各種フレームを設定した場合の概念図である。
【図４】近傍探索フレームの探索範囲の例を示す図である。
【符号の説明】
１０ＣＣＤカメラ
２０顔検出部
３０画像圧縮・伸張部
５０マイク
６０音声方向検出部
２０２入力画像スケール変換部
２０４ウィンドウ切出部
２０６テンプレートマッチング部
２０８前処理部
２１０パターン識別部
２１２重なり判定部
Ａ全探索フレーム
Ｂ非探索フレーム
Ｃ近傍探索フレーム[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an image detection apparatus and an image detection method for detecting a predetermined object in an image, for example, an image detection apparatus and an image detection method suitable for detecting a human face from a video image.
[0002]
[Prior art]
Conventionally, a technique for detecting and recognizing a predetermined object such as a human face from an image such as a video image has been proposed (see, for example, Patent Document 1), and application to a monitoring system, a robot apparatus, etc. is considered. It has been. As a detection / recognition method in this field, a method using a template matching method such as a support vector machine (SVM) is known. Prior art document information related to the present invention includes the following.
[0003]
[Patent Document 1]
JP-A-2001-216515 [0004]
[Problems to be solved by the invention]
By the way, in the apparatus for performing face detection / recognition as described above, the amount of calculation required for the pattern recognition algorithm becomes enormous. Therefore, it is important and demanded to realize practically sufficient detection accuracy while reducing arithmetic processing. Conventionally, a window image of a predetermined size consisting of a reduced-scale image is created from a video image, it is roughly determined whether or not it is a face image, and the window image that is clearly not a face image is removed, thereby reducing the total amount of computation. Techniques for reducing the number have been devised. However, with this method alone, when detecting faces from continuous moving images, the redundancy in the time direction is not used, and the amount of computation cannot be reduced sufficiently.
[0005]
The present invention has been made in view of the above-described problems, and an object of the present invention is to provide an image detection apparatus and an image detection method that can promote reduction in the amount of calculation in detecting a predetermined object in an image.
[0006]
[Means for Solving the Problems]
In order to solve the above-described problem, according to a first aspect of the present invention, there is provided an image detection device for detecting a position of a predetermined object in an image, wherein the predetermined object is detected in a plurality of consecutive frames. A non-search frame that does not search the target object at all, and a full search frame that searches the predetermined target object over all regions in the frame, and sets the full search frame between the non-search frames at a predetermined cycle. There is provided an image detecting apparatus comprising the detecting means provided in (1).
[0007]
In the present invention, a non-search frame that does not search for an object at all is set instead of searching for an object for all frames. Since the object is not searched at all in the non-search frame, the amount of calculation can be greatly reduced. By providing all search frames with a predetermined period between non-search frames, it is possible to reduce the amount of calculation while detecting the position of a predetermined object.
[0008]
Here, it is preferable that the detection means further includes a neighborhood search frame for searching around the vicinity of the position of the predetermined object after the frame in which the position of the predetermined object is detected.
[0009]
Since it is usually difficult for an object to change its position extremely instantaneously, after searching for an object, it is only necessary to search around the position of the object, thereby reducing the number of images to be searched. it can.
[0010]
At that time, the search range in the neighborhood search frame is determined and adjusted in accordance with the size of the predetermined object detected, the zoom amount and the moving angle of the image capturing means capturing the image, the motion vector, and the like. It is preferable to use these pieces of information, and the number of images to be searched can be reduced, and highly accurate detection is possible. Here, the movement angle corresponds to, for example, pan / tilt according to the present embodiment described later.
[0011]
The detection means may be configured to exclude an object that does not move over a predetermined frame from the detection target as a stationary object. As a result, erroneous detection can be eliminated and the amount of calculation can be reduced.
[0012]
Further, according to a second aspect of the present invention, there is provided an image detection method for detecting a predetermined object in an image, and a non-search frame that does not search for the predetermined object in a continuous frame. An image detection method comprising: setting a full search frame for searching the predetermined object over all regions in a frame, and providing the full search frame at a predetermined cycle between the non-search frames. Provided.
[0013]
In the present invention, a non-search frame that does not search for an object at all is set instead of searching for an object for all frames. Since the object is not searched at all in the non-search frame, the amount of calculation can be greatly reduced. By providing all search frames with a predetermined period between non-search frames, it is possible to reduce the amount of calculation while detecting the position of a predetermined object.
[0014]
Here, it is preferable that a neighborhood search frame for searching around the position of the predetermined object is provided after the frame in which the position of the predetermined object is detected.
[0015]
Since it is usually difficult for an object to change its position extremely instantaneously, after searching for an object, it is only necessary to search around the position of the object, thereby reducing the number of images to be searched. it can.
[0016]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an image detection apparatus and an image detection method according to preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.
[0017]
First, the configuration of the image detection apparatus according to the present embodiment will be described with reference to FIG. Here, a case where the object to be detected is a human face will be described as an example. As shown in FIG. 1, the image detection apparatus according to the present embodiment mainly includes a CCD (Charge Coupled Device) camera 10 as an image input means, and a face detection unit 20 that detects a face by performing image processing. Have as part.
[0018]
The apparatus further includes an image compression / decompression unit 30, a TV (television) monitor 40, a microphone 50, an audio direction detection unit 60, an audio compression / decompression unit 70, a speaker 80, a multiplex / network interface for compressing and transmitting an image. It is also possible to configure each of the 90 combinations as necessary.
[0019]
The CCD camera 10 is an image input unit that captures a moving image from a video input device, and sends the captured moving image to the face detection unit 20. It is preferable that the direction of the CCD camera 10 can be freely changed by an electric PTZ device (pan / tilt / zoom device) or the like. In this case, for example, an area recognized as a face is positioned at the center of the image. It becomes easy to control.
[0020]
The face detection unit 20 stores the image signal captured by the CCD camera 10 in an internal memory (not shown) in units of frames, and detects a human face image from the captured video. In some cases, it may be configured to have a function of processing identification of a person in addition to that.
[0021]
The microphone 50 is preferably composed of a microphone array in which a plurality of microphones are arranged. In this case, the voice direction can be detected as will be described later, and the search range can be reduced. The multiplex / network interface 90 is connected to a network such as a telephone line.
[0022]
As an overall information flow in the configuration shown in FIG. 1, an image captured by the CCD camera 10 is input to the face detection unit 20 and face detection is performed. When the CCD camera 10 is equipped with a PTZ device, an instruction for PTZ control is issued from the face detection unit 20 to the CCD camera 10 according to the processing result of the face detection unit 20. The video data is subjected to image processing by the face detection unit 20, then sent to the image compression / expansion unit 30, compressed / expanded as necessary, and then sent to the TV monitor 40 and the multiplexing / network interface 90. Depending on the case, motion vector information is sent from the image compression / decompression unit 30 to the face detection unit 20.
[0023]
On the other hand, the direction of the voice collected by the microphone 50 is detected by the voice direction detection unit 60. The detected voice direction data is sent to the face detection unit 20. The voice is sent from the voice direction detection unit 60 to the voice compression / expansion unit 70, compressed / expanded as necessary, and then sent to the speaker 80 and the multiplexing / network interface 90.
[0024]
FIG. 2 is a functional block diagram for explaining the processing content of the face detection unit 20. As shown in FIG. 2, it can be divided into an input image scale conversion unit 202, a window cutout unit 204, a template matching unit 206, a preprocessing unit 208, a pattern identification unit 210, and an overlap determination unit 212. Hereinafter, the function of each part will be schematically described.
[0025]
The input image scale conversion unit 202 reads a frame image based on an image signal from the CCD camera 10 (FIG. 1) from an internal memory (not shown), and converts the frame image into a plurality of scale images having different reduction rates. For example, a frame image composed of 25344 (= 176 × 144) pixels is sequentially reduced by 0.8 times to obtain 5 stages (1 times, 0.8 times, 0.64 times, 0.51 times, It is conceivable to convert it into a scale image of 0.41 times.
[0026]
The window cutout unit 204 sequentially cuts out a rectangular area having a predetermined pixel amount from the first scale image among the plurality of scale images. Hereinafter, this cut-out area is referred to as a window image.
[0027]
Then, the window cutout unit 204 sends the top window image among the plurality of window images cut out from the first scale image to the template matching unit 206 at the subsequent stage.
[0028]
The template matching unit 206 determines whether or not the window image is a face image with respect to the top window image obtained from the window cutout unit 204. Here, in the template matching unit 206, for example, an average face image of about 100 persons can be used as a template to roughly match the window image.
[0029]
The window image determined to be a face image by the template matching unit 206 is sent to the subsequent preprocessing unit 208 as a score image, and the window image determined not to be a face image is sent to the subsequent overlap determination unit 212 as it is. The
[0030]
The pre-processing unit 208 removes an area corresponding to the background portion that is irrelevant to the human face image from the score image, and corrects shading, contrast, and the like due to illumination during shooting. Further, the preprocessing unit 208 vector-converts the score image and sends it to the pattern identification unit 210.
[0031]
Here, the pattern identification unit 210 determines whether or not face data exists for a score image obtained as a vector using a support vector machine. If face data exists, the image position, size, reduction ratio, etc. are listed and stored in the internal memory as list data.
[0032]
In addition, the pattern identification unit 210 notifies the window cutout unit 204 that face detection has been completed for the top window image. In response to this notification, the window cutout unit 204 sends it to the next window image template matching unit 206. The pattern identification unit 210 notifies the input image scale conversion unit 202 that face detection has been completed for the first scale image. In response to this notification, the input image scale conversion unit 202 sends the second scale image to the window cutout unit 204.
[0033]
The overlap determination unit 212 reads a plurality of list data stored in the internal memory, compares score images included in the list data, determines whether or not an overlapping portion is included, and based on the determination result. In each scale image, a single image area is finally collected from the score images without overlapping, and the image area is newly stored in the internal memory as face determination data. Store.
[0034]
If the template matching unit 206 determines that the image is not a face image, the overlap determination unit 212 does nothing and does not store the internal memory.
[0035]
In this way, a face image can be detected from the original frame image. The operation as described above requires a large amount of calculation. Therefore, the face detection unit 20 according to the present embodiment has various functions described below in addition to the above functions in order to reduce the calculation amount.
[0036]
(1) Frame skip search function When a human face included in a continuous moving image is recognized, it takes a very long calculation time to pattern-match all regions of the image for each frame. Therefore, a frame that searches for a face by pattern matching over the entire area in the frame (hereinafter referred to as a full search frame) and a frame that does not perform any face search (hereinafter referred to as a non-search frame) And are set.
[0037]
In addition, when the recognition target is a human face, the range in which the position of the human face moves between frames is limited by judging from the normal human moving speed etc. Range. Therefore, it is almost impossible for the human position to suddenly jump from one end of the screen to the other, and there are many cases where the face moves within a limited range on the screen.
[0038]
Therefore, in addition to the above-mentioned two types of frames, a frame that searches for a peripheral region with reference to a frame whose face location is specified in the previous image, that is, a frame that searches for a neighborhood (hereinafter, this frame is called a neighborhood search frame) ) Is set. In this way, the search is performed by providing three types of frames, the full search frame, the non-search frame, and the neighborhood search frame.
[0039]
As a search method, first, all search frames are provided at a predetermined cycle between a plurality of non-search frames among consecutive frames. Once the presence of a face is detected, only the neighborhood is defined as a search range, and a neighborhood search frame is provided.
[0040]
FIG. 3 shows a conceptual diagram when the above three types of frames are provided. In FIG. 3, for the sake of distinction, the entire search frame A is indicated by black, the non-search frame B is indicated by white, and the neighborhood search frame C is indicated by hatching. The horizontal direction indicates time, and each frame is shown in time series as shown in FIG.
[0041]
In other words, the full search frame A is provided in a large number of non-search frames B at a constant cycle, and when the presence of a face is detected in the head full search frame A, the neighborhood search frame C is provided thereafter.
[0042]
For example, when a moving image of 30 frames / second is input as in the NTSC system, only one frame is searched for every 30 frames, and the rest are non-search frames. In this case, the full search only needs to be performed once per second, and the amount of calculation can be reduced to 1/30 compared with the case of full search for each frame. Until a face is detected for the first time, processing is performed by providing all search frames at a constant period in this way. In order to reduce the amount of calculation processing associated with face detection and face recognition, it is preferable to provide a large number of non-search frames in a continuous moving image.
[0043]
FIG. 4 shows an example of the search range of the neighborhood search frame. In this example, it is assumed that the camera is stationary. FIG. 4 shows a range f1 in which the face is detected in the previous search frame, a range f2 in which the search is performed in the next neighborhood search frame, and a range f3 in which the search is not performed in the next neighborhood search frame. A range f3 indicates a range obtained by removing the range f2 from the entire image (a range indicated by the outermost frame in FIG. 4). A range f1 indicates a human face portion, and a range f2 is a range including the vicinity of the range f1.
[0044]
By providing the neighborhood search frame as described above, it is possible to reduce the number of the above-described image scaling processes and the pattern matching process. For example, conventionally, the above-described input image scale conversion unit 202 searches for a face over the entire region in a frame and creates a five-stage scaled image of 0.8 times. On the other hand, in this embodiment, it is possible to reduce the scaled image detected in the previous search frame in the neighborhood search frame to a total of three levels of scaled images, one step before and after that.
[0045]
In addition, when cutting out the window image used for template matching, it is possible to greatly reduce the amount of calculation by limiting the entire range of the normal scaled video to only the range near the previous detected coordinate. .
[0046]
By inserting, for example, one such neighborhood search frame in 5 frames, it is possible to smoothly follow human movements and reduce the frequency of all search frames to reduce the amount of calculation. become.
[0047]
Further, in the scaling process of the neighborhood search, by cutting out the scaled image so that the face is at the center of the scaled image, it is possible to reduce the probability that the face is placed at the boundary of the scaled image.
[0048]
(2) Limiting the search range according to the size of the target In the vicinity search, the search range is limited depending on the size of the target to be detected and recognized. There is a characteristic that the range of movement displayed on the screen differs between when the human face image is large and small when the human moves the face. Think about using it.
[0049]
For example, if the face image is large, the face position on the screen may change greatly between adjacent search frames even if the amount of movement of the face or camera is small. It is necessary to take it widely. On the other hand, when the face image is small, the position of the face on the screen does not change so much between adjacent search frames, so the search range may be relatively narrow. By combining this characteristic with the scaling algorithm, it is possible to reduce the scaled image to be searched.
[0050]
(3) Search range adjustment function in conjunction with the camera When the camera itself is panned to the left or right, the face image in the screen is expected to move according to the movement of the camera. can do. In determining the search area in the vicinity search, it is possible to further narrow the search range or improve the search accuracy by linking with the motion information of the CCD camera 10 (FIG. 1) that captures the image. .
[0051]
For example, taking a video conference system using a swing camera having an electric PTZ mechanism as an example, when the camera is panned to the right, the face image included in the video is expected to move to the left. In addition, the amount of motion at that time can be predicted from the size of the face image and the zoom amount (view angle), and the accuracy can be improved by using the motion prediction. The same applies not only to panning but also to tilting the camera.
[0052]
(4) Reduction function of search range by combination with voice direction detection A voice direction detection technique that uses two or three microphone arrays and detects the direction of the sound source from the time difference between the voices reaching the microphone array. Are known. By using such a known technique, the microphone 50 (FIG. 1) is constituted by a microphone array, and the voice direction detection unit 60 (FIG. 1) is provided with a voice direction detection circuit and used in combination. Direction can be detected.
[0053]
For example, in an application in which a camera is pointed at a speaker in a TV conference, the direction of sound is roughly detected by the voice direction detection circuit, and the detection result is transmitted to the face detection circuit, so that only the vicinity of the direction considered to be the sound source direction Can be searched for pattern matching. This reduces the pattern matching process.
[0054]
(5) Limiting function of search range by using motion vector A motion vector representing the direction and distance of the face as a target moves between the previous frame and the current frame by processing in the image compression / decompression unit 30. can get. By returning the motion vector, the search range can be limited without using information such as the zoom amount and movement angle of the camera, and the face can be detected.
[0055]
(6) Exclusion function of stationary object When a poster including a person's photograph is pasted on the wall and the person's photograph is reflected in the screen, including the poster, the normal pattern The face detection method based on matching recognizes the person as a human being, which may hinder the application. In addition, there is a case where a face detection algorithm erroneously detects when there is a pattern having a feature similar to a human face and it is captured in an image.
[0056]
The original detection object is a real human face, and the above is different from the detection object. In order to prevent such a false detection, a characteristic that “living humans are not usually still” is used. If a face image of the same size is detected at the exact same location on the screen every time the camera orientation and magnification are fixed, it is determined that it is a stationary object and excluded from the detection target. Such an erroneous detection can be eliminated by adding an algorithm. For example, if a face image is detected at the same pixel position with the same scaling magnification every time in all 10 consecutive search frames, it is determined that this is a stationary object and is not detected as a face.
[0057]
As described above, according to the present embodiment, it is possible to greatly reduce the amount of calculation processing when detecting a human face from an image. As a result, it is possible to construct a system using an inexpensive device and to reduce power consumption due to a reduction in the load on a CPU (Central Processing Unit). In addition, the detection error can be reduced and the detection accuracy can be improved while the calculation amount is low.
[0058]
It goes without saying that the image detection apparatus and the image detection method of the present embodiment can be applied to a robot, a monitoring system, and the like in addition to the TV conference system, and are not limited to the detection apparatus, and can also be applied to a recognition apparatus and the like. Yes. Further, in the above description, the case where the object to be detected is a human face has been described as an example. However, the present invention is not necessarily limited to this, and the same application can be applied to a search system in which other objects are detected and recognized. Is possible. For example, the present invention may be applied to a parking lot management system using an object to be detected and recognized as a vehicle.
[0059]
In the above description, an example in which a frame unit is set such as a non-search frame, a full search frame, and a neighborhood search frame has been described. However, a non-search field, a full search field, and a neighborhood search are performed by replacing a frame with a field. Of course, it is conceivable to set the field unit as in the field.
[0060]
As mentioned above, although preferred embodiment concerning this invention was described referring an accompanying drawing, it cannot be overemphasized that this invention is not limited to this example. It is obvious for those skilled in the art that various changes or modifications can be conceived within the scope of the technical idea described in the claims. It is understood that it belongs to.
[0061]
【The invention's effect】
As described above in detail, according to the present invention, it is possible to provide an image detection device and an image that can promote reduction in the amount of calculation in detecting a predetermined object in the image.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of an image detection apparatus according to an embodiment of the present invention.
FIG. 2 is a functional block diagram for explaining processing contents of a face detection unit;
FIG. 3 is a conceptual diagram when various frames are set.
FIG. 4 is a diagram illustrating an example of a search range of a neighborhood search frame.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 CCD camera 20 Face detection part 30 Image compression / decompression part 50 Microphone 60 Audio | voice direction detection part 202 Input image scale conversion part 204 Window extraction part 206 Template matching part 208 Pre-processing part 210 Pattern identification part 212 Overlapping judgment part A Full search Frame B Non-search frame C Neighborhood search frame

Claims

  Among consecutive frame images, a non-search frame that does not search for a predetermined object in the frame image at all and a non-search frame are provided at a predetermined cycle, and all the predetermined objects are included in the frame image. And set all search frames to search over the region
  Detecting the predetermined object in the full search frame;
  After the frame in which the predetermined object is detected in the entire search frame, the frame image is converted into a plurality of scale images of a plurality of stages having different reduction ratios, and the predetermined object in the plurality of scale images is converted. A neighborhood search frame for searching around the vicinity of the position of the object is set, and the predetermined object in the scale image is detected in the neighborhood search frame;
  When the predetermined object is detected in the neighborhood search frame, the scale object is converted into a plurality of scale images having a smaller number of stages than in the previous conversion, including the stage of the scale image in which the predetermined object is detected. An image detecting apparatus comprising detecting means for detecting the predetermined object in the scale image in the next neighborhood search frame.

It said detecting means, the search range in the vicinity of the search frame, that determine in accordance with the magnitude of the detected predetermined object, an image detecting apparatus according to claim 1.

Said detecting means, said the search range in the vicinity search frame, Ru make adjustments based on the zoom amount and the moving angle of the photographing means which photographs the image, the image sensing apparatus according to claim 1.

It said detecting means, the search range in the vicinity of the search frame, that determine the use of the motion vector, the image detection device according to claim 1.

It said detecting means, to exclude entirely immobile object over a predetermined frame from the detection target image detecting apparatus according to claim 1.

Voice direction detecting means for detecting the direction of the sound source;
The image detection apparatus according to claim 1, wherein the detection unit determines only the vicinity in the direction of the detected sound source as a search range in the vicinity search frame.

  Among consecutive frame images, a non-search frame that does not search for a predetermined object in the frame image at all and a non-search frame are provided at a predetermined cycle, and all the predetermined objects are included in the frame image. And set all search frames to search over the region
  Detecting the predetermined object in the full search frame;
  After the frame in which the predetermined object is detected in the entire search frame, the frame image is converted into a plurality of scale images of a plurality of stages having different reduction ratios, and the predetermined object in the plurality of scale images is converted. A neighborhood search frame for searching around the vicinity of the position of the object is set, and the predetermined object in the scale image is detected in the neighborhood search frame;
  When the predetermined object is detected in the neighborhood search frame, the scale object is converted into a plurality of scale images having a smaller number of stages than in the previous conversion, including the stage of the scale image in which the predetermined object is detected. An image detection method for detecting the predetermined object in the scale image in a next neighborhood search frame.

Wherein the search range in the vicinity search frame, that determine in accordance with the magnitude of the detected predetermined object, an image detecting method according to claim 7.

Wherein the search range in the vicinity search frame, the image Ru make adjustments based on the zoom amount and the moving angle of the imaging means are taken, the image detecting method according to claim 7.

The search range in the vicinity of the search frame, that determine the use of the motion vector, the image detecting method according to claim 7.

To exclude entirely immobile object over a predetermined frame from the detection target image detecting method according to claim 7.

Detect the direction of the sound source,
The image detection apparatus according to claim 7, wherein only the vicinity of the detected direction of the sound source is determined as a search range in the vicinity search frame.