JP3725996B2

JP3725996B2 - Image processing device

Info

Publication number: JP3725996B2
Application number: JP16393599A
Authority: JP
Inventors: 斎藤　　修; 守小田
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1999-06-10
Filing date: 1999-06-10
Publication date: 2005-12-14
Anticipated expiration: 2019-06-10
Also published as: JP2000354247A

Description

【０００１】
【発明の属する技術分野】
この発明は画像処理装置に関し、特に、動画像符号化技術を使用し、動画像符号化技術の中でもＴＶ電話などのような対象画像内に人物などの顔画像が含まれるような画像を処理する画像処理装置に関する。
【０００２】
【従来の技術】
画像処理装置による顔画像特徴抽出技術としては、たとえば特開平７−２９０１４号公報に記載されているように、動きオブジェクトを人物の領域として矩形領域を抽出し、抽出した矩形領域内から顔画像の特徴によりさらに顔領域の肌色領域を抽出するものがある。
【０００３】
【発明が解決しようとする課題】
しかしながら、上述の特開平７−２９０１４号公報に記載された発明では、動き要素判定による動き領域抽出というアルゴリズムから生じる、人物領域以外の動き成分を誤って抽出してしまうおそれがある。また、顔画像内の縦エッジ成分を顔の両幅とすることから、鼻領域の縦エッジを顔の右頬もしくは左頬として誤って認識してしまうこともある。さらに、ハードウェアで顔画像領域抽出技術を実現する場合でなく、プロセッサなどのソフトウェアで実現する場合の不要演算によりシステム全体のパフォーマンスが低下するなどの問題点があった。
【０００４】
それゆえに、この発明の主たる目的は、動き画素による動き矩形領域抽出や矩形領域内の顔特徴抽出による顔の精密座標抽出技術において、抽出性能の向上と誤抽出の低減を図られるような画像処理装置を提供することである。
【０００５】
【課題を解決するための手段】
請求項１に係る発明は、入力される動画像から人物の顔画像を抽出する画像処理装置であって、入力される動画像に基づいて動き物体領域からなる矩形領域を抽出する矩形領域抽出手段と、抽出された矩形領域内の顔画像の特徴による精密座標を抽出する顔画像特徴抽出手段とを備え、矩形領域抽出手段は前フレームと現フレームの画像の差に基づいて矩形領域を抽出するとき、前フレーム矩形領域外の動き成分を判定するためのしきい値を矩形領域内の動き成分を判定するためのしきい値よりも感度を低く設定することにより、矩形領域の判定精度を向上させる。
【０００６】
請求項２に係る発明では、請求項１の矩形領域抽出手段は前フレームと現フレーム間の動き画素量が少ないときは前フレームの画像により矩形領域を抽出して出力することにより、対象人物領域外に矩形領域が移動しにくくすることで矩形領域の判定精度を向上させる。
【０００７】
請求項３に係る発明では、請求項１の顔画像特徴抽出手段は、矩形領域内の顔画像の特徴を抽出するために顔幅を判定するとき、顔幅検索範囲を頭頂座標から画面最下座標までの１／２の位置の矩形座標の左右幅に設定することにより、精密座標検索の精度と演算の高効率化を図る。
【０００８】
請求項４に係る発明では、請求項１の顔画像特徴抽出手段は顔幅を判定するとき、鼻成分を避けて実際の顔幅よりも狭い領域を精密座標として判定しないようにする。
【０００９】
請求項５に係る発明では、請求項１の顔画像特徴抽出手段は顔画像特徴抽出による精密座標抽出が失敗した場合、ある決められた大きさの領域を矩形領域とすることにより、矩形領域外に顔画像が存在することを低減する。
【００１０】
請求項６に係る発明では、顔画像特徴抽出による精密座標抽出が失敗した場合、画面全体を矩形領域とすることで矩形領域外に顔画像が存在することをなくす。
【００１１】
請求項７に係る発明では、矩形領域抽出手段による現フレームで動き画素より算出された矩形領域座標と、前フレームの矩形領域座標の値にローパスフィルタをかけ、動き物体が画面内で急激に移動した場合でも滑らかに矩形領域が動き物体に追従するようにする。
【００１２】
請求項８に係る発明では、画面内の端の領域には有効な顔画像が存在しないとして求められた矩形領域座標値に対してクリッピング処理を行なうことによって、後の精密座標抽出時の精度向上と演算量削減を行なう。
【００１３】
請求項９に係る発明では、顔画像特徴抽出による精密座標判定のための顔幅判定時に、前フレームで求めた精密座標と現フレームで求めた精密座標の値の差があるしきい値内であれば前フレームで算出した精密座標を現フレームの精密座標として採用することにより、精密座標が見失われることを避ける。
【００１４】
請求項１０に係る発明では、顔画像特徴抽出による精密座標判定のための顔幅を判定するとき、現フレームで求めた精密座標の値と過去数フレームの精密座標の値の平均値の差があるしきい値内であれば、過去数フレームの精密座標の値の平均値を現フレームの精密座標値として採用することにより、精密座標が見失われるのを避ける。
請求項１１に係る発明では、顔画像特徴抽出手段によって抽出された精密座標内の色分布標準偏差に基づいて入力される動画像の肌色領域を人物の顔画像として抽出する肌色領域抽出手段をさらに備え、顔画像特徴抽出手段によって抽出できなかった肌色領域をも含む顔画像を抽出できる。
【００１５】
【発明の実施の形態】
図１はＨ．２６１／Ｈ．２６３画像エンコーダと顔領域抽出部の関係を示すブロック図である。図１において、符号化器に入力されたＣＩＦ画像データは、顔領域抽出部１０１と減算器１０３とスイッチ１０４の一方の入力端ａと予測メモリ１１２とにそれぞれ与えられる。減算器１０３は入力された画像データから、ループ内フィルタ１１１の出力するビデオ信号を減算し、その差分データはスイッチ１０４の他方の入力端ｂに与えられる。スイッチ１０５は２つの入力端ａとｂとを有し、入力端ｂにもループ内フィルタ１１１の出力が与えられる。スイッチ１０５はスイッチ１０４とともに符号化制御部１０２の切換制御信号によって同期して切換えられる。
【００１６】
変換器１０７はスイッチ１０４で切換えられたフレーム内のビデオ信号とフレーム間のビデオ信号のいずれかをＤＣＴ（Discrete Cosine Transform ：離散コサイン変換）し、その出力は量子化器１０８に与えられる。量子化器１０８は変換器１０７のデータを量子化し、量子化インデックスｑを出力するとともに、この量子化インデックスｑを逆量子化器１０９に出力する。
【００１７】
逆量子化器１０９は生成された量子化インデックスｑを逆量子化し、逆変換器１１０に与える。逆変換器１１０は逆量子化器１０９で逆量子化されたデータを逆変換し、その出力を加算器１０６に与える。
【００１８】
加算器１０６はスイッチ１０５を介して与えられるループ内フィルタ１１１の出力である前フレームのビデオ信号に差分データを加算し、その出力を予測メモリ１１２に与える。予測メモリ１１２は数フレーム分の画像データを保持するとともに、前フレームの画像に対する各ブロックの画像の動きを動きベクトルｖとして出力する。ループ内フィルタ１１１は予測メモリ１１２に保持された画像における歪みをスムージングにより除去するフィルタであり、その動作の有無を示すオン／オフ信号ｆを出力する。
【００１９】
符号化制御部１０２はスイッチ１０４，１０５に切換制御信号を出力するとともに、量子化器１０８に対して量子化特性指定情報ｑｚを指示し、フレーム間／フレーム内符号化識別フラグｐと伝送／非伝送識別フラグｔをそれぞれ発生する。顔領域抽出部１０１では、フレーム内の画像データから顔領域の特徴となる領域を抽出し、座標を符号化制御部１０２に出力する。
【００２０】
以下に、顔領域の判定方法について説明する。顔領域抽出は大きく分けて物体の動きに着目した動きベースによる顔抽出と、顔の色からの色ベースによる肌色抽出の２つの処理で構成される。
【００２１】
図２は顔領域抽出の大まかな流れを示すフローチャートである。図２において、まず始めにステップ（図示ではＳＰと略称する）ＳＰ１において、動きベースによる顔領域で抽出する Still Countを０に初期化する。また、PreNoMv を１０００に初期化する。ステップＳＰ２において画像フレームが取込まれ、ステップＳＰ３で顔・肌色抽出処理の第１次段階として、動きを利用して顔領域が抽出される。ここで得られた顔領域の情報が次の処理の色ベースによる肌色抽出で使用される。そして、ステップＳＰ４で肌色の抽出の実施であるか否かが判別され、そうであればステップＳＰ５で色ベースによる肌色が抽出され、ステップＳＰ６でプリフィルタ処理がされた後、ステップＳＰ７で１フレームの画像符号化が行なわれる。
【００２２】
次に、動きベースによる顔領域の位置をもとに、動きベースによる顔抽出において抽出できなかった顔以外の肌色領域、たとえば、手，腕，首などの領域を動き物体の中から抽出する動作について説明する。
【００２３】
ここで、動きベースによる顔領域抽出の具体的な動作について説明すると、抽出対象となる顔領域は動き物体であり、動き領域の頂上が頭頂であるという特徴を用いて顔領域の抽出が行なわれる。
【００２４】
図３および図４は動きベースによる顔抽出の動作を示すフローチャートであり、図５は縮小画像の作成を示す図である。
【００２５】
はじめに、処理の高速化および細かな動きを除去して有効な働きを得るために、図３のステップＳＰ１１において現フレームおよび前フレームの入力画像の輝度成分からなるＹ画像を縮小する。縮小画像の大きさは入力画像の大きさにかかわらず、表１に示すように４４×３６画素である。
【００２６】
【表１】

【００２７】
図５に示すように、入力画像がＣＩＦのときは８×８画素の平均、ＱＣＩＦおよびＳＱＣＩＦのときは４×４画素の平均を求め、縮小画像の１画素とする。ただし、ＳＱＣＩＦの場合は縮小後の大きさが３２×２４画素であるので、その外側に相当する部分に０を入れて縮小画像の大きさを４４×３６画素にする。現在のフレームの１つ前のフレームを前フレームとすることで、前フレームの縮小画像とする。すなわち、現フレームの縮小画像と前フレームの縮小画像はダブルバッファ構成でピンポン動作する。
【００２８】
次に、前フレームと現フレームの縮小画像の差を検出することで、動き画像を抽出する。雑音による動き画素の抽出を防ぐために、過去の数フレームの動き画素の履歴を参照し、あるしきい値上であったとき有意と判断し、その画素を動き画素として抽出する。
【００２９】
図６は図３のステップＳＰ１２に示すオブジェクトマスク作成の動作をより具体的に示すフローチャートであり、図７は動き履歴のアップデートを示す図であり、図８は動き画素判定しきい値適応領域を示す図であり、図９は３×３画素のウィンドウによる拡大処理を示す図であり、図１０は３×３画素のウィンドウによる縮小処理を示す図である。
【００３０】
次に、図７〜図１０を参照して、オブジェクトマスクの作成について説明する。図６のステップＳＰ４１において、前フレームの縮小画素の輝度値を PrevY_i,jとし、現フレームの縮小画像の輝度値を CurrY_i,jとしたとき、次の第（１）式のようにそれぞれの画素の差ＡＢＳ_i,jを求める。
【００３１】
ABS_i,j＝｜ PrevY_i,j− CurrY_i,j｜ …（１）
次に、ステップＳＰ４２において、縮小画像における画素の位置ｉ，ｊに対する各Ｎフレームの動きの履歴 HIS_i,jをアップデートする。 HIS_i,jが過去の１フレーム単位で１ビットごとに記憶されているとしてアップデート前の HIS_i,jを図７（ａ）とする。このとき、現フレームＴに最も時間的に近い過去のフレームはＴ−１であり、Ｎフレームの過去のフレームはＴ−Ｎとなる。ここで、動き画素を１とし、静止画素を０とすると、Ｔ−ＮフレームからＴ−１フレームまでの履歴は順に、“１，０，１，…，０，１，１”となる。
【００３２】
これに対して、アップデートを行なうことは左へ１ビットシフトすることであり、その結果図７（ｂ）に示すようにこれから処理を行なう現フレームのビット位置Ｔに新しい動き情報が入力できるように空きができ、過去の履歴はＴ−Ｎフレーム目の履歴画素であり、ＴフレームからＴ−Ｎ＋１フレームまでの“０，１，０，…，１，１，Ｘ”のビット列となる。なお、この動きの履歴は各画素を８ビットとする。
【００３３】
このようにして得られた ABS_i,jをステップＳＰ４３でそのしきい値ＴＨｙによりその画素が動き画素か、動き画素でない（静止画素）かを判定する。
【００３４】
この発明の実施形態では、画面内の動き画素に、符号化すべき最も重要な要素（すなわち話者）があるとしている。よって、話者の背後で、話者とは別な動き対象があった場合は、話者とは別な対象にトラッキングしてしまう。
【００３５】
これを避けるために、以下の２つの操作が行なわれる。
すなわち、図８に示すように、画面の周囲には通話者がいることはないとし、画面の周囲のオブジェクトマスクを作成しない（上下左右とも、約１／１０の領域）。
【００３６】
動き判定しきい値を２種設ける。そして、話者の含まれる領域には感度の良いしきい値を適用し、それ以外の領域は感度を落としたしきい値を適用する。また、話者の含まれる領域とは、前フレームにて抽出された矩形領域をＸ方向にexp const X ，Y 方向にexp const Y だけ拡げた領域とする。ここで、exp const X ，exp const Y は例として、それぞれexp const X=4, exp const Y=8とする。すなわち、次の第（２）式を満足するときは、ステップＳＰ４４で CurrY_i,jを動き画素とし、満足しない場合はステップＳＰ４５で静止画素とする。
【００３７】
ABS_i,j> TH_ysense （前フレーム矩形領域内画素）
ABS_i,j> TH_yinsense （前フレーム矩形領域外画素）
…（２）
なお、TH_ysense,TH_yinsense は例として、実際のフレームレートが７より小さいときはTH_ysense を３とし、TH_yinsense を１２とし、７以上のときはTH_ysense を５とし、TH_yinsense を２０とする。
【００３８】
判定された結果は、動きの履歴の現フレームの位置（LSB の位置、図７（ｂ）のＸで示す部分）に１あるいは０を書込む。
【００３９】
次に、 CurrY_i,jの画素が過去Ｎフレームにおいて動き画素と判定された数 MvFrame_i,jを求めるために HIS_i,jのＮフレーム分の和を算出する。これは図７（ｂ）においてＴからＴ−Ｎ＋１までの HIS_i,jの位置の合計を算出して MvFrame_i,jとする。ここで、Ｎは８である。
【００４０】
次に、過去Ｎフレームにおける MvFrame_i,jの値からオブジェクト画素抽出のしきい値TH_objにより、
MvFrame_i,j> TH_obj …（３）
第（３）式を満足する場合は、有意な動きがあると判定してオブジェクト画素として抽出する。なお、オブジェクト画素には、ステップＳＰ４８で１を与え、それ以外にはステップＳＰ４９で０を与える。また、TH_objは例として３とする。ステップＳＰ５０で１フレーム分の処理が終了したことを判別すると、図３のステップＳＰ１３に戻り、オブジェクトマスク抽出において抽出されたオブジェクト画素の数をカウントしてNoMvとする。
【００４１】
たとえば、ＴＶ電話などの利用形態において、顔領域は通常は画面のほぼ中央に位置し、ＴＶ電話の利用中には大きく位置が変化することはない。よって、画面全体の中で画素の動き量NoMvが小さくなった場合には、顔領域抽出処理のオブジェクトマスクの拡大処理以降のラベル付けやエッジ抽出処理などといった演算をスキップし、矩形領域座標，オブジェクトマスク，エッジ画像を前回のフレームで求めた値を使うことで演算量の大幅な削減ができ、フレームレートの向上を図ることができる。ただし、始めの１０フレーム目までは顔位置が完全に定まっていないとし、スキップを行なわない。すなわち、ステップＳＰ１４で次の第（４）式を満足することを判別した場合は、オブジェクトマスク拡大処理以降をスキップする。
【００４２】
NoMv < THjudge and FrameCount > 10 …（４）
ここで、THjudge は例として１００とする。
【００４３】
また、矩形座標抽出処理以降の精密座標抽出，FaceMap 作成，HueMap作成処理のため、スキップ演算判定フラグ“Bypass flag ”を設ける。すなわち、
演算スキップ判別されたとき Bypass flag = 1
演算スキップ判定されなかったとき Bypass flag = 0
…（５）
ステップＳＰ１５でBypass flag が１の場合、矩形座標抽出処理，精密座標抽出，FaceMap 作成，HueMap作成処理では、前フレームで求めた値を返り値として使用する。
【００４４】
ステップＳＰ１６でBypass flag = 0 を０に設定し、オブジェクトマスクには、ホールや欠けが存在するため、それらを埋める処理として、ステップＳＰ１７において３×３画素のウィンドウを用いた拡大を実施する。これは、図９に示すように、オブジェクトマスクの各画素に３×３画素のウィンドウを設定して中心の画素がオブジェクト画素であるときにその８近傍をすべてオブジェクト画素とすることにより拡大が行なわれる。
【００４５】
また、SubQCIF の縮小画像は３２×２４であり、ＣＩＦ／ＱＣＩＦの縮小画像（４４×３６）よりも小さい。このため、SubQCIF のオブジェクト画素１画素が実画像で対応する領域は、ＣＩＦ／ＱＣＩＦのオブジェクト画素の１画素が実画像で対応する領域よりも大きくなる。よって、SubQCIF の場合だけ、拡大処理後に３×３画素ウィンドウを用いた縮小を実施する。
【００４６】
これは、図１０に示すように、オブジェクトマスクの各画素に３×３画素のウィンドウを設定して、中心の画素がオブジェクト画素でなかったときに、その８近傍すべてをオブジェクト画素としないことから縮小を行なう。
【００４７】
この発明の一実施形態のアルゴリズムでは、動き物体を顔領域と定義しているため、顔が静止したときにはこれらの領域の抽出ができなくなる。したがって、その場合には前フレームの領域を顔領域として使用することで動きがなくなった場合でも顔領域の消失を防ぐようにする必要がある。そのために、前フレームの動き量と現フレームの動き量との関係から動きしきい値TH_mvを算出する。
【００４８】
図１１は図３に示すステップＳＰ１８のTH_mv算出の動作をより具体的に示すフローチャートであり、図１２は最大領域の抽出を示す図であり、図１３は頭頂の検出を示す図である。
【００４９】
TH_mvからオブジェクトマスクの論理和処理判定として動きが少なくなったときのオブジェクトマスクの消失を防ぐために、NoMvの値から次の第（６）式によってオブジェクトマスクの論理和を判定する。
【００５０】
NoMv < TH_mv …（６）
ステップＳＰ１９において第（６）式を満足するとき、ステップＳＰ２０で前フレームのオブジェクトマスクと現フレームのオブジェクトマスクの論理和をとり、これを新たな現フレームのオブジェクトマスクとする。第（６）式を満足しない場合は、論理和処理を行なわず、次に示すオブジェクトマスクの抽出処理に進む。
【００５１】
画面には多くの動物体が存在する場合があり、その中から抽出の対象だけを選択する必要がある。たとえば、テレビ電話などにおいては、対象者がカメラに最も近い位置に立つ確率が高いので、ここでは最も大きい連結領域をオブジェクトマスクから切り離す。
【００５２】
この発明の一実施形態では、図１２に示すように、縦方向に４画素ごとに区切ったスリット内における画素の合計のｘ軸への投影を行なってラベル付けを実施することにより、ステップＳＰ２１で最大領域を抽出する。さらに、最大領域の左端，右端のスリットにおいて左端のスリットのときはそのスリットの左端の座標を求め、右端のスリットのときはそのスリットの右端の座標をそれぞれXa，Xbとして求めておく。このとき、最大領域の画素数が同じものが複数あった場合は、一番左端のものを最大領域とする。
【００５３】
現フレームのオブジェクトマスクに存在するホールや欠けを埋めるために、ステップＳＰ２２で３×３画素のウィンドウによる拡大処理を実施する。拡大の方法は前述の図９と同様である。なお、拡大前のオブジェクトマスクは次のフレームでの処理に利用するため、拡大されたオブジェクトマスクを別のエリアに書込む。したがって、拡大前のオブジェクトマスクが次フレーム処理で前フレームのオブジェクトマスクとなる。そのために、現フレームのオブジェクトマスクと前フレームのオブジェクトマスクは、縮小画像と同様にダブルバッファ構成でピンポン動作となる。
【００５４】
前述のごとく求めた最大領域の左端，右端の座標Xa，Xbの間においてステップＳＰ２３で頭頂を検出する。図１３において頭頂がオブジェクトマスクの最上部なので、まずXaとXbに挟まれるオブジェクトマスクの横方向の４画素ごとの画素の合計をＹ軸方向に投影する。次いで、投影した画像を上から調べ、最初にしきい値THtop 以上となるスリットにおいて、そのスリットの左上から右下へ走査し、最初に検出されるＹ座標を頭頂のＹ座標HeadTopYとする。さらに、HeadTopYにおいて、左側から走査して最初に検出されるオブジェクトのＸ座標をＸ１とする。同様に、HeadTopYにおいて、右側から走査して最初に検出されるオブジェクトの座標をＸ２とする。そして、Ｘ１とＸ２の中心を頭頂のＸ座標HeadTopXとする。なお、THtop は、ここでは４とする。
【００５５】
図１４は図４のステップＳＰ２４における顔幅検出の具体的な動作を示すフローチャートであり、図１５は顔幅検出範囲の決定を示す図であり、図１６は顔幅の検出を示す図である。
【００５６】
次に、図１４〜図１６を参照して、顔幅の検出処理について説明する。顔領域は通常、画面のほぼ中央に位置し、画面の下方には肩の領域がある。これを利用し、ステップＳＰ７１で顔領域の検出範囲を定める。
【００５７】
頭頂位置HeadTopYから画面の下までの領域の１／２のラインに注目し、このラインを検出ラインとする。すなわち、検出ラインのＹ座標sLineYは以下のようになる。
【００５８】
sLineY = HeadTopY+（Ymax-HeadTopY ）／２
Ymax = 35 : CIF, QCIF
Ymax = 23 : SubQCIF
…（７）
検出ラインの左から右へオブジェクト画素を検索する。最初にオブジェクト画素が見つかったＸ座標を左側の検出範囲の最大maxXa とする。同様に、検出ラインの右から左へオブジェクト画素を検索し、最初にオブジェクト画素が見つかったＸ座標を右側の検出範囲の最大maxXb とする。
【００５９】
ここで、求めたmaxXa が第（８）式を満たすときに、maxXa を左側の最大検出範囲とする。これを満たさないときは、Xaを左側の最大検出範囲とする。
【００６０】
Xa < maxXa < HeadTopX …（８）
また、同様にmaxXb が第（９）式を満たすときに、図１５に示すように、maxXb を右側の最大検出範囲とする。これを満たさないときは、Xbを右側の最大検出範囲とする。
【００６１】
HeadTopX < maxXb < Xb …（９）
次に、ステップＳＰ７２において画面を４画素ごとに縦方向のスリットに分割し、図１６に示すようにHeadTopXを含むエリアを求める。図１６ではエリアｂのHeadTopXが含まれている。次に、ステップＳＰ７３およびＳＰ７４においてエリアｂから左方向に隣接するエリアａとの画素数の合計の差を求める。この図１６では、エリアｂの画素数が１２０であり、エリアａの画素数が８０である。したがって、その差PelSubは、第（１０）式のようになる。
【００６２】
PelSub = 120-80 = 40 …（１０）
ここで、ステップＳＰ７５で画素の差のしきい値THsub と比較して第（１１）式を満足したとき、ステップＳＰ７６およびＳＰ７７の処理により左の顔幅の座標FaceWideX1とする。
【００６３】
PelSub > THsub …（１１）
この例では、エリアｂとエリアａとの画素数の差が第（１１）式を満足するので、比較エリアであるエリアａの左端の座標がFaceWideX1となる。
【００６４】
左の顔幅検出終了後、右方向についても同様にステップＳＰ７９〜ＳＰ８３の処理を行ない、右の顔幅の座標FaceWideXrを得る。
【００６５】
図１７は顔幅補正の流れを示すフローチャートである。次に、図１７を参照して、顔幅の補正処理について説明する。上述のごとく検出された顔幅は、雑音や動き不十分により検出を失敗している可能性がある。そのため、顔幅が小さ過ぎたり、顔幅がHeadTopXを基準として偏り過ぎたりしている場合がある。それを補正するために、FaceWideX1，FaceWideXrを図１７のフローチャートのステップＳＰ９１〜ＳＰ１０１に従って処理を行なう。
【００６６】
次に、顔の下座標を頭頂座標HeadTopY，顔幅の座標FaceWideX1，FaceWideXrから顔の下の座標FaceBottomY を次の第（１２）式から求める。
【００６７】
FaceBottomY = HeadTopY +（FaceWideXr- FaceWideXl）* 1.5 …（１２）
さらに、これらの値から顔領域を表わす矩形の座標を顔矩形領域として、左上および右下の座標を
（FaceWideX1，HeadTopY），（FaceWideXr，FaceBottomY ）
と定義する。
【００６８】
以上のようにして、動きベースによる矩形領域が求められる。
次に、ステップＳＰ２５およびＳＰ２６で矩形領域内の顔領域抽出による精密座標の検出処理を行なう。抽出対象となる顔領域は、頬部分の縦線が強い、目のまわりの横線に強いという特徴を使い、顔領域の抽出を行なう。ただし、処理のスキップ判定がされていた場合（Bypass flag = 1 ）は、この処理を行なわずに前フレームで求めた精密座標を現フレームの精密座標とする。
【００６９】
図１８は精密座標の検出処理の動作を示すフローチャートである。
人間の顔において頬は縦線が強く、目，鼻，口は横線が強いという特徴がある。この性質を利用することで顔の判定が可能となる。したがって、図１８のステップＳＰ１１１〜ＳＰ１２０の処理により縦および横エッジを顔矩形領域内のオブジェクトマスク内に存在する現画像のＹの画素から抽出する。その際、顔矩形領域は縮小画像の大きさであるため、それぞれの画像フォーマットの大きさに対応させて使用する。ただし、処理量を軽減するという観点からＣＩＦの場合は縦横１画素おきの処理とする。
【００７０】
図１９はエッジの検出動作を示す図であり、図２０は顔幅精密座標検出動作を示す図であり、図２１は横エッジの特徴量を示す図である。
【００７１】
まず、顔矩形領域内においてオブジェクトマスクに存在する画素を注目画素として、図２０に示すように３×３画素のウィンドウを設定する。このとき、縦エッジUX，横エッジUYは第（１３）式で表わされる。
【００７２】
DX =｜C+2F+I-A-2D-G ｜
DY =｜G+2H+I-A-2B-C ｜
UX = fix（DX-DY ）
UY = fix（DY-DX ）
ただし、 fix（a ）= a （a ≧０），０（a<０）
…（１３）
さらに、UX，UYをしきい値THedgeで２値化し、縦エッジ画素VEdge ，横エッジ画素HEdge を第（１４）式によって得る。なお、THedgeは例として６０とする。
【００７３】
UX > THedge →VEdge = 1
UX≦ THedge →VEdge = 0
UY > THedge →HEdge = 1
UY≦ THedge →HEdge = 0
…（１４）
まず、縦・横エッジ画像における顔幅の長さXlenを顔幅の座標FaceWideX1，FaceWideXrから第（１５）式で求める。
【００７４】
Xlen =（FaceWideXr - FaceWideXl +1）* 4 …（１５）
次に、Xlenの正方形を縦エッジ画像の上部に設定して、図１９に示すようにこの内部を縦エッジの探索領域とする。このとき、中心付近には鼻が存在するとし、鼻の幅を以下の式で求める。
【００７５】
NoseWidth = Xlen/4 …（１６）
また、エッジ画像の座標系である頭頂HeadCenterを、矩形座標の頭頂HeadTopYから第（１７）式により求める。
【００７６】
HeadCenter = HeadTopX ×4 …（１７）
そして、頭頂位置HeadCenterからNoseWidth の分だけを避けた左の位置を探索開始位置として、ここから左の領域を順次探索し、縦方向の画素の累積値が最初にしきい値THpeakより大きくなる位置をPeakL とする。
【００７７】
また、同様に、頭頂位置HeadCenterからNoseWidth の分だけを避けた右の位置を探索開始位置として、ここから右の領域を順次探索し、縦方向の画素の累積値が最初にしきい値THpeakより大きくなる位置をPeakR とする。
【００７８】
ここで、THpeakは次の第（１８）式より求めることができる。
THpeak = Xlen PeakRathio/10
ただし、
PeakRathio = 2 …（１８）
PeakL ，PeakR がともにTHPeak以上であり、かつPeakL とPeakR のＸ座標の値がemnTHsub以上であったとき、顔幅が存在したと判断して精密座標抽出成功とし、後述する横エッジからの特徴抽出を行なう。また、PeakL ，PeakR のＸ座標をそれぞれEMNl，EMNrとする。
【００７９】
もし、上記の条件を満足しない場合は、顔幅精密座標抽出失敗と判断して、前フレームで求めたEMNl，EMNrを現フレームのEMNl，EMNrとする。前フレームで求めたEMNl，EMNrは現フレームの矩形座標に対して大きく外れている場合がある。この場合は、前フレームのEMNl，EMNrは不適切と判断する。現フレームの矩形座標から大きく外れているかどうかの判定には、現フレームの矩形領域の幅の左右parm x％の領域に前フレームのEMNl，EMNrが入っているかを判定する。すなわち、前フレームのEMNl，EMNrをそれぞれpreEMNl ，preEMNr としたとき、第（１９）式を判定する。
【００８０】
preEMNl>FaceWideXl×4-（FaceWideXr-FaceWideXl ）×４×parm＿x ÷100
かつ
preEMNr>FaceWideXr×4+（FaceWideXr-FaceWideXl ）×４×parm＿x ÷100
…（１９）
第（１９）式が満たされるときは、preEMNl ，preEMNr を現フレームのEMNl，EMNrとし、満たされない場合は、精密座標抽出失敗とし、後述する顔検出の失敗処理を行なう。ここで、parm xは例として１２とする。
【００８１】
人間の顔は顔幅の間に横線特徴をもつ目・鼻・口が存在することを利用し、顔幅間で横エッジの分布を調べることで顔らしさの判断を行なう。図２１のように太線で示す横エッジ探索領域を横エッジ画像の大きさに対応させた顔矩形領域の下部に設定し、この探索領域内においてEMNlとEMNrを３等分した領域を決定する。各領域ごとの横エッジの画素数の累積を求め、それぞれSY0, SY1Y, SY2とする。
【００８２】
SY0, SY1, SY2 の関係が第（２０）式を満足していたとき顔領域と判定し、縦方向の精密座標検出処理を行なう。満足しない場合は後述する顔検出の失敗処理を行なう。
【００８３】
SY0 < SY1 かつSY2 < SY1 …（２０）
顔領域の判定に成功した場合は、縦方向の精密座標の検出処理を行なう。図２２に示すように横エッジ画素をＹ軸方向に累積したときの分布を求め、上方から走査したとき初めてしきい値THemn 以上となるＹ座標を顔面の上部の座標EMNtopとする。また、下方から走査して初めてTHemn 以上となるＹ座標を顔面の下部の座標EMNbottom とする。
【００８４】
ここで、もしもEMNtopとEMNbottom の差がemnTHsubより小さかった場合は、精密座標抽出失敗とし、後述する顔検出の失敗処理を行なう。
【００８５】
顔領域の検出に成功した場合は、顔領域検出成功フラグFindFaceを第（２１）式として精密座標検出処理を終了する。
【００８６】
FindFace = 1 …（２１）
また、顔領域の検出に成功した場合は、顔領域検出成功フラグFindFaceを第（２２）式とし、精密座標検出処理を終了する。
【００８７】
FindFace = 0 …（２２）
顔検出失敗時は、抽出された領域以外に顔の存在する可能性がある。そのときは、顔領域抽出処理の後段で行なわれる符号化制御処理やフィルタ処理で、実際の顔の部分が顔領域以外の処理が行なわれ、顔の画質が低下するおそれがある。これを避けるために、顔検出失敗時には、矩形領域を実際より大きな領域とするかまたは画面全体を領域とすることで顔が顔領域処理の対象領域から外れないようにする。
【００８８】
まず、現フレームで求められた矩形領域Ｘ座標の中心を第（２３）式のようにFaceCenterとする。
【００８９】
FaceCenter =（FaceWideXr - FaceWideXl ）/2 + FaceWideXl …（２３）
画面の中に３つの大きな領域（area0, area1, area2 ）を設定し、FaceCenterの位置より、area0, area1, area2 のうちのどれかを現フレームの矩形領域とする。すなわち、第（２４）式により現フレームの置換えを行なう。
【００９０】
0 ≦FaceCenter < Border01 → 矩形領域をarea0 に置換え
Border01 < FaceCenter ≦ Border12 → 矩形領域をarea1 に置換え
Border12 < FaceCenter ≦Ｘ座標の最大→矩形領域をarea2 に置換え
…（２３）
表２に置換え領域（area0, area1, area2 ）選択の判定値Border01, Border12の値と、（area0, area1, area2 ）の置換え値を示す。
【００９１】
【表２】

【００９２】
顔検出失敗時の処理は、顔領域抽出関数をコールする符号化制御部により、FindFaceフラグにより判定されて行なわれる。これは、顔領域抽出部でフレーム間処理の連続性を断ち切らないためである。これにより、次フレームで顔検出に成功した場合は、直ちに正しい矩形領域に復帰することができる。
【００９３】
次に、前記矩形領域の抽出処理で求めた矩形領域の拡大処理を行なう。動きが少なくなったときには全体的な符号量が減少するので、その分、顔領域を拡大することができる。その判定には第（２５）式からどれだけ動きが少ないフレームが連続したかをカウントするStill Count を求め、ステップＳＰ２７、ＳＰ２８およびＳＰ２９の処理を行なう。
【００９４】
NoMv < TH _still
then : StillCount ++
NoMv≧TH_still
then : StillCount = 0
StillCount > TH _{still max}
then : StillCount = 0
…（２５）
そして、ステップＳＰ３０で次のようにして領域の拡大領域RgnExpを求め、FaceWideXl，FaceWideXr，HeadTopYを拡大し、それぞれFaceWideXle, FaceWideXre，HeadTopYe とする。なお、THstill はここでは２０，THexp はＣＩＦのとき２０とし、ＱＣＩＦのとき５０とし、ＳＱＣＩＦのとき７０とした。
【００９５】
RgnExp = StillCount / TH_exp
FaceWideXle = FaceWideXl - RgnExp
FaceWideXre = FaceWideXr + RgnExp
HeadTopYe = HeadTopY - RgnExp
ただし、
FaceWideXle < 0 のとき、FaceWideXle = 0
FaceWideXre > 43のとき、FaceWideXre = 43
HeadTopYe < 0 のとき、HeadTopYe = 0
…（２６）
また、前フレームの顔幅PreFaceWideXle, PreFaceWideXre，頭頂PreHeadTopYeと比較し、差がTHdif より小さいときは頻繁な変動を防ぐために前フレームの値を使用する。なお、THdif はここでは３とした。
【００９６】
｜Pre FaceWideXle - FaceWideXle ｜< TH_dif
then: FaceWideXle = PreFaceWideXle
｜PreFaceWideXre - FaceWideXre｜< TH_dif
then: FaceWideXre = PreFaceWideXre
｜PreHeadTopYe - HeadTopYe｜< TH_dif
then: HeadTopYe = PreHeadTopYe
…（２７）
さらに、移動量に第（２８）式に示すローパスフィルタをかけ、前フレーム位置からの急峻な変動を抑制する。
【００９７】
FaceWideXle = PreFaceWideXle× k + FaceWideXle×（1-k ）
FaceWideXre = PreFaceWideXre× k + FaceWideXre×（1-k ）
HeadTopYe = PreHeadTopYe× k + HeadTopYe×（1-k ）
k = 1/2
…（２８）
顔領域は、一般に画面に対して中心付近に位置する。そこで、ステップＳＰ３１で顔の長さを算出する。すなわち、画面の上下左右の位置に対して第（２９）式に示すクリッピングを施し、画面の端に矩形領域が片寄らないようにする。
【００９８】
FaceWideXle < minX
then: FaceWideXle = minX
FaceWideXre > maxX
then: FaceWideXle = maxX
HeadTopY < minY
then: HeadTopY = minY
ただし、
minX = 縮小画像の横幅×1/10
maxX = 縮小画像の横幅−（縮小画像の横幅×1/10）
minY = 縮小画像の縦幅×1/12
FaceWideXre < FaceWideXle
then: FaceWideXre = FaceWideXle
…（２９）
最後に、次のフレームのために顔幅と頭頂を記憶しておく。
【００９９】
PreFaceWideXle = FaceWideXle
PreFaceWideXre = FaceWideXre
PreHeadTopYe = HeadTopYe
…（３０）
顔幅と頭頂位置から顔の下の座標FaceBottomYeを推定する。
【０１００】
FaceBottomYe = HeadTopYe +（FaceWideXre - FaceWideXle +1）^*15
…（３１）
FaceWideXle, FaceWideXre, HeadTopYe と同様に第（３２）式によりクリッピングを施し、画面の端に矩形領域が偏らないようにする。
【０１０１】
FaceBottomYe > maxY
then: FaceBottomYe = maxY
ただし、
maxY =縮小画像の縦幅−（縮小画像の縦幅×1/12）
FaceBottomYe < HeadTopYe
then: FaceBottomYe = HeadTopYe
…（３２）
次に、ステップＳＰ３２で抽出された顔幅、頭頂、顔の長さにより得られる顔領域の矩形領域から符号化制御の際に使用するFaceMap を作成する。これは、顔幅，頭頂，顔の長さの各座標は４４×３６画素の縮小画像に対応しているが、この領域内に５０％以上含まれるマクロブロック１とし、それ以外のマクロブロックには０を表わすラベルを付ける。FaceMap は画像フォーマットごとに大きさが異なり、ＣＩＦのとき２２×１８となり、ＱＣＩＦのとき１１×９となり、ＳＱＣＩＦのとき８×６の大きさとなる。
【０１０２】
なお、動きベースによる顔抽出は動きの検出に数フレームの履歴を使用しているため、顔領域抽出処理の開始フレームからTHtrack まではFaceMap がオールゼロとする。
【０１０３】
また、処理のスキップ判定がなされていた場合（Bypass flag =1）は、この処理を行なわずに前フレームで求めたFaceMap を現フレームのFaceMap とする。
【０１０４】
次に、色ベースによる肌色抽出処理を行なう。色ベースによる肌色抽出は、前記の動きベースによる顔抽出において抽出できなかった他の肌の領域、たとえば、手，腕，首などの領域を抽出するために実施され、HueMapを作成する。
【０１０５】
ただし、処理のスキップ判定がされていた場合（Bypass flag =1）は、この処理を行なわずに前フレームで求めたHueMapを現フレームのHueMapとする。
【０１０６】
図２３は色ベースによる肌色抽出の処理の流れを示すフローチャートである。図２３において、ステップＳＰ１２１でBypass flag が１でないことを判別し、ステップＳＰ１２２において顔面の精密座標から肌色サンプル領域を設定する。顔面の精密座標EMNtop, EMNbottom, EMNl, EMNr から得られる矩形領域をCb, Crに設定して肌色サンプル領域とする。ただし、ＱＣＩＦ，ＳＱＣＩＦの場合は精密座標を半分にした値を使用する。
【０１０７】
次に、ステップＳＰ１２３において、肌色サンプル領域内の平坦領域におけるCb, Crの平均および標準偏差の算出を行なう。前述した動きベースによる顔抽出で作成した縦および横エッジ画像のエッジ画素を含まない肌色サンプル領域外のCb, Crの画素の値からCb, Crのそれぞれの平均μu,μv,標準偏差σu,σv を算出する。
【０１０８】
色ベースによる肌色抽出は動きベースによる顔抽出結果に基づいて実施されるが、動きベースによる抽出が１００％正確でないため、顔領域を誤って抽出していた場合には肌色抽出に悪影響を与える。したがって、サンプルしたCb, Crの分布により肌色抽出を実施するかどうかを判断する必要がある。これは、標準偏差σu,σv から第（３３）式を満足したときに単峰性のピークがあると考え、ステップＳＰ１２４で肌色抽出処理を実施し、肌色画素の抽出処理を行なう。
【０１０９】
σ_u< THσかつσ_v< THσ …（３３）
第（３３）式を満足したときに単峰性のピークがあると考え、ステップＳＰ１２６〜ステップＳＰ１２８による肌色抽出処理を実施し、肌色画素の抽出処理を行なう。第（３３）式を満足しないときには色にばらつきがあり安定した肌色抽出ができないので、ステップＳＰ１２５で肌色抽出処理を中止し、FaceMap と同じ構成でHueMapを作成してオールゼロとし、処理を終了する。なお、THσはここでは２０とする。
【０１１０】
ステップＳＰ１２６〜ＳＰ１２８の肌色画素の抽出では、始めにステップＳＰ１２６でCb, Crの抽出範囲の値［Cbl, Cbh］，［Crl, Crh］を次のように決定する。
【０１１１】
［Cbl, Cbh］= ［μ_u- σ_u, μ_u+ σ_u］
［Crl, Crh］= ［μ_v- σ_v, μ_v+ σ_v］
…（３４）
次に、ステップＳＰ１２６で得られた抽出範囲に従い、Cb, Crがともにその範囲に属している画素をステップＳＰ１２７で抽出する。さらに、現フレームのオブジェクトマスク内の画素のみを抽出して肌色領域の数を作成する。
【０１１２】
このようにして作成された肌色領域をそれぞれ画像フォーマット（ＣＩＦ，ＱＣＩＦ，ＳＱＣＩＦ）に対応させ、マクロブロック内に５０％以上肌色画素を含むとき、そのマクロブロックを１とし、そうでないときは０のラベルを持つHueMapをステップＳＰ１２８で作成する。なお、顔領域抽出処理の開始フレームからTHtrack まではHueMapをオールゼロとする。
【０１１３】
今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。
【０１１４】
【発明の効果】
以上のように、この発明では、矩形領域抽出時に、前フレーム矩形領域外の動き成分判定しきい値を矩形領域内動き成分判定しきい値よりも感度を落とすことにより、対象人物周囲の動き成分雑音を矩形領域と判定しにくくし、対象人物顔領域外に矩形領域が移動しにくくすることで矩形領域の判定精度を向上できる。
【０１１５】
また、前フレームと現フレーム間の動き画素量により動き画素量が少ないときには、動き画素量算出以降の演算を行なわないことで、矩形領域抽出処理の全体の演算量を軽減できる。
【０１１６】
さらに、顔画像特徴抽出による精密座標判定のための顔幅判定時に、顔幅検索範囲を頭頂座標から画面最下座標までの１／２の位置の矩形座標の左右幅とすることで、精密座標検索の精度向上と演算の高効率化を行なうことができる。
【０１１７】
顔画像特徴抽出による精密座標判定のための顔幅判定時に、鼻成分を避けることにより、実際の顔幅よりも狭い領域を精密座標と判定しないようにする。
【０１１８】
さらに、顔画像特徴抽出による精密座標抽出が失敗した場合、ある決められた大きさの領域を矩形領域とすることで矩形領域外に顔画像が存在することを低減できる。
【０１１９】
さらに、ある決められた大きさの領域を矩形領域とすることに代えて、画面全体を矩形領域とすることで矩形領域外に顔画像が存在することを低減できる。
【０１２０】
さらに、現フレームで動き画素より算出された矩形領域座標と、前フレームの矩形領域座標の値にローパスフィルタをかけ、動き物体が画面内で急激に移動した場合でも滑らかに矩形領域が動き物体に追従することができる。
【０１２１】
さらに、画面内の端の領域には有効な顔画像が存在しないようにし、求められた矩形領域座標値に対してクリッピング処理を行なうことで、後の精密座標抽出時の精度向上と演算量削減を行なうことができる。
【０１２２】
さらに、顔画像特徴抽出による精密座標判定のための顔幅判定時に、前フレームで求めた精密座標と現フレームで求めた精密座標の値の差があるしきい値内であれば前フレームで算出した精密座標値を現フレームの精密座標として採用することにより、精密座標が見失われることを避けることができる。このとき、前フレームで求めた精密座標を利用することに代えて、過去数フレームの精密座標の値の平均値を用いることで同様の作用を得ることができる。
【図面の簡単な説明】
【図１】Ｈ．２６１／Ｈ．２６３画像エンコーダと顔領域抽出部の関係を示すブロック図である。
【図２】顔領域抽出の大まかな流れを示すフローチャートである。
【図３】動きベースによる顔抽出を示す前半のフローチャートである。
【図４】動きベースによる顔抽出を示す後半のフローチャートである。
【図５】縮小画像の作成例を示す図である。
【図６】オブジェクトマスク作成の流れを示すフローチャートである。
【図７】動き履歴のアップデートを示す図である。
【図８】動き画素判定しきい値適応領域を示す図である。
【図９】３×３画素のウィンドウによる拡大処理を示す図である。
【図１０】３×３画素のウィンドウによる縮小処理を示す図である。
【図１１】 TH_mv算出の流れを示すフローチャートである。
【図１２】最大領域の抽出を示す図である。
【図１３】頭頂の検出例を示す図である。
【図１４】顔幅検出の流れを示すフローチャートである。
【図１５】顔幅検出範囲の決定を説明するための図である。
【図１６】顔幅の検出例を示す図である。
【図１７】顔幅補正の流れを示すフローチャートである。
【図１８】精密座標の検出処理の流れを示すフローチャートである。
【図１９】エッジ検出オペレータを示す図である。
【図２０】顔幅精密座標検出例を示す図である。
【図２１】横エッジの特徴量を示す図である。
【図２２】顔面の縦方向の精密座標算出例を示す図である。
【図２３】色ベースによる肌色抽出の処理の流れを示すフローチャートである。
【符号の説明】
１０１顔領域抽出部
１０２符号化制御部
１０３減算器
１０４，１０５スイッチ
１０６加算器
１０７変換器
１０８量子化器
１０９逆量子化器
１１０逆変換器
１１１ループ内フィルタ
１１２予測メモリ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an image processing apparatus, and in particular, uses a moving image encoding technique, and processes an image in which a face image of a person or the like is included in a target image such as a TV phone among the moving image encoding techniques. The present invention relates to an image processing apparatus.
[0002]
[Prior art]
As a facial image feature extraction technique by an image processing device, for example, as described in Japanese Patent Laid-Open No. 7-29014, a rectangular region is extracted using a moving object as a human region, and a facial image is extracted from the extracted rectangular region. There is one that further extracts the skin color area of the face area according to the feature.
[0003]
[Problems to be solved by the invention]
However, in the invention described in the above-mentioned Japanese Patent Application Laid-Open No. 7-29014, there is a possibility that a motion component other than a person region, which is generated from an algorithm of motion region extraction based on motion element determination, is erroneously extracted. Further, since the vertical edge components in the face image are both widths of the face, the vertical edge of the nose region may be erroneously recognized as the right cheek or left cheek of the face. Furthermore, there is a problem in that the performance of the entire system is reduced due to unnecessary computation when the facial image area extraction technology is not realized by hardware but is realized by software such as a processor.
[0004]
Therefore, the main object of the present invention is to perform image processing that can improve extraction performance and reduce false extraction in a technique for extracting a precise coordinate of a face by extracting a moving rectangular area by moving pixels and extracting facial features in the rectangular area. Is to provide a device.
[0005]
[Means for Solving the Problems]
  The invention according to claim 1 is an image processing apparatus that extracts a human face image from an input moving image, and a rectangular area extracting unit that extracts a rectangular area including a moving object area based on the input moving image. And face image feature extraction means for extracting precise coordinates based on the features of the face image in the extracted rectangular areaAndThe rectangular area extracting means extracts a threshold value for determining a motion component outside the rectangular area of the previous frame when extracting the rectangular area based on the difference between the image of the previous frame and the current frame. By setting the sensitivity lower than the threshold for determination, the determination accuracy of the rectangular area is improved.
[0006]
In the invention according to claim 2, the rectangular area extracting means according to claim 1 extracts and outputs the rectangular area from the image of the previous frame when the amount of moving pixels between the previous frame and the current frame is small, and outputs the target person area The determination accuracy of the rectangular area is improved by making the rectangular area difficult to move outside.
[0007]
In the invention according to claim 3, the face image feature extraction means of claim 1 determines the face width search range from the top coordinate to the bottom of the screen when determining the face width in order to extract the feature of the face image in the rectangular area. By setting the left and right widths of the rectangular coordinates at 1/2 positions up to the coordinates, the precision of the precise coordinate search and the efficiency of the calculation are improved.
[0008]
In the invention according to claim 4, the face image feature extraction means of claim 1 avoids a nose component and does not determine a region narrower than the actual face width as a precise coordinate when determining the face width.
[0009]
In the invention according to claim 5, the face image feature extraction unit according to claim 1, when the precise coordinate extraction by the face image feature extraction fails, sets a region having a predetermined size as a rectangular region, thereby removing the region outside the rectangular region. The presence of a face image is reduced.
[0010]
In the invention according to claim 6, when the precise coordinate extraction by the face image feature extraction fails, the entire screen is made a rectangular area so that the face image does not exist outside the rectangular area.
[0011]
In the invention according to claim 7, a low-pass filter is applied to the rectangular area coordinates calculated from the moving pixels in the current frame by the rectangular area extracting means and the rectangular area coordinates of the previous frame, and the moving object moves rapidly in the screen. Even in such a case, the rectangular area smoothly follows the moving object.
[0012]
In the invention according to claim 8, by performing clipping processing on the rectangular area coordinate value obtained as a valid face image does not exist in the edge area in the screen, the accuracy at the time of subsequent precise coordinate extraction is improved. And reduce the amount of computation.
[0013]
In the invention according to claim 9, when the face width is determined for precise coordinate determination by extracting facial image features, the difference between the value of the precise coordinate obtained in the previous frame and the value of the precise coordinate obtained in the current frame is within a threshold value. If there is, the precision coordinates calculated in the previous frame are adopted as the precision coordinates of the current frame, thereby avoiding the loss of precision coordinates.
[0014]
  In the invention according to claim 10, when the face width for precise coordinate determination by face image feature extraction is determined, the difference between the average value of the precise coordinate value obtained in the current frame and the precise coordinate value in the past several frames is obtained. If it is within a certain threshold value, the average value of the precision coordinate values of the past several frames is adopted as the precision coordinate value of the current frame to avoid losing the precision coordinates.
  In the invention according to claim 11, the skin color area extracting means for extracting the skin color area of the moving image inputted based on the color distribution standard deviation in the precise coordinates extracted by the face image feature extracting means as the face image of a person. It is possible to extract a face image including a skin color area that could not be extracted by the face image feature extraction means.
[0015]
DETAILED DESCRIPTION OF THE INVENTION
FIG. 261 / H. It is a block diagram which shows the relationship between a H.263 image encoder and a face area extraction part. In FIG. 1, the CIF image data input to the encoder is provided to the face area extraction unit 101, the subtractor 103, one input terminal a of the switch 104, and the prediction memory 112, respectively. The subtracter 103 subtracts the video signal output from the in-loop filter 111 from the input image data, and the difference data is given to the other input terminal b of the switch 104. The switch 105 has two input terminals a and b, and the output of the in-loop filter 111 is also given to the input terminal b. The switch 105 is switched in synchronization with the switch 104 by a switching control signal from the encoding control unit 102.
[0016]
The converter 107 performs DCT (Discrete Cosine Transform) on either the video signal in the frame switched by the switch 104 or the video signal between the frames, and the output is given to the quantizer 108. The quantizer 108 quantizes the data of the converter 107, outputs a quantization index q, and outputs the quantization index q to the inverse quantizer 109.
[0017]
The inverse quantizer 109 inversely quantizes the generated quantization index q, and provides it to the inverse transformer 110. The inverse transformer 110 inversely transforms the data inversely quantized by the inverse quantizer 109 and provides the output to the adder 106.
[0018]
The adder 106 adds the difference data to the video signal of the previous frame, which is the output of the in-loop filter 111 given via the switch 105, and gives the output to the prediction memory 112. The prediction memory 112 holds image data for several frames and outputs the motion of the image of each block with respect to the image of the previous frame as a motion vector v. The in-loop filter 111 is a filter that removes distortion in the image held in the prediction memory 112 by smoothing, and outputs an on / off signal f indicating the presence or absence of the operation.
[0019]
The encoding control unit 102 outputs a switching control signal to the

switches

104 and 105 and instructs the quantizer 108 to specify the quantization characteristic designation information qz, and the interframe / intraframe coding identification flag p and the transmission / non-transmission. A transmission identification flag t is generated. The face area extraction unit 101 extracts an area that is a feature of the face area from the image data in the frame, and outputs the coordinates to the encoding control unit 102.
[0020]
The face area determination method will be described below. Face area extraction is roughly divided into two processes: face extraction based on motion based on the movement of an object and skin color extraction based on color from the face color.
[0021]
FIG. 2 is a flowchart showing a rough flow of face area extraction. In FIG. 2, first, in step (abbreviated as SP in the drawing) SP1, a Still Count to be extracted in a motion-based face region is initialized to zero. Also, PreNoMv is initialized to 1000. In step SP2, an image frame is captured, and in step SP3, a face region is extracted using motion as the first stage of the face / skin color extraction process. The face area information obtained here is used in the skin color extraction based on the color base in the next process. In step SP4, it is determined whether or not skin color extraction is performed. If so, a skin color based on a color base is extracted in step SP5, prefiltered in step SP6, and then one frame is processed in step SP7. Is encoded.
[0022]
Next, based on the position of the motion-based face region, an operation for extracting a skin color region other than the face that could not be extracted in the motion-based face extraction, such as a hand, arm, or neck, from the moving object Will be described.
[0023]
Here, the specific operation of face area extraction based on motion is described. The face area to be extracted is a moving object, and the face area is extracted using the feature that the top of the motion area is the top of the head. .
[0024]
3 and 4 are flowcharts showing a motion-based face extraction operation, and FIG. 5 is a diagram showing creation of a reduced image.
[0025]
First, in order to obtain an effective function by speeding up the processing and removing fine movement, the Y image composed of the luminance components of the input image of the current frame and the previous frame is reduced in step SP11 in FIG. The size of the reduced image is 44 × 36 pixels as shown in Table 1 regardless of the size of the input image.
[0026]
[Table 1]

[0027]
As shown in FIG. 5, when the input image is CIF, an average of 8 × 8 pixels is obtained, and when the input image is QCIF and SQCIF, an average of 4 × 4 pixels is obtained and set as one pixel of the reduced image. However, in the case of SQCIF, since the size after reduction is 32 × 24 pixels, 0 is put in the portion corresponding to the outside to make the size of the reduced image 44 × 36 pixels. By setting the frame immediately before the current frame as the previous frame, a reduced image of the previous frame is obtained. In other words, the reduced image of the current frame and the reduced image of the previous frame perform a ping-pong operation with a double buffer configuration.
[0028]
Next, a motion image is extracted by detecting a difference between the reduced image of the previous frame and the current frame. In order to prevent extraction of motion pixels due to noise, the history of motion pixels in the past several frames is referred to, and when it is above a certain threshold value, it is determined to be significant, and the pixel is extracted as a motion pixel.
[0029]
FIG. 6 is a flowchart more specifically showing the operation of creating the object mask shown in step SP12 of FIG. 3, FIG. 7 is a diagram showing the update of the motion history, and FIG. 8 is the motion pixel determination threshold adaptive region. FIG. 9 is a diagram illustrating enlargement processing using a 3 × 3 pixel window, and FIG. 10 is a diagram illustrating reduction processing using a 3 × 3 pixel window.
[0030]
Next, creation of an object mask will be described with reference to FIGS. In step SP41 of FIG. 6, the luminance value of the reduced pixel of the previous frame is set to PrevY._{i, j}And the brightness value of the reduced image of the current frame is CurrY_{i, j}As shown in the following equation (1), the difference ABS of each pixel_{i, j}Ask for.
[0031]
ABS_{i, j}＝｜ PrevY_{i, j}− CurrY_{i, j}| (1)
Next, in step SP42, the history of movement of each N frame with respect to the pixel positions i and j in the reduced image HIS_{i, j}Update. HIS_{i, j}HIS before update assuming that is stored for each bit in the past frame unit_{i, j}Is shown in FIG. At this time, the past frame closest in time to the current frame T is T-1, and the past frame of N frames is TN. Here, assuming that the moving pixel is 1 and the still pixel is 0, the history from the TN frame to the T-1 frame is “1, 0, 1,..., 0, 1, 1” in order.
[0032]
On the other hand, updating means shifting one bit to the left, and as a result, as shown in FIG. 7B, new motion information can be input at the bit position T of the current frame to be processed. The past history is a history pixel of the TN frame, and a bit string of “0, 1, 0,..., 1, 1, X” from the T frame to the TN + 1 frame. In this movement history, each pixel is 8 bits.
[0033]
ABS obtained in this way_{i, j}In step SP43, whether the pixel is a moving pixel or not a moving pixel (still pixel) is determined based on the threshold value THy.
[0034]
In the embodiment of the present invention, it is assumed that the motion pixel in the screen has the most important element (ie, speaker) to be encoded. Therefore, if there is a movement target that is different from the speaker behind the speaker, the target is tracked to be different from the speaker.
[0035]
In order to avoid this, the following two operations are performed.
That is, as shown in FIG. 8, there is no caller around the screen, and an object mask around the screen is not created (up, down, left, and right areas of about 1/10).
[0036]
Two types of motion determination threshold values are provided. Then, a threshold with good sensitivity is applied to the area including the speaker, and a threshold with reduced sensitivity is applied to the other areas. The area including the speaker is an area obtained by expanding the rectangular area extracted in the previous frame by exp const X in the X direction and exp const Y in the Y direction. Here, exp const X and exp const Y are, for example, exp const X = 4 and exp const Y = 8, respectively. That is, when the following expression (2) is satisfied, CurrY is determined in step SP44._{i, j}Is a moving pixel, and if not satisfied, it is a stationary pixel in step SP45.
[0037]
ABS_{i, j}> TH_ysense (pixel in the previous frame rectangular area)
ABS_{i, j}> TH_yinsense (Pixel outside the rectangular area of the previous frame)
... (2)
TH_ysense, TH_yinsense is an example when the actual frame rate is less than 7._yset sense to 3, TH_yinsense is 12, and TH is 7 or more_ySet sense to 5, TH_ySet insense to 20.
[0038]
As a result of the determination, 1 or 0 is written in the position of the current frame of the motion history (the position of the LSB, the portion indicated by X in FIG. 7B).
[0039]
Next, CurrY_{i, j}The number of pixels determined as moving pixels in the past N frames MvFrame_{i, j}To seek HIS_{i, j}The sum of N frames is calculated. This is the HIS from T to TN + 1 in FIG._{i, j}Calculate the sum of the positions of MvFrame_{i, j}And Here, N is 8.
[0040]
Next, MvFrame in the past N frames_{i, j}Threshold TH for object pixel extraction from the value of_objBy
MvFrame_{i, j}> TH_obj  ... (3)
If the expression (3) is satisfied, it is determined that there is a significant movement and extracted as an object pixel. Note that 1 is given to the object pixel at step SP48, and 0 is given to other object pixels at step SP49. TH_objIs 3 as an example. If it is determined in step SP50 that the processing for one frame has been completed, the process returns to step SP13 in FIG. 3, and the number of object pixels extracted in the object mask extraction is counted as NoMv.
[0041]
For example, in a usage form such as a TV phone, the face area is usually located at the approximate center of the screen, and the position does not change greatly during the use of the TV phone. Therefore, when the pixel movement amount NoMv in the entire screen becomes small, operations such as labeling and edge extraction after the object mask enlargement processing of the face region extraction processing are skipped, and the rectangular region coordinates and object By using the values obtained in the previous frame for the mask and edge images, the amount of calculation can be greatly reduced, and the frame rate can be improved. However, the face position is not completely determined until the first 10th frame, and skipping is not performed. That is, when it is determined in step SP14 that the following expression (4) is satisfied, the processing after the object mask enlargement process is skipped.
[0042]
NoMv <THjudge and FrameCount> 10 (4)
Here, THjudge is set to 100 as an example.
[0043]
In addition, a skip operation determination flag “Bypass flag” is provided for precise coordinate extraction, FaceMap creation, and HueMap creation processing after the rectangular coordinate extraction processing. That is,
When operation skip is determined Bypass flag = 1
When calculation skip is not judged Bypass flag = 0
... (5)
When Bypass flag is 1 in step SP15, the value obtained in the previous frame is used as a return value in the rectangular coordinate extraction process, the precise coordinate extraction, the FaceMap creation, and the HueMap creation process.
[0044]
Bypass flag = 0 is set to 0 in step SP16, and since there are holes and chips in the object mask, enlargement using a 3 × 3 pixel window is performed in step SP17 as a process of filling them. As shown in FIG. 9, the enlargement is performed by setting a window of 3 × 3 pixels for each pixel of the object mask, and when the center pixel is an object pixel, all the 8 neighborhoods are set as object pixels. It is.
[0045]
The subQCIF reduced image is 32 × 24, which is smaller than the CIF / QCIF reduced image (44 × 36). For this reason, the region where one pixel of the SubQCIF object pixel corresponds in the real image is larger than the region where one pixel of the CIF / QCIF object pixel corresponds in the real image. Therefore, only in the case of SubQCIF, reduction using a 3 × 3 pixel window is performed after enlargement processing.
[0046]
This is because, as shown in FIG. 10, when a window of 3 × 3 pixels is set for each pixel of the object mask and the central pixel is not an object pixel, all of the eight neighborhoods are not set as object pixels. Reduce.
[0047]
In the algorithm according to the embodiment of the present invention, since the moving object is defined as the face area, the area cannot be extracted when the face is stationary. Therefore, in this case, it is necessary to prevent the face area from being lost even when the movement is lost by using the area of the previous frame as the face area. Therefore, the motion threshold TH is calculated from the relationship between the amount of motion of the previous frame and the amount of motion of the current frame._mvIs calculated.
[0048]
FIG. 11 shows the TH of step SP18 shown in FIG._mvFIG. 12 is a flowchart showing the calculation operation more specifically, FIG. 12 is a diagram showing extraction of the maximum area, and FIG. 13 is a diagram showing detection of the top of the head.
[0049]
TH_mvFrom the NoMv value, the logical sum of the object mask is determined by the following equation (6) in order to prevent the disappearance of the object mask when the motion is reduced as the determination of the logical OR processing of the object mask.
[0050]
NoMv <TH_mv  (6)
When the expression (6) is satisfied in step SP19, the logical sum of the object mask of the previous frame and the object mask of the current frame is calculated in step SP20, and this is used as the object mask of the new current frame. If the expression (6) is not satisfied, the logical sum process is not performed and the process proceeds to the object mask extraction process shown below.
[0051]
There may be many moving objects on the screen, and it is necessary to select only the extraction target from among them. For example, in a video phone or the like, since there is a high probability that the subject will be closest to the camera, the largest connected area is separated from the object mask here.
[0052]
In one embodiment of the present invention, as shown in FIG. 12, by performing the labeling by projecting the total number of pixels in the slit divided into four pixels in the vertical direction onto the x-axis, in step SP21. Extract the maximum area. Further, in the case of the leftmost slit in the leftmost and rightmost slits of the maximum area, the leftmost coordinate of the slit is obtained, and in the rightmost slit, the rightmost coordinate of the slit is obtained as Xa and Xb, respectively. At this time, when there are a plurality of pixels having the same number of pixels in the maximum area, the leftmost one is set as the maximum area.
[0053]
In order to fill a hole or a chip existing in the object mask of the current frame, an enlargement process using a 3 × 3 pixel window is performed in step SP22. The enlargement method is the same as that in FIG. Since the object mask before enlargement is used for processing in the next frame, the enlarged object mask is written in another area. Therefore, the object mask before enlargement becomes the object mask of the previous frame in the next frame processing. Therefore, the object mask of the current frame and the object mask of the previous frame are ping-pong operation with a double buffer configuration like the reduced image.
[0054]
The top of the head is detected at step SP23 between the left and right end coordinates Xa and Xb of the maximum region obtained as described above. Since the top of the object mask is the top of the object mask in FIG. 13, first, the total of every four pixels in the horizontal direction of the object mask sandwiched between Xa and Xb is projected in the Y-axis direction. Next, the projected image is checked from the top, and in the slit first exceeding the threshold value THtop, scanning is performed from the upper left to the lower right of the slit, and the Y coordinate detected first is the Y coordinate HeadTopY of the top. Further, in HeadTopY, the X coordinate of the object that is first detected by scanning from the left side is X1. Similarly, in HeadTopY, the coordinate of the object that is first detected by scanning from the right side is X2. The center of X1 and X2 is the X coordinate HeadTopX of the top of the head. THtop is 4 here.
[0055]
FIG. 14 is a flowchart showing a specific operation of face width detection in step SP24 of FIG. 4, FIG. 15 is a view showing determination of a face width detection range, and FIG. 16 is a view showing detection of face width. .
[0056]
Next, face width detection processing will be described with reference to FIGS. The face area is usually located approximately in the center of the screen, and there is a shoulder area below the screen. Using this, the detection range of the face area is determined in step SP71.
[0057]
Attention is paid to a half line in the region from the top position HeadTopY to the bottom of the screen, and this line is set as a detection line. That is, the Y coordinate sLineY of the detection line is as follows.
[0058]
  sLineY = HeadTopY + (Ymax-HeadTopY) / 2
Ymax = 35: CIF, QCIF
Ymax = 23: SubQCIF
... (7)
The object pixel is searched from the left to the right of the detection line. The X coordinate at which the object pixel is first found is defined as the maximum maxXa of the left detection range. Similarly, the object pixel is searched from the right to the left of the detection line, and the X coordinate where the object pixel is first found is set as the maximum maxXb of the detection range on the right side.
[0059]
Here, when the obtained maxXa satisfies the expression (8), maxXa is set as the left maximum detection range. When this is not satisfied, Xa is set as the maximum detection range on the left side.
[0060]
Xa <maxXa <HeadTopX (8)
Similarly, when maxXb satisfies the expression (9), as shown in FIG. 15, let maxXb be the maximum detection range on the right side. When this is not satisfied, Xb is set as the maximum detection range on the right side.
[0061]
  HeadTopX <maxXb <Xb (9)
Next, in step SP72, the screen is divided into vertical slits every four pixels, and an area including HeadTopX is obtained as shown in FIG. In FIG. 16, HeadTopX of area b is included. Next, in steps SP73 and SP74, the total difference in the number of pixels from area b to area a adjacent to the left from area b is obtained. In FIG. 16, the number of pixels in area b is 120, and the number of pixels in area a is 80. Therefore, the difference PelSub is expressed by the following equation (10).
[0062]
PelSub = 120-80 = 40 (10)
Here, when the expression (11) is satisfied in comparison with the pixel difference threshold value THsub in step SP75, the coordinates of the left face width are set to FaceWideX1 by the processing in steps SP76 and SP77.
[0063]
  PelSub> THsub (11)
In this example, since the difference in the number of pixels between the area b and the area a satisfies the expression (11), the coordinate of the left end of the area a that is the comparison area is FaceWideX1.
[0064]
After the detection of the left face width, the processing in steps SP79 to SP83 is similarly performed in the right direction to obtain the coordinate FacefaceXr of the right face width.
[0065]
FIG. 17 is a flowchart showing the flow of face width correction. Next, face width correction processing will be described with reference to FIG. The face width detected as described above may have failed to be detected due to noise or insufficient motion. Therefore, the face width may be too small or the face width may be too biased with respect to HeadTopX. In order to correct this, FaceWideX1 and FaceWideXr are processed according to steps SP91 to SP101 in the flowchart of FIG.
[0066]
Next, the bottom coordinate FaceBottomY is obtained from the following equation (12) from the top coordinate HeadTopY and the face width coordinates FaceWideX1 and FaceWideXr.
[0067]
FaceBottomY = HeadTopY + (FaceWideXr- FaceWideXl) * 1.5… (12)
Furthermore, from these values, the coordinates of the rectangle representing the face area are set as the face rectangle area, and the coordinates of the upper left and lower right are set.
(FaceWideX1, HeadTopY), (FaceWideXr, FaceBottomY)
It is defined as
[0068]
As described above, a rectangular area based on motion is obtained.
Next, in steps SP25 and SP26, a precise coordinate detection process is performed by extracting a face area in the rectangular area. The face area to be extracted is extracted using the feature that the vertical line of the cheek is strong and the horizontal line around the eyes is strong. However, if the process is determined to be skipped (Bypass flag = 1), the precise coordinates obtained in the previous frame without performing this process are used as the precise coordinates of the current frame.
[0069]
FIG. 18 is a flowchart showing the operation of the precision coordinate detection process.
In human faces, cheeks have strong vertical lines and eyes, nose and mouth have strong horizontal lines. By using this property, the face can be determined. Accordingly, the vertical and horizontal edges are extracted from the Y pixels of the current image existing in the object mask in the face rectangular area by the processing of steps SP111 to SP120 in FIG. At that time, since the face rectangular area is the size of the reduced image, it is used corresponding to the size of each image format. However, from the viewpoint of reducing the processing amount, in the case of CIF, processing is performed every other pixel in the vertical and horizontal directions.
[0070]
FIG. 19 is a diagram illustrating an edge detection operation, FIG. 20 is a diagram illustrating a face width precise coordinate detection operation, and FIG. 21 is a diagram illustrating a feature amount of a horizontal edge.
[0071]
First, a window of 3 × 3 pixels is set as shown in FIG. 20 with a pixel existing in the object mask in the face rectangular area as a target pixel. At this time, the vertical edge UX and the horizontal edge UY are expressed by Expression (13).
[0072]
DX = | C + 2F + I-A-2D-G |
  DY = | G + 2H + I-A-2B-C |
  UX = fix (DX-DY)
UY = fix (DY-DX)
However, fix (a) = a (a ≥ 0), 0 (a <0)
... (13)
Further, UX and UY are binarized with the threshold value THedge, and the vertical edge pixel VEdge and the horizontal edge pixel HEdge are obtained by the expression (14). Note that THedge is 60 as an example.
[0073]
  UX> THedge → VEdge = 1
UX ≦ THedge → VEdge = 0
UY> THedge → HEdge = 1
UY ≤ THedge → HEdge = 0
... (14)
First, the length Xlen of the face width in the vertical / horizontal edge image is obtained from the face width coordinates FaceWideX1 and FaceWideXr by the expression (15).
[0074]
Xlen = (FaceWideXr-FaceWideXl +1) * 4 ... (15)
Next, the Xlen square is set at the top of the vertical edge image, and this interior is used as a vertical edge search area as shown in FIG. At this time, it is assumed that there is a nose near the center, and the width of the nose is obtained by the following equation.
[0075]
NoseWidth = Xlen / 4 (16)
Also, the vertex HeadCenter that is the coordinate system of the edge image is obtained from the vertex HeadTopY of the rectangular coordinates according to the equation (17).
[0076]
HeadCenter = HeadTopX × 4 (17)
Then, the left position avoiding only NoseWidth from the top position HeadCenter is set as the search start position, and the left area is sequentially searched from here, and the position where the cumulative value of the pixels in the vertical direction first exceeds the threshold THpeak is determined. PeakL.
[0077]
Similarly, the right position that avoids only NoseWidth from the top position HeadCenter is set as the search start position, and the right area is sequentially searched from here, and the cumulative value of the vertical pixels is initially greater than the threshold THpeak. This position is PeakR.
[0078]
Here, THpeak can be obtained from the following equation (18).
THpeak = Xlen PeakRathio / 10
However,
  PeakRathio = 2 (18)
When both PeakL and PeakR are THPeak or more and the X coordinate value of PeakL and PeakR is emnTHsub or more, it is judged that the face width existed and the precise coordinate extraction was successful, and feature extraction from the horizontal edge described later To do. Also, let XMN coordinates of PeakL and PeakR be EMNl and EMNr, respectively.
[0079]
If the above condition is not satisfied, it is determined that face width precise coordinate extraction has failed, and EMNl and EMNr obtained in the previous frame are set as EMNl and EMNr in the current frame. The EMNl and EMNr obtained in the previous frame may deviate greatly from the rectangular coordinates of the current frame. In this case, it is determined that EMNl and EMNr in the previous frame are inappropriate. In order to determine whether or not the current frame is greatly deviated from the rectangular coordinates, it is determined whether EMNl and EMNr of the previous frame are contained in the left and right parm x% of the width of the rectangular area of the current frame. That is, when EMNl and EMNr of the previous frame are preEMNl and preEMNr, respectively, the equation (19) is determined.
[0080]
preEMNl> FaceWideXl × 4- (FaceWideXr-FaceWideXl) × 4 × parm_x ÷ 100
And
  preEMNr> FaceWideXr × 4 + (FaceWideXr-FaceWideXl) × 4 × parm_x ÷ 100
... (19)
When Eq. (19) is satisfied, preEMNl and preEMNr are set to EMNl and EMNr of the current frame, and when not satisfied, precise coordinate extraction is failed, and face detection failure processing described later is performed. Here, parm x is 12 as an example.
[0081]
The human face makes use of the presence of eyes, nose, and mouth having horizontal line features between the face widths, and judges the likelihood of the face by examining the distribution of the horizontal edges between the face widths. As shown in FIG. 21, a horizontal edge search area indicated by a bold line is set below the face rectangular area corresponding to the size of the horizontal edge image, and an area obtained by equally dividing EMNl and EMNr into three is determined in this search area. Accumulate the number of pixels of the horizontal edge for each region, and set them as SY0, SY1Y, and SY2, respectively.
[0082]
  When the relationship between SY0, SY1, and SY2 satisfies the expression (20), it is determined as a face region, and vertical precision coordinate detection processing is performed. If not satisfied, face detection failure processing described later is performed.
[0083]
SY0 <SY1 and SY2 <SY1 (20)
When the determination of the face area is successful, the vertical precision coordinate detection process is performed. As shown in FIG. 22, the distribution when the horizontal edge pixels are accumulated in the Y-axis direction is obtained, and the Y coordinate that becomes the threshold value THemn or more for the first time when scanning from above is set as the coordinate EMNtop at the top of the face. Further, the Y coordinate that becomes equal to or higher than THemn after scanning from below is set as the coordinate EMNbottom of the lower part of the face.
[0084]
Here, if the difference between EMNtop and EMNbottom is smaller than emnTHsub, it is determined that the precise coordinate extraction has failed and the face detection failure process described later is performed.
[0085]
If the detection of the face area is successful, the face area detection success flag FindFace is set to the expression (21) and the precise coordinate detection process is terminated.
[0086]
FindFace = 1 (21)
If the detection of the face area is successful, the face area detection success flag FindFace is set to equation (22), and the precise coordinate detection process is terminated.
[0087]
FindFace = 0 (22)
When face detection fails, there may be a face other than the extracted region. In that case, in the encoding control process and the filter process performed in the subsequent stage of the face area extraction process, the actual face portion is processed other than the face area, and the face image quality may be deteriorated. In order to avoid this, when face detection fails, the face is prevented from being removed from the face area processing target area by making the rectangular area larger than the actual area or the entire screen.
[0088]
First, the center of the rectangular region X coordinate obtained in the current frame is set to FaceCenter as shown in the equation (23).
[0089]
FaceCenter = (FaceWideXr-FaceWideXl) / 2 + FaceWideXl… (23)
Three large areas (area0, area1, area2) are set in the screen, and one of area0, area1, and area2 is defined as a rectangular area of the current frame from the FaceCenter position. That is, the current frame is replaced by the equation (24).
[0090]
0 ≤ FaceCenter <Border01 → Replace the rectangular area with area0
  Border01 <FaceCenter ≤ Border12 → Replace the rectangular area with area1
  Border12 <FaceCenter ≤ X coordinate maximum → Replace rectangular area with area2
... (23)
Table 2 shows the determination values Border01 and Border12 for replacement area (area0, area1, area2) selection, and the replacement values for (area0, area1, area2).
[0091]
[Table 2]

[0092]
The process at the time of face detection failure is performed by the encoding control unit that calls the face area extraction function as determined by the FindFace flag. This is because the continuity of inter-frame processing is not interrupted by the face area extraction unit. As a result, when face detection succeeds in the next frame, it is possible to immediately return to the correct rectangular area.
[0093]
Next, the enlargement process of the rectangular area obtained by the extraction process of the rectangular area is performed. When the movement decreases, the overall code amount decreases, and the face area can be enlarged accordingly. For this determination, a Still Count that counts how many frames with less motion are continued is obtained from the equation (25), and the processes of steps SP27, SP28, and SP29 are performed.
[0094]
  NoMv <TH_still
then: StillCount ++
NoMv ≧ TH_still
then: StillCount = 0
StillCount> TH_{still max}
then: StillCount = 0
... (25)
In step SP30, an area expansion region RgnExp is obtained as follows, and FaceWideXl, FaceWideXr, and HeadTopY are expanded to be FaceWideXle, FaceWideXre, and HeadTopYe, respectively. Here, THstill is 20 and THexp is 20 for CIF, 50 for QCIF, and 70 for SQCIF.
[0095]
  RgnExp = StillCount / TH_exp
FaceWideXle = FaceWideXl-RgnExp
FaceWideXre = FaceWideXr + RgnExp
HeadTopYe = HeadTopY-RgnExp
However,
FaceWideXle = 0 when FaceWideXle <0
When FaceWideXre> 43, FaceWideXre = 43
HeadTopYe = 0 when HeadTopYe <0
... (26)
Also, when compared with the face widths PreFaceWideXle, PreFaceWideXre, and head top PreHeadTopYe of the previous frame, when the difference is smaller than THdif, the value of the previous frame is used to prevent frequent fluctuations. THdif is 3 here.
[0096]
｜ Pre FaceWideXle-FaceWideXle ｜ <TH_dif
then: FaceWideXle = PreFaceWideXle
｜ PreFaceWideXre-FaceWideXre ｜ <TH_dif
then: FaceWideXre = PreFaceWideXre
｜ PreHeadTopYe-HeadTopYe ｜ <TH_dif
then: HeadTopYe = PreHeadTopYe
... (27)
Further, the movement amount is subjected to a low-pass filter expressed by the equation (28) to suppress a steep fluctuation from the previous frame position.
[0097]
  FaceWideXle = PreFaceWideXle x k + FaceWideXle x (1-k)
  FaceWideXre = PreFaceWideXre x k + FaceWideXre x (1-k)
  HeadTopYe = PreHeadTopYe x k + HeadTopYe x (1-k)
k = 1/2
... (28)
The face area is generally located near the center with respect to the screen. Therefore, the face length is calculated in step SP31. That is, the clipping shown in Equation (29) is applied to the top, bottom, left, and right positions of the screen so that the rectangular area does not shift to the edge of the screen.
[0098]
  FaceWideXle <minX
then: FaceWideXle = minX
FaceWideXre> maxX
then: FaceWideXle = maxX
HeadTopY <minY
then: HeadTopY = minY
However,
  minX = width of reduced image x 1/10
maxX = Reduced image width-(Reduced image width x 1/10)
minY = Reduced image height x 1/12
FaceWideXre <FaceWideXle
then: FaceWideXre = FaceWideXle
... (29)
Finally, remember the face width and head for the next frame.
[0099]
PreFaceWideXle = FaceWideXle
PreFaceWideXre = FaceWideXre
PreHeadTopYe = HeadTopYe
... (30)
The face bottom coordinate FaceBottomYe is estimated from the face width and top position.
[0100]
FaceBottomYe = HeadTopYe + (FaceWideXre-FaceWideXle +1)^*15
... (31)
  As with FaceWideXle, FaceWideXre, and HeadTopYe, clipping is performed by the expression (32) so that the rectangular area is not biased to the edge of the screen.
[0101]
FaceBottomYe> maxY
then: FaceBottomYe = maxY
However,
  maxY = reduced image height-(reduced image height x 1/12)
FaceBottomYe <HeadTopYe
then: FaceBottomYe = HeadTopYe
... (32)
Next, a FaceMap used for encoding control is created from the rectangular area of the face area obtained from the face width, the top of the head, and the length of the face extracted in step SP32. This is because the coordinates of the face width, the top of the head, and the length of the face correspond to a reduced image of 44 × 36 pixels, but the macroblock 1 included in this region is 50% or more, and other macroblocks Attaches a label representing 0. FaceMap has a different size for each image format, and is 22 × 18 for CIF, 11 × 9 for QCIF, and 8 × 6 for SQCIF.
[0102]
Note that since motion-based face extraction uses a history of several frames for motion detection, FaceMap is set to all zero from the start frame of the face area extraction process to THtrack.
[0103]
In addition, when the skip determination of the process is made (Bypass flag = 1), the FaceMap obtained in the previous frame without performing this process is set as the FaceMap of the current frame.
[0104]
Next, skin color extraction processing based on color is performed. The color-based skin color extraction is performed to extract other skin regions that could not be extracted in the motion-based face extraction, for example, regions such as hands, arms, and necks, and create a HueMap.
[0105]
However, when the process skip determination is made (Bypass flag = 1), the HueMap obtained in the previous frame without performing this process is set as the HueMap of the current frame.
[0106]
FIG. 23 is a flowchart showing a flow of skin color extraction processing based on color. In FIG. 23, it is determined in step SP121 that Bypass flag is not 1, and in step SP122, a skin color sample region is set from the precise coordinates of the face. A rectangular area obtained from the precise facial coordinates EMNtop, EMNbottom, EMNl, and EMNr is set to Cb and Cr as skin color sample areas. However, in the case of QCIF and SQCIF, a value obtained by halving the precision coordinates is used.
[0107]
Next, in step SP123, the average and standard deviation of Cb and Cr in the flat region in the skin color sample region are calculated. The average μu, μv, standard deviation σu, σv of Cb, Cr from the values of the Cb, Cr pixels outside the skin color sample area that do not include the edge pixels of the vertical and horizontal edge images created by the face extraction based on the motion described above Is calculated.
[0108]
Although the skin color extraction based on the color is performed based on the result of the face extraction based on the motion base, the extraction based on the motion base is not 100% accurate. Therefore, if the face region is extracted by mistake, the skin color extraction is adversely affected. Therefore, it is necessary to determine whether to perform skin color extraction based on the distribution of sampled Cb and Cr. It is considered that there is a unimodal peak when the expression (33) is satisfied from the standard deviations σu, σv, and the skin color extraction process is performed in step SP124 to perform the skin color pixel extraction process.
[0109]
σ_u<THσ and σ_v<THσ (33)
When the expression (33) is satisfied, it is considered that there is a unimodal peak, and the skin color extraction process in steps SP126 to SP128 is performed to perform the skin color pixel extraction process. When the expression (33) is not satisfied, there is a variation in color and stable skin color extraction cannot be performed. Therefore, the skin color extraction process is stopped in step SP125, a HueMap is created with the same configuration as FaceMap, and all zeros are completed. Note that THσ is 20 here.
[0110]
In the extraction of skin color pixels in steps SP126 to SP128, first, in step SP126, the values [Cbl, Cbh] and [Crl, Crh] of the extraction range of Cb and Cr are determined as follows.
[0111]
[Cbl, Cbh] = [μ_u-σ_u, μ_u+ σ_u]
[Crl, Crh] = [μ_v-σ_v, μ_v+ σ_v]
... (34)
Next, in accordance with the extraction range obtained in step SP126, pixels in which both Cb and Cr belong to that range are extracted in step SP127. Further, only the pixels in the object mask of the current frame are extracted to create the number of skin color areas.
[0112]
The skin color area created in this way is associated with each image format (CIF, QCIF, SQCIF), and when the macro block includes 50% or more skin color pixels, the macro block is set to 1. Otherwise, it is 0. A HueMap having a label is created in step SP128. Note that HueMap is set to all zero from the start frame of the face area extraction process to THtrack.
[0113]
The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.
[0114]
【The invention's effect】
As described above, according to the present invention, when extracting a rectangular area, the motion component determination threshold value outside the previous frame rectangular area is made less sensitive than the in-rectangular movement component determination threshold value, so that the motion component around the target person is reduced. The determination accuracy of the rectangular area can be improved by making it difficult to determine that the noise is a rectangular area and making it difficult for the rectangular area to move outside the target human face area.
[0115]
Also, when the amount of moving pixels is small due to the amount of moving pixels between the previous frame and the current frame, the calculation amount after the calculation of the moving pixel amount is not performed, thereby reducing the total amount of calculation of the rectangular area extraction processing.
[0116]
Furthermore, at the time of face width determination for precise coordinate determination by face image feature extraction, the face width search range is set to the right and left width of a rectangular coordinate at a half position from the top vertex coordinate to the bottom coordinate of the screen. The search accuracy can be improved and the calculation efficiency can be improved.
[0117]
At the time of face width determination for precise coordinate determination by facial image feature extraction, a nose component is avoided so that an area narrower than the actual face width is not determined as precise coordinates.
[0118]
Further, when the precise coordinate extraction by the face image feature extraction fails, the existence of the face image outside the rectangular area can be reduced by setting the area having a predetermined size as the rectangular area.
[0119]
Furthermore, it is possible to reduce the presence of a face image outside the rectangular area by making the entire screen a rectangular area instead of making the area of a certain size a rectangular area.
[0120]
Furthermore, a low-pass filter is applied to the rectangular area coordinates calculated from the moving pixels in the current frame and the rectangular area coordinates in the previous frame, so that even if the moving object moves suddenly on the screen, the rectangular area becomes a moving object smoothly. Can follow.
[0121]
In addition, effective face images do not exist in the edge area in the screen, and clipping processing is performed on the obtained rectangular area coordinate values, thereby improving accuracy and reducing the amount of computation when extracting precise coordinates later. Can be performed.
[0122]
Furthermore, when determining the face width for precise coordinate determination by facial image feature extraction, if the difference between the precise coordinate value obtained in the previous frame and the precise coordinate value obtained in the current frame is within a threshold value, it is calculated in the previous frame. By adopting the precise coordinate value as the precise coordinate of the current frame, it is possible to avoid losing the precision coordinate. At this time, the same effect can be obtained by using the average value of the precision coordinate values of the past several frames instead of using the precision coordinates obtained in the previous frame.
[Brief description of the drawings]
FIG. 261 / H. It is a block diagram which shows the relationship between a H.263 image encoder and a face area extraction part.
FIG. 2 is a flowchart showing a rough flow of face area extraction.
FIG. 3 is a first half flowchart showing face extraction based on motion.
FIG. 4 is a latter half flowchart showing motion-based face extraction.
FIG. 5 is a diagram illustrating an example of creating a reduced image.
FIG. 6 is a flowchart showing a flow of creating an object mask.
FIG. 7 is a diagram showing an update of a motion history.
FIG. 8 is a diagram illustrating a motion pixel determination threshold adaptive region.
FIG. 9 is a diagram illustrating an enlargement process using a 3 × 3 pixel window.
FIG. 10 is a diagram illustrating a reduction process using a 3 × 3 pixel window.
[Figure 11] TH_mvIt is a flowchart which shows the flow of calculation.
FIG. 12 is a diagram illustrating extraction of a maximum area.
FIG. 13 is a diagram illustrating an example of detection of the vertex.
FIG. 14 is a flowchart showing a flow of face width detection.
FIG. 15 is a diagram for explaining determination of a face width detection range.
FIG. 16 is a diagram illustrating an example of detection of a face width.
FIG. 17 is a flowchart showing a flow of face width correction.
FIG. 18 is a flowchart showing a flow of a precision coordinate detection process.
FIG. 19 is a diagram showing an edge detection operator.
FIG. 20 is a diagram illustrating an example of face width precision coordinate detection.
FIG. 21 is a diagram illustrating a feature amount of a horizontal edge.
FIG. 22 is a diagram showing an example of precise coordinate calculation in the vertical direction of the face.
FIG. 23 is a flowchart showing a flow of a skin color extraction process based on a color base.
[Explanation of symbols]
101 Face region extraction unit
102 Coding control unit
103 Subtractor
104,105 switch
106 Adder
107 converter
108 Quantizer
109 Inverse quantizer
110 Inverter
111 In-loop filter
112 prediction memory

Claims

An image processing apparatus for extracting a human face image from an input moving image,
A rectangular area extracting means for extracting a rectangular area consisting of a moving object area based on the input moving image;
And a facial image feature extraction means for extracting a precise coordinates according to the features of the face image of the rectangular region extracted rectangular region extracted by means,
When the rectangular area extraction means extracts a rectangular area based on the difference between the image of the previous frame and the current frame, the threshold value for determining a movement component outside the rectangular area of the previous frame is set as a movement component in the rectangular area. An image processing apparatus, wherein the sensitivity is set lower than a threshold value for determination.

2. The image processing according to claim 1, wherein the rectangular area extracting unit extracts and outputs a rectangular area from an image of the previous frame when a moving pixel amount between the previous frame and the current frame is small. apparatus.

When the face image feature extraction unit determines the face width in order to extract the feature of the face image in the rectangular region, the face width search range is a rectangle at a half position from the top coordinate to the bottom coordinate of the screen. The image processing apparatus according to claim 1, wherein the image processing apparatus is set to a horizontal width of coordinates.

The image processing apparatus according to claim 3, wherein the face image feature extraction unit does not determine a region narrower than an actual face width by avoiding a nose component as a precise coordinate when determining a face width.

The image processing apparatus according to claim 1, wherein the face image feature extraction unit sets a region having a predetermined size as a rectangular region when precise coordinate extraction by face image feature extraction cannot be performed. .

The image processing apparatus according to claim 1, wherein the face image feature extraction unit sets the entire screen to a rectangular area when precise coordinate extraction by the face image feature extraction cannot be performed.

Further, even when the moving object moves rapidly in the screen by applying a low pass filter to the rectangular area coordinates calculated from the moving pixels in the current frame by the rectangular area extracting means and the rectangular area coordinates of the previous frame. The image processing apparatus according to claim 1, wherein the rectangular area follows a moving object.

2. The image processing according to claim 1, wherein the rectangular area extraction unit performs a clipping process on a rectangular area coordinate value obtained by assuming that a valid face image does not exist in an end area in the screen. apparatus.

When the face image feature extraction means determines the face width for precise coordinate determination by face image feature extraction, there is a threshold value that is the difference between the value of the precision coordinate obtained in the previous frame and the value of the precision coordinate obtained in the current frame The image processing apparatus according to claim 1, wherein if it is within, the precise coordinate calculated in the previous frame is used as the precise coordinate of the current frame.

When the face image feature extraction unit determines the face width for precise coordinate determination by face image feature extraction, the difference between the average value of the precision coordinate value obtained in the current frame and the precision coordinate value in the past several frames is calculated. 2. The image processing apparatus according to claim 1, wherein an average value of the precision coordinate values of the past several frames is set as a precision coordinate value of the current frame within a certain threshold value.

The skin color area extracting means for extracting the skin color area of the input moving image as the face image of the person based on the color distribution standard deviation in the precise coordinates extracted by the face image feature extracting means. The image processing apparatus according to any one of 10 to 10.