JP3612227B2

JP3612227B2 - Object extraction device

Info

Publication number: JP3612227B2
Application number: JP00189199A
Authority: JP
Inventors: 陽子三本杉; 敏明渡邊; 孝井田
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1998-01-07
Filing date: 1999-01-07
Publication date: 2005-01-19
Anticipated expiration: 2019-01-07
Also published as: JP2000082145A

Description

【０００１】
【発明の属する技術分野】
本発明は画像の物体抽出装置に関し、特に入力動画像から目的とする物体の位置を検出して動物体の追跡／抽出を行う物体抽出装置に関する。
【０００２】
【従来の技術】
従来より、動画像中の物体を追跡／抽出するためのアルゴリズムが考えられている。これは、様々な物体と背景が混在する画像からある物体だけを抽出するための技術である。この技術は、動画像の加工や編集に有用であり、例えば、動画像から抽出した人物を別の背景に合成することなどができる。
【０００３】
物体抽出に使用される方法としては、時空間画像領域分割（越後、飯作、「ビデオモザイクのための時空間画像領域分割」、１９９７年電子情報通信学会情報・システムソサイエティ大会、Ｄ−１２−８１、ｐ．２７３、１９９７年９月）を利用した領域分割技術が知られている。
【０００４】
この時空間画像領域分割を用いた領域分割方法では、動画像の１フレーム内のカラーテクスチャによる小領域の分割を行ない、フレーム間の動きの関係を使ってその領域を併合する。フレーム内の画像を分割する際には、初期分割を与える必要があり、それによって分割結果が大きく左右されるという問題がある。そこで、これを逆に利用して、この時空間画像領域分割を用いた領域分割法では、別のフレームで初期分割を変えて、結果的に異なる分割結果を得て、フレーム間の動きで矛盾する分割を併合するという手法をとっている。
【０００５】
しかし、この手法を動画像中の物体の追跡および抽出にそのまま適用すると、動きベクトルが、目的とする動物体以外の余分な動きに影響されてしまい、信頼度が十分でないことが多く、誤った併合を行なう点が問題となる。
【０００６】
また、特開平８−２４１４１４号公報には、複数の動物体検出装置を併用した動物体検出・追跡装置が開示されている。この従来の動物体検出・追跡装置は、例えば監視カメラを用いた監視システムなどに用いられるものであり、入力動画像から動物体を検出してその追跡を行う。この動物体検出・追跡装置においては、入力動画像は、画像分割部、フレーム間差分型動物体検出部、背景差分型動物体検出部、動物体追跡部にそれぞれ入力される。画像分割部では、入力動画像が予め定められた大きさのブロックに分割される。分割結果は、フレーム間差分型動物体検出部、および背景差分型動物体検出部にそれぞれ送られる。フレーム間差分型動物体検出部では、分割結果毎に、フレーム間差分を用いて入力動画像中の動物体が検出される。この場合、フレーム間差分を取る際のフレーム間隔は、動物体の移動速度に影響されずにその動物体を検出できるようにするために、背景差分型動物体検出部の検出結果に基づいて設定される。背景差分型動物体検出部では、これまでに入力された動画像を用いて分割結果毎に作成した背景画像と動物体との差分を取ることにより、動物体が検出される。統合処理部では、フレーム間差分型動物体検出部および背景差分型動物体検出部それぞれの検出結果が統合されて、動物体の動き情報が抽出される。各フレームで物体を抽出した後、動物体追跡部では、フレーム間において対応する動物体同士の対応付けが行われる。
【０００７】
この構成においては、フレーム間差分のみならず、背景差分をも用いて動物体の検出を行っているため、フレーム間差分だけを用いる場合に比べ検出精度は高くなる。しかし、入力動画像全体を対象としてその画像の中から動きのある物体をフレーム間差分および背景差分によって検出する仕組みであるため、フレーム間差分および背景差分それぞれの検出結果は、目的とする動物体以外の余分な動きに影響されてしまい、背景に複雑な動きがある画像ではうまく目的とする動物体を抽出・追跡できないという問題がある。
【０００８】
また、別の物体抽出技術としては、複数のフレームを用いてまず背景画像を生成し、その背景画像と入力画像の画素値の差分が大きい領域を物体として抽出する方法も知られている。
【０００９】
この背景画像を用いる物体抽出の既存の技術一例が、例えば、特開平８−５５２２２号公報の「移動物体検出装置および背景抽出装置ならびに放置物体検出装置」に開示されている。
【００１０】
現処理フレームの画像信号は、１フレーム分の画像を蓄えるフレームメモリ、第１の動き検出手段、第２の動き検出手段、スイッチに入力される。フレームメモリからは、１フレーム前の画像信号が読み出され、第１の動き検出手段に入力される。一方、背景画像を保持するために用意されたフレームメモリからは、その時点までに生成されている背景画像信号が読み出され、第２の動き検出手段と、スイッチに入力される。第１の動き検出手段と第２の動き検出手段では、各々入力される２つの画像信号の差分値などを用いて物体領域および物体領域が抽出され、いずれも論理演算回路に送られる。論理演算回路では、入力される２つの画像の論理積がとられ、それが最終的な物体領域として出力される。また、物体領域は、スイッチにも送られる。スイッチでは、物体領域によって、物体領域に属する画素については、背景画素信号が選択され、逆に、物体領域に属さない画素については、現処理フレームの画像信号が選択され、上書き信号としてフレームメモリに送られ、フレームメモリの画素値が上書きされる。
【００１１】
この手法では、特開平８−５５２２２号公報に示されている様に、処理が進行するにつれて、次第に背景画像が正しくなっていき、やがては、物体が正しく抽出されるようになる。しかし、動画像シーケンスの初めの部分においては背景画像に物体が混入しているために、物体の抽出精度が悪い。また、物体の動きが小さい場合には、いつまでたっても、その物体の画像が背景画像の中に残り、抽出精度は高くならない。
【００１２】
【発明が解決しようとする課題】
上述したように、従来の物体抽出／追跡方法では、入力動画像全体を対象としてその画像の中から動きのある物体を検出する仕組みであるため、目的とする動物体以外の余分な動きに影響を受けてしまい、目的の動物体を精度良く抽出・追跡することができないという問題があった。
【００１３】
また、背景画像を用いる物体抽出法では、動画像シーケンスの初めの部分において抽出精度が悪く、また、物体の動きが小さい場合には、いつまでも背景画像が完成しないために抽出精度が良くならないという問題点があった。
【００１４】
本発明は、目的の物体以外の周囲の余分な動きに影響を受けずにその物体の抽出／追跡を精度良く行うことが可能な動画像の物体抽出装置を提供することを目的とする。
【００１５】
また、本発明は、背景画像を精度良く決定できるようにして、物体の動きの大小によらずに、且つ動画像シーケンスの初めの部分も最後の部分と同様に高い抽出精度が得られる物体抽出装置を提供することを目的とする。
【００１６】
【課題を解決するための手段】
本発明は、物体抽出対象となる現フレームと、この現フレームに対し時間的に異なる第１の参照フレームとの差分に基づいて、前記現フレームと前記第１の参照フレームに共通の第１の背景領域を決定し、前記現フレームと、この現フレームに対し時間的に異なる第２の参照フレームとの差分に基づいて、前記現フレームと前記第２の参照フレームに共通の第２の背景領域を決定する背景領域決定手段と、前記現フレームの図形内画像の中で、前記第１の背景領域と前記第２の背景領域のどちらにも属さない領域を、物体領域として抽出する手段と、静止している物体領域を検出する物体静止検出手段とを具備する物体抽出装置を提供する。
【００１７】
この物体抽出装置においては、物体抽出対象の現フレーム毎に二つの参照フレームが用意され、その現フレームと第１の参照フレームとの間の第１の差分画像により、現フレームと第１の参照フレームとで共通に用いられている第１の共通背景領域が決定され、また現フレームと第２の参照フレームとの間の第２の差分画像により、現フレームと第２の参照フレームとで共通に用いられている第２の共通背景領域が決定される。第１および第２のどちらの差分画像にも現フレーム上の物体領域が共通に含まれているため、第１の共通背景領域と第２の共通背景領域のどちらにも属さない領域の中で、現フレームの図形内画像に含まれる領域を検出することにより、現フレーム上の物体領域が抽出される。この物体領域が静止物体に相当する場合には、前の物体領域と現物体領域とに差分が存在しないとき静止物体領域が検出される。
【００１８】
このようにして、時間的に異なる参照フレームに基づいて決定された複数の共通背景領域にいずれにも属さない領域を抽出対象物体と決定して物体の追跡を行うことにより、目的の物体以外の周囲の余分な動きに影響を受けずに、目的の物体を精度良く抽出・追跡することが可能となる。
【００１９】
また、前記第１および第２の各参照フレームと前記現フレームとの間で背景の動きが相対的に零となるように、前記各参照フレームまたは現フレームの背景の動きを補正する背景補正手段をさらに具備することが好ましい。この背景補正手段を図形設定手段の入力段、あるいは背景領域決定手段の入力段のいずれかに設けることにより、例えばカメラをパンした時などのように背景映像が連続するフレーム間で徐々に変化するような場合であってもそれらフレーム間で背景映像を擬似的に一定にすることができる。よって、現フレームと第１または第２の参照フレームとの差分を取ることによって、それらフレーム間で背景を相殺することが可能となり、背景変化に影響されない共通背景領域検出処理および物体領域抽出処理を行うことができる。背景補正手段は動き補償処理によって実現できる。
【００２０】
また、前記背景領域決定手段は、前記現フレームと前記第１または第２の参照フレームとの差分画像の中で、前記現フレームの図形内画像または前記第１または第２の参照フレームの図形内画像に属する領域の輪郭線近傍における各画素の差分値を検出する手段と、前記輪郭線近傍の各画素の差分値を用いて、前記共通の背景領域と判定すべき差分値を決定する手段とを具備し、この決定された差分値を背景／物体領域判定のためのしきい値として使用して、前記差分画像から前記共通の背景領域を決定するように構成することが好ましい。このように輪郭線近傍における各画素の差分値に着目することにより、差分画像全体を調べることなく、容易にしきい値を決定することが可能となる。
【００２１】
また、前記図形設定手段は、前記参照フレームの図形内画像を複数のブロックに分割する手段と、複数のブロックそれぞれについて、前記入力フレームとの誤差が最小となる前記入力フレーム上の領域を探索する手段と、探索された複数の領域を囲む図形を前記入力フレームに設定する手段とから構成することが好ましい。これにより、初期設定された図形の形状や大きさによらず、対象となる入力フレームに最適な新たな図形を設定することが可能となる。
【００２２】
また、本発明は、既に物体領域を抽出したフレームから、物体抽出対象となる現フレーム上の物体の位置または形状を予測する予測手段と、この予測手段によって予測された現フレーム上の物体の位置または形状に基づいて、前記背景領域決定手段によって使用すべき前記第１および第２の参照フレームを選択する手段とをさらに具備する。
【００２３】
このように、使用すべき参照フレームとして適切なフレームを選択することにより、常に良好な抽出結果を得ることが可能となる。
【００２４】
ここで、Ｏ_ｉ，Ｏ_ｊ，Ｏ_ｃｕｒｒをそれぞれ参照フレームｆ_ｉ，ｆ_ｊ及び抽出対象の現フレームｆｃｕｒｒの物体とすると、正しく物体の形状を抽出するための最適な参照フレームｆ_ｉ，ｆ_ｊとは、
（Ｏ_ｉ∩Ｏ_ｊ）⊆ Ｏ_ｃｕｒｒ
を満たすフレーム、つまり、Ｏ_ｉ，Ｏ_ｊの交わり部分がＯ_ｃｕｒｒ内に属するようなフレームｆ_ｉ，ｆ_ｊである。
【００２５】
また、本発明は、互いに異なる方法によって物体抽出を行う複数の物体抽出手段を設け、これら物体抽出手段を選択的に切替ながら物体抽出を行うことを特徴とする。この場合、現フレームと、この現フレームとは時間的にずれた少なくとも２つの参照フレームそれぞれの差分を用いて物体抽出を行う第１の物体抽出手段と、フレーム間予測を使用して既に物体抽出が行われたフレームから現フレームの物体領域を予測することにより物体抽出を行う第２の物体抽出手段とを組み合わせて使用することが望ましい。これにより、物体が部分的に静止していて参照フレームとの差分が検出できないときでも、フレーム間予測を用いた物体抽出手段によってそれを補うことが可能となる。
【００２６】
また、複数の物体抽出手段を設けた場合には、物体抽出対象となる現フレームから、その少なくとも一部の領域についての画像の特徴量を抽出する手段をさらに具備し、前記抽出された特徴量に基づいて、前記複数の物体抽出手段を切り替えることが好ましい。
【００２７】
例えば、背景の動きがあるかどうか予めわかるならば、その性質を使った方がよい。背景の動きがある場合は、背景動き補償を行なうが、完全に補償できるとは限らない。複雑な動きをするフレームではほとんど補償できないこともある。このようなフレームは、背景動き補償の補償誤差量によって予め選別できるので、参照フレーム候補にしないなどの工夫が可能である。しかし、背景の動きがない場合は、この処理は不必要である。別の物体が動いていると、誤った背景動き補償を行なったり、参照フレーム候補から外れたりして参照フレーム選択条件に最適なフレームであっても選ばれず、抽出精度が落ちることがあるからである。また、一つの画像中にも多様な性質が混在していることがある。物体の動きやテクスチャも部分的に異なり、同じ追跡・抽出方法及び装置やパラメータではうまく抽出できないことがある。従って、ユーザが画像中の特殊な性質を持つ一部を指定したり、画像中の違いを自動的に特徴量として検出し、その特徴量に基づいて、例えばフレーム内のブロック単位などで部分的に追跡・抽出方法を切替えて抽出したり、パラメータを変更した方がよい。
【００２８】
このようにして、画像の特徴量に基づいて、複数の物体抽出手段を切替えれば、様々な画像中の物体の形状を精度良く抽出することが可能になる。
【００２９】
また、現フレームと、この現フレームと時間的にずれた少なくとも２つの参照フレームとの差分を用いた第１の物体抽出手段と、フレーム間予測を使用した第２の物体抽出手段とを組み合わせて使用する場合には、第２の物体抽出手段による予測誤差が所定の範囲内であるときは、第２の物体抽出手段による抽出結果が物体領域として使用され、予測誤差が所定の範囲を越えたときは第１の物体抽出手段による抽出結果が物体領域として使用されるように、予測誤差量に基づいて、フレーム内のブロック単位で第１および第２の物体抽出手段を選択的に切り替えて使用することが望ましい。
【００３０】
また、第２の物体抽出手段は、参照フレームと物体抽出対象となる現フレームとの間のフレーム間隔が所定フレーム以上あくように入力フレーム順とは異なる順序でフレーム間予測を行うことを特徴とする。これにより、入力フレーム順でフレーム間予測を順次行う場合に比べフレーム間の動き量が大きくなるため、予測精度を向上でき、結果的に抽出精度を高めることが可能となる。
【００３１】
すなわち、フレームの間隔によっては動きが小さ過ぎたり、複雑過ぎて、フレーム間予測による形状予測手法では対応できないことがある。従って、例えば形状予測の誤差が閾値以下にならない場合は、予測に用いる抽出済みフレームとの間隔をあけることにより、予測精度が上がり、結果的に抽出精度が向上する。また、背景に動きがある場合は、参照フレーム候補は抽出フレームとの背景動きを求め補償するが、背景の動きがフレームの間隔によっては小さ過ぎたり複雑過ぎたりして、背景動き補償が精度良くできない場合がある。この場合もフレーム間隔をあけることによって動き補償精度を上げることができる。このようにして抽出フレームの順序を適応的に制御すれば、より確実に物体の形状を抽出することが可能になる。
【００３２】
また、本発明は、動画像データと、その動画像データを構成する複数フレーム内の所定フレーム上における物体領域を表すシェイプデータとを入力し、そのシェイプデータを用いて前記動画像データから物体領域を抽出する物体抽出装置において、前記動画像データが記録されている記憶装置から前記動画像データを読み出し、前記シェイプデータを動き補償することにより、前記読み出した動画像データを構成する各フレーム毎にシェイプデータを生成する手段と、前記生成されたシェイプデータによって決定される各フレームの背景領域の画像データを背景メモリ上に逐次上書きすることによって、前記動画像データの背景画像を生成する手段と、前記動画像データが記録されている記憶装置から前記動画像データを再度読み出し、前記読み出した動画像データを構成する各フレーム毎に前記背景メモリ上に蓄積されている背景画像の対応する画素との差分を求め、差分の絶対値が所定のしきい値よりも大きい画素を物体領域として決定する手段とを具備する。
【００３３】
この物体抽出装置においては、記憶装置からの動画像データを読み出す１回目のスキャン処理にて、背景画像が背景メモリ上に生成される。次いで、２回目のスキャン処理が行われ、１回目のスキャンで完成された背景画像を用いた物体領域の抽出が行われる。このようにして、動画像データが記憶装置に蓄積されていることを利用して動画像データを２回スキャンすることにより、動画像シーケンスの最初から十分に高い精度で物体領域を抽出することが可能となる。
【００３４】
また、本発明は、前記各フレームのシェイプデータによって決定される物体領域と、前記背景画像との差分の絶対値に基づいて決定される物体領域のいずれかを物体抽出結果として選択的に出力する手段をさらに具備する。画像によっては、１スキャン目で得たシェイプデータによって決定される物体領域の方が、２スキャン目で、背景画像との差分を利用して得た物体領域よりも抽出精度が高い場合がある。したがって、１スキャン目で得た物体領域と２スキャン目で得た物体領域とを選択的に出力できるようにすることにより、さらに抽出精度の向上を図ることが可能となる。
【００３５】
また、本発明は、動画像データと、この動画像データを構成する複数のフレーム内の所定フレーム上の物体領域を表すシェイプデータとを入力し、前記シェイプデータが与えられたフレームあるいは既にシェイプデータを求めたフレームを参照フレームとして使用することにより、前記各フレームのシェイプデータを逐次求めていく物体抽出装置であって、現処理フレームをブロックに分割する手段と、前記ブロック毎に、画像データの図柄が相似であり、且つ面積が現処理ブロックよりも大きい相似ブロックを前記参照フレームから探索する手段と、前記参照フレームから相似ブロックのシェイプデータを切り出して縮小したものを、前記現処理フレームの各ブロックに貼り付ける手段と、前記貼り付けられたシェイプデータを現処理フレームのシェイプデータとして出力する手段とを具備する。
【００３６】
この物体抽出装置においては、物体抽出対象の現フレームの各ブロック毎に、画像データ（テクスチャ）の図柄が相似であり、且つ面積が現処理ブロックよりも大きい相似ブロックの探索処理と、探索された相似ブロックのシェイプデータを切り出して縮小したものを現処理フレームのブロックに貼り付ける処理とが行われる。このように現処理ブロックよりも大きい相似ブロックのシェイプデータを縮小して張り付けることにより、シェイプデータで与えられる物体領域の輪郭線がずれていてもそれを正しい位置に補正することが可能となる。したがって、例えばユーザがマウスなどで最初のフレーム上の物体領域の輪郭を大まかになぞったものをシェイプデータとして与えるだけで、以降の入力フレーム全てにおいて物体領域を高い精度で抽出することが可能となる。
【００３７】
また、本発明は、画像データと、その画像の物体領域を表すシェイプデータを入力し、そのシェイプデータを用いて前記画像データから物体領域を抽出する物体抽出装置において、前記シェイプデータの輪郭部分にブロックを設定し、各ブロック毎に、前記画像データの図柄が相似であり、且つ前記ブロックよりも大きい相似ブロックを同じ画像の中から探索する手段と、前記各ブロックのシェイプデータを各々の前記相似ブロックのシェイプデータを縮小したもので置き換える手段と、前記置き換えを所定の回数だけ繰り返す手段と、前記置き換えを繰り返されたシェイプデータを補正されたシェイプデータとして出力する手段とを具備する。
【００３８】
このようにフレーム内のブロックマッチングによって、相似ブックを用いた置き換え処理を行うことにより、シェイプデータによって与えられる輪郭線を正しい位置に補正することが可能となる。また、フレーム内のブロックマッチングであるので相似ブックの探索および置き換えを同一ブロックについて繰り返し行うことができ、これにより補正精度をさらに高めることが可能となる。
【００３９】
【発明の実施の形態】
図１には、本発明の第１実施形態に係る動画像の物体追跡／抽出装置の全体の構成が示されている。この物体追跡／抽出装置は、入力動画像信号から目的とする物体の動きを追跡するためのものであり、初期図形設定部１と、物体追跡・抽出部２とから構成されている。初期図形設定部１は、外部からの初期図形設定指示信号ａ０に基づいて、追跡／抽出対象となる目的の物体を囲むような図形を入力動画像信号ａ１に対して初期設定するために使用され、初期図形設定指示信号ａ０によって例えば長方形、円、楕円などの任意の形状の図形が目的の物体を囲むように入力動画像信号ａ１の初期フレーム上に設定される。初期図形設定指示信号ａ０の入力方法としては、例えば入力動画像信号ａ１を表示する画面上に利用者がペンやマウスなどのポインティングデバイスを用いて図形そのものを直接書き込んだり、あるいはそれらポインティングデバイスを用いて入力する図形の位置や大きさを指定するなどの手法を用いることができる。これにより、目的の物体が現れる初期フレーム画像上において、追跡／抽出対象となる物体を外部から容易に指示することが可能となる。
【００４０】
また、ユーザによる図形入力ではなく、図形の初期設定は、通常のフレーム画像を解析する処理によって例えば人や動物の顔、体の輪郭などを検出し、それを囲むように図形を自動設定することによっても実現できる。
【００４１】
物体追跡・抽出部２は、初期図形設定部１で設定された図形内に含まれる図形内画像を基準として物体の追跡および抽出を行う。この場合、動物体の追跡・抽出処理では、図形で指定された物体に着目してその物体の動きが追跡される。従って、目的とする動物体以外の周囲の余分な動きに影響を受けずに目的とする動物体の抽出／追跡を行える。
【００４２】
図２には、物体追跡・抽出部２の好ましい構成の一例が示されている。
【００４３】
この物体追跡・抽出部は、図示のように、メモリ（Ｍ）１１，１４、図形設定部１１、背景領域決定部１２、および物体抽出部１３から構成されている。
【００４４】
図形設定部１１は、これまでに入力および図形設定した任意のフレームを参照フレームとして使用しながら入力フレームに対して順次図形を設定するために使用される。この図形設定部１１は、現フレーム画像１０１、参照フレームの図形内画像およびその位置１０３と、参照フレームの物体抽出結果１０６を入力し、現フレームの任意の図形で囲まれた領域内を表す画像１０２を出力する。すなわち、図形設定部１１による図形設定処理では、参照フレームの図形内画像１０３と現フレーム画像１０１との相関に基づいて、参照フレームの図形内画像１０３との誤差が最小となる現フレーム画像上の領域が探索され、その領域を囲むような図形が現フレーム画像１０１に対して設定される。設定する図形は、長方形、円、楕円、エッジで囲まれた領域、など、何でも良い。以下では、簡単のために長方形の場合について述べる。また、図形設定部１１の具体的な構成については、図５を参照して後述する。なお、物体を囲む図形を用いない場合は、図形内画像は全画像とし、位置を入出力する必要がない。
【００４５】
メモリ１０には、これまでに入力および図形設定されたフレームが少なくとも３つ程度保持される。保持される情報は、図形設定されたフレームの画像、設定された図形の位置や形状、図形内の画像などである。また、入力フレームの画像全体ではなく、その図形内画像だけを保持するようにしても良い。
【００４６】
背景領域決定部１２は、物体抽出対象の現フレーム毎にその現フレームとは時間的に異なるフレームの中の少なくとも二つの任意のフレームを参照フレームとして使用し、各参照フレーム毎に現フレームとの差分を取ることによってそれら各参照フレームと現フレーム間の共通の背景領域を決定する。この背景領域決定部１２は、メモリ１０で保持された、現フレームの任意の図形内画像およびその位置１０２と、少なくとも２つのフレームの任意の図形内画像およびその位置１０３と、該少なくとも２つのフレームの物体抽出結果１０６を入力し、現フレームと該少なくとも２つのフレームそれぞれの図形内画像との共通の背景領域１０４を出力する。すなわち、参照フレームとして第１および第２の２つのフレームを使用する場合には、現フレームと第１の参照フレームとの間のフレーム間差分を取ることなどによって得られた第１の差分画像により、現フレームと第１の参照フレームのどちらにおいても背景領域として用いられている共通の第１の背景領域が決定されると共に、現フレームと第２の参照フレームとの間のフレーム間差分を取ることなどによって得られた第２の差分画像により、現フレームと第２の参照フレームのどちらにおいても背景領域として用いられている共通の第２の背景領域が決定されることになる。この背景領域決定部１２の具体的な構成は、図４を参照して後述する。また、背景メモリを使って、共通の背景を得る手法もある。
【００４７】
なお、物体を囲む図形を用いない場合は、図形内画像は全画像とし、位置を入出力する必要がない。
【００４８】
物体抽出部１３は、背景領域決定部１２にて決定された共通の背景領域を用いて現フレームの図形内画像から物体領域のみを抽出するために使用され、該現フレームと少なくとも２つのフレームそれぞれとの共通の背景領域１０４を入力し、現フレームの物体抽出結果１０６を出力する。第１および第２の差分画像のどちらにも現フレーム上の物体領域が共通に含まれているため、第１の共通背景領域と第２の共通背景領域のどちらにも属さない領域の中で、現フレームの図形内画像に含まれる領域を検出することにより、現フレーム上の物体領域が抽出される。これは、共通の背景領域以外の領域が物体領域の候補となることを利用している。つまり、第１の差分画像上においては第１の共通背景領域以外の領域が物体領域候補となり、第２の差分画像上においては第２の共通背景領域以外の領域が物体領域候補となるので、２つの物体領域候補の重複する領域を、現フレームの物体領域と判定することができる。物体抽出結果１０６としては、物体領域の位置や形状を示す情報を使用することができる。また、その情報を用いて実際に現フレームから物体領域の画像を取り出すようにしても良い。
【００４９】
メモリ１４は、少なくとも２つの物体抽出結果を保持し、既に抽出されている結果をフイードバックして抽出精度をあげるために用いられる。
【００５０】
ここで、図８を参照して、本実施形態で用いられる物体抽出・追跡処理の方法について説明する。
【００５１】
ここでは、時間的に連続する３つのフレームｆ（ｉ−１），ｆ（ｉ），ｆ（ｉ＋１）を用いて、現フレームｆ（ｉ）から物体を抽出する場合を例示して説明する。
【００５２】
まず、前述の図形設定部１１によって図形設定処理が行われる。３つのフレームｆ（ｉ−１），ｆ（ｉ），ｆ（ｉ＋１）についてもそれぞれ任意の参照フレームを使用することにより図形設定処理が行われ、そのフレーム上の物体を囲むように長方形Ｒ（ｉ−１），Ｒ（ｉ），Ｒ（ｉ＋１）が設定される。なお、長方形の図形Ｒ（ｉ−１），Ｒ（ｉ），Ｒ（ｉ＋１）は位置および形状の情報であり、画像として存在するものではない。
【００５３】
次に、背景領域決定部１２にて共通背景領域が決定される。
【００５４】
この場合、まず、現フレームｆ（ｉ）と第１の参照フレームｆ（ｉ−１）との間のフレーム間差分が取られ、第１の差分画像ｆｄ（ｉ−１，ｉ）が求められる。同様にして、現フレームｆ（ｉ）と第２の参照フレームｆ（ｉ＋１）との間のフレーム間差分も取られ、第２の差分画像ｆｄ（ｉ，ｉ＋１）が求められる。
【００５５】
第１の差分画像ｆｄ（ｉ−１，ｉ）を得ることにより、現フレームｆ（ｉ）と第１の参照フレームｆ（ｉ−１）とで共通の画素値を持つ部分については画素値が相殺されるためその画素の差分値は零となる。したがって、フレームｆ（ｉ−１）とｆ（ｉ）の背景がほぼ同様であれば、基本的には、第１の差分画像ｆｄ（ｉ−１，ｉ）には、長方形Ｒ（ｉ−１）の図形内画像と、長方形Ｒ（ｉ）の図形内画像とのＯＲに相当する画像が残ることになる。この残存画像を囲む図形は、図示のように、多角形Ｒｄ（ｉ−１，ｉ）＝Ｒ（ｉ−１）ＯＲＲ（ｉ）となる。現フレームｆ（ｉ）と第１の参照フレームｆ（ｉ−１）の共通の背景領域は、多角形Ｒｄ（ｉ−１，ｉ）内の実際の物体領域（ここでは、２つの丸を一部重ねた結果得られる８の字の形状をした領域）以外の全領域となる。
【００５６】
また、第２の差分画像ｆｄ（ｉ，ｉ＋１）についても、長方形Ｒ（ｉ）の図形内画像と、長方形Ｒ（ｉ＋１）の図形内画像とのＯＲに相当する画像が残ることになる。この残存画像を囲む図形は、図示のように、多角形Ｒｄ（ｉ，ｉ＋１）＝Ｒ（ｉ）ＯＲＲ（ｉ＋１）となる。現フレームｆ（ｉ）と第２の参照フレームｆ（ｉ＋１）の共通の背景領域は、多角形Ｒｄ（ｉ，ｉ＋１）内の実際の物体領域（ここでは、２つの丸を一部重ねた結果得られる８の字の形状をしたもの）以外の全領域となる。
【００５７】
この後、第１の差分画像ｆｄ（ｉ−１，ｉ）から現フレームｆ（ｉ）と第１の参照フレームｆ（ｉ−１）の共通の背景領域を決定する処理が行われる。
【００５８】
共通背景領域・物体領域の判定のためのしきい値となる差分値が必要になる。これは、ユーザーが与えてもよいし、画像のノイズや性質を検出して自動設定してもよい。その場合、一画面で一つのしきい値でなくとも、画像中の部分的な性質に応じて部分的に決定してもよい。画像の性質は、エッジの強さや差分画素の分散などが考えれれる。また、物体を追跡する図形を用いて求めることもできる。
【００５９】
この場合、共通背景領域／物体領域の判定のためのしきい値となる差分値が求められ、しきい値以下の差分値を持つ画素の領域が共通背景領域として決定される。このしきい値は、第１の差分画像ｆｄ（ｉ−１，ｉ）の多角形Ｒｄ（ｉ−１，ｉ）の外側の一ライン、つまり多角形Ｒｄ（ｉ−１，ｉ）の輪郭線上に沿った各画素の差分値のヒストグラムを用いて決定することができる。ヒストグラムの横軸は画素値（差分値）、縦軸はその差分値を持つ画素数である。たとえば、画素数が多角形Ｒｄ（ｉ−１，ｉ）からなる枠線上に存在する全画素数の半分となるような差分値が、前述のしきい値として決定される。このようにしきい値を決定することにより、第１の差分画像ｆｄ（ｉ−１，ｉ）全体にわたって画素値の分布を調べることなく、容易にしきい値を決定することが可能となる。
【００６０】
次に、このしきい値を用いて、第１の差分画像ｆｄ（ｉ−１，ｉ）の多角形Ｒｄ（ｉ−１，ｉ）内における共通背景領域が決定される。共通背景領域以外の領域はオクルージョンを含んだ物体領域となる。これにより、多角形Ｒｄ（ｉ−１，ｉ）内の領域は背景領域と物体領域に２分され、背景領域の画素値が“０”、物体領域の画素値が“１”の２値画像に変換される。
【００６１】
第２の差分画像ｆｄ（ｉ，ｉ＋１）についても、同様にして、現フレームｆ（ｉ）と第２の参照フレームｆ（ｉ＋１）の共通の背景領域を決定する処理が行われ、多角形Ｒｄ（ｉ，ｉ＋１）内の領域が画素値“０”の背景領域と、画素値“１”の物体領域に変換される。
【００６２】
この後、物体抽出部１３による物体抽出力が行われる。
【００６３】
ここでは、第１および第２の差分画像との間で、多角形Ｒｄ（ｉ−１，ｉ）内の２値画像と多角形Ｒｄ（ｉ，ｉ＋１）内の２値画像とのＡＮＤ処理を画素毎に行う演算処理が行われ、これによってオクルージヨン入りの物体の交わりが求められ、現フレームｆ（ｉ）上の物体Ｏ（ｉ）が抽出される。
【００６４】
なお、ここでは、フレーム差分画像内の物体領域以外の他の全ての領域を共通背景領域として求める場合について説明したが、各フレームから図形内画像だけを取り出し、フレーム上での各図形内画像の位置を考慮してそれら図形内画像同士の差分演算を行うようにしてもよく、この場合には、図形外の背景領域を意識することなく、多角形Ｒｄ（ｉ−１，ｉ）内、および多角形Ｒｄ（ｉ，ｉ＋１）内の共通背景領域だけが決定されることになる。
【００６５】
このように、本実施形態では、
１）現フレームとこの現フレームに対し時間的に異なる第１および第２の少なくとも２つの参照フレームそれぞれとの差分画像を求めることにより、現フレームと第１参照フレーム間の図形内画像のＯＲと、現フレームと第２参照フレーム間の図形内画像のＯＲとを求め、
２）それら図形内画像のＯＲ処理により得られた差分画像をＡＮＤ処理し、これによって、現フレームの図形内画像から目的の物体領域を抽出するという、図形内画像に着目したＯＲＡＮＤ法による物体抽出が行われる。
【００６６】
また、現フレームと２つの参照フレームとの時間関係は前述の例に限らず、例えば、現フレームｆ（ｉ）に対して時間的に先行する２つのフレームｆ（ｉ−ｍ），ｆ（ｉ−ｎ）を参照フレームとして使用したり、時間的に連続する２つのフレームｆ（ｉ＋ｍ），ｆ（ｉ＋ｎ）を参照フレームとして使用することも可能である。
【００６７】
例えば、図８において、フレームｆ（ｉ−１），ｆ（ｉ）を参照フレームとして使用し、これら参照フレームそれぞれとフレームｆ（ｉ＋１）との差分を取って、それら差分画像に対して同様の処理を行えば、フレームｆ（ｉ＋１）から物体を抽出することができる。
【００６８】
図３には、物体追跡・抽出部２の第２の構成例が示されている。
【００６９】
図２の構成との主な違いは、背景動き削除部２１が設けられている点である。この背景動き削除部２１は、各参照フレームと現フレームとの間で背景の動きが相対的に零となるように背景の動きを補正するために使用される。
【００７０】
以下、図３の装置について、具体的に説明する。
【００７１】
背景動き削除部２１は、現フレーム２０１と時間的にずれた少なくとも２つのフレームの任意の図形内画像およびその位置２０６を入力し、時間的にずれた少なくとも２つのフレームの背景の動きを削除した画像２０２を出力する。この背景動き削除部２１の具体的な構成例については、図６で後述する。
【００７２】
図形設定部２２は、図２の図形設定部１１に対応し、現フレーム２０１と、該背景の動きを削除した少なくとも２つの画像２０２と、画像２０２の物休抽出結果２０６を入力し、現フレームおよび該少なくとも２つの画像２０２の、任意の図形に囲まれた領域内を表す画像２０３を出力する。
【００７３】
メモリ２６は、任意の図形内画像とその位置を保持する。
【００７４】
背景領域決定部２３は、図２の背景領域決定部１２に対応し、該任意の図形内画像およびその位置２０３と、画像２０２の物体抽出結果２０６を入力し、現フレームと該少なくとも２つの画像２０２との共通の背景領域２０４を出力する。物体抽出部２４は、図２の物体抽出部１３に対応し、該現フレームと少なくとも２つの画像との共通の背景領域２０４を入力し、現フレームの物体抽出結果２０５を出力する。メモリ２５は、少なくとも２つの物体抽出結果を保持する。これは、図２のメモリ１４に相当する。
【００７５】
このように背景動き削除部２１を設けることにより、例えばカメラをパンした時などのように背景映像が連続するフレーム間で徐々に変化するような場合であってもそれらフレーム間で背景映像を擬似的に一定にすることができる。よって、現フレームと参照フレームとの差分を取った時に、それらフレーム間で背景を相殺することが可能となり、背景変化に影響されない共通背景領域検出処理および物体領域抽出処理を行うことができる。
【００７６】
なお、背景動き削除部２１を背景領域決定部２３の入力段に設け、これにより参照フレームの背景の動きを現フレームに合わせて削除するようにしても良い。
【００７７】
図４（ａ）には、背景領域決定部１２（または２３）の具体的な構成利一例が示されている。
【００７８】
変化量検出部３１は、現フレームと前述の第１および第２の参照フレームとの差分を取るために使用され、現フレームと、時間的にずれたフレームの任意の図形内画像およびその位置３０２と、時間的にずれたフレームの物体抽出結果３０１を入力し、現フレームと時間的にずれたフレームの任意の図形内画像間の変化量３０３を出力する。変化量は、例えば、フレーム間の輝度差分や色の変化、オプティカルフロー、などを用いることができる。時間的にずれたフレームの物体抽出結果を使えば、フレーム間で物体が変化しない場合でも物体は抽出できる。例えば、変化量をフレーム間差分とすると、物体に属するフレーム間差分ゼロの部分は、物体が静止しているということなので、時間的にずれたフレームの物体抽出結果と同じになる。
【００７９】
代表領域決定部３２では、現フレームの任意の図形内画像およびその位置３０２を入力し、任意の図形内画像の背景を代表領域３０４として出力する。代表領域は、任意の図形内で最も背景が多いと予想される領域を選ぶ。例えば、図８で説明した差分画像上の図形の輪郭線などのように、図形内の最も外側に帯状の領域を設定する。図形は物体を囲むように設定されるので、背景となる可能性が高い。
【００８０】
背景変化量決定部３３では、代表領域３０４と、該変化量３０３を入力し、背景を判定する変化量３０５を出力する。背景変化量の決定は、図８で説明したように代表領域の差分値の変化量のヒストグラムをとり、例えば、全体の両素数の半分（過半数）以上の画素数に相当する変化量、つまり差分値をもつ領域を背景領域と決定する。
【００８１】
代表領域の背景決定部３４では、背景の変化量３０５を入力し、代表領域の背景３０６を判定し、出力する。代表領域の背景領域の決定は、先に決定した背景変化量かどうかで判定する。背景領域決定部３５では、変化量３０３と、背景判定のしきい値３０５と、代表領域の背景３０６を入力し、代表領域以外の領域の背景３０７を出力する。代表領域以外の背景領域の決定は、代表領域から成長法で行う。例えば、決定済みの画素と図形の内部方向に隣接する未決定画素が、背景変化量と一致すれば、背景と決定する。背景と隣接しない画素や、背景変化量と一致しない画素は、背景以外と判定される。また、単純に先に決定した背景変化量かどうかで判定してもよい。このようにして、差分画像上の図形の輪郭線から内周に向かって絞り込みを行うことにより、図形内画像の中でどこまでが背景領域であるかを決定することができる。
【００８２】
また、逆に、図形の輪郭線から外側に向かって図形からはみ出した物体領域を検出する。例えば、背景以外と決定された画素と図形の外側方向に隣接売る未決定画素が、背景変化量と一致しなければ、背景以外と決定する。背景以外の画素と隣接しない画素や背景変化量と一致する画素は背景と判定される。このようにして差分画像上の図形の輪郭線から外側に向かって広げることにより、図形外の画像のどこまでが背景外領域であるかを決定することができる。この場合は、図形外でも変化量を求める必要があるので、任意の図形を数画素太くして充分物体がはみ出さない図形を新たに設定して、その内部で変化量を求めるか、単純にフレーム全体で変化量を求めてもよい。また、図形内部だけ変化量を求めておき、図形外部の判定の時に随時変化量を求めながら上記処理を行ってもよい。当然、物体が図形をはみ出さない場合は、例えば、輪郭線上に背景以外の画素がない場合は、図形外部の処理を行う必要がない。
【００８３】
ところで、現フレームと参照フレームとの間で物体または物体の一部が静止している場合、現フレームと参照フレームとの差分が検出されず、物体の形状が正しく抽出されない場合がある。そこで、既に抽出されている参照フレームを使って現フレームの物体を検出する方法を図４（ｂ）を参照して説明する。
【００８４】
図４（ｂ）は、静止物体領域検出部３７を備えた背景領域決定部１２（または２３）を示している。これによると、変化量検出部３１は現フレームの図形内画像及びその位置と、時間的にずれた少なくとも２つのフレームの任意の図形内画像及びその位置３１１を入力とし、現フレームと参照フレームの図形内画像の変化量３１３を検出する。
【００８５】
形状予測部３６は、現フレームの図形内画像及びその位置と、時間的にずれた少なくとも２つのフレームの任意の図形内画像及びその位置３１１と、既に抽出されたフレームの画像及び物体形状３１７とを入力とし、現フレームと時間的にずれたフレームのうち未だ物体が抽出されていないフレームについては物体の形状３１２を予測し、出力する。
【００８６】
静止物体領域決定部３７は予測した物体形状３１２と、参照フレームと現フレームとの変化量３１３と、既に抽出されたフレームの物体形状３１７を入力とし、少なくとも２つのフレームから現フレームに対して静止している物体領域３１４を決定する。
【００８７】
背景領域決定部３５は少なくとも２つのフレームに関する、現フレームに対して静止している物体領域３１４と、参照フレームと現フレームとの変化量３１３とを入力とし、少なくとも２つのフレームと現フレームとのそれぞれの共通の背景領域３１６を決定し、出力する。
【００８８】
まず、参照フレームの物体が抽出されている場合は、現フレームで参照フレームとのフレーム間差分がゼロの領域について、参照フレームでは同じ位置の領域が物体の一部であれば、現フレームでのその領域は静止した物体の一部として抽出できる。逆に、その領域が参照フレームでは、背景の一部であれば、現フレームでのその領域は背景となる。
【００８９】
しかし、参照フレームの物体が未だ抽出されていない場合、静止した物体または物体の一部は上記方法では抽出できない。その場合、既に物体が抽出されている他のフレームを用いて、未だ抽出されていない参照フレームの物体形状を予測して、物体の一部であるかどうかを判定することができる。予測の方法は、画像の符号化でよく用いられるブロックマッチング法やアフェイン変換法などを用いる。
【００９０】
一例として、図１３で示すようなブロックマッチング法が考えられる。このようにして物体の形状が予測されれば、フレーム間差分が検出されない領域について静止した物体の一部か、背景かを判定することが可能となる。
【００９１】
物体を囲む図形を用いない場合は、図形内画像は全画像とし、位置を入出力する必要がない。この形状予測は、参照フレームを選択する場合の形状予測と同じものを使うことができる。また、別の物体抽出法と切り換える実施例では、別の物体抽出法で得た物体形状を用いることができる。
【００９２】
図５には、図形設定部１１（または２２）の具体的な構成の一例が示されている。
【００９３】
分割部４１では、現フレームと時間的にずれたフレームの任意の図形内画像とその位置４０２を入力し、分割された該画像４０３を出力する。任意の図形内画像の分割は、２等分、４等分、でもよいし、エッジを検出し、エッジにそった分割をおこなってもよい。以下、簡単に、２等分とし、分割された図形は、ブロックと呼ぶことにする。動き検出部４２では、該分割された任意の図形内画像とその位置４０３と、現フレームの任意の図形内画像とその位置４０１を入力し、該分割された画像の動きと誤差４０４を出力する。ここでは、ブロックが現フレームに対応する位置を、誤差が最小になるよう探索し、動きと誤差を求める。分割判定部４３では、該動きと誤差４０４と、時間的にずれたフレームの物体抽出結果４０７を入力し、時間的にずれたフレームの任意の図形内画像を分割するか否かの判定結果４０６と、分割しない場合、動き４０５を出力する。ここでは、時間的にずれたフレームの物体抽出結果が分割されたブロック内に含まれていなければ、そのブロックは図形内から削除する。そうでなけれは、求めた誤差から、誤差が閾値以上であれば、更に分割し、動きを求め直す。そうでなければ、ブロックの動きを決定する。図形決定部４４は、動き４０５を入力とし、現フレームの図形内画像とその位置４０７を出力する。ここでは、図形決定部４４は、各ブロックの、現フレームへの位置対応を求め、対応した位置のブロックを全て含むように、新しい図形を決定する。新しい図形は、全ブロックの結びつきも良く、全てのブロックを含むような長方形や円でもよい。
【００９４】
このようにして、参照フレームの図形内画像を複数のブロックに分割し、複数のブロックそれぞれについて現フレームとの誤差が最小となる領域を探索し、そして探索された複数の領域を囲む図形を現フレームに設定することにより、初期設定された図形の形状や大きさによらず、図形設定対象となる入力フレームに対して最適な形状の新たな図形を設定することが可能となる。
【００９５】
なお、図形設定に用いる参照フレームは既に図形が設定されていて且つ現フレームと時間的にずれたフレームであればよく、通常の符号化技術で前方向予測と後方向予測とが使用されていることと同様、現フレームよりも時間的に後のフレームを図形設定のための参照フレームとして用いることも可能である。
【００９６】
図６には、背景動き削除部２１の具体的な構成の一例が示されている。
【００９７】
代表背景領域設定部５１は、時間的にずれた任意の図形内画像とその位置５０１を入力とし、代表背景領域５０３を出力する。代表背景領域とは、任意の図形内のグローバルな動き、つまり図形内における背景の動きを代表して表す領域で、例えば、任意の図形を長方形にした場合、図７に示すような、長方形を囲む数画素幅の帯状の枠領域を設定する。また、図形内の外側の数画素を使っても良い。動き検出部５２では、現フレーム５０２と、該代表背景領域５０３を入力し、動き５０４を出力する。先の例を用いると、長方形周囲の帯状の枠領域の現フレームに対する動きを検出する。枠領域をーつの領域として検出しても良い。また、図７のように、複数のブロックに分割して動きを求め、各々の平均動きを出力しても良いし、最も多い動きを出力しても良い。
【００９８】
動き補償部５３では、該時間的にずれたフレーム５０１と、該動き５０４を入力とし、動き補償画像５０５を出力する。先に求めた動きを使って、時間的にずれたフレームの動きを現フレームに合わせて削除する。動き補償は、ブロックマッチング動き補償またはアフィン変換を使った動き補償でもよい。
【００９９】
なお、動き削除は、図形内の背景に対してのみならず、フレーム全体に対して行うようにしても良い。
【０１００】
以上のように、本実施形態においては、（１）物体の輪郭ではなく、その物体を大まかに囲む図形を用いて物体を追跡していくこと、（２）現フレームに任意の図形を設定し、現フレームと少なくとも２つのフレームそれぞれの図形内画像との共通の背景領域を決定し、現フレームの物体を抽出すること、（３）該時間的にずれた少なくとも２つのフレームの背景の動きを削除すること、（４）任意の図形内画像間の変化量を検出し、代表領域を決定し、現フレームと少なくとも２つのフレームの図形内画像とその位置の、背景に対応する変化量を決定し、変化量と代表領域との関係から背景かどうかを判定すること、（５）図形内画像を分割し、任意の図形内画像又は分割された図形内画像の一部の動きを検出し、任意の図形内画像又は分割された図形内画像の一部を分割するか否かを判定し、現フレームの任意の図形内画像と位置を決定すること、（６）背景を代表する領域を設定し、背景の動きを検出し、該時間的にずれたフレームの背景の動きを削除した画像を作ることにより、目的の物体以外の周囲の余分な動きに影響を受けずに、且つ比較的簡単な処理で、目的の物体を精度良く抽出・追跡することが可能となる。
【０１０１】
また、本実施形態の物体抽出・追跡処理の手順はソフトウェア制御によって実現することもできる。この場合でも、基本的な手順は全く同じであり、図形の初期設定を行った後に、入力フレームに対して順次図形設定処理を行い、この図形設定処理と並行してあるいは図形設定処理完了後に、背景領域決定処理、物体抽出処理を行えばよい。
【０１０２】
次に、本発明の第２実施形態を説明する。
【０１０３】
前述の第１実施形態では、ＯＲＡＮＤ法による一つの物体抽出手段のみを備えていたが、入力画像によってはその手段だけでは十分な抽出性能が得られないこともある。また、第１実施形態のＯＲＡＮＤ法では、物体抽出対象となる現フレームと、この現フレームに対し時間的に異なる第１の参照フレームとの差分に基づいて共通の背景を設定し、また別の現フレームと時間的に異なる第２の参照フレームとの差分に基づいて共通の背景を設定している。しかし、この第１及び第２の参照フレームの選択方法は特に与えられていない。第１及び第２の参照フレームの選択によっては、物体の抽出結果に大きな差が生まれ、良好な結果を得られないことがあり得る。
【０１０４】
そこで、本第２実施形態では、入力画像によらずに物体を高精度に抽出できるように第１実施形態を改良している。
【０１０５】
まず、図９のブロック図を用いて、第２実施形態に係る物体追跡／抽出装置の第１の構成例について説明をする。
【０１０６】
以下では、第１実施形態の物体追跡／抽出部２に対応する構成のみについて説明する。
【０１０７】
図形設定部６０は、図２で説明した第１実施形態の図形設定部１１と同じものであり、フレーム画像６０１と、初期フレームまたは既に他の入力フレームに対して設定した図形６０２とを入力とし、フレーム画像６０１に図形を設定して出力する。スイッチ部ＳＷ６１は、既に行われた物体抽出の結果６０５を入力とし、それに基づいて、使用すべき物体抽出部を切り替えるための信号６０４を出力する。
【０１０８】
物体追跡・抽出部６２は、図示のように第１乃至第Ｋの複数の物体追跡・抽出部から構成されている。これら物体追跡・抽出部はそれぞれ異なる手法で物体抽出を行う。物体追跡・抽出部６２の中には、第１実施形態で説明したＯＲＡＮＤ法を用いるものが少なくとも含まれている。また、別の方法による物体追跡・抽出部としては、例えばブロックマッチングによる形状予測法を用いたものや、アフィン変換による物体形状予測などを使用することができる。これら形状予測では、既に物体抽出されたフレームと現フレームとのフレーム間予測により現フレーム上の物体領域の位置または形状が予測され、その予測結果に基づいて現フレームの図形内画像６０３から物体領域が抽出される。
【０１０９】
ブロックマッチングによる形状予測の一例を図１３に示す。現フレームの図形内画像は図示のように同じ大きさのブロックに分割される。各ブロック毎に絵柄（テクスチャ）が最も類似したブロックが、既に物体の形状及び位置が抽出されている参照フレームから探索される。この参照フレームについては、物体領域を表すシェイプデータが既に生成されている。シェイプデータは、物体領域に属する画素の画素値を“２５５”、それ以外の画素値を“０”で表したものである。この探索されたブロックに対応するシェイプデータが、現フレームの対応するブロック位置に張り付けられる。このようなテクスチャの探索およびシェイプデータの張り付け処理を現フレームの図形内画像を構成する全ブロックについて行うことにより、現フレームの図形内画像は、物体領域と背景領域を区別するシェイプデータによって埋められる。よって、このシェイプデータを用いることにより、物体領域に対応する画像（テクスチャ）を抽出することができる。
【０１１０】
スイッチ部ＳＷ６１は、例えば第１の物体追跡抽出部と同様の操作を行ない、抽出精度が良い場合は第１の物体追跡抽部を選ぶよう切り替え、そうでない場合は別の物体追跡抽出部を選ぶよう切り替える。例えば、第１の物体追跡抽出部が、ブロックマッチングによる物体形状予測手段であるとすれば、マッチング誤差の大きさによって物体追跡抽出部の切り替えを制御すればよい。また、アフィン変換による物体形状予測であれば、アフィン変換係数の推定誤差の大きさによって物体追跡抽出部を切り替えることができる。スイッチ部ＳＷ６１での切替の単位は、フレーム単位ではなく、フレーム内の小領域、例えばブロック毎や、輝度や色に基づいて分割した領域毎である。これにより、使用する物体抽出法をよりきめ細かに選択することが出来、抽出精度を高めることができる。
【０１１１】
図１０には、第２実施形態に係る動画像物体追跡／抽出装置の第２の例が示されている。
【０１１２】
図形設定部７０は図２で説明した第１実施形態の図形設定部１１と同じものであり、画像７０１と、初期フレームまたは既に他の入力フレームに対して設定した図形７０２とを入力し、フレーム画像７０１に図形を設定して出力する。
【０１１３】
第２の物体追跡・抽出部７１は、ブロックマッチング法やアフィン変換などの形状予測によって物体領域を抽出するために使用され、図形設定部７０から出力される現フレームの図形内画像７０３と、既に抽出されている別の参照フレーム上の物体の形状及び位置７０７を入力とし、現フレームの図形内画像７０３から物体の形状及び位置を予測する。
【０１１４】
参照フレーム選択部７２は、第２の物体追跡・抽出部７１によって予測された現フレームの物体の予測形状及び位置７０４と、既に抽出されている物体の形状及び位置７０７とを入力し、少なくとも２つの参照フレームを選択する。ここで、参照フレームの選択方法について説明する。
【０１１５】
Ｏ_ｉ，Ｏ_ｊ，Ｏ_ｃｕｒｒは各々フレームｉ，ｊ及び抽出中のフレームｃｕｒｒの物体とする。２つの時間的に異なる参照フレームｆ_ｉ，ｆ_ｊとの差分ｄ_ｉ，ｄ_ｊを取って、これら差分をＡＮＤ処理して現フレームｆ_ｃｕｒｒの物体を抽出すると、抽出したい物体Ｏ_ｃｕｒｒ以外に、物体Ｏ_ｉ，Ｏ_ｊの重なり部分が時間的に異なるフレームのＡＮＤ処理により抽出される。勿論、Ｏ_ｉ∩ Ｏ_ｊ＝φ、つまり物体Ｏ_ｉ，Ｏ_ｊの重なり部分が存在せず、物体Ｏ_ｉ，Ｏ_ｊの重なりが空集合となる場合には問題ない。
【０１１６】
しかし物体Ｏ_ｉ，Ｏ_ｊの重なり部分が存在し（Ｏ_ｉ∩ Ｏ_ｊ≠φ）、かつ、この重なり部分が抽出したい物体の外部に存在する場合は、Ｏ_ｃｕｒｒと、Ｏ_ｉ∩ Ｏ_ｊの二つが抽出結果として残る。
【０１１７】
この場合、図１４（ａ）のように、Ｏ_ｃｕｒｒの背景領域（Ｏ_ｃｕｒｒ￣）と、物体Ｏｉと、Ｏ_ｊとの全ての共通領域が存在しない場合｛Ｏ_ｃｕｒｒ￣∩ （Ｏ_ｉ∩ Ｏ_ｊ）＝φ｝であれば、問題はない。しかし、図１４（ｂ）のように、Ｏ_ｃｕｒｒの背景領域（Ｏ_ｃｕｒｒ￣）と、物体Ｏ_ｉと、Ｏ_ｊとの全ての共通領域が存在する場合｛Ｏ_ｃｕｒｒ￣ ∩ （Ｏ_ｉ∩ Ｏ_ｊ）≠φ｝は、Ｏ_ｃｕｒｒが斜線で示すような誤った形状で抽出される。
【０１１８】
従って、正しく物体の形状を抽出する最適な参照フレームｆ_ｉ，ｆ_ｊとは、
（Ｏ_ｉ∩ Ｏ_ｊ）∩ Ｏ_ｃｕｒｒ …（１）
を満たすフレーム、つまり、Ｏ_ｉ，Ｏ_ｊの重なり部分がＯ_ｃｕｒｒ内に属するようなフレームｆ_ｉ，ｆ_ｊである（図１４（ａ））。
【０１１９】
また、２つ以上の参照フレームを選ぶ場合は、
（Ｏ_ｉ∩ Ｏ_ｊ∩…∩Ｏ_ｋ）∩ Ｏ_ｃｕｒｒ …（２）
となる。
【０１２０】
したがって、物体抽出対象となる現フレーム上の物体の位置または形状の予測結果に基づいて、（１）式または（２）式を満足するような参照フレームを選択することにより、確実に物体の形状を抽出することが可能になる。
【０１２１】
第１の物体追跡・抽出部７３では、参照フレーム選択部７２で選択された少なくとも２つの参照フレーム７０５と、現画像７０１を入力し、ＯＲＡＮＤ法により物体を抽出してその形状及び位置７０６を出力する。
【０１２２】
メモリ７４には、抽出された物体形状及び位置７０６が保持されている。
【０１２３】
図１１には、第２実施形態に係る物体追跡／抽出装置の第３の構成例が示されている。
【０１２４】
この物体追跡／抽出装置は、図示のように、図形設定部８０、第２の物体追跡・抽出部８１、スイッチ部ＳＷ８２、および第１の物体抽出部８３から構成されている。図形設定部８０、第２の物体追跡・抽出部８１、および第１の物体抽出部８３は、それぞれ図１０の図形設定部７０、第２の物体追跡・抽出部７１、および第１の物体抽出部７３に対応している。本例では、スイッチ部ＳＷ８２によって、第２の物体追跡・抽出部８１の抽出結果と第１の物体抽出部８３の抽出結果が選択的に使用される。
【０１２５】
すなわち、図形設定部８０では、画像８０１と初期図形の形状及び位置８０２を入力し、図形の形状及び位置８０３を出力する。第２の物体追跡・抽出部８１では、図形の形状及び位置８０３と、既に抽出されている物体の形状及び位置８０６を入力し、未だ抽出されていない物体の予測形状及び位置８０４を予測し、出力する。スイッチ部ＳＷ８２では、第２の物体抽出部で予測された物体の形状及び位置８０４を入力し、第１の物体追跡・抽出部を行なうかどうかを切り替える信号８０５を出力する。物体追跡・抽出部８３では、既に抽出されている物体の形状及び位置８０６と、未だ抽出されていない物体の予測形状及び位置８０４を入力し、物体の形状及び位置８０５を決定し、出力する。
【０１２６】
スイッチ部ＳＷ８２での切替の単位は、上記で述べた例と同様にブロック毎に切り替えても良いし、輝度や色に基づいて分割した領域毎に切り替えても良い。切替を判断する方法として、例えば物体予測したときの予測誤差を用いることができる。すなわち、フレーム間予測を用いて物体抽出を行う第２の物体追跡・抽出部８１における予測誤差が所定のしきい値以下の場合には、第２の物体追跡・抽出部８１によって得られた予測形状が抽出結果として使用されるようにスイッチ部ＳＷ８２による切替が行われ、第２の物体追跡・抽出部８１における予測誤差が所定のしきい値を越えた場合には、第１の物体追跡・抽出部８３によってＯＲＡＮＤ法にて物体抽出が行われるようにスイッチ部ＳＷ８２による切替が行われ、その抽出結果が外部に出力される。
【０１２７】
図１５は、予測の単位となるブロック毎にマッチング誤差に基づいて、使用する抽出部を切り替えた場合の抽出結果の例を示している。
【０１２８】
ここで、網目で示した領域は第２の物体追跡・抽出部８１による予測で得られた物体形状であり、斜線で示した領域は第１の物体追跡・抽出部８３によって得られた物体形状である。
【０１２９】
図１２には、本第２実施形態に係る動画像物体追跡／抽出装置の第４の構成例が示されている。
【０１３０】
この物体追跡／抽出装置は、図１１の構成に加え、図１０の参照フレーム選択部を追加したものである。
【０１３１】
図形設定部９０では、画像９０１と初期図形の形状及び位置９０２を入力し、図形の形状及び位置９０３を出力する。第２の物体追跡・抽出部９１では、図形の形状及び位置９０３と、既に抽出されている物体の形状及び位置９０８を入力し、未だ抽出されていない物体の予測形状及び位置９０４を予測し、出力する。スイッチ部ＳＷ９２は、その予測した物体の形状及び位置９０４を入力とし、予測物体の精度が良いか否かを判断し、第２の物体抽出部で抽出された物体の出力を切り替える信号９０５を出力する。参照フレーム選択部９３は、未だ抽出されていない物体の予測形状及び位置９０４と既に抽出されている物体の形状及び位置９０８を入力とし、少なくとも２つの参照フレームの物体または予測物体の形状及び位置９０６を選択し、出力する。物体追跡・抽出部９４は、現画像９０１と、少なくとも２つの参照フレームの物体又は予測物体の形状及び位置９０６を入力とし、物体を抽出して、その形状及び位置９０７を出力する。メモリ９５は、抽出した物体の形状及び位置９０７と、その予測した物体の形状及び位置９０４いずれかを保持する。
【０１３２】
以下、図１６を参照して、本例における物体追跡／抽出方法の手順を説明する。
【０１３３】
（ステップＳ１）
参照フレームの候補としては、現フレームと時間的にずれたフレームが予め設定される。これは現フレーム以外の全てのフレームでも良いし、現フレーム前後の数フレームと限定してもよい。例えば、初期フレームと、現フレームより前３フレーム、現フレームより後１フレームの合計５フレームに限定する。ただし、前フレームが３フレームない場合はその分後のフレームの候補を増やし、後フレームが１フレームない場合はその分前４フレームを候補とする。
【０１３４】
（ステップＳ２）
まず、ユーザが初期フレームに抽出したい物体を書こむ図形を例えば長方形で設定する。以降のフレームの図形は初期設定の図形をブロックに分割し、マッチングを取って対応する位置にブロックを張り付ける。全ての張り付けたブロックを含むように新たに長方形を設定することで物体を追跡する。全ての参照フレーム候補に物体追跡の図形を設定する。物体が抽出される度にそれを使って先のフレームの物体追跡図形を求め直すほうが抽出エラーを防ぐことができる。また、ユーザは初期フレームの物体形状を入力する。
【０１３５】
以下、抽出するフレームは現フレームとし、現フレームより前のフレームは物体が既に抽出されており、先のフレームは抽出されていないとする。
【０１３６】
（ステップＳ３）
参照フレーム候補の図形の周辺に適当な領域を設定し、現フレームとの背景の動きを検出して参照フレームの図形内の背景を削除する。背景の動きを検出する方法として、図形の周囲数画素の幅の領域を設定し、この領域を現フレームに対してマッチングを取り、マッチング誤差が最小となる動きベクトルを背景の動きとする。
【０１３７】
（ステップＳ４）
背景動き削除時の動きベクトル検出誤差が大きい参照フレームは候補から外すことにより、背景動き削除が適当でない場合の抽出エラーを防ぐことができる。また、参照フレーム候補が減った場合、新たに参照フレーム候補を選び直してもよい。新たに付け加えた参照フレーム候補の図形設定や背景動き削除が行なわれていない場合は、新たに図形設定および背景動き削除を行なう必要がある。
【０１３８】
（ステップＳ５）
次に、未だ物体が抽出されていない現フレームと、現フレームより先の参照フレームの候補の物体形状を予測する。現フレーム又は先の参照フレームの候補に設定された長方形を例えばブロックに分割して既に物体が抽出されているフレーム（前のフレーム）とマッチングを取り、対応する物体形状を張り付けて物体形状を予測する。物体が抽出される度にそれを使って先のフレームの物体予測をやり直すほうが抽出エラーを防ぐことができる。
【０１３９】
（ステップＳ６）
この時、予測誤差が小さいブロックは、予測した形状を抽出結果としてそのまま出力する。また物体形状の予測をブロック単位で処理を行なうとマッチング誤差によりブロック歪みが生じる場合があるので、それを消去するようなフィルターをかけ、全体の物体形状を滑らかにしてもよい。
【０１４０】
物体追跡及び物体形状予測の時に行なう長方形の分割は、固定ブロックサイズで行なっても良いし、マッチング閾値によって階層的ブロックマッチングによって行なっても良い。
【０１４１】
予測誤差が大きいブロックについては、以下の処理を行う。
【０１４２】
（ステップＳ７）
参照フレームの候補から仮の参照フレームを設定し、各々の組合せについて、式（１）又は式（２）を満たす参照フレームのセットを選ぶ。全参照フレーム候補のどの組合せも式（１）又は式（２）を満たさなかった場合、Ｏ_ｉ _∩ Ｏ_ｊ内の画素数が最小のものを選ぶのがよい。また、背景動き削除時の動きベクトル検出誤差がなるべく小さいフレームを選ぶように、参照フレーム候補の組合せを考慮するほうがよい。具体的には、式（１）又は式（２）による条件が同じ参照フレームセットがあるばあい、背景動き削除時の動きベクトル検出誤差が小さい方を選ぶ、などの方法がある。以下、参照フレームは２フレーム選択されたとする。
【０１４３】
（ステップＳ８）
参照フレームが選択されると、現フレームとのフレーム間差分を求め、設定された図形内のフレーム間差分に注目する。設定された図形の外側１ライン画素の差分の絶対値のヒストグラムを求め、多く現れる差分の絶対値を背景領域の差分値とし、設定された図形外側の１ライン画素の背景画素を決定する。設定された図形外側の１ライン画素の背景画素から内側に向けて、隣接する背景領域の差分値をもつ画素を背景画素と決定し、背景画素でないと判定されるまで順次続ける。この背景画素は、現フレームと一つの参照フレームとの共通の背景領域となる。この時、ノイズの影響で背景領域とそれ以外の部分の境界が不自然であることがあるので、境界を滑らかにするフィルターや、余分やノイズ領域を削除するフィルターをかけてもよい。
【０１４４】
（ステップＳ９）
各々の参照フレームに対して共通の背景領域が求まると、二つの共通背景領域に含まれない領域を検出し、それを物体領域として抽出する。先に予測した物体の形状を用いない部分について、ここでの結果を出力し、物体全体の形状を出力する。
【０１４５】
共通の背景から求めた形状を用いる部分と先に予測した物体形状を用いる部分の整合性が取れない場合、最後にフィルターをかけて出力結果を見ために良いものにできる。
【０１４６】
以上説明したように、本第２実施形態によれば、入力画像によらずに物体を精度良く抽出できる。又は、物体抽出に適した参照フレームを選択することができる。
【０１４７】
次に、本発明の第３実施形態を説明する。
【０１４８】
まず、図１７のブロック図を用いて、第３実施形態に係る物体追跡／抽出装置の第１の例を説明をする。
【０１４９】
ここでは、物体抽出対象となる現フレームから、その少なくとも一部の領域についての画像の特徴量を抽出し、その特徴量に基づいて複数の物体抽出手段を切り替える構成が採用されている。
【０１５０】
すなわち、本物体追跡／抽出装置は、図示のように、図形設定部１１０、特徴量抽出部１１１、スイッチ部ＳＷ１１２、複数の物体追跡・抽出部１１３、およびメモリ１１４から構成されている。図形設定部１１０、スイッチ部ＳＷ１１２、複数の物体追跡・抽出部１１３は、それぞれ第２実施形態で説明した図９の図形設定部６０、スイッチ部ＳＷ６１、および複数の物体追跡・抽出部６２と同じであり、特徴量抽出部１１１によって抽出された現フレームの画像の特徴量に基づいて、使用する物体追跡・抽出部の切替が行われる点が異なっている。
【０１５１】
図形設定部１１０は、抽出フレーム１１０１と、ユーザ設定による初期図形１１０２と、既に抽出されたフレームの抽出結果１１０６を入力とし、抽出フレームに図形を設定してその図形を出力する。図形は長方形、円、楕円、など幾何図形でもよいし、ユーザが物体形状を図形設定部１１０に入力してもよい。その場合、図形は精密な形状でなくても、大まかな形状でもよい。特徴量検出部１１１は、図形が設定された抽出フレーム１１０３と、既に抽出されたフレームの抽出結果１１０６とを入力とし、特徴量１１０４を出力する。スイッチ部ＳＷ１１２は、特徴量１１０４と、既に抽出されたフレームの抽出結果１１０６とを入力とし、図形が設定された抽出フレーム１１０３の物体追跡・抽出部への入力を制御する。
【０１５２】
スイッチ部ＳＷ１１２は、画像全体に対して特徴量を得た場合は、画像の性質を検出し、画像に対して適当な物体追跡・抽出部への入力の制御に用いることができる。図形内部は適当な大きさに分割され、特徴量は各分割図形毎に与えても良い。特徴量は分散や輝度勾配、エッジ強度などであり、この場合は、これらを自動的に算出することができる。また、人間が視覚的に認知した物体の性質がユーザによってスイッチ部Ｗ１１２に与えられてもよい。例えば、目的とする物体が人物であれば、エッジが不鮮明な髪の毛を指定して抽出時のパラメータが特別に選ばれ、前処理にエッジ補正してから抽出されてもよい。
【０１５３】
特徴量は、設定された図形内部（物体及びその周辺）に関してだけでなく、図形外部（背景部）に関する特徴量でも良い。
【０１５４】
複数（第１〜ｋ）の物体追跡・抽出部１１３の各々では、図形が設定された抽出フレーム１１０３と、既に抽出されたフレームの抽出結果１１０６とを入力とし、物体を追跡・抽出した結果１１０５を出力する。
【０１５５】
複数の物体追跡・抽出部１１３は、ＯＲＡＮＤ法を使用して物体を抽出するもの、クロマキーを使用して物体を抽出するもの、ブロックマッチングやアフィン変換によって物体を抽出するものなどを含む。
【０１５６】
なお、実施形態１では、設定された図形の周囲の画素値のフレーム間差分のヒストグラムを用いて、背景画素が決定されているが、単純に、フレーム間差分が閾値以下の画素が背景画素と決定されても良い。また、実施形態１では、設定された図形から図形内部に向かって背景画素（差分値が一定値以下）が決定されているが、図形から図形外部へ向けて物体画素（差分値が一定値以上）も決定されても良いし、任意の操作順でもよい。
【０１５７】
メモリ１１４は、物体を追跡・抽出した結果１１０５を入力とし、それを保持する。
【０１５８】
以下、画像の性質を示す特徴量によって、追跡／抽出方法を切替えると、よりよい抽出結果が得られる理由について説明する。
【０１５９】
例えば、背景の動きがあるかどうか予め分かるならば、その性質を使った方がよい。背景の動きがある場合は、背景動き補償が行なわれるが、完全に補償できるかわからない。複雑な動きをするフレームではほとんど動き補償できない。このようなフレームは、背景動き補償の補償誤差で予め分かるので、参照フレーム候補にしないなど工夫が可能である。しかし、背景の動きがない場合は、この処理は不必要である。別の物体が動いていると、誤った背景動き補償が行なわれたり、そのフレームは参照フレーム候補から外れたりして参照フレーム選択条件に最適なフレームであっても選ばれず、抽出精度が落ちることがある。
【０１６０】
また、一つの画像中にも多様な性質が混在している。物体の動きやテクスチャも部分的に異なり、同じ追跡・抽出方法及び装置やパラメータではうまく物体が抽出できないことがある。従って、ユーザが画像中の特殊な性質を持つ一部を指定したり、画像中の違いを自動的に特徴量として検出して、部分的に追跡・抽出方法を切替えて物体を抽出したり、パラメータを変更した方がよい。
【０１６１】
このようにして、複数の物体の追跡・抽出手段を切替えれば、様々な画像中の物体の形状を精度良く抽出することが可能になる。
【０１６２】
次に、図１８のブロック図を用いて、第３実施形態に係る動画像物体追跡／抽出装置の第２の構成例について説明する。
【０１６３】
図形設定部１２０は、抽出フレーム１２０１と、ユーザ設定による初期図形１２０２と、既に抽出されたフレームの抽出結果１２０７を入力とし、抽出フレームに図形を設定して出力する。第２の物体追跡・抽出部１２１は、ブロックマッチング法やアフィン変換などの形状予測によって物体領域を抽出するために使用され、図形が設定された抽出フレーム１２０３と、既に抽出されたフレームの抽出結果１２０７を入力とし、物体の追跡・抽出結果１２０４を出力する。
【０１６４】
特徴量抽出部１２２は、物体の追跡・抽出結果１２０４を入力とし、物体の特徴量１２０５をスイッチ部ＳＷ１２３に出力する。スイッチ部ＳＷ１２３は、物体の特徴量１２０５を入力として、第一の物体追跡・抽出部への物体の追跡・抽出結果１２０４の入力を制御する。例えば、第２の物体追跡・抽出部１２１でブロックマッチング法により物体形状が追跡・抽出された場合、特徴量をマッチング誤差として、このマッチング誤差が小さい部分は第２の物体追跡・抽出部１２１による予測形状の抽出結果として出力される。また他の特徴量として、ブロック毎に輝度勾配や分散、テクスチャの複雑さを表すパラメータ（フラクタル次元など）がある。輝度勾配を用いた場合、輝度勾配がほとんどないブロックに対しては、ＯＲＡＮＤ法による第１の物体追跡・抽出部１２４の結果が使用されるように第一の物体追跡・抽出部への入力が制御される。またエッジ検出をして、エッジの有無や強度を特徴量とした場合、エッジのない所、弱い所では第１の物体追跡・抽出部１２４の結果が使用されるように第一の物体追跡・抽出部への入力が制御される。このように、画像の一部分であるブロック単位や領域単位で、切替の制御が変えられる。切替の閾値を大きくしたり小さくしたりすることで、適応的な制御ができる。
【０１６５】
第１の物体追跡・抽出部１２４は、抽出フレーム１２０１と、物体の追跡・抽出結果１２０４と、既に抽出されたフレームの抽出結果１２０７を入力とし、抽出フレームの追跡・抽出結果１２０６をメモリ１２５に出力する。
メモリ１２５は、抽出フレームの追跡・抽出結果１２０６を入力とし、保持する。
【０１６６】
次に、図１９のブロック図を用いて、本第３実施形態に係る物体追跡／抽出装置の第３の構成例をする。
【０１６７】
この物体追跡／抽出装置は、図１８の構成に加え、第２実施形態で説明した参照フレーム選択部を追加したものである。すなわち、この物体追跡／抽出装置は、図示のように、図形設定部１３０、第２の物体追跡・抽出部１３１、特徴量抽出部１３２、スイッチ部ＳＷ１３３、参照フレーム選択部１３４、第１の物体追跡・抽出部１３５、およびメモリ１３６から構成されている。
【０１６８】
図形設定部１３０では、抽出フレーム１３０１と、ユーザ設定による初期図形１３０２と、既に抽出されたフレームの抽出結果１３０８を入力とし、抽出フレームに図形を設定して出力する。第２の物体追跡・抽出部１３１は、ブロックマッチング法やアフィン変換などの形状予測によって物体領域を抽出するためのものであり、図形を設定された抽出フレーム１３０３と、既に抽出されたフレームの抽出結果１３０８を入力とし、物体の追跡・抽出結果１３０４を出力する。
【０１６９】
特徴量抽出部１３２は、物体の追跡・抽出結果１３０４を入力とし、物体の特徴量１３０５を出力する。スイッチ部ＳＷ１３３は、物体の特徴量１３０５を入力とし、第１の物体追跡・抽出部１３５への物体の追跡・抽出結果１３０４の入力を制御する。
【０１７０】
参照フレーム選択部１３４は、第１の物体追跡・抽出部１３５への物体の追跡・抽出結果１３０４と、既に抽出されたフレームの抽出結果１３０８を入力とし、参照フレーム１３０６を出力する。
【０１７１】
物体の特徴の一例として、動きの複雑さがある。第２の物体追跡・抽出部１３１でブロックマッチング法により物体を追跡・抽出する場合、マッチング誤差が大きい部分に対して、第１の物体抽出結果が出力される。部分的に複雑な動きがあると、その部分はマッチング誤差が大きくなり、第１の物体追跡・抽出部１３５で抽出されることになる。従って、このマッチング誤差を特徴量として第１の物体追跡・抽出部１３５で用いる参照フレームの選択方法が切替えられる。具体的には物体形状全体ではなく、第１の物体追跡・抽出部１３５で抽出する部分だけについて第２実施形態で説明した式（１）または（２）の選択条件を満たすように参照フレームの選択方法を選ぶのがよい。
【０１７２】
背景の特徴量の例は、１）背景が静止している画像である、２）ズームがある、３）パーンがあるという情報等である。この特徴量はユーザが入力しても良いし、カメラから得たパラメータが特徴量として入力されても良い。背景の特徴量としては、背景の動きベクトル、背景動き補正画像の精度、背景の輝度分布、テクスチャ、エッジなどがある。例えば背景動き補正画像の精度を背景動き補正画像と補正前画像との差分平均を特徴量として、参照フレーム選択方法が制御できる。制御例としては、差分平均が非常に多い場合、そのフレームは参照フレームの候補にしなかったり、そのフレームの選択順位を下げてフレームを選んだりできる。背景が静止している場合や、背景動き補正がすべてのフレームについて完全であると差分がゼロとなる。参照フレーム選択法は、第２実施形態と同じ方法を用いることができる。
【０１７３】
第１の物体追跡・抽出部１３５は、抽出フレーム１３０１と、参照フレーム１３０６と、既に抽出されたフレームの抽出結果１３０８を入力とし、ＯＲＡＮＤ法により抽出フレームの追跡・抽出結果１３０７をメモリ１３５に出力する。メモリ１３５は、抽出フレームの追跡・抽出結果１３０７を入力し、保持する。
【０１７４】
次に、図２２を用いて、先に挙げた例のうち、第２の物体追跡・抽出部からの出力から特徴量を得て、それによって複数の参照フレーム選択部を切替える例を第４の構成例として説明する。
【０１７５】
図形設定部１６０は、抽出フレーム１６０１と、ユーザが設定した初期図形１６０２と、既に物体が抽出されたフレーム１６０８を入力とし、設定図形１６０３を出力する。第２の物体追跡・抽出部１６１は、ブロックマッチング法やアフィン変換などの形状予測によって物体領域を抽出するために使用され、設定図形１６０３と、既に物体が抽出されたフレーム１６０８を入力とし、物体追跡・抽出結果１６０４を出力する。特徴量検出部１６３は、物体追跡・抽出結果１６０４を入力とし、特徴量１６０５をスイッチ部ＳＷ１６４に出力する。スイッチ部ＳＷ１６４は、特徴量１６０５を入力し、参照フレーム選択部への物体追跡・抽出結果１６０４の入力を制御する。
【０１７６】
複数の参照フレーム選択部１６５は、物体追跡・抽出結果１６０４と、既に物体が抽出されたフレーム１６０８を入力とし、少なくとも２つの参照フレーム１６０６を出力する。
【０１７７】
第１の物体追跡・抽出部１６６は、ＯＲＡＮＤ法により物体抽出を行うために使用され、参照フレーム１６９６と、抽出フレーム１６０１を入力とし、物体の追跡・抽出結果１６０７をメモリ１６７に出力する。メモリ１６７は、物体の追跡・抽出結果１６０７を入力とし、保持する。
【０１７８】
次に、ブロック図２３を用いて、先に述べた例のうち、背景の情報を得て、背景動き補正の誤差によって複数の参照フレーム選択部の入力を制御する例について説明する。
【０１７９】
図形設定部１７０は、抽出フレーム１７０１と、ユーザが設定した初期図形１７０２と、既に物体が抽出されたフレーム１７１０を入力とし、設定図形１７０３を出力する。第２の物体追跡・抽出部１７１は、設定図形１７０３と、既に物体が抽出されたフレーム１７１０を入力とし、物体の追跡・抽出結果１７０４を出力する。スイッチ部ＳＷ１７２では、ユーザが指定した背景の情報１７０５を入力し、背景動き補正部１７３への抽出フレーム１７０１の入力を制御する。
【０１８０】
背景動き補正部１７３は、抽出フレーム１７０１と、既に物体が抽出されたフレーム１７１０を入力とし、背景動きを補正したフレーム１７０６を出力する。
【０１８１】
背景特徴量検出部１７４は、抽出フレーム１７０１と、背景動きを補正したフレーム１７０６を入力とし、背景の特徴量１７０７をスイッチ部ＳＷ１７５へ出力する。このスイッチ部ＳＷ１７５は、背景の特徴量１７０７を受け、参照フレーム選択部１７６への物体の追跡・抽出結果１７０４の入力を制御する。参照フレーム選択部１７６は、物体の追跡・抽出結果１７０４と、既に物体が抽出されたフレーム１７１０を入力し、少なくとも２つの参照フレーム１７０８を出力する。
【０１８２】
第１の物体追跡・抽出部１７７は、少なくとも２つの参照フレーム１７０８と、抽出フレーム１７０１を入力とし、物体の追跡・抽出結果１７０９をメモリ１７８に出力する。メモリ１７８は、物体の追跡・抽出結果１７０９を受け、保持する。
【０１８３】
次に、図２０のブロック図を用いて、本第３実施形態に係る物体追跡／抽出装置の第５の構成例を説明する。
【０１８４】
抽出フレーム出力制御部１４０は、画像１４０１と抽出するフレームの順序１４０５を入力とし、抽出フレーム１４０２を出力する。フレーム順序制御部１４１は、ユーザが与えたフレーム順序に関する情報１４０５を入力とし、フレーム順序１４０６を出力する。物体追跡・抽出装置１４２は、動画像信号から目的とする物体の抽出／追跡を行なう物体追跡／抽出方法及び装置であり、抽出フレーム１４０２を入力とし、追跡・抽出結果１４０３を追跡・抽出結果出力制御部１４３に出力する。追跡・抽出結果出力制御部１４３は、追跡・抽出結果１４０３と、フレーム順序１４０６を入力とし、フレーム順序を画像１４０１の順序に並び変えて出力する。
【０１８５】
フレームの順序は、ユーザが与えても良いし、物体の動きに応じて適応的に決定しても良い。物体の動きが検出しやすいフレーム間隔が決定され、物体が抽出される。すなわち、参照フレームと物体抽出対象となる現フレームとの間のフレーム間隔が少なくとも２フレーム以上となるように入力フレーム順とは異なる順序で物体抽出処理が行われるようにフレーム順の制御が行われる。これにより、入力フレーム順にフレーム間予測による形状予測や、ＯＲＡＮＤ演算を行う場合に比べ、予測精度を向上でき、結果的に抽出精度を高めることが可能となる。ＯＲＡＮＤ法の場合には適切な参照フレームを選択することによって抽出精度を高めることが可能となるため、ブロックマッチングなどによるフレーム間予測による形状予測法について特に効果がある。
【０１８６】
すなわち、フレームの間隔によっては動きが小さ過ぎたり、複雑過ぎて、フレーム間予測による形状予測手法では対応できないことがある。従って、例えば形状予測の誤差が閾値以下にならない場合は、予測に用いる抽出済みフレームとの間隔をあけることにより、予測精度が上がり、結果的に抽出精度が向上する。また、背景に動きがある場合は、参照フレーム候補は抽出フレームとの背景動きを求め補償するが、背景の動きがフレームの間隔によっては小さ過ぎたり複雑過ぎたりして、背景動き補償が精度良くできない場合がある。この場合もフレーム間隔をあけることによって動き補償精度を上げることができる。このようにして抽出フレームの順序を適応的に制御すれば、より確実に物体の形状を抽出することが可能になる。
【０１８７】
次に、図２１のブロック図を用いて、本第３実施形態に係る物体追跡／抽出装置の第６の例を説明をする。
【０１８８】
抽出フレーム出力制御部１５０は、画像１５０１と抽出するフレームの順序１５０５を入力とし、抽出フレーム１５０２を出力する。フレーム順序制御部１５１は、ユーザが与えたフレーム順序に関する情報１５０５を入力とし、フレーム順序１５０６を出力する。すなわち、フレーム順序制御部１５１は、フレーム間隔が与えられ、フレームの抽出順序を決定する。複数の物体追跡／抽出装置１５２は、動画像信号から目的とする物体の抽出／追跡を行なう物体追跡／抽出方法及び装置であり、フレーム順序１５０６にしたがって抽出フレーム１５０２の入力が制御され、追跡・抽出結果１５０３を出力する。追跡・抽出結果出力制御部１５３は、追跡・抽出結果１５０３と、フレーム順序１５０６を入力とし、それを画像１５０１の順序に並び変えて出力する。
【０１８９】
飛ばされた間のフレームは、既に抽出されたフレームから内挿しても良いし、参照フレーム候補の選び方を変えて同じアルゴリズムで抽出しても良い。
【０１９０】
ここで、図２５を用いて、図２１の物体の追跡／抽出装置の処理の例について説明する。
【０１９１】
図２５で、斜線で示すフレームは２フレーム間隔開けて先に抽出するフレームである。飛ばされたフレームは第２の物体追跡・抽出装置によって抽出される。図２５のように両脇のフレームが抽出された後に、両脇のフレームの抽出結果から内挿して物体形状を求めてもよい。また、閾値などのパラメータを変えたり、これら両脇のフレームを参照フレーム候補に加えて、両脇のフレームと同じ方法で抽出してもよい。
【０１９２】
次に、図２４のブロック図を用いて、物体追跡・抽出装置の他の構成例を説明する。
【０１９３】
スイッチ部ＳＷ１８２は、ユーザが指定した背景の情報１８０５を入力とし、背景動き補正部１８３への抽出フレーム１８０１の入力を制御する。背景動き
補正部１８３は、抽出フレーム１８０１と、既に物体が抽出されたフレーム１８１１を入力とし、背景動きを補正したフレーム１８０６を出力する。背景特徴量検出部１８４は、抽出フレーム１８０１と、背景動きを補正したフレーム１８０６を入力とし、背景の特徴量１８０７を出力する。スイッチ部ＳＷ１８７は、背景の特徴量１８０７を入力とし、参照フレーム選択部１８８への物体の追跡・抽出結果１８０４の入力を制御する。図形設定部１８０は、抽出フレーム１８０１と既に物体が抽出されたフレーム１８１１並びにユーザが設定した初期図形１８０２を入力とし、図形を設定した抽出フレーム１８０３を出力する。第二の物体追跡・抽出部１８５は、図形を設定した抽出フレーム１８０３と既に物体が抽出されたフレーム１８１１を入力とし、物体追跡・抽出結果１８０４を出力する。特徴量検出部１８５は、物体追跡・抽出結果１８０４を入力とし、特徴量１８０８を出力する。スイッチ部ＳＷ１８６は、特徴量１８０８を入力とし、参照フレーム選択部への物体追跡・抽出結果１８０４の入力を制御する。参照フレーム選択部１８８は、物体追跡・抽出結果１８０４と、既に物体が抽出されたフレーム１８１１を入力とし、少なくとも２つの参照フレーム１８０９を出力する。
【０１９４】
第一の物体追跡・抽出部１８９は、少なくとも２つの参照フレーム１８０９と、抽出フレーム１８０１を入力とし、物体の追跡・抽出結果１８１０をメモリ１９０に出力する。メモリ１９０は、物体の追跡・抽出結果１８１０を保持する。
【０１９５】
処理の流れは以下のようになる。
【０１９６】
ユーザが初期フレームにおいて抽出したい物体を大まかに囲む。以降のフレームの長方形は既に抽出された物体を囲む長方形を上下左右に数画素広げて設定する。この長方形をブロックに分割し、抽出済みのフレームとマッチングを取って対応する位置に抽出済み物体の形状を張り付ける。この処理によって得られた物体形状（予測物体形状）が大まかな物体を表す。予測の精度がある閾値以下にならない場合、別のフレームから予測を直してより予測精度を上げるように処理しても良い。
【０１９７】
予測精度が良い場合、この予測形状全部又は一部をそのまま抽出結果として出力する。この方法は、物体を追跡しつつ、物体も抽出できる。
【０１９８】
物体追跡及び物体形状予測の時に行なうブロック化は、長方形を固定ブロックサイズで分割しても良いし、マッチング閾値によって階層的ブロックマッチングによって行なっても良い。フレームを固定のサイズで分割し、物体を含むブロックだけを用いても良い。
【０１９９】
予測が悪い場合を考えて、物体予測形状を数画素分拡張して、予測エラーによる凹凸や穴が修正される。この方法で全ての参照フレーム候補に予測物体形状が設定される。物体が抽出される度にその物体を使って先のフレームの物体追跡図形が求め直されそれにより抽出エラーが防がれる。この追跡図形が物体を囲むように設定された追跡図形とする。
【０２００】
以下、抽出フレームより前のフレームについては物体が既に抽出されており、先のフレームについては物体が抽出されていないものとする。
【０２０１】
参照フレームの候補は、一定間隔おきに抽出するフレームに対して時間的に一定間隔毎にずれた前後５フレームとする。参照フレームの候補は、例えば、初期フレームと現フレームより前３フレーム、現フレームより後１フレーム、の合計５フレームのように限定する。ただし、前フレームが３フレームない場合はその分後のフレームの候補を増やし、後フレームが１フレームない場合はその分前４フレームを候補とする。
【０２０２】
参照フレーム候補の物体の周辺に適当な領域が設定され、この領域と現フレームとの背景の動きが検出され、これにより参照フレームの図形内の背景が削除される。背景の動きを検出する方法として、物体を除いた全領域で現フレームに対してマッチングを取り、マッチング誤差が最小となる動きベクトルが背景の動きと判定される。
【０２０３】
背景動き削除時の動きベクトル検出誤差が大きい参照フレームは候補から外すことにより、背景動き削除が適当でない場合の抽出エラーを防ぐことができる。また、参照フレーム候補が減った場合、新たに参照フレーム候補を選び直してもよい。新たに付け加えた参照フレーム候補の図形設定や背景動き削除が行なわれていない場合は、新たに図形設定や背景動き削除を行なう必要がある。
【０２０４】
予め背景の動きがないと分かる場合は、この処理は行なわない。
【０２０５】
参照フレームの候補から仮の参照フレームを設定し、これらフレームの組合せについて、第２実施形態の式（１）又は式（２）を満たす参照フレームのセットを選ぶ。全参照フレーム候補のどの組合せも式（１）又は式（２）を満たさなかった場合、Ｏ_ｉ _∩ Ｏ_ｊ内の画素数が最小のフレームを選ぶのがよい。
【０２０６】
また、背景動き削除時の動きベクトル検出誤差がなるべく小さいフレームを選ぶように、参照フレーム候補の組合せを考慮するほうがよい。具体的には、式（１）又は式（２）による条件が同じ参照フレームセットがある場合、背景動き削除時の動きベクトル検出誤差が小さい方のフレームを選ぶなどの方法がある。背景の動きがない場合は、フレーム間差分が十分検出できるフレームを優先的に選ぶようにできる。
【０２０７】
また、物体予測の精度がよく、物体の一部をそのまま出力する場合、物体予測結果を抽出結果としない領域のみを対象に、式（１）又は式（２）による条件を満たすフレームを選ぶ。
【０２０８】
以下、２参照フレームが選択されたときの処理を説明する。
【０２０９】
参照フレームが選択されると、抽出フレームとのフレーム間差分を求め、設定された図形内のフレーム間差分に注目する。
【０２１０】
設定された閾値でフレーム間差分を２値化する。２値化に用いる閾値は画像に対して一定でもよいし、背景動き補償の精度に応じてフレーム毎に変えても良い。制御例としては、背景動き補償の精度が悪ければ、背景に余分な差分が多く発生しているので、２値化の閾値を大きくする例などがある。また、物体の部分的な輝度勾配やテクスチャ、エッジ強度に応じて変えても良い。この制御例として、輝度勾配が少ない領域や、エッジ強度が小さい領域のように比較的平坦な領域は２値化の閾値を小さくする。更に、ユーザが物体の性質から閾値を与えても良い。
【０２１１】
物体追跡図形の外側の画素について、隣接する背景領域の差分値を持つ画素を背景画素と決定する。また、同時に物体追跡図形の内側の画素についても、隣接する背景領域の差分値をもたない画素を背景画素でない、と決定する。
【０２１２】
フレーム間差分は、物体の静止領域では検出できない。従って、予測に用いたフレームとのフレーム間差分がゼロで、かつ予測に用いたフレームでは物体内部の画素である場合は、静止領域画素として背景画素に加えない。
【０２１３】
この背景画素は、現フレームと一つの参照フレームとの共通の背景領域となる。この時、ノイズの影響で背景領域とそれ以外の部分の境界が不自然であることがあるので、画像信号に境界を滑らかにするフィルターや、余分なノイズ領域を削除するフィルターをかけてもよい。
【０２１４】
各々の参照フレームに対して共通の背景領域が求まると、二つの共通背景領域に含まれない領域が検出され、それが物体領域として抽出される。先に予測した物体の形状を用いない部分に対しては抽出結果が出力され、物体全体の形状が抽出される。共通の背景から求めた形状を用いる部分と先に予測した物体形状を用いる部分の整合性が取れない場合、最後にフィルターをかけて得た出力結果は見ために良いものにできる。
【０２１５】
最後に抽出順が入力フレームの順序に置き換えて抽出された物体領域が出力する。
【０２１６】
本発明のような物体の形状を抽出する方法及び装置は、現在標準化が固まりつつあるＭＰＥＧ−４のオブジェクト符号化の入力手段として用いることができる。このＭＰＥＧ−４と物体抽出の応用例として、物体形状をウインドウ形式とする表示システムがある。このような表示システムは、多地点会議システムに有効である。限られた大きさのディスプレイにテキスト資料と、各地点で会議に参加する人物を四角いウインドウで表示するよりも、図２６のように人物は人物の形状で表示することにより、省スペース化できる。ＭＰＥＧ−４の機能を使えば、発言中の人物だけを大きくしたり、発言していない人物を半透明にしたりでき、システムの使用感がよくなる。
【０２１７】
以上説明したように、本第３実施形態によれば、画像の性質に応じて方法及び装置で物体を選ぶことによって、不必要な処理を省き、安定な抽出精度が得られる。また、時間順という制約を外すことによって物体の動きによらずに十分な抽出精度が得ることができる。
【０２１８】
また、本第３実施形態は、第１実施形態及び第２実施形態の性能を改善するものであり、第１実施形態及び第２実施形態の各構成と第３実施形態で説明した構成とを適宜組み合わせて使用することもできる。
【０２１９】
図２７には、本発明の第４実施形態に係る物体抽出装置の第１の構成例が示されている。
【０２２０】
外部のカメラで撮像されたり、ビデオテープ、ビデオディスクなどの蓄積媒体から読み出されたりした後に、本物体抽出装置に入力されるテクスチャ画像２２１は、記録装置２２２、スイッチ部２２３、動き補償による物体抽出回路２２４に入力される。記録装置２２２は、入力されたテクスチャ画像２２１を保持するものである。例えば、パソコンなどで用いられているハードディスク、光磁気ディスクなどである。記録装置２２２は後にテクスチャ画像２２１を再び用いるために必要であり、テクスチャ画像２２１が外部の蓄積媒体に記録されていた画像である場合は、記録装置２２２を別に用意する必要はなく、その蓄積媒体が記録装置２２２として用いられる。この際は、記録装置２２２にテクスチャ画像２２１を入力しなおす必要はない。テクスチャ画像は、例えば、各画素の輝度（Ｙ）を０〜２５５の値で表した画素をラスタ順序（画像の左上の画素から右方向へ、上のラインから下のラインへの順序）で並べて形成され、一般に画像信号と呼ばれている。後に述べるシェイプ画像と区別するために、ここではテクスチャ画像と呼ぶことにする。テクスチャ画像としては、輝度以外にも、色差（Ｕ，Ｖなど）、色（Ｒ，Ｇ，Ｂなど）が用いられても良い。
【０２２１】
一方、最初のフレームにおいて、操作者が抽出したい物体を別途抽出しておいたシェイプ画像２２５が、動き補償による物体抽出回路２２４に入力される。シェイプ画像は、例えば、物体に属する画素の画素値を“２５５”、それ以外の画素の画素値を“０”で表した画素をテクスチャ画像と同様にラスタ順序で並べて生成される。
【０２２２】
ここで、最初のフレームのシェイプ画像２５を生成する実施例を図３４などを用いて詳しく説明する。
【０２２３】
図３４では、省略しているが、背景や前景にも図柄があり、そのうちで、家の形をした物体２２６を抽出したいとする。操作者は、モニタに表示された画像２２７に対して、物体２２６の輪郭をマウスやペンでなぞる。その輪郭線の内側の画素に“２５５”、外側の画素に“０”を代入して得た画像をシェイプ画像とする。操作者が細心の注意をはらって輪郭線を描けば、このシェイプ画像の精度は高いものになるが、ある程度精度が低い場合でも、以下の方法を用いれば、精度を上げることができる。
【０２２４】
図３５には、操作者によって描かれた線２２８と、物体２２６の輪郭線２２９が示されている。この段階では、輪郭線２２９の正しい位置はもちろん抽出されていないが、線２２８との位置関係を表すために輪郭線２２９が示している。
【０２２５】
まず、輪郭線２２８を含むようにブロックが設定される。具体的には、画面をラスタ順でスキャンし、輪郭線２２８があった時、つまり、輪郭線２２８のシェイプ画像において、隣接する画素値に差があった時、その画素を中心にして所定のサイズのブロックを設ける。この際、既に設定したブロックと今回のブロックが重なる場合には、今回のブロック設定は行わずに、スキャンを進めるようにすると、図３６のように互いに重なりがなく、なおかつ接するようにブロックが設定できる。しかし、これだけでは、部分２３０，２３１，２３２がブロックに入っていない。そこで、もう一度スキャンを行い、ブロックに含まれない輪郭線があった時、やはり、その画素を中心にしてブロックが設けられる。但し、２度目のスキャンの時には、今回のブロックが既に設定したブロックと重なる部分があっても、中心とする画素が既に設定したブロックに含まれない限り、今回のブロックの設定を行う。図３７において斜線で示すブロック２３３，２３４，２３５，２３６が２スキャン目で設定されたブロックである。ブロックサイズは、固定にしてもよいが、輪郭線２２８によって囲まれる画素数が多い場合には大きく、その画素数が少ない場合には小さく、輪郭線２２８の凸凹が少ない場合には大きく、凸凹が多い場合には小さく、あるいは、画像の図柄が平坦な場合には大きく、図柄が細かい場合には小さく、設定してもよい。
【０２２６】
画面の端では、普通にブロックを設定すると画面からはみだしてしまうことがある。そういう場合は、そのブロックだけ、画面からはみ出さないようにブロックの端を切って長方形のブロックにする。この場合は相似ブロックも長方形とする。
【０２２７】
以上がシェイプ画像におけるブロックの設定方法である。
【０２２８】
次に、ブロック毎に、その相似なブロックをテクスチャ画像を用いて探索する。ここで、相似とは、ブロックサイズが異なるブロック同士で、一方のブロックサイズを他方と同じになるように、拡大あるいは縮小した時に、対応する画素の画素値がほぼ等しくなることをいう。例えば、図３８のブロック２３７に対しては、ブロック２３８が、テクスチャ画像の図柄が相似になる。同様に、ブロック２３９に対しては、ブロック２４０が、ブロック２４１に対してはブロック２４２が、相似である。本実施形態では、相似ブロックは、輪郭線上に設定したブロックよりも大きくする。また、相似ブロックは、画面全体を探索するのではなく、例えば、図３９に示す様に、ブロック２４３の近くのブロック２４４，２４５，２４６，２４７を四隅とするある一定の範囲内で探索すれば十分である。図３９は各ブロックの中心を起点におき、ブロック２４３の起点を用いて、ブロック２４４，２４５，２４６，２４７の起点を所定の画素幅だけ、上下方向と左右方向に動かした場合である。起点をブロックの左上角においた場合を図４０に示す。
【０２２９】
探索範囲内でも、一部が画面からはみ出す相似ブロックは、探索の対象から外すのであるが、ブロックが画面の端にあると、探索範囲にある全ての相似ブロックが探索の対象から外れてしまうことがある。そういう場合には、画面の端のブロックについては、探索範囲を画面の内側にずらして対応する。
【０２３０】
相似ブロックの探索は、多段階探索を行うと、演算量を少なくできる。多段階探索とは、例えば１画素や半画素ずつ起点をずらしながら、探索範囲全体を探索するのではなく、初めに、とびとびの位置の起点で誤差を調べる。次に、その中で誤差が小さかった起点の周囲だけを少し細かく起点を動かして誤差を調べるということを繰り返しながら、相似ブロックの位置をしぼりこんでいく方法である。
【０２３１】
相似ブロックの探索において、相似ブロックの縮小処理を毎回行うと、処理時間が多く必要である。そこで、予め画像全体を縮小したものを生成し、別のメモリに保持しておけば、相似ブロックに対応する部分のデータをそのメモリから読み出すだけで済む。
【０２３２】
図３８では３つのブロック２３７，２３９，２４１についてだけ、相似ブロックを示しているが、実際には、図３７で示した全てのブロックに対して相似ブロックを求める。以上が相似ブロックの探索方法である。相似ブロックの探索はシェイプ画像ではなく、テクスチャ画像を用いることが肝要である。画面内で相似ブロックをブロックに写像する一次変換を考えた時に、テクスチャ画像の輪郭線は、この一次変換において不変である。
【０２３３】
次に、各ブロックとその相似ブロックの位置関係を用いて、シェイプ画像の輪郭がテクスチャ画像の輪郭に合うように補正する方法を説明する。
【０２３４】
図４１において、輪郭線２２８が操作者によって描かれた線である。この線が、正しい輪郭線２２９に近づけばよい。そのために、シェイプ画像の相似ブロック２３８の部分を読み出し、それをブロック２３７と同じサイズに縮小したもので、シェイプ画像のブロック２３７の部分を置き換える。この操作には、輪郭線を、相似ブロックからブロックへの一次変換の不動点を含む不変集合に近づける性質があるので、輪郭線２２８は輪郭線２２９に近づく。相似ブロックの一辺がブロックの一辺の２倍の長さの時、１回の置き換えで、輪郭線２２８と正しい輪郭線２２９の隔たりは概して、１／２になる。この置き換えを全てのブロックに対して１回行った結果が図４２の輪郭線２４８である。このブロックの置き換えを繰り返せば輪郭線２４８は、正しい輪郭線にさらに近づき、やがて、図４３に示すように、正しい輪郭線に一致する。実際には、２本の輪郭線のずれが画素間距離よりも小さい状態は意味がないので適当な回数で置き換えを終了する。本手法は、シェイプ画像で設定したＮ×Ｎ画素のブロックにテクスチャ画像での輪郭線が含まれる時に有効なのであるが、その場合、シェイプ画像の輪郭線とテクスチャ画像の輪郭線の距離は最大でおよそＮ／２である。相似ブロックの一辺の長さが、ブロックの一辺の長さのＡ倍とした時、１回の置き換えにつき、２本の輪郭線の距離は１／Ａになるのであるから、この距離が１画素よりも短くなることを式で表すと、置き換え回数をｘとして、
（Ｎ／２）×（１／Ａ）＾ｘ＜１
となる。ここで＾はべき乗を表し、上式では（１／Ａ）をｘ回乗ずるという意味である。上式から、
ｘ＞ｌｏｇ（２／Ｎ）／ｌｏｇ（１／Ａ）
となる。例えばＮ＝８，Ａ＝２の時は、
ｘ＞２
であり、置き換え回数は３回で十分である。
【０２３５】
この物体抽出装置のブロック図を図３０に示す。まず、操作者によって入力されるシェイプ画像２４９が、シェイプメモリ２５０に記録される。シェイプメモリ２５０においては、図３６，３７を用いて前述した様にブロックが設定される。一方、テクスチャ画像２５１は、テクスチャメモリ２５２に記録される。テクスチャメモリ２５２からは、シェイプメモリ２５０から送られる、ブロックの位置情報２５３を参照して、ブロックのテクスチャ画像２５４が探索回路２５５に送られる。同時に、図３９や図４０を用いて説明した様に相似ブロックの候補もテクスチャメモリ２５２から、探索回路２５５に送られる。探索回路２５５では、相似ブロックの各候補を縮小した後に、ブロックとの誤差を計算し、その誤差が最小となったものを相似ブロックとして決定する。誤差としては輝度値の差分の絶対値和や、それに色差の差分の絶対値和を加えたものなどが考えられる。輝度だけに比べて、色差も用いると、演算量は多くなるが、物体の輪郭において輝度の段差が小さくても、色差の段差が大きい場合に正しく相似ブロックを決定できるので精度が向上する。相似ブロックの位置の情報２５６は縮小変換回路２５７に送られる。縮小変換回路２５７には、シェイプメモリ２５０から、相似ブロックのシェイプ画像２５８も送られる。縮小変換回路２５７では、相似ブロックのシェイプ画像が縮小され、その縮小された相似ブロックは輪郭線が補正されたシェイプ画像２５９として、シェイプメモリ２５０に返され、対応するブロックのシェイプ画像が上書きされる。このシェイプメモリ２５０の置き換えが所定の回数に達した時は、補正されたシェイプ画像２５９は外部に出力される。シェイプメモリ２５０の書き換えは、ブロック毎に逐次上書きしても良いし、メモリを２画面分用意して、一方から他方へ、初めに画面全体のシェイプ画像をコピーした後に、輪郭部分のブロックは相似ブロックを縮小したもので置き換えるようにしても良い。
【０２３６】
この物体抽出方法を図４８のフローチャートを参照して説明する。
【０２３７】
（フレーム内の縮小ブロックマッチングによる物体抽出方法）
ステップＳ３１では、シェイプデータの輪郭部分にブロックが設定される。ステップＳ３２では、現処理ブロックと画像データの図柄が相似である相似ブロックが同じ画像データから見つけられる。ステップＳ３３では、現処理ブロックのシェイプデータを相似ブロックのシェイプデータが縮小したデータで置き換えられる。
【０２３８】
ステップＳ３４で処理済みブロック数が所定の数に達したらステップ３５に進む、そうでない場合は次のブロックに処理対象を進めてステップ３２に戻る。
【０２３９】
ステップＳ３５では置き換えの繰り返し回数が所定の数に達したらステップ３６に進む、そうでない場合は、置き換えられたシェイプデータを処理対象としてステップ３１に戻る。ステップＳ３６では、置き換えを繰り返されたシェイプデータが物体領域として出力される。
【０２４０】
この方法はブロックのエッジと相似ブロックのエッジが合った場合に効果がある。従って、ブロックに複数のエッジがある場合には、エッジが正しく合わないことがあるので、そういうブロックについては置き換えをせずに入力されたままのエッジを保持する。具体的には、ブロックのシェイプ画像を左右方向と上下方向に各ラインをスキャンし、１つのラインで“０”から“２５５”へ、あるいは“２５５”から“０”に変化する点が２つ以上あるラインが所定の数以上あるブロックは置き換えをしない。また、物体と背景の境界であっても部分によっては、輝度などが平坦な場合がある。このような場合もエッジ補正の効果が期待できないのでテクスチャ画像の分散が所定値以下のブロックについても置き換えをせずに、入力されたままのエッジを保持する。
【０２４１】
相似ブロックの誤差が所定値よりも小さくならない場合、縮小をあきらめ、同じサイズで相似ブロックを求めても良い。この際、自分とはなるべく重ならないように相似ブロックを選ぶ。縮小を行わないブロックだけでは、エッジが補正される効果はないが、縮小を行うことによってエッジが補正されたブロックから、その補正されたエッジをコピーすることで、縮小を行わないブロックについても、間接的にエッジが補正される。
【０２４２】
図４８に示したフローチャートは相似ブロックを見つけた直後にシェイプ画像の置き換えを行う例であったが、全ブロックの相似ブロックの位置情報を保持するようにすることで、初めに、相似ブロックの探索を全ブロックについて行い、次に、シェイプ画像の置き換えを全ブロックについて行う方法を図５０のフローチャートを参照して説明する。
【０２４３】
この例では、１回の相似ブロックの探索に対してシェイプ画像の置き換えが複数回反復できる。
【０２４４】
ステップＳ４１では、シェイプデータの輪郭部分にブロックが設定される。ステッブＳ４２では、現処理ブロックと画像データの図柄が相似である相似ブロックが同じ画像データ内から見つけられる。ステップＳ４３では、全てのブロックについて相似ブロックを見つける処理が終わったとき、つまり、処理済みブロック数が所定の数に達したときにはステップＳ４４に進む。そうでない場合はステップＳ４２に戻る。ステップＳ４４では、現処理ブロックのシェイプデータを相似ブロックのシェイプデータを縮小したもので置き換える。
【０２４５】
ステップＳ４５では、全てのブロックについて置き換える処理が終わったとき、つまり、処理済みブロック数が所定の数に違したときにはステップＳ４６に進む。そうでない場合はステップＳ４４に戻る。ステップＳ４６では全ブロックの置き換え回数が所定の回数に達した場合はＳ４７に進む。そうでない場合はＳ４４に戻る。ステップＳ４７では、置き換え変換を繰り返されたシェイプデータが物体領域として出力される。
【０２４６】
次にエッジ補正の精度を上げることができるブロックの設定方法を説明する。
【０２４７】
前述したようにシェイブ画像の輪郭線の周囲にブロックを設定する方法では、図５１（ａ）に示されるように、正しい輪郭線３０１の一部がブロックに含まれなくなることがある。ここで、シェイプ画像の輪郭線３０２は太い線で示してある。仮に輪郭線の右下側が物体で左上側が背景だとすると、本当は背景である部分３０３は物体と誤って設定されているにもかかわらず、ブロックに含まれないために修正される可能性がない。このようにブロックと正しい輪郭線の間に隙間があると、正しく補正されない。
【０２４８】
ブロックと正しい輪郭線の隙間を小さくするには、図５１（ｂ）に示したようにブロックをある程度重なり合わせる方法がある。こうすると、ブロックの数が増えるので、演算量は増加するが隙間３０４は小さくなる。従って抽出の精度は向上する。しかし、この例では、まだ隙間は完全にはなくならない。
【０２４９】
隙間を小さくするには、図５１（ｃ）に示したようにブロックサイズを大きくすることも有効である。この例では、前述したブロックの重ねあわせを併用した。これにより、この例では隙間が完全になくなる。
【０２５０】
このように、輪郭線の補正可能な範囲を広げるにはブロックサイズを大きくすることが有効である。しかし、ブロックサイズが大きすぎると、ブロックに含まれる輪郭線の形状が複雑になり、相似ブロツクが見つかりにくくなる。その例が図５２に示されている。
【０２５１】
図５２（ａ）では、斜線の部分３０５が物体領域、白色の部分３０６が背景領域を表す。与えられたシェイプ画像の輪郭線３０７は黒線で示されている。このように、シェイプ画像の輪郭線３０７は正しい輪郭線と大きく隔たっており、また、正しい輪郭線には凹凸がある。これに対して、前に説明した方法とは異なる方法でブロックを配置した結果が図５２（ｂ）に示されている。ここでは、まず、互いに重ならず、かつ、隙間がないような矩形ブロックで画像が分割される。ブロック毎にテクスチャ画像での分散が計算され、分散が所定値よりも小さいブロックはその設定を解消した。従って図５２（ｂ）では、分散が所定値以上のブロックだけが残っている。これらのブロック毎に相似ブロックを求めるのであるが、例えばブロック３０８の近くにこれを縦横２倍にした図柄は存在しないし、他の多くのブロックについても同様である。従って、誤差最小な部分を相似ブロックとして選択はするものの、その位置関係を用いてシェィブ画像の置き換え変換を反復しても図５２（ｃ）に示す通り正しい輪郭線には合致しない。ただ、図５２（ａ）のシェイプ画像の輪郭線３０７と比較して、エッジ補正後の図５２（ｃ）のシェイプ画像の輪郭線３０９は、テクスチャ画像の輪郭線の大まかな凹凸（左と右に山があってその間に谷があるという程度のもの）は反映されている。この例で仮にブロックサィズを小さくすると、この大まかな補正さえされなくなってしまう。
【０２５２】
このように、補正の範囲を広げるためにブロックサイズを大きくすると、ブロックに含まれる輪郭線の形状が複雑になり、相似ブロックが見つかりにくくなることがある。その結果エッジの補正が大まかにしかされなくなる。このような場合には、ブロックのサイズを初めは大きなサイズでエッジ補正を行い、その結果に対して、再度ブロックサイズを小さくしてエッジ補正を行うと、補正の精度が向上する。図５２（ｃ）に対して、ブロックサイズを縦横１／２にして補正を再度行い、さらに１／４にして補正を行クた結果が図５２（ｄ）に示される。このように、ブロックサイズを次第に小さくしながら補正を繰り返せば補正の精度を向上できる。
【０２５３】
ブロックサイズを次第に小さくする方法を図５３のフローチャートを参照して説明する。
【０２５４】
ステップＳ５１ではブロックサイズｂ＝Ａと設定する。ステップＳ５２では、図４８または図５０に示したエッジ補正と同様なエッジ補正を行う。ステップＳ５３では、ｂを観察し、ｂがＺ（＜Ａ）より小さくなると、この処理は終了する。ｂがｚ以上の場合にはステップＳ５４に進む。ステップＳ５４でブロックサイズｂを半分にしてＳ５２に進む。
【０２５５】
以上、ブロックサイズを初めは大きめにし、次第に小さくしながら補正を繰り返すことで補正の精度を向上する例を示した。
【０２５６】
図５４（ａ）に、ブロックを４５度傾けることで、ブロックと正しい輪郭線の間に隙間をできにくくする例が示されている。このように、輪郭線が斜めの場合にはブロックサイズを図５１（ｃ）ほど大きくしなくても、ブロックを傾ければ正しい輪郭線を覆うことができる。また、この例では、ブロックの重なりを無くしても図５４（ｂ）のように正しい輪郭線を覆える。このように、シェイプ画像の輪郭線と同じ向きにブロックの辺を傾けることで、ブロックと正しい輪郭線の間に隙間を生じにくくすることができる。具体的には、アルフア画像の輪郭線の傾きを検知しそれが水平か垂直に近い場合にはブロツクの向きは図５１（ｃ）のようにし、そうでない場合にはブロックの傾きは図５４（ｂ）のようにする。水平や垂直に近いという判断はしきい値との比較で行う。
【０２５７】
以上が、最初のフレームの物体抽出処理である。これは、必ずしも動画像の最初のフレームだけではなく、静止画像一般に用いることができる手法である。なお、置き換えを１回行ったシェイプ画像に対して、ブロックを設定しなおし、その相似ブロックを求めなおして、２回目の置き換えを行うというように、置き換えの度にブロック設定と相似ブロックの探索を行えば、演算量は増えるが、より補正の効果が得られる。
【０２５８】
また、相似ブロックはブロックのなるべく近い部分から選ばれるのが好ましいので、相似ブロックを探す範囲をブロックサイズによって切り換えるとよい。即ち、ブロックサイズが大きい場合には、相似ブロックを探す範囲を広くし、ブロックサイズが小さい場合には、相似ブロックを探す範囲を狭くする。
【０２５９】
また、本手法では、シェイプデータの置き換えの過程で、シェイプデータに小さい穴や、孤立した小領域が誤差として出現することがある。そこで、ステップＳ３４，Ｓ３５，Ｓ３６，Ｓ４５，Ｓ４６，Ｓ４７，Ｓ５３の前などで、シェイプデータから、小さい穴や、孤立した小領域を除くことにより、補正の精度を向上することができる。小さい穴や、孤立した小領域を除くには、例えば、画像解析ハンドブック（高木、下田監修、東京大学出版会、初版１９９１年１月）５７５〜５７６頁に記載されている膨張と収縮を組み合わせた処理や、６７７頁に記載された多数決フィルタなどを用いる。
【０２６０】
また、ブロックは図４９に示すように、より簡易的に設定しても良い。すなわち、画面を単純にブロック分割し、そのうちブロック２２００など、輪郭線２２８を含むブロックについてのみ、相似ブロックの探索や置き換えの処理を行う。
【０２６１】
また、与えられるテクスチャ画像が予めフラクタル符号化（特公平０８−３２９２５５号公報「画像の領域分割方法及び装置」）によって圧縮されているのであれば、その圧縮データに、各ブロックの相似ブロックの情報が含まれている。従って、輪郭線２２８を含むブロックの相似ブロックとしては、圧縮データを流用すれば、改めて相似ブロックを探索する必要はない。
【０２６２】
図２７に戻り、動画から物体を抽出する物体抽出装置の説明を続ける。
【０２６３】
動き補償による物体抽出回路２４２では、テクスチャ画像２２１から検出される動きベクトルを用いながら、最初のフレームのシェイプ画像２５を元にして、２フレーム目以降のフレームのシェイプ画像２６０を生成する。
【０２６４】
図２９に動き補償による物体抽出回路２２４の例が示される。最初のフレームのシェイプ画像２２５が、シェイプメモリ２６１に記録される。シェイプメモリ２６１においては、図４５のフレーム２６２に示した様に、画面全体にブロックが設定される。一方、テクスチャ画像２２１は、動き推定回路２６４に送られ、また、テクスチャメモリ２６３に記録される。テクスチャメモリ２６３からは、１フレーム前のテクスチャ画像２６５が動き推定回路２６４に送られる。動き推定回路２６４では、現処理フレームのブロック毎に、１フレーム前のフレーム内から誤差が最小となる参照ブロックを見つける。図４５に、ブロック２６７と、１フレーム前のフレーム２６６から選ばれた参照ブロック２６８の例が示される。ここで、誤差が所定のしきい値よりも小さくなるのであれば、参照ブロックはブロックよりも大きくする。ブロック２６９と縦横２倍の大きさの参照ブロック７０の例も図４５に示す。
【０２６５】
図２９に戻り、参照ブロックの位置の情報２７１は動き補償回路２７２に送られる。動き補償回路２７２には、シェイプメモリ２６１から、参照ブロックのシェイプ画像２７３も送られる。動き補償回路２７２では、参照ブロックの大きさがブロックと同じ場合は、そのまま、参照ブロックの大きさがブロックよりも大きい場合は、参照ブロックのシェイプ画像が縮小され、その参照ブロックのシェイプ画像は現処理フレームのシェイプ画像２６０として出力される。また、次のフレームに備えて、現処理フレームのシェイプ画像２６０はシェイプメモリ２６１に送られて、画面全体のシェイプ画像が上書きされる。
【０２６６】
参照ブロックがブロックよりも大きい場合、先に図４１，４２を用いて説明したのと同様に、輪郭線が正しい位置からずれていた場合に補正する効果がある。従って、与えられる最初のフレームのシェイプ画像に続く、動画シーケンスの全てのフレームにおいて、物体が高い精度で抽出される。従来手法のように、動画シーケンスの最初の方や、物体の動きが小さい時に、精度が悪いという不具合はない。
【０２６７】
フレーム間の動き補償による物体抽出を図４７のフローチャートを参照して説明する。
【０２６８】
ステップＳ２１で現処理フレームがブロックに分割される。ステップＳ２２では現処理ブロックと画像データの図柄が相似であり、かつ、現処理ブロックよりも大きい参照ブロックを各フレームあるいは、既にシェイプデータを求めたフレーム内から見つける。ステップＳ２３では参照ブロックのシェイプデータを切り出して縮小したサブブロックを現処理ブロックに貼り付ける。
【０２６９】
ステップＳ２４では処理済みブロック数が所定の数に達したらステップ２５に進む、そうでない場合は次のブロックに処理対象を進めてステップ２２に戻る。ステップＳ２５では貼り合わされたシェイプデータを物体領域として出力する。
【０２７０】
ここで、各フレームとは、本実施例では最初のフレームであり、予めシェイプ画像が与えられるフレームのことである。また、参照ブロックは必ずしも１フレーム前のフレームでなくても、ここで述べたように、既にシェイプ画像が求まっているフレームならよい。
【０２７１】
以上が動き補償を用いた物体抽出の説明である。物体抽出回路２２４としては、以上で説明した方法の他に、先に出願した、特開平１０−００１８４７「動画像の物体追跡／抽出装置」にあるフレーム間差分画像を用いる方法などもある。
【０２７２】
図２７に戻り、動画から物体を抽出する物体抽出装置の実施例の説明を続ける。
【０２７３】
シェイプ画像２６０はスイッチ部２２３とスイッチ部２８１に送られる。スイッチ部２２３では、シェイプ画像２６０が“０”（背景）の時には、テクスチャ画像２２１が、背景メモリ２７４に送られ、記録される。シェイプ画像２６０が“２５５”（物体）の時には、テクスチャ画像２２１は、背景メモリ２７４には送られない。これをいくつかのフレームに対して行い、そのシェイプ画像２６０がある程度正確であれば、物体を含まない、背景部分だけの画像が、背景メモリ２７４に生成される。
【０２７４】
次に、記録装置２２２から、テクスチャ画像２７５が再度最初のフレームから順に読み出され、あるいは、操作者が指定する物体を抽出したいフレームだけが読み出され差分回路２７６に入力される。同時に、背景メモリ２７４から、背景画像２７７が読み出され、差分回路２７６に入力される。差分回路２７６では、テクスチャ画像２７５と背景画像２７７の、互いに画面内で同じ位置にある画素同士の差分値２７８が求められ、背景画像を用いた物体抽出回路２７９に入力される。物体抽出回路２７９では、シェイプ画像２８０が生成されるのであるが、これは、差分値２７８の絶対値が予め定めるしきい値よりも大きい画素は、物体に属するとして画素値を“２５５”とし、そうでない画素は、背景に属するとして画素値“０”とすることで生成される。テクスチャ画像として、輝度だけでなく、色差や色も用いる場合は、各信号の差分の絶対値の和をしきい値と比較して物体か背景かが決定される。あるいは、輝度や色差毎に別々にしきい値を定めて、輝度、色差のいずれかにおいて、差分の絶対値がそのしきい値よりも大きい場合に物体、そうでない場合に背景とが判定される。このようにして生成されたシェイプ画像２８０がスイッチ部２８１に送られる。また、操作者によって決定される選択信号２８２が外部からスイッチ部２８１に入力され、この選択信号２８２によって、シェイプ画像２６０とシェイプ画像２８０のうちのいずれかが選択され、シェイプ画像２８３として外部に出力される。操作者は、シェイプ画像２６０とシェイプ画像２８０を各々、ディスプレイなどに表示し、正確な方を選択する。あるいは、シェイプ画像２６０が生成された段階でそれを表示し、その精度が満足するものでなかった場合に、シェイプ画像２８０を生成し、シェイプ画像２６０の精度が満足するものであった場合には、シェイプ画像２８０を生成せずに、シェイプ画像２６０をシェイプ画像２８３として外部に出力するようにすれば、処理時間を節約できる。選択は、フレーム毎に行ってもよいし、動画像シーケンス毎に行ってもよい。
【０２７５】
図２７の物体抽出装置に対応する物体抽出方法を図４６のフローチャートを参照して説明する。
【０２７６】
（背景画像を用いる物体抽出方法）
ステップＳ１１では与えられる各フレームにおけるシェイプデータを動き補償することにより各フレームのシェイプデータが生成される。ステップＳ１２ではシェイプデータによって決定される背景領域の画像データが背景画像としてメモリに記憶される。
【０２７７】
ステップＳ１３では処理済みフレーム数が所定の数に達したらステップ１４に進む、そうでない場合は次のフレームに処理対象を進めてステップ１１に戻る。ステップＳ１４では画像データと背景画像との差分の絶対値が大きい画素を物体領域とし、そうでない画素を背景領域とする。
【０２７８】
本実施形態においては、例えば撮像するカメラに動きがあると背景が動く。この場合は、前のフレームからの背景全体の動き（グローバル動きベクトル）を検出し、１スキャン目ではグローバル動きベクトルの分だけ前のフレームからずらして背景メモリに記録し、２スキャン目では、グローバル動きベクトルの分だけ前のフレームからずらした部分を背景メモリから読み出す。１スキャン目で検出したグローバル動きベクトルをメモりに記録しておき、２スキャン目では、それを読み出して用いれば、グローバル動きベクトルを求める時間を節約できる。また、カメラが固定していることなどから、背景が静止していることが既知の場合は、操作者がスイッチを切り替えることなどによって、グローバル動きベクトルの検出を行わないようにして、グローバル動きベクトルは常にゼロにするようにすれば、処理時間はさらに節約できる。グローバル動きベクトルを半画素精度で求める時は、背景メモリは、入力される画像の縦横とも２倍の画素密度とする。すなわち、入力画像の画素値は１画素おきに背景メモリに書き込まれる。例えば次のフレームでは背景が横方向に０．５画素動いていた場合には、先に書き込まれた画素の間にやはり１画素おきに画素値が書き込まれる。このようにすると、１スキャン目が終了した時点で、背景画像に一度も書き込まれない画素ができることがある。その場合は、周囲の書き込まれた画素から内挿してその隙間を埋める。
【０２７９】
また、半画素の動きベクトルを用いる／用いないに関わらず、動画シーケンス全体を通じて、一度も背景領域にならない部分は、１スキャン目を終わっても背景メモリに画素値が代入されない。このような未定義の部分は、２スキャン目では、常に物体と判定する。これは、特に未定義の部分を記録するためのメモリを用意して、未定義か否かをいちいち判定しなくても、背景に希にしか出てこないと予想される画素値（Ｙ，Ｕ，Ｖ）＝（０，０，０）などで、予め背景メモリを初期化してから１スキャン目を開始すればよい。未定義の画素にはこの初期画素値が残るので２スキャン目では、自動的に物体と判定される。
【０２８０】
これまでの説明では、背景メモリを生成する時に、既に背景の画素値が代入されている画素についても、背景領域であれば、画素値が上書きされている。この場合、動画シーケンスの最初の方でも最後の方にでも背景である部分には、動画シーケンスの最後の方の背景の画素値が背景メモリに記録される。動画シーケンスの最初と最後でそういった背景が全く同じ画素値ならば問題はないが、カメラが非常にゆっくりと動いたり、背景の明るさが少しずつ変化するなどして、画素値がフレーム間で微少に変動する場合には、動画シーケンスの最初の方の背景と最後の方の背景とでは、画素値の差が大きくなるので、この背景メモリを用いると、動画シーケンスの最初の方のフレームで背景部分も物体と誤検出されてしまう。そこで、その前までのフレームでは、一度も背景領域にはならずに、現処理フレームで初めて背景領域となった画素についてのみ背景メモリへの書き込みを行い、既に背景の画素値が代入されている画素の上書きはしないようにすれば、背景メモリには動画シーケンスの最初の方の背景が記録されるので、正しく物体が抽出される。そして、２スキャン目にも、その物体抽出結果に応じて、現処理フレームの背景領域を背景メモリに上書きするようにすれば、現処理フレームの直前のフレームの背景と現処理フレームの背景という相関の高い背景同士を比較することになり、その部分が物体と誤検出されにくくなる。２スキャン目の上書きは、背景に微少な変動がある場合に有効なので、操作者が背景の動きは無しという意味にスイッチを切り替えるなどした場合は、上書きは行わない。このスイッチは、先のグローバル動きベクトルを行うか行わないかを切り替えるスイッチと共通でも構わない。
【０２８１】
１スキャン目は背景画像を生成するのが目的であるから、必ずしも全てのフレームを用いる必要はない。１フレームおき、２フレームおきなどとフレームを間引いても、ほぼ同じ背景画像が得られ、処理時間は短くなる。
【０２８２】
背景領域のうち、フレーム間差分がしきい値以下の画素だけを背景メモリに記録するようにすれば、画面に入り込んでくる他の物体が背景メモリに記録されずに済む。また、１スキャン目の物体領域が実際よりも物体側に誤検出された場合、物体の画素値が背景メモリに記録されてしまう。そこで、背景領域でも物体領域に近い画素は背景メモリに入力しないようにする。
【０２８３】
観光地で撮影した画像などで、前景の人などを除いた背景画像だけが必要な場合は、背景メモリに記録された背景画像を外部に出力する。
【０２８４】
以上が本実施形態の第１の構成例の説明である。本例によれば、動画シーケンスの初めの部分も最後の部分と同様に高い抽出精度が得られる。また、物体の動きが小さかったり、全く動かない場合でも正しく抽出される。
【０２８５】
次に、図２８を用いて、生成されたシェイプ画像２８０を修正する例を説明する。シェイプ画像２８０が生成されるまでは、図２７と同じなので説明を省略する。
【０２８６】
シェイプ画像２８０は、背景パレットを用いるエッジ補正回路２８４に入力される。また、テクスチャ画像２７５が、背景パレットによるエッジ補正回路２８４と縮小ブロックマッチングによるエッジ補正回路２８５に入力される。エッジ補正回路２８４の詳細なブロック図を図３１に示す。
【０２８７】
図３１において、シェイプ画像２８０は、補正回路２８６に入力され、同じフレームのテクスチャ画像２７５は比較回路２８７に入力される。背景パレットを保持するメモリ２８８からは、背景色２８９が読み出され、比較回路２８７に入力される。ここで、背景パレットは、背景部分に存在する輝度（Ｙ）と色差（Ｕ，Ｖ）の組すなわちベクトルの集まり、
（Ｙ１，Ｕ１，Ｖ１）
（Ｙ２，Ｕ２，Ｖ２）
（Ｙ３，Ｕ３，Ｖ３）
……………………
のことで、予め用意される。具体的には、背景パレットは、最初のフレームにおいて、背景領域に属する画素のＹ，Ｕ，Ｖの組を集めたものである。ここで、例えばＹ，Ｕ，Ｖが各々２５６通りの値をとるとすると、その組み合わせは膨大な数になり、背景の（Ｙ，Ｕ，Ｖ）の組み合わせ数も多くなり、後に説明する処理の演算量が多くなってしまうので、所定のステップサイズでＹ，Ｕ，Ｖの値を各々量子化することにより、組み合わせ数を抑制できる。これは、量子化をしない場合は異なるベクトル値だったもの同士が、量子化により同じベクトル値になる場合があるからである。
【０２８８】
比較回路２８７では、テクスチャ画像２７５の各画素のＹ，Ｕ，Ｖが量子化され、そのベクトルがメモリ２８８から順次送られてくる背景パレットに登録されているベクトル、すなわち背景色２８９のいずれかと一致するかどうかが調べられる。画素毎に、その画素の色が背景色かどうかの比較結果２９０が、比較回路２８７から補正回路２８６へ送られる。補正回路２８６では、シェイプ画像２８０のある画素の画素値が“２５５”（物体）であるにもかかわらず、比較結果２９０が背景色であった場合に、その画素の画素値を“０”（背景）に置き換えて、補正されたシェイプ画像２９１として出力する。この処理により、シェイプ画像２８０において物体領域が背景領域にはみ出して誤抽出されていた場合に、その背景領域を正しく分離できる。ただ、背景と物体に共通の色があり、背景パレットに物体の色も混じって登録されていると、物体のその色の部分もが背景と判定されてしまう。そこで、最初のフレームでは、先に説明したパレットを背景の仮のパレットとしておき、同様の方法で最初のフレームの物体のパレットも作る。次に、背景の仮のパレットの中で物体のパレットにも含まれる色については、背景の仮のパレットから除き、残ったものを背景パレットとする。これにより、物体の一部が背景になってしまう不具合を回避できる。
【０２８９】
また、最初のフレームで与えられるシェイプ画像に誤差がある場合を考慮し、シェイプ画像のエッジの近傍の画素は、パレットの生成に用いないようにしても良い。また、各ベクトルの出現頻度を数え、頻度が所定値以下のベクトルはパレットに登録しないようにしても良い。量子化ステップサイズを小さくしすぎると、処理時間が多くなったり、背景色に非常に似た色でもベクトル値がわずかに異なるために背景と判定されなかったりし、逆に量子化ステップサイズを大きくしすぎると、背景と物体に共通するベクトルばかりになってしまう。そこで、最初のフレームに対して、いくつかの量子化ステップサイズを試し、与えられるシェイプ画像の様に背景色と物体色が分離される量子化ステップサイズが選ばれる。
【０２９０】
また、途中から新しい色が背景や物体に現れることがあるので、途中のフレームで背景パレットを作りなおしても良い。
【０２９１】
図２８に戻り、シェイプ画像２９１は、エッジ補正回路２８５に入力される。エッジ補正回路２８５は、先に説明した図３０の回路において、シェイプ画像２４９をシェイプ画像２９１、テクスチャ画像２５１をテクスチャ画像２７５とする回路と同じであるので、説明は省略するが、シェイプ画像のエッジがテクスチャ画像のエッジに合うようにシェイプ画像の補正を行う。補正されたシェイプ画像２９２はスイッチ部２８１に送られる。スイッチ部２８１からは、シェイプ画像２９２とシェイプ画像２６０のうちから選択されたシェイプ画像２９３が出力される。
【０２９２】
本例では、エッジ補正回路を物体抽出回路２７９の後段に設けたが、物体抽出回路２２４の後段に設ければ、シェイプ画像２６０の精度を向上できる。
【０２９３】
また、エッジ補正によって、抽出精度がかえって悪化する場合も希にある。そういう時に、悪化したシェイプ画像２９２が出力されてしまわないように、図２８において、シェイプ画像２８０やシェイプ画像２９１もスイッチ部２８１に入力すれば、エッジ補正を行わないシェイプ画像２８０や、背景パレットによるエッジ補正だけを施したシェイプ画像２９１を選択することも可能となる。
【０２９４】
図４４は、背景パレットに登録された背景色の画素をクロスハッチで示しており、先に図３０や図２９を用いて説明した相似ブロックの探索の時に、図４４の情報を用いると、輪郭抽出の精度をさらに高めることができる。背景に図柄がある場合に、物体と背景のエッジではなく、背景の図柄のエッジに沿うように相似ブロックが選ばれてしまうことがある。このような場合、ブロックと相似ブロックを縮小したブロックとの誤差を求める時に、対応画素がいずれも背景色同士の時は、その画素の誤差は、計算に含めないようにすると、背景の図柄のエッジがずれていても、誤差が発生せず、従って、物体と背景のエッジが合うように相似ブロックが正しく選択される。
【０２９５】
図３２は、本実施形態の物体抽出装置２９４を組み込んだ画像合成装置の例である。テクスチャ画像２９５はスイッチ部２９６と物体抽出装置２９４に入力され、最初のフレームのシェイプ画像２１００は物体抽出装置２９４に入力される。物体抽出装置２９４は、図２７や図２８で構成されており、各フレームのシェイプ画像２９７が生成され、スイッチ部２９６に送られる。一方、記録回路２９８には、予め合成用背景画像２９９が保持されており、現処理フレームの背景画像２９９が記録回路２９８から読み出され、スイッチ部２９６に送られる。スイッチ部２９６では、シェイプ画像の画素値が“２５５”（物体）の画素ではテクスチャ画像２９５が選択されて合成画像２１０１として出力され、シェイプ画像の画素値が“０”（背景）の画素では、背景画像２９９が選択されて合成画像２１０１として出力される。これにより、背景画像２９９の前景にテクスチャ画像２９５内の物体を合成した画像が生成される。
【０２９６】
図３３はエッジ補正を行う別の例を示す。図３３のように設定されたブロックのうちの一つが、図３３のブロック２１０２であるとする。輪郭線を境界にして、ブロックは物体領域と背景領域に分けられている。この輪郭を左右方向にずらして得られたブロックが２１０３，２１０４，２１０５，２１０６である。それぞれずらす幅と向きが異なる。文献：福井「領域間の分離度に基づく物体輪郭抽出」（電子情報通信学会論文誌、Ｄ−ＩＩ、Ｖｏｌ．Ｊ８０−Ｄ−ＩＩ、Ｎｏ．６、ｐｐ．１４０６−１４１４、１９９７年６月）の１４０８ページに記述されている分離度を各々の輪郭線について求め、ブロック２１０２〜２１０６のうちで分離度が最も高い輪郭線を採用する。これにより、シェイプ画像の輪郭がテクスチャ画像のエッジに合う。
【０２９７】
以上述べてきたように、本第４実施形態によれば、動画シーケンスの初めの部分も最後の部分と同様に高い抽出精度が得られる。また、物体の動きが小さかったり、全く動かない場合でも正しく抽出される。さらに、現処理ブロックよりも大きい相似ブロックのシェイプデータを縮小して張り付けることにより、シェイプデータで与えられる物体領域の輪郭線がずれていてもそれを正しい位置に補正することが可能となり、物体領域の輪郭を大まかになぞったものをシェイプデータとして与えるだけで、以降の入力フレーム全てにおいて物体領域を高い精度で抽出することが可能となる。
【０２９８】
なお、以上の第１乃至第４実施形態は適宜組み合わせて利用することもできる。また、第１乃至第４実施形態の物体抽出方法の手順はすべてソフトウェアによって実現することもでき、この場合には、その手順を実行するコンピュータプログラムを記録媒体を介して通常のコンピュータに導入するだけで、第１乃至第４実施形態と同様の効果を得ることができる。
【０２９９】
【発明の効果】
以上のように、本発明によれば、目的とする物体を囲む図形を用いてその物体を追跡することにより、目的の物体以外の周囲の余分な動きに影響を受けずに、目的の物体を精度良く抽出・追跡することが可能となる。
【０３００】
また、入力画像によらずに高い抽出精度を得ることが可能となる。さらに、動画シーケンスの初めの部分も最後の部分と同様に高い抽出精度が得られる。また、物体の動きが小さかったり、全く動かない場合でも正しく抽出される。
【図面の簡単な説明】
【図１】本発明の第１実施形態に係る動画像の物体追跡／抽出装置の基本構成を示すブロック図。
【図２】同実施形態の物体追跡／抽出装置の第１の構成例を示すブロック図。
【図３】同実施形態の物体追跡／抽出装置の第２の構成例を示すブロック図。
【図４】同実施形態の物体追跡／抽出装置に設けられた背景領域決定部の具体的な構成の一例を示すブロック図。
【図５】同実施形態の物体追跡／抽出装置に設けられた図形設定部の具体的な構成の一例を示すブロック図。
【図６】同実施形態の物体追跡／抽出装置に設けられた背景動き削除部の具体的な構成の一例を示すブロック図。
【図７】同実施形態の物体追跡／抽出装置に設けられた背景動き削除部で使用される代表背景領域の一例を示す図。
【図８】同実施形態の物体追跡／抽出装置の動作を説明するための図。
【図９】本発明の第２実施形態に係る第１の動画像物体追跡／抽出装置を表すブロック図。
【図１０】同第２実施形態に係る第２の動画像物体追跡／抽出装置を表すブロック図。
【図１１】同第２実施形態に係る第３の動画像の物体追跡／抽出装置を表すブロック図。
【図１２】同第２実施形態に係る第４の動画像の物体追跡／抽出装置を表すブロック図。
【図１３】同第２実施形態の物体追跡／抽出装置で用いられる物体予測の方法を説明するための図。
【図１４】同第２実施形態の物体追跡／抽出装置で用いられる参照フレーム選択方法を説明するための図。
【図１５】同第２実施形態の物体追跡／抽出装置において第１の物体追跡／抽出部と第２の物体抽出部を切り替えて物体を抽出した結果の例を表す図。
【図１６】同第２実施形態の物体追跡／抽出装置を用いた動画像の物体追跡／抽出処理の流れを説明する図。
【図１７】本発明の第３実施形態に係る第１の動画像物体追跡／抽出装置を表すブロック図。
【図１８】同第３実施形態に係る第２の動画像物体追跡／抽出装置を表すブロック図。
【図１９】同第３実施形態に係る第３の動画像物体追跡／抽出装置を表すブロック図。
【図２０】同第３実施形態に係る第５の動画像物体追跡／抽出装置を表すブロック図。
【図２１】同第３実施形態に係る第６の動画像物体追跡／抽出装置を表すブロック図。
【図２２】同第３実施形態に係る第４の動画像物体追跡／抽出装置を表すブロック図。
【図２３】同第３実施形態に係る動画像物体追跡／抽出装置の他の構成例を表すブロック図。
【図２４】同第３実施形態に係る動画像物体追跡／抽出装置のさらに他の構成例を表すブロック図。
【図２５】同第３実施形態に係る動画像物体追跡／抽出装置に適用されるフレーム順序制御による抽出フレーム順の例を説明する図。
【図２６】同第３実施形態に係る動画像物体追跡／抽出装置の応用例を表す図。
【図２７】本発明の第４実施形態に係る物体抽出装置を示すブロック図。
【図２８】同第４実施形態に係る物体抽出装置にエッジ補正処理を適用した場合の構成例を示すブロック図。
【図２９】同第４実施形態に係る物体抽出装置に適用される動き補償部の構成例を示すブロック図。
【図３０】同第４実施形態に係る物体抽出装置に適用される縮小ブロックマッチングによる物体抽出部の構成例を示すブロック図。
【図３１】同第４実施形態に係る物体抽出装置で使用される背景パレットによるエッジ補正回路を示す図。
【図３２】同第４実施形態に係る物体抽出装置に適用される画像合成装置を示す図。
【図３３】同第４実施形態に係る物体抽出装置で使用される分離度を用いたエッジ補正の原理を説明する図。
【図３４】同第４実施形態に係る物体抽出装置で処理される処理画像全体を示す図。
【図３５】同第４実施形態で用いられる操作者によって描かれた輪郭線を示す図。
【図３６】同第４実施形態で用いられるブロック設定（１スキャン目）の様子を示す図。
【図３７】同第４実施形態で用いられるブロック設定（２スキャン目）の様子を示す図。
【図３８】同第４実施形態で用いられる相似ブロックを説明するための図。
【図３９】同第４実施形態で用いられる相似ブロックの探索範囲を説明するための図。
【図４０】同第４実施形態で用いられる相似ブロックの探索範囲の別の例を説明するための図。
【図４１】同第４実施形態で用いられるシェイプ画像の置き換え変換前の様子を示す図。
【図４２】同第４実施形態で用いられるシェイプ画像の置き換え変換後の様子を示す図。
【図４３】同第４実施形態において抽出された輪郭線を示す図。
【図４４】同第４実施形態において抽出された背景色の部分を表す図。
【図４５】同第４実施形態で使用される動き補償を説明するための図。
【図４６】同第４実施形態で使用される背景画像を用いた物体抽出方法のフローチャート。
【図４７】同第４実施形態で使用される動き補償による物体抽出方法のフローチャート。
【図４８】同第４実施形態で使用されるフレーム内の縮小ブロックマッチングによる物体抽出方法のフローチャート。
【図４９】同第４実施形態で用いられるブロック設定の他の例を示す図。
【図５０】エッジ補正を説明するためのフローチャート図。
【図５１】ブロック設定の例を示す図。
【図５２】物体領域の輪郭線を探索する過程を示す図。
【図５３】ブロックサイズを次第に小さくする方法を説明するフローチャート図
【図５４】ブロック設定の他の例を示す図。
【符号の説明】
１…初期図形設定部
２…物体追跡・抽出部
１１…図形設定部
１２…背景領域決定部
１３…物体抽出部
２１…背景動き削除部
２２…図形設定部
２３…背景領域決定部
２４…物体抽出部
３１…変化量検出部
３２…代表領域決定部
３３…背景変化量決定部
３４…代表領域の背景決定部
３５…背景領域決定部
３６…形状予測部
３７…静止物体領域決定部
４１…分離部
４２…動き検出部
４３…分割判定部
４４…図形決定部
５１…背景代表領域設定部
５２…動き検出部
５３…動き補償部
６１…図形設定部
６２…複数の物体追跡・抽出部
７０…図形設定部
７１…第２の物体追跡・抽出部
７２…参照フレーム選択部
７３…第１の物体追跡・抽出部
１１１…特徴量抽出部
１４１…フレーム順序制御部
２２４…物体抽出部
２７９…物体抽出部
２８４…エッジ補正部
２８５…エッジ補正部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an image object extracting apparatus, and more particularly to an object extracting apparatus that detects / positions a target object from an input moving image and tracks / extracts a moving object.
[0002]
[Prior art]
Conventionally, an algorithm for tracking / extracting an object in a moving image has been considered. This is a technique for extracting only an object from an image in which various objects and backgrounds are mixed. This technique is useful for processing and editing a moving image. For example, a person extracted from a moving image can be combined with another background.
[0003]
As a method used for object extraction, spatio-temporal image segmentation (Echigo, Iizaku, “Spatio-temporal image segmentation for video mosaic”, 1997 IEICE Information and Systems Society Conference, D-12- 81, p.273, September 1997) is known.
[0004]
In this region division method using space-time image region division, a small region is divided by a color texture within one frame of a moving image, and the regions are merged using the relationship of motion between frames. When an image in a frame is divided, there is a problem that it is necessary to give an initial division, which greatly affects the division result. Therefore, using this in reverse, the region division method using the spatio-temporal image region division changes the initial division in another frame, resulting in different division results, and inconsistencies in the movement between frames. The method of merging the divisions is taken.
[0005]
However, if this method is applied to tracking and extraction of an object in a moving image as it is, the motion vector is affected by extra motion other than the target moving object, and the reliability is often not sufficient. The point of merging is a problem.
[0006]
Japanese Patent Application Laid-Open No. 8-241414 discloses a moving object detection / tracking apparatus using a plurality of moving object detection apparatuses in combination. This conventional moving object detection / tracking device is used in, for example, a monitoring system using a monitoring camera, and detects a moving object from an input moving image and tracks it. In this moving object detection / tracking apparatus, the input moving image is input to the image dividing unit, the inter-frame difference moving object detection unit, the background difference moving object detection unit, and the moving object tracking unit. In the image dividing unit, the input moving image is divided into blocks having a predetermined size. The division result is sent to the inter-frame difference type moving object detection unit and the background difference type moving object detection unit. The inter-frame difference type moving object detection unit detects the moving object in the input moving image using the inter-frame difference for each division result. In this case, the frame interval when taking the difference between frames is set based on the detection result of the background difference type moving object detection unit so that the moving object can be detected without being affected by the moving speed of the moving object. Is done. The background difference type moving object detection unit detects the moving object by taking the difference between the moving image inputted so far and the background image created for each division result and the moving object. The integration processing unit integrates the detection results of the inter-frame difference type moving object detection unit and the background difference type moving object detection unit, and extracts movement information of the moving object. After extracting an object in each frame, the moving object tracking unit associates corresponding moving objects between frames.
[0007]
In this configuration, since the moving object is detected using not only the inter-frame difference but also the background difference, the detection accuracy is higher than when only the inter-frame difference is used. However, since it is a mechanism that detects a moving object from the image by using the inter-frame difference and the background difference for the entire input moving image, the detection results of the inter-frame difference and the background difference indicate the target moving object. There is a problem that the target moving object cannot be extracted and tracked well in an image having a complicated movement in the background because it is influenced by extra movement other than the above.
[0008]
As another object extraction technique, a method is also known in which a background image is first generated using a plurality of frames, and a region where the difference between pixel values of the background image and the input image is large is extracted as an object.
[0009]
An example of an existing technique for object extraction using the background image is disclosed in, for example, “Moving Object Detection Device, Background Extraction Device, and Unattended Object Detection Device” of Japanese Patent Laid-Open No. 8-55222.
[0010]
The image signal of the current processing frame is input to a frame memory that stores an image for one frame, a first motion detection unit, a second motion detection unit, and a switch. The image signal of the previous frame is read from the frame memory and input to the first motion detection means. On the other hand, the background image signal generated up to that point is read from the frame memory prepared for holding the background image, and is input to the second motion detection means and the switch. In the first motion detection means and the second motion detection means, the object region and the object region are extracted using the difference value between the two input image signals, respectively, and sent to the logic operation circuit. In the logical operation circuit, the logical product of the two input images is taken and output as the final object region. The object area is also sent to the switch. In the switch, the background pixel signal is selected for the pixels belonging to the object area depending on the object area, and conversely, for the pixels not belonging to the object area, the image signal of the current processing frame is selected and is stored in the frame memory as an overwrite signal. Sent, and the pixel value of the frame memory is overwritten.
[0011]
In this method, as shown in Japanese Patent Laid-Open No. 8-55222, as the process proceeds, the background image gradually becomes correct, and eventually an object is correctly extracted. However, since the object is mixed in the background image at the beginning of the moving image sequence, the object extraction accuracy is poor. Also, when the movement of the object is small, the image of the object remains in the background image and the extraction accuracy does not increase.
[0012]
[Problems to be solved by the invention]
As described above, the conventional object extraction / tracking method is a mechanism for detecting a moving object from the entire input moving image, and thus affects an extra movement other than the target moving object. As a result, there is a problem that the target moving object cannot be accurately extracted and tracked.
[0013]
Also, in the object extraction method using the background image, the extraction accuracy is poor at the beginning of the moving image sequence, and when the movement of the object is small, the background image is not completed indefinitely and the extraction accuracy is not improved. There was a point.
[0014]
It is an object of the present invention to provide a moving image object extracting apparatus capable of accurately extracting / tracking an object without being affected by extraneous movements around the object other than the object.
[0015]
In addition, the present invention makes it possible to accurately determine the background image, and the object extraction can obtain high extraction accuracy regardless of the size of the movement of the object and the first part of the moving image sequence is the same as the last part. An object is to provide an apparatus.
[0016]
[Means for Solving the Problems]
The present invention provides a first common to the current frame and the first reference frame based on a difference between the current frame to be extracted and a first reference frame that is temporally different from the current frame. A second background area common to the current frame and the second reference frame based on a difference between the current frame and a second reference frame that is temporally different from the current frame; A background area determining means for determining the area that does not belong to either the first background area or the second background area in the in-graphic image of the current frame; Provided is an object extraction device comprising an object stationary detection means for detecting a stationary object region.
[0017]
In this object extraction device, two reference frames are prepared for each current frame to be extracted, and the current frame and the first reference are obtained from the first difference image between the current frame and the first reference frame. A first common background area used in common with the frame is determined, and is common to the current frame and the second reference frame by a second difference image between the current frame and the second reference frame. The second common background region used for the is determined. Since both the first and second difference images include the object region on the current frame in common, the region does not belong to either the first common background region or the second common background region. By detecting an area included in the graphic image of the current frame, the object area on the current frame is extracted. When this object region corresponds to a stationary object, the stationary object region is detected when there is no difference between the previous object region and the current object region.
[0018]
In this way, by tracking an object by determining an area that does not belong to any of a plurality of common background areas determined based on temporally different reference frames as an extraction target object, the object other than the target object can be tracked. It is possible to extract and track the target object with high accuracy without being affected by extraneous movements in the surroundings.
[0019]
Also, background correction means for correcting the background movement of each reference frame or current frame so that the background movement between the first and second reference frames and the current frame is relatively zero. It is preferable to further comprise. By providing this background correction means at either the input stage of the graphic setting means or the input stage of the background area determination means, the background video gradually changes between successive frames, for example, when the camera is panned. Even in such a case, the background video can be made pseudo-constant between these frames. Therefore, by taking the difference between the current frame and the first or second reference frame, the background can be canceled between the frames, and the common background area detection process and the object area extraction process that are not affected by the background change are performed. It can be carried out. The background correction means can be realized by motion compensation processing.
[0020]
In addition, the background region determination means includes a graphic image of the current frame or a graphic of the first or second reference frame in a difference image between the current frame and the first or second reference frame. Means for detecting a difference value of each pixel in the vicinity of the contour line of the region belonging to the image, and means for determining a difference value to be determined as the common background region using the difference value of each pixel in the vicinity of the contour line; It is preferable that the common background region is determined from the difference image using the determined difference value as a threshold value for background / object region determination. Thus, by focusing on the difference value of each pixel in the vicinity of the contour line, the threshold value can be easily determined without examining the entire difference image.
[0021]
Further, the graphic setting means searches for an area on the input frame where an error from the input frame is minimized for each of the plurality of blocks, and a means for dividing the in-graphic image of the reference frame into a plurality of blocks. It is preferable to comprise means and means for setting a figure surrounding a plurality of searched areas in the input frame. This makes it possible to set a new figure that is optimal for the target input frame regardless of the shape and size of the figure that is initially set.
[0022]
The present invention also provides a prediction means for predicting the position or shape of an object on the current frame that is the object extraction target from the frame from which the object region has already been extracted, and the position of the object on the current frame predicted by the prediction means. Or a means for selecting the first and second reference frames to be used by the background region determination means based on the shape.
[0023]
As described above, by selecting an appropriate frame as a reference frame to be used, it is possible to always obtain a good extraction result.
[0024]
Where O_i, O_j, O_currFor each reference frame f_i, F_jAnd the object of the current frame fcurr to be extracted, the optimum reference frame f for correctly extracting the shape of the object_i, F_jIs
(O_i∩O_j) ⊆ O_curr
A frame that satisfies O, that is, O_i, O_jThe intersection of O is_currA frame f belonging to_i, F_jIt is.
[0025]
Further, the present invention is characterized in that a plurality of object extraction means for extracting objects by different methods are provided, and object extraction is performed while selectively switching these object extraction means. In this case, a first object extraction unit that performs object extraction using a difference between the current frame and at least two reference frames that are shifted in time from the current frame, and object extraction using inter-frame prediction has already been performed. It is desirable to use in combination with a second object extraction means for extracting an object by predicting the object area of the current frame from the frame on which the above has been performed. Thereby, even when the object is partially stationary and the difference from the reference frame cannot be detected, it can be compensated by the object extraction means using inter-frame prediction.
[0026]
In addition, when a plurality of object extraction means are provided, the apparatus further comprises means for extracting an image feature amount for at least a part of the region from the current frame as an object extraction target, and the extracted feature amount Preferably, the plurality of object extraction means are switched based on the above.
[0027]
For example, if you know in advance whether there is background movement, you should use that property. If there is background motion, background motion compensation is performed, but it cannot always be completely compensated. Frames with complex movements can hardly be compensated. Since such a frame can be selected in advance according to the amount of compensation error of background motion compensation, it is possible to devise such as not to make it a reference frame candidate. However, this process is unnecessary if there is no background movement. If another object is moving, it may not be selected even if it is the best frame for the reference frame selection condition due to incorrect background motion compensation or deviating from the reference frame candidate. is there. In addition, various properties may be mixed in one image. The motion and texture of the object are also partially different, and the same tracking / extraction method, apparatus, and parameters may not be extracted successfully. Therefore, the user designates a part having special properties in the image, or automatically detects a difference in the image as a feature amount, and based on the feature amount, for example, in a block unit in a frame. It is better to change the parameters by changing the tracking / extraction method.
[0028]
In this way, by switching a plurality of object extraction means based on the feature amount of the image, it becomes possible to accurately extract the shapes of objects in various images.
[0029]
Also, a combination of a first object extraction unit that uses a difference between the current frame and at least two reference frames that are shifted in time from the current frame, and a second object extraction unit that uses inter-frame prediction. When using, when the prediction error by the second object extraction means is within a predetermined range, the extraction result by the second object extraction means is used as the object region, and the prediction error exceeds the predetermined range. In some cases, the first and second object extraction means are selectively switched for each block in the frame based on the prediction error amount so that the extraction result by the first object extraction means is used as the object region. It is desirable to do.
[0030]
Further, the second object extraction means performs inter-frame prediction in an order different from the input frame order so that a frame interval between the reference frame and the current frame to be extracted is at least a predetermined frame. To do. As a result, since the amount of motion between frames is larger than when inter-frame prediction is sequentially performed in the order of input frames, prediction accuracy can be improved, and as a result, extraction accuracy can be increased.
[0031]
That is, depending on the frame interval, the motion is too small or too complex, and the shape prediction method based on inter-frame prediction may not be able to cope with it. Therefore, for example, when the shape prediction error does not fall below the threshold value, the prediction accuracy is improved by spacing the extracted frame used for prediction, and as a result, the extraction accuracy is improved. In addition, when there is motion in the background, the reference frame candidate obtains and compensates for the background motion with the extracted frame, but the background motion is too small or too complex depending on the frame interval, so the background motion compensation is accurate. There are cases where it is not possible. Also in this case, the motion compensation accuracy can be increased by increasing the frame interval. If the order of the extraction frames is adaptively controlled in this way, the shape of the object can be extracted more reliably.
[0032]
Further, the present invention inputs moving image data and shape data representing an object region on a predetermined frame in a plurality of frames constituting the moving image data, and uses the shape data to determine the object region from the moving image data. In the object extracting device for extracting the moving image data, the moving image data is read from the storage device in which the moving image data is recorded, and the shape data is subjected to motion compensation, so that each frame constituting the read moving image data is read out. Means for generating shape data; means for generating a background image of the moving image data by sequentially overwriting image data of a background area of each frame determined by the generated shape data on a background memory; Re-reading the moving image data from the storage device in which the moving image data is recorded, For each frame constituting the extracted moving image data, a difference from the corresponding pixel of the background image stored in the background memory is obtained, and a pixel whose absolute value of the difference is larger than a predetermined threshold is determined as an object. Means for determining as a region.
[0033]
In this object extraction device, a background image is generated on the background memory in the first scan process of reading moving image data from the storage device. Next, a second scan process is performed, and an object region is extracted using the background image completed in the first scan. In this way, the object region can be extracted with sufficiently high accuracy from the beginning of the moving image sequence by scanning the moving image data twice using the fact that the moving image data is stored in the storage device. It becomes possible.
[0034]
Further, the present invention selectively outputs, as an object extraction result, one of an object region determined based on the shape data of each frame and an object region determined based on an absolute value of a difference between the background image and the background image. Means are further provided. Depending on the image, the object area determined by the shape data obtained at the first scan may have higher extraction accuracy than the object area obtained at the second scan using the difference from the background image. Therefore, the extraction accuracy can be further improved by selectively outputting the object region obtained in the first scan and the object region obtained in the second scan.
[0035]
Further, the present invention inputs moving image data and shape data representing an object region on a predetermined frame in a plurality of frames constituting the moving image data, and the frame to which the shape data is given or already shape data. Is used as a reference frame to sequentially obtain the shape data of each frame, a means for dividing the current processing frame into blocks, and image data for each block. Means for searching for a similar block having a similar design and an area larger than that of the current processing block from the reference frame, and extracting and reducing the shape data of the similar block from the reference frame for each of the current processing frame The means for pasting to the block and the pasted shape data to the current processing frame And means for outputting a beam shape data.
[0036]
In this object extraction device, for each block of the current frame that is the object extraction target, a search process for a similar block having a similar image data (texture) pattern and a larger area than the current processing block is searched. A process of cutting out and reducing the shape data of the similar block and pasting it on the block of the current processing frame is performed. By reducing and pasting the shape data of the similar block larger than the current processing block in this way, even if the contour line of the object area given by the shape data is shifted, it can be corrected to the correct position. . Therefore, for example, the user can extract the object region in all subsequent input frames with high accuracy only by giving a rough trace of the outline of the object region on the first frame with a mouse or the like as shape data. .
[0037]
Further, the present invention provides an object extraction device for inputting image data and shape data representing an object area of the image, and extracting the object area from the image data using the shape data, in an outline portion of the shape data. A block is set, and for each block, means for searching for a similar block in which the image data is similar and larger than the block in the same image, and the shape data of each block is the respective similarity Means for replacing the block shape data with a reduced version, means for repeating the replacement a predetermined number of times, and means for outputting the shape data after the replacement as corrected shape data.
[0038]
As described above, by performing the replacement process using the similar book by the block matching in the frame, the contour line given by the shape data can be corrected to a correct position. Further, since block matching is performed within a frame, similar book search and replacement can be performed repeatedly for the same block, thereby further improving the correction accuracy.
[0039]
DETAILED DESCRIPTION OF THE INVENTION
FIG. 1 shows the overall configuration of a moving image object tracking / extracting apparatus according to a first embodiment of the present invention. This object tracking / extracting apparatus is for tracking the movement of a target object from an input moving image signal, and includes an initial graphic setting unit 1 and an object tracking / extracting unit 2. The initial graphic setting unit 1 is used for initial setting of a graphic surrounding the target object to be tracked / extracted with respect to the input moving image signal a1 based on an external initial graphic setting instruction signal a0. Then, an initial figure setting instruction signal a0 is set on the initial frame of the input moving image signal a1 so that a figure of an arbitrary shape such as a rectangle, a circle, or an ellipse surrounds the target object. As an input method of the initial figure setting instruction signal a0, for example, the user directly writes the figure itself on the screen displaying the input moving image signal a1 using a pointing device such as a pen or a mouse, or uses these pointing devices. A method such as designating the position and size of the figure to be input can be used. Thereby, it becomes possible to easily designate an object to be tracked / extracted from the outside on the initial frame image in which the target object appears.
[0040]
In addition, the initial setting of the figure is not a figure input by the user, but a normal frame image is analyzed to detect, for example, a human or animal face, body contour, etc., and the figure is automatically set to surround it. Can also be realized.
[0041]
The object tracking / extracting unit 2 performs tracking and extraction of an object based on the in-graphic image included in the graphic set by the initial graphic setting unit 1. In this case, in the tracking / extraction process of the moving object, the movement of the object is tracked by paying attention to the object designated by the figure. Therefore, the target animal body can be extracted / tracked without being affected by extraneous movements other than the target animal body.
[0042]
FIG. 2 shows an example of a preferable configuration of the object tracking / extracting unit 2.
[0043]
The object tracking / extracting unit includes memories (M) 11 and 14, a graphic setting unit 11, a background region determining unit 12, and an object extracting unit 13, as shown in the figure.
[0044]
The figure setting unit 11 is used to sequentially set figures for input frames while using any frame that has been input and set as a reference frame. The graphic setting unit 11 receives the current frame image 101, the image in the graphic of the reference frame and its position 103, and the object extraction result 106 of the reference frame, and an image representing the inside of the region surrounded by the arbitrary graphic of the current frame. 102 is output. That is, in the graphic setting processing by the graphic setting unit 11, the error on the graphic image 103 in the reference frame is minimized based on the correlation between the graphic image 103 in the reference frame and the current frame image 101. A region is searched, and a figure surrounding the region is set for the current frame image 101. The figure to be set may be anything such as a rectangle, a circle, an ellipse, or an area surrounded by edges. In the following, the case of a rectangle is described for simplicity. A specific configuration of the graphic setting unit 11 will be described later with reference to FIG. If no graphic surrounding the object is used, the image in the graphic is the entire image, and there is no need to input / output the position.
[0045]
The memory 10 holds at least about three frames that have been input and figure-set so far. Information to be held includes an image of a frame in which a figure is set, a position and shape of the set figure, an image in the figure, and the like. Further, not the entire image of the input frame but only the in-graphic image may be held.
[0046]
The background region determination unit 12 uses, as reference frames, at least two arbitrary frames among frames that are temporally different from the current frame for each current frame to be extracted, and for each reference frame, The common background area between each reference frame and the current frame is determined by taking the difference. The background area determination unit 12 stores an arbitrary intra-graphic image and its position 102 of the current frame, an arbitrary intra-graphic image and its position 103 of at least two frames, and the at least two frames held in the memory 10. The object extraction result 106 is input, and a background area 104 common to the current frame and the in-graphic image of each of the at least two frames is output. That is, when the first and second frames are used as the reference frame, the first difference image obtained by taking the interframe difference between the current frame and the first reference frame is used. A common first background area used as a background area in both the current frame and the first reference frame is determined, and an interframe difference between the current frame and the second reference frame is calculated. Based on the second difference image obtained by the above, the common second background area used as the background area in both the current frame and the second reference frame is determined. A specific configuration of the background region determination unit 12 will be described later with reference to FIG. There is also a method for obtaining a common background using a background memory.
[0047]
If no graphic surrounding the object is used, the image in the graphic is the entire image and there is no need to input / output the position.
[0048]
The object extraction unit 13 is used to extract only the object region from the graphic image of the current frame using the common background region determined by the background region determination unit 12, and each of the current frame and at least two frames The common background area 104 and the object extraction result 106 of the current frame are output. Since both the first and second difference images include the object region on the current frame in common, the region does not belong to either the first common background region or the second common background region. By detecting an area included in the graphic image of the current frame, the object area on the current frame is extracted. This utilizes the fact that regions other than the common background region are candidates for the object region. In other words, the region other than the first common background region is an object region candidate on the first difference image, and the region other than the second common background region is an object region candidate on the second difference image. A region where two object region candidates overlap can be determined as the object region of the current frame. As the object extraction result 106, information indicating the position and shape of the object region can be used. Further, an image of the object area may be actually extracted from the current frame using the information.
[0049]
The memory 14 holds at least two object extraction results, and is used to increase the extraction accuracy by feeding back the already extracted results.
[0050]
Here, with reference to FIG. 8, a method of object extraction / tracking processing used in the present embodiment will be described.
[0051]
Here, a case where an object is extracted from the current frame f (i) using three temporally continuous frames f (i−1), f (i), and f (i + 1) will be described as an example.
[0052]
First, the graphic setting process is performed by the graphic setting unit 11 described above. The graphic setting process is also performed on each of the three frames f (i−1), f (i), and f (i + 1) by using an arbitrary reference frame, and a rectangle R ( i-1), R (i), R (i + 1) are set. The rectangular figures R (i−1), R (i), and R (i + 1) are position and shape information and do not exist as images.
[0053]
Next, the background region determination unit 12 determines a common background region.
[0054]
In this case, first, an inter-frame difference between the current frame f (i) and the first reference frame f (i−1) is taken to obtain a first difference image fd (i−1, i). . Similarly, an inter-frame difference between the current frame f (i) and the second reference frame f (i + 1) is also taken to obtain a second difference image fd (i, i + 1).
[0055]
By obtaining the first difference image fd (i−1, i), the pixel value of a portion having a common pixel value in the current frame f (i) and the first reference frame f (i−1) is obtained. Since it is canceled out, the difference value of the pixel becomes zero. Therefore, if the backgrounds of the frames f (i−1) and f (i) are substantially the same, basically, the first difference image fd (i−1, i) has a rectangle R (i−1). ) And the image corresponding to the OR of the rectangle R (i). The figure surrounding the remaining image is a polygon Rd (i-1, i) = R (i-1) OR R (i) as shown in the figure. The common background area of the current frame f (i) and the first reference frame f (i-1) is the actual object area (here, two circles) within the polygon Rd (i-1, i). This is the entire area except for the area of the figure of 8 obtained as a result of overlapping.
[0056]
Also, for the second difference image fd (i, i + 1), an image corresponding to the OR of the in-figure image of the rectangle R (i) and the in-figure image of the rectangle R (i + 1) remains. The figure surrounding this remaining image is a polygon Rd (i, i + 1) = R (i) OR R (i + 1), as shown. The common background region of the current frame f (i) and the second reference frame f (i + 1) is the actual object region (here, the result of overlapping two circles) in the polygon Rd (i, i + 1). This is the entire area other than the one obtained in the shape of figure 8).
[0057]
Thereafter, a process of determining a common background area for the current frame f (i) and the first reference frame f (i-1) from the first difference image fd (i-1, i) is performed.
[0058]
A difference value serving as a threshold for determining the common background area / object area is required. This may be given by the user, or may be automatically set by detecting noise and properties of the image. In that case, it may be determined partially according to the partial property in the image, even if it is not one threshold value in one screen. The nature of the image may be the edge strength or the variance of the difference pixels. It can also be determined using a figure that tracks the object.
[0059]
In this case, a difference value serving as a threshold value for determining the common background area / object area is obtained, and a pixel area having a difference value equal to or smaller than the threshold value is determined as the common background area. This threshold is one line outside the polygon Rd (i-1, i) of the first difference image fd (i-1, i), that is, on the outline of the polygon Rd (i-1, i). It can be determined using a histogram of the difference values of each pixel along. The horizontal axis of the histogram is the pixel value (difference value), and the vertical axis is the number of pixels having the difference value. For example, a difference value such that the number of pixels is half of the total number of pixels existing on the frame line made of the polygon Rd (i−1, i) is determined as the threshold value. By determining the threshold value in this way, it is possible to easily determine the threshold value without examining the distribution of pixel values over the entire first difference image fd (i−1, i).
[0060]
Next, a common background region in the polygon Rd (i-1, i) of the first difference image fd (i-1, i) is determined using this threshold value. An area other than the common background area is an object area including occlusion. As a result, the region in the polygon Rd (i−1, i) is divided into a background region and an object region, and a binary image in which the pixel value of the background region is “0” and the pixel value of the object region is “1”. Is converted to
[0061]
Similarly, for the second difference image fd (i, i + 1), a process of determining a common background area of the current frame f (i) and the second reference frame f (i + 1) is performed, and the polygon Rd The area within (i, i + 1) is converted into a background area with a pixel value “0” and an object area with a pixel value “1”.
[0062]
Thereafter, the object extraction force by the object extraction unit 13 is performed.
[0063]
Here, an AND process between the binary image in the polygon Rd (i−1, i) and the binary image in the polygon Rd (i, i + 1) is performed between the first and second difference images. A calculation process is performed for each pixel, whereby the intersection of objects with occlusion is obtained, and the object O (i) on the current frame f (i) is extracted.
[0064]
Here, the case where all the regions other than the object region in the frame difference image are obtained as the common background region has been described, but only the in-figure image is extracted from each frame, and the in-figure image on the frame is extracted. The difference between the images in the figure may be calculated in consideration of the position. In this case, the polygon Rd (i−1, i) and Only the common background region within the polygon Rd (i, i + 1) will be determined.
[0065]
Thus, in this embodiment,
1) By calculating a difference image between the current frame and each of the first and second reference frames that are different in time with respect to the current frame, the OR of the in-graphic image between the current frame and the first reference frame Find the OR of the in-figure image between the current frame and the second reference frame,
2) Object extraction by the ORAND method focusing on the in-figure image in which a difference image obtained by OR processing of these in-figure images is ANDed, thereby extracting a target object region from the in-figure image in the current frame. Is done.
[0066]
Further, the temporal relationship between the current frame and the two reference frames is not limited to the above example, and, for example, two frames f (im) and f (i) temporally preceding the current frame f (i). It is also possible to use −n) as a reference frame, or use two frames f (i + m) and f (i + n) that are continuous in time as reference frames.
[0067]
For example, in FIG. 8, the frames f (i−1) and f (i) are used as reference frames, the difference between each of these reference frames and the frame f (i + 1) is taken, and the same is applied to these difference images. If processing is performed, an object can be extracted from the frame f (i + 1).
[0068]
FIG. 3 shows a second configuration example of the object tracking / extracting unit 2.
[0069]
The main difference from the configuration of FIG. 2 is that a background motion deletion unit 21 is provided. The background motion deletion unit 21 is used to correct the background motion so that the background motion is relatively zero between each reference frame and the current frame.
[0070]
Hereinafter, the apparatus of FIG. 3 will be specifically described.
[0071]
The background motion deletion unit 21 inputs an arbitrary figure image in at least two frames shifted in time from the current frame 201 and its position 206, and deletes the background motion in at least two frames shifted in time. The image 202 is output. A specific configuration example of the background motion deletion unit 21 will be described later with reference to FIG.
[0072]
The figure setting unit 22 corresponds to the figure setting unit 11 in FIG. 2 and inputs the current frame 201, at least two images 202 from which the background motion has been deleted, and the physical holiday extraction result 206 of the image 202. An image 203 representing an area surrounded by an arbitrary figure of the at least two images 202 is output.
[0073]
The memory 26 holds an arbitrary graphic image and its position.
[0074]
The background region determination unit 23 corresponds to the background region determination unit 12 in FIG. 2 and inputs the arbitrary graphic image and its position 203 and the object extraction result 206 of the image 202, and the current frame and the at least two images A background area 204 common with 202 is output. The object extraction unit 24 corresponds to the object extraction unit 13 in FIG. 2, inputs a common background area 204 between the current frame and at least two images, and outputs an object extraction result 205 of the current frame. The memory 25 holds at least two object extraction results. This corresponds to the memory 14 of FIG.
[0075]
By providing the background motion deletion unit 21 in this way, even when the background video gradually changes between successive frames, for example, when the camera is panned, the background video is simulated between those frames. Can be kept constant. Therefore, when the difference between the current frame and the reference frame is obtained, the background can be canceled between the frames, and the common background area detection process and the object area extraction process that are not affected by the background change can be performed.
[0076]
Note that the background motion deletion unit 21 may be provided at the input stage of the background region determination unit 23 so that the background motion of the reference frame is deleted in accordance with the current frame.
[0077]
FIG. 4A shows a specific example of the configuration of the background region determination unit 12 (or 23).
[0078]
The change amount detection unit 31 is used to obtain a difference between the current frame and the first and second reference frames described above, and an arbitrary intra-graphic image and its position 302 of the current frame and a frame shifted in time. Then, the object extraction result 301 of the frame shifted in time is input, and the change amount 303 between the images in any figure of the frame shifted in time from the current frame is output. As the amount of change, for example, a luminance difference between frames, a change in color, an optical flow, or the like can be used. By using the object extraction result of frames that are shifted in time, an object can be extracted even if the object does not change between frames. For example, if the amount of change is an inter-frame difference, the portion of the inter-frame difference that belongs to the object is the same as the object extraction result of the temporally shifted frame because the object is stationary.
[0079]
The representative area determination unit 32 inputs an arbitrary graphic image of the current frame and its position 302, and outputs the background of the arbitrary graphic image as the representative area 304. As the representative area, an area that is expected to have the most background in an arbitrary figure is selected. For example, a strip-like region is set on the outermost side in the figure, such as the outline of the figure on the difference image described with reference to FIG. Since the figure is set so as to surround the object, there is a high possibility of becoming a background.
[0080]
The background change amount determination unit 33 inputs the representative area 304 and the change amount 303 and outputs a change amount 305 for determining the background. The background change amount is determined by taking a histogram of the change amount of the difference value of the representative area as described with reference to FIG. A region having a value is determined as a background region.
[0081]
The representative area background determination unit 34 receives the background change amount 305, determines the representative area background 306, and outputs it. The background area of the representative area is determined based on whether the background change amount is determined in advance. The background area determination unit 35 receives the change amount 303, the background determination threshold value 305, and the background 306 of the representative area, and outputs the background 307 of the area other than the representative area. The background area other than the representative area is determined by the growth method from the representative area. For example, if the determined pixel and an undetermined pixel adjacent in the internal direction of the figure match the background change amount, the background is determined. Pixels that are not adjacent to the background or that do not match the background change amount are determined to be other than the background. Alternatively, the determination may be made based on whether the background change amount is determined in advance. In this way, by narrowing down from the contour line of the graphic on the difference image toward the inner periphery, it is possible to determine how far the background area is in the graphic image.
[0082]
Conversely, an object area that protrudes from the outline of the figure toward the outside is detected. For example, if a pixel determined to be other than the background and an undetermined pixel to be sold adjacent to the outside of the figure do not match the background change amount, it is determined to be other than the background. Pixels that are not adjacent to pixels other than the background or that match the background change amount are determined to be the background. In this way, by expanding outward from the contour line of the figure on the difference image, it is possible to determine how far the image outside the figure is the outside background area. In this case, since it is necessary to calculate the amount of change even outside the figure, a few figures can be thickened to set a new figure that does not protrude sufficiently, and the amount of change can be calculated inside it, or simply The amount of change may be obtained for the entire frame. Alternatively, the amount of change may be obtained only inside the figure, and the above processing may be performed while obtaining the amount of change at any time during determination outside the figure. Naturally, when the object does not protrude from the figure, for example, when there is no pixel other than the background on the contour line, it is not necessary to perform processing outside the figure.
[0083]
By the way, when the object or a part of the object is stationary between the current frame and the reference frame, the difference between the current frame and the reference frame may not be detected, and the shape of the object may not be extracted correctly. Therefore, a method for detecting an object in the current frame using the reference frame that has already been extracted will be described with reference to FIG.
[0084]
FIG. 4B shows the background region determination unit 12 (or 23) including the stationary object region detection unit 37. According to this, the change amount detection unit 31 receives as input the in-graphic image and its position of the current frame and the arbitrary in-graphic image and its position 311 of at least two frames that are shifted in time. The amount of change 313 in the graphic image is detected.
[0085]
The shape predicting unit 36 includes an image in the figure of the current frame and its position, an image in the figure of at least two frames shifted in time and its position 311, an image of the already extracted frame, and an object shape 317. Is input, and the shape 312 of the object is predicted and output for the frame in which the object has not yet been extracted among the frames shifted in time from the current frame.
[0086]
The stationary object region determination unit 37 receives the predicted object shape 312, the change amount 313 between the reference frame and the current frame, and the object shape 317 of the already extracted frame, and is stationary with respect to the current frame from at least two frames. The object region 314 is determined.
[0087]
The background region determination unit 35 receives the object region 314 that is stationary with respect to the current frame and the change amount 313 between the reference frame and the current frame for at least two frames, and inputs at least two frames and the current frame. Each common background area 316 is determined and output.
[0088]
First, if the object of the reference frame has been extracted, if the interframe difference from the reference frame is zero in the current frame, and if the region at the same position in the reference frame is part of the object, The region can be extracted as part of a stationary object. Conversely, if the area is a part of the background in the reference frame, the area in the current frame becomes the background.
[0089]
However, when the object of the reference frame has not yet been extracted, the stationary object or part of the object cannot be extracted by the above method. In that case, the object shape of the reference frame that has not yet been extracted can be predicted using another frame from which the object has already been extracted, and it can be determined whether the object is a part of the object. As a prediction method, a block matching method, an affine transformation method, or the like often used in image coding is used.
[0090]
As an example, a block matching method as shown in FIG. 13 can be considered. If the shape of the object is predicted in this way, it is possible to determine whether the area where no inter-frame difference is detected is a part of a stationary object or the background.
[0091]
When the figure surrounding the object is not used, the image in the figure is the entire image, and there is no need to input / output the position. This shape prediction can use the same shape prediction as when a reference frame is selected. Further, in the embodiment switched to another object extraction method, an object shape obtained by another object extraction method can be used.
[0092]
FIG. 5 shows an example of a specific configuration of the graphic setting unit 11 (or 22).
[0093]
The dividing unit 41 inputs an arbitrary intra-graphic image and its position 402 of a frame shifted in time from the current frame, and outputs the divided image 403. Arbitrary graphic images may be divided into two equal parts or four equal parts, or an edge may be detected and divided along the edges. Hereinafter, the divided figure is simply divided into two equal parts, and the divided figure is called a block. The motion detection unit 42 inputs the divided arbitrary graphic image and its position 403, the arbitrary graphic image of the current frame and its position 401, and outputs the movement and error 404 of the divided image. . Here, the position where the block corresponds to the current frame is searched so as to minimize the error, and the motion and error are obtained. The division determination unit 43 inputs the motion, the error 404, and the object extraction result 407 of the temporally shifted frame, and the determination result 406 of whether or not to divide an arbitrary graphic image of the temporally shifted frame. If not divided, the motion 405 is output. Here, if the object extraction result of the frame shifted in time is not included in the divided block, the block is deleted from the figure. Otherwise, if the error is greater than or equal to the threshold value from the obtained error, it is further divided and the motion is obtained again. Otherwise, the motion of the block is determined. The figure determination unit 44 receives the motion 405 and outputs the in-graphic image of the current frame and its position 407. Here, the graphic determination unit 44 obtains the position correspondence of each block to the current frame, and determines a new graphic so as to include all the blocks at the corresponding position. The new figure may be a connection between all blocks or a rectangle or circle that includes all blocks.
[0094]
In this way, the image in the figure of the reference frame is divided into a plurality of blocks, the area where the error from the current frame is minimized is searched for each of the plurality of blocks, and the figure surrounding the plurality of searched areas is displayed. By setting the frame, it is possible to set a new figure having an optimum shape for the input frame to be set for the figure, regardless of the shape and size of the figure set initially.
[0095]
Note that the reference frame used for graphic setting may be a frame in which a graphic has already been set and is temporally shifted from the current frame, and forward prediction and backward prediction are used in a normal encoding technique. Similarly, a frame that is later in time than the current frame can be used as a reference frame for graphic setting.
[0096]
FIG. 6 shows an example of a specific configuration of the background motion deletion unit 21.
[0097]
The representative background area setting unit 51 receives an arbitrary graphic image shifted in time and its position 501 and outputs a representative background area 503. The representative background area is an area representing the global movement in an arbitrary figure, that is, the movement of the background in the figure. For example, when an arbitrary figure is rectangular, a rectangle as shown in FIG. A band-like frame region having a width of several pixels is set. Also, a few pixels outside the figure may be used. The motion detection unit 52 inputs the current frame 502 and the representative background region 503 and outputs a motion 504. Using the previous example, the movement of the belt-like frame region around the rectangle with respect to the current frame is detected. The frame area may be detected as one area. Also, as shown in FIG. 7, the motion may be obtained by dividing into a plurality of blocks, and the average motion of each may be output, or the most motion may be output.
[0098]
The motion compensation unit 53 receives the frame 501 shifted in time and the motion 504 as inputs, and outputs a motion compensated image 505. Using the previously obtained motion, the motion of the frame shifted in time is deleted in accordance with the current frame. The motion compensation may be block matching motion compensation or motion compensation using affine transformation.
[0099]
Note that the motion deletion may be performed not only on the background in the figure but also on the entire frame.
[0100]
As described above, in this embodiment, (1) the object is tracked using a figure that roughly surrounds the object, not the outline of the object, and (2) an arbitrary figure is set in the current frame. Determining a common background area between the current frame and the in-figure image of each of the at least two frames, and extracting an object of the current frame; (3) determining the movement of the background of the at least two frames shifted in time (4) Detecting the amount of change between images in any figure, determining the representative area, and determining the amount of change in the figure and the position of the current frame and at least two frames corresponding to the background Determining whether it is the background from the relationship between the amount of change and the representative area, (5) dividing the in-graphic image, detecting a movement of an arbitrary in-graphic image or a part of the divided in-graphic image, Any graphic image or minute Determine whether or not to divide a part of the image in the figure, and determine the position and position of any figure in the current frame. (6) Set the area that represents the background and detect the movement of the background. Then, by creating an image in which the background movement of the frame shifted in time is deleted, the target object is not affected by extraneous movements other than the target object and is relatively easy to process. Can be extracted and tracked with high accuracy.
[0101]
Also, the object extraction / tracking procedure of the present embodiment can be realized by software control. Even in this case, the basic procedure is exactly the same. After the initial setting of the figure, the figure setting process is sequentially performed on the input frame, and in parallel with the figure setting process or after the figure setting process is completed, A background region determination process and an object extraction process may be performed.
[0102]
Next, a second embodiment of the present invention will be described.
[0103]
In the first embodiment described above, only one object extraction means using the ORAND method is provided. However, depending on the input image, sufficient extraction performance may not be obtained with only that means. In the ORAND method of the first embodiment, a common background is set based on the difference between the current frame that is the object extraction target and the first reference frame that is temporally different from the current frame. A common background is set based on the difference between the current frame and a second reference frame that is temporally different. However, there is no particular method for selecting the first and second reference frames. Depending on the selection of the first and second reference frames, there may be a large difference in the object extraction results, and good results may not be obtained.
[0104]
Therefore, in the second embodiment, the first embodiment is improved so that an object can be extracted with high accuracy regardless of the input image.
[0105]
First, a first configuration example of the object tracking / extracting apparatus according to the second embodiment will be described with reference to the block diagram of FIG.
[0106]
Only the configuration corresponding to the object tracking / extracting unit 2 of the first embodiment will be described below.
[0107]
The graphic setting unit 60 is the same as the graphic setting unit 11 of the first embodiment described with reference to FIG. 2 and receives a frame image 601 and a graphic 602 set for an initial frame or another input frame. Then, a figure is set in the frame image 601 and output. The switch unit SW61 receives the result 605 of the already performed object extraction, and outputs a signal 604 for switching the object extraction unit to be used based on the result.
[0108]
The object tracking / extracting unit 62 includes a plurality of first to Kth object tracking / extracting units as shown in the figure. These object tracking / extracting units perform object extraction using different methods. The object tracking / extracting unit 62 includes at least one using the ORAND method described in the first embodiment. In addition, as an object tracking / extracting unit using another method, for example, a method using a shape prediction method based on block matching or an object shape prediction based on affine transformation can be used. In these shape predictions, the position or shape of the object region on the current frame is predicted by inter-frame prediction between the frame from which the object has already been extracted and the current frame, and the object region is determined from the in-figure image 603 of the current frame based on the prediction result. Is extracted.
[0109]
An example of shape prediction by block matching is shown in FIG. The graphic image in the current frame is divided into blocks of the same size as shown. The block with the most similar pattern (texture) for each block is searched from the reference frame from which the shape and position of the object have already been extracted. For this reference frame, shape data representing the object region has already been generated. In the shape data, the pixel value of the pixel belonging to the object region is represented by “255”, and the other pixel values are represented by “0”. Shape data corresponding to the searched block is pasted to the corresponding block position of the current frame. By performing such a texture search and shape data pasting process for all the blocks constituting the current frame graphic image, the current frame graphic image is filled with shape data that distinguishes the object region from the background region. . Therefore, an image (texture) corresponding to the object region can be extracted by using this shape data.
[0110]
The switch unit SW61 performs, for example, the same operation as the first object tracking extraction unit, and switches to select the first object tracking extraction unit when the extraction accuracy is good, and selects another object tracking extraction unit otherwise. Switch as follows. For example, if the first object tracking extraction unit is an object shape prediction unit based on block matching, switching of the object tracking extraction unit may be controlled according to the magnitude of the matching error. In the case of object shape prediction by affine transformation, the object tracking extraction unit can be switched depending on the magnitude of the affine transformation coefficient estimation error. The unit of switching in the switch unit SW61 is not a frame unit but a small region in the frame, for example, for each block, or for each region divided based on luminance or color. Thereby, the object extraction method to be used can be selected more finely, and the extraction accuracy can be improved.
[0111]
FIG. 10 shows a second example of the moving image object tracking / extracting apparatus according to the second embodiment.
[0112]
The graphic setting unit 70 is the same as the graphic setting unit 11 of the first embodiment described with reference to FIG. 2, and inputs an image 701 and a graphic 702 that has already been set for an initial frame or another input frame, A figure is set in the image 701 and output.
[0113]
The second object tracking / extracting unit 71 is used for extracting an object region by shape prediction such as a block matching method or affine transformation, and the current in-graphic image 703 output from the graphic setting unit 70, The shape and position 707 of the object on another extracted reference frame are input, and the shape and position of the object are predicted from the in-graphic image 703 of the current frame.
[0114]
The reference frame selection unit 72 inputs the predicted shape and position 704 of the object of the current frame predicted by the second object tracking / extraction unit 71 and the shape and position 707 of the object already extracted, and receives at least 2 Select one reference frame. Here, a reference frame selection method will be described.
[0115]
O_i, O_j, O_currAre the objects of frames i and j and the frame curr being extracted. Two temporally different reference frames f_i, F_jDifference d from_i, D_jAnd AND these differences to obtain the current frame f_currWhen extracting the object, the object O to be extracted_currIn addition to object O_i  , O_jAre extracted by AND processing of frames that are temporally different. Of course, O_i∩ O_j= Φ, that is, object O_i  , O_j There is no overlapping part of the object O_i  , O_jThere is no problem if the overlap is an empty set.
[0116]
But object O_i  , O_jThere is an overlapping part (O_i∩ O_j≠ φ), and if this overlap exists outside the object to be extracted, O_currAnd O_i∩ O_jThese two remain as extraction results.
[0117]
In this case, as shown in FIG._currBackground area (O_currIi), object Oi and O_jIf all the common areas do not exist {O_curr￣∩ (O_i∩ O_j) = Φ}, there is no problem. However, as shown in FIG._currBackground area (O_curr￣) and object O_i  And O_jIf all common areas exist with {O_currＯＯ (O_i∩ O_j) ≠ φ} is O_currAre extracted in the wrong shape as shown by the oblique lines.
[0118]
Therefore, the optimum reference frame f for correctly extracting the shape of the object_i, F_jIs
(O_i∩ O_j) ∩ O_curr                      ... (1)
A frame that satisfies O, that is O_i, O_jThe overlapping part is O_currA frame f belonging to_i, F_j(FIG. 14A).
[0119]
If you choose more than one reference frame,
(O_i∩ O_j∩… ∩O_k) ∩ O_curr              ... (2)
It becomes.
[0120]
Therefore, by selecting a reference frame that satisfies the expression (1) or (2) based on the prediction result of the position or shape of the object on the current frame that is the object extraction target, the shape of the object is ensured. Can be extracted.
[0121]
The first object tracking / extracting unit 73 inputs at least two reference frames 705 selected by the reference frame selecting unit 72 and the current image 701, extracts an object by the ORAND method, and outputs its shape and position 706. To do.
[0122]
The memory 74 holds the extracted object shape and position 706.
[0123]
FIG. 11 shows a third configuration example of the object tracking / extracting apparatus according to the second embodiment.
[0124]
As shown in the figure, the object tracking / extracting apparatus includes a figure setting unit 80, a second object tracking / extracting unit 81, a switch unit SW82, and a first object extracting unit 83. The graphic setting unit 80, the second object tracking / extracting unit 81, and the first object extracting unit 83 are respectively the graphic setting unit 70, the second object tracking / extracting unit 71, and the first object extracting unit shown in FIG. Corresponds to the portion 73. In this example, the switch unit SW82 selectively uses the extraction result of the second object tracking / extraction unit 81 and the extraction result of the first object extraction unit 83.
[0125]
That is, the graphic setting unit 80 inputs the image 801 and the shape and position 802 of the initial graphic, and outputs the graphic shape and position 803. The second object tracking / extracting unit 81 inputs the shape and position 803 of the figure and the shape and position 806 of the already extracted object, predicts the predicted shape and position 804 of the object that has not yet been extracted, Output. The switch unit SW82 inputs the object shape and position 804 predicted by the second object extraction unit, and outputs a signal 805 for switching whether to perform the first object tracking / extraction unit. The object tracking / extracting unit 83 inputs the shape and position 806 of the object that has already been extracted and the predicted shape and position 804 of the object that has not yet been extracted, and determines and outputs the shape and position 805 of the object.
[0126]
The unit of switching in the switch unit SW82 may be switched for each block as in the example described above, or may be switched for each region divided based on luminance or color. As a method for determining switching, for example, a prediction error when an object is predicted can be used. That is, when the prediction error in the second object tracking / extracting unit 81 that performs object extraction using inter-frame prediction is equal to or less than a predetermined threshold, the prediction obtained by the second object tracking / extracting unit 81 When the switching by the switch unit SW82 is performed so that the shape is used as the extraction result, and the prediction error in the second object tracking / extracting unit 81 exceeds a predetermined threshold value, the first object tracking / Switching by the switch unit SW82 is performed so that the extraction unit 83 performs object extraction by the ORAND method, and the extraction result is output to the outside.
[0127]
FIG. 15 illustrates an example of an extraction result when the extraction unit to be used is switched based on the matching error for each block serving as a prediction unit.
[0128]
Here, the area indicated by the mesh is the object shape obtained by the prediction by the second object tracking / extracting unit 81, and the area indicated by the oblique line is the object shape obtained by the first object tracking / extracting part 83. It is.
[0129]
FIG. 12 shows a fourth configuration example of the moving image object tracking / extracting apparatus according to the second embodiment.
[0130]
This object tracking / extracting device is obtained by adding the reference frame selection unit of FIG. 10 to the configuration of FIG.
[0131]
The graphic setting unit 90 inputs the image 901 and the shape and position 902 of the initial graphic, and outputs the graphic shape and position 903. The second object tracking / extracting unit 91 inputs the shape and position 903 of the figure and the shape and position 908 of the object already extracted, predicts the predicted shape and position 904 of the object that has not been extracted, Output. The switch unit SW92 receives the predicted shape and position 904 of the object, determines whether the accuracy of the predicted object is good, and outputs a signal 905 for switching the output of the object extracted by the second object extraction unit. To do. The reference frame selection unit 93 receives the predicted shape and position 904 of the object that has not yet been extracted and the shape and position 908 of the object that has already been extracted, and the shape and position 906 of the object or predicted object of at least two reference frames. Select and output. The object tracking / extracting unit 94 receives the current image 901 and the shape and position 906 of at least two reference frame objects or predicted objects, extracts the object, and outputs the shape and position 907. The memory 95 holds either the extracted object shape and position 907 and the predicted object shape and position 904.
[0132]
Hereinafter, the procedure of the object tracking / extracting method in this example will be described with reference to FIG.
[0133]
(Step S1)
As a reference frame candidate, a frame shifted in time from the current frame is set in advance. This may be all frames other than the current frame, or may be limited to several frames before and after the current frame. For example, it is limited to a total of 5 frames including an initial frame, 3 frames before the current frame, and 1 frame after the current frame. However, if there are no three previous frames, the number of candidates for the subsequent frame is increased, and if there is no one subsequent frame, the previous four frames are set as candidates.
[0134]
(Step S2)
First, a figure in which a user writes an object to be extracted in an initial frame is set as a rectangle, for example. For the subsequent frame figures, the default figure is divided into blocks, and matching is performed to attach the blocks to the corresponding positions. The object is tracked by setting a new rectangle to include all the pasted blocks. Object tracking figures are set for all reference frame candidates. An extraction error can be prevented by re-determining the object tracking figure of the previous frame by using the object every time it is extracted. In addition, the user inputs the object shape of the initial frame.
[0135]
In the following, it is assumed that the frame to be extracted is the current frame, the object before the current frame has already been extracted, and the previous frame has not been extracted.
[0136]
(Step S3)
An appropriate area is set around the reference frame candidate graphic, the background movement with the current frame is detected, and the background in the reference frame graphic is deleted. As a method for detecting the background motion, a region having a width of several pixels around the figure is set, this region is matched with the current frame, and the motion vector that minimizes the matching error is set as the background motion.
[0137]
(Step S4)
By removing reference frames having a large motion vector detection error at the time of background motion deletion from the candidates, it is possible to prevent an extraction error when background motion deletion is not appropriate. In addition, when the number of reference frame candidates decreases, a new reference frame candidate may be selected again. If the newly added reference frame candidate graphic setting or background motion deletion is not performed, it is necessary to newly perform graphic setting and background motion deletion.
[0138]
(Step S5)
Next, a candidate object shape of a current frame from which an object has not yet been extracted and a reference frame prior to the current frame is predicted. The rectangle set as a candidate for the current frame or the previous reference frame is divided into blocks, for example, and matched with a frame from which an object has already been extracted (previous frame), and the corresponding object shape is pasted to predict the object shape To do. It is possible to prevent an extraction error by re-predicting the object of the previous frame by using the object every time it is extracted.
[0139]
(Step S6)
At this time, the block having a small prediction error outputs the predicted shape as it is as an extraction result. Further, when object shape prediction is processed in units of blocks, block distortion may occur due to a matching error. Therefore, a filter that eliminates the block distortion may be applied to smooth the entire object shape.
[0140]
The rectangular division performed at the time of object tracking and object shape prediction may be performed with a fixed block size, or may be performed by hierarchical block matching with a matching threshold.
[0141]
For a block with a large prediction error, the following processing is performed.
[0142]
(Step S7)
A temporary reference frame is set from the reference frame candidates, and a set of reference frames satisfying Equation (1) or Equation (2) is selected for each combination. If any combination of all reference frame candidates does not satisfy Equation (1) or Equation (2), O_i _∩  O_jIt is better to select the one with the smallest number of pixels. In addition, it is better to consider a combination of reference frame candidates so that a frame having a motion vector detection error when deleting background motion is as small as possible. Specifically, when there is a reference frame set having the same condition according to Expression (1) or Expression (2), there is a method of selecting a smaller motion vector detection error at the time of background motion deletion. Hereinafter, it is assumed that two frames are selected as reference frames.
[0143]
(Step S8)
When the reference frame is selected, an inter-frame difference from the current frame is obtained, and attention is paid to the inter-frame difference in the set figure. A histogram of the absolute value of the difference of the one line pixel outside the set graphic is obtained, and the background value of the one line pixel outside the set graphic is determined using the absolute value of the difference that appears frequently as the background area difference value. A pixel having a difference value between adjacent background regions is determined as a background pixel inward from the background pixel of the set one-line pixel outside the figure, and the processing is sequentially continued until it is determined that the pixel is not a background pixel. This background pixel becomes a common background area for the current frame and one reference frame. At this time, since the boundary between the background region and the other part may be unnatural due to the influence of noise, a filter that smooths the boundary or a filter that removes excess or noise regions may be applied.
[0144]
(Step S9)
When a common background area is obtained for each reference frame, an area that is not included in the two common background areas is detected and extracted as an object area. For the part that does not use the previously predicted shape of the object, the result here is output and the shape of the entire object is output.
[0145]
If the part using the shape obtained from the common background and the part using the previously predicted object shape cannot be matched, a filter can be applied at the end to improve the output result.
[0146]
As described above, according to the second embodiment, an object can be extracted with high accuracy regardless of the input image. Alternatively, a reference frame suitable for object extraction can be selected.
[0147]
Next, a third embodiment of the present invention will be described.
[0148]
First, a first example of the object tracking / extracting device according to the third embodiment will be described with reference to the block diagram of FIG.
[0149]
Here, a configuration is adopted in which a feature amount of an image for at least a part of the region is extracted from the current frame that is an object extraction target, and a plurality of object extraction means are switched based on the feature amount.
[0150]
That is, the object tracking / extracting apparatus includes a graphic setting unit 110, a feature amount extracting unit 111, a switch unit SW112, a plurality of object tracking / extracting units 113, and a memory 114, as shown. The graphic setting unit 110, the switch unit SW112, and the plurality of object tracking / extracting units 113 are the same as the graphic setting unit 60, the switch unit SW61, and the plurality of object tracking / extracting units 62 of FIG. 9 described in the second embodiment, respectively. The object tracking / extracting unit to be used is switched based on the feature amount of the image of the current frame extracted by the feature amount extracting unit 111.
[0151]
The graphic setting unit 110 receives the extracted frame 1101, the initial graphic 1102 set by the user, and the extraction result 1106 of the already extracted frame, sets the graphic in the extracted frame, and outputs the graphic. The figure may be a geometric figure such as a rectangle, a circle, or an ellipse, or the user may input the object shape to the figure setting unit 110. In that case, the figure may not be a precise shape but may be a rough shape. The feature quantity detection unit 111 receives the extracted frame 1103 in which the figure is set and the extraction result 1106 of the already extracted frame, and outputs the feature quantity 1104. The switch unit SW112 receives the feature value 1104 and the extraction result 1106 of the already extracted frame as input, and controls the input of the extracted frame 1103 in which the figure is set to the object tracking / extracting unit.
[0152]
When the feature value is obtained for the entire image, the switch unit SW112 can detect the property of the image and use it to control the input to the appropriate object tracking / extraction unit. The inside of the figure may be divided into an appropriate size, and the feature amount may be given for each divided figure. The feature amount includes variance, luminance gradient, edge strength, and the like. In this case, these can be automatically calculated. Further, the property of an object visually recognized by a human may be given to the switch unit W112 by the user. For example, if the target object is a person, it may be extracted after specifying a hair with a sharp edge and selecting a parameter at the time of extraction and performing edge correction for preprocessing.
[0153]
The feature amount may be a feature amount related not only to the inside of the set figure (object and its surroundings) but also to the outside of the figure (background portion).
[0154]
Each of the plurality of (first to k) object tracking / extracting units 113 receives an extraction frame 1103 in which a figure is set and an extraction result 1106 of an already extracted frame as an input, and results of tracking / extracting the object 1105 Is output.
[0155]
The plurality of object tracking / extracting units 113 include an object extracting using the ORAND method, an object extracting using the chroma key, an object extracting by block matching or affine transformation, and the like.
[0156]
In the first embodiment, the background pixel is determined using the histogram of the inter-frame difference of the pixel values around the set graphic. However, the pixel whose inter-frame difference is equal to or less than the threshold is simply determined as the background pixel. It may be determined. In the first embodiment, the background pixel (difference value is a certain value or less) is determined from the set figure toward the inside of the figure, but the object pixel (difference value is a certain value or more) from the figure to the outside of the figure. ) May be determined, or an arbitrary operation order may be used.
[0157]
The memory 114 receives the result 1105 obtained by tracking / extracting the object, and holds it.
[0158]
Hereinafter, the reason why a better extraction result can be obtained by switching the tracking / extraction method depending on the feature amount indicating the property of the image will be described.
[0159]
For example, if you know in advance whether there is background movement, you should use that property. If there is background motion, background motion compensation is performed, but it is not known whether it can be completely compensated. Motion compensation is hardly possible in frames that have complex motion. Since such a frame can be known in advance by a compensation error of background motion compensation, it is possible to devise such as not making it a reference frame candidate. However, this process is unnecessary if there is no background movement. If another object is moving, erroneous background motion compensation will be performed, or the frame will be out of the reference frame candidate, even if it is the optimal frame for the reference frame selection condition, the extraction accuracy will be reduced There is.
[0160]
In addition, various properties are mixed in one image. The motion and texture of the object are also partially different, and the object may not be successfully extracted with the same tracking / extraction method, apparatus, and parameters. Therefore, the user can specify a part with special properties in the image, automatically detect the difference in the image as a feature amount, and partially extract the object by switching the tracking / extraction method, It is better to change the parameters.
[0161]
Thus, by switching the tracking / extracting means for a plurality of objects, it becomes possible to accurately extract the shapes of the objects in various images.
[0162]
Next, a second configuration example of the moving image object tracking / extracting device according to the third embodiment will be described using the block diagram of FIG.
[0163]
The graphic setting unit 120 receives the extracted frame 1201, the initial graphic 1202 set by the user, and the extraction result 1207 of the already extracted frame, and sets and outputs a graphic to the extracted frame. The second object tracking / extracting unit 121 is used to extract an object region by shape prediction such as a block matching method or affine transformation, and an extraction frame 1203 in which a figure is set and an extraction result of an already extracted frame 1207 is input, and an object tracking / extraction result 1204 is output.
[0164]
The feature amount extraction unit 122 receives the object tracking / extraction result 1204 as an input, and outputs the object feature amount 1205 to the switch unit SW123. The switch unit SW123 controls the input of the object tracking / extraction result 1204 to the first object tracking / extracting unit using the object feature 1205 as an input. For example, when an object shape is tracked / extracted by the second object tracking / extracting unit 121 by the block matching method, a feature amount is used as a matching error, and a portion with a small matching error is determined by the second object tracking / extracting unit 121. It is output as the extraction result of the predicted shape. Further, as other feature amounts, there are parameters (fractal dimension and the like) representing luminance gradient and dispersion and texture complexity for each block. When the luminance gradient is used, an input to the first object tracking / extracting unit is used so that the result of the first object tracking / extracting unit 124 by the ORAND method is used for a block having almost no luminance gradient. Be controlled. In addition, when edge detection is performed and the presence / absence or strength of an edge is used as a feature amount, the first object tracking / extraction unit 124 is used so that the result of the first object tracking / extracting unit 124 is used in a place where there is no edge or where the edge is weak. Input to the extraction unit is controlled. In this way, switching control can be changed in units of blocks or areas that are part of an image. Adaptive control can be performed by increasing or decreasing the switching threshold.
[0165]
The first object tracking / extracting unit 124 receives the extracted frame 1201, the object tracking / extraction result 1204, and the extraction result 1207 of the already extracted frame, and inputs the extracted frame tracking / extraction result 1206 to the memory 125. Output.
The memory 125 receives and holds the extraction frame tracking / extraction result 1206 as an input.
[0166]
Next, a third configuration example of the object tracking / extracting apparatus according to the third embodiment will be described with reference to the block diagram of FIG.
[0167]
This object tracking / extracting apparatus is obtained by adding the reference frame selection unit described in the second embodiment to the configuration of FIG. That is, the object tracking / extracting apparatus includes a graphic setting unit 130, a second object tracking / extracting unit 131, a feature amount extracting unit 132, a switch unit SW133, a reference frame selecting unit 134, a first object, as illustrated. It comprises a tracking / extracting unit 135 and a memory 136.
[0168]
The figure setting unit 130 receives the extracted frame 1301, the initial figure 1302 set by the user, and the extraction result 1308 of the already extracted frame, and sets and outputs the figure in the extracted frame. The second object tracking / extracting unit 131 is for extracting an object region by shape prediction such as a block matching method or affine transformation. The second object tracking / extracting unit 131 extracts an extraction frame 1303 in which a figure is set and an already extracted frame. The result 1308 is used as an input, and an object tracking / extraction result 1304 is output.
[0169]
The feature amount extraction unit 132 receives the object tracking / extraction result 1304 and outputs a feature amount 1305 of the object. The switch unit SW133 controls the input of the object tracking / extraction result 1304 to the first object tracking / extracting unit 135 using the object feature quantity 1305 as an input.
[0170]
The reference frame selection unit 134 receives the object tracking / extraction result 1304 to the first object tracking / extraction unit 135 and the extraction result 1308 of the already extracted frame, and outputs a reference frame 1306.
[0171]
An example of the feature of an object is the complexity of movement. When the second object tracking / extracting unit 131 tracks / extracts an object by the block matching method, the first object extraction result is output for a portion with a large matching error. If there is a partially complicated movement, the matching error becomes large in that part and is extracted by the first object tracking / extracting unit 135. Therefore, the reference frame selection method used in the first object tracking / extracting unit 135 is switched using the matching error as a feature amount. Specifically, only the portion extracted by the first object tracking / extracting unit 135, not the entire object shape, satisfies the selection condition of the expression (1) or (2) described in the second embodiment. You should choose a selection method.
[0172]
Examples of the background feature amount are 1) information indicating that the background is still, 2) zooming, and 3) information indicating that there is a pann. This feature amount may be input by the user, or a parameter obtained from the camera may be input as the feature amount. Examples of the background feature amount include a background motion vector, the accuracy of a background motion correction image, a background luminance distribution, a texture, and an edge. For example, the reference frame selection method can be controlled by using the average difference between the background motion corrected image and the uncorrected image as the feature amount. As a control example, when the difference average is very large, the frame cannot be selected as a reference frame candidate, or the frame can be selected by lowering the selection order of the frame. If the background is stationary or if background motion correction is complete for all frames, the difference is zero. As the reference frame selection method, the same method as in the second embodiment can be used.
[0173]
The first object tracking / extracting unit 135 receives the extracted frame 1301, the reference frame 1306, and the extraction result 1308 of the already extracted frame, and outputs the extracted frame tracking / extracting result 1307 to the memory 135 by the ORAND method. To do. The memory 135 receives and holds the extraction frame tracking / extraction result 1307.
[0174]
Next, with reference to FIG. 22, a fourth example of obtaining a feature amount from the output from the second object tracking / extracting unit and switching a plurality of reference frame selecting units among the examples given above is described. A configuration example will be described.
[0175]
The graphic setting unit 160 receives the extracted frame 1601, the initial graphic 1602 set by the user, and the frame 1608 from which an object has already been extracted, and outputs a set graphic 1603. The second object tracking / extracting unit 161 is used to extract an object region by shape prediction such as a block matching method or affine transformation. The tracking / extraction result 1604 is output. The feature amount detection unit 163 receives the object tracking / extraction result 1604 and outputs the feature amount 1605 to the switch unit SW164. The switch unit SW164 receives the feature quantity 1605 and controls the input of the object tracking / extraction result 1604 to the reference frame selection unit.
[0176]
The plurality of reference frame selection units 165 receives the object tracking / extraction result 1604 and the frame 1608 from which an object has already been extracted, and outputs at least two reference frames 1606.
[0177]
The first object tracking / extracting unit 166 is used to perform object extraction by the ORAND method. The first object tracking / extracting unit 166 receives the reference frame 1696 and the extraction frame 1601 as input, and outputs the object tracking / extraction result 1607 to the memory 167. The memory 167 receives an object tracking / extraction result 1607 and holds it.
[0178]
Next, with reference to the block diagram 23, an example in which background information is obtained and the inputs of a plurality of reference frame selection units are controlled based on an error of background motion correction will be described.
[0179]
The graphic setting unit 170 receives the extracted frame 1701, the initial graphic 1702 set by the user, and the frame 1710 from which an object has already been extracted, and outputs a set graphic 1703. The second object tracking / extracting unit 171 receives the setting graphic 1703 and the frame 1710 from which the object has already been extracted, and outputs an object tracking / extracting result 1704. The switch unit SW 172 inputs background information 1705 designated by the user, and controls the input of the extracted frame 1701 to the background motion correction unit 173.
[0180]
The background motion correction unit 173 receives the extracted frame 1701 and the frame 1710 from which an object has already been extracted, and outputs a frame 1706 in which the background motion is corrected.
[0181]
The background feature quantity detection unit 174 receives the extracted frame 1701 and the frame 1706 whose background motion has been corrected, and outputs the background feature quantity 1707 to the switch unit SW175. The switch unit SW175 receives the background feature value 1707 and controls input of the object tracking / extraction result 1704 to the reference frame selection unit 176. The reference frame selection unit 176 receives the object tracking / extraction result 1704 and the frame 1710 from which the object has already been extracted, and outputs at least two reference frames 1708.
[0182]
The first object tracking / extracting unit 177 receives at least two reference frames 1708 and an extraction frame 1701 as input, and outputs an object tracking / extraction result 1709 to the memory 178. The memory 178 receives and holds the object tracking / extraction result 1709.
[0183]
Next, a fifth configuration example of the object tracking / extracting apparatus according to the third embodiment will be described with reference to the block diagram of FIG.
[0184]
The extracted frame output control unit 140 receives the image 1401 and the frame order 1405 to be extracted, and outputs an extracted frame 1402. The frame order control unit 141 receives information 1405 regarding the frame order given by the user as an input, and outputs a frame order 1406. The object tracking / extracting apparatus 142 is an object tracking / extracting method and apparatus for extracting / tracking a target object from a moving image signal. The object tracking / extracting apparatus 142 receives an extraction frame 1402 and outputs a tracking / extraction result 1403 as a tracking / extraction result. Output to the control unit 143. The tracking / extraction result output control unit 143 receives the tracking / extraction result 1403 and the frame order 1406 as input, rearranges the frame order into the order of the image 1401, and outputs the result.
[0185]
The order of the frames may be given by the user or may be determined adaptively according to the movement of the object. A frame interval at which the movement of the object is easily detected is determined, and the object is extracted. That is, the frame order is controlled so that the object extraction process is performed in an order different from the input frame order so that the frame interval between the reference frame and the current frame to be extracted is at least two frames or more. . Thereby, compared with the case where shape prediction by inter-frame prediction or ORAND operation is performed in the order of input frames, prediction accuracy can be improved, and as a result, extraction accuracy can be increased. In the case of the ORAND method, it is possible to increase the extraction accuracy by selecting an appropriate reference frame, so that the shape prediction method based on inter-frame prediction using block matching or the like is particularly effective.
[0186]
That is, depending on the frame interval, the motion is too small or too complex, and the shape prediction method based on inter-frame prediction may not be able to cope with it. Therefore, for example, when the shape prediction error does not fall below the threshold value, the prediction accuracy is improved by spacing the extracted frame used for prediction, and as a result, the extraction accuracy is improved. In addition, when there is motion in the background, the reference frame candidate obtains and compensates for the background motion with the extracted frame, but the background motion is too small or too complex depending on the frame interval, so the background motion compensation is accurate. There are cases where it is not possible. Also in this case, the motion compensation accuracy can be increased by increasing the frame interval. If the order of the extraction frames is adaptively controlled in this way, the shape of the object can be extracted more reliably.
[0187]
Next, a sixth example of the object tracking / extracting device according to the third embodiment will be described with reference to the block diagram of FIG.
[0188]
The extracted frame output control unit 150 receives the image 1501 and the sequence 1505 of the extracted frames, and outputs an extracted frame 1502. The frame order control unit 151 receives information 1505 regarding the frame order given by the user as an input, and outputs a frame order 1506. That is, the frame order control unit 151 is given a frame interval and determines the frame extraction order. The plurality of object tracking / extracting devices 152 is an object tracking / extracting method and apparatus for extracting / tracking a target object from a moving image signal. The input of the extraction frame 1502 is controlled according to the frame order 1506, and tracking / extraction is performed. The extraction result 1503 is output. The tracking / extraction result output control unit 153 receives the tracking / extraction result 1503 and the frame order 1506 as input, and rearranges them in the order of the images 1501 and outputs them.
[0189]
The skipped frames may be interpolated from the already extracted frames, or may be extracted by the same algorithm by changing the way of selecting reference frame candidates.
[0190]
Here, an example of processing of the object tracking / extracting apparatus in FIG. 21 will be described with reference to FIG.
[0191]
In FIG. 25, a frame indicated by diagonal lines is a frame extracted first with an interval of two frames. The skipped frame is extracted by the second object tracking / extracting device. After the frames on both sides are extracted as shown in FIG. 25, the object shape may be obtained by interpolation from the extraction results of the frames on both sides. Alternatively, parameters such as a threshold value may be changed, or the frames on both sides may be added to the reference frame candidate and extracted by the same method as the frames on both sides.
[0192]
Next, another configuration example of the object tracking / extracting apparatus will be described with reference to the block diagram of FIG.
[0193]
The switch unit SW182 receives the background information 1805 designated by the user as an input, and controls the input of the extraction frame 1801 to the background motion correction unit 183. Background movement
The correction unit 183 receives the extracted frame 1801 and the frame 1811 from which an object has already been extracted, and outputs a frame 1806 in which background motion is corrected. The background feature quantity detection unit 184 receives the extracted frame 1801 and the frame 1806 corrected for the background motion, and outputs a background feature quantity 1807. The switch unit SW 187 receives the background feature value 1807 as input, and controls the input of the object tracking / extraction result 1804 to the reference frame selection unit 188. The graphic setting unit 180 receives the extracted frame 1801, the frame 1811 from which an object has already been extracted, and the initial graphic 1802 set by the user, and outputs an extracted frame 1803 in which a graphic is set. The second object tracking / extracting unit 185 receives an extraction frame 1803 in which a graphic is set and a frame 1811 from which an object has already been extracted, and outputs an object tracking / extraction result 1804. The feature quantity detection unit 185 receives the object tracking / extraction result 1804 and outputs a feature quantity 1808. The switch unit SW186 receives the feature value 1808 and controls the input of the object tracking / extraction result 1804 to the reference frame selection unit. The reference frame selection unit 188 receives the object tracking / extraction result 1804 and the frame 1811 from which an object has already been extracted, and outputs at least two reference frames 1809.
[0194]
The first object tracking / extracting unit 189 receives at least two reference frames 1809 and an extraction frame 1801 as inputs, and outputs an object tracking / extraction result 1810 to the memory 190. The memory 190 holds an object tracking / extraction result 1810.
[0195]
The flow of processing is as follows.
[0196]
The user roughly surrounds the object that he wants to extract in the initial frame. The rectangles of the subsequent frames are set by extending the rectangle surrounding the already extracted object several pixels vertically and horizontally. This rectangle is divided into blocks, and the extracted object is pasted at the corresponding position by matching with the extracted frame. The object shape (predicted object shape) obtained by this processing represents a rough object. If the prediction accuracy does not fall below a certain threshold value, the prediction may be corrected from another frame to increase the prediction accuracy.
[0197]
When the prediction accuracy is good, all or a part of the predicted shape is output as an extraction result. This method can also extract an object while tracking the object.
[0198]
Blocking performed at the time of object tracking and object shape prediction may be performed by dividing a rectangle into fixed block sizes or by hierarchical block matching with a matching threshold. The frame may be divided at a fixed size and only the block including the object may be used.
[0199]
Considering a case where prediction is bad, the object prediction shape is expanded by several pixels, and irregularities and holes due to prediction errors are corrected. With this method, predicted object shapes are set for all reference frame candidates. Each time an object is extracted, the object tracking figure of the previous frame is re-determined using the object, thereby preventing an extraction error. The tracking figure is set so as to surround the object.
[0200]
Hereinafter, it is assumed that an object has already been extracted for a frame before the extraction frame, and no object has been extracted for the previous frame.
[0201]
Reference frame candidates are five frames before and after being shifted at regular intervals with respect to frames extracted at regular intervals. The reference frame candidates are limited to, for example, a total of 5 frames including an initial frame, 3 frames before the current frame, and 1 frame after the current frame. However, if there are no three previous frames, the number of candidates for the subsequent frame is increased, and if there is no one subsequent frame, the previous four frames are set as candidates.
[0202]
An appropriate area is set around the reference frame candidate object, and the background motion between this area and the current frame is detected, thereby deleting the background in the figure of the reference frame. As a method for detecting the background motion, matching is performed on the current frame in the entire area excluding the object, and the motion vector that minimizes the matching error is determined as the background motion.
[0203]
By removing reference frames having a large motion vector detection error at the time of background motion deletion from the candidates, it is possible to prevent an extraction error when background motion deletion is not appropriate. In addition, when the number of reference frame candidates decreases, a new reference frame candidate may be selected again. If the newly added reference frame candidate graphic setting or background motion deletion is not performed, it is necessary to newly perform graphic setting or background motion deletion.
[0204]
If it is known in advance that there is no background movement, this process is not performed.
[0205]
A temporary reference frame is set from the reference frame candidates, and a set of reference frames that satisfy Expression (1) or Expression (2) of the second embodiment is selected for a combination of these frames. If any combination of all reference frame candidates does not satisfy Equation (1) or Equation (2), O_i _∩  O_jIt is better to select the frame with the smallest number of pixels.
[0206]
In addition, it is better to consider a combination of reference frame candidates so that a frame having a motion vector detection error when deleting background motion is as small as possible. Specifically, when there is a reference frame set having the same condition according to Equation (1) or Equation (2), there is a method of selecting a frame with a smaller motion vector detection error when deleting background motion. When there is no background motion, it is possible to preferentially select a frame that can sufficiently detect the interframe difference.
[0207]
Further, when the accuracy of object prediction is high and a part of the object is output as it is, a frame satisfying the condition of Expression (1) or Expression (2) is selected only for a region where the object prediction result is not the extraction result.
[0208]
Hereinafter, a process when the two reference frames are selected will be described.
[0209]
When the reference frame is selected, an inter-frame difference from the extracted frame is obtained, and attention is paid to the inter-frame difference in the set figure.
[0210]
The difference between frames is binarized with the set threshold value. The threshold value used for binarization may be constant for the image, or may be changed for each frame according to the accuracy of background motion compensation. As an example of control, if the accuracy of background motion compensation is poor, there are many extra differences in the background, so the threshold for binarization is increased. Further, it may be changed according to the partial luminance gradient, texture, and edge strength of the object. As an example of this control, the binarization threshold is reduced in a relatively flat region such as a region with a small luminance gradient or a region with low edge strength. Furthermore, the user may give a threshold value based on the property of the object.
[0211]
For pixels outside the object tracking graphic, a pixel having a difference value between adjacent background regions is determined as a background pixel. At the same time, with respect to the pixels inside the object tracking figure, pixels that do not have a difference value between adjacent background regions are determined not to be background pixels.
[0212]
The interframe difference cannot be detected in the stationary region of the object. Therefore, when the inter-frame difference from the frame used for prediction is zero and the pixel used for prediction is a pixel inside the object, it is not added as a still region pixel to the background pixel.
[0213]
This background pixel is a common background area for the current frame and one reference frame. At this time, the boundary between the background area and other parts may be unnatural due to the influence of noise, so a filter that smooths the boundary or a filter that removes the extra noise area may be applied to the image signal. .
[0214]
When a common background area is obtained for each reference frame, an area that is not included in the two common background areas is detected and extracted as an object area. An extraction result is output for a portion that does not use the previously predicted shape of the object, and the shape of the entire object is extracted. If the consistency between the part using the shape obtained from the common background and the part using the object shape predicted earlier can not be obtained, the output result obtained by filtering at the end can be improved.
[0215]
Finally, the extracted object region is output with the extraction order replaced with the input frame order.
[0216]
The method and apparatus for extracting the shape of an object as in the present invention can be used as an input means for MPEG-4 object coding that is currently being standardized. As an application example of MPEG-4 and object extraction, there is a display system in which the object shape is a window format. Such a display system is effective for a multipoint conference system. Rather than displaying the text material and the person participating in the meeting at each point in a square window on a limited size display, the person can be saved in space by displaying it in the shape of a person as shown in FIG. If the MPEG-4 function is used, only the person who is speaking can be enlarged, or the person who is not speaking can be made translucent, which improves the usability of the system.
[0217]
As described above, according to the third embodiment, by selecting an object with the method and apparatus according to the properties of the image, unnecessary processing can be omitted and stable extraction accuracy can be obtained. Also, by removing the restriction of time order, sufficient extraction accuracy can be obtained regardless of the movement of the object.
[0218]
The third embodiment improves the performance of the first embodiment and the second embodiment, and includes the configurations of the first embodiment and the second embodiment and the configurations described in the third embodiment. It can also be used in combination as appropriate.
[0219]
FIG. 27 shows a first configuration example of an object extraction apparatus according to the fourth embodiment of the present invention.
[0220]
A texture image 221 input to the object extraction apparatus after being imaged by an external camera or read from a storage medium such as a video tape or a video disc is a recording apparatus 222, a switch unit 223, and an object by motion compensation. Input to the extraction circuit 224. The recording device 222 holds the input texture image 221. For example, hard disks and magneto-optical disks used in personal computers. The recording device 222 is necessary to use the texture image 221 again later. When the texture image 221 is an image recorded on an external storage medium, it is not necessary to prepare the recording device 222 separately. Is used as the recording device 222. At this time, it is not necessary to input the texture image 221 to the recording device 222 again. In the texture image, for example, pixels in which the luminance (Y) of each pixel is represented by a value from 0 to 255 are arranged in raster order (from the upper left pixel of the image to the right direction, from the upper line to the lower line). It is formed and is generally called an image signal. In order to distinguish from the shape image described later, it is referred to as a texture image here. As a texture image, in addition to luminance, color differences (U, V, etc.) and colors (R, G, B, etc.) may be used.
[0221]
On the other hand, in the first frame, a shape image 225 in which an object to be extracted by the operator is separately input is input to the object extraction circuit 224 by motion compensation. The shape image is generated, for example, by arranging pixels in which the pixel values of the pixels belonging to the object are represented by “255” and the pixel values of the other pixels are represented by “0” in the raster order similarly to the texture image.
[0222]
Here, an embodiment for generating the shape image 25 of the first frame will be described in detail with reference to FIG.
[0223]
Although omitted in FIG. 34, it is assumed that the background and the foreground also have designs, and that it is desired to extract the object 226 in the shape of a house. The operator traces the outline of the object 226 with the mouse or the pen on the image 227 displayed on the monitor. An image obtained by substituting “255” for the inner pixel of the contour and “0” for the outer pixel is defined as a shape image. If the operator draws a contour line with great care, the accuracy of the shape image will be high. However, even if the accuracy is low to some extent, the accuracy can be improved by using the following method.
[0224]
FIG. 35 shows a line 228 drawn by the operator and an outline 229 of the object 226. At this stage, the correct position of the contour line 229 is of course not extracted, but the contour line 229 is shown to represent the positional relationship with the line 228.
[0225]
First, a block is set so as to include the contour line 228. Specifically, when the screen is scanned in raster order and there is a contour line 228, that is, when there is a difference between adjacent pixel values in the shape image of the contour line 228, a predetermined centering on the pixel is performed. Provide a block of size. At this time, if the already set block and the current block overlap, if the scan is advanced without performing the current block setting, the blocks are set so that they do not overlap with each other as shown in FIG. it can. However, with this alone, the portions 230, 231 and 232 are not included in the block. Therefore, when scanning is performed again and there is an outline not included in the block, the block is provided with the pixel as the center. However, at the time of the second scan, even if there is a portion where the current block overlaps the already set block, the current block is set as long as the central pixel is not included in the already set block. In FIG. 37, blocks 233, 234, 235 and 236 indicated by diagonal lines are blocks set in the second scan. The block size may be fixed, but is large when the number of pixels surrounded by the contour line 228 is large, small when the number of pixels is small, large when the contour line 228 is small, and uneven. It may be set to be small when there are many, or large when the pattern of the image is flat, and small when the pattern is fine.
[0226]
At the edge of the screen, if you set a block normally, it may protrude from the screen. In such a case, cut the end of the block into a rectangular block so that only that block does not protrude from the screen. In this case, the similar block is also rectangular.
[0227]
The above is the block setting method in the shape image.
[0228]
Next, for each block, the similar block is searched using the texture image. Here, the similarity means that the pixel values of the corresponding pixels are substantially equal when the blocks having different block sizes are enlarged or reduced so that one block size is the same as the other. For example, for the block 237 of FIG. 38, the block 238 has a similar texture image design. Similarly, block 240 is similar for block 239 and block 242 for block 241 is similar. In this embodiment, the similar block is made larger than the block set on the contour line. In addition, the similar block is not searched for the entire screen. For example, as shown in FIG. 39, if the blocks 244, 245, 246 and 247 near the block 243 are searched within a certain range. It is enough. FIG. 39 shows a case where the center of each block is set as the starting point, and the starting points of the blocks 244, 245, 246 and 247 are moved in the vertical and horizontal directions by a predetermined pixel width using the starting point of the block 243. FIG. 40 shows the case where the starting point is placed at the upper left corner of the block.
[0229]
Even within the search range, similar blocks that partially protrude from the screen are excluded from the search target, but if the block is at the end of the screen, all similar blocks in the search range will be excluded from the search target. There is. In such a case, the block at the end of the screen is handled by shifting the search range to the inside of the screen.
[0230]
The search for similar blocks can reduce the amount of calculation by performing a multistage search. In the multistage search, for example, the entire search range is not searched while shifting the starting point by one pixel or half pixel, but first, the error is examined at the starting point of the jump position. Next, it is a method of squeezing the position of the similar block while repeating that the error is examined by moving the origin slightly finely only around the origin where the error was small.
[0231]
In the search for similar blocks, if the similar block reduction process is performed every time, a long processing time is required. Therefore, if a reduced version of the entire image is generated in advance and stored in another memory, only the data corresponding to the similar block need be read from the memory.
[0232]
In FIG. 38, similar blocks are shown only for the three blocks 237, 239, and 241, but actually, similar blocks are obtained for all the blocks shown in FIG. The above is the method for searching for similar blocks. It is important to search for similar blocks using texture images instead of shape images. When considering a primary transformation that maps a similar block to a block in the screen, the contour line of the texture image is unchanged in the primary transformation.
[0233]
Next, a method of correcting the contour of the shape image so as to match the contour of the texture image using the positional relationship between each block and its similar block will be described.
[0234]
In FIG. 41, a contour line 228 is a line drawn by the operator. This line should be close to the correct contour line 229. For this purpose, the portion of the similarity block 238 of the shape image is read out and reduced to the same size as the block 237, and the portion of the block 237 of the shape image is replaced. This operation has the property of bringing the contour line closer to an invariant set including a fixed point of the primary transformation from the similar block to the block, so that the contour line 228 approaches the contour line 229. When one side of the similar block is twice as long as one side of the block, the distance between the contour line 228 and the correct contour line 229 is generally halved by one replacement. The result of performing this replacement once for all the blocks is the outline 248 in FIG. If this block replacement is repeated, the contour line 248 gets closer to the correct contour line, and eventually matches the correct contour line as shown in FIG. Actually, the state where the deviation between the two contour lines is smaller than the inter-pixel distance is meaningless, and therefore the replacement is completed at an appropriate number of times. This method is effective when the contour line of the texture image is included in the N × N pixel block set in the shape image. In this case, the distance between the contour line of the shape image and the contour line of the texture image is the maximum. Approximately N / 2. When the length of one side of the similar block is A times the length of one side of the block, the distance between the two contour lines is 1 / A for each replacement, so this distance is one pixel. If the number of replacements is x,
(N / 2) × (1 / A) ^ x <1
It becomes. Here, ^ represents a power and means that (1 / A) is multiplied x times in the above equation. From the above formula
x> log (2 / N) / log (1 / A)
It becomes. For example, when N = 8 and A = 2,
x> 2
Thus, three replacements are sufficient.
[0235]
A block diagram of this object extraction apparatus is shown in FIG. First, the shape image 249 input by the operator is recorded in the shape memory 250. In the shape memory 250, blocks are set as described above with reference to FIGS. On the other hand, the texture image 251 is recorded in the texture memory 252. From the texture memory 252, the block texture image 254 is sent to the search circuit 255 with reference to the block position information 253 sent from the shape memory 250. At the same time, similar block candidates are also sent from the texture memory 252 to the search circuit 255 as described with reference to FIGS. In the search circuit 255, after reducing each candidate of the similar block, the error with the block is calculated, and the one with the smallest error is determined as the similar block. Possible errors include a sum of absolute values of differences in luminance values and a sum of absolute values of differences in color differences. When the color difference is used as compared with the luminance alone, the amount of calculation increases. However, even if the luminance step is small in the contour of the object, the similarity block can be correctly determined when the color difference step is large, so that the accuracy is improved. Similar block position information 256 is sent to the reduction conversion circuit 257. A shape image 258 of a similar block is also sent from the shape memory 250 to the reduction conversion circuit 257. In the reduction conversion circuit 257, the shape image of the similar block is reduced, and the reduced similar block is returned to the shape memory 250 as the shape image 259 whose outline has been corrected, and the shape image of the corresponding block is overwritten. . When the replacement of the shape memory 250 reaches a predetermined number of times, the corrected shape image 259 is output to the outside. The shape memory 250 may be overwritten sequentially for each block, or after two memory screens are prepared and the shape image of the entire screen is first copied from one to the other, the contour block is similar. A block may be replaced with a reduced one.
[0236]
This object extraction method will be described with reference to the flowchart of FIG.
[0237]
(Object extraction method by reduced block matching in frame)
In step S31, a block is set in the contour portion of the shape data. In step S32, a similar block whose image data is similar to the current processing block is found from the same image data. In step S33, the shape data of the current processing block is replaced with data obtained by reducing the shape data of the similar block.
[0238]
If the number of processed blocks reaches a predetermined number in step S34, the process proceeds to step 35. If not, the process target is advanced to the next block and the process returns to step 32.
[0239]
In step S35, if the number of replacements reaches a predetermined number, the process proceeds to step 36. If not, the process returns to step 31 with the replaced shape data as a processing target. In step S36, the shape data that has been repeatedly replaced is output as an object region.
[0240]
This method is effective when the edge of a block matches the edge of a similar block. Therefore, when there are a plurality of edges in a block, the edges may not be correctly matched. Therefore, such a block is maintained without being replaced. Specifically, each line of the block shape image is scanned in the horizontal direction and the vertical direction, and two points change from “0” to “255” or from “255” to “0” in one line. Blocks having more than a predetermined number of lines are not replaced. Also, even at the boundary between the object and the background, the luminance or the like may be flat depending on the portion. Even in such a case, since the effect of edge correction cannot be expected, a block whose texture image variance is equal to or smaller than a predetermined value is not replaced and the input edge is held.
[0241]
If the error of the similar block does not become smaller than the predetermined value, the reduction may be given up and the similar block may be obtained with the same size. At this time, select similar blocks so as not to overlap with yourself. Only the block that does not perform reduction does not have the effect of correcting the edge, but by copying the corrected edge from the block in which the edge is corrected by performing reduction, the block that does not perform reduction can also be used. The edge is corrected indirectly.
[0242]
The flowchart shown in FIG. 48 is an example in which a shape image is replaced immediately after finding a similar block. However, by storing the position information of the similar blocks of all the blocks, first the search for the similar block is performed. Next, a method for performing replacement for all blocks and then performing shape image replacement for all blocks will be described with reference to the flowchart of FIG.
[0243]
In this example, the replacement of the shape image can be repeated a plurality of times for one similar block search.
[0244]
In step S41, a block is set in the contour portion of the shape data. In step S42, a similar block whose image data pattern is similar to that of the current processing block is found in the same image data. In step S43, when the process of finding similar blocks for all the blocks is completed, that is, when the number of processed blocks reaches a predetermined number, the process proceeds to step S44. Otherwise, the process returns to step S42. In step S44, the shape data of the current processing block is replaced with the reduced shape data of the similar block.
[0245]
In step S45, when the replacement process is completed for all blocks, that is, when the number of processed blocks is different from the predetermined number, the process proceeds to step S46. Otherwise, the process returns to step S44. In step S46, if the number of replacements of all blocks reaches a predetermined number, the process proceeds to S47. Otherwise, the process returns to S44. In step S47, the shape data on which the replacement conversion has been repeated is output as an object region.
[0246]
Next, a block setting method capable of increasing the accuracy of edge correction will be described.
[0247]
As described above, in the method of setting a block around the contour line of the shave image, as shown in FIG. 51A, a part of the correct contour line 301 may not be included in the block. Here, the outline 302 of the shape image is indicated by a thick line. Assuming that the lower right side of the contour line is an object and the upper left side is the background, the portion 303 that is actually the background is erroneously set as an object, but it is not included in the block and therefore cannot be corrected. Thus, if there is a gap between the block and the correct contour line, it will not be corrected correctly.
[0248]
In order to reduce the gap between the block and the correct contour line, there is a method in which the blocks are overlapped to some extent as shown in FIG. As a result, the number of blocks increases, so that the amount of calculation increases but the gap 304 decreases. Therefore, the accuracy of extraction is improved. However, in this example, the gap is still not completely removed.
[0249]
In order to reduce the gap, it is also effective to increase the block size as shown in FIG. In this example, the above-described block superposition is used together. Thereby, in this example, the gap is completely eliminated.
[0250]
As described above, it is effective to increase the block size in order to expand the range in which the contour line can be corrected. However, if the block size is too large, the shape of the contour line included in the block becomes complicated, and it becomes difficult to find similar blocks. An example is shown in FIG.
[0251]
In FIG. 52A, the hatched portion 305 represents the object region, and the white portion 306 represents the background region. The contour line 307 of the given shape image is indicated by a black line. As described above, the outline 307 of the shape image is largely separated from the correct outline, and the correct outline has irregularities. On the other hand, FIG. 52B shows the result of arranging the blocks by a method different from the method described above. Here, first, an image is divided into rectangular blocks that do not overlap each other and have no gaps. The variance in the texture image is calculated for each block, and the setting is canceled for blocks whose variance is smaller than a predetermined value. Therefore, in FIG. 52 (b), only blocks whose variance is equal to or greater than a predetermined value remain. A similar block is obtained for each of these blocks. For example, there is no symbol obtained by doubling the height and width near the block 308, and the same applies to many other blocks. Therefore, although the portion with the smallest error is selected as the similar block, even if the replacement conversion of the shave image is repeated using the positional relationship, it does not match the correct contour line as shown in FIG. However, compared with the contour line 307 of the shape image of FIG. 52A, the contour line 309 of the shape image of FIG. 52C after the edge correction is a rough unevenness (left and right) of the contour line of the texture image. It is reflected that there is a mountain with a valley in between. If the block size is reduced in this example, even this rough correction cannot be performed.
[0252]
As described above, when the block size is increased in order to widen the correction range, the shape of the outline included in the block becomes complicated, and it may be difficult to find similar blocks. As a result, edge correction is only roughly performed. In such a case, when the edge correction is initially performed with a large block size, and the edge correction is performed by reducing the block size again with respect to the result, the correction accuracy is improved. FIG. 52 (d) shows the result of correcting the block size to 1/2 in the vertical and horizontal directions and correcting it to 1/4 with respect to FIG. 52 (c). Thus, the correction accuracy can be improved by repeating the correction while gradually reducing the block size.
[0253]
A method of gradually reducing the block size will be described with reference to the flowchart of FIG.
[0254]
In step S51, the block size b is set to A. In step S52, edge correction similar to the edge correction shown in FIG. 48 or 50 is performed. In step S53, b is observed, and when b is smaller than Z (<A), this process ends. If b is greater than or equal to z, the process proceeds to step S54. In step S54, the block size b is halved and the process proceeds to S52.
[0255]
As described above, the correction accuracy is improved by repeating the correction while increasing the block size at first and gradually decreasing the block size.
[0256]
FIG. 54 (a) shows an example in which a gap is hardly formed between the block and the correct contour line by inclining the block by 45 degrees. In this way, when the contour is oblique, the correct contour can be covered by tilting the block without increasing the block size as much as in FIG. In this example, the correct contour line can be covered as shown in FIG. In this way, by tilting the side of the block in the same direction as the contour line of the shape image, it is possible to make it difficult to generate a gap between the block and the correct contour line. Specifically, when the inclination of the contour line of the alpha image is detected and it is horizontal or close to vertical, the block orientation is as shown in FIG. 51 (c). Otherwise, the block inclination is shown in FIG. As in b). The judgment of being close to horizontal or vertical is made by comparison with a threshold value.
[0257]
The above is the object extraction process of the first frame. This is a technique that can be used not only for the first frame of a moving image but also for general still images. Note that the block setting and the similar block search are performed each time the replacement is performed, such as re-setting a block for the shape image that has been replaced once, obtaining the similar block again, and performing the second replacement. If this is done, the amount of calculation increases, but the effect of correction can be obtained.
[0258]
In addition, it is preferable that the similar block is selected from a portion as close as possible to the block. Therefore, the range for searching for the similar block may be switched depending on the block size. That is, when the block size is large, the range for searching for similar blocks is widened, and when the block size is small, the range for searching for similar blocks is narrowed.
[0259]
In this method, a small hole or an isolated small region may appear as an error in the shape data in the process of replacing the shape data. Therefore, the correction accuracy can be improved by removing small holes or isolated small regions from the shape data before step S34, S35, S36, S45, S46, S47, S53 or the like. To remove small holes and isolated small areas, for example, the image analysis handbook (supervised by Takagi and Shimoda, The University of Tokyo Press, first edition, January 1991) combined with expansion and contraction described in pages 575 to 576 Processing or a majority filter described on page 677 is used.
[0260]
Further, the blocks may be set more simply as shown in FIG. That is, the screen is simply divided into blocks, and similar blocks are searched or replaced only for blocks including the outline 228, such as the block 2200.
[0261]
Also, if the texture image to be given is compressed in advance by fractal coding (Japanese Patent Publication No. 08-329255 “Image Segmentation Method and Apparatus”), information on the similar blocks of each block is included in the compressed data. It is included. Therefore, as a similar block of the block including the contour line 228, if the compressed data is used, it is not necessary to search for the similar block again.
[0262]
Returning to FIG. 27, the description of the object extracting apparatus for extracting an object from a moving image will be continued.
[0263]
The object extraction circuit 242 based on motion compensation generates a shape image 260 of the second and subsequent frames based on the shape image 25 of the first frame using the motion vector detected from the texture image 221.
[0264]
FIG. 29 shows an example of the object extraction circuit 224 by motion compensation. A shape image 225 of the first frame is recorded in the shape memory 261. In the shape memory 261, as shown in a frame 262 in FIG. 45, a block is set on the entire screen. On the other hand, the texture image 221 is sent to the motion estimation circuit 264 and recorded in the texture memory 263. From the texture memory 263, the texture image 265 of the previous frame is sent to the motion estimation circuit 264. In the motion estimation circuit 264, for each block of the current processing frame, a reference block with the smallest error is found from within the previous frame. FIG. 45 shows an example of the reference block 268 selected from the block 267 and the frame 266 one frame before. Here, if the error is smaller than a predetermined threshold, the reference block is made larger than the block. An example of the reference block 70 that is twice as large as the block 269 is also shown in FIG.
[0265]
Returning to FIG. 29, the reference block position information 271 is sent to the motion compensation circuit 272. The shape image 273 of the reference block is also sent from the shape memory 261 to the motion compensation circuit 272. In the motion compensation circuit 272, when the size of the reference block is the same as the block, the shape image of the reference block is reduced as it is when the size of the reference block is larger than that of the block. A processed frame shape image 260 is output. Also, in preparation for the next frame, the shape image 260 of the current processing frame is sent to the shape memory 261, and the shape image of the entire screen is overwritten.
[0266]
When the reference block is larger than the block, as described with reference to FIGS. 41 and 42, there is an effect of correcting when the outline is shifted from the correct position. Therefore, the object is extracted with high accuracy in every frame of the moving image sequence following the shape image of the given first frame. As in the conventional method, there is no problem that the accuracy is poor at the beginning of the moving image sequence or when the movement of the object is small.
[0267]
Object extraction by motion compensation between frames will be described with reference to the flowchart of FIG.
[0268]
In step S21, the current processing frame is divided into blocks. In step S22, a reference block having a similar design between the current processing block and the image data and having a larger size than the current processing block is found from each frame or a frame for which shape data has already been obtained. In step S23, the sub-block obtained by cutting out and reducing the reference block shape data is pasted to the current processing block.
[0269]
In step S24, if the number of processed blocks reaches a predetermined number, the process proceeds to step 25. If not, the process target is advanced to the next block and the process returns to step 22. In step S25, the pasted shape data is output as an object region.
[0270]
Here, each frame is a first frame in the present embodiment, and is a frame to which a shape image is given in advance. Further, the reference block is not necessarily a frame one frame before, but may be a frame for which a shape image has already been obtained as described herein.
[0271]
The above is the description of object extraction using motion compensation. As the object extraction circuit 224, in addition to the method described above, there is a method using an inter-frame difference image disclosed in Japanese Patent Application Laid-Open No. 10-001847 “Moving Image Object Tracking / Extraction Device”.
[0272]
Returning to FIG. 27, the description of the embodiment of the object extracting apparatus for extracting an object from a moving image will be continued.
[0273]
The shape image 260 is sent to the switch unit 223 and the switch unit 281. In the switch unit 223, when the shape image 260 is “0” (background), the texture image 221 is sent to the background memory 274 and recorded. When the shape image 260 is “255” (object), the texture image 221 is not sent to the background memory 274. This is performed for several frames, and if the shape image 260 is accurate to some extent, an image of only the background portion that does not include an object is generated in the background memory 274.
[0274]
Next, the texture image 275 is read again from the recording device 222 in order from the first frame, or only the frame from which the object specified by the operator is to be extracted is read and input to the difference circuit 276. At the same time, the background image 277 is read from the background memory 274 and input to the difference circuit 276. The difference circuit 276 obtains a difference value 278 between pixels in the same position in the screen between the texture image 275 and the background image 277 and inputs the difference value 278 to the object extraction circuit 279 using the background image. In the object extraction circuit 279, a shape image 280 is generated. This is because a pixel whose absolute value of the difference value 278 is larger than a predetermined threshold value belongs to the object, and the pixel value is “255”. The other pixels are generated by setting the pixel value to “0” as belonging to the background. When not only luminance but also color difference or color is used as a texture image, the sum of absolute values of differences between signals is compared with a threshold value to determine whether the object or the background. Alternatively, a threshold value is separately determined for each luminance and color difference, and an object is determined when the absolute value of the difference is larger than the threshold value in either luminance or color difference, and a background is determined otherwise. The shape image 280 generated in this way is sent to the switch unit 281. In addition, a selection signal 282 determined by the operator is input from the outside to the switch unit 281, and either the shape image 260 or the shape image 280 is selected by the selection signal 282 and output to the outside as the shape image 283. Is done. The operator displays the shape image 260 and the shape image 280 on a display or the like, and selects the correct one. Alternatively, when the shape image 260 is displayed when the shape image 260 is generated and the accuracy is not satisfactory, the shape image 280 is generated, and when the accuracy of the shape image 260 is satisfied If the shape image 260 is output to the outside as the shape image 283 without generating the shape image 280, the processing time can be saved. The selection may be performed for each frame or for each moving image sequence.
[0275]
An object extraction method corresponding to the object extraction apparatus of FIG. 27 will be described with reference to the flowchart of FIG.
[0276]
(Object extraction method using background image)
In step S11, the shape data of each frame is generated by motion compensation of the given shape data in each frame. In step S12, the image data of the background area determined by the shape data is stored in the memory as a background image.
[0277]
In step S13, if the number of processed frames reaches a predetermined number, the process proceeds to step 14. If not, the process target is advanced to the next frame and the process returns to step 11. In step S14, a pixel having a large absolute value of the difference between the image data and the background image is set as the object region, and a pixel other than that is set as the background region.
[0278]
In the present embodiment, for example, the background moves when the camera to be imaged moves. In this case, the motion of the entire background (global motion vector) from the previous frame is detected, and in the first scan, it is shifted from the previous frame by the amount of the global motion vector and recorded in the background memory. A portion shifted from the previous frame by the amount of the motion vector is read from the background memory. If the global motion vector detected in the first scan is recorded in a memory and then read out and used in the second scan, the time required to obtain the global motion vector can be saved. If the background is known to be stationary because the camera is fixed, etc., the global motion vector is not detected by switching the switch by the operator, etc. Processing time can be further saved by making zero always zero. When the global motion vector is obtained with half-pixel accuracy, the background memory has a pixel density that is twice as long as the input image. That is, every other pixel value of the input image is written into the background memory. For example, in the next frame, when the background has moved by 0.5 pixels in the horizontal direction, pixel values are also written every other pixel between the previously written pixels. In this way, when the first scan is completed, pixels that are never written to the background image may be formed. In that case, the gap is filled by interpolating from the surrounding written pixels.
[0279]
Regardless of whether or not the half-pixel motion vector is used, the pixel value is not substituted into the background memory even if the first scan is completed for the portion that never becomes the background region throughout the moving image sequence. Such an undefined portion is always determined as an object in the second scan. This is because pixel values (Y, U) that are expected to appear rarely in the background, even if a memory for recording an undefined portion is prepared and it is not determined whether or not it is undefined. , V) = (0, 0, 0), etc., the first scan may be started after the background memory is initialized in advance. Since this initial pixel value remains in an undefined pixel, it is automatically determined as an object in the second scan.
[0280]
In the description so far, when the background memory is generated, the pixel value already overwritten with the pixel value of the background is overwritten in the background region. In this case, the background pixel value at the end of the moving image sequence is recorded in the background memory in the portion that is the background at the beginning or the end of the moving image sequence. If the background is exactly the same pixel value at the beginning and end of the video sequence, there is no problem, but the pixel value is very small between frames because the camera moves very slowly or the brightness of the background changes little by little. When the background memory is used, the background of the first frame of the video sequence is increased by the difference in pixel value between the first background and the last background of the video sequence. The part is also erroneously detected as an object. Therefore, in the previous frames, the background pixel value is not assigned to the background area, but only the pixels that have become the background area for the first time in the current processing frame are written to the background memory, and the background pixel values are already substituted. If the pixel is not overwritten, the background of the first part of the moving image sequence is recorded in the background memory, so that the object is correctly extracted. In the second scan, if the background area of the current processing frame is overwritten on the background memory according to the object extraction result, the correlation between the background of the frame immediately before the current processing frame and the background of the current processing frame High backgrounds are compared with each other, and the portion is less likely to be erroneously detected as an object. The overwriting for the second scan is effective when there is a slight change in the background, so if the operator switches the switch to mean that there is no background movement, overwriting is not performed. This switch may be the same as the switch for switching whether or not to perform the previous global motion vector.
[0281]
Since the first scan is intended to generate a background image, it is not always necessary to use all frames. Even if frames are thinned out every other frame, every second frame, etc., the same background image is obtained, and the processing time is shortened.
[0282]
If only the pixels having the interframe difference equal to or less than the threshold in the background area are recorded in the background memory, other objects entering the screen are not recorded in the background memory. Further, when the object area of the first scan is erroneously detected closer to the object side than the actual one, the pixel value of the object is recorded in the background memory. Therefore, the pixels close to the object region are not input to the background memory even in the background region.
[0283]
When only a background image excluding a foreground person is necessary for an image taken at a sightseeing spot, the background image recorded in the background memory is output to the outside.
[0284]
The above is the description of the first configuration example of the present embodiment. According to this example, a high extraction accuracy can be obtained in the first part of the moving image sequence as in the last part. Even if the movement of the object is small or does not move at all, it is correctly extracted.
[0285]
Next, an example of correcting the generated shape image 280 will be described with reference to FIG. Until the shape image 280 is generated, the description is omitted because it is the same as FIG.
[0286]
The shape image 280 is input to an edge correction circuit 284 that uses a background palette. The texture image 275 is input to the edge correction circuit 284 using the background palette and the edge correction circuit 285 using the reduced block matching. A detailed block diagram of the edge correction circuit 284 is shown in FIG.
[0287]
In FIG. 31, the shape image 280 is input to the correction circuit 286, and the texture image 275 of the same frame is input to the comparison circuit 287. The background color 289 is read from the memory 288 that holds the background palette and input to the comparison circuit 287. Here, the background palette is a set of luminance (Y) and color difference (U, V) existing in the background portion, that is, a collection of vectors,
(Y1, U1, V1)
(Y2, U2, V2)
(Y3, U3, V3)
……………………
It is prepared beforehand. Specifically, the background palette is a collection of Y, U, and V pairs of pixels belonging to the background area in the first frame. Here, for example, if Y, U, and V each take 256 values, the number of combinations becomes enormous, and the number of combinations of (Y, U, V) in the background also increases. Since the amount of calculation increases, the number of combinations can be suppressed by quantizing the values of Y, U, and V with a predetermined step size. This is because, when not quantized, different vector values may become the same vector value due to quantization.
[0288]
In the comparison circuit 287, Y, U, and V of each pixel of the texture image 275 are quantized, and the vector matches one of the vectors registered in the background palette sequentially sent from the memory 288, that is, the background color 289. To see if it does. For each pixel, a comparison result 290 indicating whether the color of the pixel is the background color is sent from the comparison circuit 287 to the correction circuit 286. In the correction circuit 286, when the comparison result 290 is the background color even though the pixel value of a certain pixel of the shape image 280 is “255” (object), the pixel value of the pixel is set to “0” ( In this case, the corrected shape image 291 is output. With this process, when the object region protrudes from the background region and is erroneously extracted in the shape image 280, the background region can be correctly separated. However, if there is a color common to the background and the object, and the object color is also registered in the background palette, that color part of the object is also determined as the background. Therefore, in the first frame, the palette described above is set as a temporary palette for the background, and a palette for the object in the first frame is also created in the same manner. Next, regarding the colors included in the object palette in the background temporary palette, the remaining colors are removed from the background temporary palette and set as the background palette. Thereby, the malfunction that a part of object becomes the background can be avoided.
[0289]
Further, in consideration of the case where there is an error in the shape image given in the first frame, the pixels near the edge of the shape image may not be used for generating the palette. Further, the appearance frequency of each vector may be counted, and vectors having a frequency equal to or lower than a predetermined value may not be registered in the palette. If the quantization step size is too small, the processing time will increase, and even if the color is very similar to the background color, the vector value will be slightly different, so it will not be determined as the background, and conversely the quantization step size will increase. If you do it too much, you will end up with a vector that is common to the background and the object. Therefore, several quantization step sizes are tried for the first frame, and a quantization step size that separates the background color and the object color is selected like a given shape image.
[0290]
In addition, since a new color may appear in the background or object from the middle, the background palette may be recreated with a frame in the middle.
[0291]
Returning to FIG. 28, the shape image 291 is input to the edge correction circuit 285. The edge correction circuit 285 is the same as the circuit having the shape image 249 as the shape image 291 and the texture image 251 as the texture image 275 in the circuit of FIG. 30 described above. The shape image is corrected so that is aligned with the edge of the texture image. The corrected shape image 292 is sent to the switch unit 281. The switch unit 281 outputs a shape image 293 selected from the shape image 292 and the shape image 260.
[0292]
In this example, the edge correction circuit is provided at the subsequent stage of the object extraction circuit 279. However, if the edge correction circuit is provided at the subsequent stage of the object extraction circuit 224, the accuracy of the shape image 260 can be improved.
[0293]
In addition, there are rare cases where the extraction accuracy deteriorates due to edge correction. In such a case, in order to prevent the deteriorated shape image 292 from being output, if the shape image 280 and the shape image 291 are also input to the switch unit 281 in FIG. It is also possible to select a shape image 291 subjected to only edge correction.
[0294]
FIG. 44 shows the pixels of the background color registered in the background palette by cross hatching. When searching for similar blocks described above with reference to FIGS. 30 and 29, if the information of FIG. The accuracy of extraction can be further increased. When there is a pattern in the background, a similar block may be selected along the edge of the background pattern, not the edge of the object and the background. In such a case, when calculating the error between the block and the block obtained by reducing the similar block, if the corresponding pixels are both background colors, the error of the pixel is not included in the calculation. Even if the edges are deviated, no error occurs, and therefore the similar block is correctly selected so that the edges of the object and the background match.
[0295]
FIG. 32 is an example of an image composition device incorporating the object extraction device 294 of this embodiment. The texture image 295 is input to the switch unit 296 and the object extraction device 294, and the shape image 2100 of the first frame is input to the object extraction device 294. The object extraction device 294 is configured as shown in FIGS. 27 and 28, and a shape image 297 of each frame is generated and sent to the switch unit 296. On the other hand, the background image 299 for synthesis is held in the recording circuit 298 in advance, and the background image 299 of the current processing frame is read from the recording circuit 298 and sent to the switch unit 296. In the switch unit 296, the texture image 295 is selected and output as a composite image 2101 for pixels whose shape image pixel value is “255” (object), and for pixels whose shape image pixel value is “0” (background), A background image 299 is selected and output as a composite image 2101. As a result, an image in which the object in the texture image 295 is combined with the foreground of the background image 299 is generated.
[0296]
FIG. 33 shows another example in which edge correction is performed. One of the blocks set as shown in FIG. 33 is assumed to be a block 2102 in FIG. The block is divided into an object area and a background area with the outline as a boundary. Blocks 2103, 2104, 2105, and 2106 obtained by shifting the contour in the left-right direction are 2103, 2104, 2105, and 2106. The width and direction of shifting are different. Literature: Fukui "Object contour extraction based on the degree of separation between regions" (The Institute of Electronics, Information and Communication Engineers Journal, D-II, Vol. J80-D-II, No. 6, pp. 1406-1414, June 1997) The degree of separation described on page 1408 is obtained for each contour line, and the contour line having the highest degree of separation among the blocks 2102 to 2106 is adopted. Thereby, the outline of the shape image matches the edge of the texture image.
[0297]
As described above, according to the fourth embodiment, the extraction accuracy of the first part of the moving image sequence can be obtained similarly to the last part. Even if the movement of the object is small or does not move at all, it is correctly extracted. Furthermore, by reducing and pasting the shape data of the similar block larger than the current processing block, it becomes possible to correct it to the correct position even if the contour line of the object area given by the shape data is shifted. By simply giving a rough trace of the outline of the area as shape data, the object area can be extracted with high accuracy in all subsequent input frames.
[0298]
The first to fourth embodiments described above can be used in appropriate combination. Also, all the procedures of the object extraction method of the first to fourth embodiments can be realized by software. In this case, a computer program for executing the procedure is simply introduced into a normal computer via a recording medium. Thus, the same effects as those of the first to fourth embodiments can be obtained.
[0299]
【The invention's effect】
As described above, according to the present invention, by tracking an object using a figure surrounding the target object, the target object can be detected without being affected by extraneous movements other than the target object. It is possible to extract and track with high accuracy.
[0300]
In addition, high extraction accuracy can be obtained regardless of the input image. Furthermore, high extraction accuracy can be obtained in the first part of the moving image sequence as in the last part. Even if the movement of the object is small or does not move at all, it is correctly extracted.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a basic configuration of a moving image object tracking / extracting apparatus according to a first embodiment of the present invention;
FIG. 2 is an exemplary block diagram showing a first configuration example of the object tracking / extracting apparatus according to the embodiment;
FIG. 3 is an exemplary block diagram showing a second configuration example of the object tracking / extracting apparatus according to the embodiment;
FIG. 4 is an exemplary block diagram showing an example of a specific configuration of a background area determination unit provided in the object tracking / extracting apparatus according to the embodiment;
FIG. 5 is an exemplary block diagram illustrating an example of a specific configuration of a graphic setting unit provided in the object tracking / extracting apparatus according to the embodiment;
FIG. 6 is an exemplary block diagram illustrating an example of a specific configuration of a background motion deletion unit provided in the object tracking / extracting apparatus according to the embodiment;
FIG. 7 is an exemplary view showing an example of a representative background area used in a background motion deleting unit provided in the object tracking / extracting apparatus according to the embodiment.
FIG. 8 is a view for explaining the operation of the object tracking / extracting apparatus according to the embodiment;
FIG. 9 is a block diagram showing a first moving image object tracking / extracting apparatus according to a second embodiment of the present invention.
FIG. 10 is a block diagram showing a second moving image object tracking / extracting apparatus according to the second embodiment.
FIG. 11 is a block diagram showing a third moving image object tracking / extracting apparatus according to the second embodiment;
FIG. 12 is a block diagram showing a fourth moving image object tracking / extracting apparatus according to the second embodiment;
FIG. 13 is a view for explaining an object prediction method used in the object tracking / extracting apparatus according to the second embodiment;
FIG. 14 is a view for explaining a reference frame selection method used in the object tracking / extracting apparatus according to the second embodiment;
FIG. 15 is a diagram illustrating an example of a result of extracting an object by switching between a first object tracking / extraction unit and a second object extraction unit in the object tracking / extraction apparatus according to the second embodiment;
FIG. 16 is a view for explaining the flow of moving object tracking / extraction processing using the object tracking / extraction apparatus according to the second embodiment;
FIG. 17 is a block diagram showing a first moving image object tracking / extracting apparatus according to a third embodiment of the present invention.
FIG. 18 is a block diagram showing a second moving image object tracking / extracting apparatus according to the third embodiment.
FIG. 19 is a block diagram showing a third moving image object tracking / extracting device according to the third embodiment.
FIG. 20 is a block diagram showing a fifth moving image object tracking / extracting apparatus according to the third embodiment.
FIG. 21 is a block diagram showing a sixth moving image object tracking / extracting device according to the third embodiment.
FIG. 22 is a block diagram showing a fourth moving image object tracking / extracting apparatus according to the third embodiment.
FIG. 23 is a block diagram showing another configuration example of the moving image object tracking / extracting apparatus according to the third embodiment.
FIG. 24 is a block diagram showing still another configuration example of the moving image object tracking / extracting apparatus according to the third embodiment.
FIG. 25 is a view for explaining an example of an extracted frame order by frame order control applied to the moving image object tracking / extracting apparatus according to the third embodiment;
FIG. 26 is a diagram showing an application example of the moving image object tracking / extracting device according to the third embodiment.
FIG. 27 is a block diagram showing an object extraction device according to a fourth embodiment of the present invention.
FIG. 28 is a block diagram showing a configuration example when edge correction processing is applied to the object extraction device according to the fourth embodiment;
FIG. 29 is a block diagram showing a configuration example of a motion compensation unit applied to the object extraction device according to the fourth embodiment;
FIG. 30 is a block diagram showing a configuration example of an object extraction unit by reduced block matching applied to the object extraction device according to the fourth embodiment.
FIG. 31 is a diagram showing an edge correction circuit using a background palette used in the object extraction device according to the fourth embodiment.
FIG. 32 is a view showing an image composition device applied to the object extraction device according to the fourth embodiment;
FIG. 33 is a view for explaining the principle of edge correction using the degree of separation used in the object extraction device according to the fourth embodiment;
FIG. 34 is a view showing the entire processed image processed by the object extracting device according to the fourth embodiment.
FIG. 35 is a view showing an outline drawn by an operator used in the fourth embodiment;
FIG. 36 is a diagram showing a state of block setting (first scan) used in the fourth embodiment.
FIG. 37 is a view showing a block setting (second scan) used in the fourth embodiment;
FIG. 38 is a view for explaining similar blocks used in the fourth embodiment;
FIG. 39 is a view for explaining a search range for similar blocks used in the fourth embodiment;
FIG. 40 is a view for explaining another example of the search range of similar blocks used in the fourth embodiment.
FIG. 41 is a diagram showing a state before replacement conversion of a shape image used in the fourth embodiment.
FIG. 42 is a view showing a state after replacement conversion of a shape image used in the fourth embodiment;
FIG. 43 is a diagram showing contour lines extracted in the fourth embodiment.
44 is a diagram showing a background color portion extracted in the fourth embodiment; FIG.
FIG. 45 is a view for explaining motion compensation used in the fourth embodiment;
FIG. 46 is a flowchart of an object extraction method using a background image used in the fourth embodiment.
FIG. 47 is a flowchart of an object extraction method by motion compensation used in the fourth embodiment.
FIG. 48 is a flowchart of an object extraction method using reduced block matching in a frame used in the fourth embodiment.
FIG. 49 is a view showing another example of block setting used in the fourth embodiment.
FIG. 50 is a flowchart for explaining edge correction.
FIG. 51 is a diagram showing an example of block setting.
FIG. 52 is a diagram showing a process of searching for an outline of an object region.
FIG. 53 is a flowchart for explaining a method of gradually reducing the block size.
FIG. 54 is a diagram showing another example of block setting.
[Explanation of symbols]
1 ... Initial figure setting section
2 ... Object tracking / extraction unit
11 ... Figure setting section
12 ... Background region determination unit
13 ... Object extraction unit
21. Background motion deletion unit
22 ... Figure setting section
23. Background region determination unit
24 ... Object extraction unit
31 ... Change amount detection unit
32 ... Representative area determination unit
33. Background change amount determination unit
34 ... Background determination part of representative area
35. Background region determination unit
36 ... Shape prediction unit
37: Stationary object region determination unit
41. Separation part
42. Motion detector
43 ... division determination unit
44 ... Figure determining section
51 ... Background representative area setting section
52. Motion detection unit
53. Motion compensation unit
61 ... Figure setting section
62 ... Plural object tracking / extraction units
70: Graphic setting section
71 ... second object tracking / extracting unit
72 ... Reference frame selection section
73 ... first object tracking / extracting unit
111... Feature amount extraction unit
141... Frame order control unit
224 ... Object extraction unit
279 ... Object extraction unit
284 ... Edge correction unit
285 ... Edge correction unit

Claims

A pixel having a small difference between the current frame including the object target extracted from the moving image signal and the current frame and the reference frame is obtained by using the object shape of the reference frame given in advance, and the pixel of the reference frame is the object region. A stationary object determining means for determining a pixel of the current frame as an object region if the pixel belongs to a background region, and determining a pixel of the current frame as a background region when the pixel of the reference frame belongs to a background region;
Based on the determination result of the stationary object determination means and the difference between the current frame and the reference frame, a first background area common to the current frame and the reference frame is determined, and the current frame and another reference are determined. A background region determination means for determining a second background region common to the current frame and the other reference frames based on a difference with a frame;
Means for extracting, as an object area, an area that does not belong to either the first background area or the second background area in the image of the current frame;
A moving image object extraction apparatus comprising:

The object shape given in advance is used when the object shape of the reference frame has already been extracted, and when the object region of the object shape of the reference frame has not yet been extracted, the object shape has already been obtained. The object extraction device according to claim 1, further comprising a stationary object determination unit that generates a shape of an object of the reference frame from a frame from which a shape has been extracted by a block matching method, and uses the shape thus generated.

Background correction means for correcting the background movement of the reference frame or the current frame so that the background movement between the reference frame and each of the other reference frames and the current frame is relatively zero; The object extraction device according to claim 1.

The object extraction apparatus according to claim 1, wherein the background region determination unit includes a unit that determines the common background region using a threshold value given in advance.

The predetermined threshold value is a means for measuring the magnitude of the difference in the current frame and setting the threshold value large when the difference is large, and setting the threshold value small when the difference is small. The object extraction device according to claim 4 provided.

The predetermined threshold value is obtained by dividing the current frame into a plurality of areas, measuring the magnitude of the difference in each divided area, and setting the threshold value large if the difference is large, and the difference being small. 5. The object extraction device according to claim 4, further comprising means for setting the threshold value to be small.

Prediction means for predicting the position or shape of an object on the current frame from a frame from which an object region has already been extracted, and the first and the first and the second based on the position or shape of the object on the current frame predicted by the prediction means The apparatus further comprises: means for selecting the first and second reference frames to be used by the background region determining means , wherein overlapping portions of the objects of the second reference frame belong to the object of the current frame. The object extraction device described in 1.

An initial figure setting means for setting a graphic surrounding the object region of interest in the initial frame of the moving image signal, for each input frame of the moving image signal, the on the reference frame and the current frame is the input frame A graphic setting means for setting, in the current frame, a graphic surrounding an area on the current frame corresponding to the graphic image in the reference frame based on the correlation with the graphic image; 2. The object extraction device according to claim 1, wherein the area extraction unit extracts, as an object area, an area that does not belong to either the first background area or the second background area in the graphic image.

9. The object extraction device according to claim 8, wherein the initial figure setting means sets a figure surrounding a target moving object based on an input from the outside.

An initial figure setting means for setting a figure surrounding a target object in an initial frame of a moving image signal;
For each input frame of the moving image signal, and the input frame, based on the correlation between the figures in the image on the reference frame temporally different to the input frame, wherein corresponding to the figure in the image of the reference frame a figure that surrounds the region of the input frame, and the figure setting means for setting the input frame,
Said input frame to be the object extraction target, with respect to the input frame based on a difference between temporally different first reference frame, a common first background region in the first reference frame and the current frame determined, and the input frame, based on the difference between temporally different second reference frame with respect to the input frame to determine a common second background region in the second reference frame and the input frame Background region determination means;
Means for extracting, as an object area, an area that does not belong to either the first background area or the second background area in the in-graphic image of the input frame; ,
Second object extraction means for extracting an object region from an in-figure image on an input frame to be the object extraction target using a method different from the first object extraction means;
Means for selectively switching the first and second object extraction means;
Means for extracting feature values of an image for at least a part of the region from the current frame to be extracted;
And the switching means selectively switches between the first and second object extraction means based on the extracted feature quantity.

An initial figure setting means for setting a figure surrounding a target object in an initial frame of a moving image signal;
For each input frame of the moving image signal, an input corresponding to the intra-graphic image of the reference frame based on the correlation between the input frame and the intra-graphic image on the reference frame that is temporally different from the input frame A figure setting means for setting a figure surrounding an area on the frame to the input frame;
A first background area common to the input frame and the first reference frame is determined based on a difference between an input frame to be extracted and a first reference frame that is temporally different from the input frame. And a background for determining a second background area common to the input frame and the second reference frame based on a difference between the current frame and a second reference frame that is temporally different from the current frame. An area determination means;
A first object extracting means for extracting, as an object area, an area that does not belong to either the first background area or the second background area in the graphic image of the input frame;
Second object extraction means for extracting an object region from an in-figure image on the current frame to be the object extraction target using a method different from the first object extraction means;
Means for selectively switching the first and second object extraction means;
And the second object extraction means predicts the position or shape of the object on the current frame to be extracted from the reference frame using the frame from which the object region has already been extracted as the reference frame. And an object region is extracted from the current frame based on the prediction result, and when the prediction error by the second object extraction unit is within a predetermined range, the extraction result by the second object extraction unit is the object region. And when the prediction error exceeds a predetermined range, the extraction result by the first object extraction means is used as an object region, based on the prediction error amount, in units of blocks in a frame. An object extraction apparatus that selectively switches between the first and second object extraction means.

11. The second object extraction means performs inter-frame prediction in an order different from an input frame order so that a frame interval between a reference frame and a current frame as an object extraction target is a predetermined frame or more. The object extraction apparatus described.