JP3855349B2

JP3855349B2 - Image recognition method and image information encoding method

Info

Publication number: JP3855349B2
Application number: JP08070297A
Authority: JP
Inventors: 美樹男笹木
Original assignee: Denso Corp
Current assignee: Denso Corp
Priority date: 1997-03-31
Filing date: 1997-03-31
Publication date: 2006-12-06
Anticipated expiration: 2017-03-31
Also published as: JPH10275237A

Description

【０００１】
【発明の属する技術分野】
本発明は、画像から動きのある対象物を抽出して認識する画像の認識方法、および画像を符号化する場合に、限られた情報伝送量の範囲内で、全体としての符号化の情報量の増大を抑制しながら特定領域の画質を向上させるための画像情報の符号化方法に関する。
【０００２】
【発明が解決しようとする課題】
自動車電話や携帯電話などの移動体通信におけるデジタル通信では、データ伝送レートは現在９６００ｂｐｓ（bit per second）が主流となっている。通信コストを考えた場合に、技術的には広帯域移動通信が可能になるとしても低コストで実現できる狭帯域通信の優位性は２０００年以降においても当分は崩れないと言われている。
【０００３】
そこで、このような点を考慮して、データ伝送レートを９６００ｂｐｓとした準動画伝送を可能とするパソコンベースでの自動車ＴＶ電話のプロトタイプが開発されたが、この場合においては、画像更新速度のレベルとしては１秒から２秒の間に１フレーム（画面）程度がせいぜいであった。ところが、このような画像更新速度では利用者が見る場合に動画像の表現能力に欠けるため、実用的なレベルでの使用が難しいものであった。
【０００４】
一方、ＩＴＵ−Ｔ／ＬＢＣで標準化された画像情報の圧縮に関するＨ．２６３規格は、有線系超低レート画像伝送（６４ｋｂｐｓ〜４．８ｋｂｐｓ）において最も高性能な方式と言われるが、この規格はデコーダ（復号化器）に関する規定であり、エンコーダ（符号化器）の演算アルゴリズムについての詳細な部分は利用者側の仕様に委ねられている。そこで、Ｈ．２６３評価ソフトウェアＴＭＮ５（Telenor 社）を用いて画質評価を行なったところ、フレームレートは比較的高い（９６００ｂｐｓで約５フレーム／秒）が、画質に関しては顔の表情の伝達能力が低いことがわかった。
【０００５】
また、実際の製品化にあたっては音声情報も送信する必要があるので、ＡＶ多重化に際しては、画像の実質レートとしては４ｋｂｐｓ程度しか取れないようになることが予想されるので、通信レートの制約はさらに厳しいものとなる。その一方で、製品としての性能を考慮すると、対話をする場合の顔の表情を表示させる場合には、５フレーム／秒以上の動画レートと顔の表情が良く分かる画質の両方が要求される。換言すると、画像の全体が高い精度で更新されなくとも、対象となる顔領域や手などの動きの部分のみを良好な精度で伝送することにより、利用者にとっては質の高い画像であると認識されるのである。
【０００６】
そこで、画像情報から動きのある対象物のうちの特定の対象領域である顔領域をいかにして効率的に検出して認識するかが第１の課題となり、さらに、限られた伝送容量の範囲内で、いかにしてその顔領域について他の領域よりも高い精度で符号化を行なうかという点が第２の課題となってくる。
【０００７】
この場合、人物領域の抽出処理に関しては、クラスタリングやテンプレートマッチング、あるいは動的輪郭モデルなどの多くの手法が提案されているが、様々な状況（帽子，髪形，背景色，肌色，陰影，全身／上半身／顔のみ等）に対して安定して抽出を行なえると共に、低い演算コストを同時に満足するものが少ないのが実情である。
【０００８】
本発明は、上記事情に鑑みてなされたもので、第１の目的は、動きのある対象物のうちの顔などの特定の対象領域を画像情報から効率的に検出して認識することができる画像の認識方法を提供することにあり、第２の目的は、その認識方法で得られた画像情報を高効率で符号化する画像情報の符号化方法を提供することにある。
【０００９】
【課題を解決するための手段】
請求項１記載の画像情報の符号化方法によれば、まず、画像の認識方法では、動領域検出過程において、動きのある対象物を撮影した画像を複数のブロックに分割したときに、時間的に前後する少なくとも２フレーム分の画像の各ブロックの画素データに基づいて動きが発生しているブロックを検出して対象物に相当する動領域を検出し、続いてモデル適合過程において、検出された動領域に対してあらかじめ規定されている対象物の形状モデルを当てはめてその内部に含まれる領域を対象物の推定領域として認識し、これによって画像中に占める動きのある対象物の領域を効率的に認識することができるようになる。
【００１０】
そして、上記した画像の認識方法により得られた画像情報に基づいて画像情報を符号化する場合に、モデル適合過程により抽出された推定領域に対して他の領域よりも多くの情報量を割り当てるように符号化処理を行なうことにより所定の伝送容量の範囲内で画像を符号化するので、抽出された推定領域について符号化における精度を高くして情報量を多くすることができ、対象物として利用者が最も注視すると予想されるもの例えば人物の顔などを設定することにより、利用者が復調された画像を見る場合に、注視する対象物の領域を重点的に高精度で表示させることができるので見た目に精度が高く感じられ、これにより、限られた伝送容量の範囲内で良好な画像を送るようにすることができる。
上記のような前提となる処理に加えて、
さらに、動領域検出過程においては、検出された動きブロックの情報に加えて画像の色情報に基づいて動領域を検出するので、動きの少ないブロックに対しても色情報から対象物に対応したブロックであるか否かの判定を行なえるようになり、対象物に相当する推定領域の抽出の精度を向上させることができるようになる。
加えて、動領域検出過程においては、色情報の検出を重心色の計算による対象物の色の決定と色距離計算としきい値判定により行なうので、具体的な演算処理が比較的簡単に行なうことができるようになる。
さらに、動領域検出過程においては、色距離計算に続くしきい値判定で用いるしきい値の設定を時間の経過に伴って低いレベルに変更するので、例えば、あるシーンの画像から対象物の推定領域を抽出する場合に、シーンの開始直後における変化に対して時間が経過するにしたがって推定領域の抽出精度が向上してくるのでしきい値を低レベルにしてより精度の高い判定処理を行なうことができるようになる。
請求項２記載の画像情報の符号化方法によれば、請求項１の説明で述べた前提となる処理に加えて、動領域検出過程において、検出した動きブロックについて時間的に前後するフレームの画像の動きブロックを参照してヒストグラムを作成して時間方向のフィルタリングを行なうことにより雑音除去処理を行なうので、ひとつの画像では判定しかねる雑音ブロックに対しても、複数のフレームの画像に渡って対応するブロックが存在しているか否かという時間的な経過を伴うフィルタリング処理を施すことにより簡単に除去することができるようになる。
請求項３記載の画像情報の符号化方法によれば、請求項１の説明で述べた前提となる処理に加えて、対象物のモデルとして、３次元モデル、色情報およびシーン構造を含むモデルを規定したモデルデータベースを設け、モデルデータベースからモデルを選択する場合に、確率的に状態遷移を行なうときの確率値を変更または状態遷移経路を変更するので、時間の経過と共に画像の状態に応じて適切なモデルを選択するようにして迅速かつ正確に推定領域の抽出を行なうことができるようになる。
請求項４記載の画像情報の符号化方法によれば、請求項１の説明で述べた前提となる処理に加えて、評価修正過程において、モデル適合過程により抽出された推定領域の適合度を画像のデータに基づいて評価し、その評価結果に応じて適合度がより高くなるように推定領域を修正するので、画像中の動きのある対象物を精度良く認識することができ、しかも、符号化処理過程では、修正されたより精度の高い推定領域を用いて符号化処理を行なうことができ、これによって、対象物に対応した推定領域をより精度よく抽出してその推定領域に対して符号化の精度を高めて符号化処理を行なえるので、限られた伝送容量の範囲内で対象物の動きをより滑らかにした画像を伝送することができる。
さらに、評価修正過程においては、３次元位置姿勢の決定を物理的制約や常識的な値を用いたしきい値判定により行なうので、実際的状況に即した安定した演算結果を比較的簡単に得ることができるようになる。
請求項５記載の画像情報の符号化方法によれば、請求項１の説明で述べた前提となる処理に加えて、評価修正過程において、モデル適合過程により抽出された推定領域の適合度を画像のデータに基づいて評価し、その評価結果に応じて適合度がより高くなるように推定領域を修正するので、画像中の動きのある対象物を精度良く認識することができ、しかも、符号化処理過程では、修正されたより精度の高い推定領域を用いて符号化処理を行なうことができ、これによって、対象物に対応した推定領域をより精度よく抽出してその推定領域に対して符号化の精度を高めて符号化処理を行なえるので、限られた伝送容量の範囲内で対象物の動きをより滑らかにした画像を伝送することができる。
さらに、評価修正過程においては、３次元位置姿勢算出値のしきい値判定で用いるしきい値の設定を時間の経過に伴って動的に変更するので、例えば、あるシーンの画像から対象物の３次元位置姿勢を推定する場合に、シーンの開始直後における変化に対して時間変化が経過するにしたがって逐次処理に伴い推定精度が向上してくるので、時間変化や予測誤差に関するしきい値を低いレベルに変更して少ない変化に対しても感度を高めてこれを検出できるように判定処理を行なうことができるようになる。
【００１１】
請求項６記載の画像情報の符号化方法あるいは請求項２４記載の画像の認識方法によれば、対象物の形状モデルをあらかじめ３次元モデルとして規定しておき、モデル適合過程では、動領域検出過程で得られた動領域の形状特徴に基づいて対象物の３次元モデルの概略の位置姿勢を推定しこれに基づいて推定領域を抽出するので、対象物の動きに応じて高い精度で追随して動領域に対する推定領域の抽出を行なうことができるようになる。
【００１２】
請求項７記載の画像情報の符号化方法あるいは請求項２５記載の画像の認識方法によれば、モデル適合過程では、対象物の３次元モデルの位置姿勢情報に基づいてそのワイヤフレームモデルを画像面上に投影して２次元的な領域を割り当て、その割り当てたワイヤフレームモデルの内部領域を前記対象物の推定領域として抽出するので、対象物の位置姿勢情報から対象物の推定領域の抽出に必要な演算処理を迅速に行なわせることができるようになる。
【００１３】
請求項８記載の画像情報の符号化方法あるいは請求項２６記載の画像の認識方法によれば、評価修正過程において、モデル適合過程により抽出された推定領域の適合度を画像のデータに基づいて評価し、その評価結果に応じて適合度がより高くなるように推定領域を修正するので、画像中の動きのある対象物を精度良く認識することができ、しかも、符号化処理過程では、修正されたより精度の高い推定領域を用いて符号化処理を行なうことができ、これによって、対象物に対応した推定領域をより精度よく抽出してその推定領域に対して符号化の精度を高めて符号化処理を行なえるので、限られた伝送容量の範囲内で対象物の動きをより滑らかにした画像を伝送することができる。
【００１４】
請求項９記載の画像情報の符号化方法あるいは請求項２７記載の画像の認識方法によれば、対象物の常識的なモデル情報として例えば色情報や運動情報あるいは３次元構造の情報があらかじめ規定されているので、評価修正過程においては、抽出されている推定領域のデータがその常識的なモデル情報に適合しているかを評価修正を行なうことにより、より正確な推定領域の抽出をすることができるようになる。
【００１５】
請求項１０記載の画像情報の符号化方法あるいは請求項２８記載の画像の認識方法によれば、評価修正過程においては、時間的に前後する画像のそれぞれにおいて抽出した推定領域が対象物の物理的な動き条件や常識的な動き条件に適合しているか否かを判定するので、抽出された推定領域が対象物の不自然な動きや位置などに相当している場合に、物理的条件を満たさないことにより除外することができるようになる。この場合、常識的な動きや位置条件とは、例えば、対象物が人物である場合に、前フレームと現フレームとの間での動き量は常識的にみて自ずと上限が存在するし、また、人物が空中に浮かんだり上下が転倒した状態となったり、あるいは突然消えたり現れたりすることはあまり考えられないというように、その対象物の性質として規定される条件である。
【００１６】
請求項１１記載の画像情報の符号化方法あるいは請求項２９記載の画像の認識方法によれば、評価修正過程において、対象物に対応してあらかじめ規定されている常識的色情報の条件，重心色の再評価および推定領域内の色連続性の評価を行なって推定領域の修正を行なうので、例えば、推定領域の境界に位置するブロックが領域内か外かを判定する際に、重心位置のブロックの色情報に基づいて色の連続性を評価することにより正確に判定を行なうことができ、推定領域の修正を高精度で行なうことができるようになる。
【００１７】
請求項１２記載の画像情報の符号化方法によれば、符号化処理過程では、抽出された推定領域の情報を次の画像の符号化処理に適用するので、推定領域の抽出に使用した画像のデータを繰り返し符号化処理のために読出して演算する場合に比べて、演算処理を並列に行なうことにより迅速に対応することができるようになる。なお、この場合において、現画像の推定領域を次の画像に適用することから、動きが発生している場合には対象物の領域のある程度のずれが生ずるが、人物の顔など対象物によっては実用上ほとんど悪影響を受けることがないので、演算量を減じて迅速な対応をすることができる。
【００１８】
請求項１３記載の画像情報の符号化方法によれば、領域予測過程において、推定領域の抽出過程で得られた情報に基づいて次の画像における対象物の推定領域を予測し、符号化処理過程ではこの予測された推定領域の情報に基づいて次の画像の符号化処理を行なうので、対象物の動きが大きい場合でもそれに追随して次の画像の対象物の推定領域を予測した結果に基づいて、より的確に符号化処理をすることができるようになる。
【００１９】
請求項１４記載の画像情報の符号化方法あるいは請求項３０記載の画像の認識方法によれば、動領域検出過程において、画像のパニングベクトルを検出してこれに基づいて検出された動きブロックの情報を補正するので、対象物を撮影する撮像手段が動く場合でも、その撮像手段が動いた分を差し引いて画面中の動きブロックを判定することができるようになり、対象物の動領域をより精度高く抽出することができるようになる。
【００２０】
請求項１５記載の画像情報の符号化方法あるいは請求項３１記載の画像の認識方法によれば、動領域検出過程においては、検出された動きブロックの情報に加えて画像の色情報に基づいて動領域を検出するので、動きの少ないブロックに対しても色情報から対象物に対応したブロックであるか否かの判定を行なえるようになり、対象物に相当する推定領域の抽出の精度を向上させることができるようになる。
【００２１】
請求項１６記載の画像情報の符号化方法あるいは請求項３２記載の画像の認識方法によれば、動領域検出過程においては、色情報の検出を重心色の計算による対象物の色の決定と色距離計算としきい値判定により行なうので、具体的な演算処理が比較的簡単に行なうことができるようになる。
【００２３】
請求項１７記載の画像情報の符号化方法あるいは請求項３３記載の画像の認識方法によれば、動領域検出過程において、検出した動きブロックについて重心位置との間の距離計算としきい値判定を行なうことにより雑音除去処理を行なうので、大きな領域に近接する孤立した雑音領域などを簡単に除去することができるようになる。
【００２５】
請求項１８記載の画像情報の符号化方法あるいは請求項３４記載の画像の認識方法によれば、評価修正過程においては、対象物の３次元モデルのワイヤフレームモデルを画像面に投影する際に、その対象物の所定部位の適合性が高くなるようにワイヤフレームモデルを当てはめてモデルの適応化を図るので、特徴的な部位を選定することにより簡単かつ迅速にワイヤフレームモデルの当てはめの処理を行なうことができるようになる。
【００２９】
【発明の実施の形態】
以下、本発明の一実施例について図面を参照しながら説明する。
図２は符号化器１の全体の概略的構成を示すもので、撮像手段としてのカメラ２は、対象物を含むシーンを撮影して画像情報を出力するもので、Ａ／Ｄ変換部３を介してデジタル信号に変換された状態で原画像情報として符号化部４に入力するようになっている。
【００３０】
符号化部４は、既存の動き補償予測符号化方式（例えば、Ｈ．２６１，Ｈ．２６３あるいはＭＰＥＧなど）で原画像情報を符号化するもので、これは、後述するモデルベース対象領域抽出部５に対して２次元動きベクトル検出部６にて原画像情報から動きベクトルを検出してその情報を与えるようになっており、また、モデルベース対象領域抽出部５から与えられる領域情報に基づいて符号化属性判定・制御部７にて抽出された領域について動き補償予測符号化の符号化処理を行なって圧縮ビットストリームを生成して通信路に出力するものである。
【００３１】
モデルベース対象領域抽出部５は、上述したように符号化部４の２次元動きベクトル検出部６から動きベクトルの情報が与えられると共に、原画像の情報が与えられるようになっており、これらの情報に基づいてモデルベースを用いて対象領域を抽出して領域情報として出力するものである。図１は、その構成を機能ブロックで示すものであり、以下、これについて詳述する。
【００３２】
モデルベース対象領域抽出部５は、動領域解析部８，モデル適合部９，評価・修正部１０，領域予測部１１およびモデルデータベース部１２の５つの機能ブロックから構成される。このうち、動領域解析部８は、パニングベクトル検出部８ａ，動きブロック判定部８ｂ，重心計算部８ｃ，平均色計算部８ｄ，背景雑音除去部８ｅおよび２次元形状パラメータ抽出部８ｆからなり、原画像情報からパニングベクトルを検出して動きベクトルから差し引いて動きブロックを判定し、その重心位置から重心を求め、雑音を除去して領域形状のデータを求めるものである。
【００３３】
モデル適合部９は、３次元位置姿勢推定部９ａ，透視変換部９ｂおよび投影領域計算部９ｃからなる。このモデル適合部９では、動領域解析部８において得られた結果に基づいてワイヤフレームモデルを画像面上に投影して当てはめを行なう。評価修正部１０は、色評価部１０ａ，適合度計算部１０ｂおよび３次元位置姿勢決定部１０ｃからなる。この評価修正部１０では、ワイヤフレーム投影によって得た切り出し領域である２次元推定領域の評価を行なうもので、原画像データとの適合度の評価演算を行ないフィードバックすることにより２次元推定領域の修正およびワイヤフレームモデルの適応的変形の処理を行なう。領域予測部１１は、移動平均処理部１１ａ，３次元動き予測処理部１１ｂおよび領域計算部１１ｃからなる。この領域予測部１１では、評価修正部１０において抽出された領域を現フレームの推定領域としてこれに基づいて次のフレームの推定領域を予測する。
【００３４】
モデルデータベース部１２は、色情報データベース部１２ａ，３次元形状データベース部１２ｂおよびシーン構造データベース部１２ｃからなる。これらは、あらかじめ分かっている対象物についてモデル化したデータを記憶するもので、これらのデータベースを駆動するモデル情報およびモード情報は符号化器の外部に設けられたシステム制御部から入力されるようになっている。
【００３５】
以上のように符号化器１が構成されている。一方、この符号化器１からのデータを受ける復号化器は、一般的な構成のもので良く、例えば、Ｈ．２６１，Ｈ．２６３あるいはＭＰＥＧの規格に対応した構成を有するものである。
【００３６】
次に本実施例の作用について図３ないし図２４も参照して説明する。なお、本実施例の具体的動作の説明に先だって、動作原理についてまず述べる。
すなわち、本実施例で対象としているのは、原画像の中から対象となる人物の顔の領域をいかに効率的に抽出するかということであり、その原理について図３を参照して概略的に説明する。ここでは、このような課題を人物の３次元的形状を標準的なワイヤフレームモデル（以下、ＷＦＭと略する）を設定し、これを原画像に映った人物像に対していかに忠実に割り付けるかという問題として捕らえる。そして、この割り付けを行なうためには、以下の５つの条件が満たされる必要がある。
【００３７】
（１）カメラ特性（焦点距離と視野角）が既知であること
（２）対象物（人物）のＷＦＭの３次元構造データが既知であること
（３）対象物（人物）が略剛体と見なせる運動姿勢を保っていること
（４）対象物（人物）に対するカメラの３次元位置姿勢が既知であること
（５）被写体の人物がＷＦＭと同等の常識的なサイズであること
【００３８】
上述の５つの条件（１）〜（５）が既知であるときに、ＷＦＭの画像面への２次元投影像は人物領域と略一致するはずである。しかし、上記（４）の条件を満たすことは通常の撮影環境では困難であるため、他の条件（１）〜（３）および（５）が満たされていると仮定すると共に、これに加えてさらに以下の情報を活用することによりＷＦＭを割り付けることにする。
【００３９】
（ａ）画像中のテクスチャや色の空間的変化および時間的変化
（ｂ）人物の持つテクスチャ、色の常識的情報（形状情報については上記ＷＦＭに相当）
（ｃ）人物の持つ物理的特性（速度、慣性、運動特性など）
（ｄ）人物の常識的位置姿勢（倒立した状態や、宙に浮いた状態でカメラに映る可能性は非常に低い）
【００４０】
以上を総合した結果、ＷＦＭ、画像情報、常識・物理的制約の３つを有機的に関連付けて問題を解決する手段として得られたのが、以下に示すような計算処理過程を含んでなる画像の符号化方法における対象物の領域の抽出方法である。
【００４１】
（ア）動き情報と色情報とにより画像から２次元の推定領域を抽出する
（イ）抽出された推定領域の２次元形状情報とあらかじめ規定されているモデル情報（色情報、ＷＦＭ情報）により、対象物の概略の３次元位置姿勢を算出する
（ウ）ＷＦＭをこの３次元位置姿勢から得た透視変換処理をすることにより画像面に投影し、領域別にラベル指定をすることにより顔領域を抽出する
（エ）抽出した顔領域の再評価（色、大きさ、位置姿勢の時間変化など）により、現フレームに関する３次元位置姿勢の最終推定値を算出する
（オ）ワイヤフレーム投影により顔領域の最終推定結果を出力する
（カ）必要に応じて、３次元位置姿勢時系列の線形予測により、次フレームにおける顔領域を推定する
【００４２】
図４ないし図８は、上述の原理に基づいて作成したプログラムのフローチャートである。ここで、図４のプログラムは全体の流れを示す概略的なもので、各ステップに対応して図５ないし図８のプログラムが設定されている。以下、これらのフローチャートを参照して全体の動作の流れについて説明する。
【００４３】
なお、一連の処理過程は、あるシーンが始まると、領域抽出のために、まず（Ａ）動領域解析過程を行なって推定領域を抽出し（ステップＳ１）、そのレベル判定を行なってＬＥＶＥＬ１以上の場合には、続いて（Ｂ）モデル適合過程を実施する（ステップＳ２）。そして、ＬＥＶＥＬ２以上の場合には（ステップＳ３）、続いて（Ｃ）評価修正過程（ステップＳ４）および（Ｄ）領域予測過程（ステップＳ５）を実行し、シーンが終了するまで上述を繰り返す（ステップＳ６）。以下、これらの各処理過程（Ａ）〜（Ｄ）を単位として説明する。
【００４４】
（Ａ）動領域解析過程
これは、図４の全体のプログラムの動領域解析の処理ステップＳ１に相当するもので、動領域解析部８は、図５に示すような処理を実行する。画像情報が読み込まれてフレーム番号Ｎｆをインクリメントした後にブロックマッチング法などによって動ベクトルの検出処理を行なうと（ステップＡ１〜３）、続いて、パニングベクトル検出処理を実施するようになる（ステップＡ４）。そして、得られたパニングベクトルに基づいて動ベクトルを修正する（ステップＡ５）。
【００４５】
ここで、パニングベクトルとは、画像中の一部が動くのではなく、画面全体が動く場合（パンする場合）、つまり、撮像手段としてのカメラ２が動いた場合などの状況に対応するもので、このカメラ２の動きの大きさと方向をパニングベクトルとして表すものである。そして、上述のようにして検出した動ベクトルからこのパニングベクトルを差し引いて修正することにより、画像中の動き領域をより正確に検出しようというものである。このパニングベクトルの検出については後述する。
【００４６】
さて、得られた動ベクトルに基づいて、画像中の動きブロックを１６×１６のブロック画素単位（マクロブロックＭＢＫ単位）で動きの有無に関するラベリングを行う（ステップＡ６）。このときラベリング結果に基づいて、あらかじめ設定されているレベルＬＥＶＥＬ０〜３の値と比較してレベル判定を行ない、以降の処理段階の判断基準とする。この場合、各レベルは、所用演算量に対応した動作レベルを考え、最も演算量を抑えたレベルをＬＥＶＥＬ０とする。
【００４７】
▲１▼ＬＥＶＥＬ０の場合
動きブロック数が非常に少なく、あるしきい値以下の場合、ＷＦＭの位置姿勢は変化していないとしてＬＥＶＥＬ０の判定を行ない（ステップＡ７）、ステップＡ１に戻って次フレームの読み込みを開始する。この場合には、前回の処理で得られているＷＦＭの位置姿勢情報や推定領域の情報が符号化処理に使用されるようになっている。
【００４８】
▲２▼ＬＥＶＥＬ１以上の場合
（ａ）動きブロック数が非常に少なくあるしきい値以下の場合には（ステップＡ７）、抽出された推定領域の重心ブロック内の平均色を計算し、それとの誤差が一定以内の平均色を持つブロックを領域に含めてラベリングする。
（ｂ）動きブロック数がある一定しきい値以上の場合、上記の色ベクトル評価によるラベル修正は行わない。
【００４９】
次に、色ベクトル評価によるラベル修正を行ない（ステップＡ８）、ラベル画像のクラスタリングを行ない（ステップＡ９）、クラスタリングされた領域の数である領域数Ｋを決定する（ステップＡ１０）。そして、これらの各領域ｋ（＝１，…，Ｋ）について重心位置をブロックアドレスで計算し、その重心色を領域色とする（ステップＡ１１〜Ａ１４）。この後に重心からの距離Ｌが最も近い別のクラスタ中心との距離の２分の１を超えるときにラベルを「０」にするといった距離によるフィルタ操作を施す（ステップＡ１５）。
【００５０】
次に、過去の複数フレームの情報を利用して出現回数の頻度で雑音成分を除去する時間フィルタリング手法や、大きなクラスタから孤立しているブロックを雑音成分と見なして除去する孤立点除去などの方法により雑音除去を施す（ステップＡ１６，Ａ１７）。この後、領域ｋの重心を再計算し（ステップＡ１８）、重心色との誤差評価により雑音除去を行う（ステップＡ１９）。そして、得られた領域形状の長軸と短軸およびその傾きを抽出し（ステップＡ２０）、以降、クラスタの数Ｋだけ同じことを繰り返す（ステップＡ１２〜Ａ２１）。
【００５１】
（Ｂ）モデル適合過程
図４に示したモデル適合過程（ステップＳ２）では、モデル適合部９において、図６に示すプログラムのフローチャートに従ってモデル適合過程が行なわれる。すなわち、まず、図３にも示したように、人物の上半身に対応したＷＦＭ（ワイヤフレームモデル）を選択し（ステップＢ１）、続いて、各領域に対応して、姿勢角の算出および距離ｒの算出を行なう（ステップＢ２〜Ｂ５）。
【００５２】
この場合、例えば、前述の領域形状の短軸を頭部の投影領域の幅であると見なして姿勢角パラメータである角度α、β、γの算出と比例計算により、距離ｒを求める。これを用いてＷＦＭの端点を透視変換し、ＷＦＭの投影像を得る（ステップＢ６）。この投影像の内部領域をＷＦＭの部位（頭、胴体など）情報に基づいてラベリングする（ステップＢ７）。例えば、頭部のラベル値を「１」、目を「２」、胴体を「７」、背景は「０」といった値に割り付けてラベリングする。そして、以上のことを人物の数つまり領域の数Ｋだけ繰り返し実行する。
【００５３】
すべての領域（人物）について上記演算を行った後、動作レベル指定がＬＥＶＥＬ１の場合は、ここで１フレームの領域抽出演算を終了して、次の画像読み込みに移る。すなわち、以下の評価修正過程、領域予測過程については実行しないのである。これによって、モデル適合の正確さは若干失われるが、全体の演算量を削減することができる。なお、このＬＥＶＥＬ１処理をシーンカット後しばらく時間が経過した後に適用すれば、対象物とＷＦＭのずれとはそれほど大きくならないと考えられる。そして、動作レベルがＬＥＶＥＬ２以上の場合は、以下の処理過程を実行する。
【００５４】
（Ｃ）評価修正過程
次に、図４に示した評価修正過程（ステップＳ４）では、評価修正部１０において、図７に示すプログラムのフローチャートに従って評価修正過程が行なわれる。すなわち、ワイヤフレーム投影によって得た切り出し領域（２次元推定領域）の評価を行なうために、原画像データをもとにして評価を行なって２次元推定領域の修正を行なうと共にワイヤフレームモデルの適応的変形の処理を行なうものである。
【００５５】
例えば、前述のようにラベリングした結果のデータを用いて、人物頭部領域を抽出してその領域重心の色を後述するようにして評価し（ステップＣ１）、この後、前述したレベルがＬＥＶＥＬ２である場合には以降の評価修正過程（ステップＣ３〜Ｃ６）を省略し、ＬＥＶＥＬ３である場合には続く色連続性評価，適合度評価，２次元位置修正および３次元位置姿勢修正を行なう（ステップＣ３〜Ｃ６）。この場合、ＬＥＶＥＬ２では、ステップＣ１の領域重心としての評価で妥当と判断されたときには、そのＷＦＭ領域重心および色を採用し、不適と判断されたときには動領域解析部８にて得られた値を用いるようになっている。
【００５６】
次に、シーンカット以後のフレーム時刻Ｎｆがあるしきい値Ｎｆ＿ＴＨに達している場合にはパラメータ評価しきい値の変更を行う（ステップＣ７，Ｃ８）。これは、シーンカット直後の例えば１〜３秒程度は大きくしきい値を取り、それ以後は小さくするものである。この値をもとにして、物理的制約による評価および常識適用による総合評価を行い（ステップＣ９〜Ｃ１２）、評価結果が良好であればこの処理を終了し、そうでない場合にはもう一度モデル適合過程（ステップＳ２）に戻って繰り返し行なう（ステップＣ１３）。
【００５７】
（Ｄ）領域予測過程
次に、図４に示した領域予測過程（ステップＳ５）では、領域予測部１１において、図８に示すプログラムのフローチャートに従って領域予測過程が行なわれる。上述したような１フレーム全体に渡る大局的処理により抽出した領域は現在フレームにおける推定領域であるが、これをもとにして現在フレームを再度処理して符号化を行うことは遅延時間と演算時間の増大を招く。そこで、ここでは、この推定領域から次のフレームの予測領域を求めることを行なう。
【００５８】
この場合、領域の平均動きベクトルの線形予測という形で２次元的に実現することも可能であるが、ここでは、各領域について（ステップＤ１，Ｄ２，Ｄ７）３次元位置姿勢の時系列の線形予測により（ステップＤ３）、Ｎｆ＋１時刻のＷＦＭの投影領域を予測する（ステップＤ４〜Ｄ６）。これにより、現在フレームの符号化と次フレームの領域予測を同時に並行して進めることができるようになる。
【００５９】
次に、上述した処理の流れにおける各部の処理内容の詳細について項目別に説明する。
Ａ．動領域解析過程について
（１）動ベクトルの検出
動きベクトルの検出は、本実施例においては、例えば、原画像（例えばＣＩＦ）サイズの画像についてブロック毎のフレーム間相関を計算することによって行うブロックマッチング法を用いる。この動ベクトルの検出方法については、ブロックマッチング法以外にも利用することができる。
【００６０】
（２）パニングベクトルの検出
カメラ２が静止している場合は、上述のようにして検出される動ベクトルがそのまま対象物の動きの２次元投影に対応することが多いが、カメラ２が動いている場合（パニングが有る場合）には、このカメラ２の動きに対応したパニングベクトルを検出して対象物の動きと分離する必要がある。
【００６１】
［ａ］パニングの有無の判定と計算
パニングベクトルが発生しているか否かの判定については、図９に示す計算手順に従う。これは、例えば、図１０のように、画像周辺部の４辺（あるいは矩形領域）について、各１個の平均動ベクトルを計算する（ステップＥ１）。特に、ここでは実施例として画像の最も外側から２番目のＭＢＫで構成する辺のブロック列ＢＬ１〜ＢＬ４をエッジ領域として、それらの各ＢＬ１〜４の平均動きベクトルＰＭＶｎ（ｎ＝１〜４）を次式（１）〜（４）に基づいて計算する。
【００６２】
【数１】

ＰＭＶｎ；エッジ領域ＢＬｎ（ｎ＝１〜４）のパニングベクトル
ＭＶ（ｉ，ｊ）；ＭＢＫ座標（ｉ，ｊ）のＭＢＫの動きベクトル
【００６３】
この４個のパニングベクトルＰＭＶ１〜４が大体同じ大きさ（≠０）と方向であれば画像全体が動いていると見なすことができるので、パニングベクトルが発生していると判定する。そこでまず、４個のパニングベクトルＰＭＶ１〜４の分散値ＶPAN ２を次式（５），（６）で計算する（ステップＥ２，Ｅ３）。
【００６４】
【数２】

ｄ（ｖｉ，ｖｊ）；ベクトルｖｉとｖｊの間の距離関数（ここでは２乗距離）
ＭＰＶ；４個のＰＭＶの平均ベクトル
【００６５】
また、画面全体のパニングベクトルの推定値であるＭＰＶの大きさについても評価する必要があるため、平均ベクトルＭＰＶの大きさを２乗距離値ＡPAN ２として表し、次式（７）により求める（ステップＥ５）。すなわち、
ＡPAN ２＝ｄ（ＭＰＶ，０） …（７）
であり、この２乗距離値ＡPAN ２および分散値ＶPAN ２を用いて以下のようにしきい値判定する（ステップＥ４，Ｅ６，Ｅ７，Ｅ８）。
【００６６】
▲１▼ＶPAN ２≧ＶPAN ２＿ＴＨのとき、
あるいは、
ＡPAN ２＜ＡPAN ２＿ＴＨのとき
→ パニングはなしと判定する（ステップＥ８）
▲２▼上記以外の場合
→ パニングありと判定し、ＭＰＶをパニングベクトルとする（ステップ
Ｅ７）
しきい値については、経験的に設定するが、例えば、次のように設定することで、比較的良好な結果を得ることができる。
（ＶPAN ２＿ＴＨ，ＡPAN ２＿ＴＨ）＝（５０，１）
【００６７】
［ｂ］パニングベクトル除去による動き量の補正
パニングありと判定したとき、各ＭＢＫの動きベクトルを次式（８）を用いて補正する。
ＴＭＶ（ｉ，ｊ）＝ＭＶ（ｉ，ｊ）−ＭＰＶ …（８）
この補正された動きベクトルＴＭＶを実質的な動きベクトルとし、以下の動領域解析過程を進める。
【００６８】
（３）変化ブロックの重心位置の計算
変化ブロックの検出には以下の方法が考えられる。
▲１▼フレーム間差分のしきい値判定
▲２▼動き補償フレーム間差分のしきい値判定（パニングを考慮しない）
▲３▼パニング補正動きベクトルによる動領域判定
【００６９】
ここで、カメラ２が固定の場合は上記▲１▼の方法で十分である。しかし、この方法では、カメラ２のほんの僅かなぶれに対しても敏感に反応してしまうので、符号化器１の演算容量に余裕がある場合は上記▲２▼の方法または▲３▼の方法を使用することが望ましい。ここで、▲２▼の方法は主として色合いや輝度の変化に基づく変化ブロックの検出に対応し、▲３▼の方法は本来の対象物の動きに起因する変化ブロックの検出に対応する。そこで、▲２▼および▲３▼の方法の併用で検出した変化ブロックについて重心位置を検出する。
【００７０】
これらの操作は、画面内の対象物が１個と仮定できる場合である。対象物が複数個ある場合には上記で得た変化ブロック群のクラスタリングが必要になる。この場合において、クラスタリングを行うかどうかは以下のようなプロセスで決定される。
【００７１】
［ａ］第１ステップ …重心計算またはクラスタリングの起動
ケース１ …被写体モードにおいて人物が一人 → 重心計算
ケース２ …対象領域の数が不明 → 重心計算
ケース３ …対象領域の数が既知（２以上） → クラスタリング
ケース４ …対象領域の数が不明（２以上） → クラスタリング
［ｂ］第２ステップ …クラスタリング
クラスタリングにおいては重心を求めると同時にクラスタ半径も抽出する
【００７２】
第１ステップのケース２の場合の処理は、対象領域数Ｋ＝１の仮定を行ったことに相当しており、この仮定のもとに計算を進めた結果、後述の評価プロセスで矛盾する（誤差評価関数があるしきい値を超える）場合には、次のフレームにおいては仮定を変更してケース４を選択するようになっている。
【００７３】
（４）色解析
［ａ］重心色の計算
上述の（３）変化ブロックの重心位置の計算で決定された１個以上の領域重心ブロックについてブロック画素の平均色を計算する。これを重心色と呼び、例えば人物を対象とする場合はこの重心色を肌色の１次推定結果とする。なお、平均色は以下の式（９）〜（１１）で計算する。
【００７４】
【数３】

【００７５】
ＣPAT （ｋ，ｉ，ｊ）；ＭＢＫ（ｉ，ｊ）の平均色ベクトルのｋ番目の成分（ｋ＝１，２，３）
ＭＢＫ＿Ｙ（ｉ，ｊ，ix，jy）；ＭＢＫ（ｉ，ｊ）の（ix，jy）画素のＹ成分画素値
ＢＬＫ＿Ｕ（ｉ，ｊ，ix，jy）；ＭＢＫ（ｉ，ｊ）のＵブロックの（ix，jy）画素値
ＢＬＫ＿Ｖ（ｉ，ｊ，ix，jy）；ＭＢＫ（ｉ，ｊ）のＶブロックの（ix，jy）画素値
【００７６】
［ｂ］各ブロックの平均色と重心色との距離計算
これは次式（１２）にしたがって求める。
【００７７】
【数４】

ＬＣＤ（ｉ，ｊ）；ＭＢＫ（ｉ，ｊ）の平均色と重心色との間の絶対値距離
ＩＧ；重心ブロックの水平座標
ＪＧ；重心ブロックの垂直座標
【００７８】
［ｃ］各ブロックのしきい値判定
各ＭＢＫについて計算した色距離ＬＣＤをもとに、次のような判定を行う。
ＬＣＤ＞ＬＣＤ＿ＴＨの場合
→ ＭＢＫ（ｉ，ｊ）は領域外
ＬＣＤ≦ＬＣＤ＿ＴＨの場合
→ ＭＢＫ（ｉ，ｊ）は領域内
ＬＣＤ＿ＴＨ；しきい値
【００７９】
ここで、ＬＣＤは０〜２５５の値をとる３成分の絶対値距離の和として計算されるので、０≦ＬＣＤ≦７６５、となる。これをもとに、ＬＣＤ＿ＴＨを設定する（１００〜２００程度が妥当である）。なお、このしきい値はシーンカット直後と、その後で動的に変更することが望ましい。例えば、次のようである。
【００８０】
０≦Ｎｆ≦Ｎｆ＿ＴＨのとき
→ ＬＣＤ＿ＴＨ＝３００
Ｎｆ＿ＴＨ＜Ｎｆのとき
→ ＬＣＤ＿ＴＨ＝１００
Ｎｆ；ある１カットのシーンのフレーム番号
というように、再帰的領域抽出結果が安定して対象物をトラッキングできるようになるまでのフレーム時間をＮｆ＿ＴＨとしてその間は大きい変化を許容し、その後は許容誤差を小さく設定する。シーンカットの判定手法は限定しないが、画面全体についてのフレーム間動き補償差分電力の総和のしきい値判定に基づく方法が考えられる。
【００８１】
（５）雑音除去
［ａ］距離判定による雑音除去
重心ブロックと各ブロックとの間の距離を計算し、あるしきい値以上のブロックを対象領域の候補から除去する。これは、クラスタリングを適用した場合では、クラスタ半径を基準にした領域判定を行うことに相当する。すなわち、領域重心ＭＶＫ（ＩＧ，ＪＧ）について、次式（１３）により２乗距離ｄＲを計算し、その結果に基づいて下記のように判断する。
【００８２】
ｄＲ＝ｄ［（ＩＧ，ＪＧ），（ｉ，ｊ）］ …（１３）
ｄＲ＜（Ｒｃ＋δ）
→ ＭＢＫ（ｉ，ｊ）は領域候補である
ｄＲ≧（Ｒｃ＋δ）
→ ＭＢＫ（ｉ，ｊ）は領域候補ではない
Ｒｃ；クラスタ半径
δ；正整数（例えば１〜３程度）
【００８３】
［ｂ］時間フィルタリングによる雑音除去
時間方向のヒストグラム処理により、雑音ブロックを除去する。これは、過去の複数フレームの処理で得られているデータに基づいてフィルタリング処理を行なうもので、例えば、３フレーム時間程度に渡って検出した動きベクトルの発生領域について重ね合わせた状態で得られるパターンについてヒストグラムをとって発生頻度の高いブロックを検出するようにした方法である。
【００８４】
［ｃ］孤立点除去や輪郭形状の滑らかさによるフィルタリング
通常の２値画像処理において用いられているのと同様の孤立点除去および輪郭雑音のフィルタリングをＭＢＫラベル画像に対して適用する。
【００８５】
［ｄ］領域内の穴埋め
領域内の穴は水平スキャンにおける２値平滑化により穴埋めを施す。ここで領域形状があらかじめモデル等で予測できる場合は（例えば楕円など）隣接する水平ブロック間での長さを予測することでブロック帯を安定して抽出できる。
【００８６】
（６）傾き検出
図４に示したように、抽出した領域の傾きから、ワイヤフレームの配置における回転角γを決定する。このとき、回転角γの決定は、例えば次のような方法で行なう。すなわち、抽出した領域の端部に位置する縦方向あるいは横方向のブロック列の対向する中点間を結んで長軸あるいは短軸を検出し、その軸の中点に直交する直線を求めて領域の端部と交差する位置を求めることにより残った短軸あるいは長軸を検出する。得られた長軸および短軸のデータから傾きを決定することができる。
【００８７】
回転角γの決定については、次のようにして求めることもできる。すなわち、領域の重心位置を求めこの重心位置を通る直線について傾きを変化させたときに領域端部と交差する線分の長さが最大となるものを長軸として検出し、この長軸に直交する線分を短軸として検出することもできる。
【００８８】
これは、隣接する水平ブロック帯の間の中心位置を結ぶことによって得られる局所的な傾きの平均に等しい（図１１（ａ）参照）。しかし、前述のようにして雑音除去を施して抽出した領域であっても、ある程度は抽出された領域形状に雑音を含んでいるため、傾き検出方法（特にエッジのみを用いる簡単な方法の場合）によっては長軸／短軸が反転する可能性がある（図１１（ｂ）参照）。これに対しては、次の方法が考えられる。
【００８９】
［ａ］雑音除去を高度化し、完全に凸領域になるようにする
［ｂ］傾き検出をロバスト化する
［ｃ］フレーム間の傾き変化に対して制約を設定する
［ａ］、［ｂ］は画像処理の改善による方法であるので省略する。［ｃ］については、後述する評価修正過程の説明で述べる。
【００９０】
Ｂ．モデル適合過程について
上記で得られた動領域解析結果をもとに、ワイヤフレームモデルを配置する。そのために、まず、３次元位置姿勢を計算する。ワイヤフレームの画像面上への投影に当たっては透視変換行列を計算するための位置姿勢パラメータｒ，θ，φ，α，β，γを上記の動領域解析結果から計算する。なお、これらの位置姿勢パラメータについては、ｒはカメラ２との間の距離を示す値であり、θ，φは対象物の姿勢を示す値であり、α，βはカメラ２の傾きを示す値であり、γはカメラ２の光軸回りの回転角を示す値である。
【００９１】
（１）対象物の３次元位置姿勢の計算
［ａ］光軸まわりの回転角γ
投影面上の長軸の直線の傾き角度をＤＲＣＴとすると、次式（１４）で表すことができる。
γＥ＝ａｒｃｔａｎ（ＤＲＣＴ） …（１４）
【００９２】
［ｂ］対象物中心の偏差角α，β
領域重心（ＩＧ，ＪＧ）を対象物中心とみなし、ＣＩＦ画像の画素座標（ＮＰＡ，ＮＰＢ）に変換すると、次式（１５）、（１６）のようになる。そして、これより、画像中心からの偏差角α，βは次式（１７）、（１８）で表される。
【００９３】
【数５】

Ｈ＿ＳＩＺＥ；ＣＩＦ画像の水平画素サイズ
Ｖ＿ＳＩＺＥ；ＣＩＦ画像の垂直画素サイズ
【数６】

ＦＬ；カメラの焦点距離
【００９４】
［ｃ］距離ｒ
動領域解析で抽出した一次推定領域をもとに、次式（１９），（２０）でカメラ２と対象物との間の距離ｒＥを計算する。
【００９５】
【数７】

ＭＨＷ：距離ｒ＝ｒｓの時に３次元対象物を画像面に投影した際に得られる対象領域の水平幅（あるいは垂直幅），ただし，対象物の光軸周り回転
角はγ＝０とする
ＲＨＷ０：一次推定領域のラインスキャンで得られる水平（垂直）方向の幅
ＲＨＷ：一次推定領域の傾き角度γＥで補正した水平（垂直）方向の幅
【００９６】
［ｄ］位置を示す角度θ，φ
角度θ，φを１回目の動領域解析の１次推定結果から計算する方法は限定していない。１次推定においては、通常、先見的知識がない限り対象物は計算開始時点において正面を向いていると仮定する。即ち、
θ＝０，φ＝０
とする。
【００９７】
（２）投影領域のラベリング
ＭＢＫラベル画像と同じ解像度に変換したワイヤフレームをＭＢＫラベル画像にオーバレイし、直線部分をラベリングする。その後、ラインスキャンにより穴をラベリングする。
【００９８】
（３）部分領域のラベリング
上記の投影領域のラベリングにおいて、ワイヤフレームの３次元座標をもとに対象物の部分領域毎にラベル値を変えてやることにより、対象物の特徴部分と２次元部分領域を対応づけることができる。
【００９９】
Ｃ．評価修正処理について
（１）ワイヤフレーム投影によって得た切り出し領域（２次推定領域）の評価ここでは、原画像データをもとにして前述のモデル適合処理過程で適合処理を行なったワイヤフレームモデルの適合度の評価を行ない、そのフィードバックにより、次の事項について処理を実行する。
［ａ］２次推定領域の修正（３次元位置姿勢の修正）
［ｂ］ワイヤフレームモデルの適応的変形
【０１００】
（２）常識の利用
評価手段としてはワイヤフレームモデルに対応するオブジェクト（車、人等）が持つ形状情報，位置姿勢、画像テクスチャに対して以下のような制約条件のもとで評価を行なう。
【０１０１】
［ａ］物理的制約
▲１▼力学的慣性の妥当性
▲２▼速度、振動周波数の妥当性
▲３▼位置関係の妥当性
［ｂ］概念的制約
▲１▼位置関係の妥当性評価
▲２▼色の妥当性
▲３▼運動軌跡の妥当性
【０１０２】
（３）色の妥当性評価
図１２に色の評価に関する基本方針を示す。
［ａ］重心色の再評価
切り出し領域（２次推定領域）の重心色を動領域解析過程における色解析処理と同様にして計算する。即ち、切り出し領域の重心座標を（IG１，JG１）とすると重心色は次式（２１）〜（２３）で表わされる。これらを用いて距離計算を式（２４）に基づいて行なう。
【０１０３】
【数８】

【０１０４】
これに対して次のようにしきい値判定を行なう。
【０１０５】
ＬＣＤＧ＞ＬＣＤＧ＿ＴＨ
→ＭＢＫ（ＩＧ１，ＪＧ１）は領域重心から外れている
ＬＣＤＧ≦ＬＣＤＧ＿ΤＨ
→ＭＢＫ（ＩＧ１，ＪＧ１）は領域重心として妥当である
ＬＣＤＧ＿ＴＨ；しきい値，対象領域を何であると仮定しているかによって値は異なりうる。
【０１０６】
［ｂ］領域内の色の連続性評価
ここでは、領域の境界部分の色の連続性評価を次のようにして行なう。図１２のように領域内をブロック毎にスキャンし、重心色と各ブロックの平均色との間の誤差を再評価する。即ち、ＬＣＤＧを各ブロックについて計算し、その値に応じて次のようにラベリングを行なう。尚、ここであらかじめ推定領域は「１」、領域外は「０」にラベリングしておく。
【０１０７】
ＬＣＤＧ＞ＬＣＤＧ＿ＴＨ
→ＭＢＫ（ｉ，ｊ）は領域外であると判定し，「２」にラベリングする
ＬＣＤＧ≦ＬＣＤＧ＿ＴＨ
→ＭＢＫ（ｉ，ｊ）は領域内にあると判定する
ＬＣＤＧ＿ＴＨ；しきい値，対象領域の仮定によって値は異なりうる
【０１０８】
［ｃ］３次元位置姿勢の修正ベクトルの計算
（ア）２次元の近似修正ベクトルの計算
上述の［ｂ］領域内の色の連続性評価で得られたラベル画像をもとに次の２つの部分領域の重心Ｇ１，Ｇ２を計算する。
Ｇ１（ＩＳＧ１，ＪＳＧ１）；２次推定領域内でラベル１（推定領域内）の部分領域の重心
Ｇ２（ＩＳＧ２，ＪＳＧ２）；２次推定領域内でラベル２（色誤差が大）の部分領域の重心
【０１０９】
これをもとにして、２次元領域としての位置修正ベクトルの方向を次式（２５）で計算する。
ＤＳＧ１２＝（ＩＳＧ２−ＩＳＧ１，ＪＳＧ２−ＪＳＧ１） …（２５）
ただし、これは方向のみ有効であり、大きさはあまり正確ではない。大きさを知るには図１３に示すようにＧ１、Ｇ２を通るベクトルを延長して推定領域境界との交点ブロックＢ２と実物領域境界との交点ブロックＢ１を求める。その座標の差の値ＤＳＢ１２は次式（２６）で示され、これが２次元近似修正ベクトルとなる。
ＤＳＢ１２＝（ＩＳＢ２−１ＳＢ１，ＪＳＢ２−ＪＳＢ１） …（２６）
これを用いるか、あるいは後述の水平垂直スキヤンによって２次元位置修正ベクトルＤＶ（＝ＤＶｘ＋ＤＶｙ）を計算する。
【０１１０】
（イ）位置関係
３次元位置姿勢を修正するためには次の２つを判定する必要がある。
１）推定領域と真の領域の間の大小関係
２）推定領域の境界部分における色の連続性
これは以下のようにして判定する。
【０１１１】
▲１▼水平スキャン
図８のように５．３．２で得たラベル画像をもとにして推定領域の重心Ｇ（IG１，JG１）から右と左の両方向に［ii０，ii１］の範囲で水平スキャンを行ない、次の２つの位置を求める。
【０１１２】
ＣＬ（ii０Ｌ，JG１）；左スキャンで初めて２または０のラベル値のブロックに当たる位置
ＣＲ（ii０Ｒ，JG１）；右スキャンで初めて２または０のラベル値のブロックに当たる位置
この２点における色べクトルの連続性を以下のようにして判定する
（ｃａｓｅ−ＨＣ１）
ＬＣＤ［（ii０Ｌ，JG１），（ii０Ｌ−１，JG１）］≦ＬＣＤＴＨ
→左連続
（ｃａｓｅ−ＨＣ２）
ＬＣＤ［（ii０Ｒ，JG１），（ii０Ｒ＋１，JG１）］≦ＬＣＤＴＨ
→右連続
また、位置関係を以下のようにして判定する。
【０１１３】
＜ｃａｓｅ−ＨＰ１＞
ii０＜ii０Ｌ且つ ii０Ｒ＜ii１
→真の領域は推定領域よりも小さい
ＤＶｘ＝（０，ＤＳＧ１２ｘ） …（２８）
ＲＤ＝Ｒ×Ｗ／ＷＤ …（２９）
ＷＤ＝ii０Ｄ−ii０Ｒ＋１ …（３０）
Ｗ＝ii１−ii０＋１ …（３１）
＜ｃａｓｅ−ＨＰ２＞
ii０＜ii０Ｌ且つ ii０Ｒ＝ii１且つ右連続
→真の領域は推定領域よりも右にずれている
ＤＶｘ＝（０，ii０−ii０Ｌ） …（３２）
【０１１４】
＜ｃａｓｅ−ＨＰ３＞
ii０≧ii０Ｌ且つ ii０Ｒ＜ii１且つ左連続
→真の領域は推定領域よりも左にずれている
ＤＶｘ＝（０，ii１−ii０Ｒ） …（３３）
＜ｃａｓｅ−ＨＰ４＞
ii０＝ii０Ｌ且つii０Ｒ＝ii１且つ右連続且つ左連続
→真の領域は推定領域よりも大きい
ＤＶｘ＝（０，ＤＳＧ１２ｘ） …（３４）
ＲＤ＝Ｒ×Ｗ／ＷＤ …（３５）
ＷＤ＝ii０Ｒ−ii０Ｌ＋１ …（３６）
Ｗ＝ii１−ii０＋１ …（３７）
ただし、
ＤＶｘ；水平方向の２次元位置修正ベクトル
ＲＤ；修正後の距離値
である。
【０１１５】
▲２▼垂直スキャン
図１４のように［ｂ］領域内の色の連続性評価で得たラベル画像をもとにして推定領域の重心Ｇ（IG１，JG１）から上下両方向に［jj０，jj１］の範囲で垂直スキャンを行ない、次の２つの位置を求める。
【０１１６】
ＣＵ（IG１，jj０Ｕ）；上スキャンで初めて２または０のラベル値のブロックに当たる位置
ＣＤ（IG１，jj０Ｄ）；下スキャンで初めて２または０のラベル値のブロックに当たる位置
この２点における色べクトルの連続性を以下の様にして判定する
（ｃａｓｅ−ＶＣＩ）
ＬＣＤ［（IG１，jj０Ｕ），（IG１，jj０Ｕ−１）］≦ＬＣＤＴＨ
→上連続
（ｃａｓｅ−ＶＣ２）
ＬＣＤ［（IG１，jj０Ｄ），（IG１，jj０Ｄ＋１）］≦ＬＣＤＴＨ
→下連続
また、位置関係及び垂直方向の２次元修正を以下のようにして判定する。
【０１１７】
＜ｃａｓｅ−ＶＰ１＞
jj０＜jj０Ｕ且つ jj０Ｄ＜jj１
→真の領域は推定領域よりも小さい。
【０１１８】
ＤＶｙ＝（０，ＤＳＧ１２ｙ） …（３８）
ＲＤ＝Ｒ×Ｈ／ＨＤ …（３９）
ＨＤ＝jj０Ｄ−jj０Ｕ＋１ …（４０）
Ｈ＝jj１−jj０＋１ …（４１）
＜ｃａｓｅ−ＶＰ２＞
jj０＜jj０Ｕ且つ jj０Ｄ＝jj１且つ下連続
→推定領域は真の領域よりも上にずれている
ＤＶｙ＝（０，jj０−jj０Ｕ） …（４２）
【０１１９】
＜ｃａｓｅ−ＶＰ３＞
jj０≧jj０Ｕ且つ jj０Ｄ＜jj１且つ上連続
→推定領域は真の領域よりも下にずれている
ＤＶｙ＝（０，jj１−jj０Ｄ） …（４３）
＜ｃａｓｅ−ＶＰ４＞
jj０＝jj０Ｕ且つ jj０Ｄ＝jj１且つ下連続且つ上連続
→真の領域は推定領域よりも大きい
ＤＶｙ＝（０，ＤＳＧ１２ｙ） …（４４）
ＲＤ＝Ｒ×Ｈ／ＨＤ …（４５）
ＨＤ＝jj０Ｄ−jj０Ｕ＋１ …（４６）
Ｈ＝jj１−jj０＋１ …（４７）
ただし、
ＤＶｙ；垂直方向の２次元位置修正ベクトル
ＲＤ；修正後の距離値
である。
以上の結果から位置関係が図１５（ａ）〜（ｆ）のどの場合に相当するかが判別できる。
【０１２０】
（ウ）３次元位置姿勢修正ベクトルの推定
上記によって場合分けされた結果と２次元位置修正近似ベクトルＤＶ及び距離修正値ＲＤによって３次元位置に関する修正を行なう。
【０１２１】
（４）物理的制約
対象物の動きに関して、対象物の物理的性質から常識的に判断した場合に自ずと決まる条件の範囲がある。このような条件をあらかじめ考慮しておくことにより対象物の位置姿勢の時間変化を評価する。
【０１２２】
［ａ］位置姿勢パラメータにおける時系列変化の妥当性
フレーム間の連続性を考えずに上記のプロセスで位置姿勢を推定した場合、対象物の速度や加速度、振動数が常識的範囲を越えてしまう場合がある。そこで、以下の３つの方法でフレーム毎の計算結果に修正を加える。
【０１２３】
▲１▼時系列の移動平均
▲２▼位置姿勢の時間差分の上限を設定する
▲３▼フレーム間予測値と誤差の上限を設定する
以下、▲２▼を重点的に説明する。また、▲３▼については領域予測処理の説明で詳述する。
【０１２４】
［ｂ］画面に平行な面内の時間的偏位の評価
画面内における領域重心Ｇの時間変化はα，βのフレーム間変化から計算できる。αおよびβの時間変化を次式（４８），（４９）で定義する。
【０１２５】
【数９】

ｎ；ビデオレート（３０ｆｒａｍｅ／ｓ）画像に対してＫｄｒｏｐフレームに１枚サンプルしたフレームの番号
これから、領域重心を起点とした画面に平行な面内の偏位量ＤＰＬを次式（５０）〜（５２）で計算する。
【０１２６】
【数１０】

上式（５２）においては、絶対値距離尺度を用いれば次式（５３）のようになる。
【０１２７】
ＤＰＬ（ｎ）＝｜Ｄｘ（ｎ）＋Ｄｙ（ｎ）｜ …（５３）
これらに対して、しきい値ＤＰＬ＿ＴＨを設定し、以下のような判定処理を行う。
【０１２８】
ＤＰＬ（ｎ）＞ＤＰＬ＿ＴＨ
→α（ｎ）＝α（ｎ−１），β（ｎ）＝β（ｎ−１）
ＤＰＬ＿ＴＨ≧ＤＰＬ（ｎ）
→α（ｎ），β（ｎ）を承認
例えば、特殊な場合は別として１秒間で人間が日常範囲の運動で動く量は１〜２ｍ位であると考えると、１／３０秒間の動き量のしきい値ＤＰＬ＿ＴＨは大体３０〜７０ｍｍ程度に設定できる。
【０１２９】
［ｃ］距離の時間変化の評価
視点から領域重心までの距離ｒの時間変化は、次式（５４）のように記述できる。
【０１３０】
【数１１】

これに対してしきい値ＡＲ＿ＴＨを設定し、以下のような判定処理を行う。
【０１３１】
Ｄｒ（ｎ）＞ＡＲ＿ＴＨ
→ｒ（ｎ）＝ｒ（ｎ−１）
ＡＲ＿ＴＨ≧ＤＰＬ（ｎ）
→ｒ（ｎ）を承認
【０１３２】
［ｄ］姿勢の時間変化の評価
対象物の３次元姿勢θ，φについても時間変化としきい値判定を上記と同様に考えることができる。
▲１▼水平方向の姿勢θ
【数１２】

Ｄｔｈ（ｎ）＞ＴＨＥ＿ＴＨ
→θ（ｎ）＝θ（ｎ−１）
ＴＨＥ＿ＴＨ≧Ｄｔｈ（ｎ）
→θ（ｎ）を承認
▲２▼垂直方向の姿勢φ
【数１３】

Ｄｐｈ（ｎ）＞ＰＨＡ＿ＴＨ
→φ（ｎ）＝φ（ｎ−１）
ＰＨＡ＿ＴＨ≧Ｄｐｈ（ｎ）
→φ（ｎ）を承認
【０１３３】
［ｅ］動的制御
動領域解析において物理的制約を適用する際に、初期段階（例えば最初の１秒位）では、パラメータの制約を緩め、時間がたった後で徐々に制約を強める（図１６（ａ）〜（ｃ）参照）。すなわち、上記で設定した各しきい値ＸＸＸ＿ＴＨはシーンカット直後とその後で動的に変更することが望ましい。すなわち、
０≦Ｎｆ≦Ｎｆ＿ＴＨ
→ＸＸＸ＿ＴＨ＝ＴＨ１（ＴＨ１＞ＴＨ２）
Ｎｆ＿ＴＨ≦Ｎｆ
→ＸＸＸ＿ＴＨ＝ＴＨ２
Ｎｆ；ある１カットのシーンのフレーム番号
というように、再帰的領域抽出結果が安定して対象物をトラッキングできるようになるまでのフレーム時間をＮｆ＿ＴＨとしてその間は大きい変化を許容し、その後は、許容誤差を小さく設定する。
【０１３４】
（５）モデルの適応化
［ａ］肩幅の適応
ほぼ正面画像であると仮定できるとき、初期ワイヤフレームに対して胴体部分の水平幅をある範囲内で変更して胴体部分のみの適合度評価関数が最大となる値を選択する。
【０１３５】
▲１▼動領域解析結果から得られるおおよその肩幅Ｗｓｈを探索中心とする
▲２▼最大探索幅は頭部領域の水平幅Ｗｈｄに対して比率ｗを次式（５７）の範囲で考える。なお、目安として（ｗ１，ｗ２）＝（１．０，５．０）位である
Ｗｓｈ＝Ｗｈｄ×ｗ（ｗ１≦ｗ≦ｗ２） …（５７）
［ｂ］頭部の幅の適応
基本的に肩幅と同じ考え方で行う。
【０１３６】
（６）常識の適用による制約
図１のモデルデータベースから色情報、３次元形状、シーン構造を選択するには、システム制御部から得られるモード情報とモデル情報に加えて、上記のプロセスで得られたモデル適合の評価修正結果を基にして、シーンの状況を決定する必要がある。そこで、使用環境や被写体に関するモード制御を行なうにあたって、状態遷移のパターンをあらかじめ設定しておき、それらの分岐においては、状態遷移の判定情報が不足する初期段階においては確率値のデフォルト値に基づいて状態遷移を判定し、このような判定動作を繰り返すうちに評価結果に応じて適切な状態遷移が行なえるようにこの確率値を変更設定する。
【０１３７】
［ａ］確率ラベルの更新
状態遷移の解釈木のパスを一度通り、その評価プロセスによる評価スコア値ＳＣＯＲＥがしきい値ＳＣ＿ＴＨ１，ＳＣ＿ＴＨ２の値に対して次の場合に応じて確率ラベルの変更を行なう。
【０１３８】
ＳＣＯＲＥ≧ＳＣ＿ＴＨ１
→ｐ（ｐａｔｈ−ｉ）の確率ラベルを上げる
ＳＣ＿ＴＨ１＞ＳＣＯＲＥ≧ＳＣ＿ＴＨ２
→ｐ（ｐａｔｈ−ｉ）の確率ラベルを変更しない
ＳＣ＿ＴＨ２＞ＳＣＯＲＥ
→ｐ（ｐａｔｈ−ｉ）の確率ラベルを下げる
【０１３９】
［ｂ］他の経路の並列評価の判定
確率ラベルの経路和の値ＳＰ（ｐａｔｈ−ｉ）の値に応じて各パスの解析・評価計算を行うかどうかを決定する。最も確率値の高いパスをｐａｔｈ−ｉとするとき、以下のように判定する。
【０１４０】
▲１▼ＳＰ（ｐａｔｈ−ｉ）≧０．８
→ｐａｔｈ−ｉの計算のみ行う
▲２▼０．８＞ＳＰ（ｐａｔｈ−ｉ）≧０．５
→ｐａｔｈ−ｉの次に確率の高いｐａｔｈ−ｊを選択し
ｐａｔｈ−ｉとｐａｔｈ−ｊの並列計算を進める
▲３▼０．５＞ＳＰ（ｐａｔｈ−ｉ）≧０．２
→確率の高い上位３個のｐａｔｈ−ｋを選択する
▲４▼０．２＞ＳＰ（ｐａｔｈ−ｉ）
→状況仮定の変更を行う
【０１４１】
次に、発明者が実験的に行なった推定領域の抽出処理過程の結果について概略的に説明する。図１７ないし図２４は、ＣＩＦ形式のブロックに分割した場合の結果を示している。人物の頭部を撮影したある画像について、まず、動ベクトル検出によるラベリングを行なった結果が図１７に示される。これに対して、パニングベクトルを検出して補正を行なった結果が図１８に示される。さらに、時間方向のヒストグラム処理による雑音除去処理を行なった結果が図１９に示される。図中の値は頻度を示している。また、顔領域について重心色による評価を行なった結果が図２０に示される。そして、ワイヤフレーム適合による投影を行なった結果が図２１に示される。
【０１４２】
また、図２２ないし図２４にはこのようにして抽出された図２０に相当する推定領域の結果について時間的に変化する状態を、シーン開始時点から数えて、第１フレーム（Ｎｆ＝１），第７フレーム（Ｎｆ＝７）および第１３フレーム（Ｎｆ＝１５）について示している。
【０１４３】
このような本実施例によれば、画像の動領域を抽出して対象物の推定領域とし、その推定領域に対して符号化処理における符号化レートを他の領域に比べて高くするので、限られた伝送容量の範囲で画像の符号化信号を送信する場合においても、使用者が見た時の印象として人物の顔などの注目する領域に関して高精度で動き情報が良く表現されるようになり、動画像としての見た目の印象を良くすることができるようになる。
【０１４４】
また、本実施例によれが、対象物の３次元形状をワイヤフレームモデルとして規定すると共に、色のモデル情報やシーン構造を規定してモデルベースとして設定し、抽出された動領域の形状特徴に基づいてワイヤフレームモデルの位置姿勢を推定しこれに基づいて推定領域を抽出するので、対象物の動きに応じて高い精度で追随して動領域に対する推定領域の抽出を行なうことができる。
【０１４５】
さらに、モデル適合過程により抽出された推定領域の適合度を画像のデータに基づいて評価し、その評価結果に応じて適合度がより高くなるように推定領域を修正するので、符号化処理過程で、修正されたより精度の高い推定領域を用いて符号化処理を行なうことができ、これによって、対象物に対応した推定領域をより精度よく抽出できる。
【０１４６】
そして、対象物の常識的なモデル情報として例えば色情報や運動情報あるいは３次元構造の情報があらかじめ規定されているので、評価修正過程においては、抽出されている推定領域のデータがその常識的なモデル情報に適合しているかを評価修正を行なうことにより、より正確な推定領域の抽出をすることができるようになる。
【０１４７】
符号化処理過程では、前回のフレームで抽出された推定領域の情報に基づいてその推定領域に対応して精度を高めて符号化処理を行なうので、同じフレームの画像データについて繰り返し読出して信号処理をすることがなくなり、演算処理を迅速に実施することができる。また、必要に応じて、推定領域に対して次フレームの予測を行なうので、速い動きがある場合にもこれに追随して的確な推定領域の情報を得ることができる。
【０１４８】
また、動領域検出過程においては、検出された動きブロックの情報に加えて画像の色情報に基づいて動領域を検出するので、動きの少ないブロックに対しても色情報から対象物に対応したブロックであるか否かの判定を行なえ、対象物に相当する推定領域の抽出の精度を向上させることができるようになる。
【０１４９】
動領域検出過程においては、色距離計算に続くしきい値判定で用いるしきい値の設定を時間の経過に伴って低いレベルに変更するので、画像から対象物の推定領域を抽出する場合に、シーンの開始直後から時間が経過するにしたがって推定領域の抽出精度が向上してくることに基づいてより精度の高い判定処理を行なうことができるようになる。
【０１５０】
動領域検出過程において、重心位置からの距離による雑音除去を行なうと共に時間的に前後するフレームの画像のデータから雑音除去を行なうので、動領域の抽出を高精度で行なうことができる。
【図面の簡単な説明】
【図１】本発明の一実施例を示す符号化器のブロック構成図
【図２】全体の概略的構成図
【図３】領域情報を抽出する原理説明図
【図４】領域抽出プログラムのフローチャート
【図５】動領域解析処理プログラムのフローチャート
【図６】モデル適合処理プログラムのフローチャート
【図７】評価・修正処理プログラムのフローチャート
【図８】領域予測処理プログラムのフローチャート
【図９】パニングベクトル計算プログラムのフローチャート
【図１０】画像中のパニングベクトル検出の計算に使用するブロックの説明図
【図１１】抽出領域の傾きを検出する場合の作用説明図
【図１２】抽出した推定領域の色評価に関する概念的な説明図
【図１３】２次元の近似修正ベクトルの計算をする場合の概念的な説明図
【図１４】推定領域内の色連続性の判定処理をする場合の概念的な説明図
【図１５】推定領域と実物領域とのさまざまなずれの関係を示す説明図
【図１６】しきい値変更を示す作用説明図
【図１７】実験例を示す１フレームの各ブロックにおける、動ベクトル検出によるラベリング結果図
【図１８】同、パニングベクトル除去後のラベリング結果図
【図１９】同、時間方向のヒストグラム処理を行なった後のラベリング結果図
【図２０】同、重心色による評価結果図
【図２１】同、ワイヤフレーム適合による投影結果図
【図２２】同、時間の推移に伴う図２１相当図（その１）
【図２３】同、時間の推移に伴う図２１相当図（その２）
【図２４】同、時間の推移に伴う図２１相当図（その３）
【符号の説明】
１は符号化器、２はカメラ、３はＡ／Ｄ変換部、４は動き補償予測符号化部、５はモデルベース対象領域抽出部、６は２次元動きベクトル検出部、７は符号化属性判定・制御部、８は動領域解析部、８ａはパニングベクトル検出部、８ｂは動きブロック判定部、８ｃは重心計算部、８ｄは平均色計算部、８ｅは背景雑音除去部、８ｆは２次元形状パラメータ抽出部、９はモデル適合部、９ａは３次元位置姿勢推定部、９ｂは透視変換部、９ｃは投影領域計算部、１０は評価・修正部、１０ａは色評価部、１０ｂは適合度計算部、１０ｃは３次元位置姿勢速度決定部、１１は領域予測部、１１ａは移動平均部、１１ｂは３次元動き予測部、１１ｃは領域計算部、１２はモデルデータベース、１２ａは色情報データベース、１２ｂは３次元形状データベース、１２ｃはシーン構造データベースである。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an image recognition method for extracting and recognizing a moving object from an image, and an encoding information amount as a whole within a limited amount of information transmission when an image is encoded. The present invention relates to an image information encoding method for improving the image quality of a specific area while suppressing an increase in image quality.
[0002]
[Problems to be solved by the invention]
In digital communications in mobile communications such as car phones and mobile phones, the data transmission rate is currently 9600 bps (bit per second). Considering the communication cost, it is technically said that the advantage of narrowband communication that can be realized at low cost even if broadband mobile communication becomes possible will not collapse for the time being after 2000.
[0003]
In view of these points, a personal computer-based car TV phone prototype capable of quasi-video transmission with a data transmission rate of 9600 bps has been developed. In this case, the level of the image update speed has been developed. As a result, about one frame (screen) was at most between 1 second and 2 seconds. However, at such an image update speed, the ability to express moving images is lacking when viewed by the user, making it difficult to use at a practical level.
[0004]
On the other hand, H.264 related to compression of image information standardized by ITU-T / LBC. The H.263 standard is said to be the most powerful system in wired ultra-low-rate image transmission (64 kbps to 4.8 kbps), but this standard is a provision related to a decoder (decoder), and is an encoder (encoder). Detailed parts of the calculation algorithm are left to the user's specifications. Therefore, H.H. When the image quality was evaluated using the H.263 evaluation software TMN5 (Telenor), it was found that the frame rate was relatively high (about 5 frames / second at 9600 bps), but the facial expression transmission capability was low with regard to image quality. .
[0005]
In addition, since it is necessary to transmit audio information in actual product production, it is expected that only about 4 kbps can be obtained as an actual image rate in AV multiplexing. Even more severe. On the other hand, when considering the performance as a product, when displaying facial expressions for dialogue, both a moving image rate of 5 frames / second or more and an image quality that clearly shows facial expressions are required. In other words, even if the entire image is not updated with high accuracy, it is recognized as a high-quality image for the user by transmitting only the target face area and the moving part such as the hand with high accuracy. It is done.
[0006]
Therefore, the first problem is how to efficiently detect and recognize a face area, which is a specific target area among moving objects, from the image information, and further, a limited transmission capacity range. The second problem is how to encode the face area with higher accuracy than other areas.
[0007]
In this case, many methods such as clustering, template matching, or an active contour model have been proposed for human region extraction processing, but various situations (hat, hairstyle, background color, skin color, shadow, whole body / In fact, it is possible to stably extract the upper body / face only, etc., and there are few that satisfy the low calculation cost at the same time.
[0008]
The present invention has been made in view of the above circumstances, and a first object is to efficiently detect and recognize a specific target region such as a face among moving objects from image information. The second object of the present invention is to provide a method for recognizing image information, which encodes image information obtained by the recognition method with high efficiency.
[0009]
[Means for Solving the Problems]
  Image according to claim 1informationofCodingAccording to the methodFirst, in the image recognition method,In the moving area detection process, when an image obtained by capturing a moving object is divided into a plurality of blocks, a movement occurs based on pixel data of each block of at least two frames of images that are temporally mixed. A block corresponding to the target object is detected, and then, in the model fitting process, the shape model of the target specified in advance is applied to the detected moving area, and the block is included therein. The area is recognized as the estimated area of the object, and thereby, the area of the moving object in the image can be efficiently recognized.
[0010]
  AndWhen encoding image information based on the image information obtained by the above-described image recognition method, a larger amount of information is allocated to the estimated region extracted by the model fitting process than other regions. Since the image is encoded within a predetermined transmission capacity by performing the encoding process, the accuracy of encoding can be increased for the extracted estimated region, and the amount of information can be increased. By setting the face that is most likely to be watched, for example, the face of a person, when the user views the demodulated image, the area of the target object to be watched can be displayed with high precision. It seems that the visual accuracy is high, so that a good image can be transmitted within a limited transmission capacity.
  In addition to the above assumptions,
  Further, in the moving area detection process, since the moving area is detected based on the color information of the image in addition to the information of the detected moving block, the block corresponding to the object from the color information even for the block with less movement It is possible to determine whether or not the estimated region is equivalent to the object, and the accuracy of extracting the estimated region corresponding to the object can be improved.
  In addition, in the moving region detection process, the color information is detected by determining the color of the object by calculating the barycentric color, calculating the color distance, and determining the threshold value, so that specific calculation processing is relatively easy. Will be able to.
  Furthermore, in the moving region detection process, the threshold setting used in threshold determination following the color distance calculation is changed to a lower level as time elapses. For example, an object is estimated from an image of a scene. When extracting a region, the extraction accuracy of the estimated region improves as time elapses with respect to the change immediately after the start of the scene. Will be able to.
  According to the method for encoding image information according to claim 2, in addition to the premise processing described in the description of claim 1, in the moving region detection process, the image of the frame that temporally moves around the detected motion block Noise removal processing is performed by creating a histogram with reference to the motion block and filtering in the time direction, so even a noise block that cannot be determined with one image can be handled across multiple frames. It is possible to easily remove a filter by performing a filtering process with a passage of time whether or not there is a block to be processed.
  According to the image information encoding method of the third aspect, in addition to the premise processing described in the description of the first aspect, a model including a three-dimensional model, color information, and a scene structure is used as a target object model. When a specified model database is provided and a model is selected from the model database, the probability value for state transition is changed stochastically or the state transition path is changed, so it is appropriate according to the state of the image over time. By selecting a simple model, it is possible to extract the estimation region quickly and accurately.
  According to the image information encoding method of the fourth aspect, in addition to the premise processing described in the description of the first aspect, in the evaluation correction process, the degree of fit of the estimated region extracted by the model fitting process is calculated as an image. Since the estimation area is corrected so that the fitness is higher according to the evaluation result, the moving object in the image can be recognized with high accuracy, and the encoding is performed. In the processing process, it is possible to perform the encoding process using the corrected estimation area with higher accuracy, thereby extracting the estimation area corresponding to the object more accurately and performing the encoding on the estimation area. Since the encoding process can be performed with high accuracy, it is possible to transmit an image in which the movement of the object is smoother within the limited transmission capacity.
  Furthermore, in the evaluation correction process, the determination of the three-dimensional position and orientation is performed by threshold judgment using physical constraints and common-sense values, so that stable calculation results that match the actual situation can be obtained relatively easily. Will be able to.
  According to the image information encoding method of claim 5, in addition to the premise processing described in the description of claim 1, in the evaluation correction process, the degree of fit of the estimated region extracted by the model fitting process is calculated as an image. Since the estimation area is corrected so that the fitness is higher according to the evaluation result, the moving object in the image can be accurately recognized. However, in the encoding process, it is possible to perform the encoding process using the corrected estimation area with higher accuracy, thereby more accurately extracting the estimation area corresponding to the object and Thus, the encoding process can be performed with an increased accuracy of encoding, so that an image in which the movement of the object is smoother can be transmitted within a limited transmission capacity.
  Furthermore, in the evaluation correction process, the threshold value setting used in the threshold determination of the calculated three-dimensional position / orientation value is dynamically changed as time elapses. When estimating a three-dimensional position and orientation, the estimation accuracy improves with sequential processing as time changes with respect to changes immediately after the start of the scene, so the threshold for time changes and prediction errors is low. By changing the level, it is possible to perform determination processing so that even a small change can be detected with increased sensitivity.
[0011]
  Claim 6Encoding method of the described image information orClaim 24According to the described image recognition method, the shape model of the object is defined in advance as a three-dimensional model, and in the model fitting process, the object is identified based on the shape characteristics of the moving area obtained in the moving area detection process. Since the approximate position and orientation of the three-dimensional model are estimated and the estimated area is extracted based on the estimated position and orientation, the estimated area can be extracted from the moving area by following the movement of the object with high accuracy. .
[0012]
  Claim 7Encoding method of the described image information orClaim 25According to the described image recognition method, in the model fitting process, the wire frame model is projected on the image plane based on the position and orientation information of the three-dimensional model of the object, and a two-dimensional area is allocated. In addition, since the internal region of the wire frame model is extracted as the estimated region of the object, it is possible to quickly perform the arithmetic processing necessary for extracting the estimated region of the object from the position and orientation information of the object.
[0013]
  Claim 8Encoding method of the described image information orClaim 26According to the described image recognition method, in the evaluation correction process, the fitness of the estimated region extracted by the model adaptation process is evaluated based on the image data, and the fitness is increased according to the evaluation result. Since the estimation area is corrected, the moving object in the image can be recognized with high accuracy, and the encoding process is performed using the corrected estimation area with higher accuracy. As a result, the estimation region corresponding to the object can be extracted more accurately and the encoding process can be performed on the estimation region with higher accuracy. An image in which the movement of the object is smoother can be transmitted.
[0014]
  Claim 9Encoding method of the described image information orClaim 27According to the described image recognition method, for example, color information, motion information, or three-dimensional structure information is defined in advance as common-sense model information of an object. A more accurate estimation area can be extracted by performing an evaluation correction to confirm whether the area data conforms to the common-sense model information.
[0015]
  Claim 10Encoding method of the described image information orClaim 28According to the described image recognition method, in the evaluation correction process, whether or not the estimated area extracted in each of the temporally changing images conforms to the physical motion condition or common-sense motion condition of the object. Therefore, when the extracted estimated area corresponds to an unnatural movement or position of the object, it can be excluded by not satisfying the physical condition. In this case, common sense movement and position conditions are, for example, when the object is a person, the amount of movement between the previous frame and the current frame naturally has an upper limit in common sense, It is a condition defined as the nature of the object so that it is unlikely that a person floats in the air, falls vertically, or suddenly disappears or appears.
[0016]
  Claim 11Encoding method of the described image information orClaim 29According to the described image recognition method, in the evaluation correction process, common-sense color information conditions, centroid color re-evaluation, and color continuity in the estimation area, which are prescribed in advance for the object, are evaluated. For example, when determining whether a block located at the boundary of the estimated region is inside or outside the region, by evaluating the color continuity based on the color information of the block at the center of gravity position, for example, The determination can be performed accurately, and the estimation area can be corrected with high accuracy.
[0017]
  Claim 12According to the described image information encoding method, in the encoding process, the extracted estimation area information is applied to the next image encoding process, so that the image data used for the estimation area extraction is repeated. Compared with the case where the calculation is performed by reading out for the encoding process, it is possible to quickly cope with the calculation process in parallel. In this case, since the estimated area of the current image is applied to the next image, a certain amount of deviation of the area of the object occurs when movement occurs, but depending on the object such as the face of a person, Since there is almost no adverse effect in practical use, it is possible to reduce the amount of calculation and take a quick response.
[0018]
  Claim 13According to the described image information encoding method, in the region prediction process, the estimated region of the target object in the next image is predicted based on the information obtained in the estimation region extraction process, and this prediction is performed in the encoding process. As a result of encoding the next image based on the information of the estimated area, the result of predicting the estimated area of the object of the next image following even when the movement of the object is largeInBased onYoThe encoding process can be performed more accurately.
[0019]
  Claim 14Encoding method of the described image information orClaim 30According to the described image recognition method, in the moving region detection process, the panning vector of the image is detected and the motion block information detected based on the detected panning vector is corrected. However, it becomes possible to determine the motion block in the screen by subtracting the amount of movement of the imaging means, and it is possible to extract the moving region of the object with higher accuracy.
[0020]
  Claim 15Encoding method of the described image information orClaim 31According to the described image recognition method, in the moving region detection process, the moving region is detected based on the color information of the image in addition to the detected moving block information. It becomes possible to determine whether or not the block corresponds to the object from the information, and it is possible to improve the accuracy of extraction of the estimation region corresponding to the object.
[0021]
  Claim 16Encoding method of the described image information orClaim 32According to the described image recognition method, in the moving region detection process, the color information is detected by determining the color of the object by calculating the barycentric color, calculating the color distance, and determining the threshold value. Can be performed relatively easily.
[0023]
  Claim 17Encoding method of the described image information orClaim 33According to the described image recognition method, noise removal processing is performed by calculating the distance between the center of gravity of the detected motion block and threshold determination in the moving region detection process, so that the isolated region close to a large region is isolated. This makes it possible to easily remove the noise area.
[0025]
  Claim 18Encoding method of the described image information orClaim 34According to the described image recognition method, in the evaluation correction process, when the wire frame model of the three-dimensional model of the target object is projected onto the image plane, the wire is set so that the suitability of the predetermined part of the target object becomes high. Since the model is adapted by fitting the frame model, the wire frame model fitting process can be easily and quickly performed by selecting a characteristic part.
[0029]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
FIG. 2 shows an overall schematic configuration of the encoder 1. A camera 2 as an image pickup unit captures a scene including an object and outputs image information. The A / D converter 3 Then, it is input to the encoding unit 4 as original image information in a state converted into a digital signal.
[0030]
The encoding unit 4 encodes original image information using an existing motion compensated prediction encoding method (for example, H.261, H.263, MPEG, etc.), which is a model-based target region extraction unit described later. 5, the two-dimensional motion vector detection unit 6 detects the motion vector from the original image information and gives the information, and based on the region information given from the model base target region extraction unit 5 The region extracted by the encoding attribute determination / control unit 7 is subjected to a motion compensated prediction encoding process to generate a compressed bit stream and output it to the communication path.
[0031]
As described above, the model base target region extraction unit 5 is provided with motion vector information from the two-dimensional motion vector detection unit 6 of the encoding unit 4, and is also provided with original image information. The target area is extracted using the model base based on the information and is output as area information. FIG. 1 shows the configuration in functional blocks, which will be described in detail below.
[0032]
The model base target region extraction unit 5 is composed of five functional blocks: a dynamic region analysis unit 8, a model adaptation unit 9, an evaluation / correction unit 10, a region prediction unit 11, and a model database unit 12. Among these, the moving region analysis unit 8 includes a panning vector detection unit 8a, a motion block determination unit 8b, a centroid calculation unit 8c, an average color calculation unit 8d, a background noise removal unit 8e, and a two-dimensional shape parameter extraction unit 8f. A panning vector is detected from image information and subtracted from a motion vector to determine a motion block, a centroid is obtained from the centroid position, noise is removed, and area shape data is obtained.
[0033]
The model adaptation unit 9 includes a three-dimensional position / orientation estimation unit 9a, a perspective conversion unit 9b, and a projection area calculation unit 9c. The model fitting unit 9 performs fitting by projecting a wire frame model on the image plane based on the result obtained by the moving region analysis unit 8. The evaluation correction unit 10 includes a color evaluation unit 10a, a fitness calculation unit 10b, and a three-dimensional position / orientation determination unit 10c. This evaluation correction unit 10 evaluates a two-dimensional estimation region, which is a cutout region obtained by wire frame projection, and corrects the two-dimensional estimation region by performing an evaluation operation on the degree of matching with the original image data and feeding back. And adaptive deformation processing of the wireframe model. The region prediction unit 11 includes a moving average processing unit 11a, a three-dimensional motion prediction processing unit 11b, and a region calculation unit 11c. The region prediction unit 11 uses the region extracted by the evaluation correction unit 10 as an estimation region of the current frame, and predicts the estimation region of the next frame based on this.
[0034]
The model database unit 12 includes a color information database unit 12a, a three-dimensional shape database unit 12b, and a scene structure database unit 12c. These store data modeled for objects that are known in advance, and model information and mode information for driving these databases are input from a system control unit provided outside the encoder. It has become.
[0035]
The encoder 1 is configured as described above. On the other hand, a decoder that receives data from the encoder 1 may have a general configuration. 261, H.M. It has a configuration corresponding to the H.263 or MPEG standard.
[0036]
Next, the operation of this embodiment will be described with reference to FIGS. Prior to the description of the specific operation of this embodiment, the operation principle will be described first.
That is, what is targeted in this embodiment is how to efficiently extract the target human face region from the original image, and its principle will be outlined with reference to FIG. explain. Here, how to faithfully assign the three-dimensional shape of a person to a person image reflected in the original image by setting a standard wire frame model (hereinafter abbreviated as WFM) for such a problem. It is caught as a problem. And in order to perform this allocation, the following five conditions need to be satisfied.
[0037]
(1) The camera characteristics (focal length and viewing angle) are known.
(2) The WFM three-dimensional structure data of the object (person) is known.
(3) The object (person) must maintain an exercise posture that can be regarded as a substantially rigid body.
(4) The camera's three-dimensional position and orientation with respect to the object (person) are known.
(5) The subject person has a common sense size equivalent to WFM.
[0038]
When the above five conditions (1) to (5) are known, the two-dimensional projection image on the image plane of the WFM should substantially match the person area. However, since it is difficult to satisfy the condition (4) in a normal shooting environment, it is assumed that the other conditions (1) to (3) and (5) are satisfied. Furthermore, WFM will be allocated by utilizing the following information.
[0039]
(A) Spatial and temporal changes in texture and color in the image
(B) Common sense information of texture and color of a person (equivalent to the above WFM for shape information)
(C) Physical characteristics of a person (speed, inertia, motion characteristics, etc.)
(D) A common sense position and posture of a person (the possibility of being reflected on the camera in an inverted or floating state is very low)
[0040]
As a result of combining the above, an image comprising the following calculation processing steps was obtained as a means of solving the problem by organically associating the three of WFM, image information, and common sense / physical constraints. This is a method for extracting a region of an object in the encoding method.
[0041]
(A) A two-dimensional estimation area is extracted from an image based on motion information and color information.
(A) The approximate three-dimensional position / orientation of the object is calculated based on the two-dimensional shape information of the extracted estimated area and the model information (color information, WFM information) defined in advance.
(C) The WFM is projected on the image plane by performing a perspective transformation process obtained from the three-dimensional position and orientation, and the face area is extracted by designating the label for each area.
(D) Re-evaluating the extracted face area (color, size, temporal change in position and orientation, etc.) to calculate a final estimated value of the three-dimensional position and orientation for the current frame
(E) Output the final estimation result of the face area by wire frame projection
(F) If necessary, the face area in the next frame is estimated by linear prediction of a three-dimensional position and orientation time series.
[0042]
4 to 8 are flowcharts of programs created based on the above-described principle. Here, the program of FIG. 4 is a schematic diagram showing the overall flow, and the programs of FIGS. 5 to 8 are set corresponding to each step. The overall operation flow will be described below with reference to these flowcharts.
[0043]
When a certain scene starts, a series of processing steps are performed in order to extract a region. First, (A) a moving region analysis step is performed to extract an estimated region (step S1), and its level is determined to determine a level higher than LEVEL1. In the case, (B) a model fitting process is subsequently performed (step S2). If it is LEVEL2 or higher (step S3), then (C) the evaluation correction process (step S4) and (D) the region prediction process (step S5) are executed, and the above is repeated until the scene ends (step S3). S6). Hereinafter, these processing steps (A) to (D) will be described as units.
[0044]
(A) Dynamic region analysis process
This corresponds to the processing step S1 of the dynamic region analysis of the entire program in FIG. 4, and the dynamic region analysis unit 8 executes processing as shown in FIG. When image information is read and the frame number Nf is incremented and a motion vector detection process is performed by a block matching method or the like (steps A1 to A3), a panning vector detection process is subsequently performed (step A4). . Then, the motion vector is corrected based on the obtained panning vector (step A5).
[0045]
Here, the panning vector corresponds to a situation in which a part of the image does not move but the entire screen moves (panning), that is, the camera 2 as the imaging means moves. The magnitude and direction of the movement of the camera 2 are represented as a panning vector. Then, by subtracting this panning vector from the motion vector detected as described above and correcting it, the motion region in the image is to be detected more accurately. The detection of the panning vector will be described later.
[0046]
Now, based on the obtained motion vector, the motion block in the image is labeled for the presence / absence of motion in units of 16 × 16 block pixels (macroblock MBK) (step A6). At this time, based on the labeling result, a level determination is performed by comparing with the values of the level LEVEL 0 to 3 set in advance, and this is used as a determination criterion for the subsequent processing stage. In this case, each level considers the operation level corresponding to the required amount of computation, and the level with the smallest amount of computation is LEVEL0.
[0047]
(1) For LEVEL0
If the number of motion blocks is very small and is below a certain threshold value, LEVEL0 is determined that the position and orientation of the WFM has not changed (step A7), and the process returns to step A1 to start reading the next frame. In this case, the position / orientation information of the WFM and information on the estimation area obtained in the previous process are used for the encoding process.
[0048]
(2) When LEVEL1 or higher
(A) If the number of motion blocks is less than or equal to a threshold value (step A7), an average color in the centroid block of the extracted estimation area is calculated, and an average color within an error from the average color is calculated. Include the blocks you have in the area and label them.
(B) When the number of motion blocks is equal to or greater than a certain threshold, the label correction based on the color vector evaluation is not performed.
[0049]
Next, label correction by color vector evaluation is performed (step A8), label images are clustered (step A9), and the number K of regions, which is the number of clustered regions, is determined (step A10). Then, the barycentric position of each of the areas k (= 1,..., K) is calculated by the block address, and the barycentric color is set as the area color (steps A11 to A14). Thereafter, when the distance L from the center of gravity exceeds one half of the distance to the nearest other cluster center, a filter operation is performed based on the distance such that the label is set to “0” (step A15).
[0050]
Next, methods such as temporal filtering that removes noise components with the frequency of appearance using information from multiple past frames, and isolated point removal that removes blocks isolated from large clusters as noise components To remove noise (steps A16 and A17). Thereafter, the center of gravity of the region k is recalculated (step A18), and noise is removed by evaluating the error from the center of gravity color (step A19). Then, the major axis and minor axis of the obtained region shape and the inclination thereof are extracted (step A20), and thereafter the same is repeated by the number K of clusters (steps A12 to A21).
[0051]
(B) Model fitting process
In the model fitting process (step S2) shown in FIG. 4, the model fitting process is performed in the model fitting unit 9 according to the program flowchart shown in FIG. That is, first, as shown in FIG. 3, a WFM (wire frame model) corresponding to the upper body of the person is selected (step B1), and then calculation of the posture angle and distance r are performed corresponding to each region. Is calculated (steps B2 to B5).
[0052]
In this case, for example, the short axis of the region shape described above is regarded as the width of the projected region of the head, and the distance r is obtained by calculating and proportionally calculating the angles α, β, and γ that are the posture angle parameters. Using this, the end point of the WFM is perspective-transformed to obtain a projected image of the WFM (step B6). The internal region of this projection image is labeled based on WFM part (head, trunk, etc.) information (step B7). For example, labeling is performed by assigning a label value of “1” for the head, “2” for the eye, “7” for the body, and “0” for the background. Then, the above is repeated for the number of persons, that is, the number K of regions.
[0053]
After the above calculation for all areas (persons), if the operation level designation is LEVEL1, the area extraction calculation for one frame is terminated here, and the next image reading is started. That is, the following evaluation correction process and area prediction process are not executed. As a result, the accuracy of model fitting is slightly lost, but the overall calculation amount can be reduced. Note that if this LEVEL1 process is applied after some time has elapsed after the scene cut, it is considered that the difference between the object and the WFM does not become so large. When the operation level is LEVEL2 or higher, the following processing steps are executed.
[0054]
(C) Evaluation correction process
Next, in the evaluation correction process (step S4) shown in FIG. 4, the evaluation correction process is performed in the evaluation correction unit 10 according to the flowchart of the program shown in FIG. That is, in order to evaluate the cutout region (two-dimensional estimation region) obtained by wire frame projection, the evaluation is performed based on the original image data, the two-dimensional estimation region is corrected, and the wire frame model is adaptive. A deformation process is performed.
[0055]
For example, by using the data obtained as a result of labeling as described above, a human head region is extracted and the color of the center of gravity of the region is evaluated as described later (step C1). Thereafter, the level described above is LEVEL2. In some cases, the subsequent evaluation correction process (steps C3 to C6) is omitted, and in the case of LEVEL3, subsequent color continuity evaluation, fitness evaluation, two-dimensional position correction, and three-dimensional position / posture correction are performed (step C3). To C6). In this case, LEVEL2 adopts the WFM area centroid and color when it is determined to be appropriate in the evaluation as the area centroid in step C1, and uses the value obtained by the moving area analysis unit 8 when it is determined to be inappropriate. It comes to use.
[0056]
Next, when the frame time Nf after the scene cut has reached a certain threshold value Nf_TH, the parameter evaluation threshold value is changed (steps C7 and C8). This takes a large threshold value, for example, about 1 to 3 seconds immediately after the scene cut, and then decreases it. Based on this value, an evaluation based on physical constraints and an overall evaluation based on common sense application are performed (steps C9 to C12). If the evaluation result is good, the process is terminated. It returns to (step S2) and repeats (step C13).
[0057]
(D) Region prediction process
Next, in the region prediction process (step S5) shown in FIG. 4, the region prediction unit 11 performs the region prediction process according to the program flowchart shown in FIG. The region extracted by the global processing over the entire frame as described above is an estimated region in the current frame. Based on this, the current frame is processed again, and encoding is performed by delay time and calculation time. Increase. Therefore, here, the prediction area of the next frame is obtained from this estimation area.
[0058]
In this case, it is possible to realize two-dimensionally in the form of linear prediction of the average motion vector of the region, but here, for each region (steps D1, D2, and D7), the time-series linearity of the three-dimensional position and orientation By the prediction (step D3), the projection area of the WFM at the time Nf + 1 is predicted (steps D4 to D6). As a result, encoding of the current frame and region prediction of the next frame can be performed in parallel at the same time.
[0059]
Next, details of processing contents of each unit in the processing flow described above will be described item by item.
A. Dynamic domain analysis process
(1) Motion vector detection
In this embodiment, for example, a block matching method is used to detect a motion vector by calculating an inter-frame correlation for each block of an original image (for example, CIF) size image. This motion vector detection method can be used in addition to the block matching method.
[0060]
(2) Panning vector detection
When the camera 2 is stationary, the motion vector detected as described above often corresponds to the two-dimensional projection of the motion of the object as it is, but when the camera 2 is moving (when panning is present) ) Needs to detect a panning vector corresponding to the movement of the camera 2 and separate it from the movement of the object.
[0061]
[A] Judgment and calculation of panning
The determination as to whether or not a panning vector has occurred follows the calculation procedure shown in FIG. For example, as shown in FIG. 10, one average motion vector is calculated for each of the four sides (or rectangular area) of the peripheral portion of the image (step E1). In particular, here, as an example, block rows BL1 to BL4 of the side constituted by the second MBK from the outermost side of the image are used as edge regions, and average motion vectors PMVn (n = 1 to 4) of the respective BL1 to BL4 are used. It calculates based on following Formula (1)-(4).
[0062]
[Expression 1]

PMVn; Panning vector of edge region BLn (n = 1 to 4)
MV (i, j); MBK motion vector of MBK coordinates (i, j)
[0063]
If these four panning vectors PMV1 to PMV4 are approximately the same size (≠ 0) and direction, it can be considered that the entire image is moving, so it is determined that a panning vector is generated. Therefore, first, the variance value VPAN 2 of the four panning vectors PMV1 to PMV4 is calculated by the following equations (5) and (6) (steps E2 and E3).
[0064]
[Expression 2]

d (vi, vj); distance function between vectors vi and vj (here square distance)
MPV; average vector of 4 PMVs
[0065]
Further, since it is necessary to evaluate the magnitude of the MPV, which is an estimated value of the panning vector of the entire screen, the magnitude of the average vector MPV is expressed as the square distance value APAN 2 and is obtained by the following equation (7) (step E5). That is,
APAN 2 = d (MPV, 0) (7)
The threshold value is determined using the square distance value APAN 2 and the variance value VPAN 2 as follows (steps E4, E6, E7, E8).
[0066]
(1) When VPAN 2 ≧ VPAN 2_TH,
Or
When APAN 2 <APAN 2_TH
→ It is determined that there is no panning (step E8).
(2) Cases other than the above
→ It is determined that there is panning, and MPV is set as a panning vector (step
E7)
Although the threshold is set empirically, for example, a relatively good result can be obtained by setting as follows.
(VPAN 2_TH, APAN 2_TH) = (50, 1)
[0067]
[B] Correction of motion amount by removing panning vector
When it is determined that there is panning, the motion vector of each MBK is corrected using the following equation (8).
TMV (i, j) = MV (i, j) −MPV (8)
The corrected motion vector TMV is used as a substantial motion vector, and the following moving region analysis process is performed.
[0068]
(3) Calculation of the center of gravity of the changed block
The following method can be considered for detecting the changed block.
(1) Judgment of threshold of difference between frames
(2) Judgment of threshold value of motion compensation frame difference (without considering panning)
(3) Judgment of moving area by panning correction motion vector
[0069]
Here, when the camera 2 is fixed, the method {circle around (1)} is sufficient. However, since this method reacts sensitively to even a slight shake of the camera 2, the method (2) or the method (3) is used when there is a surplus in the calculation capacity of the encoder 1. It is desirable to use Here, the method {circle around (2)} mainly corresponds to detection of a change block based on a change in hue and luminance, and the method {circle around (3)} corresponds to detection of a change block due to the original movement of the object. Therefore, the center-of-gravity position is detected for the change block detected by the combined use of methods (2) and (3).
[0070]
These operations are when the number of objects in the screen can be assumed to be one. When there are a plurality of objects, clustering of the changed block group obtained above is necessary. In this case, whether to perform clustering is determined by the following process.
[0071]
[A] First step: centroid calculation or clustering activation
Case 1… One person in subject mode → Center of gravity calculation
Case 2 ... Number of target areas is unknown → Center of gravity calculation
Case 3 ... Number of target areas is known (2 or more) → Clustering
Case 4 ... The number of target areas is unknown (2 or more) → Clustering
[B] Second step: Clustering
In clustering, the center of gravity is calculated and the cluster radius is extracted at the same time
[0072]
The processing in case 2 of the first step corresponds to the assumption that the number of target areas K = 1, and as a result of proceeding the calculation based on this assumption, there is a contradiction in the evaluation process described later ( In the case where the error evaluation function exceeds a certain threshold value), the assumption is changed in the next frame and case 4 is selected.
[0073]
(4) Color analysis
[A] Calculation of barycentric color
The average color of the block pixels is calculated for one or more area centroid blocks determined by the above-described (3) calculation of the centroid position of the changed block. This is called a barycentric color. For example, when a person is targeted, the barycentric color is used as a primary estimation result of skin color. The average color is calculated by the following formulas (9) to (11).
[0074]
[Equation 3]

[0075]
CPAT (k, i, j); k-th component of the average color vector of MBK (i, j) (k = 1, 2, 3)
MBK_Y (i, j, ix, jy); Y component pixel value of (ix, jy) pixel of MBK (i, j)
BLK_U (i, j, ix, jy); (ix, jy) pixel value of U block of MBK (i, j)
BLK_V (i, j, ix, jy); (ix, jy) pixel value of V block of MBK (i, j)
[0076]
[B] Calculation of distance between average color and barycentric color of each block
This is obtained according to the following equation (12).
[0077]
[Expression 4]

LCD (i, j); absolute value distance between the average color of MBK (i, j) and the barycentric color
IG: Horizontal coordinates of the center of gravity block
JG: Vertical coordinates of the center of gravity block
[0078]
[C] Threshold judgment for each block
The following determination is performed based on the color distance LCD calculated for each MBK.
LCD> LCD_TH
→ MBK (i, j) is out of range
LCD ≦ LCD_TH
→ MBK (i, j) is in the area
LCD_TH; threshold value
[0079]
Here, since the LCD is calculated as the sum of the absolute value distances of the three components having values of 0 to 255, 0 ≦ LCD ≦ 765. Based on this, LCD_TH is set (about 100 to 200 is appropriate). It is desirable that this threshold value be dynamically changed immediately after the scene cut and thereafter. For example, it is as follows.
[0080]
When 0 ≦ Nf ≦ Nf_TH
→ LCD_TH = 300
When Nf_TH <Nf
→ LCD_TH = 100
Nf: Frame number of one cut scene
As described above, the frame time until the object can be tracked stably after the recursive region extraction result is stable is set as Nf_TH, and a large change is allowed during that time, and thereafter the allowable error is set small. The scene cut determination method is not limited, but a method based on threshold determination of the sum of interframe motion compensation differential power for the entire screen is conceivable.
[0081]
(5) Noise removal
[A] Noise removal by distance judgment
The distance between the centroid block and each block is calculated, and blocks above a certain threshold are removed from the target area candidates. This is equivalent to performing region determination based on the cluster radius when clustering is applied. That is, for the region center of gravity MVK (IG, JG), the square distance dR is calculated by the following equation (13), and the following determination is made based on the result.
[0082]
dR = d [(IG, JG), (i, j)] (13)
dR <(Rc + δ)
→ MBK (i, j) is a region candidate
dR ≧ (Rc + δ)
→ MBK (i, j) is not a region candidate
Rc: Cluster radius
δ: positive integer (for example, about 1 to 3)
[0083]
[B] Noise removal by temporal filtering
Noise blocks are removed by histogram processing in the time direction. This is a filtering process based on data obtained in the past processing of a plurality of frames. For example, a pattern obtained by overlaying motion vector generation areas detected over about three frame times. This is a method in which a histogram is taken to detect a frequently occurring block.
[0084]
[C] Filtering by isolated point removal and contour shape smoothness
The isolated point removal and contour noise filtering similar to those used in normal binary image processing is applied to the MBK label image.
[0085]
[D] Fill in area
Holes in the region are filled by binary smoothing in horizontal scanning. Here, when the region shape can be predicted in advance by a model or the like (for example, an ellipse), the block band can be stably extracted by predicting the length between adjacent horizontal blocks.
[0086]
(6) Tilt detection
As shown in FIG. 4, the rotation angle γ in the arrangement of the wire frames is determined from the inclination of the extracted region. At this time, the rotation angle γ is determined by, for example, the following method. That is, a long axis or a short axis is detected by connecting between opposing midpoints of the vertical or horizontal block rows located at the end of the extracted area, and a straight line orthogonal to the midpoint of the axis is obtained to obtain the area The remaining short axis or long axis is detected by obtaining the position that intersects the end of the. The inclination can be determined from the obtained long axis and short axis data.
[0087]
The determination of the rotation angle γ can also be obtained as follows. That is, the position of the center of gravity of the region is obtained, and when the inclination is changed with respect to the straight line passing through the position of the center of gravity, the longest line segment that intersects the end of the region is detected as the long axis, It is also possible to detect a line segment to be detected as a short axis.
[0088]
This is equal to the average of local gradients obtained by connecting the center positions between adjacent horizontal block bands (see FIG. 11A). However, even if an area is extracted by removing noise as described above, the extracted area shape contains noise to some extent, so the inclination detection method (especially in the case of a simple method using only edges). Depending on the case, the major axis / minor axis may be reversed (see FIG. 11B). The following method can be considered for this.
[0089]
[A] Improve noise removal so that it becomes a completely convex region
[B] Robust tilt detection
[C] Set a constraint on the tilt change between frames
[A] and [b] are omitted because they are methods by improving image processing. [C] will be described later in the description of the evaluation correction process.
[0090]
B. Model fitting process
A wire frame model is arranged based on the dynamic region analysis result obtained above. For this purpose, first, a three-dimensional position / orientation is calculated. When projecting the wire frame onto the image plane, the position and orientation parameters r, θ, φ, α, β, and γ for calculating the perspective transformation matrix are calculated from the above-described dynamic region analysis results. For these position and orientation parameters, r is a value indicating the distance to the camera 2, θ and φ are values indicating the orientation of the object, and α and β are values indicating the tilt of the camera 2. Γ is a value indicating the rotation angle of the camera 2 around the optical axis.
[0091]
(1) Calculation of the 3D position and orientation of the object
[A] Rotation angle γ around the optical axis
If the tilt angle of the long straight line on the projection plane is DRCT, it can be expressed by the following equation (14).
γE = arctan (DRCT) (14)
[0092]
[B] Deviation angles α and β at the center of the object
When the area center of gravity (IG, JG) is regarded as the center of the object and converted into pixel coordinates (NPA, NPB) of the CIF image, the following equations (15) and (16) are obtained. Thus, the deviation angles α and β from the center of the image are expressed by the following equations (17) and (18).
[0093]
[Equation 5]

H_SIZE; Horizontal pixel size of CIF image
V_SIZE; vertical pixel size of CIF image
[Formula 6]

FL: Camera focal length
[0094]
[C] Distance r
Based on the primary estimation region extracted by the motion region analysis, the distance rE between the camera 2 and the object is calculated by the following equations (19) and (20).
[0095]
[Expression 7]

MHW: horizontal width (or vertical width) of a target area obtained when a three-dimensional target is projected onto the image plane when the distance r = rs, but rotated around the optical axis of the target
The angle is γ = 0
RHW0: width in the horizontal (vertical) direction obtained by line scanning of the primary estimation region
RHW: width in the horizontal (vertical) direction corrected by the inclination angle γE of the primary estimation region
[0096]
[D] Angles θ, φ indicating position
The method for calculating the angles θ and φ from the primary estimation result of the first dynamic region analysis is not limited. In the primary estimation, it is usually assumed that the object is facing the front at the start of calculation unless there is a priori knowledge. That is,
θ = 0, φ = 0
And
[0097]
(2) Projection area labeling
The wire frame converted to the same resolution as the MBK label image is overlaid on the MBK label image, and the straight line portion is labeled. Then, the holes are labeled by line scanning.
[0098]
(3) Labeling of partial areas
In the above-described labeling of the projection area, the characteristic part of the object can be associated with the two-dimensional partial area by changing the label value for each partial area of the object based on the three-dimensional coordinates of the wire frame. .
[0099]
C. Evaluation correction process
(1) Evaluation of cutout region (secondary estimation region) obtained by wireframe projection Here, evaluation of the fitness of a wireframe model that has been subjected to the above-described model fitting process based on the original image data And the following items are processed according to the feedback.
[A] Correction of secondary estimation area (correction of three-dimensional position and orientation)
[B] Adaptive deformation of wireframe model
[0100]
(2) Use of common sense
As an evaluation means, the shape information, position and orientation, and image texture of an object (car, person, etc.) corresponding to the wire frame model are evaluated under the following constraints.
[0101]
[A] Physical constraints
(1) Validity of mechanical inertia
(2) Validity of speed and vibration frequency
(3) Validity of positional relationship
[B] Conceptual constraints
(1) Validity evaluation of positional relationship
(2) Validity of color
(3) Validity of motion trajectory
[0102]
(3) Validity evaluation of color
FIG. 12 shows a basic policy regarding color evaluation.
[A] Reevaluation of barycentric color
The barycentric color of the cutout region (secondary estimation region) is calculated in the same manner as the color analysis process in the dynamic region analysis process. That is, if the barycentric coordinates of the cutout region are (IG1, JG1), the barycentric color is expressed by the following equations (21) to (23). Using these, the distance is calculated based on the equation (24).
[0103]
[Equation 8]

[0104]
In response to this, threshold determination is performed as follows.
[0105]
LCDG> LCDG_TH
→ MBK (IG1, JG1) is out of the area center of gravity
LCDG ≦ LCDG_ΤH
→ MBK (IG1, JG1) is valid as the region centroid
LCDG_TH: The value may differ depending on what the threshold value and the target area are assumed to be.
[0106]
[B] Continuity evaluation of colors in the area
Here, the continuity evaluation of the color of the boundary portion of the region is performed as follows. As shown in FIG. 12, the area is scanned for each block, and the error between the barycentric color and the average color of each block is reevaluated. That is, LCDG is calculated for each block, and labeling is performed as follows according to the value. Here, the estimated area is previously labeled “1” and the area outside the area is previously labeled “0”.
[0107]
LCDG> LCDG_TH
→ MBK (i, j) is determined to be out of the region and is labeled “2”.
LCDG ≦ LCDG_TH
→ MBK (i, j) is determined to be in the area.
LCDG_TH; Value may vary depending on threshold and target area assumptions
[0108]
[C] Calculation of correction vector of 3D position and orientation
(A) Calculation of two-dimensional approximate correction vector
The centroids G1 and G2 of the following two partial areas are calculated based on the label image obtained by the color continuity evaluation in the [b] area.
G1 (ISG1, JSG1); the center of gravity of the partial area of label 1 (in the estimation area) in the secondary estimation area
G2 (ISG2, JSG2); the center of gravity of the partial area of label 2 (large color error) within the secondary estimation area
[0109]
Based on this, the direction of the position correction vector as a two-dimensional area is calculated by the following equation (25).
DSG12 = (ISG2-ISG1, JSG2-JSG1) (25)
However, this is only valid for direction and the size is not very accurate. To know the size, as shown in FIG. 13, the vector passing through G1 and G2 is extended to obtain the intersection block B1 between the estimated area boundary and the actual area boundary. The coordinate difference value DSB12 is expressed by the following equation (26), which becomes a two-dimensional approximate correction vector.
DSB12 = (ISB2-1SB1, JSB2-JSB1) (26)
This is used, or a two-dimensional position correction vector DV (= DVx + DVy) is calculated by a horizontal / vertical scan described later.
[0110]
(I) Positional relationship
In order to correct the three-dimensional position and orientation, it is necessary to determine the following two.
1) Magnitude relationship between the estimated area and the true area
2) Color continuity at the boundary of the estimated area
This is determined as follows.
[0111]
(1) Horizontal scan
As shown in FIG. 8, based on the label image obtained in 5.3.2, horizontal scanning is performed in the range of [ii0, ii1] in both the right and left directions from the center of gravity G (IG1, JG1) of the estimation region. Find the next two positions.
[0112]
CL (ii0L, JG1); position that hits a block with a label value of 2 or 0 for the first time in the left scan
CR (ii0R, JG1): Position that hits a block with a label value of 2 or 0 for the first time in a right scan
Determine the continuity of the color vectors at these two points as follows:
(Case-HC1)
LCD [(ii0L, JG1), (ii0L-1, JG1)] ≦ LCDTH
→ left continuous
(Case-HC2)
LCD [(ii0R, JG1), (ii0R + 1, JG1)] ≦ LCDTH
→ right continuous
The positional relationship is determined as follows.
[0113]
<Case-HP1>
ii0 <ii0L and ii0R <ii1
→ The true area is smaller than the estimated area
DVx = (0, DSG12x) (28)
RD = R × W / WD (29)
WD = ii0D−ii0R + 1 (30)
W = ii1-ii0 + 1 (31)
<Case-HP2>
ii0 <ii0L and ii0R = ii1 and right continuous
→ The true area is shifted to the right from the estimated area.
DVx = (0, ii0−ii0L) (32)
[0114]
<Case-HP3>
ii0 ≧ ii0L and ii0R <ii1 and left continuous
→ The true region is shifted to the left of the estimated region
DVx = (0, ii1-ii0R) (33)
<Case-HP4>
ii0 = ii0L and ii0R = ii1 and right continuous and left continuous
→ The true area is larger than the estimated area
DVx = (0, DSG12x) (34)
RD = R × W / WD (35)
WD = ii0R−ii0L + 1 (36)
W = ii1-ii0 + 1 (37)
However,
DVx; horizontal two-dimensional position correction vector
RD: Distance value after correction
It is.
[0115]
(2) Vertical scan
As shown in FIG. 14, the vertical scan is performed in the range of [jj0, jj1] in both the upper and lower directions from the center of gravity G (IG1, JG1) of the estimated area based on the label image obtained by the color continuity evaluation in the [b] area. To find the next two positions.
[0116]
CU (IG1, jj0U): Position that hits a block with a label value of 2 or 0 for the first time in the upper scan
CD (IG1, jj0D): Position that hits a block with a label value of 2 or 0 for the first time in the lower scan
Determine the continuity of the color vectors at these two points as follows:
(Case-VCI)
LCD [(IG1, jj0U), (IG1, jj0U-1)] ≦ LCDTH
→ top continuous
(Case-VC2)
LCD [(IG1, jj0D), (IG1, jj0D + 1)] ≦ LCDTH
→ bottom continuous
Further, the positional relationship and the two-dimensional correction in the vertical direction are determined as follows.
[0117]
<Case-VP1>
jj0 <jj0U and jj0D <jj1
→ The true area is smaller than the estimated area.
[0118]
DVy = (0, DSG12y) (38)
RD = R × H / HD (39)
HD = jj0D-jj0U + 1 (40)
H = jj1-jj0 + 1 (41)
<Case-VP2>
jj0 <jj0U and jj0D = jj1 and lower continuous
→ The estimated area is shifted above the true area.
DVy = (0, jj0−jj0U) (42)
[0119]
<Case-VP3>
jj0 ≧ jj0U and jj0D <jj1 and upper continuous
→ The estimated area is shifted below the true area.
DVy = (0, jj1-jj0D) (43)
<Case-VP4>
jj0 = jj0U and jj0D = jj1, and lower continuous and upper continuous
→ The true area is larger than the estimated area
DVy = (0, DSG12y) (44)
RD = R × H / HD (45)
HD = jj0D-jj0U + 1 (46)
H = jj1-jj0 + 1 (47)
However,
DVy; vertical two-dimensional position correction vector
RD: Distance value after correction
It is.
From the above results, it can be determined which case corresponds to the positional relationship in FIGS.
[0120]
(C) Estimation of 3D position and orientation correction vector
The three-dimensional position correction is performed based on the result classified according to the above, the two-dimensional position correction approximate vector DV, and the distance correction value RD.
[0121]
(4) Physical constraints
Regarding the movement of an object, there is a range of conditions that are naturally determined when common sense is determined from the physical properties of the object. Considering such conditions in advance, the temporal change in the position and orientation of the object is evaluated.
[0122]
[A] Validity of time-series change in position and orientation parameters
If the position and orientation are estimated by the above process without considering continuity between frames, the speed, acceleration, and frequency of the object may exceed the common sense range. Therefore, the calculation results for each frame are corrected by the following three methods.
[0123]
(1) Time series moving average
(2) Set the upper limit of time difference between position and orientation
(3) Set the inter-frame prediction value and the upper limit of error
Hereinafter, (2) will be described mainly. Further, (3) will be described in detail in the description of the region prediction process.
[0124]
[B] Evaluation of temporal deviation in a plane parallel to the screen
The temporal change of the region center of gravity G in the screen can be calculated from the change between α and β frames. The time changes of α and β are defined by the following equations (48) and (49).
[0125]
[Equation 9]

n: Number of the frame sampled in one Kdrop frame for the video rate (30 frame / s) image
From this, the displacement amount DPL in the plane parallel to the screen starting from the area centroid is calculated by the following equations (50) to (52).
[0126]
[Expression 10]

In the above equation (52), the following equation (53) is obtained by using the absolute value distance scale.
[0127]
DPL (n) = | Dx (n) + Dy (n) | (53)
For these, a threshold value DPL_TH is set, and the following determination process is performed.
[0128]
DPL (n)> DPL_TH
→ α (n) = α (n−1), β (n) = β (n−1)
DPL_TH ≧ DPL (n)
→ Approve α (n), β (n)
For example, if it is considered that the amount of movement of human beings in a daily range of movement is about 1 to 2 m, except for special cases, the threshold DPL_TH of the movement amount for 1/30 seconds is about 30 to 70 mm. Can be set.
[0129]
[C] Evaluation of distance change over time
The time change of the distance r from the viewpoint to the region centroid can be described as the following equation (54).
[0130]
## EQU11 ##

On the other hand, a threshold value AR_TH is set, and the following determination process is performed.
[0131]
Dr (n)> AR_TH
→ r (n) = r (n-1)
AR_TH ≧ DPL (n)
→ Approve r (n)
[0132]
[D] Evaluation of time change of posture
With respect to the three-dimensional postures θ and φ of the object, time change and threshold determination can be considered in the same manner as described above.
(1) Horizontal orientation θ
[Expression 12]

Dth (n)> THE_TH
→ θ (n) = θ (n−1)
THE_TH ≧ Dth (n)
→ Approve θ (n)
(2) Vertical posture φ
[Formula 13]

Dph (n)> PHA_TH
→ φ (n) = φ (n−1)
PHA_TH ≧ Dph (n)
→ Approve φ (n)
[0133]
[E] Dynamic control
When applying physical constraints in dynamic region analysis, at the initial stage (for example, the first one second or so), the parameter constraints are relaxed, and the constraints are gradually strengthened after a time (FIGS. 16A to 16C). )reference). That is, it is desirable that each threshold value XXX_TH set above is dynamically changed immediately after the scene cut. That is,
0 ≦ Nf ≦ Nf_TH
→ XXX_TH = TH1 (TH1> TH2)
Nf_TH ≦ Nf
→ XXX_TH = TH2
Nf: Frame number of one cut scene
In this way, the frame time until the object can be tracked stably after the recursive area extraction result is stable is set as Nf_TH, and a large change is allowed during that time, and thereafter, the allowable error is set small.
[0134]
(5) Model adaptation
[A] Shoulder width adjustment
When it can be assumed that the image is almost a front image, the horizontal width of the body portion is changed within a certain range with respect to the initial wire frame, and a value that maximizes the fitness evaluation function of only the body portion is selected.
[0135]
(1) The approximate shoulder width Wsh obtained from the moving region analysis result is used as the search center.
(2) As for the maximum search width, the ratio w is considered within the range of the following equation (57) with respect to the horizontal width Whd of the head region. As a guide, (w1, w2) = (1.0, 5.0) rank
Wsh = Whd × w (w1 ≦ w ≦ w2) (57)
[B] Adaptation of head width
Basically the same concept as shoulder width.
[0136]
(6) Restrictions due to the application of common sense
In order to select color information, three-dimensional shape, and scene structure from the model database of FIG. 1, in addition to the mode information and model information obtained from the system control unit, the model correction evaluation correction result obtained by the above process is used. Based on this, it is necessary to determine the situation of the scene. Therefore, when performing mode control related to the usage environment and subject, a state transition pattern is set in advance, and at those branches, based on the default value of the probability value at the initial stage when the state transition determination information is insufficient. The state transition is determined, and the probability value is changed and set so that an appropriate state transition can be performed according to the evaluation result while repeating such a determination operation.
[0137]
[A] Update probability label
The path of the state transition interpretation tree is passed once, and the evaluation score value SCORE by the evaluation process is changed to the value of the threshold values SC_TH1 and SC_TH2 according to the following case.
[0138]
SCORE ≧ SC_TH1
→ Increase the probability label of p (path-i)
SC_TH1> SCORE ≧ SC_TH2
→ Do not change the probability label of p (path-i)
SC_TH2> SCORE
→ Decrease the probability label of p (path-i)
[0139]
[B] Determination of parallel evaluation of other paths
It is determined whether to analyze / evaluate each path according to the value SP (path-i) of the path sum of the probability label. When the path with the highest probability value is assumed to be path-i, it is determined as follows.
[0140]
(1) SP (path-i) ≧ 0.8
→ Perform only path-i calculation
(2) 0.8> SP (path-i) ≧ 0.5
→ Select path-j, which has the next highest probability after path-i.
Advance parallel computation of path-i and path-j
(3) 0.5> SP (path-i) ≧ 0.2
→ Select the top three path-k with the highest probability
(4) 0.2> SP (path-i)
→ Change situation assumptions
[0141]
Next, the result of the estimation region extraction process performed experimentally by the inventors will be schematically described. FIG. 17 to FIG. 24 show the results when the data is divided into CIF format blocks. FIG. 17 shows the result of labeling by motion vector detection for an image obtained by photographing a person's head. On the other hand, FIG. 18 shows the result of correcting by detecting the panning vector. Further, FIG. 19 shows the result of performing noise removal processing by histogram processing in the time direction. The value in the figure indicates the frequency. In addition, FIG. 20 shows the result of evaluation based on the barycentric color for the face area. FIG. 21 shows the result of projection by wire frame matching.
[0142]
Further, in FIGS. 22 to 24, the temporally changing state of the result of the estimation region corresponding to FIG. 20 extracted in this way is counted from the scene start time to the first frame (Nf = 1), The seventh frame (Nf = 7) and the thirteenth frame (Nf = 15) are shown.
[0143]
According to the present embodiment, the moving region of the image is extracted to be an estimated region of the object, and the encoding rate in the encoding process is higher than that of the other regions. Even when an encoded image signal is transmitted within the specified transmission capacity range, the motion information is well expressed with high accuracy for the region of interest such as the face of a person as an impression when viewed by the user. It is possible to improve the impression of appearance as a moving image.
[0144]
In addition, according to the present embodiment, the three-dimensional shape of the object is defined as a wire frame model, the color model information and the scene structure are defined and set as a model base, and the extracted moving region shape feature is defined. Based on this, the position and orientation of the wire frame model is estimated, and the estimated region is extracted based on the estimated position and orientation. Therefore, the estimated region can be extracted from the moving region with high accuracy according to the motion of the object.
[0145]
In addition, the fitness of the estimated region extracted by the model fitting process is evaluated based on the image data, and the estimated region is modified so that the fitness is higher according to the evaluation result. The encoding process can be performed using the corrected estimation area with higher accuracy, and thereby the estimation area corresponding to the object can be extracted with higher accuracy.
[0146]
Since, for example, color information, motion information, or three-dimensional structure information is defined in advance as common-sense model information of the object, the data of the estimated area extracted is the common-sense data in the evaluation correction process. It is possible to extract a more accurate estimation region by evaluating and correcting whether it conforms to the model information.
[0147]
In the encoding process, the encoding process is performed with high accuracy corresponding to the estimated area based on the information of the estimated area extracted in the previous frame. Therefore, the image data of the same frame is repeatedly read and the signal processing is performed. Thus, the calculation process can be performed quickly. In addition, since the next frame is predicted for the estimated area as needed, accurate estimated area information can be obtained following this even when there is a fast motion.
[0148]
In addition, in the moving area detection process, the moving area is detected based on the color information of the image in addition to the detected moving block information. It is possible to improve the accuracy of extraction of the estimation region corresponding to the object.
[0149]
In the moving region detection process, the threshold setting used in threshold determination following the color distance calculation is changed to a low level as time passes, so when extracting the estimated region of the object from the image, Based on the fact that the extraction accuracy of the estimation region is improved as time elapses immediately after the start of the scene, it is possible to perform a determination process with higher accuracy.
[0150]
In the moving area detection process, noise is removed based on the distance from the center of gravity and noise is removed from the image data of the frames that are temporally changed, so that the moving area can be extracted with high accuracy.
[Brief description of the drawings]
FIG. 1 is a block diagram of an encoder showing an embodiment of the present invention.
FIG. 2 is an overall schematic configuration diagram.
FIG. 3 is a diagram illustrating the principle of extracting region information.
FIG. 4 is a flowchart of a region extraction program.
FIG. 5 is a flowchart of a moving area analysis processing program.
FIG. 6 is a flowchart of a model adaptation processing program.
FIG. 7 is a flowchart of an evaluation / correction processing program.
FIG. 8 is a flowchart of a region prediction processing program
FIG. 9 is a flowchart of a panning vector calculation program.
FIG. 10 is an explanatory diagram of blocks used for calculation of panning vector detection in an image.
FIG. 11 is a diagram for explaining the operation when detecting the inclination of the extraction region;
FIG. 12 is a conceptual explanatory diagram relating to color evaluation of an extracted estimation area.
FIG. 13 is a conceptual explanatory diagram for calculating a two-dimensional approximate correction vector.
FIG. 14 is a conceptual explanatory diagram in the case of performing color continuity determination processing in an estimated area.
FIG. 15 is an explanatory diagram showing various misalignment relationships between the estimated area and the real area.
FIG. 16 is an operation explanatory diagram showing threshold change.
FIG. 17 is a labeling result diagram by motion vector detection in each block of one frame showing an experimental example.
FIG. 18 shows the result of labeling after removing the panning vector.
FIG. 19 shows the result of labeling after performing histogram processing in the time direction.
FIG. 20 shows the result of evaluation based on the center of gravity color.
FIG. 21 is a projection result diagram by wire frame matching.
FIG. 22 is a view corresponding to FIG. 21 along with time (part 1).
FIG. 23 is a view corresponding to FIG. 21 with time (part 2).
FIG. 24 is a view corresponding to FIG. 21 with time (Part 3).
[Explanation of symbols]
1 is an encoder, 2 is a camera, 3 is an A / D conversion unit, 4 is a motion compensated prediction encoding unit, 5 is a model base target region extraction unit, 6 is a two-dimensional motion vector detection unit, and 7 is an encoding attribute. Determination / control unit, 8 is a moving region analysis unit, 8a is a panning vector detection unit, 8b is a motion block determination unit, 8c is a centroid calculation unit, 8d is an average color calculation unit, 8e is a background noise removal unit, and 8f is two-dimensional A shape parameter extraction unit, 9 is a model adaptation unit, 9a is a three-dimensional position and orientation estimation unit, 9b is a perspective transformation unit, 9c is a projection area calculation unit, 10 is an evaluation / correction unit, 10a is a color evaluation unit, and 10b is a fitness level Calculation unit, 10c is a three-dimensional position and orientation speed determination unit, 11 is a region prediction unit, 11a is a moving average unit, 11b is a three-dimensional motion prediction unit, 11c is a region calculation unit, 12 is a model database, 12a is a color information database, 12b is a three-dimensional shape data. Database, 12c is a scene structure database.

Claims

A moving region corresponding to the object by detecting image information in units of blocks and detecting a block in which movement occurs from an image of at least two frames preceding and following the time when a moving object is captured. A dynamic region detection process for detecting
A model fitting process for recognizing a region included in the object as an estimated region of the object by applying a predetermined shape model of the object to the moving region detected in the moving region detection process. in the coding method of image information for encoding image information obtained by the recognition method of images you,
An encoding process for performing an encoding process so as to allocate a larger amount of information to the estimated area extracted by the model fitting process than other areas;
In the moving region detection process, the moving region is detected based on the color information of the image in addition to the detected motion block information,
The detection of the color information is performed by determining the color of the object by calculating the center of gravity color, calculating the color distance, and determining a threshold value.
A method for encoding image information, characterized in that the setting of a threshold value used in the threshold value determination is changed to a lower level as time elapses.

A moving region corresponding to the object by detecting image information in units of blocks and detecting a block in which movement occurs from an image of at least two frames preceding and following the time when a moving object is captured. A dynamic region detection process for detecting
A model fitting process for recognizing a region included in the object as an estimated region of the object by applying a predetermined shape model of the object to the moving region detected in the moving region detection process In an image information encoding method for encoding image information obtained by an image recognition method,
An encoding process for performing an encoding process so as to allocate a larger amount of information to the estimated area extracted by the model fitting process than other areas ;
In the moving region detection process, a noise removal process is performed by creating a histogram with reference to a motion block of an image of a frame preceding and following the detected motion block and performing temporal filtering. A method for encoding image information.

A moving region corresponding to the object by detecting image information in units of blocks and detecting a block in which movement occurs from an image of at least two frames preceding and following the time when a moving object is captured. A dynamic region detection process for detecting
A model fitting process for recognizing a region included in the object as an estimated region of the object by applying a predetermined shape model of the object to the moving region detected in the moving region detection process In an image information encoding method for encoding image information obtained by an image recognition method ,
An encoding process for performing an encoding process so as to allocate a larger amount of information to the estimated area extracted by the model fitting process than other areas;
As a model of the object, a model database defining a model including a three-dimensional model, color information, and a scene structure is provided,
An image information encoding method, comprising: selecting a model from the model database, changing a probability value when state transition is performed stochastically or changing a state transition path .

A moving region corresponding to the object by detecting image information in units of blocks and detecting a block in which movement occurs from an image of at least two frames preceding and following the time when a moving object is captured. A dynamic region detection process for detecting
A model fitting process for recognizing a region included in the object as an estimated region of the object by applying a predetermined shape model of the object to the moving region detected in the moving region detection process In an image information encoding method for encoding image information obtained by an image recognition method ,
An encoding process for performing an encoding process so as to allocate a larger amount of information to the estimated area extracted by the model fitting process than other areas ;
An evaluation correction process for performing an evaluation of the fitness on the basis of the image data for the estimation area extracted by the model adaptation process, and correcting the estimation area so that the fitness is higher according to the evaluation result;
Provided,
In the encoding process, the encoding process is performed on the estimation region corrected through the evaluation correction process,
In the evaluation correction process, the three-dimensional position / orientation parameter value of the object and the calculated value of the three-dimensional position / orientation are evaluated and corrected by determining a threshold value of the time change and prediction error thereof . Method.

A moving region corresponding to the object by detecting image information in units of blocks and detecting a block in which movement occurs from an image of at least two frames preceding and following the time when a moving object is captured. A dynamic region detection process for detecting
A model fitting process for recognizing a region included in the object as an estimated region of the object by applying a predetermined shape model of the object to the moving region detected in the moving region detection process In an image information encoding method for encoding image information obtained by an image recognition method ,
For the estimation region extracted by the model adaptation process, the encoding process for performing an encoding process so that a larger amount of information is assigned to the estimation region extracted by the model adaptation process and the estimation region extracted by the model adaptation process. An evaluation correction process is performed for evaluating the fitness based on the data of the data, and correcting the estimation region so that the fitness is higher according to the evaluation result,
In the encoding process, the encoding process rows that have for the estimated region where the evaluation has been modified through the Modify process,
A method of encoding image information, characterized in that, in the evaluation correction process, the setting of a threshold value used in threshold value determination of a three-dimensional position / orientation calculation value is dynamically changed as time passes .

In the encoding method of the image information in any one of Claim 1 thru | or 5 ,
The shape model of the object is defined as a three-dimensional model,
In the model fitting process, an approximate position and orientation of a three-dimensional model of the object is estimated based on a two-dimensional shape feature of the moving region, and the estimated region is extracted based on the estimated position and orientation Information encoding method.

The image information encoding method according to claim 6, wherein:
In the model fitting process, the wire frame model is projected on the image plane based on the position and orientation information of the three-dimensional model of the object, and the internal area is extracted as the estimated area of the object. A method for encoding image information.

In the encoding method of the image information in any one of Claim 1 thru | or 3, 6, 7,
For the estimated area extracted by the model fitting process, the degree of matching is evaluated based on the data of the image, and an evaluation correcting process for correcting the estimated area so that the degree of matching becomes higher according to the evaluation result is provided. ,
In the encoding process, the encoding process is performed on the estimation region corrected through the evaluation correction process .

In the encoding method of the image information in any one of Claim 4, 5, and 8,
In the evaluation correction process, common-sense model information of the object is specified in advance, and evaluation correction is performed based on the common-sense model information .

In the encoding method of the image information of Claim 9 ,
The evaluation correction process includes a process of determining whether or not the estimated region extracted in each of the temporally moving images is compatible with the physical motion condition or common sense motion condition of the object. encoding method of the image information, characterized in that there.

In the encoding method of the image information in any one of Claims 8 thru | or 10,
In the evaluation correction process, common-sense color information corresponding to the object, centroid color re-evaluation, and color continuity evaluation in the estimation area are performed to correct the estimation area. Method.

The image information encoding method according to any one of claims 1 to 11,
In the encoding process, the extracted information on the estimated area is applied to an encoding process for an image of the next frame .

The image information encoding method according to any one of claims 1 to 11 ,
Providing an area prediction process for predicting an estimated area of the object of the next image based on information obtained in the process of extracting the estimated area;
In the encoding processing step, the information of the prediction estimation region predicted in the region prediction step is applied to the encoding processing of the image of the next frame .

In the encoding method of the image information in any one of Claims 1 thru | or 13 ,
The method of encoding image information, wherein the moving region detecting step detects panning vectors of an image and corrects information of a motion block detected based on the detected panning vector .

The method for encoding image information according to any one of claims 1 to 14,
The motion in the area detection process, the detected encoding method of the image information, characterized in that in addition to the information of the motion block detecting the moving area based on color information of an image.

In the encoding method of the image information according to claim 15,
In the moving region detection process, the color information is detected by determining the color of an object by calculating a barycentric color, calculating a color distance, and determining a threshold value .

In the encoding method of the image information in any one of Claims 1 thru | or 16,
In the moving region detection process, a noise removal process is performed by calculating a distance between the detected motion block and the center of gravity position and determining a threshold value .

In the encoding method of the image information in any one of Claims 7 thru | or 16 ,
In the evaluation correction process, when the wire frame model of the three-dimensional model of the object is projected on the image plane, the model is adapted by fitting the wire frame model so that the suitability of a predetermined part of the object is high. A method of encoding image information, characterized by:

A moving region corresponding to the object by detecting image information in units of blocks and detecting a block in which movement occurs from an image of at least two frames preceding and following the time when a moving object is captured. A moving region detection process for detecting
A model matching process for recognizing a region included in the object as an estimated region of the object by applying a predetermined shape model of the object to the moving region detected in the moving region detection process ,
In the moving region detection process, the moving region is detected based on the color information of the image in addition to the detected motion block information,
The detection of the color information is performed by determining the color of the object by calculating the color of the center of gravity, calculating the color distance and determining the threshold value,
A method for recognizing an image, characterized in that the setting of a threshold value used in the threshold value determination is changed to a lower level as time passes.

A moving region corresponding to the object by detecting image information in units of blocks and detecting a block in which movement occurs from an image of at least two frames preceding and following the time when a moving object is captured. A moving region detection process for detecting
A model matching process for recognizing a region included in the object as an estimated region of the object by applying a predetermined shape model of the object to the moving region detected in the moving region detection process ,
In the moving region detection process, a noise removal process is performed by creating a histogram with reference to a motion block of an image of a frame preceding and following the detected motion block and performing temporal filtering. Image recognition method.

A moving region corresponding to the object by detecting image information in units of blocks and detecting a block in which movement occurs from an image of at least two frames preceding and following the time when a moving object is captured. A dynamic region detection process for detecting
A model matching process for recognizing a region included in the object as an estimated region of the object by applying a predetermined shape model of the object to the moving region detected in the moving region detection process ,
As a model of the object, a model database defining a model including a three-dimensional model , color information and a scene structure is provided,
A method for recognizing an image, comprising: selecting a model from the model database, changing a probability value when performing state transition stochastically or changing a state transition path .

A moving region corresponding to the object by detecting image information in units of blocks and detecting a block in which movement occurs from an image of at least two frames preceding and following the time when a moving object is captured. A dynamic region detection process for detecting
A model fitting process for recognizing a region included in the object as an estimated region of the object by applying a predefined shape model of the object to the moving region detected in the moving region detection process,
An evaluation correction process is performed on the estimated area extracted by the model fitting process based on the image data, and an evaluation correction process is performed to correct the estimated area so that the matching degree is higher according to the evaluation result. Provided,
In the evaluation correction process, common-sense model information of the object is specified in advance, and evaluation correction is performed based on the common-sense model information. The parameter value of the three-dimensional position and orientation of the object, its temporal change, and the threshold of prediction error An image recognition method characterized by evaluating and correcting a calculated value of a three-dimensional position and orientation by value determination .

A moving region corresponding to the object by detecting image information in units of blocks and detecting a block in which movement occurs from an image of at least two frames preceding and following the time when a moving object is captured. A dynamic region detection process for detecting
A model fitting process for recognizing a region included in the object as an estimated region of the object by applying a predefined shape model of the object to the moving region detected in the moving region detection process,
An evaluation correction process is performed on the estimated area extracted by the model fitting process based on the image data, and an evaluation correction process is performed to correct the estimated area so that the matching degree is higher according to the evaluation result. Provided ,
In the evaluation correction process, common-sense model information of the object is specified in advance, and evaluation correction is performed based on the common-sense model information. A method for recognizing an image, wherein the image is dynamically changed as time passes .

The image recognition method according to any one of claims 19 to 23 ,
The shape model of the object is defined as a three-dimensional model,
In the model fitting process, an approximate position and orientation of a three-dimensional model of the object is estimated based on a two-dimensional shape feature of the moving region, and the estimated region is extracted based on the estimated position and orientation Recognition method.

The image recognition method according to claim 24, wherein:
In the model fitting process, the wire frame model is projected on the image plane based on the position and orientation information of the three-dimensional model of the object, and the internal area is extracted as the estimated area of the object. Image recognition method.

The image recognition method according to any one of claims 19 to 21, 23, 25.
For the estimated area extracted by the model fitting process, the degree of matching is evaluated based on the data of the image, and an evaluation correcting process for correcting the estimated area so that the degree of matching becomes higher according to the evaluation result is provided. recognition method of an image, characterized in that the.

The image recognition method according to claim 26, wherein:
In the evaluation correction process, common sense model information of the object is specified in advance, and evaluation correction is performed based on the common model information .

The image recognition method according to claim 27 .
The evaluation correction process includes a process of determining whether or not the estimated region extracted in each of the temporally moving images is compatible with the physical motion condition or common sense motion condition of the object. A method for recognizing an image, characterized by comprising:

The image recognition method according to any one of claims 26 to 28 ,
In the evaluation correction process, common sense color information corresponding to the object, re-evaluation of the barycentric color, and color continuity evaluation in the estimation area are performed to correct the estimation area. .

The image recognition method according to any one of claims 19 to 29 ,
The method of recognizing an image, wherein the moving region detecting step detects panning vectors of the image and corrects information of a motion block detected based on the detected panning vector .

The image recognition method according to any one of claims 19 to 30,
The motion in the area detecting step, recognition method of an image and detecting the moving region based in addition to the information of the detected motion blocks to the color information of the image.

The image recognition method according to claim 31 ,
In the moving region detection process, the color information is detected by determining the color of an object by calculating a centroid color, calculating a color distance, and determining a threshold value .

The image recognition method according to any one of claims 19 to 32,
In the moving region detection process, a noise removal process is performed by calculating a distance between the detected motion block and the center of gravity position and determining a threshold value .

The image recognition method according to any one of claims 26 to 33,
In the evaluation correction process, when the wire frame model of the three-dimensional model of the object is projected onto the image plane, the model is adapted by fitting the wire frame model so that the suitability of a predetermined part of the object is high. recognition method of an image, characterized in that to achieve.