JP2003512802A

JP2003512802A - System and method for three-dimensional modeling

Info

Publication number: JP2003512802A
Application number: JP2001532487A
Authority: JP
Inventors: ヨン，イェン; シャラパリ，キラン
Original assignee: Philips Electronics NV
Current assignee: Koninklijke Philips NV
Priority date: 1999-10-21
Filing date: 2000-10-06
Publication date: 2003-04-02
Also published as: WO2001029767A3; EP1190385A2; KR20010089664A; WO2001029767A2

Abstract

(57)【要約】入力される一対の画像フレームからの情報を使用して三次元モデルを提供する方法及び画像処理システムを開示する。三次元の表面は、最初に入力フレーム中の造作を識別することで得られる。造作対応マッチングは、フレームに基づく相違情報を用いて実施される。識別される造作は、時間の情報を使用して互いに関係付けられてもよい。 (57) [Summary] A method and an image processing system for providing a three-dimensional model using information from a pair of input image frames are disclosed. A three-dimensional surface is obtained by first identifying features in the input frame. The feature correspondence matching is performed using the difference information based on the frame. The identified features may be related to each other using the time information.

Description

Detailed Description of the Invention

【０００１】［発明の分野］本発明は、三次元モデリングの分野に一般的に関わり、より特定的には、相違
に基づく情報を使用してディジタル画像中に含まれるオブジェクトを三次元モデ
リングするシステム及び方法に関する。FIELD OF THE INVENTION The present invention relates generally to the field of three-dimensional modeling, and more particularly to a system for three-dimensional modeling objects contained in digital images using disparity-based information. And method.

【０００２】［発明の背景］インターネット又は公衆交換電話回線網（Public Switched Telephone Net
work（PSTN））上のビデオ／画像通信に対する適用法が人気及び使用の面で増加
している。従来のビデオ／画像通信技術では、（ＪＰＥＧ又はＧＩＦ形式におけ
る）ピクチャが捕捉され、トランスポートネットワーク上で伝送される。しかし
ながら、このアプローチ法は、ピクチャの寸法（即ち、データ量）のために大き
い帯域幅を必要とする。BACKGROUND OF THE INVENTION The Internet or Public Switched Telephone Net
Applications for video / image communication over work (PSTN) are increasing in popularity and use. In conventional video / image communication technology, pictures (in JPEG or GIF format) are captured and transmitted over a transport network. However, this approach requires a large bandwidth due to the size of the picture (ie the amount of data).

【０００３】６４Ｋ乃至２Ｍビット／秒で動画を符号化するとき、ブロックベースのハイブ
リッド符号器が一般に使用される。符号器は、シーケンスの各画像を個々の動く
ブロックに細分する。次に各ブロックは、２Ｄ動き予測によって符号化され、コ
ーディングを変換する。伝送速度に依存して、受信する結果となる画像は平滑に
プレイされず、リアルタイムでプレイされ得ない。Block-based hybrid encoders are commonly used when encoding video at 64 K to 2 Mbits / sec. The encoder subdivides each image of the sequence into individual moving blocks. Each block is then encoded by 2D motion estimation to transform the coding. Depending on the transmission rate, the resulting image that is received is not played smoothly and cannot be played in real time.

【０００４】ビデオ／画像通信、及び／又は、伝送されるのに要求される情報量を減少する
ための方法が使用される。これら方法のうちの一つは、ビデオ電話機への適用に
おいて使用される。画像は、その動き、形状、及び表面の色を明確にする３つの
セットのパラメータによってエンコードされる。ビジュアル通信の被写体は、一
般に人間であるため、一次焦点が被写体の頭又は顔に向けられ得る。Video / visual communications and / or methods for reducing the amount of information required to be transmitted are used. One of these methods is used in video phone applications. An image is encoded with three sets of parameters that define its motion, shape, and surface color. Since the subject of visual communication is typically a human, the primary focus can be directed at the subject's head or face.

【０００５】従来のビデオ電話機通信システムでは、ビデオ電話をかけている人の画像を捕
捉するために１台のカメラが一般的に使用される。１台のカメラだけが使用され
るため、真の三次元（３Ｄ）の顔の表面（即ち、形状パラメータ）を捕捉するこ
とが困難である。一般に３Ｄ表面を生成するためには、オブジェクトの多数の二
次元（通常６）ビューが必要となる。これらビューを用いて、距離変換が適用さ
れる。例えば、オブジェクト中の形状のＺ座標をオブジェクトの境界線までのそ
の距離の関数として得るために楕円が母関数として使用され得る。これら輪郭線
は３Ｄ形状だけを近似し得る。In conventional video telephone communication systems, a single camera is commonly used to capture an image of the person making the video call. Since only one camera is used, it is difficult to capture the true three-dimensional (3D) face surface (ie, shape parameter). In general, multiple 2D (typically 6) views of an object are required to generate a 3D surface. A distance transform is applied using these views. For example, an ellipse can be used as the generating function to obtain the Z coordinate of the shape in the object as a function of its distance to the object's boundary. These contours can only approximate 3D shapes.

【０００６】別の方法は、モデルベースコーディングと呼ばれる。低ビットレートの通信は
、被写体の頭の代表的な顔のパラメータだけをエンコードし伝送することによっ
て実現され得る。遠隔なサイトにおいて顔の画像が伝送されるパラメータを使用
して合成される。一般に、モデルベースコーディングは、顔のセグメント化、顔
の造作の抽出、造作のトラッキング、及び、動きの推定といった少なくとも４つ
のタスクを必要とする。Another method is called model-based coding. Low bit rate communication can be achieved by encoding and transmitting only typical facial parameters of the subject's head. A facial image is composited using the parameters transmitted at the remote site. Model-based coding generally requires at least four tasks: face segmentation, facial feature extraction, feature tracking, and motion estimation.

【０００７】顔のセグメント化に対する一つの既知の方法は、パラメータで表示される顔を
記述するデータセットを形成する方法である。このデータセットは、顔のオブジ
ェクトの三次元記述を明確にする。パラメータで表示される顔は、筋肉及び皮膚
のアクチュエータ及び力ベースの変形をモデリングすることによって解剖学ベー
スの構造として提供される。One known method for face segmentation is to form a data set that describes the parametrically displayed face. This dataset defines a three-dimensional description of facial objects. The parametrically represented face is provided as an anatomy-based structure by modeling actuator and force-based deformations of muscle and skin.

【０００８】図１に示すように、多角形の組が人の顔モデル１００を画成する、多角形の各
頂点は、Ｘ、Ｙ、及び、Ｚ座標によって明確にされる。各頂点は、指数によって
識別される。特定の多角形は、多角形を囲う指数の組によって画成される。特定
の多角形に対して色を限定するために符号が指数の組に加えられてもよい。As shown in FIG. 1, each vertex of a polygon, the set of polygons defining a human face model 100, is defined by an X, Y, and Z coordinate. Each vertex is identified by an index. A particular polygon is defined by a set of indices surrounding the polygon. Signs may be added to the set of indices to limit the color for a particular polygon.

【０００９】ディジタル画像を分析し、人間の顔を認識し、顔の造作を抽出するシステム及
び方法も公知である。従来の顔の造作を検出するシステムは、顔のカラートーン
の検出、テンプレートマッチング、又は、エッジ検出アプローチのような方法を
使用する。Systems and methods for analyzing digital images, recognizing human faces, and extracting facial features are also known. Conventional facial feature detection systems use methods such as facial color tone detection, template matching, or edge detection approaches.

【００１０】モデルベースコーディングにおける最も困難な問題の一つは、顔の造作の対応
を迅速に、容易に、且つ、粗野に提供することである。順次的なフレームでは、
同じ顔の造作が正確に適合されなくてはならない。従来では、ブロックマッチン
グ処理が造作の対応を決定するために現在のフレームと次のフレーム中のピクセ
ルを比較するために使用される。フレーム全体にわたって顔の対応について探索
される場合、処理は遅く、同じ勾配値を有する領域の不適合により不正確な結果
をもたらし得る。フレームのサブセットだけが探索される場合、処理時間が改善
され得る。しかしながら、この状況では処理は全ての顔の対応を決定することが
できないこともある。One of the most difficult problems in model-based coding is to provide face feature support quickly, easily, and crudely. In sequential frames,
Features of the same face must be matched exactly. Conventionally, a block matching process is used to compare pixels in the current frame and the next frame to determine feature correspondences. If searched for face correspondences over the entire frame, the process is slow and can result in inaccurate results due to the mismatch of regions with the same gradient value. Processing time may be improved if only a subset of frames is searched. However, in this situation the process may not be able to determine all face correspondences.

【００１１】従って、低下したデータレート伝送に対してディジタル画像中に含まれるオブ
ジェクトを三次元モデリングする改善されたシステム及び方法が技術において必
要である。Accordingly, there is a need in the art for improved systems and methods for three-dimensional modeling of objects contained in digital images for reduced data rate transmission.

【００１２】［発明の簡単な要約］本発明は、上述の従来のビデオ／画像通信システム及びモデルベースコーディ
ングの限界に取り組むことを目的とする。BRIEF SUMMARY OF THE INVENTION The present invention aims to address the limitations of conventional video / image communication systems and model-based coding described above.

【００１３】本発明は、リアルタイムで圧縮されたビデオ情報を送るオブジェクトに向けら
れたクロスープラットホーム方法を提供することを別の目的とする。Another object of the present invention is to provide a cross-platform method directed to objects that send compressed video information in real time.

【００１４】本発明は、画像フレーム内で特定のオブジェクトを符号化することを可能にす
ることを更なる目的とする。It is a further object of the invention to make it possible to code a particular object within an image frame.

【００１５】本発明は、合成及び自然なビジュアルオブジェクトを対話式又はリアルタイム
で組み込むことを更なる目的とする。It is a further object of the present invention to incorporate synthetic and natural visual objects interactively or in real time.

【００１６】本発明の一つの面では、画像処理装置は、一対の入力画像信号から造作の位置
情報を抽出するために構成される少なくとも一つの造作抽出決定器と、造作の位
置情報及び相違情報に従って入力画像信号中の対応する造作をマッチングするマ
ッチングユニットとを有する。In one aspect of the present invention, the image processing apparatus includes at least one feature extraction determiner configured to extract feature position information from a pair of input image signals, and feature position information and difference information. And a matching unit for matching corresponding features in the input image signal according to.

【００１７】本発明の一実施例は、３Ｄモデルに関係するパラメータを決定する方法に関す
る。この方法は、一対の入力画像に関係する造作の位置情報を抽出する段階と、
抽出された造作の位置情報及び相違情報に従って一対の入力画像中で対応する造
作をマッチングする段階とを有する。この方法は、造作の対応するマッチングに
従って３Ｄモデルに対するパラメータを決定する段階を更に有する。One embodiment of the invention relates to a method of determining parameters related to a 3D model. This method comprises the steps of extracting positional information of a feature related to a pair of input images,
Matching corresponding features in the pair of input images according to the extracted position information and the difference information of the features. The method further comprises determining parameters for the 3D model according to corresponding matching of features.

【００１８】本発明の上記並びに実施例及び面は、以下の詳細な記載において例示される。[0018] The above as well as the embodiments and aspects of the present invention are illustrated in the following detailed description.

【００１９】本発明の特徴及び利点は、図面を参照する前述の好ましい実施例の詳細な説明
を参照して理解され得る。The features and advantages of the invention may be understood with reference to the detailed description of the preferred embodiments given above with reference to the drawings.

【００２０】［好ましい実施例の説明］図２を参照するに、３Ｄモデリングシステム１０を示す。一般に、システム１
０は、少なくとも一つの造作抽出決定器１１と、少なくとも一組の時間の情報１
２と、造作を対応するマッチングユニット１３とを含む。左のフレーム１４及び
右のフレームのフレームフレーム１５は、システム１０中に入力される。左及び
右のフレームのフレームフレームは、ディジタル又はアナログでもよい画像デー
タから成る。画像データがアナログの場合、アナログ−ディジタル回路がデータ
をディジタル形式に変換するために使用され得る。Description of the Preferred Embodiments Referring to FIG. 2, a 3D modeling system 10 is shown. Generally, system 1
0 is at least one feature extraction determiner 11 and at least one set of time information 1
2 and a matching unit 13 corresponding to the feature. The left frame 14 and the right frame frame 15 are input into the system 10. The frame frames of the left and right frames consist of image data which may be digital or analog. If the image data is analog, analog-to-digital circuitry can be used to convert the data into digital form.

【００２１】造作抽出決定器１１は鼻、目、及び、口の顔の造作位置のようなディジタル画
像中の造作位置／場所を決定する。図２中、２つの造作抽出決定器１１が示され
る一方で、一方の決定器が左及び右のフレーム１４及び１５夫々から位置情報を
抽出するために使用されてもよい。時間の情報１２は、正確な造作の対応に対し
て制約を課すために使用される先行する及び／又は後続するフレームのようなデ
ータを含む。処理されるべき現在のフレームは、システム１０への第１のフレー
ム入力であることが必要なことを理解すべきである。テストフレームは、幾つか
のヒステリシスを確立するために使用されてもよい。The feature extraction determiner 11 determines the feature position / location in the digital image, such as the feature positions of the nose, eyes, and mouth faces. While two feature extraction determiners 11 are shown in FIG. 2, one determiner may be used to extract position information from the left and right frames 14 and 15, respectively. The time information 12 includes data such as preceding and / or subsequent frames used to impose constraints on the exact feature correspondence. It should be appreciated that the current frame to be processed needs to be the first frame input to system 10. The test frame may be used to establish some hysteresis.

【００２２】好ましい実施例では、システム１０は、データ処理機器によって実行されるコ
ンピュータ読出し可能な符号によって実施される。符号は、データ処理装置内の
メモリに記憶されてもよく、又は、ＣＤ−ＲＯＭ又はフロッピー（登録商標）デ
ィスクのような記憶媒体から読出し／ダウンロードされてもよい。他の実施例で
は、ハードウェア回路が本発明実施するためのソフトウェア・インストラクショ
ンの代わりに、又は、組み合わされて使用されてもよい。例えば、本発明は、処
理するためにTrimediaプロセッサ、及び、表示するためにテレビジョンモニタを
使用するディジタルテレビジョンプラットフォーム上で実施されてもよい。本発
明は、図３に示すコンピュータ３０上でも実施され得る。In the preferred embodiment, system 10 is implemented with computer readable code executed by data processing equipment. The code may be stored in memory within the data processing device or may be read / downloaded from a storage medium such as a CD-ROM or floppy disk. In other embodiments, hardware circuitry may be used in place of or in combination with software instructions to implement the invention. For example, the present invention may be implemented on a digital television platform that uses a Trimedia processor for processing and a television monitor for displaying. The present invention can also be implemented on the computer 30 shown in FIG.

【００２３】図３に示すように、コンピュータ３０は、可変帯域幅ネットワーク又はインタ
ーネットのようなデータネットワークにインタフェースするネットワーク接続３
１と、ビデオ又はディジタルカメラ（図示せず）のような他の遠隔なソースとイ
ンタフェースするファックス／モデム接続３２とを含む。コンピュータ３０は、
更に、ユーザに情報（ビデオデータ含む）を表示するディスプレイ３３と、テキ
スト及びユーザコマンドを入力するキーボード３４と、ディスプレイ３３上にカ
ーソルを位置決めしユーザコマンドを入力するマウス３５と、中にインストール
されたフロッピーディスクから読出し及び書き込むディスクドライブ３６と、Ｃ
Ｄ−ＲＯＭに記憶された情報にアクセスするＣＤ−ＲＯＭドライブ３７とを有す
る。コンピュータ３０は、画像又は同様のものを入力する一対のビデオ会議カメ
ラ、及び、画像、テキスト、又は、同様のものを出力するプリンタ３８のような
一つ以上の周辺装置が取り付けられてもよい。As shown in FIG. 3, the computer 30 has a network connection 3 that interfaces to a data network such as a variable bandwidth network or the Internet.
1 and a fax / modem connection 32 for interfacing with other remote sources such as video or digital cameras (not shown). The computer 30
Furthermore, a display 33 for displaying information (including video data) to a user, a keyboard 34 for inputting texts and user commands, a mouse 35 for positioning a cursor on the display 33 and inputting user commands, and are installed therein. A disk drive 36 for reading and writing from a floppy disk, C
A CD-ROM drive 37 for accessing information stored in the D-ROM. The computer 30 may be fitted with a pair of video conferencing cameras for inputting images or the like and one or more peripherals such as a printer 38 for outputting images, text or the like.

【００２４】図４は、ランダム・アクセス・メモリ（ＲＡＭ）、読出し専用メモリ（ＲＯＭ
）、及び、ハードディスクのようなコンピュータ読出し可能媒体を含んでもよい
メモリ４０を有するコンピュータ３０の内部構造を示す。メモリ４０に記憶され
るアイテムは、オペレーティング・システム４１、データ４２、及び、アプリケ
ーション４３を含む。メモリ４０に記憶されるデータは、時間の情報１２を含ん
でもよい。本発明の好ましい実施例では、オペレーティング・システムは、ＵＮ
ＩＸ（登録商標）のようなウィンドウイングオペレーティング・システムであり
、本発明は、MicrosoftのWindows95のような他のオペレーティング・システムと
使用されてもよい。メモリ４０に記憶されるアプリケーションの中には、ビデオ
符号器４４、ビデオ復号器４５、及び、フレームグラバー４６がある。ビデオ符
号器４４は、従来の方法でビデオデータをエンコードし、ビデオ復号器４５は従
来の方法で符号化されたビデオデータを復号化する。フレームグラバー４６は、
ビデオ信号ストリ−ムから単一のフレームが捕捉され処理されることを可能にす
る。FIG. 4 shows a random access memory (RAM) and a read only memory (ROM).
), And the internal structure of a computer 30 having a memory 40 that may include a computer-readable medium such as a hard disk. The items stored in the memory 40 include an operating system 41, data 42, and applications 43. The data stored in memory 40 may include time information 12. In the preferred embodiment of the invention, the operating system is UN
A windowing operating system such as IX® and the present invention may be used with other operating systems such as Microsoft's Windows 95. Among the applications stored in memory 40 are video encoder 44, video decoder 45, and frame grabber 46. The video encoder 44 encodes the video data in a conventional manner, and the video decoder 45 decodes the video data encoded in the conventional manner. The frame grabber 46 is
It allows a single frame from a video signal stream to be captured and processed.

【００２５】コンピュータ３０には、中央処理ユニット（ＣＰＵ）５０、通信インタフェー
ス５１、メモリインタフェース５２、ＣＤ−ＲＯＭドライブインタフェース５３
、ビデオインタフェース５４、及び、バス５５も含まれる。ＣＰＵ５０は、上記
のようなコンピュータ読出し可能な符号、即ち、アプリケーションをメモリ５０
から実行するマイクロプロセッサ等を有する。このようなアプリケーションは、
（上述の通り）メモリ４０に記憶されてもよく、又は、代わりに、ディスクドラ
イブ３６中のフロッピーディスク或いはＣＤ−ＲＯＭドライブ３７中のＣＤ−Ｒ
ＯＭに記憶されてもよい。ＣＰＵ５０は、フロッピーディスクに記憶されるアプ
リケーション（又は他のデータ）にメモリインタフェース５２を介してアクセス
し、ＣＤ−ＲＯＭに記憶されるアプリケーション（又は他のデータ）にＣＤ−Ｒ
ＯＭドライブインタフェース５３を介してアクセスする。The computer 30 includes a central processing unit (CPU) 50, a communication interface 51, a memory interface 52, and a CD-ROM drive interface 53.
Also included are a video interface 54 and a bus 55. The CPU 50 stores the computer-readable code, that is, the application, in the memory 50.
It has a microprocessor and the like to be executed from. Such applications are
It may be stored in memory 40 (as described above) or, alternatively, a floppy disk in disk drive 36 or a CD-R in CD-ROM drive 37.
It may be stored in the OM. The CPU 50 accesses the application (or other data) stored in the floppy disk via the memory interface 52, and CD-Rs the application (or other data) stored in the CD-ROM.
Access via the OM drive interface 53.

【００２６】コンピュータ３０のアプリケーションの実行及び他のタスクは、キーボード３
４又はマウス３５を使用して始められてもよい。コンピュータ上で動くアプリケ
ーションから出力結果は、ディスプレイ３４上にユーザに対して表示されてもよ
く、又は、代わりに、ネットワーク接続３１を介して出力されてもよい。例えば
、入力ビデオデータは、ビデオインタフェース５４又はネットワーク接続３１を
通じて受信されてもよい。入力ビデオデータは、ビデオ復号器４５によって復号
化されてもよい。出力ビデオデータは、ビデオインタフェース５４又はネットワ
ークインタフェ−ス３１を通じて伝送するためにビデオ符号器４４によって符号
化されてもよい。ディスプレイ３３は、バス５５上にＣＰＵ５０によって供給さ
れる符号化されたビデオデータに基づいてビデオ画像を形成するディスプレイプ
ロセッサを有することが好ましい。様々なアプリケーションからの出力結果は、
プリンタ３８に供給されてもよい。Execution of applications on the computer 30 and other tasks are performed by the keyboard 3
4 or mouse 35 may be used. The output results from the application running on the computer may be displayed to the user on the display 34 or, alternatively, may be output via the network connection 31. For example, input video data may be received through video interface 54 or network connection 31. The input video data may be decoded by the video decoder 45. The output video data may be encoded by video encoder 44 for transmission over video interface 54 or network interface 31. Display 33 preferably comprises a display processor that forms a video image based on encoded video data provided by CPU 50 on bus 55. The output results from various applications are
It may be supplied to the printer 38.

【００２７】図２を参照するに、左のフレーム１４及び右のフレームのフレームフレーム１
５、一対のステレオディジタル画像を有することが好ましい。例えば、ディジタ
ル画像は、２つの（静置又はビデオ）カメラ６０及び６１（図５に示す）から受
信され、その後の処理のためにメモリ４０に記憶されてもよい。異なる角又はビ
ューで捕捉される他のフレーム、又は、一対のフレームが使用されてもよい。カ
メラ６０及び６１は、ビデオ会議システム又はアニメーションシステムのような
別のシステムの一部でもよい。Referring to FIG. 2, the frame 14 of the left frame and the frame 1 of the right frame
5, preferably having a pair of stereo digital images. For example, digital images may be received from two (static or video) cameras 60 and 61 (shown in Figure 5) and stored in memory 40 for subsequent processing. Other frames captured at different corners or views, or a pair of frames may be used. The cameras 60 and 61 may be part of another system such as a video conferencing system or an animation system.

【００２８】カメラ６０及び６１は、互いに関して近接しておかれ、被写体６４は、カメラ
６２及び６３から短い距離の場所におかれる。図５に示すように、カメラ６０及
び６１は、互いから距離ｂ（中心間）にある。オブジェクト６２は、カメラ６０
及び６１夫々から距離ｆにある。ｂが約１２．７乃至１５．２４センチメートル
（５乃至６インチ）に等しく、ｆが約９１．４４センチメートル（３フィート）
であることが好ましい。しかしながら、本発明は、これら距離に制限されず、こ
れらの距離は単なる例に過ぎないことを理解すべきである。The cameras 60 and 61 are in close proximity to each other and the subject 64 is placed at a short distance from the cameras 62 and 63. As shown in FIG. 5, the cameras 60 and 61 are at a distance b (center-to-center) from each other. The object 62 is the camera 60
And 61, respectively, at a distance f. b equals approximately 12.7 to 15.24 centimeters (5 to 6 inches) and f approximately 91.44 centimeters (3 feet)
Is preferred. However, it should be understood that the invention is not limited to these distances and that these distances are merely examples.

【００２９】カメラ６０は、正面図を捕捉し、カメラ６１がオブジェクト６２のオフセット
又は側面図を捕捉することが好ましい。これは、相違マップを決定するために左
のフレーム１４及び右のフレームのフレームフレーム１５を比較することを可能
にする。本発明の好ましい実施例では、左のフレーム１４（画像Ａ）は、右のフ
レームのフレームフレーム１５（画像Ｂ）と比較される。しかしながら、逆の比
較が実施されてもよい。The camera 60 preferably captures a front view and the camera 61 preferably captures an offset or side view of the object 62. This allows to compare the left frame 14 and the right frame frame 15 to determine the difference map. In the preferred embodiment of the invention, the left frame 14 (image A) is compared with the frame frame 15 (image B) of the right frame. However, the reverse comparison may be performed.

【００３０】ディジタルフレーム又は画像は、ピクセルのアレイを形成する複数の水平方向
の走査線及び複数の垂直方向の列を有するとして概念的に説明され得る。走査線
及び列の数は、ディジタル画像の解像度を決定する。相違マップを決定するため
には、走査線は並べられ、例えば、画像Ａの走査線１０は画像Ｂの走査線１０に
適合する。画像Ａの走査線１０上のピクセルは、画像Ｂの走査線１０上のその対
応するピクセルに適合される。従って、例えば、画像Ａの走査線１０の１５番目
のピクセルが画像Ｂの走査線１０の１０番目のピクセルと適合される場合、相違
は、１５−１０＝５として計算される。左及び右のフレームのフレームのカメラ
６０及び６１が、近接しておかれるとき、画像の前景情報のピクセル、例えば、
人の顔は、背景情報のピクセルよちもより大きい相違を有する。A digital frame or image may be conceptually described as having a plurality of horizontal scan lines and a plurality of vertical columns forming an array of pixels. The number of scan lines and columns determines the resolution of the digital image. To determine the difference map, the scan lines are aligned, for example scan line 10 of image A matches scan line 10 of image B. The pixel on scanline 10 of image A is matched to its corresponding pixel on scanline 10 of image B. Thus, for example, if the 15th pixel of scanline 10 of image A is matched with the 10th pixel of scanline 10 of image B, the difference is calculated as 15-10 = 5. When the cameras 60 and 61 of the left and right frame of the frame are brought in close proximity, pixels of the foreground information of the image, eg
A person's face has more differences than pixels of background information.

【００３１】相違計算に基づく相違マップは、メモリ４０に記憶されてもよい。画像の各走
査線（又は列）は、その走査線（又は列）における各ピクセルに対する相違を含
むプロファイルを有する。本実施例では、各ピクセルのグレースケールレベルは
、そのピクセルに対する計算された相違の大きさを示す。グレースケールレベル
が暗いほど相違は低くなる。The difference map based on the difference calculation may be stored in the memory 40. Each scanline (or column) of the image has a profile that contains the difference for each pixel in that scanline (or column). In this example, the grayscale level for each pixel indicates the magnitude of the calculated difference for that pixel. The darker the grayscale level, the lower the difference.

【００３２】相違閾値は、例えば、１０と選択されてもよく、相違閾値より上の全ての相違
はピクセルが前景情報（即ち、被写体６４）であることを示す一方で、１０より
下の全ての相違はピクセルが背景情報であることを示す。相違閾値の選択は、上
述のカメラの距離に部分的に基づく。例えば、オブジェクト６２がカメラ６０又
は６１からより離れた距離に位置決めされる場合、より低い相違閾値が使用され
てもよく、又は、カメラ５０及び６１が互いからより離れている場合より高い相
違閾値が使用されてもよい。The difference threshold may be selected, for example, 10 and all differences above the difference threshold indicate that the pixel is foreground information (ie object 64), while all below 10 The difference indicates that the pixel is background information. The selection of the difference threshold is based in part on the camera distance described above. For example, a lower difference threshold may be used if the object 62 is positioned at a greater distance from the camera 60 or 61, or a higher difference threshold than if the cameras 50 and 61 are further from each other. May be used.

【００３３】相違マップは、左又は右のフレームのフレームフレーム１４及び１５から顔の
造作位置又は座標を抽出するために使用される。１９９９年８月３０日に出願さ
れた米国特許出願第０８/３８５，２８０号に記載するシステム及び方法は、造
作抽出決定器１１を有することが好ましい。顔の造作位置は、目、鼻、口に対す
る位置、並びに、頭のアウトライン位置に対する位置を含むことが好ましい。図
１に関連するに、これらの位置は、顔モデル１００の様々な頂点に相互に関係が
ある。例えば、鼻に関して、顔の造作抽出決定器は、図１に示す頂点４、５、２
３、及び、５８に直接的に関連する情報を供給することが好ましい。The difference map is used to extract facial feature positions or coordinates from the left or right frame frames 14 and 15. The system and method described in US patent application Ser. No. 08 / 385,280, filed Aug. 30, 1999, preferably has a feature extraction determiner 11. The facial features preferably include positions for the eyes, nose, and mouth, as well as positions for the outline of the head. With reference to FIG. 1, these positions correlate to various vertices of the face model 100. For example, with respect to the nose, the facial feature extraction determiner may use
It is preferable to provide information directly related to 3 and 58.

【００３４】しかしながら、造作抽出決定器１１は、顔の造作のＸ及びＹ座標だけを提供す
る。造作対応マッチングユニット１３がＺ座標を提供する。造作抽出決定器１１
が左及び右のフレームのフレームのステレオ画像フレーム１４及び１５上の透視
プロジェクションに与えられる３Ｄ点の位置の推測に基づく三角分割（triangul
ation）の手順を使用する。例えば、左及び右のフレームのフレームのフレーム
１４及び１５中の造作点（Ｆ_Ｌ及びＦ_Ｒ）のＸ及びＹ座標が与えられて、３Ｄ表
面（即ち、Ｚ又は奥行き情報）がHowever, the feature extraction determiner 11 only provides the X and Y coordinates of the face feature. The feature-matching matching unit 13 provides the Z coordinate. Feature extraction determiner 11
Is a triangulation based on the estimation of the position of the 3D points given to the perspective projection on the stereo image frames 14 and 15 of the left and right frame.
ation) procedure. For example, given the X and Y coordinates of the feature points (F _L and F _R ) in frames 14 and 15 of the left and right frame, the 3D surface (ie, Z or depth information) is given.

【００３５】[0035]

【数１】の式によって決定され得、このとき、距離ｆ（図５に示す）はカメラ６０及び６
１の焦点距離であり、距離ｂは（図５に示す）は、カメラ６０と６１との間の基
本線の距離であり、｜Ｆ_Ｌ及びＦ_Ｒ｜は、上述の通りに計算される相違である。[Equation 1] The distance f (shown in FIG. 5) can be determined by the cameras 60 and 6
Is the focal length of 1, the distance b is the distance of the baseline between the cameras 60 and 61 (shown in FIG. 5), and | F _L and F _R | are the differences calculated as described above. Is.

【００３６】本実施例において、上記式は、幾つかの幾何学的条件下で相違と表面Ｚとの間
の関係を提供する。特に、各カメラの前の画像面は、焦点距離ｆにあり、各カメ
ラはカメラ６０及び６１の位置によって明確にされる線の方向に向けられるカメ
ラの基準フレームのＸ軸と同一に向けられる。カメラ６０及び６１の焦点距離は
、同じであると推定する。カメラ６０及び６１のレンズの全ての幾何学的なひず
みが補償されることも推定する。他の幾何学的配置が使用されてもよいが、相違
と表面Ｚとの関係はより複雑になる。In this example, the above equations provide the relationship between the difference and the surface Z under some geometric conditions. In particular, the image plane in front of each camera is at the focal length f and each camera is oriented the same as the X axis of the camera's reference frame, which is oriented in the direction of the line defined by the positions of the cameras 60 and 61. The focal lengths of cameras 60 and 61 are assumed to be the same. It is also estimated that all geometric distortions of the lenses of cameras 60 and 61 are compensated. Other geometries may be used, but the relationship between the difference and the surface Z becomes more complex.

【００３７】図１に示す顔モデル１００の他の頂点は、造作抽出決定器１１からの位置（即
ち、顔の造作の頂点情報）、及び、造作対応マッチングユニット１３からの決定
に基づいて補間又は補外され得る。補間は、線形、非線形、又は、基準化される
モデル或いは関数に基づいてもよい。例えば、２つの他の既知の頂点間の頂点は
、３つの頂点が満たす所定の放物線関数を使用して決定されてもよい。更なる頂
点を有する他の顔モデルが向上又は改善されたモデリング結果を提供するために
使用されてもよい。The other vertices of the face model 100 shown in FIG. 1 are interpolated or determined based on the position from the feature extraction / determination unit 11 (ie, the vertex information of the face feature) and the determination from the feature matching unit 13. Can be extrapolated. The interpolation may be linear, non-linear, or based on a scaled model or function. For example, the vertex between two other known vertices may be determined using a predetermined parabolic function that the three vertices satisfy. Other facial models with additional vertices may be used to provide enhanced or improved modeling results.

【００３８】図１に示す顔モデル１００は、無表情な一般的な顔である。顔モデル１００の
制御は、一定の割合に応じる。顔モデルの１００のテンプレートは、全ての通信
が始められる前に遠隔なサイトにおいて記憶又はロードされてもよい。抽出され
る顔の造作を用いて、多角形の頂点が特定の人の顔により近く適合するよう調節
され得る。得に、造作対応マッチングユニット１３及び造作抽出決定器１１によ
って実施される情報及び処理に基づき、顔モデル１００のテンプレートは、適合
され、動き、表情を可能にし、オーディオを同期（即ち、音声）するためにアニ
メートされる。本質的に、一般的な顔モデル１００は、特定の顔にリアルタイム
で動的に変換される。モデルの顔のパラメータ／データのリアルタイム又はリア
ルタイムでない伝送は、合成顔モデルの低ビットレートアニメーションを提供す
る。データ転送速度は毎秒６４キロビット以下であることが好ましいが、動画に
対して毎秒６４キロビット乃至４メガビットのデータ転送速度も許容できる。The face model 100 shown in FIG. 1 is a general expressionless face. The face model 100 is controlled according to a fixed ratio. The 100 templates of the facial model may be stored or loaded at the remote site before all communications are initiated. Using the extracted facial features, the vertices of the polygon can be adjusted to better fit the face of a particular person. In addition, based on the information and processing performed by the feature-matching matching unit 13 and the feature extraction determiner 11, the template of the face model 100 is adapted to allow movement, facial expression, and audio synchronization (ie, voice). To be animated. In essence, the generic face model 100 is dynamically transformed in real time to a particular face. Real-time or non-real-time transmission of model facial parameters / data provides low bit rate animation of synthetic facial models. The data transfer rate is preferably 64 kilobits per second or less, but a data transfer rate of 64 kilobits to 4 megabits per second is acceptable for moving images.

【００３９】別の実施例では、時間の情報１２が造作対応マッチングユニット１３からの造
作マッチングの結果を確認及び／又は代替の造作マッチング処理を実施するため
に使用される。本実施例では、例えば、マッチングは、選択されたフレーム、好
ましくは、（ＭＰＥＧ方式における）「キー」フレームに対して造作対応マッチ
ングユニット１３によってだけ実施される。一旦キーフレームが造作マッチング
されると、他のキーフレームでないフレーム（又は他のキーフレーム）中の造作
（即ち、奥行き）を対応するマッチングが時間の方法で対応する造作点をトラッ
キングすることによって決定され得る。３Ｄ動きは、最初の造作対応が与えられ
る場合、２つのビュー（即ち、時間の情報１２は２つの左又は右の順次或いは連
続的なフレームから成ってもよい）からの１つの並進方向におけるスケールまで
決定され得る。造作対応マッチングユニット１３が時間の造作マッチングからの
全ての構築エラーを除去するために他のキーフレームを定期的に造作マッチング
するために使用される。造作対応マッチングユニット１３は、必要であれば造作
対応マッチング及び時間の造作マッチング夫々を実施すために構成されてもよい
。時間の造作マッチングは、造作対応マッチングよりも早く実施され得、これは
、リアルタイムな処理に関して有利である。In another embodiment, the time information 12 is used to confirm the result of the feature matching from the feature-matching matching unit 13 and / or to perform an alternative feature matching process. In this embodiment, for example, the matching is only performed by the feature-matching matching unit 13 for the selected frame, preferably the "key" frame (in the MPEG system). Once a keyframe is feature matched, the corresponding matching (ie, depth) in the other non-keyframe frame (or other keyframe) is determined by tracking the corresponding feature point in a time-wise manner. Can be done. 3D motion is scaled in one translational direction from two views (ie the time information 12 may consist of two left or right sequential or consecutive frames) given the first feature correspondence. Can be determined up to. The feature-matching matching unit 13 is used to regularly feature-match other keyframes to eliminate all construction errors from the time feature-match. The feature enabled matching unit 13 may be configured to perform feature enabled matching and time feature matched, respectively, if desired. Temporal feature matching may be performed faster than feature-based matching, which is advantageous for real-time processing.

【００４０】本発明は、ビデオ会議及び実オブジェクトのアニメーション／シミュレーショ
ンのような分野、又は、オブジェクトのモデリングが要求される全ての適用法に
おいて多数の適用法がある。例えば、典型的な用途は、ビデオゲーム、マルチメ
ディアクリエーション、及び、インターネット上の改善されたナビゲーションを
含む。The invention has numerous applications in fields such as video conferencing and animation / simulation of real objects, or in all applications where modeling of objects is required. For example, typical applications include video games, multimedia creation, and improved navigation on the Internet.

【００４１】更に、本発明は、３Ｄの顔モデルに制限されない。本発明は、自動車及び部屋
の３Ｄモデルのような他の物理的なオブジェクト及びシーンのモデルと使用され
てもよい。本実施例では、造作抽出決定器１１は、当の特定のオブジェクト又は
シーンに関連する位置、例えば、ホイール又は家具の場所の位置の情報を集める
。更なる処理は、この情報に基づく。Furthermore, the invention is not limited to 3D face models. The present invention may be used with other models of physical objects and scenes such as 3D models of cars and rooms. In this embodiment, the feature extraction determiner 11 collects information about the location associated with the particular object or scene of interest, eg, the location of the wheel or furniture location. Further processing is based on this information.

【００４２】本発明は、特定の実施例に関して上記した一方で、本発明は、中で記載する実
施例に制限又は限定されることが意図されないことを理解する。例えば、本発明
は、全ての特定のタイプのフィルタ処理又は数学的変換、又は、全ての特定の入
力画像のスケール又はオリエンテーションに制限されない。反対に、本発明は、
添付の特許請求の範囲及びその精神内に含まれる本発明の様々な構造及び変更態
様を網羅することを意図する。While the present invention has been described above with respect to particular embodiments, it is understood that the present invention is not intended to be limited or limited to the embodiments described therein. For example, the invention is not limited to any particular type of filtering or mathematical transformation, or to any particular input image scale or orientation. On the contrary, the present invention
It is intended to cover various structures and modifications of the present invention that fall within the scope of the appended claims and their spirit.

[Brief description of drawings]

【図１】三次元のモデルベースコーディングに使用される人間の顔モデルの概略的な正
面図である。FIG. 1 is a schematic front view of a human face model used in three-dimensional model-based coding.

【図２】本発明の一面による３Ｄモデリングシステムのブロック図である。[Fig. 2] FIG. 3 is a block diagram of a 3D modeling system according to one aspect of the present invention.

【図３】図１のシステムを支持することができる例示的なコンピュータシステムのブロ
ック図である。3 is a block diagram of an exemplary computer system capable of supporting the system of FIG.

【図４】図２のコンピュータシステムのアーキテクチャを示すブロック図である。[Figure 4] 3 is a block diagram illustrating the architecture of the computer system of FIG. 2.

【図５】本発明の好ましい実施例による例示的な配置を示すブロック図である。[Figure 5] FIG. 3 is a block diagram illustrating an exemplary arrangement according to a preferred embodiment of the present invention.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｈ０４Ｎ 7/24 Ｈ０４Ｎ 7/13 Ｚ (72)発明者シャラパリ，キランオランダ国，5656 アーアーアインドーフェン，プロフ・ホルストラーン６Ｆターム(参考） 5B057 BA13 CA13 CA16 CB13 CB16 CG01 CH14 DA07 DB03 DC05 DC32 5C059 MA00 MA05 MB01 MB08 MB12 MB18 PP04 SS07 SS08 SS26 UA02 UA05 UA33 5C061 AA20 AB08 5L096 AA09 CA05 FA69 HA07 JA18─────────────────────────────────────────────────── ─── Continuation of front page (51) Int.Cl. ⁷ Identification code FI Theme Coat (reference) H04N 7/24 H04N 7/13 Z (72) Inventor Sharapari, Kiran Netherlands, 5656 Aer Eindouven, Prof. Holstraan 6F Term (Reference) 5B057 BA13 CA13 CA16 CB13 CB16 CG01 CH14 DA07 DB03 DC05 DC32 5C059 MA00 MA05 MB01 MB08 MB12 MB18 PP04 SS07 SS08 SS26 UA02 UA05 UA33 5C061 AA20 AB08 5L096 AA09 CA05 FA69 HA07

Claims

[Claims]

1. At least one feature extraction determiner configured to extract feature position information from a pair of input image signals, and matching with corresponding features in the input image signals according to the feature position information and the difference information. An image processing device having a matching unit coupled to the feature extraction determiner arranged to cause the feature extraction determiner.

2. The image processing apparatus according to claim 1, wherein the matching unit outputs three-dimensional information related to the input image.

3. The image processing apparatus according to claim 2, wherein the 3D surface information is based on a predetermined model.

4. The image processing apparatus according to claim 4, wherein the predetermined model is a human face model.

5. The image processing apparatus according to claim 1, wherein the matching unit performs matching on at least one frame of the input image signal and performs time feature matching on at least the other frame.

6. The image processing device of claim 6, wherein the temporal feature matching is performed using sequential input image frames.

7. The image processing apparatus according to claim 6, wherein one of the frames is a key frame.

8. A method of determining parameters related to a 3D model, comprising: extracting feature position information associated with a pair of input images; and the pair of inputs according to the extracted feature position information and the difference information. A method comprising matching corresponding features in an image, and determining parameters for a 3D model according to a result of corresponding matching of the features.

9. The method of claim 9, wherein the 3D model is a human face model.

10. A method of encoding an object in a digital image for transmission, the method comprising: extracting feature location information associated with at least one pair of digital images; A method comprising matching corresponding features in the pair of digital images, and encoding information for transmission according to the result of the corresponding matching of features.

11. The method of claim 11, wherein the information for transmitting comprises parameters related to a 3D model.