JP2002525735A

JP2002525735A - Tracking semantic objects in vector image sequences

Info

Publication number: JP2002525735A
Application number: JP2000570977A
Authority: JP
Inventors: グチュアン; リーミン−チェ
Original assignee: Microsoft Corp
Current assignee: Microsoft Corp
Priority date: 1998-09-10
Filing date: 1999-09-10
Publication date: 2002-08-13
Anticipated expiration: 2019-09-10
Also published as: US7162055B2; ATE286337T1; EP1519589A3; JP4074062B2; EP1112661A1; US20040189863A1; US7088845B2; US20050240629A1; WO2000016563A1; EP1519589A2; EP1112661B1; DE69922973T2; US6711278B1; DE69922973D1

Abstract

(57)【要約】意味対象物追跡方法は、複数の剛体運動と、断片的な構成要素と、ベクトル画像シーケンスを通じて使用されている複数の色で、一般的な意味対象物を追跡する。この方法は、現在のフレームから画像領域を空間的に分割し、次いで先行フレームのどの意味対象物を源とするかについてこれらの領域を分類することによって、正確にこれらの一般的な意味対象物を追跡する。各領域を分類するために、この方法は、空間的に分割された各領域と先行フレームとの間で先行フレームで計算され予測された位置に対する領域ベースの運動推定を実行する。次いでこの方法は、先行フレームのどの意味対象物が、予測された領域の最も重複している点を含むかということに基づいて、意味対象物の部分として現在のフレームの各領域を分類する。この方法を用いて、現在の各領域は、ギャップまたは重複がない状態で、先行フレームからの１つの意味対象物まで追跡される。この方法は、対象物の境界が未知であるフレームにおいて境界を射影および調整しようとするのではなく、意味対象物の境界が以前に計算されているフレームに領域を射影するので、ほとんどまたは全くエラーを伝搬しない。 (57) Abstract The semantic object tracking method tracks general semantic objects with multiple rigid motions, fragmentary components, and multiple colors used throughout a vector image sequence. The method accurately divides these general semantic objects by spatially dividing the image regions from the current frame, and then classifying those regions with respect to which semantic object of the previous frame as the source. To track. To classify each region, the method performs a region-based motion estimation between each spatially partitioned region and the previous frame, relative to the predicted position calculated in the previous frame. The method then classifies each region of the current frame as part of the semantic object based on which semantic object of the previous frame contains the most overlapping point of the predicted region. Using this method, each current region is tracked without gaps or overlaps to one semantic object from the previous frame. This method does not attempt to project and adjust the boundaries in frames where the boundaries of the object are unknown, but rather projects the region into the frame where the boundaries of the semantic objects were previously calculated, resulting in little or no error. Does not propagate.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】（発明の分野）本発明は、ビデオデータの分析に関し、より詳細には、意味対象物(セマンテ
ィックオブジェクト）と呼ばれ、意味のある実在物がビデオシーケンスなどのベ
クトル画像シーケンスを通じ移動する際に、それらを追跡(トラッキング）する
方法に関する。[0001] The present invention relates to the analysis of video data, and more particularly to semantic objects, where meaningful entities move through a vector image sequence, such as a video sequence. At the time, it relates to a method of tracking them.

【０００２】（発明の背景）意味ビデオ対象物は、ボール、車、飛行機、ビル、細胞、目、唇、手、頭など
、デジタルビデオクリップの有意な実在物を表す。このコンテキストでの「意味
」という用語は、ビデオクリップの聴視者が、ある意味をその対象物(オブジェ
クト）に結び付けることを意味する。例えば、上に列挙した各対象物は、ある実
世界の実在物を表し、ビューワは、これらの実在物に対応するスクリーンの部分
を、それが描写する有意な対象物と関連付ける。意味ビデオ対象物は、コンテン
ツをベースとする通信、マルチメディア信号処理、デジタルビデオライブラリ、
デジタル映画スタジオ、およびコンピュータの画面（ｖｉｓｉｏｎ）とパターン
の認識を含む、様々な新しいデジタルビデオの応用分野で非常に有用である。意
味ビデオ対象物をこれらの応用分野で使用するために、対象物の分割（セグメン
テーション）と追跡の方法は、各ビデオフレームにおいて対象物を識別する必要
がある。BACKGROUND OF THE INVENTION Semantic video objects represent significant entities of digital video clips, such as balls, cars, airplanes, buildings, cells, eyes, lips, hands, and heads. The term "meaning" in this context means that the viewer of the video clip associates a meaning with the object. For example, each object listed above represents certain real world entities, and the viewer associates the portion of the screen corresponding to these entities with the significant objects that it depicts. Semantic video objects include content-based communications, multimedia signal processing, digital video libraries,
It is very useful in a variety of new digital video applications, including digital cinema studios and computer screen and pattern recognition. In order to use semantic video objects in these applications, object segmentation and tracking methods need to identify the objects in each video frame.

【０００３】ビデオ対象物を分割するプロセスは、一般に、画像データ中の関係のある対象
物を抽出する、自動化または半自動化された方法を指す。ビデオクリップから意
味ビデオ対象物を抽出することは、依然として長年わたる挑戦的な課題である。
典型的なビデオクリップでは、意味対象物は、断片的な構成要素と、異なる色、
複数の剛体運動／非剛体運動とを含む。意味対象物は、聴視者が認識することは
容易であるが、意味対象物の形、色、および運動が非常に多様であるために、こ
のプロセスをコンピュータ上で自動化することは困難である。最初のフレームで
意味対象物の最初の輪郭をユーザに描かせ、次いでその輪郭を使用して、そのフ
レームの対象物の部分である画素を計算させることによって、満足な結果を達成
することができる。各連続フレームにおいて、運動の推定を使用して、先行フレ
ームからの分割した対象物に基づいて、対象物の最初の境界を予測することがで
きる。この半自動対象物分割追跡方法は、ＣｈｕａｎｇＧｕとＭｉｎｇＣｈ
ｉｅｈＬｅｅによるＳｅｍａｎｔｉｃＶｉｄｅｏＯｂｊｅｃｔＳｅｇｍ
ｅｎｔａｔｉｏｎａｎｄＴｒａｃｋｉｎｇという名称の、同時継続中の米国
特許出願第０９／０５４，２８０号に記載されており、本明細書でも参考文献に
よって組み込まれている。[0003] The process of segmenting video objects generally refers to an automated or semi-automated method of extracting relevant objects in image data. Extracting semantic video objects from video clips remains a challenging task for many years.
In a typical video clip, semantic objects consist of fragmentary components, different colors,
Includes multiple rigid / non-rigid motions. Although semantic objects are easy for a viewer to recognize, it is difficult to automate this process on a computer due to the wide variety of shapes, colors, and movements of semantic objects. . Satisfactory results can be achieved by having the user draw the first outline of the semantic object in the first frame and then using that outline to calculate the pixels that are part of the object in that frame. . In each successive frame, motion estimation can be used to predict an initial boundary of the object based on the segmented object from the previous frame. This semi-automatic object division tracking method is based on Chuang Gu and Ming Ch.
Semantic Video Object Segm by ie Lee
It is described in co-pending US patent application Ser. No. 09 / 054,280, entitled "entation and Tracking," which is hereby incorporated by reference.

【０００４】対象物の追跡は、対象物がフレームからフレームに移動するときに、対象物の
位置を計算するプロセスである。より一般的な意味ビデオ対象物に対処するため
に、対象物追跡方法は、断片的な構成要素と複数の非剛体運動を含んでいる対象
物に対処することができなければならない。研究の大半は対象物追跡に費やされ
てきたが、現存の方法は、依然として、非剛体運動をする複数の構成要素を有す
る対象物を正確に追跡しない。[0004] Tracking an object is the process of calculating the position of an object as it moves from frame to frame. To address more general semantic video objects, object tracking methods must be able to address objects that contain fragmentary components and multiple non-rigid motions. Although much of the research has been devoted to object tracking, existing methods still do not accurately track objects with multiple components that make non-rigid motions.

【０００５】追跡技術のいくつかは、基準として同一グレイスケール／カラー使用して、領
域を追跡する。１９９２年５月、イタリア、サンタマルゲリータ、ＥＣＣＶ′９
２、ｐｐ．４７６〜４８４のＦ．ＭｅｙｅｒとＰ．Ｂｏｕｔｈｅｍｙによる「Ｒ
ｅｇｉｏｎ−ｂａｓｅｄｔｒａｃｋｉｎｇｉｎａｎｉｍａｇｅｓｅｑ
ｕｅｎｃｅ」、１９９５年６月、ＰｒｏｃｅｅｄｉｎｇｏｆｔｈｅＩＥＥ
Ｅ、Ｖｏｌ．８３、Ｎｏ．６、ＰＰ．８４３〜８５７のＰｈ．Ｓａｌｅｍｂｉｅ
ｒ、Ｌ．Ｔｏｒｒｅｓ、Ｆ．Ｍｅｙｅｒ、Ｃ．Ｇｕによる「Ｒｅｇｉｏｎ−ｂａ
ｓｅｄｖｉｄｅｏｃｏｄｉｎｇｕｓｉｎｇｍａｔｈｅｍａｔｉｃａｌ
ｍｏｒｐｈｏｌｏｇｙ」、１９９７年、２月、サンホゼ、ＶＣＩＰ′９７、Ｖｏ
ｌ．３０２４、Ｎｏ．１、ｐｐ．１９０〜１９９のＦ．ＭａｒｑｕｅｓとＣｒｉ
ｓｔｉｎａＭｏｌｉｎａによる「Ｏｂｊｅｃｔｔｒａｃｋｉｎｇｆｏｒ
ｃｏｎｔｅｎｔ−ｂａｓｅｄｆｕｎｃｔｉｏｎａｌｉｔｉｅｓ」、および１９
９７年１０月、サンタバーバラ、ＩＣＩＰ′９７、Ｖｏｌ．Ｉ、ページ１１３〜
１１６のＣ．Ｔｏｋｌｕ、Ａ．Ｔｅｋａｌｐ、Ａ．Ｅｒｄｅｍによる「Ｓｉｍｕ
ｌｔａｎｅｏｕｓａｌｐｈａｍａｐｇｅｎｅｒａｔｉｏｎａｎｄ２−
Ｄｍｅｓｈｔｒａｃｋｉｎｇｆｏｒｍｕｌｔｉｍｅｄｉａａｐｐｌｉ
ｃａｔｉｏｎｓ」を参照されたい。Some tracking techniques track an area using the same gray scale / color as a reference. May 1992, Santa Margherita, Italy, ECCV'9
2, pp. 476-484. Meyer and P.M. "R" by Bouthemy
egion-based tracking in an image seq
uence ", June 1995, Proceeding of the IEEE
E, Vol. 83, no. 6, PP. Ph. 84-857. Salembie
r, L.R. Torres, F.C. Meyer, C.I. "Region-ba" by Gu
sed video coding using material
morphology ", February 1997, San Jose, VCIP'97, Vo.
l. 3024, no. 1, pp. F. 190-199. Marques and Cri
"Object tracking for" by stina Molina
content-based functions ", and 19
October 1997, Santa Barbara, ICIP '97, Vol. I, pages 113-
116 C.I. Toklu, A .; Tekalp, A .; "Simu by Erdem
ltaneous alpha map generation and 2-
D mesh tracking for multimedia appli
See "sessions".

【０００６】ある者は、同一運動情報を使用して、運動する対象物を追跡する。例えば、１
９９４年９月、ＩＥＥＥＴｒａｎｓ．ｏｎＩｍａｇｅＰｒｏｃｅｓｓｉｎ
ｇ、Ｖｏｌ．３、Ｎｏ．５．ｐｐ．６２５〜６３８のＪ．ＷａｎｇとＥ．Ａｄ
ｅｌｓｏｎによる「Ｒｅｐｒｅｓｅｎｔｉｎｇｍｏｖｉｎｇｉｍａｇｅｓ
ｗｉｔｈｌａｙｅｒｓ」、および１９９６年９月、スイス、ローザンヌ、ＩＣ
ＩＰ′９６、Ｖｏｌ．Ｉ、ｐｐ．９２５〜９２８のＮ．ＢｒａｄｙとＮ．Ｏ′Ｃ
ｏｎｎｏｒによる「Ｏｂｊｅｃｔｄｅｔｅｃｔｉｏｎａｎｄｔｒａｃｋｉ
ｎｇｕｓｉｎｇａｎｅｍ−ｂａｓｅｄｍｏｔｉｏｎｅｓｔｉｍａｔｉ
ｏｎａｎｄｓｅｇｍｅｎｔａｉｏｎｆｒａｍｅｗｏｒｋ」を参照されたい
。[0006] Some people use the same motion information to track a moving object. For example, 1
In September 994, IEEE Trans. on Image Processin
g, Vol. 3, No. 5. pp. J. 625-638. Wang and E.W. Ad
"Representing moving images by Elson
with layers ", September 1996, Lausanne, Switzerland, IC
IP'96, Vol. I, pp. N. 925-928. Brady and N.M. O'C
"Object detection and tracki"
ng using an em-based motion estimati
See "on and segmentation framework".

【０００７】他の者は、空間基準と時間基準の組合わせを使用して、対象物を追跡する。１
９９２年５月、イタリア、サンタマルゲリータ、ＥＣＣＶ′９２、ｐｐ．４８５
〜４９３のＭ．Ｊ．Ｂｌａｃｋによる「Ｃｏｍｂｉｎｉｎｇｉｎｔｅｓｉｔｙ
ａｎｄｍｏｔｉｏｎｆｏｒｉｎｃｒｅｍｅｎｔａｌｓｅｇｍｅｎｔａ
ｔｉｏｎａｎｄｔｒａｃｋｉｎｇｏｖｅｒｌｏｎｇｉｍａｇｅｓｅ
ｑｕｅｎｃｅｓ」、１９９５年、ニューヨーク、ＰｌｅｎｕｍＰｒｅｓｓ、Ｍ
ｕｌｔｉｍｅｄｉａＣｏｍｍｕｎｉｃａｔｉｏｎａｎｄＶｉｄｅｏＣｏ
ｄｉｎｇ、ｐｐ．２３３〜２４０のＣ．Ｇｕ、Ｔ．Ｅｂｒａｈｉｍｉ、Ｍ．Ｋｕ
ｎｔによる「Ｍｏｒｐｈｏｌｏｇｉｃａｌｍｏｖｉｎｇｏｂｊｅｃｔｓｅ
ｇｍｅｎｔａｔｉｏｎａｎｄｔｒａｃｋｉｎｇｆｏｒｃｏｎｔｅｎｔ−
ｂａｓｅｄｖｉｄｅｏｃｏｄｉｎｇ」、１９９６年５月、ＧＡ、アトランタ
、Ｐｒｏｃ．ＩＣＡＳＳＰ′９６、Ｖｏｌ．４、ｐｐ．１９１４〜１９１７のＦ
．Ｍｏｓｃｈｅｎｉ、Ｆ．Ｄｕｆａｕｘ、Ｍ．Ｋｕｎｔによる「Ｏｂｊｅｃｔ
ｔｒａｃｋｉｎｇｂａｓｅｄｏｎｔｅｍｐｏｒａｌ、ａｎｄｓｐａｔｉ
ａｌｉｎｆｏｒｍａｔｉｏｎ」、および１９９７年、１０月、サンタバーバラ
、ＩＣＩＰ′９７、Ｖｏｌ．ＩＩ、ページ５１４〜５１７のＣ．ＧｕとＭ．Ｃ．
Ｌｅｅによる「Ｓｅｍａｎｔｉｃｖｉｄｅｏｏｂｊｅｃｔｓｅｇｍｅｎｔ
ａｔｉｏｎａｎｄｔｒａｃｋｉｎｇｕｓｉｎｇｍａｔｈｅｍａｔｉｃａ
ｌｍｏｒｐｈｏｌｏｇｙａｎｄｐｅｒｓｐｅｃｔｉｖｅｍｏｔｉｏｎ
ｍｏｄｅｌ」を参照されたい。Others track objects using a combination of spatial and temporal references. 1
May 992, Santa Margherita, Italy, ECCV'92, pp. 485
M.-493. J. "Combining institution
and motion for incremental segmenta
Tion and tracking over long image se
quences ", 1995, New York, Plenum Press, M.
ultmedia Communication and Video Co
ding, pp. 233 to 240; Gu, T .; Ebrahim, M .; Ku
"Morphological moving object se
gmentation and tracking for content-
based video coding, "May 1996, GA, Atlanta, Proc. ICASSP '96, Vol. 4, pp. 1914-1917 F
. Moscheni, F.C. Dufaux, M .; "Object by Kunt
tracking based on temporal, and spati
al information ", and October 1997, Santa Barbara, ICIP'97, Vol. II, pages 514-517, C.I. Gu and M.M. C.
Lee's "Semantic video object segment"
ation and tracking using materialatica
l morphology and perspective motion
model ".

【０００８】これらの技術のほとんどは、先行する領域／対象物を現在のフレームに射影し
、現在のフレームで射影された領域／対象物をなんとか組み立てる／調整する順
方向追跡メカニズムを使用する。これらの順方向技術の主な欠点は、現在のフレ
ームで射影領域を組み立てる／調整すること、または複数の非剛体運動に対処す
ることが困難なことである。これらの場合の多くでは、不確定なホールが出現す
るか、または結果的に境界が歪む可能性がある。[0008] Most of these techniques use a forward tracking mechanism that projects a preceding region / object into the current frame and manages to assemble / adjust the projected region / object in the current frame. A major drawback of these forward techniques is that it is difficult to assemble / adjust the projection area in the current frame or to cope with multiple non-rigid motions. In many of these cases, indeterminate holes may appear or result in distorted boundaries.

【０００９】図１Ａ〜Ｃは、対象物追跡に関連する困難を示す、意味ビデオ対象物の簡単な
例を提供する。図１Ａは、複数の色１０２、１０４を含んでいる建物１００の意
味ビデオ対象物を示す。対象物が同一の色を有すると仮定する方法は、これらの
種類の対象物をうまく追跡しない。図１Ｂは、図１Ａと同じ建物対象物を示すが
、部分的に建物を遮っている木１１０によって断片的な構成要素１０６、１０８
に分割されている点が異なる。対象物が画素の接続されたグループから形成され
ていると仮定する方法は、これらの種類の断片的な対象物をうまく追跡しない。
最後に、図１Ｃは、人１１２を表す簡単な意味ビデオ対象物を示す。この簡単な
対象物でさえ、異なる運動をする複数の構成要素１１４、１１６、１１８、１２
０を有する。対象物は同一な運動を有すると仮定する方法は、これらの種類の対
象物をうまく追跡しない。一般に、意味ビデオ対象物は、断片的な構成要素と、
複数の色と、複数の運動と、任意の形状を有することが可能である。FIGS. 1A-C provide a simple example of a semantic video object showing the difficulties associated with object tracking. FIG. 1A shows a semantic video object of a building 100 that includes a plurality of colors 102,104. Methods that assume that objects have the same color do not track these types of objects well. FIG. 1B shows the same building object as FIG. 1A, but with fragmented components 106, 108 due to trees 110 partially obstructing the building.
Is different. Methods that assume that objects are formed from connected groups of pixels do not track these types of fragmentary objects well.
Finally, FIG. 1C shows a simple semantic video object representing the person 112. Even for this simple object, a plurality of components 114, 116, 118, 12 with different movements
Has zero. Methods that assume that objects have the same motion do not track these types of objects well. In general, semantic video objects are fragmented components,
It is possible to have multiple colors, multiple movements, and arbitrary shapes.

【００１０】一般的な意味ビデオ対象物のこれらの属性を取り扱うことに加えて、追跡方法
は、許容可能な正確さのレベルを達成し、エラーがフレームからフレームに伝搬
することを回避しなければならない。通常、対象物追跡方法は、先行フレームの
区分に基づいて各フレームを区切るので、先行フレームのエラーは、次のフレー
ムに伝搬する傾向がある。追跡方法が、画素的な正確で対象物の境界を計算しな
い場合には、重大なエラーが次のフレームに伝搬する可能性がある。その結果、
各フレームについて計算された対象物の境界は精密ではなく、いくつかのフレー
ムを追跡した後、対象物が失われることがある。[0010] In addition to dealing with these attributes of the general semantic video object, the tracking method must achieve an acceptable level of accuracy and avoid errors from propagating from frame to frame. No. Typically, the object tracking method delimits each frame based on the segmentation of the previous frame, so errors in the previous frame tend to propagate to the next frame. If the tracking method does not calculate pixel boundaries with pixel accuracy, significant errors can propagate to the next frame. as a result,
The object boundaries calculated for each frame are not precise, and after tracking several frames, objects may be lost.

【００１１】（発明の概要）本発明は、ベクトル画像シーケンスにおいて、意味対象物を追跡する方法を提
供する。本発明は、デジタルビデオクリップにおいて意味ビデオ対象物を追跡す
ることに特に適しているが、様々な他のベクトル画像シーケンスに使用すること
もできる。この方法は、ソフトウエアプログラムのモジュールで実現されるが、
デジタルハードウエア論理、またはハードウエア構成要素とソフトウエア構成要
素の組合わせで実現することもできる。SUMMARY OF THE INVENTION The present invention provides a method for tracking semantic objects in a vector image sequence. The invention is particularly suitable for tracking semantic video objects in digital video clips, but can also be used for various other vector image sequences. This method is realized by a module of a software program.
It can also be implemented in digital hardware logic, or a combination of hardware and software components.

【００１２】この方法は、フレームから領域を分割し、次いで分割した領域を、１つまたは
複数の意味対象物の境界が既知である目標フレームに射影することによって、画
像シーケンスにおいて意味対象物を追跡する。射影領域は、それが目標フレーム
の意味対象物と重複する程度を決定することによって、意味対象物の形成部分と
して分類される。例えば、通常の応用では、追跡方法は、各フレームに対し、意
味対象物の境界が以前に計算されている先行フレームに領域を射影することによ
って領域を分類することを繰り返す。The method tracks semantic objects in an image sequence by segmenting a region from a frame and then projecting the segmented region onto a target frame at which one or more semantic object boundaries are known. I do. The projection region is classified as a forming part of the semantic object by determining the extent to which it overlaps the semantic object of the target frame. For example, in a typical application, the tracking method repeats, for each frame, classifying the region by projecting the region into a previous frame where the semantic object boundaries have been previously calculated.

【００１３】追跡方法は、意味対象物が、最初のフレームですでに識別されていると仮定す
る。意味対象物の最初の境界を得るために、意味対象物分割方法を使用して、最
初のフレームにおいて意味対象物の境界を識別することが可能である。The tracking method assumes that semantic objects have already been identified in the first frame. To obtain the initial boundary of the semantic object, it is possible to identify the boundary of the semantic object in the first frame using the semantic object division method.

【００１４】最初のフレームの後、追跡方法は、先行フレームの分割結果および現在と先行
する画像フレームの分割結果に基づいて動作する。シーケンスの各フレームに対
し、領域抽出処理（ｒｅｇｉｏｎｅｘｔｒａｃｔｏｒ）は、同一領域をフレー
ムから分割する。次いで、運動推定処理（ｍｏｔｉｏｎｅｓｔｉｍａｔｏｒ）
で、これらの領域のそれぞれに対して領域をベースとする整合を実行し、先行フ
レームで最も密接に整合しているイメージの値の領域を識別する。このステップ
で得られた運動パラメータを使用して、分割境界がすでに計算されている先行フ
レームに、分割した領域を射影する。次いで、領域分類処理（ｒｅｇｉｏｎｃ
ｌａｓｓｉｆｉｃａｔｉｏｎ）は、射影された領域が先行フレームの意味対象物
と重複する程度に基づいて、現在のフレームの意味対象物の部分として領域を分
類する。[0014] After the first frame, the tracking method operates based on the segmentation results of the preceding frame and the current and preceding image frames. For each frame of the sequence, a region extractor divides the same region from the frame. Next, a motion estimator (motion estimator)
Perform region-based matching on each of these regions to identify the region of the image value that is most closely matched in the previous frame. Using the motion parameters obtained in this step, the divided area is projected onto the preceding frame in which the division boundary has already been calculated. Next, an area classification process (region c)
classification classifies a region as a part of the semantic object of the current frame based on the degree to which the projected region overlaps with the semantic object of the preceding frame.

【００１５】上述の手法は、フレームの順序付けられたシーケンス上で動作する場合特に適
している。これらの種類の応用では、先行フレームの分割結果を使用して、次の
フレームから抽出した領域を分類する。しかし、入力フレームと、意味対象物の
境界が既知である他の任意の目標フレームとの間で、意味対象物を追跡するため
に使用することもできる。[0015] The above approach is particularly suitable when operating on an ordered sequence of frames. In these types of applications, the segmentation result of the previous frame is used to classify regions extracted from the next frame. However, it can also be used to track semantic objects between the input frame and any other target frame whose semantic object boundaries are known.

【００１６】方法のある実装では、独自の空間分割方法を使用する。特に、この空間分割方
法は領域発生プロセスであり、このプロセスでは領域の点に対する最小のイメー
ジの値と最大のイメージの値の差が閾値より小さい限り、イメージの点が領域に
追加される。この方法は、シーケンシャル分割方法として実現され、ある開始点
の第１領域で開始し、同じテストを用いて次々にシーケンシャル領域を形成して
、イメージの点の同一グループを識別する。Some implementations of the method use a unique spatial partitioning method. In particular, this spatial partitioning method is a region generation process in which image points are added to a region as long as the difference between the minimum and maximum image values for the region points is less than a threshold. The method is implemented as a sequential partitioning method, starting with a first region at a certain starting point and forming successive regions one after another using the same test to identify the same group of image points.

【００１７】方法の実装は、追跡方法の正確さを改善する他の特徴を含む。例えば、追跡方
法は、対象物の境界を不鮮明にせずに画像エラーを除去する領域ベースの前処理
と、計算した意味対象物境界に関する後処理を含むことが好ましい。対象物の計
算した境界は、目標フレームの同じ意味対象物に関連しているものとして分類さ
れた個々の領域から形成される。ある実装では、ポストプロセッサは、過半数オ
ペレータフィルタを用いて、意味対象物の境界を円滑化する。このフィルタは、
フレームの各点に対し近接するイメージの点を検査し、これらの点の最大数を含
む意味対象物を決定する。次いで、その点を点の最大数を含んでいる意味対象物
に割り当てる。The implementation of the method includes other features that improve the accuracy of the tracking method. For example, the tracking method preferably includes region-based pre-processing to remove image errors without blurring object boundaries and post-processing on calculated semantic object boundaries. The calculated boundaries of the object are formed from the individual regions classified as being associated with the same semantic object in the target frame. In some implementations, the post-processor uses a majority operator filter to smooth semantic object boundaries. This filter is
Examine the points in the image that are close to each point in the frame and determine the semantic object that contains the maximum number of these points. The point is then assigned to the semantic object containing the maximum number of points.

【００１８】本発明の他の利点および特徴は、以下の詳細な説明と添付の図によって明らか
になるであろう。[0018] Other advantages and features of the present invention will become apparent from the following detailed description and the accompanying drawings.

【００１９】（詳細な説明）意味対象物追跡システムの概要以下のセクションで、意味対象物追跡方法について説明する。この方法は、最
初のフレーム（Ｉ−フレーム）に対する意味対象物が既知であると仮定する。こ
の方法の目的は、先行する意味区分画像と先行フレームからの情報に基づいて、
現在のフレームで意味区分画像を見つけることにある。Detailed Description Overview of Semantic Object Tracking System The following sections describe semantic object tracking methods. This method assumes that the semantic object for the first frame (I-frame) is known. The purpose of this method is based on the information from the preceding semantic segment image and the previous frame,
The purpose is to find a semantic segment image in the current frame.

【００２０】意味区分画像に関する基本的な観察は、区分画像の境界は、有意な実在物の物
理的な縁部に位置するということにある。物理的な縁部は、２つの接続された点
の間の位置であり、これらの点でのイメージの値（例えば、３色の色強度、グレ
イスケール値、運動ベクトル）は、著しく異なっている。追跡方法は、この観察
を利用し、分割と克服の戦略を用いて、意味ビデオ対象物追跡システムを解明す
る。A basic observation with semantic segmented images is that the boundaries of segmented images are located at the physical edges of significant entities. The physical edge is the location between two connected points, and the values of the image at these points (eg, three color intensities, grayscale values, motion vectors) are significantly different. . Tracking methods use this observation to solve semantic video object tracking systems using segmentation and overcoming strategies.

【００２１】第１に、追跡方法は、現在のフレームで物理的な縁部を見つける。これは、分
割方法、特に空間分割方法を用いて実現される。この分割方法の目的は、現在の
フレームで、同一イメージの値（例えば、色強度の３重線、グレイスケール値）
を有する全ての接続されている領域を抽出することである。第２に、追跡方法は
、現在のフレームで抽出された各領域を分類し、それが先行フレームのどの対象
物に属するかを決定する。この分類分析は、領域ベースの分類問題である。領域
ベースの分類問題が解明された後は、現在のフレームの意味ビデオ対象物は、抽
出および追跡されたことになる。First, the tracking method finds a physical edge in the current frame. This is achieved using a partitioning method, in particular a spatial partitioning method. The purpose of this segmentation method is to use the same image values (eg, color intensity triplets, grayscale values) in the current frame.
Is to extract all connected regions having Second, the tracking method classifies each region extracted in the current frame and determines to which object in the previous frame it belongs. This classification analysis is a region-based classification problem. After the region-based classification problem has been solved, the semantic video objects of the current frame have been extracted and tracked.

【００２２】図２は、意味ビデオ対象物追跡システムを示す図である。追跡システムは、以
下の５つのモジュールを備える。１．領域前処理２２０２．領域抽出２２２３．領域ベースの運動推定２２４４．領域ベースの分類２２６５．領域の後処理２２８FIG. 2 is a diagram illustrating a semantic video object tracking system. The tracking system comprises the following five modules. 1. 1. Area pre-processing 220 Region extraction 222 3. 3. Region based motion estimation 224 4. Region-based classification 226 Post-processing of area 228

【００２３】図２では、以下の表記を使用する。Ｉ _ｉ−フレームｉに対する入力画像Ｓ _ｉ−フレームｉに対する空間分割の結果Ｍ _ｉ−フレームｉに対する運動パラメータＴ _ｉ−フレームｉに対する追跡結果In FIG. 2, the following notation is used. I _i - frame input to the i image S _i - result of the spatial division with respect to the frame i M _i - motion to the frame i parameter T _i - tracking results for frame i

【００２４】追跡方法は、最初のフレームＩ _０に対する意味ビデオ対象物が、すでに既知で
あると仮定する。最初のフレームから開始して、分割プロセスは、フレームの意
味対象物の境界を定義する最初の区分を決定する。図２では、Ｉ−分割ブロック
２１０が、意味ビデオ対象物を分割するプログラムを表す。このプログラムは、
最初のフレームＩ _０を取り入れ、意味対象物の境界を計算する。通常、この境界
は、２進またはアルファマスクとして表される。様々な分割の手法を使用して、
第１フレームに対する意味対象物を見つけることが可能である。The tracking method assumes that the semantic video object for the first frame I ₀ is already known. Starting from the first frame, the segmentation process determines the first partition that defines the semantic object boundaries of the frame. In FIG. 2, I-divide block 210 represents a program for dividing a semantic video object. This program is
Incorporating the first frame I _0, to calculate the boundaries of the sense object. Typically, this boundary is represented as a binary or alpha mask. Using various splitting techniques,
It is possible to find a semantic object for the first frame.

【００２５】ＧｕとＬｅｅによる同時継続中の米国特許出願第０９／０５４，２８０号に記
載されているように、１つの手法は、ユーザが、意味ビデオ対象物の境界の内側
および外側の回りで境界を描くことができる描写用ツールを提供することである
。次いで、このユーザが描いた境界は、計算した境界を意味ビデオ対象物の縁部
にスナップする自動化方法のための開始点として役立つ。関連のある複数のビデ
オ対象物を含んでいるアプリケーションでは、Ｉ−分割プロセス２１０は、各対
象物についてマスクなどの区分画像を計算する。As described in co-pending US patent application Ser. No. 09 / 054,280 by Gu and Lee, one approach is to allow the user to move around the inside and outside boundaries of semantic video objects. The purpose is to provide a drawing tool capable of drawing a boundary. This user-drawn border then serves as a starting point for an automated method that snaps the calculated border to the edges of the semantic video object. For applications involving multiple video objects of interest, the I-segmentation process 210 computes a segmented image, such as a mask, for each object.

【００２６】最初のフレームで使用した後処理ブロック２１２は、最初の区分画像を円滑化
し、エラーを除去するプロセスである。このプロセスは、後続フレームＩ _１、Ｉ _２で意味ビデオ対象物を追跡する結果を処理するために使用する後処理と同一ま
たは類似のものである。The post-processing block 212 used in the first frame smoothes the first segmented image
And remove the error. This process is used for subsequent framesI ₁,I ₂ Is the same as post-processing used to process the results of tracking semantic video objects in
Or similar.

【００２７】次のフレーム（Ｉ _１）で開始する追跡プロセスのための入力は、先行フレームＩ _０と先行フレーム分割の結果Ｔ _０を含む。破線２１６は、各フレームに対する
処理を分離する。破線２１４は、最初のフレームと次のフレームに対する処理を
分離するが、破線２１６は、意味ビデオ対象物がフレームを追跡する間、後続フ
レームに対する処理を分離する。The next frame (I ₁) Input for the tracking process starting with the preceding frame I ₀ And result of previous frame divisionT ₀including. Dashed line 216 indicates for each frame
Separate processing. A dashed line 214 indicates processing for the first frame and the next frame.
Separation, but dashed line 216, indicates that subsequent semantic video
Separate processing for frames.

【００２８】意味ビデオ対象物追跡は、フレームＩ _１で開始する。第１ステップでは、入力
フレームＩ _１を簡略化する。図２では、簡略化ブロック２２０が、他の分析の前
に入力フレームＩ _１を簡略化するために使用する領域前処理ステップを表す。多
くの場合、入力データは、追跡結果に悪影響を与える可能性がある雑音を含んで
いる。領域前処理は、雑音を除去し、他の意味対象物追跡が、クリーンな入力デ
ータ上で実行されることを保証する。The semantic video object tracking starts at frame I ₁ . In the first step, to simplify the input frame I _1. In Figure 2, a simplified block 220 represents the region preprocessing step used to simplify the input frame I ₁ before the other analysis. Often, the input data contains noise that can adversely affect the tracking results. Region pre-processing removes noise and ensures that other semantic object tracking is performed on clean input data.

【００２９】簡略化ブロック２２０は、分割方法が、接続された画素の領域をより正確に抽
出することを可能とするクリーンな結果を提供する。図２では、分割ブロック２
２０は、入力フレームで同一イメージの値を有する接続された領域を抽出する空
間分割方法を表す。The simplification block 220 provides clean results that allow the segmentation method to more accurately extract the connected pixel regions. In FIG. 2, divided block 2
Reference numeral 20 denotes a space division method for extracting connected regions having the same image value in an input frame.

【００３０】各領域に対し、追跡システムは、接続された領域が、以前の意味ビデオ対象物
を源とするかを決定する。追跡段階が現在のフレームに対して完全であるとき、
現在のフレームにある意味ビデオ対象物の境界は、これらの接続された領域の境
界から構成される。したがって、空間分割は、現在のフレームに対し、信頼でき
る分割結果を提供すべきである。すなわち、いかなる領域も欠損するべきではな
く、いかなる領域もそれに属さない区域を含むべきではない。For each region, the tracking system determines whether the connected region is from a previous semantic video object. When the tracking stage is complete for the current frame,
The boundaries of the semantic video object in the current frame consist of the boundaries of these connected regions. Therefore, spatial partitioning should provide reliable partitioning results for the current frame. That is, no region should be missing, and no region should include an area that does not belong to it.

【００３１】接続された領域が、意味ビデオ対象物に属するかを決定する第１ステップでは
、接続領域と先行フレームの対応する領域とを整合することである。図２に示す
ように、運動推定ブロック２２６は、接続された領域と現在および先行フレーム
を入力として取り入れ、現在のフレームで各領域と最も密接に整合する、先行フ
レームの対応する領域を見つける。各領域に対し、運動推定ブロック２２６は運
動情報を提供し、現在のフレームの各領域が、先行フレームに由来する場所を予
測する。この運動情報は、先行フレームにある各領域の祖先の位置を示す。その
後で、この位置情報を使用して、現在の領域が、意味ビデオ対象物に属するかを
どうかを決定する。The first step in determining whether a connected region belongs to a semantic video object is to match the connected region with the corresponding region of the previous frame. As shown in FIG. 2, the motion estimation block 226 takes as input the connected regions and the current and previous frames and finds the corresponding region of the previous frame that most closely matches each region in the current frame. For each region, motion estimation block 226 provides motion information and predicts where each region of the current frame comes from the previous frame. This motion information indicates the position of the ancestor of each area in the preceding frame. The location information is then used to determine whether the current region belongs to a semantic video object.

【００３２】次に、追跡システムは、各領域が意味ビデオ対象物を源とするかについて各領
域を分類する。図２では、分類ブロック２２６は、各領域が源としている可能性
がある先行フレームで意味対象物を識別する。分類プロセスは、各領域に対する
運動情報を使用して、その領域が先行フレームに由来する場所を予測する。予測
した領域を先行フレームの分割結果と比較することによって、分類プロセスは、
予測した領域が意味対象物または先行フレームに対してすでに計算された対象物
と重複する程度を決定する。この分類プロセスの結果は、現在のフレームの各領
域を意味ビデオ対象物または背景と関連付ける。現在のフレームで追跡された意
味ビデオ対象物は、先行フレームの対応する意味ビデオ対象物と連結された全て
の領域の集合（ｕｎｉｏｎ）を備える。Next, the tracking system classifies each region as to whether each region is derived from a semantic video object. In FIG. 2, the classification block 226 identifies semantic objects in previous frames that each region may be a source of. The classification process uses the motion information for each region to predict where that region comes from the previous frame. By comparing the predicted region to the segmentation result of the previous frame, the classification process
Determine the extent to which the predicted region overlaps the semantic object or the object already calculated for the previous frame. The result of this classification process associates each region of the current frame with a semantic video object or background. The semantic video object tracked in the current frame comprises a union of all regions concatenated with the corresponding semantic video object in the previous frame.

【００３３】最後に、追跡システムは、各対象物に対して連結領域を後処理する。図２では
、後処理ブロック２２８が、現在のフレームで各意味ビデオ対象物の獲得された
境界を微調整する。このプロセスは、分類手続きで導入されたエラーを除去し、
境界を円滑化して視覚効果を改善する。Finally, the tracking system post-processes the connected area for each object. In FIG. 2, a post-processing block 228 refines the acquired boundaries of each semantic video object in the current frame. This process removes any errors introduced in the classification process,
Smooth borders to improve visual effects.

【００３４】各後続フレームに対し、追跡システムは、先行フレームと、先行フレームの追
跡結果と、現在のフレームとを入力として使用して、自動化形態で同じステップ
を繰り返す。図２は、フレームＩ _２に対して反復された処理ステップの例を示す
。ブロック２４０〜２４８は、次のフレームに適用された追跡システムのステッ
プを表す。For each subsequent frame, the tracking system repeats the same steps in an automated fashion, using the previous frame, the tracking result of the previous frame, and the current frame as inputs. Figure 2 shows an example of the processing steps are repeated for the frame I _2. Blocks 240-248 represent the steps of the tracking system applied to the next frame.

【００３５】様々な順方向追跡メカニズムを使用する他の領域と対象物の追跡システムと異
なり、図２に示す追跡システムは、逆方向追跡を実行する。逆方向の領域をベー
スとする分類の手法は、空間分割の結果として、最終の意味ビデオ対象物の境界
が、常に有意な実在物の物理的な縁部に位置するという利点を有する。また、各
領域が個々に取り扱われるので、追跡システムは、容易に断片的な意味ビデオ対
象物または非剛体運動に対処することができる。Unlike other area and object tracking systems that use various forward tracking mechanisms, the tracking system shown in FIG. 2 performs backward tracking. The reverse domain-based classification approach has the advantage that as a result of the spatial division, the boundaries of the final semantic video object are always located at the physical edge of the significant entity. Also, since each region is treated individually, the tracking system can easily cope with fragmented semantic video objects or non-rigid motion.

【００３６】定義追跡システムの実装について説明する前に、これ以降の説明を通して使用する
一連の定義から始めることが助けになろう。これらの定義は、追跡方法が、カラ
ーのビデオフレームのシーケンスだけでなく、複数次元画像データの他の時間的
シーケンスについても適用されることを示すの助けになる。このコンテキストで
は、「複数次元」は、各離散イメージの点の空間的座標、並びにその点でのイメ
ージの値を指す。画像データの時間的シーケンスは、それが複数次元データアレ
イの連続フレームからなるので、「ベクトル画像シーケンス」と呼ぶことができ
る。ベクトル画像シーケンスの例として、下記の表１に列挙した例について考え
る。Definitions Before describing the implementation of the tracking system, it will be helpful to start with the set of definitions used throughout the following description. These definitions help indicate that the tracking method applies not only to sequences of colored video frames, but also to other temporal sequences of multi-dimensional image data. In this context, "multi-dimensional" refers to the spatial coordinates of each discrete image point, as well as the value of the image at that point. A temporal sequence of image data can be referred to as a "vector image sequence" because it consists of consecutive frames of a multidimensional data array. As examples of vector image sequences, consider the examples listed in Table 1 below.

【００３７】[0037]

【表１】 [Table 1]

【００３８】次元ｎは、画像サンプルの空間座標における次元の数を指す。次元ｍは、画像
サンプルの空間座標に位置するイメージの値の次元の数を指す。例えば、カラー
ボリューム画像シーケンスの空間座標は、３次元空間における画像サンプルの位
置を定義する３つの空間座標を含み、したがってｎ＝３である。カラーボリュー
ム画像の各サンプルは、３つのカラーの値Ｒ、Ｇ、およびＢを有し、したがって
ｍ＝３である。The dimension n refers to the number of dimensions in the spatial coordinates of the image sample. The dimension m refers to the number of dimensions of the value of the image located at the spatial coordinates of the image sample. For example, the spatial coordinates of a color volume image sequence include three spatial coordinates that define the position of the image sample in three-dimensional space, and therefore n = 3. Each sample of the color volume image has three color values R, G, and B, and therefore m = 3.

【００３９】以下の定義は、集合およびグラフの理論表記を用いて、ベクトル画像のコンテ
キストで追跡システムを説明する基礎を与える。The following definitions provide a basis for describing a tracking system in the context of a vector image, using theoretical notation for sets and graphs.

【００４０】定義１接続点：Ｓはｎ次元の集合とする。点ｐ∈Ｓ⇒ｐ＝（ｐ１、．．．、ｐ_ｎ）。∀ｐ、ｑ
∈Ｓ、ｐとｑは、その距離Ｄ_ｐ、ｑが１に等しい場合のみ接続されている。Definition 1 Connection Point: S is an n-dimensional set. The point p∈ S ⇒p = (p1,..., _Pn ). ∀p, q
∈ S , p and q are connected only when their distance D _{p, q} is equal to one.

【００４１】[0041]

【数１】 (Equation 1)

【００４２】定義２接続経路：Ｐ（Ｐ⊆Ｓ）は、ｍ個の点ｐ１、．．．ｐ_ｍからなる経路とする。経路Ｐは、
ｐ_ｋとｐ_ｋ＋１（ｋ∈｛１、．．．、ｍ−１｝が接続点である場合のみ接続され
ている。[0042] Definition 2 connection path: P (P ⊆ S) is, m number of points p1,. . . a route consisting of a p _m. The route P is
p _k and _{p k + 1 (k∈ {1} , ..., m-1} is connected only when a connection point.

【００４３】定義３近接点：Ｒ（Ｒ⊆Ｓ）は領域とする。点[0043] Definition 3 near point: R (R ⊆ S) is a region. point

【００４４】[0044]

【数２】 (Equation 2)

【００４５】は、∃他の点ｑ（ｑ∈Ｒ）ｐとｑが接続点である場合のみ領域Ｒに近接する。[0045] is proximate only area R if ∃ another point q (q∈ R) p and q are connected points.

【００４６】定義４接続領域：Ｒ（Ｒ⊆Ｓ）は領域とする。Ｒは、∀ｘ、ｙ∈Ｒ、∃接続経路Ｐ（Ｐ＝｛ｐ_１、．．．ｐ_ｍ、｝）でｐ_１＝ｘおよびＰ_ｎ＝ｙである場合のみ接続領域である。[0046] Definition 4 connection region: R (R ⊆ S) is a region. R is, ∀x, y∈ R, ∃ connection path _{P (P = {p 1,} ... p m,}) if a _p 1 = x and _P n = y is only connected area.

【００４７】定義５区分画像：区分画像Ｐは、写像（ｍａｐｐｉｎｇ）Ｐ：Ｓ→Ｔであり、Ｔは完全な順序付
けされた格子（ｌａｔｔｉｃｅ）である。Ｒ _ｐ（ｘ）は、点ｘ：Ｒ _ｐ（ｘ）＝∪ _ｙ∈Ｓ｛ｙ｜Ｐ（ｘ）＝Ｐ（ｙ）｝を含む領域とする。区分画像は、次の条件を
満たさなければならない。∀ｘ、ｙ∈Ｓ、Ｒ _ｐ（ｘ）＝Ｒ _ｐ（ｙ）またはＲ _ｐ（
ｘ）∩Ｒ _ｐ（ｙ）＝φ；∪_ｘ∈ＳＲ _ｐ（ｘ）＝Ｓ。Definition 5 Segmented Image: A segmented image P is a mapping P:S→ T, where T is fully ordered
FIG.R _p(X) is the point x:R _p(X) = ∪ _y∈S An area including {y | P (x) = P (y)}. Segmented images must meet the following conditions
Must meet. {X, y}S,R _p(X) =R _p(Y) orR _p(
x) ∩R _p(Y) = φ; ∪_x∈S R _p(X) =S.

【００４８】定義６接続区分画像：接続区分画像は、区分画像Ｐであり、∀ｘ∈Ｓ、Ｒ _ｐ（ｘ）は常に接続されて
いる。Definition 6 Connection Division Image: The connection division image is the division image P, and {x} S , R _p (x) are always connected.

【００４９】定義７微細区分区分画像Ｐが、Ｓ上の他の区分画像Ｐ′より微細である場合、これは、∀ｘ∈ Ｓ、Ｒ _ｐ（ｘ）⊇Ｒ _ｐ′（ｘ）を意味する。Definition 7 Fine Division The division image P isSIf it is finer than the other segmented image P 'above, this is {x} S ,R _p(X) ⊇R _{p '}Means (x).

【００５０】定義８粗区分：区分画像Ｐが、Ｓ上の他の区分画像Ｐ′より粗である場合、これは、∀ｘ∈Ｓ
、Ｒ _ｐ（ｘ）⊆Ｒ _ｐ′（ｘ）を意味する。[0050] Definition 8 crude Indicator: If partition image P is another segment crude from the image P 'on S, which, ∀X∈ S
Means _{_{R p (x) ⊆ R p}} '(x).

【００５１】区分画像に対して究極的な場合が２つある。一方は「最も粗い区分」であり、
これは、全てのＳ：∀ｘ、ｙ∈Ｓ、Ｒ _ｐ（ｘ）＝Ｒ _ｐ（ｙ）に及ぶ。他方は「最
も微細な区分」であり、Ｓの各点は、個々の領域：∀ｘ、ｙ∈Ｓ、ｘ≠ｙ⇒Ｒ _ｐ（ｘ）≠Ｒ _ｐ（ｙ）である。There are two ultimate cases for a segmented image. One is the "coarse segment"
This is all the S: ∀x, y∈ S, up to _{_{R p (x) = R p}} (y). The other is the "finest classification", each point S, the individual regions: ∀x, a _{y∈ S, x ≠ y⇒ R p} (x) ≠ R p (y).

【００５２】定義９隣接領域：２つの領域Ｒ _１とＲ _２は、∃ｘ、ｙ（ｘ∈Ｒ _１およびｙ∈Ｒ _２）に対し、ｘと
ｙが接続点である場合のみ隣接する。[0052] Definition 9 adjacent regions: two areas R ₁ and R _2, ∃x, to y (x ∈ R ₁ and y∈ R _2), adjacent only when x and y are connected points.

【００５３】定義１０領域に隣接するグラフ：Ｐは複数次元集合Ｓ上の区分画像である。Ｐにはｋ（Ｒ _１、．．．、Ｒ _ｋ）の
領域があり、Ｓ＝∪Ｒ _ｉ、およびｉ≠ｊ⇒Ｒ _ｉ∩Ｒ _ｊ＝φである。領域隣接グラ
フ（ＲＡＧ）は、頂点Ｖの集合と縁部の集合Ｌとからなる。Ｖ＝｛ｖ_１、．．．
、ｖ_ｋ｝とし、各ｖ_ｉは、対応する領域Ｒ _ｉに関連付けられている。縁部の集合Ｌは、｛ｅ_１、．．．、ｅ_ｔ｝、Definition 10: Graph adjacent to area: P is a multidimensional setSIt is an upper division image. P has k (R ₁,. . . ,R _k)of
There is an area,S= ∪R _i, And i ≠ j⇒R _i∩R _j= Φ. Area adjacent graph
RAG is the vertexVSet and edge setLConsists ofV= ｛V₁,. . .
, V_k｝ And each v_iIs the corresponding areaR _iAssociated with Set of edges L Is ｛e₁,. . . , E_t｝,

【００５４】[0054]

【数３】 (Equation 3)

【００５５】であり、各ｅ_ｉは、２つの対応する領域が隣接領域である場合、２つの頂点の間
に構築される。[0055] a, wherein each e _i has two corresponding regions may be contiguous region, is built between two vertices.

【００５６】図３Ａ〜Ｃは、異なる種類の区分画像の例を示し、図３Ｄは、これらの区分画
像に基づく領域隣接グラフの例を示す。これらの例では、Ｓは、２次元画像の集
合である。白区域３００〜３０８と、斜線区域３１０〜３１４と、点区域３１６
とは、２次元画像フレームでの異なる領域を表す。図３Ａは、２つの断片的な領
域（白区域３００と３０２）を有する区分画像を示す。図３Ｂは、２つの接続領
域（白区域３０４と斜線区域３１２）を有する接続区分画像を示す。図３Ｃは、
図３Ａの斜線区域が、２つの領域、斜線区域３１４と点区域３１６を備えるとい
う点で、図３Ａと比較してより微細な区分画像を示す。図３Ｄは、図３Ｃの区分
画像の対応する領域隣接グラフを示す。グラフの頂点３２０、３２２、３２４、
３２６は、それぞれ領域３０６、３１４、３１６、３０８に対応する。縁部３３
０、３３２、３３４、３３６、および３３８は、隣接領域の頂点を接続する。3A to 3C show examples of different types of segmented images, and FIG. 3D shows an example of an area adjacency graph based on these segmented images. In these examples, S is a set of two-dimensional images. White areas 300 to 308, hatched areas 310 to 314, and point areas 316
Represents different regions in the two-dimensional image frame. FIG. 3A shows a segmented image having two fragmentary regions (white areas 300 and 302). FIG. 3B shows a connection section image having two connection areas (white area 304 and hatched area 312). FIG. 3C
3A shows a finer segmented image as compared to FIG. 3A in that the shaded area of FIG. 3A comprises two regions, a shaded area 314 and a point area 316. FIG. 3D shows a corresponding region adjacency graph of the segmented image of FIG. 3C. Vertices 320, 322, 324 of the graph,
326 corresponds to the regions 306, 314, 316, and 308, respectively. Rim 33
0, 332, 334, 336, and 338 connect the vertices of adjacent regions.

【００５７】定義１１ベクトル画像シーケンス：積Definition 11 Vector Image Sequence: Product

【００５８】[0058]

【数４】 (Equation 4)

【００５９】のｍ（ｍ≧１）個の完全に順序付けされた完全格子Ｌ_１、．．．、Ｌ_ｍを与えら
れた場合、ベクトル画像シーケンスは、写像Ｉ _ｔ：Ｓ→Ｌのシーケンスであり、Ｓはｎ次元の集合で、ｔは時間領域にある。M (m ≧ 1) fully ordered complete lattices L₁,. . . , L_mGiven
The vector image sequence is mappedI _t:S→LIs the sequence of S Is an n-dimensional set, and t is in the time domain.

【００６０】いくつかの種類のベクトル画像シーケンスを表１に示す。これらのベクトル画
像シーケンスは、カラー画像などの一連のセンサ、または濃度（ｄｅｎｓｅ）運
動の場などの計算されたパラメータスペースから獲得することができる。入力信
号の物理的な意味は場合ごとに異なるが、それらは全て例外なくベクトル画像シ
ーケンスと見なされる。Some types of vector image sequences are shown in Table 1. These vector image sequences can be obtained from a series of sensors, such as color images, or from a calculated parameter space, such as a field of dense motion. Although the physical meaning of the input signals varies from case to case, they are all without exception considered vector image sequences.

【００６１】定義１２意味ビデオ対象物：Ｉは、ｎ次元集合Ｓ上のベクトル画像とする。Ｐは、Ｉの意味区分画像とする
。Ｓ＝∪_{ｉ＝１、．．．、ｍ} Ｏ _ｉであり、各Ｏ _ｉは、意味ビデオ対象物の位置を
示す。Definition 12 Semantic Video Object: Let I be a vector image on an n-dimensional set S. P is a semantic division image of I. S = ∪i _{= 1,. . . , M} O _i , where each O _i indicates the location of a semantic video object.

【００６２】定義１３意味ビデオ対象物分割：Ｉは、ｎ次元集合Ｓ上のベクトル画像とする。意味ビデオ対象物分割は、対象
物の数ｍと各対象物Ｏ _ｉの位置を見つけるものとする。Definition 13 Meaning Video Object Division: Let I be a vector image on an n-dimensional set S. Means video object division shall locate the number of objects m and each object O _i.

【００６３】ｉ＝１、．．．、ｍ、でＳ＝∪_{ｉ＝１、．．．、ｍ} Ｏ _ｉである。For i = 1,. . . , M, S = ∪i _{= 1,. . . , M} O _i .

【００６４】定義１４意味ビデオ対象物追跡：Ｉ _ｔ−１は、ｎ次元集合Ｓの上のベクトル画像であり、Ｐ_ｔ−１は、時間ｔ−
１での対応する意味区分画像とする。Ｓ＝∪_{ｉ＝１、．．．、ｍ} Ｏ _{ｔ−１、ｉ}で
ある。各Ｏ _{ｔ−１、ｉ}（ｉ＝１、．．．、ｍ）は、時間ｔ−１での意味ビデオ対
象物である。Ｉ _ｔの意味ビデオ対象物追跡は、時間ｔ、ｉ＝１、．．．、ｍで意
味ビデオ対象物を見つけるときに定義される。∀ｘ∈Ｏ _ｔ−１、_ｉおよび∀ｙ∈ Ｏ _ｔ、ｉ：Ｐ_ｔ−１（ｘ）＝Ｐ_ｔ（ｙ）である。Definition 14 Semantic Video Object Tracking:I _t-1Is an n-dimensional setSVector image above_t-1Is the time t-
1, and the corresponding semantic division image.S= ∪_{i = 1,. . . , M} O _{t-1, i}so
is there. eachO _{t-1, i}(I = 1,..., M) is the semantic video pair at time t−1
It is an elephant.I _tThe video object tracking at time t, i = 1,. . . , M means
Defined when finding a taste video object. ∀x∈O _t-1,_iAnd {y} O _{t, i} : P_t-1(X) = P_t(Y).

【００６５】実装例以下のセクションでは、特定の意味ビデオ対象物追跡方法について、より詳細
に説明する。図４は、以下で説明する実装の主要な構成要素を示すブロック図で
ある。図４の各ブロックは、上記で略述した対象物追跡方法の部分を実現するプ
ログラムモジュールを表す。コスト、性能、および設計の複雑さなどの様々な考
慮事項に応じて、これらのモジュールのそれぞれは、デジタル論理回路において
も実現することが可能である。Example Implementation The following sections describe in more detail certain semantic video object tracking methods. FIG. 4 is a block diagram showing the main components of the implementation described below. Each block in FIG. 4 represents a program module that implements part of the object tracking method outlined above. Depending on various considerations, such as cost, performance, and design complexity, each of these modules can also be implemented in digital logic.

【００６６】上記で定義した表記を用いて、図４に示す追跡方法は、入力として、時間ｔ−
１での先行フレームの分割結果と現在のベクトル画像Ｉ _ｔを取り入れる。現在の
ベクトル画像は、ｎ次元集合Ｓ上の積Ｌ（定義１１参照）のｍ個（ｍ≧１）の完
全に順序付けされた完全格子Ｌ_１、．．．、Ｌ_ｍにおいて定義されている。Using the notation defined above, the tracking method shown in FIG.
Dividing the result of the preceding frame at 1 and incorporating the present vector image I _t. The current vector image consists of m (m ≧ 1) fully ordered complete grids L ₁ ,... Of the product L (see Definition 11) on the n-dimensional set S. . . , Defined in _{L m.}

【００６７】 ∀ｐ、ｑ∈Ｓ、Ｉ _ｔ（ｐ）＝｛Ｌ_１（ｐ）、Ｌ_２（ｐ）、．．．、Ｌ_ｍ（ｐ）
｝[0067] _{∀p, q∈ S, I t (} p) = {L 1 (p), L 2 (p) ,. . . , L _m (p)
｝

【００６８】この情報を用いて、追跡方法は、シーケンスの各フレームに対し、区分画像を
計算する。分割の結果は、各フレームで各意味対象物の位置を識別するマスクで
ある。各マスクは、各フレームで、それがどの対象物に対応するかを識別する対
象物番号を有する。Using this information, the tracking method calculates a segmented image for each frame of the sequence. The result of the division is a mask that identifies the position of each semantic object in each frame. Each mask has, in each frame, an object number that identifies which object it corresponds to.

【００６９】例えば、表１で定義されているカラー画像シーケンスについて考察する。各点
ｐは、２次元画像の画素を表す。集合Ｓの点の数は、各画像フレームの画素の数
に対応する。各画素での格子は、赤、緑、および青の強度値に対応する３つのサ
ンプル値を備える。追跡方法の結果は、各フレームに対する対応する意味ビデオ
対象物の部分を形成する全ての画素の位置を識別する一連の２次元マスクである
。For example, consider the color image sequence defined in Table 1. Each point p represents a pixel of the two-dimensional image. The number of points in the set S corresponds to the number of pixels in each image frame. The grid at each pixel comprises three sample values corresponding to the red, green, and blue intensity values. The result of the tracking method is a series of two-dimensional masks that identify the location of all pixels that form part of the corresponding semantic video object for each frame.

【００７０】領域の前処理図４に示す実装は、入力ベクトル画像を簡略化することによって、フレームに
対する処理を開始する。特に、簡略フィルタ４２０は、入力ベクトル画像全体を
クリーンにし、その後さらに処理を行う。この前処理段階の設計では、偽データ
を導入しない簡略方法を選択することが好ましい。例えば、低域通過フィルタは
、画像をクリーンにし滑らかにする可能性があるが、ビデオ対象物の境界を歪め
る可能性もある。したがって、入力ベクトル画像を簡略化し、同時に意味ビデオ
対象物の境界の位置を保持する方法を選択することが好ましい。Region Pre-Processing The implementation shown in FIG. 4 starts processing on frames by simplifying the input vector image. In particular, the simplified filter 420 cleans the entire input vector image and then performs further processing. In this pre-processing stage design, it is preferable to select a simplified method that does not introduce false data. For example, a low-pass filter may clean and smooth the image, but may also distort the boundaries of the video object. Therefore, it is preferable to select a method that simplifies the input vector image while retaining the position of the boundary of the semantic video object.

【００７１】中央値フィルタまたは形態フィルタなどの多くの非線形フィルタは、このタス
クのための候補である。現在の実装では、入力ベクトル画像の簡略化のために、
ベクトル中央値フィルタ、メジアン（Ｍｅｄｉａｎ）（・）を使用する。Many non-linear filters, such as median filters or morphological filters, are candidates for this task. In the current implementation, to simplify the input vector image,
Use a vector median filter, Median (•).

【００７２】ベクトル中央値フィルタは、入力画像の各点に対する近接点の中央イメージの
値を計算し、その点のイメージの値を中央値で置き換える。ｎ次元集合Ｓのあら
ゆる点ｐに対し、構造要素Ｅは、それの回りで定義され、それは全ての接続点を
含んでいる（接続点に関する定義１参照）。The vector median filter calculates the value of the median image of a point adjacent to each point of the input image, and replaces the value of the image at that point with the median. For every point p in the n-dimensional set S , a structuring element E is defined around it, which includes all connection points (see definition 1 for connection points).

【００７３】Ｅ＝∪_ｑ∈Ｓ｛Ｄ_ｐ、ｑ＝１｝ E = { _q } _S {D _{p, q} = 1}

【００７４】点ｐのベクトル中央値は、構造要素Ｅ内の各構成要素の中央値として定義され
る。The vector median of the point p is defined as the median of each component in the structuring element E.

【００７５】メジアン（Ｉ_ｔ（ｐ））＝｛中央値_ｑ∈Ｅ｛Ｌ_１（ｑ）、．．．、中央値_ｑ∈ _Ｅ｛Ｌ_ｍ（ｑ）｝｝The median (I _t (p)) = {median _q∈E ｛L ₁ (q),. . . Median _{_{_{q∈ E {L m (q)}}} }}

【００７６】そのようなベクトル中央値フィルタを使用することによって、ベクトル画像Ｉ _ｔの小さな変動を除去することができ、同時に、ビデオ対象物の境界が、構造要
素Ｅの空間的設計の下でうまく保持される。その結果、追跡プロセスは、より効
果的に、意味ビデオ対象物の境界を識別することができる。By using such a vector median filter, the vector imageI _t Small fluctuations can be eliminated, while the boundaries of the video object are
ElementaryEWell maintained under the spatial design. As a result, the tracking process is more effective
Consequently, semantic video object boundaries can be identified.

【００７７】領域抽出ベクトル入力画像をフィルタリングした後、追跡プロセスは、現在の画像から
領域を抽出する。これを達成するために、追跡プロセスは、現在の画像を取り入
れて、「同一」イメージの値を有する接続点の領域を識別する空間分割方法４２
２を使用する。これらの接続領域は、領域ベースの運動推定４２４と領域ベース
の分類４２６で使用される点の領域である。Region Extraction After filtering the vector input image, the tracking process extracts regions from the current image. To achieve this, the tracking process takes a current image and identifies a space segmentation method 42 that identifies regions of connection points having "same" image values.
Use 2. These connection regions are regions of points used in region-based motion estimation 424 and region-based classification 426.

【００７８】領域抽出段階の実行において、取り組まなければならない主要な課題が３つあ
る。第１に、「同一」の概念を強固にする必要がある。第２に、領域の合計の数
を見つけるべきである。第３に、各領域の位置を固定しなければならない。ベク
トル画像データの分割に関係する文献は、様々な空間分割方法を記載している。
大半の一般的な空間分割方法は、下記のものを使用する。There are three main issues that need to be addressed in performing the region extraction stage. First, the concept of “identical” needs to be strengthened. Second, the total number of regions should be found. Third, the position of each region must be fixed. Documents relating to the division of vector image data describe various spatial division methods.
Most common spatial partitioning methods use the following.

【００７９】・領域の同一性を定義する多項式関数・領域の数を見つける決定論的方法、および／または・全ての領域の位置を最終決定する境界調整A polynomial function defining the identity of the regions; a deterministic method of finding the number of regions; and / or a boundary adjustment to finalize the position of all regions.

【００８０】これらの方法は、いくつかの応用例では満足な結果をもたらすことが可能であ
るが、非剛体運動と、断片的な領域と、複数の色を有する非常に多様な意味ビデ
オ対象物に対しては、正確な結果を保証しない。意味対象物を分類することがで
きる正確さは、領域の正確さに依存しているので、空間分割方法に要求される正
確さは、非常に高度なものとなる。分割段階後、いかなる意味対象物の領域も欠
損しておらず、いかなる領域もそれに属さない区域を含まないことが好ましい。
現在のフレームにある意味ビデオ対象物の境界は、これらの接続領域の全境界の
部分集合として定義されているので、その正確さは、追跡プロセスの結果の正確
さに直接影響する。境界が不正確な場合、結果的な意味ビデオ対象物の境界も不
正確になる。したがって、空間分割方法は、現在のフレームに対し、正確な空間
区分画像を提供するべきである。Although these methods can produce satisfactory results in some applications, non-rigid motion, fragmented regions, and a wide variety of semantic video objects with multiple colors Does not guarantee accurate results. Since the accuracy with which semantic objects can be classified depends on the accuracy of the region, the accuracy required for the space division method is very high. After the division step, it is preferred that no region of the semantic object is missing, and that no region contains a zone that does not belong to it.
Since the boundaries of the semantic video objects in the current frame are defined as a subset of the total boundaries of these connected regions, their accuracy directly affects the accuracy of the results of the tracking process. If the boundaries are incorrect, the resulting boundaries of the semantic video object will also be incorrect. Therefore, the spatial partitioning method should provide an accurate spatial segmentation image for the current frame.

【００８１】追跡方法の現在の実装では、ＬａｂｅｌＭｉｎＭａｘと呼ばれる、新規で速い
空間分割方法を使用する。この特別の手法は、シーケンシャルの形態で、１度に
１つの領域を発生（ｇｒｏｗ）発生させる。この手法は、他の並行領域発生プロ
セス、すなわち、領域発生が任意のシードから始まる前に、全てのシードを特定
する必要のある他の並行領域発生プロセスとは異なる。シーケンシャル領域発生
方法は、領域を次々に抽出する。これにより、より柔軟に各領域を取り扱うこと
が可能となり、全体的な計算の煩雑さを低減する。The current implementation of the tracking method uses a new and fast space division method called LabelMinMax . This particular approach generates one region at a time, in a sequential fashion. This approach is different from other concurrent region generation processes, that is, the need to identify all seeds before region generation begins with any seed. The sequential area generation method extracts areas one after another. As a result, each area can be handled more flexibly, and the complexity of the overall calculation is reduced.

【００８２】領域の同一性は、領域の最大値と最小値の差によって制御される。入力ベクト
ル画像Ｉ _ｔは、積Ｌの完全に順序付けされたｍ個（ｍ≧１）の完全格子Ｌ_１、．
．．、Ｌ_ｍにおいて定義されていると仮定する（定義１１参照）。The identity of an area is controlled by the difference between the maximum and minimum values of the area. Input vector image I _t is completely lattice _L 1 of m which are ordered full of product L (m ≧ 1),.
. . , It assumed to be defined in _{L m} (see Definition 11).

【００８３】 ∀ｐ、ｑ∈Ｓ、Ｉ _ｔ（ｐ）＝｛Ｌ_１（ｐ）、Ｌ_２（ｐ）、．．．、Ｌ_ｍ（ｐ）
｝[0083] _{∀p, q∈ S, I t (} p) = {L 1 (p), L 2 (p) ,. . . , L _m (p)
｝

【００８４】領域Ｒの最大値と最小値（ＭａｘＬとＭｉｎＬ）は、下式のように定義される
。The maximum value and the minimum value ( MaxL and MinL ) of the region R are defined as in the following expression.

【００８５】[0085]

【数５】 (Equation 5)

【００８６】ＭａｘＬとＭｉｎＬの差が、閾値（Ｈ＝｛ｈ_１、ｈ_２、．．．、ｈ_ｍ｝より小
さい場合、その領域は同一である。 If the difference between MaxL and MinL is smaller than the threshold value ( H = {h ₁ , h ₂ ,..., H _m }), the regions are the same.

【００８７】同一性；∀ｉ、１≦ｉ≦ｍ、（ｍａｘ_ｐ∈Ｒ｛Ｌ_ｉ（ｐ）｝−ｍｉｎ_ｐ∈Ｒ｛Ｌ _ｉ（ｐ）｝≦ｈ_ｉ [0087] Identity; ∀i, 1 ≦ i ≦ m , (max p∈R {L i (p)} - min p∈R {L i (p)} ≦ h i

【００８８】ＬａｂｅｌＭｉｎＭａｘ方法は、次々に各領域に名前を付ける。ｎ次元集合Ｓ
の点ｐから開始する。領域Ｒは、ＬａｂｅｌＭｉｎＭａｘがその上で動作してい
る現在の領域と仮定する。開始時では、点ｐ：Ｒ＝｛ｐ｝のみを含んでいる。次
に、ＬａｂｅｌＭｉｎＭａｘは、領域Ｒの全ての近接点（定義３参照）を検査し
、近接点ｑがその中に挿入されている場合に、領域Ｒが依然として同一であるか
を調べる。挿入によって領域の同一性を変更しない場合、点ｑは領域Ｒに追加さ
れる。点ｑが領域Ｒに追加されたとき、点ｑは集合Ｓから消去されるべきである
。領域Ｒは、徐々に、さらに近接点を追加することができない同一テリトリまで
拡大する。次いで、Ｓに残存している点からの点で、新しい領域が構築される。Ｓにもはや残存する点がなくなるまで、このプロセスが続く。プロセス全体は、
以下の疑似コードによって明瞭に説明することができる。[0088]LabelMinMaxThe method names each region in turn. n-dimensional setS
Starting from point p. regionRIsLabelMinMaxIs working on it
Suppose the current area is At the start, the point p:R= {P} only. Next
ToLabelMinMaxIs the areaRInspect all adjacent points (see Definition 3)
, When the proximity point q is inserted therein,RAre still the same
Find out. If insertion does not change the identity of the region, point q is the regionRAdded to
It is. Point q is the areaR, The point q is the setSShould be erased from
. regionRGradually up to the same territory where no further points can be added
Expanding. ThenSA new region is constructed in terms of the points remaining in. S This process continues until there are no more points remaining. The whole process is
This can be clearly explained by the following pseudo code.

【００８９】ＬａｂｅｌＭｉｎＭａｘ： LabelMinMax:

【００９０】[0090]

【数６】 (Equation 6)

【００９１】ＬａｂｅｌＭｉｎＭａｘは、下記を含む多くの利点を有する。 LabelMinMax has many advantages, including:

【００９２】・ＭａｘＬとＭｉｎＬは、他の基準と比較して、領域の同一性について、より
精密な説明を提示する。・同一性の定義は、正確な領域をもたらす領域の同一性に対し、より厳密な制
御を与える。・ＬａｂｅｌＭｉｎＭａｘは、信頼できる空間分割結果をもたらす。・ＬａｂｅｌＭｉｎＭａｘは、多くの他の方法より、計算がはるかに煩雑でな
い。• MaxL and MinL provide a more precise description of the identity of the regions compared to other criteria. • The definition of identity gives more control over the identity of the regions that yields the exact region. LabelMinMax gives reliable spatial segmentation results. -LabelMinMax is much less complicated to calculate than many other methods.

【００９３】これらの利点により、ＬａｂｅｌＭｉｎＭａｘは、空間分析に対しよい選択肢
となり、また、代替分割方法を使用して、接続領域を識別することが可能である
。例えば、他の領域発生方法は、異なる同一基準と「同一」領域のモデルを使用
して、追加の点を同一領域に追加するかを決定する。例えば、これらの基準は強
度の閾値を含んでおり、各新しい点と領域の近接点との強度の差が閾値を超えな
い限り、領域に点が追加される。また、同一基準は、領域の点の強度値が変動す
ることが可能であり、それでも依然として接続領域の部分と見なすことができる
方法について説明する数学的関数の観点から定義することが可能である。These advantages make LabelMinMax a good choice for spatial analysis, and also allow alternative partitioning methods to be used to identify connected regions. For example, other region generation methods use different same criteria and models of "same" regions to determine whether additional points are added to the same region. For example, these criteria include an intensity threshold, and points are added to the region as long as the difference in intensity between each new point and the neighboring points of the region does not exceed the threshold. Also, the same criterion can be defined in terms of a mathematical function that describes how the intensity values of points in the region can vary and still be considered part of the connected region.

【００９４】領域ベースの運動推定領域ベースの運動推定４２４のプロセスは、分割プロセスによって識別された
領域のイメージの値と、先行フレームの対応するイメージの値とを整合し、領域
が先行フレームから移動した方法を推定する。このプロセスを示すために、以下
の例を考察する。Ｉ _ｔ−１は、時間ｔ−１のｎ次元集合Ｓ上の先行ベクトル画像
とし、Ｉ _ｔは、時間ｔの同じ集合Ｓ上にある現在のベクトル画像とする。領域抽
出手順は、現在のフレームＩ _ｔでＮ個の同一領域Ｒ _ｉ（ｉ＝１、２、．．．、Ｎ
）を抽出する。Region-Based Motion Estimation The process of region-based motion estimation 424 matches the image values of the region identified by the segmentation process with the corresponding image values of the previous frame, and moves the region from the previous frame. Estimate the method used. To illustrate this process, consider the following example. I _t-1 is the previous vector image on the time t-1 of the n-dimensional set S, _{I t} is the current vector image on the same set S of time t. Region extraction procedure, N pieces in the current frame I _t in the same region _{R i (i = 1,2, ...} , N
) To extract.

【００９５】Ｓ＝∪_{ｉ−１、．．．、Ｎ} Ｒ _ｉ S = { _{i−1,. . . , N} R _i

【００９６】ここで、追跡プロセスは次に進み、先行フレームの意味ビデオ対象物の正確に
１つに属するとして、各領域を分類する。追跡プロセスは、この領域ベースの分
類問題を、領域ベースの運動推定と補償を用いて解明する。現在のフレームＩ _ｔの各抽出した領域Ｒ _ｉに対し、運動推定手順を実行して、これらの領域が、先行
フレームＩ _ｔ−１で発生した場所を見つける。多くの運動モデルを使用すること
が可能であるが、現在の実装は、運動推定手順として並進運動モデルを使用する
。このモデルでは、運動推定手順は、その領域に関する予想エラー（ＰＥ）を最
小限に抑える領域Ｒ _ｉに対する運動ベクトルＶ _ｉを計算する。Here, the tracking process proceeds, classifying each region as belonging to exactly one of the semantic video objects of the preceding frame. The tracking process solves this region-based classification problem using region-based motion estimation and compensation. For each extracted region R _i of the current frame I _t, running motion estimation procedure, these regions, find a place that occurred in the previous frame I _t-1. Although many motion models can be used, current implementations use a translational motion model as the motion estimation procedure. In this model, the motion estimation procedure computes a motion vector V _i for region R _i to minimize the expected error (PE) for that region.

【００９７】[0097]

【数７】 (Equation 7)

【００９８】上式で‖・‖は、２つのベクトルの絶対的な差の合計を表し、Ｖ _ｉ≦Ｖ _ｍａｘ（Ｖ _ｍａｘは最大探索範囲）である。この運動ベクトルＶ _ｉは、先行フレームＩ _ｔ−１での軌跡の位置を示す領域Ｒ _ｉに割り当てられる。In the above equation, ‖ · ‖ represents the sum of absolute differences between two vectors,V _i≤V _max (V _maxIs the maximum search range). This motion vectorV _iIs the preceding frameI _t-1 Area indicating the position of the trajectory atR _iAssigned to.

【００９９】他の運動モデルも同様に使用することが可能である。例えば、アフィンまたは
透視運動モデルを使用して、現在のベクトル画像の領域と、先行ベクトル画像の
対応する領域との間の運動をモデルすることができる。アフィンおよび透視運動
モデルは、幾何学的変換（例えば、アフィンまたは透視変換）を使用して、ある
フレームと他のフレームとの間の領域の運動を定義する。この変換は、領域のい
くつかの点に対する運動ベクトルを見つけ、次いで、選択した点での運動ベクト
ルを用いて連立方程式を解いて係数を計算することによって計算することが可能
な運動係数で表される。他の方式は、運動係数の最初の集合を選択し、次いでエ
ラー（例えば、絶対的な差の合計または２乗した差の合計）が閾値より小さくな
るまで繰り返す。[0099] Other motion models can be used as well. For example, an affine or perspective motion model can be used to model motion between a region of the current vector image and a corresponding region of the preceding vector image. Affine and perspective motion models use geometric transformations (eg, affine or perspective transformations) to define the motion of a region between one frame and another. This transformation is represented by motion coefficients that can be calculated by finding the motion vectors for some points in the region and then solving the system of equations using the motion vectors at the selected points to calculate the coefficients. You. Other schemes select an initial set of motion coefficients and then repeat until the error (eg, the sum of absolute differences or the sum of squared differences) is less than a threshold.

【０１００】領域ベースの分類領域ベースの分類プロセス４２６は、運動情報を用いて各領域の位置を変更し
、先行フレームで領域の推定された位置を決定する。次いで、この推定位置を先
行フレーム（Ｓ _ｔ）の意味ビデオ対象物の境界と比較し、どの意味ビデオ対象物
の部分を最も形成しやすいかを決定する。Region-Based Classification The region-based classification process 426 uses the motion information to change the position of each region and determine the estimated position of the region in the previous frame. Then, as compared with the boundary of the meanings video object of this estimated position prior frame (S _t), to determine the most easily form a portion of which means the video object.

【０１０１】それを示すために、以下の例を考察する。Ｉ _ｔ−１とＩ _ｔは、ｎ次元集合Ｓ上
の先行および現在のベクトル画像とし、Ｐ_ｔ−１は、時間ｔ−１での対応する意
味区分画像とする。To illustrate, consider the following example. I _t-1 and I _t are the preceding and current vector images on n-dimensional set _{S, P t-1} is the corresponding mean division image at time t-1.

【０１０２】Ｓ＝∪_{ｉ＝１、．．．、ｍ} Ｏ _{ｔ−１、ｉ} S = ∪i _{= 1,. . . , M} O _{t−1, i}

【０１０３】各Ｏ _{ｔ−１、ｉ}（ｉ＝１、．．．、ｍ）は、時間ｔ−１での意味ビデオ対象物
の位置を示す。Ｎ個の抽出された全領域Ｒ _ｉ（ｉ＝１、２、．．．、Ｎ）があり
、各領域は、現在のフレームで関連付けられた運動ベクトルＶ _ｉ（ｉ＝１、２、
．．．、Ｎ）を有すると仮定する。ここで、追跡方法は、時間ｔで現在の意味区
分画像Ｐ_ｔを構築することが必要である。Each O _{t−1, i} (i = 1,..., M) indicates the position of the semantic video object at time t−1. There are N extracted total regions R _i (i = 1, 2,..., N), and each region has an associated motion vector V _i (i = 1, 2,...) In the current frame.
. . . , N). Here, tracking method, it is necessary to construct the current meaning segmented image P _t at time t.

【０１０４】追跡プロセスは、現在のフレームで、各領域Ｒ _ｉに対し意味ビデオ対象物Ｏ _ｔ _−１、ｊ（ｊ∈｛１、２、．．．、ｍ｝）を見つけることによって、このタスク
を履行する。The tracking process performs this task by finding a semantic video object O _t _{−1, j} ( _j {1,2, ..., m}) for each region R _i in the current frame. Is implemented.

【０１０５】各領域Ｒ _ｉに対する運動情報が、すでにこの段階で利用可能であるので、領域
分類装置４２６は、逆方向運動補償を用いて、現在のフレームの各領域Ｒ _ｉを、
先行フレームに向けてワープする。領域に対する運動情報を、その領域の点に適
用することによって、領域をワープする。以前の領域にあるワープした領域をＲ
′ _ｉと仮定する。Since the motion information for each region R _i is already available at this stage, the region classifier 426 uses reverse motion compensation to convert each region R _i in the current frame to
Warp to the previous frame. Warping a region by applying motion information for the region to points in that region. The area was warped in the previous area R
It is assumed that the _'i.

【０１０６】Ｒ′ _ｉ＝∪_ｐ∈Ｒｉ｛ｐ＋Ｖ _ｉ｝ R ′ _i = {p} _{R i} {p + V _i }

【０１０７】理想的には、ワープした領域Ｒ′ _ｉは、先行フレームの意味ビデオ対象物の１
つに当てはまるべきである。Ideally, the warped region R ′ _i is one of the semantic video objects of the previous frame.
That should be the case.

【０１０８】 ∃ｉ、ｊ∈｛１、２、．．．、ｍ｝およびＲ′ _ｉ⊆Ｏ _{ｔ−１、ｊ} {I, j} 1, 2,. . . , M} and _{_{R 'i ⊆ O t-1}} , j

【０１０９】これがその場合であれば、追跡方法は、意味ビデオ対象物Ｏ _{ｔ−１、ｊ}をこの
領域Ｒ _ｉに割り当てる。しかし、実際には、運動推定プロセスからの潜在的な曖
昧さのために、Ｒ′ _ｉは、先行フレームの複数の意味ビデオ対象物と重複する可
能性がある。すなわち、If this is the case, the tracking method assigns a semantic video object O _{t−1, j} to this region R _i . However, in practice, due to potential ambiguity from the motion estimation process, R ′ _i can overlap with multiple semantic video objects in the previous frame. That is,

【０１１０】[0110]

【数８】 (Equation 8)

【０１１１】である。Is as follows.

【０１１２】現在の実装は、領域ベースの分類に対し、過半数基準Ｍを使用する。現在のフ
レームの各領域Ｒ _ｉに対し、ワープした領域Ｒ′ _ｉの過半数部分が、先行フレー
ムの意味ビデオ対象物Ｏ _{ｔ−１、ｊ}（Ｊ∈１、２、．．．、ｍ）に由来する場合
、この領域は、その意味ビデオ対象物Ｏ _{ｔ−１、ｊ}に割り当てられる。The current implementation uses a majority criterion M for region-based classification. For each region R _i of the current frame, the warped region R _'i majority part of the meaning of the previous frame video object _{O t-1, j (J∈1,2} , ..., m) from If you, this region is assigned to the meaning video object O _{t-1, j.}

【０１１３】 ∀ｐ∈Ｒ _ｉ、および∀ｑ∈Ｏ _{ｔ−１、ｊ}、Ｐ_ｔ（ｐ）＝Ｐ_ｔ−１（ｑ）{P} R _i , and {q} O _{t−1, j} , P _t (p) = P _t−1 (q)

【０１１４】より詳細には、Ｒ′ _ｉと重複する過半数区域（ＭＯＡ）を有する意味ビデオ対
象物Ｏ _{ｔ−１、ｊ}は、下式のように見つけられる。More specifically, a semantic video object O _{t−1, j} having a majority area (MOA) overlapping R ′ _i is found as follows:

【０１１５】[0115]

【数９】 (Equation 9)

【０１１６】現在のフレームの完全意味ビデオ対象物Ｏ _ｔ、ｊは、現在のフレームの全ての
領域Ｒ _ｉ（ｉ＝１、２、．．．、ｍ）に対してこの領域ベースの分類手順を用い
ることにより、１つずつ構築される。点ｑ∈Ｏ _{ｔ−１、ｊ}、The full semantic video object O _{t, j} of the current frame uses this region-based classification procedure for all regions R _i (i = 1, 2,..., M ) of the current frame. By use, they are built one by one. Point q∈ O _{t-1, j,}

【０１１７】Ｏ _ｔ、ｊ＝∪_ｐ∈Ｓ｛ｐ｜Ｐ_ｔ（ｐ）＝Ｐ_ｔ−１（ｑ）｝、ｊ＝１、２、．．
．、ｍ[0117] _{_{O t, j = ∪ p∈S {}} p | P t (p) = P t-1 (q)}, j = 1,2 ,. .
. , M

【０１１８】と仮定する。この領域ベースの分類プロセスの設計によって、現在のフレーム
では、いかなるホール／ギャップ、または異なる意味ビデオ対象物間の重複はな
いことになる。It is assumed that With the design of this region-based classification process, there will be no holes / gaps or overlaps between different semantic video objects in the current frame.

【０１１９】 ∪_{ｉ＝１，．．．，}ｍＯ_ｔ，ｉ＝∪_{ｉ＝１，．．．，}ＮＲ_ｉ＝∪_{ｉ＝１，．．} _．，ｍＯ_{ｔ−１，ｉ}＝Ｓ ∀ｉ，ｊ∈｛１，．．．，ｍ｝，ｉ≠ｊ⇒Ｏｔ，ｉ∩Ｏｔ，ｉ＝φ∪ _{i = 1,. . . ,} MO _{t, i} = ∪ _{i = 1,. . . ,} NR _i = ∪ _{i = 1,. .} _. , MO _{t−1, i} = S ∀i, j∈ ｛1,. . . , M｝, i ≠ j⇒Ot, i∩Ot, i = φ

【０１２０】これは、意味ビデオ対象物の境界を決定することができないフレーム内へと対
象物を追跡する追跡システムと比較して、この追跡システムの利点である。例え
ば、順方向追跡システムにおいて、対象物追跡は、精密な境界が未知である後続
フレームへと進む。次いで、境界は、境界条件をモデルするいくつかの所定の基
準に基づいて、未知の境界に合うように調整される。This is an advantage of this tracking system as compared to a tracking system that tracks objects into frames where the boundaries of semantic video objects cannot be determined. For example, in a forward tracking system, object tracking proceeds to subsequent frames where the precise boundaries are unknown. The boundary is then adjusted to fit the unknown boundary based on some predetermined criteria that models the boundary conditions.

【０１２１】領域後処理現在のフレームの追跡結果は、意味区分画像Ｐ_ｔであると仮定する。様々な理
由のために、領域ベースの分類手順には、いくつかのエラーが存在する可能性が
ある。領域後処理プロセスの目的は、これらのエラーを除去し、同時に、現在の
フレームで各意味ビデオ対象物の境界を滑らかにすることである。興味深いこと
に、区分画像は、従来の画像とは異なる空間画像である。この区分画像の各点に
おける値は、意味ビデオ対象物の位置を示すだけである。したがって、一般に、
信号処理用の全ての従来の線形または非線形フィルタは、この空間後処理に適し
ていない。[0121] region aftertreatment tracking results of the current frame is assumed as meaning segmented image P _t. For various reasons, there may be some errors in the region-based classification procedure. The purpose of the region post-processing process is to eliminate these errors while smoothing the boundaries of each semantic video object in the current frame. Interestingly, segmented images are spatial images that are different from conventional images. The value at each point in this segmented image only indicates the location of the semantic video object. Therefore, in general,
All conventional linear or non-linear filters for signal processing are not suitable for this spatial post-processing.

【０１２２】この実装は、過半数オペレータＭ（・）を使用して、このタスクを履行する。
ｎ次元集合Ｓの各点ｐに対し、構造要素Ｅは、全ての接続点を含むそれの回りで
定義されている（接続点に関する１参照）。This implementation uses a majority operator M (•) to perform this task.
For each point p of the n-dimensional set S , the structuring element E is defined around it, including all connection points (see 1 for connection points).

【０１２３】Ｅ＝∪_ｐ∈Ｓ｛Ｄ_ｐ、ｑ＝１｝ E = { _p } _S {D _{p, q} = 1}

【０１２４】第１に、過半数オペレータＭ（・）は、構造要素Ｅと最大限重複している区域
（ＭＯＡ）を有する意味ビデオ対象物Ｏ _ｔ、ｊを見つける。[0124] First, the majority operator M (·) is meant video object O _t having an area (MOA) that maximize the structural element E _overlap, find _j.

【０１２５】[0125]

【数１０】 (Equation 10)

【０１２６】第２に、過半数オペレータＭ（・）は、その意味ビデオ対象物Ｏ _ｔ、ｊの値を
、点ｐに割り当てる。[0126] Second, the majority operator M (·) is the meaning video object O _t, the value of _j, assigned to the point p.

【０１２７】ｑ∈Ｏ _ｔ、ｊ、Ｐ_ｔ（ｐ）＝Ｍ（ｐ）＝Ｐ_ｔ（ｑ）とする。[0127] q∈ O _t, _j, and P t (p) = M ( p) = P t (q).

【０１２８】過半数基準の採用のために、非常に小さい区域（エラーである可能性が最も高
い）を除去し、同時に、各意味ビデオ対象物の境界を滑らかにすることが可能で
ある。Due to the adoption of the majority criterion, it is possible to eliminate very small areas (most likely to be errors), while at the same time smoothing the boundaries of each semantic video object.

【０１２９】コンピュータシステムの簡単な概要図５および以下の議論は、本発明を実現することが可能である適切なコンピュ
ータ環境について、簡単で一般的な説明を提供することを意図している。本発明
またはその態様は、ハードウエアデバイスで実現することが可能であるが、上述
の追跡システムは、プログラムモジュールにおいて組織されたコンピュータ実行
可能命令で実行される。プログラムモジュールは、ルーチンと、プログラムと、
対象物と、構成要素と、タスクを実行し、上述のデータタイプを実行するデータ
構造とを含む。Brief Overview of Computer System FIG. 5 and the following discussion are intended to provide a brief, general description of a suitable computer environment in which the invention may be implemented. Although the present invention or aspects thereof may be implemented in hardware devices, the above-described tracking systems are implemented with computer-executable instructions organized in program modules. Program modules are routines, programs,
Includes objects, components, and data structures that perform tasks and perform the data types described above.

【０１３０】図５は、デスクトップコンピュータの一般的な構成を示すが、本発明は、手持
ち式デバイス、マルチプロセッサシステム、マイクロプロセッサベースまたはプ
ログラム可能な消費者エレクトロニクス、ミニコンピュータ、メインフレームコ
ンピュータなどを含む、他のコンピュータシステム構成で実施することが可能で
ある。また、本発明は、コンピュータネットワークを介して連結されているリモ
ート処理装置によってタスクを実行する分散計算環境で使用することが可能であ
る。分散コンピュータ環境では、プログラムモジュールは、局所およびリモート
メモリ格納装置の両方に配置することが可能である。While FIG. 5 shows a general configuration of a desktop computer, the present invention includes hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like. It is possible to implement the present invention with other computer system configurations. Also, the present invention can be used in a distributed computing environment where tasks are performed by remote processing devices that are linked through a computer network. In a distributed computer environment, program modules may be located in both local and remote memory storage devices.

【０１３１】図５は、本発明の操作環境として役立つコンピュータシステムの例を示す。コ
ンピュータシステムは、処理ユニット５２１と、システムメモリ５２２と、シス
テムメモリを含んでいる様々なシステム構成要素を処理ユニット５２１に内部接
続するシステムバス５２３とを含んでいるパーソナルコンピュータ５２０を含む
。システムバスは、メモリバスまたはメモリ制御装置と、周辺バスと、ＰＣＩ、
ＶＥＳＡ、マイクロチャネル（ＭＣＡ）、ＩＳＡおよびＥＩＳＡなどが例として
挙げられるバス体系を使用する局所バスとを含んでいるいくつかの種類のバス構
造のうち、いずれかを備えることが可能である。システムメモリは、読取り専用
メモリ（ＲＯＭ）５２４とランダムアクセスメモリ（ＲＡＭ）５２５を含む。基
本的な入力／出力システム５２６（ＢＩＯＳ）は、開始時中などに、パーソナル
コンピュータ５２０内で要素間の情報を転送することに役立つ基本的なルーチン
を含んでおり、ＲＯＭ５２４に格納されている。さらに、パーソナルコンピュー
タ５２０は、ハードディスクドライブ５２７と、例えば取外し可能ディスク５２
９から読み込むまたはそれに書き込む磁気ディスクドライブ５２８と、例えばＣ
Ｄ−ＲＯＭディスク５３１を読むまたは他の光学メディアを読み込むあるいはそ
れに書き込む光学ディスクドライブ５３０とを含む。ハードディスクドライブ５
２７、磁気ディスクドライブ５２８、光学ディスクドライブ５３０は、それぞれ
、ハードディスクドライブインターフェース５３２、磁気ディスクドライブイン
ターフェース５３３、光学ドライブインターフェース５３４によって、システム
バス５２３に接続される。ドライブとそれに関連付けられたコンピュータ読取り
可能媒体は、データの不揮発性格納、データ構造、コンピュータ実行可能命令（
動的リンクライブラリなどのプログラムコードと実行可能ファイル）などを、パ
ーソナルコンピュータ５２０に提供する。上記のコンピュータ読取り可能媒体の
説明は、ハードディスクと、取外し可能磁気ディスクと、ＣＤとを指すが、磁気
カセット、フラッシュメモリカード、デジタルビデオディスク、ベルヌーイカー
トリッジなど、コンピュータによって読み取ることができる他の種類の媒体を含
むことができる。FIG. 5 shows an example of a computer system serving as the operating environment of the present invention. The computer system includes a personal computer 520 that includes a processing unit 521, a system memory 522, and a system bus 523 that interconnects various system components including the system memory to the processing unit 521. The system bus includes a memory bus or memory controller, a peripheral bus, a PCI,
Any of a number of types of bus structures can be provided, including VESA, Micro Channel (MCA), ISA and EISA, and local buses using bus architectures as examples. The system memory includes a read-only memory (ROM) 524 and a random access memory (RAM) 525. The basic input / output system 526 (BIOS) contains basic routines that help to transfer information between elements within the personal computer 520, such as during startup, and is stored in the ROM 524. Further, the personal computer 520 includes a hard disk drive 527 and, for example, a removable disk 52.
9, a magnetic disk drive 528 that reads from or writes to
An optical disk drive 530 for reading a D-ROM disk 531 or reading or writing to other optical media. Hard disk drive 5
27, the magnetic disk drive 528, and the optical disk drive 530 are connected to the system bus 523 by a hard disk drive interface 532, a magnetic disk drive interface 533, and an optical drive interface 534, respectively. The drives and their associated computer-readable media provide nonvolatile storage of data, data structures, computer-executable instructions (
Program code such as a dynamic link library and an executable file) are provided to the personal computer 520. The above description of computer-readable media refers to hard disks, removable magnetic disks, and CDs, but other types of computers that can be read by a computer, such as magnetic cassettes, flash memory cards, digital video disks, Bernoulli cartridges, and the like. A medium can be included.

【０１３２】多くのプログラムモジュールを、オペレーティングシステム５３５と、１つま
たは複数のアプリケーションプログラム５３６と、他のプログラムモジュール５
３７と、プログラムデータ５３８とを含む、ドライブおよびＲＡＭ５２５に格納
することが可能である。ユーザは、キーボード５４０およびマウス５４２などの
位置表示装置を介して、コマンドおよび情報をパーソナルコンピュータ５２０に
入力することが可能である。他の入力装置（図示せず）には、マイクロフォン、
ジョイスティック、ゲームパッド、衛星放送用パラボラアンテナ、スキャナなど
を含むことが可能である。これらおよび他の入力装置は、しばしば、システムバ
スに結合されているシリアルポートインターフェース５４６を介して、処理ユニ
ット５２１に接続されるが、パラレルポート、ゲームポート、またはユニバーサ
ルシリアルバス（ＵＳＢ）などの他のインターフェースによって接続することも
可能である。また、モニタ５４７または他の種類の表示装置も、表示制御装置ま
たはビデオアダプタ５４８などのインターフェースを介して、システムバス５２
３に接続される。通常、モニタの他に、パーソナルコンピュータは、スピーカお
よびプリンタなどの他の周辺出力装置（図示せず）を含む。[0132] A number of program modules are stored in an operating system 535, one or more application programs 536, and other program modules 5
37 and the program data 538 can be stored in the drive and RAM 525. The user can input commands and information to the personal computer 520 through a position display device such as a keyboard 540 and a mouse 542. Other input devices (not shown) include a microphone,
It may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 521 via a serial port interface 546 that is coupled to the system bus, but other ports such as a parallel port, game port, or universal serial bus (USB). It is also possible to connect by the interface of. A monitor 547 or other type of display device may also be connected to the system bus 52 via an interface such as a display controller or video adapter 548.
3 is connected. Typically, in addition to a monitor, a personal computer includes other peripheral output devices (not shown) such as speakers and a printer.

【０１３３】パーソナルコンピュータ５２０は、ネットワークされた環境で、リモートコン
ピュータ５４９などの１つまたは複数のリモートコンピュータへの論理接続を用
いて動作することが可能である。リモートコンピュータ５４９は、サーバ、ルー
タ、ピアデバイス、または他の一般的なネットワークノードとすることが可能で
あり、通常、パーソナルコンピュータ５２０に関して記述した多くのまたは全て
の要素を含むが、図５では、メモリ格納装置５５０のみを図示している。図５に
示した論理接続は、ローカルエリアネットワーク（ＬＡＮ）５５１とワイドエリ
アネットワーク（ＷＡＮ）を含む。そのようなネットワーキング環境は、会社、
企業全体にわたるコンピュータネットワーク、イントラネット、およびインター
ネットでは一般的である。The personal computer 520 can operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 549. Remote computer 549 can be a server, router, peer device, or other general network node, and typically includes many or all of the elements described with respect to personal computer 520; Only the memory storage device 550 is shown. The logical connections shown in FIG. 5 include a local area network (LAN) 551 and a wide area network (WAN). Such a networking environment can be a company,
It is common in enterprise-wide computer networks, intranets, and the Internet.

【０１３４】ＬＡＮネットワーキング環境で使用するとき、パーソナルコンピュータ５２０
は、ネットワークインターフェースまたはアダプタ５５３を介してローカルネッ
トワーク５５１に接続されている。ＷＡＮネットワーキング環境で使用するとき
、パーソナルコンピュータ５２０は、通常、インターネットなど、ワイドエリア
ネットワーク５５２上で通信を確立する、モデム５５４または他の手段を含む。
モデム５５４は、内在または外付けとすることが可能であり、シリアルポートイ
ンターフェース５４６を介して、システムバス５２３に接続されている。ネット
ワークされた環境では、パーソナルコンピュータ５２０に関連して記述したプロ
グラムモジュール、またはその部分は、リモートメモリ格納装置に格納すること
が可能である。示したネットワーク接続は、単なる例であり、コンピュータ間で
通信リンクを確立する他の手段を使用することが可能である。When used in a LAN networking environment, the personal computer 520
Is connected to the local network 551 via a network interface or an adapter 553. When used in a WAN networking environment, the personal computer 520 typically includes a modem 554 or other means for establishing communication over a wide area network 552, such as the Internet.
The modem 554 can be internal or external, and is connected to the system bus 523 via a serial port interface 546. In a networked environment, program modules described relative to the personal computer 520, or portions thereof, may be stored in the remote memory storage device. The network connections shown are merely examples, and other means of establishing a communication link between the computers can be used.

【０１３５】結論本発明を特定の実装の詳細に関するコンテキストで説明したが、本発明は、こ
れらの特定の詳細に限定されるものではない。本発明は、ベクトル画像フレーム
において同一領域を識別し、次いでこれらの領域を意味対象物の部分であると分
類する、意味対象物追跡の方法とシステムを提供する。上述した実装の分類方法
は、意味領域を、意味対象物の境界が以前に計算されている先行フレームに射影
するので、「逆方向追跡」と呼ばれる。Conclusion Although the invention has been described in the context of particular implementation details, the invention is not limited to these particular details. The present invention provides a method and system for semantic object tracking that identifies identical regions in a vector image frame and then classifies these regions as part of semantic objects. The classification method of the implementation described above is called "backward tracking" because the semantic region is projected onto the previous frame where the boundaries of the semantic object were previously calculated.

【０１３６】また、この追跡システムは、一般に、意味ビデオ対象物の境界が既知であるフ
レームに、たとえこれらのフレームが順序付けられたシーケンスにある先行フレ
ームでない場合でも、分割領域を射影する応用に適用されることに留意されたい
。したがって、上述した「逆方向」追跡方式は、分類が必ずしも先行フレームに
限定されておらず、代わりに、意味対象物の境界が既知または以前に計算されて
いるフレームに限定されている応用に適用される。意味ビデオ対象物がすでに識
別されているフレームは、より一般的に、基準フレームと呼ぶ。現在のフレーム
に対する意味対象物の追跡は、基準フレームの意味対象物の境界に関して、現在
のフレームで分割された領域を分類することによって計算される。The tracking system is also generally applied to applications where the boundaries of semantic video objects are known, even if these frames are not the preceding frames in an ordered sequence, the application of a segmented region. Note that Therefore, the "reverse" tracking scheme described above applies to applications where the classification is not necessarily limited to the previous frame, but instead is limited to frames where the boundaries of the semantic object are known or previously calculated. Is done. Frames where semantic video objects have already been identified are more generally referred to as reference frames. Tracking of semantic objects relative to the current frame is calculated by classifying the regions divided by the current frame with respect to the semantic object boundaries of the reference frame.

【０１３７】上述のように、対象物追跡方法は、一般に、ベクトル画像シーケンスに適用さ
れる。したがって、２Ｄビデオシーケンス、またはイメージの値が強度値を表し
ているシーケンスに限定されていない。As described above, the object tracking method is generally applied to a sequence of vector images. Therefore, it is not limited to a 2D video sequence, or a sequence in which the values of the images represent intensity values.

【０１３８】領域分割段階の説明により、特に有用であるが、意味ビデオ対象物追跡の全て
の実装に必要ではない基準が識別された。すでに述べたように、他の分割技術を
使用して、点の接続領域を識別することが可能である。領域の同一性の定義は、
イメージの値の種類（例えば、運動ベクトル、色の強度）と応用例に応じて異な
る可能性がある。The description of the segmentation stage has identified criteria that are particularly useful, but are not required for all implementations of semantic video object tracking. As already mentioned, other segmentation techniques can be used to identify the connected areas of the points. The definition of region identity is
It can vary depending on the type of image values (eg, motion vector, color intensity) and application.

【０１３９】運動推定と補償を実行するために使用する運動モデルは、同様に変更すること
ができる。計算はより複雑であるが、領域の各個々の点に対して、運動ベクトル
を計算することが可能である。代替として、上述の変換モデルのように、各領域
に対して、１つの運動ベクトルを計算することが可能である。好ましくは、領域
ベースの整合方法を使用して、関心のあるフレームにおいて整合領域を見つける
べきである。領域ベースの整合では、現在のフレームの境界またはマスクを使用
して、予測した点と基準フレームの対応する領域との間のエラーを最小限に抑え
るプロセスから、領域の外部に位置する点を除外する。この種類の手法は、Ｍｉ
ｎｇ−ＣｈｉｅｈＬｅｅによる名称ＰｏｌｙｇｏｎＢｌｏｃｋＭａｔｃｈ
ｉｎｇＭｅｔｈｏｄの米国特許第５，７９６，８５５号に記載されており、参
考文献によってここに組み込まれている。The motion model used to perform motion estimation and compensation can be varied as well. Although the calculation is more complicated, it is possible to calculate a motion vector for each individual point in the region. Alternatively, one motion vector can be calculated for each region, as in the transformation model described above. Preferably, a matching region should be found in the frame of interest using a region-based matching method. Region-based matching uses the current frame boundary or mask to exclude points located outside the region from the process of minimizing errors between the predicted point and the corresponding region in the reference frame I do. This type of approach is called Mi
Polygon Block Match by ng-Chie Lee
No. 5,796,855 to ing Method and is incorporated herein by reference.

【０１４０】本発明の多くの可能な実装を考慮すると、上述した実装は本発明の単なる例で
あり、本発明の範囲に対する限定と考えるべきではない。むしろ、本発明の範囲
は、添付の請求項によって定義される。したがって、我々の発明は全て、これら
の特許請求の範囲および精神内に由来することを主張する。Given the many possible implementations of the present invention, the implementations described above are only examples of the present invention and should not be considered as limitations on the scope of the present invention. Rather, the scope of the present invention is defined by the appended claims. We therefore claim that all our invention comes from within the scope and spirit of these claims.

[Brief description of the drawings]

【図１Ａ】一般的な意味対象物を追跡する困難さを示すために異なる種類の意味対象物を
表す例である。FIG. 1A is an example representing different types of semantic objects to show the difficulty of tracking a general semantic object.

【図１Ｂ】一般的な意味対象物を追跡する困難さを示すために異なる種類の意味対象物を
表す例である。FIG. 1B is an example representing different types of semantic objects to show the difficulty of tracking a general semantic object.

【図１Ｃ】一般的な意味対象物を追跡する困難さを示すために異なる種類の意味対象物を
表す例である。FIG. 1C is an example representing different types of semantic objects to show the difficulty of tracking a general semantic object.

【図２】意味対象物追跡システムを示すブロック図である。FIG. 2 is a block diagram showing a semantic object tracking system.

【図３Ａ】区分画像の例と、領域近接グラフにおける区分画像を表す方法を示す図である
。FIG. 3A is a diagram illustrating an example of a segmented image and a method of representing the segmented image in an area proximity graph.

【図３Ｂ】区分画像の例と、領域近接グラフにおける区分画像を表す方法を示す図である
。FIG. 3B is a diagram illustrating an example of a segmented image and a method of representing the segmented image in an area proximity graph.

【図３Ｃ】区分画像の例と、領域近接グラフにおける区分画像を表す方法を示す図である
。FIG. 3C is a diagram showing an example of a segmented image and a method of representing the segmented image in the area proximity graph.

【図３Ｄ】区分画像の例と、領域近接グラフにおける区分画像を表す方法を示す図である
。FIG. 3D is a diagram illustrating an example of a segmented image and a method of representing the segmented image in the area proximity graph.

【図４】意味対象物追跡システムの実装を示すフローチャートである。FIG. 4 is a flowchart illustrating an implementation of a semantic object tracking system.

【図５】本発明の実装に対する操作環境として役立つコンピュータシステムのブロック
図である。FIG. 5 is a block diagram of a computer system that serves as an operating environment for an implementation of the present invention.

───────────────────────────────────────────────────── フロントページの続き (72)発明者ミン−チェリーアメリカ合衆国 98006 ワシントン州ベルビューサウスイースト 5558−166 プレイス（番地なし) Ｆターム(参考） 5C054 EA05 FB03 FC13 FE24 HA31 5C059 MA01 MB02 MB03 NN24 NN36 PP04 PP26 PP28 RF11 5L096 FA02 FA36 HA04 JA11 【要約の続き】んどまたは全くエラーを伝搬しない。──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Min-Cherry USA 98006 Bellevue, Washington Southeast 5558-166 Place (No address) F-term (reference) 5C054 EA05 FB03 FC13 FE24 HA31 5C059 MA01 MB02 MB03 NN24 NN36 PP04 PP26 PP28 RF11 5L096 FA02 FA36 HA04 JA11 [continued from summary] Does not propagate any or no errors.

Claims

[Claims]

1. A method for tracking a video object in a video frame, comprising: performing a spatial division on the video frame to identify regions of pixels having the same intensity value; Perform motion estimation between frames, warp the position of pixels in each region to the position of the previous frame using motion estimation for each region, and position the warped pixels in the divided video object of the previous frame. To determine the set of regions that may be part of the video object, and as a combination of each region in the set of video frames, the boundary of the video object in the video frame Forming a method.

2. The method according to claim 1, wherein the steps of claim 1 are repeated for subsequent frames, using the boundaries of the video object as reference boundaries for the next frame.

3. The method of claim 1, wherein the video frames are filtered to remove noise from the video frames, and then performing spatial partitioning.

4. Each region is a connected group of pixels, and a difference between an intensity value of a pixel having a maximum intensity value of the region and an intensity value of another pixel having a minimum intensity value region is smaller than a threshold value. The method of claim 1, wherein each area is determined to be the same by ensuring that it is small.

5. The method of claim 1, wherein the segmentation is a sequential region generation method, wherein starting from a first pixel position of a video image frame, a first region of connected pixels is identified by adding a pixel to the region. Is generated around the first pixel so as to satisfy the following condition. When there is no boundary pixel that satisfies the same criterion, the generation step occurring at the position of the pixel outside the first region is repeated, and each pixel of the frame is 5. The method according to claim 4, wherein the steps of occurring occur until the identification is made.

6. Performing region-based motion estimation, wherein for each region identified through spatial partitioning in a video frame, matching only pixels in the region with pixels of a previous frame. The method of claim 1, further comprising: finding a corresponding position for each pixel in a previous frame, and applying a motion model that approximates the motion of the pixel in the region to the corresponding position in the previous frame.

7. Use a motion model to find a motion vector for each region that minimizes prediction errors between pixel values warped from a video frame and pixel values at corresponding pixel locations in a previous video frame. 7. The method according to claim 6, wherein
The method described in.

8. The step of determining: finding the number of warped pixels inside the boundary of the segmented video object of the previous frame; and when the majority of the warped pixels is inside the boundary of the segmented video object, The method of claim 1, wherein the region is classified as a portion of a video object in a video frame.

9. A computer readable medium having instructions for performing the steps of claim 1.

10. A computer readable medium having instructions for tracking semantic objects in a vector image sequence of image frames, a spatial division module for dividing a vector image frame of the image sequence into regions. A spatial partitioning module having a connected group of image points having image values satisfying the same criterion; estimating motion between each region of the input image frame and the reference frame; A motion estimation module that determines a motion parameter that approximates the motion of each region between frames; a motion estimation module that applies a motion parameter of each region to the region to calculate a predicted region in the reference frame; , At least in part, assessing whether it is within the boundaries of the semantic object in the reference frame, and Classifying each area as a part of the semantic object in the reference frame based on a certain degree within the boundary of the semantic object boundary of the reference frame. A computer-readable medium formed from each area classified as a part of a meaning object.

11. The same criterion of the space division module is that the point of the first image of the connected group of pixels having the largest image value and the second image of the connected group having the smallest image value. 9. The method according to claim 8, wherein the segmentation module selectively adds a point of the image to the connection region having a maximum difference from the point to generate a new connection region while the new connection region satisfies the same criterion. 11. The method according to 10.

12. The motion parameter of each region is a motion vector, and when the motion vector is used to project a point of each image in the region to a target frame, an image value of the projected point; 11. The method of claim 10, wherein the sum of differences between corresponding image points of the target frame and image values is minimized.

13. The target frame includes two or more semantic objects, each object occupying a non-overlapping area of the target frame, and by region analysis, for each prediction region, an overlapping image of the prediction region. Identifying a semantic object of the target frame having the maximum number of points of the overlapped image and classifying each region as being associated with the semantic video object of the target frame having the maximum number of points of the duplicate image; Calculating a boundary of each semantic object in the image frame as a combination of regions classified as being associated with a corresponding semantic object in the target frame. Method.

14. Define the structure of points around each point of the image frame, determine the semantic object in the image frame having the largest overlapping structure area, and convert the value of the semantic video object to a point in the image. 14. The medium of claim 13, further comprising a majority operator to assign.

15. A method for tracking semantic objects in a vector image sequence, comprising performing a spatial division on an image frame to identify regions of points of a discrete image with values of the same image, Performing motion estimation between each region and a target image frame whose boundary of the semantic object is known, using the motion estimation for each region to warp the image point of each region to the position of the target frame; Determine whether the location of the warped pixel in each region is within the boundary of the semantic object in the target frame, and when at least the threshold amount of the region overlaps the semantic object in the target frame, Classify the area as the source of the object, and set the boundary of the semantic object of the image frame as the combination of the areas of the image frame classified as the source of the semantic object of the target frame. Wherein the forming as a cause.

16. The step of claim 15, wherein the calculated boundary of the semantic object of the previous frame is used to repeat the steps of claim 15 and the region divided in the current frame from one of the semantic objects of the previous frame as a source. The method of claim 15, further comprising classifying:

17. Each region is a connected group of points of the image, and each region is determined to be the same only by adding a point of the proximity image to the region, wherein each region is determined by each proximity image. 16. The method of claim 15, wherein after adding the point, the difference between the intensity value of the largest image value and the smallest image value of the region is less than a threshold.

18. The target frame is a preceding frame of the current frame, wherein each region divided from the current frame is one of the previously calculated frames for the preceding frame using the steps of claim 15. The boundary for the semantic object that is correctly classified as being from the semantic object is calculated by combining the boundaries of the regions classified as being from the same semantic object in the previous frame. The method of claim 15, wherein the steps of claim 15 are repeated for successive frames of the vector image sequence.

19. A computer readable medium having instructions for performing the steps of claim 15.

20. A method for tracking semantic objects of a vector image sequence, comprising performing spatial division on an image frame to identify regions of points of a discrete image with values of the same image, each region being a point of the image. Connected regions, where each region is determined to be identical only by adding points in the proximity image to that region, and after adding each proximity image point, the maximum and minimum values for that region Performing a region-based motion estimation between each region of the image frame and the image frame immediately before the vector image sequence, and using the motion estimated for each region. Then, warping the image point of each area to the position of the immediately preceding frame, determining whether the position of the warped pixel of each area is within the boundary of the semantic object of the target frame, When at least the amount of the threshold value overlaps with the semantic object of the target frame, the area is classified as being derived from the semantic object of the target frame, and the boundary for each semantic object of the image frame is set to A method characterized in that it is formed as a combination of regions of an image frame classified as originating from semantic objects.