JP2010057105A

JP2010057105A - Three-dimensional object tracking method and system

Info

Publication number: JP2010057105A
Application number: JP2008222253A
Authority: JP
Inventors: Yous Sofiane; ソフィアンユース; Laga Hamid; ラガハミド
Original assignee: Tokyo Institute of Technology NUC
Current assignee: Tokyo Institute of Technology NUC
Priority date: 2008-08-29
Filing date: 2008-08-29
Publication date: 2010-03-11

Abstract

<P>PROBLEM TO BE SOLVED: To provide an object tracking method with which complicated occlusion can be handled. <P>SOLUTION: In order to track an object in a three-dimensional (3D) manner, a video image of a 3D camera observing a predetermined environment from an obliquely upper side is analyzed to form a depth map for a background of the environment and a depth map for a current scene. The created depth maps are compared to form a depth map of a foreground, and on the basis of the foreground depth map, 3D coordinates at respective points are computed. The 3D coordinates are transformed into world coordinates, and for each of the point of the foreground, only zw values of the world coordinates are extracted and mapped to respective pixels of an image plane, thereby forming a world-Z map. A point, with a great change amount of the zw values, between neighboring pixels on the world-Z map is detected to detect a boundary of one or more objects included in the foreground. On the basis of the detected boundary, the objects are segmented, thereby creating a 3D locus. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、一連の動画像（ビデオシーケンス）からリアルタイムで、一個または複数の移動体（以下、オブジェクト）を検出し、３次元で追跡することが可能な方法およびシステムに関する。 The present invention relates to a method and system capable of detecting one or a plurality of moving objects (hereinafter referred to as objects) from a series of moving images (video sequences) in real time and tracking them in three dimensions.

ビデオシーケンスにおいて、オブジェクト、例えば人物を追跡（トラッキング）して検出するためには、ビデオ画像の監視領域内に出現した人物（単数あるいは複数）を検出すること、この領域内にその人物がいる限り各人物に固有のＩＤをアサインすること、さらに、追跡した各人物の軌跡を３次元で再現すること、が必要である。理想的なシステムを構築するためには、次のような要求を満足する必要がある。 In order to track and detect an object, such as a person, in a video sequence, it is necessary to detect the person or persons appearing in the monitoring area of the video image, as long as the person is in this area. It is necessary to assign a unique ID to each person and to reproduce the track of each person tracked in three dimensions. In order to construct an ideal system, it is necessary to satisfy the following requirements.

第一に、オクルージョン（オブジェクトの重なり）が頻繁な複雑なシーン、例えば混雑シーンであってもトラッキングが可能なことが必要である。オクルージョンには、（１）被写体の一部分のみがセンサに対して可視であるような部分的なオクルージョン、（２）被写体が短時間完全にオクルージョン状態である、短期間オクルージョン、および（３）被写体が長期間にわたって視野を離れる、拡張オクルージョンの三つのタイプが有り、これらのオクルージョンを識別し、トラッキングを行う必要がある。 First, it is necessary to be able to track even complex scenes with frequent occlusions (overlapping objects), for example, crowded scenes. Occlusion includes (1) partial occlusion where only a portion of the subject is visible to the sensor, (2) short occlusion where the subject is completely occluded for a short time, and (3) subject. There are three types of extended occlusion that leave the field of view for a long period of time, and these occlusions need to be identified and tracked.

第二に、シーンにおける照明の変化および背景の複雑性に対してロバスト性が求められる。追跡システムは、室内あるいは室外環境で、しかも複雑でかつダイナミックな背景で動作しなければならない。第三に、３次元空間あるいはフロア平面において、各人物の軌跡を正確に再現できること、および外見が似た人物を区別して正確にトラッキングできることが必要であり、第四に、処理時間が速いこと、即ち、リアルタイムで作動して人物の行動を分析し、異常を検出することが可能なこと、第五に、システムの設置コストが低くセットアップと維持が容易であること、などが要求される。なお、第五の要求を満足するためには、システムの調整のためのパラメータが最小であることが望まれる。 Second, robustness is required for changes in lighting and background complexity in the scene. The tracking system must operate in an indoor or outdoor environment and with a complex and dynamic background. Third, it is necessary to accurately reproduce the trajectory of each person in a three-dimensional space or floor plane, and to distinguish and accurately track a person with a similar appearance, and fourthly, processing time is fast, That is, it is required to operate in real time to analyze a person's behavior and detect an abnormality, and fifthly, the system installation cost is low and setup and maintenance are easy. In order to satisfy the fifth requirement, it is desirable that the parameters for system adjustment are minimized.

既に幾つかの追跡システムが提案されている。非特許文献３は、一個の単眼カメラベースのシステムを提案している。このシステムでは、オブジェクトトラッキングのために、ベイジアン−マルチ・ブロブの考えを導入している。トラックすべきオブジェクトが一旦、一個の単眼カメラで検出（ブロッブ）されると、このトラッキングは各ブロッブの外見に基づいて行われる。彼らは、シーンにおけるオブジェクトの数が未知で経時的に変化する場合に、マルチ・ブロブ尤度関数および複数オブジェクトトラッキングのためのベイジアンフィルタリングを適用することを提案した。 Several tracking systems have already been proposed. Non-Patent Document 3 proposes a single monocular camera-based system. This system introduces the Bayesian-multi-blob concept for object tracking. Once an object to be tracked is detected (blobed) with a single monocular camera, this tracking is performed based on the appearance of each blob. They proposed applying a multi-blob likelihood function and Bayesian filtering for multi-object tracking when the number of objects in the scene is unknown and changes over time.

しかしながら、このシステムではオクルージョンを効果的に識別できず、似たような外見を有する人物の区別に失敗し、照明状況の変化に敏感で、さらに追跡オブジェクトの３Ｄ軌跡を再現することができない。 However, this system cannot effectively identify occlusion, fails to distinguish between persons with similar appearances, is sensitive to changes in lighting conditions, and cannot reproduce the 3D trajectory of the tracking object.

非特許文献１は、複数の単眼カメラベースのシステムを提案している。このシステムは、モニタシーンの周囲にセットされた複数の単眼カメラを使用する。前景イメージを抽出するために、カラーベースの背景減算を実行する。オーバーヘッド平面上の前景に相当するコーンの交点は、追跡する候補者のセットを生成する。現在の候補者と以前のフレームにおいて検出された人物との対応付けを、モーション、カラーおよび外見に基づいて実行することにより、追跡する候補者の新しい軌跡を生成する。 Non-Patent Document 1 proposes a plurality of monocular camera-based systems. This system uses a plurality of monocular cameras set around the monitor scene. Perform color-based background subtraction to extract the foreground image. The intersection of cones corresponding to the foreground on the overhead plane produces a set of candidates to track. A new trajectory of the candidate to be tracked is generated by performing an association between the current candidate and the person detected in the previous frame based on motion, color and appearance.

このシステムは、人物が全ての視野においてオクルージョンされていない限りにおいて、オクルージョンを処理することができ、３Ｄ軌跡を再現させることができる。各時間において１００フレームのパッチを処理するので、２秒の遅延を有するが比較的早い処理であると言える。 This system can process occlusion and reproduce 3D trajectories as long as the person is not occluded in all fields of view. Since a patch of 100 frames is processed at each time, it can be said to be a relatively fast process with a delay of 2 seconds.

また、カラーと外見に基づいて各人物を認識するので、それぞれの人物のＩＤを保存することができると考えられる。しかしながら、このシステムでは、コーンの交点に基づいて動作しているので、複数の人物が合流した場合などをうまく処理することができない。また、似たような色を着た、似たような外見を有する人物を識別することもできない。さらに、背景減算およびトラッキングをカラーベースデータで行うため、照明の変化にロバストではない。 Further, since each person is recognized based on the color and appearance, it is considered that the ID of each person can be stored. However, since this system operates based on the intersection of cones, it cannot handle a case where a plurality of persons merge. It is also impossible to identify a person with a similar color and a similar appearance. Furthermore, since background subtraction and tracking are performed with color-based data, it is not robust to changes in illumination.

非特許文献２および特許文献１のシステムは、高い位置に固定され、斜め方向からシーンを観測する３Ｄカメラを使用している。この３Ｄカメラはシーンの深度マップを提供する。このシステムの操作は次のとおりである。 The systems of Non-Patent Document 2 and Patent Document 1 use a 3D camera that is fixed at a high position and observes a scene from an oblique direction. This 3D camera provides a depth map of the scene. The operation of this system is as follows.

１）前景（フォアグラウンド）から、シーンにオブジェクトが無い状態で記録された背景（バックグラウンド）を減算することによって、スタートする。 1) Start by subtracting the background (background) recorded with no objects in the scene from the foreground (foreground).

２）前景の深さから３Ｄポイントを計算し、仮想オーバーヘッド平面をレンダリングする。オーバーヘッド平面上には、最も高いポイントのみが表示される。 2) Compute 3D points from the foreground depth and render a virtual overhead plane. Only the highest point is displayed on the overhead plane.

３）幾つかの形態的操作によってブロブを後処理することにより、各合成ブロブはシーン中の各個人を表すようになる。 3) By post-processing the blob by some morphological operation, each composite blob will represent each individual in the scene.

４）検出された人物の対応付けは、各人物の高さ（身長）パターンの経時的な軌跡を提供するトラッキングに基づいてなされる。 4) The detected person is associated based on tracking that provides a temporal trajectory of the height (height) pattern of each person.

このシステムはオクルージョンを扱うことができるが、人物が相互に接近しかつ同じような外見、特に、同じような身長を有している場合はうまく処理することができない。一方、カラーに代わって３Ｄデータを使用するので、照明の変化に対してロバストである。また、３Ｄ軌跡を再現し、処理速度が速く、コストが低いが、人物がシーンから一旦退場し再入場する場合に、その人物のＩＤをキープすることはない。さらに、トラッキングのために高さパターンを使用するので、人物が座ったり、ジャンプしたりあるいは手を上げたりした場合には混乱が生じる。 Although this system can handle occlusion, it cannot handle well when people are close to each other and have similar appearances, especially similar heights. On the other hand, since 3D data is used instead of color, it is robust against changes in illumination. Also, although the 3D trajectory is reproduced, the processing speed is fast, and the cost is low, when the person leaves the scene and reenters, the person's ID is not kept. In addition, because the height pattern is used for tracking, confusion arises when a person sits down, jumps or raises his hand.

米国特許第７、００３、１３６号U.S. Patent No. 7,003,136 F. Fleuret, J. Berclaz, R. Lengagne and P. Fua, “Multi-Camera People Tracking with a Probabilistic Occupancy Map”, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), February (30)2, pp. 267-282, Feb.2008.F. Fleuret, J. Berclaz, R. Lengagne and P. Fua, “Multi-Camera People Tracking with a Probabilistic Occupancy Map”, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), February (30) 2, pp. 267 -282, Feb.2008. M. Harville. “Stereo person tracking with adaptive plan-view templates of height and occupancy statistics” Journal of Image and Vision Computing (22), No. 2, pp. 127-142, 2004.M. Harville. “Stereo person tracking with adaptive plan-view templates of height and occupancy statistics” Journal of Image and Vision Computing (22), No. 2, pp. 127-142, 2004. M. Isard and J. MacCormick. Bramble: “A bayesian multiple-blob tracker”. In Proceedings of the International Conference on Computer Vision, ICCV’01, pages 34-41, 2001.M. Isard and J. MacCormick. Bramble: “A bayesian multiple-blob tracker”. In Proceedings of the International Conference on Computer Vision, ICCV’01, pages 34-41, 2001. Arun Hampapur, Lisa M. Brown, Jonathan Connell, Max Lu, Hans Merkl, S. Pankanti , Multi-scale Tracking for Smart Video Surveillance, IEEE Transactions on Signal Processing, Vol. 22, No. 2, March 2005.Arun Hampapur, Lisa M. Brown, Jonathan Connell, Max Lu, Hans Merkl, S. Pankanti, Multi-scale Tracking for Smart Video Surveillance, IEEE Transactions on Signal Processing, Vol. 22, No. 2, March 2005. IBM smart surveillance system: http://reaserch.ibm.com/peoplevision/IBM smart surveillance system: http://reaserch.ibm.com/peoplevision/

本発明は、上記した従来技術の問題点を解決する目的でなされたもので、室内および室外環境において、さらに複雑でしかもダイナミックな背景においてロバストで、しかも外見が類似した移動体または人物のオクルージョンを効果的に処理することが可能な、オブジェクトの３次元追跡方法およびシステムを提供することを課題とする。 The present invention has been made for the purpose of solving the above-mentioned problems of the prior art, and is intended to provide an occlusion of a moving object or person having a similar appearance and robustness in a more complex and dynamic background in indoor and outdoor environments. It is an object of the present invention to provide a three-dimensional tracking method and system for an object that can be effectively processed.

上記課題を解決するために、本発明の方法は、所定の環境を斜め上方から観察する３Ｄカメラの映像を解析して、前記環境の背景の深度マップと前記環境の現在シーンの深度マップとを形成し、前記背景および現在シーンの深度マップを比較して、前景の深度マップを形成し、前記前景の深度マップに基づいて前記前景の各点の３次元座標を計算し、
前記３次元座標をワールド座標に変換し、前記前景の各点について前記ワールド座標のｚｗ値のみを抽出し、イメージ平面の各ピクセルにマッピングしてワールド−Ｚマップを形成し、前記ワールド−Ｚマップ上の隣接するピクセル間でｚｗ値の変化量の大きい点を検出して、前記前景に含まれる１個またはそれ以上のオブジェクトの境界を検出し、
前記検出した境界に基づいて前記オブジェクトをセグメント化し、前記セグメント化されたオブジェクトに対して３次元軌跡を作成する、各ステップを備える。 In order to solve the above-described problem, the method of the present invention analyzes a video of a 3D camera that observes a predetermined environment obliquely from above, and obtains a depth map of the background of the environment and a depth map of the current scene of the environment. Forming and comparing the depth map of the background and the current scene to form a depth map of the foreground, calculating three-dimensional coordinates of each point of the foreground based on the depth map of the foreground,
Converting the three-dimensional coordinates into world coordinates, extracting only the zw value of the world coordinates for each point of the foreground, and mapping it to each pixel of the image plane to form a world-Z map, and the world-Z map Detecting a point with a large amount of change in zw value between adjacent pixels on the top, detecting a boundary of one or more objects included in the foreground,
Each step includes segmenting the object based on the detected boundary and creating a three-dimensional trajectory for the segmented object.

上記課題を解決するために、本発明のシステムは、所定の環境を斜め上方から観察する３Ｄカメラと、前記３Ｄカメラの映像を解析するプロセッサと、を備え、前記プロセッサは、前記映像から前記環境の背景と現在シーンそれぞれの深度マップを形成し、前記それぞれの深度マップを比較して前景の深度マップを計算し、前記前景の深度マップに基づいて前記前景の各点における３次元座標を算出し、前記算出した３次元座標をワールド座標に変換し、前記前景の各点について前記ワールド座標のｚｗ値のみを抽出し、イメージ平面の各ピクセルにマッピングして、ワールド−Ｚマップを形成し、前記ワールド−Ｚマップ上の隣接するピクセル間でｚｗ値の変化量の大きい点を検出して、前記前景に含まれる１個またはそれ以上のオブジェクトの境界を検出し、前記検出した境界に基づいて前記オブジェクトをセグメント化し、前記セグメント化されたオブジェクトに対して３次元軌跡を作成する、各手順を実行する。 In order to solve the above-described problem, the system of the present invention includes a 3D camera that observes a predetermined environment obliquely from above, and a processor that analyzes an image of the 3D camera, and the processor uses the image to the environment A depth map of each of the background and the current scene is formed, the depth maps are compared to calculate a foreground depth map, and three-dimensional coordinates at each point of the foreground are calculated based on the depth map of the foreground. Converting the calculated three-dimensional coordinates into world coordinates, extracting only the zw value of the world coordinates for each point of the foreground and mapping it to each pixel of the image plane to form a world-Z map, One or more objects included in the foreground are detected by detecting a point with a large change in zw value between adjacent pixels on the world-Z map. Detecting the door of the boundary, and segmenting the object based on the boundary and the detecting, to create a three-dimensional trajectory to the segmented object, it performs each procedure.

また、上記方法あるいはシステムにおいて、前記背景の深度マップを、前記環境にオブジェクトが存在しない場合に撮影された映像に基づいて形成し、前記前景の深度マップを前記現在シーンの深度マップから前記背景の深度マップを減算することによって形成するようにしても良い。 In the above method or system, the depth map of the background is formed based on an image captured when no object exists in the environment, and the depth map of the foreground is generated from the depth map of the current scene. You may make it form by subtracting a depth map.

さらに、上記方法あるいはシステムにおいて、さらに、前記前景の３次元座標のデータを前記３Ｄカメラに対する入射角でフィルタしてノイズを除去するようにしても良い。 Furthermore, in the above method or system, the foreground three-dimensional coordinate data may be filtered by an incident angle with respect to the 3D camera to remove noise.

３Ｄカメラから得た映像を解析して、観察しようとする領域（ＲＯＩ、region of interest）の前景の深度マップを形成し、この深度マップに基づいて前景の各点の３次元座標を計算する。次に、計算された３次元座標をワールド座標に変換する。このとき、ワールド座標系のＸＹ平面をＲＯＩのフロア面に平行とし、かつ、３Ｄカメラの中心のワールド座標を（００ｈ）（ｈは任意の値）としてワールド座標系を作成する。ワールド座標に変換された前景の各点から、Ｚ軸上の座標ｚｗ値のみを取り出し、これをイメージ平面上にマップして前景のワールド−Ｚマップを形成する。このようにして形成したワールド−Ｚマップでは、同一のオブジェクトに属する点のｚｗ値はイメージ平面のＹ軸方向で連続的でかつ一様に変化する特性を有している。 The video obtained from the 3D camera is analyzed, a depth map of the foreground of the region (ROI, region of interest) to be observed is formed, and the three-dimensional coordinates of each point of the foreground are calculated based on this depth map. Next, the calculated three-dimensional coordinates are converted into world coordinates. At this time, the world coordinate system is created with the XY plane of the world coordinate system parallel to the floor surface of the ROI and the world coordinate of the center of the 3D camera as (00h) (h is an arbitrary value). From each foreground point converted to world coordinates, only the coordinate zw value on the Z axis is extracted and mapped on the image plane to form a foreground world-Z map. In the world-Z map formed in this way, the zw values of points belonging to the same object have a characteristic that changes continuously and uniformly in the Y-axis direction of the image plane.

この特性を利用することにより、前景に含まれる複数オブジェクトの境界を検出することが出来る。即ち、ワールド−Ｚマップを走査していくとき、隣接するピクセル間でｚｗ値が一様に減少（あるいは増加）せず急激に変化する点が存在すれば、その点はオブジェクト間の境界であると考えられる。従って、オブジェクト間の境界を示唆するｚｗ値の大きな変化に基づいて前景に含まれる複数オブジェクトを検出し、これらを別個にセグメント化すれば、複数のオブジェクトが混在する、いわゆるオクルージョンが頻発する状況であっても、高い精度で複数オブジェクトを区別してセグメント化することが可能となる。 By using this characteristic, it is possible to detect boundaries between a plurality of objects included in the foreground. That is, when the world-Z map is scanned, if there is a point where the zw value does not decrease (or increase) uniformly between adjacent pixels but changes rapidly, that point is a boundary between objects. it is conceivable that. Therefore, if a plurality of objects included in the foreground are detected based on a large change in zw value that suggests a boundary between objects, and these are separately segmented, a plurality of objects are mixed, so-called occlusion occurs frequently. Even in such a case, it is possible to distinguish and segment a plurality of objects with high accuracy.

個々のオブジェクトがセグメント化されると、例えば線形確率モデルを使用することにより、時間ｔ−１と時間ｔ間でオブジェクトがどのように移動したかの対応付けを行うことができる。これにより、個々のオブジェクトの３次元軌跡を検出することが出来る。オブジェクトの対応付けは、オブジェクトの３Ｄ座標、平均のカラー、顔の特徴などの特徴量を基に、確率計算を行うことにより実現される。 When individual objects are segmented, it is possible to correlate how the objects have moved between time t-1 and time t, for example by using a linear probability model. As a result, the three-dimensional trajectory of each object can be detected. Object association is realized by performing probability calculation based on feature quantities such as 3D coordinates, average color, and facial features of the object.

本発明の方法およびシステムでは、基本的に深度マップに基づいてオブジェクトの検出を行うので、検出しようとする環境の照明変化、背景の複雑性に対してロバストである。さらに、オクルージョンの発生頻度の高い環境であっても、複数オブジェクトを効果的にセグメント化することができる。また、本発明の方法およびシステムを実施する場合、システムセットアップのための重要なパラメータは３Ｄカメラの設置角度だけであり、従って、システムのセットアップコスト、維持コストを低くすることが可能である。 In the method and system of the present invention, the detection of an object is basically performed based on a depth map, so that it is robust against a change in illumination of the environment to be detected and the complexity of the background. Furthermore, even in an environment where the occurrence frequency of occlusion is high, a plurality of objects can be effectively segmented. Also, when the method and system of the present invention are implemented, the only important parameter for system setup is the installation angle of the 3D camera, and therefore the setup cost and maintenance cost of the system can be lowered.

以下に、本発明の種々の実施形態を、図面を参照して説明する。なお、以下の図面において同一の参照符号は同一または類似の構成要素を示すので、重複した説明は行わない。 Various embodiments of the present invention will be described below with reference to the drawings. Note that, in the following drawings, the same reference numerals indicate the same or similar components, and thus redundant description will not be given.

図１は、本発明の一実施形態に係る移動体（以下、オブジェクト）の３次元追跡方法を示すフローチャートであり、図２は、図１の方法を実施するためのシステム構成を示す図、さらに、図３（ａ）〜（ｆ）は、図１のフローチャートの幾つかのステップにおける操作を説明するためのイメージ図である。 FIG. 1 is a flowchart showing a three-dimensional tracking method of a moving object (hereinafter, object) according to an embodiment of the present invention, and FIG. 2 is a diagram showing a system configuration for carrying out the method of FIG. FIGS. 3A to 3F are image diagrams for explaining operations in several steps of the flowchart of FIG.

図２に示す様に、本実施形態に係るオブジェクトの３次元追跡システムは基本的に、オブジェクトの移動を追跡すべき空間１の斜め上方に固定された一台の３Ｄカメラ（例えば立体カメラ）２と、このカメラ２から得られた情報を処理する処理装置（プロセッサを含む）３とで構成されている。処理装置３は通常の情報処理装置、例えばコンピュータであっても良く、あるいはカメラ２と一体に構成されたカメラシステムであっても良い。３Ｄカメラ２は空間１を傾斜した角度から観測し、処理装置３は観測イメージから空間１の深度マップ（デプスマップ）を作成する。 As shown in FIG. 2, the object three-dimensional tracking system according to the present embodiment is basically a single 3D camera (for example, a stereoscopic camera) 2 fixed obliquely above a space 1 in which the movement of the object is to be tracked. And a processing device (including a processor) 3 for processing information obtained from the camera 2. The processing device 3 may be a normal information processing device such as a computer, or may be a camera system configured integrally with the camera 2. The 3D camera 2 observes the space 1 from an inclined angle, and the processing device 3 creates a depth map (depth map) of the space 1 from the observation image.

オブジェクト追跡の前処理として、まず、背景（バックグラウンド）のモデルが作成される。図１のステップＳ１とステップＳ２は背景モデル作成のためのステップである。まず、ステップＳ１において、カメラ２によって空間１の背景（バックグラウンド）イメージを取得する。背景イメージとは、追跡すべきオブジェクトが存在しない場合の空間のイメージである。ステップＳ２において、取得した背景イメージから深度マップ、即ち背景モデルを作成する。図３の（ｃ）に、このようにして作成された背景モデルを示す。なお、背景モデルは一旦作成されれば、これを毎回のシーンイメージの処理に使用すればよく、したがって、ステップＳ１とステップＳ２を時間ｔ毎のシーンの撮影に対して、繰り返す必要はない。 As pre-processing for object tracking, a background model is first created. Steps S1 and S2 in FIG. 1 are steps for creating a background model. First, in step S1, a background image of the space 1 is acquired by the camera 2. A background image is an image of a space when there is no object to be tracked. In step S2, a depth map, that is, a background model is created from the acquired background image. FIG. 3C shows the background model created in this way. It should be noted that once the background model is created, it may be used for each scene image processing, and therefore it is not necessary to repeat Step S1 and Step S2 for shooting a scene every time t.

図１に示すステップＳ３では、３Ｄカメラ２によって空間１の実際のシーンを撮像する。実際のシーンには追跡すべきオブジェクトが含まれている。図３の（ａ）にシーンイメージの一例を示す。なお、シーンの撮影は一定の時間ｔ毎に行われ、従ってステップＳ３以下は時間ｔ毎に繰り返して実行される。ステップＳ４では撮像したシーンの深度マップを作成する。図３の（ｂ）は、図（ａ）のシーンイメージから作成された深度マップを示している。オブジェクトを識別しその移動を追跡するためには、イメージの深度マップ（図３（ｂ）参照）からオブジェクトのみを切り出す必要がある。したがって、ステップＳ５で、シーンイメージの深度マップから背景イメージの深度マップを減算することにより、オブジェクト全体のの深度マップ、即ち前景（フォアグラウンド）イメージの深度マップを作成する。 In step S <b> 3 shown in FIG. 1, an actual scene in the space 1 is imaged by the 3D camera 2. The actual scene contains objects to be tracked. FIG. 3A shows an example of a scene image. Note that scene shooting is performed at regular time intervals t, and therefore steps S3 and subsequent steps are repeatedly performed at each time interval t. In step S4, a depth map of the captured scene is created. FIG. 3B shows a depth map created from the scene image of FIG. In order to identify an object and track its movement, it is necessary to cut out only the object from the image depth map (see FIG. 3B). Accordingly, in step S5, the depth map of the entire object, that is, the depth map of the foreground (foreground) image is created by subtracting the depth map of the background image from the depth map of the scene image.

次のステップＳ６では、ステップＳ５で作成された前景イメージの深度マップに基づいて、前景イメージの各点について３Ｄデータを作成する。ステップＳ７では、３Ｄデータに含まれるノイズを除去するための処理を行う。ノイズ除去処理については、［３Ｄデータ中のノイズ除去処理］の項で図４を参照して詳細に説明する。 In the next step S6, 3D data is created for each point of the foreground image based on the depth map of the foreground image created in step S5. In step S7, a process for removing noise included in the 3D data is performed. The noise removal processing will be described in detail with reference to FIG. 4 in the section [Noise removal processing in 3D data].

ステップＳ８では、ノイズ除去後の３Ｄデータを、ワールド座標に変換する。なお、３Ｄデータはカメラ座標に基づくものであり、これをワールド座標に変換する場合、シーンのフロア面をワールド座標のＸＹ平面に平行とする。また、カメラの中心座標がワールド座標系の（００ｈ）（ｈ：任意の値）となるように、ワールド座標系を設定する。カメラ座標系のワールド座標系への変換に関しては、［ワールド−Ｚマップの作成］の項で図５を参照して詳述する。 In step S8, the 3D data after noise removal is converted into world coordinates. Note that 3D data is based on camera coordinates, and when this is converted into world coordinates, the floor surface of the scene is made parallel to the XY plane of the world coordinates. Also, the world coordinate system is set so that the center coordinates of the camera are (00h) (h: arbitrary value) of the world coordinate system. The conversion from the camera coordinate system to the world coordinate system will be described in detail with reference to FIG.

ステップＳ９では、前景の各点について、ワールド座標のＺｗ軸上の値、即ち、ｚｗ値を取り出しこれをイメージ平面の各ピクセル上にマップすることにより、ワールド−Ｚマップを作成する。図３の（ｄ）はこのようにして作成されたワールド−Ｚマップを示す。 In step S9, for each foreground point, a value on the Zw axis of the world coordinates, that is, a zw value is taken out and mapped onto each pixel of the image plane to create a world-Z map. FIG. 3D shows a world-Z map created in this way.

なお、図３の（ｂ）、（ｃ）、（ｄ）に示すマップは、実際は数値データの集合であるが、マップの機能を説明のためにイメージ化して示されている。したがって、図（ｄ）のワールド−Ｚマップでは、ｚｗ値の大きなピクセルを明るく、小さなピクセルを暗くなるように示してある。前景以外の部分はｚｗ値を０として処理されている。 Note that the maps shown in (b), (c), and (d) of FIG. 3 are actually sets of numerical data, but the functions of the map are shown as an image for explanation. Therefore, in the world-Z map of FIG. 4D, the pixels having a large zw value are shown bright and the small pixels are shown dark. Parts other than the foreground are processed with a zw value of 0.

以上の処理により、前景の各点について、ワールド座標におけるｚｗ値のみを保持したワールド−Ｚマップが作成されると、次のステップＳ１０ではこのマップを走査して前景イメージ中の各オブジェクトを検出する。マップの走査は、通常、マップの最上部左端から最上部右端に、同様にマップの上から下に向かって行われるが、システムの構成に基づいてどのような走査アルゴリズムをとっても良い。 As a result of the above processing, when a world-Z map having only the zw value in the world coordinates is created for each point in the foreground, in step S10, this map is scanned to detect each object in the foreground image. . The map is usually scanned from the top left edge to the top right edge of the map, similarly from top to bottom of the map, but any scanning algorithm may be used based on the system configuration.

ワールド−Ｚマップでは、各ピクセルにおいて、ワールド座標のｚｗ値、即ちオブジェクトのフロア面からの高さがマップされているので、同一のオブジェクト内では、ワールド座標のＹｗ軸方向にそのｚｗ値は一様に変化する。ステップＳ１１では、この特性を利用して、ワールド−Ｚマップ中でオブジェクト間の境界を検出する。オブジェクト間の境界を検出するための具体的な手順に関しては、［ワールド−Ｚマップの特性］の項で、図６および７を参照して詳述する。 In the world-Z map, the zw value of the world coordinate, that is, the height from the floor surface of the object is mapped in each pixel. Therefore, in the same object, the zw value is one in the Yw axis direction of the world coordinate. Changes. In step S11, the boundary between objects is detected in the world-Z map using this characteristic. A specific procedure for detecting the boundary between objects will be described in detail in the section of [World-Z map characteristics] with reference to FIGS.

オブジェクトの境界が検出されると、ステップＳ１２で、検出された各オブジェクトをセグメント化する。図３の（ｅ）に図（ｄ）のワールド−Ｚマップから２個のオブジェクトを識別し、セグメント化した状態を示す。ステップＳ１２のセグメント化、即ちセグメンテーションについては、［セグメンテーション］の項で、図７を参照して詳述する。 When an object boundary is detected, each detected object is segmented in step S12. FIG. 3E shows a state in which two objects are identified from the world-Z map of FIG. The segmentation in step S12, that is, the segmentation will be described in detail in the [Segmentation] section with reference to FIG.

ステップＳ１３では、例えば、線形確率モデルを使用して、時間ｔ−１においてセグメント化されたオブジェクトと、現在の時間ｔにおいてセグメント化されたオブジェクトとの間の対応付けを実行する。ステップＳ１４では、ステップＳ１３の対応付けに基づいて、各オブジェクトの３Ｄ軌跡を作成する。以上の手順、即ち、ステップＳ３からステップＳ１４を一定の時間毎に繰り返して実行することにより、オブジェクトの３Ｄ追跡を実行することができる。 In step S13, for example, an association between the object segmented at time t-1 and the object segmented at current time t is performed using a linear probability model. In step S14, a 3D trajectory of each object is created based on the association in step S13. By repeating the above procedure, that is, step S3 to step S14 at regular intervals, 3D tracking of the object can be performed.

次に、図１の主要なステップの詳細を説明する。 Next, details of the main steps of FIG. 1 will be described.

［３Ｄデータ中のノイズ除去処理］
３Ｄカメラによる深度情報に固有のノイズを除去するために、ノイズ除去処理を行う。このようなノイズは、主に、オブジェクトの境界に発生しやすく、その存在によってオブジェクトのセグメンテーションにおける精度が劣化する。発明者等は、このようなノイズが、ポイントクラウド（点群）よりも大きな入射角（カメラに対する）を有していることに注目した。この特性を利用することによって、ノイズを除去するためのフィルタの閾値を決定することができる。 [Noise removal processing in 3D data]
In order to remove noise specific to the depth information by the 3D camera, noise removal processing is performed. Such noise is likely to occur mainly at the boundary of an object, and the presence thereof degrades the accuracy in object segmentation. The inventors have noted that such noise has a larger incident angle (with respect to the camera) than a point cloud (point cloud). By using this characteristic, it is possible to determine a filter threshold value for removing noise.

即ち、イメージ平面上に、各ピクセルがポイントクラウドの一個の点の入射角を記録した入射角マップを形成する。次に、予め決定した閾値よりも小さな入射角を有する点のみを保存するために、このマップをフィルタする。このようにして形成されたマップは、ワールド−Ｚマップのためのマスクとして使用される。 That is, on the image plane, an incident angle map in which each pixel records the incident angle of one point cloud is formed. The map is then filtered to preserve only those points that have an angle of incidence that is less than a predetermined threshold. The map thus formed is used as a mask for the world-Z map.

図４にステレオ処理において発生するノイズを示す。この図に示す様に、ノイズ４０のカメラ２０に対する入射角βは、ポイント３０のカメラ２０に対する入射角γよりも大きい傾向がある。なお、図４において、３２はノイズ４０とポイント３０それぞれの入射ベクトルを示し、３４はノイズ４０のノーマルベクトルを示す。ある所定の点ｐの入射角のコサインは、点ｐの単位ノーマルベクトル
と、カメラの中心と点ｐとを結ぶ単位ベクトル
との積として示される。このノーマルベクトルは、点ｐに隣接するが点ｐとは整列していない有効点ｐ１、ｐ２を用いて計算することができる。即ち、ノーマルベクトル
は、
と
の外積で与えられる。 FIG. 4 shows noise generated in stereo processing. As shown in this figure, the incident angle β of the noise 40 to the camera 20 tends to be larger than the incident angle γ of the point 30 to the camera 20. In FIG. 4, 32 indicates incident vectors of the noise 40 and the point 30, and 34 indicates a normal vector of the noise 40. The cosine of the incident angle of a given point p is the unit normal vector of point p
And a unit vector connecting the center of the camera and the point p
Is shown as the product of The normal vector can be calculated using effective points p1 and p2 that are adjacent to the point p but are not aligned with the point p. That is, normal vector
Is
When
Is given by the outer product of.

その結果、点ｐに関する入射マップエントリＩ（ｐ）は、
で与えられる。 As a result, the incident map entry I (p) for point p is
Given in.

［ワールド−Ｚマップの作成］
ワールド−Ｚマップはイメージ平面の各ピクセル上に、シーンの各３次元ポイントのワールド−Ｚ座標を記録したものである。図５は、カメラ座標とワールド座標の関係を示す。図５において、Ｘｃ、Ｙｃ、Ｚｃはカメラ座標の座標軸を、Ｘｗ、Ｙｗ、Ｚｗはワールド座標の座標軸を示す。カメラ２は、フロア面から高さｈの位置に、フロア面を角度αで観測するように配置される。さらに、フロア面がＸＹ平面に平行であると仮定し、カメラ２の中心のワールド座標を（００ｈ）とする。この場合、カメラ座標が（ｘｃ、ｙｃ、ｚｃ）である点Ｐのワールド座標（ｘｗ、ｙｗ、ｚｗ）は、
ｘｗ＝ｚｃｓｉｎ（α）
ｙｗ＝ｘｃ
ｚｗ＝ｈ−ｚｃｃｏｓ（α）（１）
として計算される。ワールド−Ｚマップは、式（１）に基づいて計算されたワールドＺ座標値のみを取り出し、これをイメージ平面上の各ピクセルにマップすることによって作成される。 [Create World-Z Map]
The world-Z map is a record of the world-Z coordinates of each three-dimensional point of the scene on each pixel of the image plane. FIG. 5 shows the relationship between camera coordinates and world coordinates. In FIG. 5, Xc, Yc, and Zc indicate coordinate axes of camera coordinates, and Xw, Yw, and Zw indicate coordinate axes of world coordinates. The camera 2 is disposed at a height h from the floor surface so as to observe the floor surface at an angle α. Further, it is assumed that the floor surface is parallel to the XY plane, and the world coordinate of the center of the camera 2 is (00h). In this case, the world coordinates (xw, yw, zw) of the point P whose camera coordinates are (xc, yc, zc) are
xw = zcsin (α)
yw = xc
zw = h−zccos (α) (1)
Is calculated as The world-Z map is created by taking only the world Z coordinate value calculated based on equation (1) and mapping it to each pixel on the image plane.

［ワールド−Ｚマップの特性］
図６および図７は、ワールド−Ｚマップの特性を示す図である。図６は、フロア面１０上に近接して位置するオブジェクトＡおよびＢを含むシーンをカメラ２によって観測する場合を示し、図７は図６のシーンのワールド−Ｚマップを示す。図７において、１２はイメージ平面である。図６のシーンをカメラ２によって観察した場合、カメラ２とオブジェクトＡ間の距離ｄ１はカメラ２とオブジェクトＢ間の距離ｄ２よりも小さいので、ワールド−Ｚマップのイメージ平面１２上では、オブジェクトＡはオブジェクトＢより低い位置に現れる。 [Characteristics of World-Z Map]
6 and 7 are diagrams showing the characteristics of the world-Z map. FIG. 6 shows a case where a scene including objects A and B located close to each other on the floor surface 10 is observed by the camera 2, and FIG. 7 shows a world-Z map of the scene of FIG. In FIG. 7, 12 is an image plane. When the scene of FIG. 6 is observed by the camera 2, the distance d1 between the camera 2 and the object A is smaller than the distance d2 between the camera 2 and the object B. Therefore, on the image plane 12 of the world-Z map, the object A is Appears below object B.

したがって、人物の場合の様に、オブジェクトＡとオブジェクトＢが同じような高さ（身長）を有する場合、ワールド−Ｚマップ上で最も高い位置（Ｙ座標）にある点は、カメラ２から最も遠い位置のオブジェクトに属すると考えて良い。具体的には、図７の点Ｐ３はオブジェクトＡより遠いオブジェクトＢに属する点であると考えられる。また、図６から明らかなように、オブジェクトＡの点Ｐ１とオブジェクトＢの点Ｐ２はイメージ平面１２上で同じ位置に現れる。しかしながら、オブジェクトＡはオブジェクトＢよりもカメラに近く、したがって点Ｐ１のワールド−Ｚ座標値Ｚ１は点Ｐ２のワールド−Ｚ座標値Ｚ２よりも大きい。即ち、Ｚ１＞Ｚ２である。 Therefore, when the object A and the object B have the same height (height) as in the case of a person, the point at the highest position (Y coordinate) on the world-Z map is farthest from the camera 2. You can think of it as belonging to the object at the position. Specifically, the point P3 in FIG. 7 is considered to be a point belonging to the object B far from the object A. Further, as apparent from FIG. 6, the point P1 of the object A and the point P2 of the object B appear at the same position on the image plane 12. However, the object A is closer to the camera than the object B, and therefore the world-Z coordinate value Z1 of the point P1 is larger than the world-Z coordinate value Z2 of the point P2. That is, Z1> Z2.

また、同じオブジェクト内では、ワールド−Ｚ座標の値は、イメージ平面のＹ軸座標が大きくなるにしたがって一様に減少する。即ち、図７のオブジェクトＢでは、点Ｐ３からＹ座標が大きくなるに従ってその点のｚｗ値は一様に減少していく。同様に、オブジェクトＡ内では、点Ｐ１のｚｗ値が最も大きく、Ｙ座標が大きくなるに従ってｚｗ値は一様に減少する。 In the same object, the value of the world-Z coordinate decreases uniformly as the Y-axis coordinate of the image plane increases. That is, in the object B of FIG. 7, the zw value of the point decreases uniformly as the Y coordinate increases from the point P3. Similarly, in the object A, the zw value of the point P1 is the largest, and the zw value decreases uniformly as the Y coordinate increases.

以上を要約すると、ワールド−Ｚマップは、次のような特性を有している。 In summary, the World-Z map has the following characteristics.

１）ワールド−Ｚマップは深度マップから計算された全ての３Ｄ点を含んでいるため、情報の損失がない。
２）一個のオブジェクト内のｚｗ値は、Ｙ軸の値が大きくなるにしたがって一様に減少する。
３）カメラに近い位置のオブジェクトはカメラから遠い位置のオブジェクトよりも、ワールド−Ｚマップ上で大きなｚｗ値を有する。このことは、オブジェクトが人物の場合の様にほぼ同じような高さ（身長）を有する場合、ワールド−Ｚマップ上で最も高い位置（Ｙ座標）にある点は、カメラから最も遠い位置のオブジェクトに属することを意味する。 1) Since the World-Z map includes all 3D points calculated from the depth map, there is no loss of information.
2) The zw value in one object decreases uniformly as the Y-axis value increases.
3) An object near the camera has a larger zw value on the world-Z map than an object far from the camera. This means that when the object has almost the same height (height) as in the case of a person, the point at the highest position (Y coordinate) on the world-Z map is the object farthest from the camera. Means belonging to.

［オブジェクトセグメンテーション］
ワールド−Ｚマップの上記特性２）および３）を利用することによって、イメージ平面上で一部が重なった複数のオブジェクトを識別し、セグメント化することができる。即ち、図７に示すワールド−Ｚマップを、例えば右から左に、かつ上から下に走査していくとき、隣接するピクセル間でｚｗ値が一様に減少せず大きく変化する場合がある。このような場合に、そこが複数オブジェクトの境界であると考えることができる。例えば、ワールド−Ｚマップにおいて、オブジェクトＢ内でのｚｗ値は、点Ｐ３から点Ｐ２に向かって一様に減少していくが、点Ｐ２に達するとそこはオブジェクトＡの点Ｐ１に相当するため、そのｚｗ値は急激に大きくなる。図６に示す様に、Ｚ１＞Ｚ２であるためである。オブジェクトＡ内では、点Ｐ１のｚｗ値Ｚ１を最大に、以降、一様に減少する。 [Object segmentation]
By using the above characteristics 2) and 3) of the world-Z map, a plurality of objects partially overlapping on the image plane can be identified and segmented. That is, when the world-Z map shown in FIG. 7 is scanned from right to left and from top to bottom, for example, the zw value may change greatly between adjacent pixels without being uniformly reduced. In such a case, it can be considered that there is a boundary between a plurality of objects. For example, in the world-Z map, the zw value in the object B decreases uniformly from the point P3 toward the point P2, but when it reaches the point P2, it corresponds to the point P1 of the object A. The zw value increases rapidly. This is because Z1> Z2 as shown in FIG. In the object A, the zw value Z1 of the point P1 is maximized and thereafter decreases uniformly.

以上のように、ワールド−Ｚマップを走査して隣接するピクセル間でｚｗ値の変化が急激な点を検出することにより、オブジェクト間の境界を検出することが出来る。本方法では、この特性を利用して、ワールド−Ｚマップから複数オブジェクトを識別する。 As described above, the boundary between objects can be detected by scanning the world-Z map and detecting a point where the change in the zw value is sharp between adjacent pixels. The method uses this property to identify multiple objects from the world-Z map.

［セグメンテーション］
図８は、モニタエリア内に２個のオブジェクトＡ、Ｂを含むワールド−Ｚマップから、オブジェクトＡ、Ｂをセグメント化する手順を説明するための図である。図８（ａ）は３Ｄカメラから得られたイメージを基にして作成されたワールド−Ｚマップを示す。このマップを最上部左端から走査を開始し、画面の右から左、さらに上から下に走査することによって、オブジェクトＢと他のオブジェクトとの境界１３を検出する。境界を検出するために、隣接するピクセル間のｚｗ値の変化にユーザ指定の閾値εを設けても良い。次に、検出された境界１３に基づいて、図（ｂ）に示す様にオブジェクトＢをセグメント化する（セグメントＢ’）。 [segmentation]
FIG. 8 is a diagram for explaining a procedure for segmenting the objects A and B from the world-Z map including the two objects A and B in the monitor area. FIG. 8A shows a world-Z map created based on an image obtained from a 3D camera. The map starts scanning from the top left end and scans the screen from right to left and further from top to bottom to detect the boundary 13 between the object B and another object. In order to detect the boundary, a user-specified threshold ε may be provided for a change in the zw value between adjacent pixels. Next, based on the detected boundary 13, the object B is segmented as shown in FIG.

次に、ワールド−ＺマップからセグメントＢ’を除去し、残りのマップを走査することにより、オブジェクトＡを確定し（図（ｃ））、図（ｄ）に示す様にこれをセグメント化する（セグメントＡ’）。この結果、ワールド−Ｚマップ内でオブジェクトＡとオブジェクトＢが、セグメントＡ’、セグメントＢ’としてセグメント化される。 Next, segment B ′ is removed from the world-Z map and the remaining map is scanned to determine object A (FIG. (C)) and segment it as shown in FIG. (D) ( Segment A ′). As a result, the objects A and B are segmented as a segment A ′ and a segment B ′ in the world-Z map.

［セグメントトラッキング］
図９は、線形確率モデルに基づいた３Ｄオブジェクトトラッキングの全体像を示す。図９の（ａ）は、時間ｔ−１において検出されたオブジェクトＸ１、Ｘ２およびＸ３の位置を示し、図（ｂ）は、時間ｔにおいて検出された２個のセグメントＹ１、Ｙ２の位置を示す。トラッキングは、時間ｔにおいて検出されたセグメントＹ１、Ｙ２と時間ｔ−１において検出されたオブジェクトＸ１、Ｘ２およびＸ３間の対応関係を見出すことである。各変数Ｘｉ（ｉ＝１、２、３）は、オブジェクトＸｉの、トラッキングに関係するある特徴を符号化する。これらの特徴には、３Ｄ位置、配置、外見および動き等が含まれる。 [Segment tracking]
FIG. 9 shows an overview of 3D object tracking based on a linear probability model. FIG. 9A shows the positions of the objects X1, X2 and X3 detected at time t-1, and FIG. 9B shows the positions of the two segments Y1 and Y2 detected at time t. . Tracking is to find the correspondence between the segments Y1, Y2 detected at time t and the objects X1, X2 and X3 detected at time t-1. Each variable Xi (i = 1, 2, 3) encodes certain features of the object Xi related to tracking. These features include 3D position, placement, appearance and movement.

例えば、（ｘ１、ｙ１、ｚ１）をオブジェクトの３Ｄ位置とし、（ｈ１、ｗ１）をオブジェクトの身長および横幅であるとし、（ｒ１、ｇ１、ｂ１）をオブジェクトの平均の色であるとした場合、Ｘ１＝（ｘ１、ｙ１、ｚ１、ｗ１、ｒ１、ｇ１、ｂ１）と仮定することができる。Ｙｊ（ｊ＝１、２）に対しても同様の仮定を行う。この対応関係を構築するために、シーン中のオブジェクトの数が時間の経過とともに変化することも有りうると仮定する。例えば、ある人物がシーンから出てしまい、他の人物がシーン内に入り込むことも可能である。システムはこのような状況を自動的に取り扱う。 For example, if (x1, y1, z1) is the 3D position of the object, (h1, w1) is the height and width of the object, and (r1, g1, b1) is the average color of the object, It can be assumed that X1 = (x1, y1, z1, w1, r1, g1, b1). The same assumption is made for Yj (j = 1, 2). In order to build this correspondence, it is assumed that the number of objects in the scene may change over time. For example, one person can leave the scene and another person can enter the scene. The system automatically handles this situation.

一般に、時間ｔ−１において検出されたオブジェクトセットを、
Ｘ＝（Ｘ１、Ｘ２、・・・、Ｘｎ）、ｎはオブジェクトの数、
時間ｔにおいて検出されたセグメントセットを、
Ｙ＝（Ｙ１、Ｙ２、・・・、Ｙｎ）、ｍはセグメントの数、
として示すことができる。なお、ｍ≠ｎである。 In general, the set of objects detected at time t−1 is
X = (X1, X2,..., Xn), n is the number of objects,
The segment set detected at time t is
Y = (Y1, Y2,..., Yn), m is the number of segments,
Can be shown as Note that m ≠ n.

トラッキングのために、ｍ＋１個の行とｎ個の列を有する行列Ｐを考える。各列は、以前のフレームにおいて検出されたオブジェクトに相当し、各行はフレームｔにおいて検出されたセグメントに相当する。この行列に、追加の行を加える。この行は、時間ｔ−１において検出された人物Ｘｉが時間ｔにおいて姿を消す確率に相当する。ｉを行番号、ｊを列番号とする時、行列の各エントリＰ（ｉ、ｊ）は、ＹｉがＸｊに相当する確率となる。 For tracking purposes, consider a matrix P having m + 1 rows and n columns. Each column corresponds to an object detected in the previous frame, and each row corresponds to a segment detected in frame t. Add additional rows to this matrix. This line corresponds to the probability that the person Xi detected at time t−1 disappears at time t. When i is a row number and j is a column number, each entry P (i, j) in the matrix has a probability that Yi corresponds to Xj.

図９の（ｄ）は、このような行列Ｐの一例を示す。行列Ｐが決定されると、次のステップは最適な割当を見出すことである。図９の（ｃ）はこのような割当の一例を示す。即ち、図（ｃ）ではＹ１はＸ３に割当てられ、Ｙ２はＸ２に割当てられ、さらにＸ１は時間ｔにおいて姿を消したものと見なされる（φに割当てられる）。この割当ての確率は、Ｐ（１、３）×Ｐ（２、２）×Ｐ（３、１）＝０．３４３であり、この値は他の可能な割当ての中で最も高い値である。 FIG. 9D shows an example of such a matrix P. Once the matrix P is determined, the next step is to find the optimal assignment. FIG. 9C shows an example of such allocation. That is, in FIG. 3C, Y1 is assigned to X3, Y2 is assigned to X2, and X1 is considered to disappear at time t (assigned to φ). The probability of this assignment is P (1,3) × P (2,2) × P (3,1) = 0.343, which is the highest value among other possible assignments.

ｍ＞ｎの場合、新しいオブジェクトがシーン内に入り込んできたことを示す。このアルゴリズムでは、シーンに入り込んできたオブジェクトに対して何らの割当ても見出さない。この場合、それらは新しく来たオブジェクトであると解釈され、時間ｔにおいて検出された人物のリストの中に加えられる。行列Ｐの値は、オブジェクトＸ１、Ｘ２、Ｘ３とセグメントＹ１、Ｙ２間の類似の程度に基づいて計算される。この類似としては、オブジェクト間の幾つかの特性を使用することも可能である。例えば、オブジェクト間の３Ｄ距離、色の類似性、顔の類似性あるいはこれらの組合せを使用することができる。 If m> n, it indicates that a new object has entered the scene. This algorithm finds no assignments for objects that have entered the scene. In this case, they are interpreted as newly coming objects and are added to the list of persons detected at time t. The value of the matrix P is calculated based on the degree of similarity between the objects X1, X2, X3 and the segments Y1, Y2. As this similarity, it is also possible to use several properties between objects. For example, 3D distance between objects, color similarity, face similarity, or a combination thereof can be used.

本発明の一実施形態に係るオブジェクト追跡方法を実施するための手順を示すフローチャート。The flowchart which shows the procedure for implementing the object tracking method which concerns on one Embodiment of this invention. 本発明の一実施形態に係るオブジェクト追跡システムの構成を示す概略図。1 is a schematic diagram showing a configuration of an object tracking system according to an embodiment of the present invention. 図１の方法を説明するためのイメージ図。The image figure for demonstrating the method of FIG. ポイントクラウドとノイズの関係を示す図。The figure which shows the relationship between a point cloud and noise. カメラ座標とワールド座標の関係を示す図。The figure which shows the relationship between a camera coordinate and a world coordinate. オブジェクトとカメラとの関係を示す図。The figure which shows the relationship between an object and a camera. ワールド−Ｚマップを示す図。The figure which shows a world-Z map. オブジェクトセグメンテーションの手順を示す図。The figure which shows the procedure of object segmentation. オブジェクトの３Ｄトラッキングの手順を示す図。The figure which shows the procedure of 3D tracking of an object.

Explanation of symbols

１空間
２３Ｄカメラ
３処理装置（コンピュータ） 1 Space 2 3D Camera 3 Processing Device (Computer)

Claims

Analyzing an image of a 3D camera that observes a predetermined environment from obliquely above, and forming a depth map of the background of the environment and a depth map of the current scene of the environment,
Comparing the background and current scene depth maps to form a foreground depth map;
Calculating the three-dimensional coordinates of each point of the foreground based on the depth map of the foreground;
Converting the three-dimensional coordinates into world coordinates;
Extracting only the zw value of the world coordinates for each point in the foreground and mapping it to each pixel in the image plane to form a world-Z map;
Detecting a point with a large amount of change in zw value between adjacent pixels on the world-Z map, detecting a boundary of one or more objects included in the foreground;
Segment the object based on the detected boundary;
Creating a three-dimensional trajectory for the segmented object;
An object three-dimensional tracking method comprising each step.

The method of claim 1, wherein the depth map of the background is formed based on a video taken when no object exists in the environment, and the depth map of the foreground is derived from the depth map of the current scene. 3D tracking method of an object formed by subtracting a depth map of

The method according to claim 1, further comprising a step of filtering the foreground three-dimensional coordinate data by an incident angle with respect to the 3D camera to remove noise.

A 3D camera for observing a predetermined environment obliquely from above;
A processor for analyzing the video of the 3D camera,
The processor is
Forming a depth map of the background of the environment and the current scene from the video,
Compare the respective depth maps to calculate a foreground depth map;
Calculating three-dimensional coordinates at each point of the foreground based on the depth map of the foreground;
The calculated three-dimensional coordinates are converted into world coordinates,
Extract only the zw value of the world coordinates for each point in the foreground and map to each pixel in the image plane to form a world-Z map;
Detecting a point with a large amount of change in zw value between adjacent pixels on the world-Z map, detecting a boundary of one or more objects included in the foreground;
Segment the object based on the detected boundary;
An object three-dimensional tracking system that executes each procedure that creates a three-dimensional trajectory for the segmented object.

The system of claim 1, wherein the depth map of the background is formed based on a video taken when no object exists in the environment, and the depth map of the foreground is derived from the depth map of the current scene. 3D tracking system for objects formed by subtracting the depth map of.

The system according to claim 4, further comprising a step of filtering the foreground three-dimensional coordinate data by an incident angle with respect to the 3D camera to remove noise.