JP2023081582A

JP2023081582A - Information processing device, information processing method, and program

Info

Publication number: JP2023081582A
Application number: JP2021195407A
Authority: JP
Inventors: 敬正角田; Norimasa Kadota
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2021-12-01
Filing date: 2021-12-01
Publication date: 2023-06-13

Abstract

To enable acquisition of a position and trajectory of a target object in a three-dimensional space.SOLUTION: An information processing device detects a prescribed target object from images photographed by each of a plurality of photographing means, and based on a detection result of the target object for the images photographed at a first point in time, for each photographing means, predicts a state of the target object at a second point in time after the first point in time. Then, the information processing device updates, for each photographing means, the predicted state of the target object based on the detection result of detection means and a prediction result of prediction means, and estimates a photographing timing for each photographing means based on the updated state of the target object and the detection result of the detection means.SELECTED DRAWING: Figure 1

Description

本発明は、対象物を複数の撮像装置で撮像して追尾等する際に適用可能な情報処理技術に関する。 The present invention relates to an information processing technology that can be applied when an object is imaged by a plurality of imaging devices and tracked.

複数の固定カメラで撮像した画像を基に３次元空間における被写体の位置を推定する技術がある。この技術では、同期した複数のカメラで取得した、時間的に連続する複数の画像（以後、フレームと呼ぶ）において、被写体の同一性を判定し、３次元空間上でその被写体の軌跡を推定する。 There is a technique for estimating the position of a subject in a three-dimensional space based on images captured by a plurality of fixed cameras. In this technology, the identity of a subject is determined in multiple temporally consecutive images (hereinafter referred to as frames) captured by multiple synchronized cameras, and the trajectory of that subject in 3D space is estimated. .

特許文献１には、平面上において複数の人物の軌跡を推定する技術が開示されている。特許文献１に記載の技術では、サッカー場等のフィールドを見下ろした俯瞰平面上に多数の粒子を配置し、俯瞰平面上での人物の動きのモデルから次の位置を予測する。そして特許文献１の技術では、その予測した粒子をカメラのフレーム上に射影し、フレームの前景らしい領域に再配置するパーティクルフィルタで、平面上の複数の人物の軌跡を推定する。また、特許文献２には、３次元空間上の人物の位置と顔向きを状態変数として扱い、人物の顔向きの状態に好適な識別器を用いて粒子の再配置を行うパーティクルフィルタにより、人物の軌跡の推定を行う技術が開示されている。 Patent Literature 1 discloses a technique for estimating the trajectories of a plurality of persons on a plane. In the technique described in Patent Document 1, a large number of particles are arranged on a bird's-eye view plane looking down on a field such as a soccer field, and the next position is predicted from a model of a person's movement on the bird's-eye view plane. In the technique of Patent Document 1, the predicted particles are projected onto a camera frame, and the trajectories of a plurality of people on a plane are estimated by a particle filter that rearranges them in a likely foreground area of the frame. In addition, in Patent Document 2, a particle filter that treats the position and face orientation of a person in a three-dimensional space as state variables and rearranges particles using a classifier suitable for the state of the face orientation of the person. A technique for estimating the trajectory of is disclosed.

特開２０１３－５８１３２号公報JP 2013-58132 A 特開２００８－２６９７４号公報JP 2008-26974 A

ところで、追尾の対象物が、例えばスポーツ競技の試合中におけるボールなどのように加速度が大きく変化して高速になる物体である場合、その追尾に失敗することがある。このことは、特許文献１や特許文献２に開示された技術でも同様に生じ、追尾の対象物の３次元空間上の位置と軌跡を取得できなくなって、追尾に失敗することがある。これは複数のカメラのフレームレートが、高速で移動する対象物を追尾するのに不十分であったり、各カメラの撮像のタイミングが同期していなかったりして、３次元空間上の位置と軌跡を取得できなくなることなどにより生ずると考えられる。 By the way, when the object to be tracked is an object whose acceleration greatly changes and becomes high speed, such as a ball in a sports game, the tracking may fail. This also occurs in the techniques disclosed in Patent Literature 1 and Patent Literature 2 in the same way, and the tracking may fail because the position and trajectory of the object to be tracked in the three-dimensional space cannot be obtained. This is because the frame rate of multiple cameras is not sufficient to track a fast-moving object, or the imaging timing of each camera is not synchronized, resulting in the position and trajectory in 3D space. It is thought that this occurs due to the inability to obtain

そこで本発明は、対象物の３次元空間上の位置と軌跡を取得可能にすることを目的とする。 SUMMARY OF THE INVENTION Accordingly, an object of the present invention is to enable acquisition of the position and trajectory of an object in a three-dimensional space.

本発明の情報処理装置は、複数の撮像手段がそれぞれ撮像した画像から所定の対象物を検出する検出手段と、前記撮像手段ごとに、第１の時点で撮像した画像に対する前記対象物の検出結果を基に、前記第１の時点より後の第２の時点における当該対象物の状態を予測する予測手段と、前記撮像手段ごとに、前記予測された対象物の状態を、前記検出手段による検出結果と前記予測手段による予測結果とを基に更新する更新手段と、前記更新された前記対象物の状態と、前記検出手段による検出結果とを基に、前記撮像手段ごとの撮像タイミングを推定する推定手段と、を有することを特徴とする。 The information processing apparatus of the present invention includes detection means for detecting a predetermined target object from images respectively captured by a plurality of imaging means, and a detection result of the target object for the image captured at a first point in time for each of the imaging means. prediction means for predicting the state of the object at a second time point after the first time point based on the detection means for detecting the predicted state of the object for each of the imaging means; estimating an imaging timing for each of the imaging means based on updating means for updating based on the result and the prediction result by the predicting means; and on the basis of the updated state of the object and the detection result of the detecting means. and estimating means.

本発明によれば、対象物の３次元空間上の位置と軌跡を取得可能となる。 According to the present invention, it is possible to acquire the position and trajectory of an object in a three-dimensional space.

実施形態の３次元追尾装置の機能構成を示す図である。It is a figure which shows the functional structure of the three-dimensional tracking apparatus of embodiment. カメラと物体の配置例およびカメラで撮像された画像例を示す図である。FIG. 3 is a diagram showing an example of arrangement of a camera and an object and an example of an image captured by the camera; 第１、第２の実施形態に係る情報処理の流れを示すフローチャートである。4 is a flow chart showing the flow of information processing according to the first and second embodiments; 検出器のＦＯＶと空間中で重複するＦＯＶの数を説明する図である。It is a figure explaining the FOV of a detector, and the number of FOV which overlaps in space. 検出器ＦＯＶと物体配置、フレームと検出器出力の一例を示す図である。It is a figure which shows a detector FOV, an object arrangement|positioning, a frame, and an example of a detector output. 観測値を物体に割り当てる方法を説明する図である。It is a figure explaining the method of assigning an observation value to an object. カメラ毎の撮像タイミングを説明する図である。It is a figure explaining the imaging timing for every camera. カメラ毎の撮像タイミングの事後確率分布を説明する図である。It is a figure explaining the posterior probability distribution of the imaging timing for every camera. 第３の実施形態に係る情報処理の流れを示すフローチャートである。10 is a flow chart showing the flow of information processing according to the third embodiment;

以下、本実施形態を、図面を参照しながら説明する。以下の実施形態は本発明を限定するものではなく、また、本実施形態で説明されている特徴の組み合わせの全てが本発明の解決手段に必須のものとは限らない。実施形態の構成は、本発明が適用される装置の仕様や各種条件（使用条件、使用環境等）によって適宜修正又は変更され得る。また、後述する各実施形態の一部を適宜組み合わせて構成してもよい。以下の各実施形態において、同一の構成には同じ参照符号を付して説明する。 Hereinafter, this embodiment will be described with reference to the drawings. The following embodiments do not limit the present invention, and not all combinations of features described in the embodiments are essential to the solution of the present invention. The configuration of the embodiment can be appropriately modified or changed according to the specifications of the device to which the present invention is applied and various conditions (use conditions, use environment, etc.). Also, a part of each embodiment described later may be appropriately combined. In each of the following embodiments, the same configurations are given the same reference numerals.

本実施形態では、複数の撮像装置（以下適宜、カメラと記す）で撮像した画像を基に、追尾の対象物の３次元空間上の位置と軌跡を推定し、その対象物の追尾を行う例を挙げて説明する。また本実施形態では、追尾の対象物の一例として、スポーツ競技の試合中における選手やボールなどを挙げる。 In this embodiment, the position and trajectory of an object to be tracked in a three-dimensional space are estimated based on images captured by a plurality of imaging devices (hereinafter referred to as cameras), and the object is tracked. will be described. In this embodiment, examples of tracking targets include players and balls during a sports match.

ここで、追尾の対象物が、例えばボールのように加速度が大きく変化し高速になる物体である場合、観賞用の映像取得を目的とした一般的なフレームレートのカメラによる撮像では追尾に失敗することがある。前述した特許文献１や特許文献２に開示された技術の場合、ある時刻における追尾対象の３次元空間位置から、その次の時刻における３次元空間位置を予測する事前分布の時間ステップ幅は、カメラが撮像するフレームの時間間隔と同一である。すなわち追尾結果の周波数（トラッキングレートとする）は、カメラにおける撮像のための所定の周期、つまりフレームレートと同じである。例えば、秒間３０フレーム（３０ｆｒａｍｅ－ｐｅｒ－ｓｅｃｏｎｄ（以後ｆｐｓとする））のカメラにおいて、時速５０ｋｍで移動する物体を撮像した場合、３次元空間上ではフレームの前後で約０．５ｍの移動が生じている。この結果、後のフレームの物体検出位置から計算される尤度が、事前分布の予測から大きく外れ、追尾の失敗が起きる可能性がある。一方で、３０ｆｐｓ以上で撮像する高速なカメラを用いれば追尾時のトラッキングレートを向上させることは可能であるが、その分、高額なシステムが必要になってしまう。また特許文献１や特許文献２の技術では、撮影システムを構成している複数のカメラがローカルエリアネットワーク（ＬＡＮ）で互いに接続されている場合、各カメラの撮像タイミングが同期せず、追尾に失敗することがある。 Here, if the object to be tracked is, for example, a ball whose acceleration changes greatly and becomes fast, tracking with a general frame rate camera for the purpose of obtaining images for viewing will fail in tracking. Sometimes. In the case of the techniques disclosed in Patent Document 1 and Patent Document 2 described above, the time step width of the prior distribution that predicts the three-dimensional spatial position at the next time from the three-dimensional spatial position of the tracking target at a certain time is determined by the camera is the same as the time interval of the imaged frames. That is, the frequency of the tracking result (tracking rate) is the same as the predetermined period for imaging by the camera, that is, the frame rate. For example, when an object moving at a speed of 50 km/h is imaged with a camera of 30 frames per second (hereinafter referred to as fps), a movement of about 0.5 m occurs before and after the frame in the three-dimensional space. ing. As a result, the likelihood calculated from object detection positions in later frames may deviate significantly from the prediction of the prior distribution, resulting in tracking failure. On the other hand, if a high-speed camera that captures images at 30 fps or more is used, it is possible to improve the tracking rate during tracking, but this requires an expensive system. Further, in the techniques disclosed in Patent Documents 1 and 2, when a plurality of cameras constituting an imaging system are connected to each other via a local area network (LAN), the imaging timing of each camera is not synchronized, and tracking fails. I have something to do.

そこで本実施形態では、３０ｆｐｓ等の汎用的なフレームレートの複数のカメラをＬＡＮ等で接続したシステムにおいて、各カメラのフレームレートを超えるトラッキングレートで対象物を追尾可能とするために、以下に説明する情報処理装置を有する。すなわち本実施形態の情報処理装置によれば、汎用的なフレームレートの複数のカメラが同期していなくても、追尾対象物の３次元位置及び軌跡を高精度に推定可能とし、当該対象物の追尾を可能とする。 Therefore, in the present embodiment, in a system in which a plurality of cameras with a general-purpose frame rate such as 30 fps are connected via a LAN or the like, an object can be tracked at a tracking rate exceeding the frame rate of each camera. It has an information processing device that That is, according to the information processing apparatus of the present embodiment, even if multiple cameras with general-purpose frame rates are not synchronized, the three-dimensional position and trajectory of the tracking target can be estimated with high accuracy. Allows tracking.

＜第１の実施形態＞
以下、第１の実施形態として、競技場や体育館などの３次元空間の周囲に複数台のカメラを配置した撮影システムを例に挙げて説明する。第１の実施形態の情報処理装置は、それら複数台のカメラの撮像タイミングを推定し、カメラのフレームレート以上の時間分解能で、３次元空間中の対象物の追尾を可能としている。 <First Embodiment>
Hereinafter, as a first embodiment, an imaging system in which a plurality of cameras are arranged around a three-dimensional space such as a stadium or a gymnasium will be described as an example. The information processing apparatus of the first embodiment estimates the imaging timings of the plurality of cameras, and enables tracking of an object in a three-dimensional space with a time resolution equal to or higher than the frame rate of the cameras.

ここで、カメラの撮像タイミングとは、複数台のカメラの内、何れか１台のカメラでシャッターレリーズが行われた時刻を基準時刻としたとき、他の複数台のカメラそれぞれのシャッターレリーズが行われた時刻との時刻差（単位は秒）のことである。また本実施形態において、時刻とは連続した時間軸上のある瞬間の一時点のことであり、複数のカメラにおいてそれぞれシャッターレリーズが行われる各時刻は同一時間軸上で表される各時点であるとする。また各カメラのフレームレートが３０ｆｐｓである場合、複数台のカメラの各撮像タイミングの内、最も大きい撮像タイミングと最も小さい撮像タイミングとの差は１フレームの経過時間（３０ｆｐｓの場合、約０．０３３秒）以下である。 Here, the imaging timing of the cameras means that when the time at which the shutter is released in one of the multiple cameras is set as the reference time, the shutter release of each of the other multiple cameras is performed. It is the time difference (in seconds) from the calculated time. In the present embodiment, the time refers to a point in time on a continuous time axis, and each time at which a shutter release is performed in each of a plurality of cameras is each point in time represented on the same time axis. and Also, when the frame rate of each camera is 30 fps, the difference between the largest imaging timing and the smallest imaging timing among the imaging timings of a plurality of cameras is the elapsed time of one frame (in the case of 30 fps, about 0.033 seconds) or less.

本実施形態の情報処理装置は各カメラが撮像した画像から検出した対象物の位置と、３次元空間中における当該対象物の状態とを基に、各カメラの撮像タイミングを推定する。本実施形態において、対象物の状態とは、３次元空間内におけるその対象物の少なくとも位置および速度を含み、また対象物が例えば人物の顔のように向き（顔が向いている方向）が重要になる対象物である場合には位置と速度に向きを含む。そして本実施形態の情報処理装置では、推定した各撮像タイミングの早い順の序列を基に、各カメラを選択(若しくは各カメラの画像を選択）して対象物検出を行い、その検出結果を用いて対象物の３次元空間中の位置を追尾する。これにより、本実施形態の情報処理装置においては、カメラのフレームレート以上の時間分解能で対象物の追尾を実現可能とする。これらの詳細は後述する。なお以下の説明において、画像から検出されて追尾の対象となる対象物を適宜、物体とのみ表記する。 The information processing apparatus of this embodiment estimates the imaging timing of each camera based on the position of the target detected from the image captured by each camera and the state of the target in the three-dimensional space. In this embodiment, the state of the object includes at least the position and speed of the object in the three-dimensional space, and the direction of the object (the direction in which the face is facing) is important, such as the face of a person. In the case of an object that becomes , the orientation is included in the position and velocity. Then, the information processing apparatus of the present embodiment selects each camera (or selects an image of each camera) based on the estimated order of each imaging timing, performs object detection, and uses the detection result. track the position of the object in the three-dimensional space. Accordingly, in the information processing apparatus of the present embodiment, it is possible to track the object with a temporal resolution higher than the frame rate of the camera. Details of these will be described later. In the following description, an object to be tracked that is detected from an image will be referred to as an object as appropriate.

図１（ａ）は、第１の実施形態の３次元追尾装置２００の機能構成を示す図である。ここで、第１の実施形態に係る３次元追尾装置２００の説明を行う前に、図２（ａ）～図２（ｄ）を用いて、本実施形態で想定している撮影システムの一例を説明する。 FIG. 1(a) is a diagram showing the functional configuration of the three-dimensional tracking device 200 of the first embodiment. Here, before describing the three-dimensional tracking device 200 according to the first embodiment, an example of an imaging system assumed in this embodiment will be described using FIGS. explain.

図２（ａ）は、複数のカメラ配置１００の一例を示した図である。空間１０１は、複数のカメラの配置およびそれら各カメラによって撮像される３次元空間であり、カメラ１０２～１１１が配置されているとする。また図２（ａ）には、空間１０１における３次元座標（世界座標）のＸ軸１１３ｘ、Ｙ軸１１４ｙ、Ｚ軸１１５ｚ、および原点１１２も示されている。なお、Ｘ軸１１３ｘとＺ軸１１５ｚとがなす平面が地面であり、Ｙ軸１１４ｙの方向は地面からの高さ方向になり、地面方向が正（つまり地面からの高さの値は負で表される）であるとする。各カメラ１０２～１１１は、地面からある程度の高さの空間壁面部分等（不図示）に固定されていて、レンズおよびイメージセンサー等を有し、空間１０１の地面を含む撮影エリアを撮像可能となるように設置されている。本実施形態において、各カメラ１０２～１１１は、空間１０１内に存在する物体（追尾の対象物）を撮像可能な位置に設置されている。また各カメラ１０２～１１１は、それぞれ公知の技術によってカメラキャリブレーションが行われているものとする。 FIG. 2(a) is a diagram showing an example of a plurality of camera arrangements 100. FIG. A space 101 is a three-dimensional space in which a plurality of cameras are arranged and captured by each camera, and cameras 102 to 111 are arranged. FIG. 2A also shows an X-axis 113x, a Y-axis 114y, a Z-axis 115z of three-dimensional coordinates (world coordinates) in the space 101, and an origin 112. FIG. The plane formed by the X-axis 113x and the Z-axis 115z is the ground, the direction of the Y-axis 114y is the direction of the height from the ground, and the ground direction is positive (that is, the value of the height from the ground is negative). is performed). Each of the cameras 102 to 111 is fixed to a space wall portion (not shown) at a certain height from the ground, has a lens, an image sensor, etc., and can capture an imaging area including the ground of the space 101. is installed as In this embodiment, each of the cameras 102 to 111 is installed at a position capable of capturing an image of an object existing in the space 101 (object to be tracked). Further, it is assumed that each of the cameras 102 to 111 has undergone camera calibration using a known technique.

図２（ｂ）は、ある時刻において空間中に存在する人物等の配置１２０の一例を示した図である。空間１２１は、図２（ａ）の空間１０１と同じ空間である。図２（ｂ）に例示した空間１２１内には、追尾対象の物体として、人物１２２、１２３、１２４と、ボール１２５が存在しているとする。 FIG. 2(b) is a diagram showing an example of an arrangement 120 of persons, etc. present in space at a certain time. Space 121 is the same space as space 101 in FIG. Assume that persons 122, 123, and 124 and a ball 125 exist as objects to be tracked in a space 121 illustrated in FIG. 2B.

図２（ｃ）は、カメラ１０２～１１１のうち、例えばカメラ１０４によって図２（ｂ）の空間１２１内を撮像したときの、１フレーム１３０の画像１３１の一例を示した図である。カメラ１０４で撮像された画像１３１には、人物１３２、１３３、１３４とボール１３５が写っている。人物１３２、１３３、１３４は、図２（ｂ）に示した配置１２０内の人物１２２、１２３、１２４にそれぞれ対応している。同じくボール１３５は、図２（ｂ）に示した配置１２０内のボール１２５に対応している。このフレーム１３０の座標系は、原点１３８と、ｕ軸１３６ｕおよびｖ軸１３７ｖとにより表されるピクセル座標系とする。なお、適宜正規化した座標系も使用可能である。 FIG. 2C is a diagram showing an example of an image 131 of one frame 130 when the space 121 in FIG. An image 131 captured by the camera 104 includes persons 132, 133, and 134 and a ball 135. FIG. Persons 132, 133, and 134 respectively correspond to persons 122, 123, and 124 in arrangement 120 shown in FIG. 2(b). Ball 135 also corresponds to ball 125 in arrangement 120 shown in FIG. 2(b). The coordinate system of this frame 130 is a pixel coordinate system represented by an origin 138 and u-axis 136u and v-axis 137v. Note that an appropriately normalized coordinate system can also be used.

図２（ｄ）は、カメラ１０２～１１１のうち、例えばカメラ１０２によって図２（ｂ）の空間１２１内を撮像したときの、１フレーム１４０の画像１４１の一例を示した図である。カメラ１０２で撮像された画像１４１には、人物１４２、１４３とボール１４４が写っている。人物１４２、１４３は、図２（ｂ）に示した配置１２０内の人物１２２、１２３にそれぞれ対応している。同じくボール１４４は、図２（ｂ）に示した配置１２０内のボール１２５に対応している。このフレーム１４０の座標系は、原点１４５と、ｕ軸１４６ｕおよびｖ軸１４７ｖとにより表されるピクセル座標系とする。
ここでは、カメラ１０２で撮影された図２（ｃ）のフレーム１３０と、カメラ１０４で撮像された図２（ｄ）のフレーム１４０の例を挙げているが、これらカメラ１０２と１０４以外の他のカメラでも前述同様に撮像が行われる。 FIG. 2D is a diagram showing an example of an image 141 of one frame 140 when the space 121 in FIG. An image 141 captured by the camera 102 includes persons 142 and 143 and a ball 144 . Persons 142 and 143 correspond to persons 122 and 123 in arrangement 120 shown in FIG. 2(b), respectively. Ball 144 also corresponds to ball 125 in arrangement 120 shown in FIG. 2(b). The coordinate system of this frame 140 is a pixel coordinate system represented by an origin 145 and u-axis 146u and v-axis 147v.
Here, the example of the frame 130 in FIG. 2C taken by the camera 102 and the frame 140 in FIG. 2D taken by the camera 104 are given. The camera also performs imaging in the same manner as described above.

次に図２（ａ）を参照しつつ、図１（ａ）に示した第１の実施形態に係る３次元追尾装置２００の機能構成について説明する。
本実施形態の３次元追尾装置２００は、撮影部２２０、処理部２３０、およびモニタリング部２４０を有する。
撮影部２２０は、複数の動画取得装置を含んでいる。図１（ａ）の例の場合、撮影部２２０は、第１動画取得部２０１～第Ｋ動画取得部２０２までの、全部でＫ台の動画取得装置を有する。図２（ａ）の例では、Ｋ台の動画取得装置としてカメラ１０２～１１１を例示しており、図１（ａ）ではそれらを第１動画取得部２０１～第Ｋ動画取得部２０２として示している。なお、以下の説明において、第１動画取得部２０１～第Ｋ動画取得部２０２のＫ台の動画取得部を適宜、カメラと記載する。つまり以降の説明で記載するカメラとは、第１動画取得部２０１～第Ｋ動画取得部２０２のうちいずれかの動画取得部を表しているとする。撮影部２２０は、ローカルエリアネットワーク等の通信経路を介して処理部２３０と接続されているとする。 Next, the functional configuration of the three-dimensional tracking device 200 according to the first embodiment shown in FIG. 1(a) will be described with reference to FIG. 2(a).
The three-dimensional tracking device 200 of this embodiment has an imaging unit 220 , a processing unit 230 and a monitoring unit 240 .
The photographing unit 220 includes a plurality of moving image acquisition devices. In the case of the example of FIG. 1A, the imaging unit 220 has a total of K moving image capturing devices from the first moving image capturing unit 201 to the Kth moving image capturing unit 202 . In the example of FIG. 2(a), the cameras 102 to 111 are illustrated as K moving image acquisition devices, and in FIG. there is In the following description, the K moving image acquisition units of the first moving image acquiring unit 201 to the Kth moving image acquiring unit 202 are appropriately referred to as cameras. In other words, the camera described in the following description represents one of the moving image acquiring units 201 to 202. FIG. It is assumed that the imaging unit 220 is connected to the processing unit 230 via a communication path such as a local area network.

処理部２３０は、本実施形態の情報処理装置の一適用例である。以下、処理部２３０の各機能の詳細について、図３等を用いて説明する。
図３（ａ）は、３次元追尾装置２００の処理部２３０における情報処理の流れを示したフローチャートである。まず、このフローチャートを参照しながら、３次元追尾装置２００における処理全体の概要を説明する。
ステップＳ１０１において、初期値設定部２０５は、３次元空間上の追尾対象の物体（人物、ボール）の数、各物体を識別するためのＩＤ（識別情報）、対象物（物体）の状態についての初期値を、それぞれ設定する。さらに初期値設定部２０５は、撮影部２２０の第１動画取得部２０１～第Ｋ動画取得部２０２のＫ台のカメラのそれぞれの撮像タイミングの後述する推定値の初期値を設定する。 The processing unit 230 is an application example of the information processing apparatus of this embodiment. Details of each function of the processing unit 230 will be described below with reference to FIG. 3 and the like.
FIG. 3A is a flowchart showing the flow of information processing in the processing section 230 of the three-dimensional tracking device 200. FIG. First, an overview of the overall processing in the three-dimensional tracking device 200 will be described with reference to this flowchart.
In step S101, the initial value setting unit 205 sets the number of tracking target objects (person, ball) in the three-dimensional space, the ID (identification information) for identifying each object, and the state of the target object (object). Set the initial value, respectively. Further, the initial value setting unit 205 sets an initial value of an estimated value of the imaging timing of each of the K cameras of the first moving image acquisition unit 201 to the Kth moving image acquisition unit 202 of the imaging unit 220, which will be described later.

ループＬ１０１～Ｌ１１１において、処理部２３０は、時刻に関する繰り返し処理を実行する。具体的には、処理部２３０は、時刻に関するインデックスｔを１からＴの順で与える繰り返し処理を実行する。
ループＬ１０２～Ｌ１１２において、処理部２３０は、第１動画取得部２０１～第Ｋ動画取得部２０２に関する繰り返し処理を実行する。具体的には、処理部２３０は、カメラに関するインデックスｋを１からＫの順で与えるような繰り返し処理を実行する。 In loops L101 to L111, the processing unit 230 repeatedly performs time-related processing. Specifically, the processing unit 230 executes a repetitive process of giving an index t relating to time in order from 1 to T. FIG.
In loops L102 to L112, the processing unit 230 repeatedly executes processing related to the first moving image acquiring unit 201 to the Kth moving image acquiring unit 202. FIG. Specifically, the processing unit 230 executes a repetitive process of giving an index k related to the camera in order from 1 to K.

ステップＳ１０２において、動画取得選択部２０３は、タイミング推定部２１０が持つ現在の撮像タイミングの後述する推定値を基に、フレームを取得するカメラを、撮影部２２０の第１動画取得部２０１～第Ｋ動画取得部２０２の中から選択する。具体的には、動画取得選択部２０３は、撮像タイミングの最も早いカメラを最初に選択し、以降、撮像タイミングの早い順に順次選択する。換言すると、動画取得選択部２０３は、現在の撮像タイミングの推定値を基に、Ｋ台のカメラでそれぞれ取得される画像のうち、撮像タイミングの最も早いカメラで取得される画像を最初に選択し、以降、撮像タイミングの早い順に順次選択する。 In step S102, the moving image acquisition selection unit 203 selects a camera from the first moving image acquisition unit 201 to the K-th camera of the imaging unit 220 based on the estimated value of the current imaging timing held by the timing estimation unit 210, which will be described later. Select from the moving image acquisition unit 202 . Specifically, the moving image acquisition selection unit 203 first selects the camera with the earliest imaging timing, and then sequentially selects cameras in order of earliest imaging timing. In other words, based on the estimated value of the current imaging timing, the moving image acquisition selection unit 203 first selects the image acquired by the camera with the earliest imaging timing among the images acquired by each of the K cameras. , and thereafter, are sequentially selected in order of earliest imaging timing.

次にステップＳ１０３において、処理部２３０は、撮影部２２０が有する動画取得部のうち、前ステップで選択された動画取得部（以下、カメラｋとする）が撮像した、現在のフレーム（静止画）を１枚取得する。このカメラｋにて撮像された現在のフレームの画像は、検出部２０４に入力される。 Next, in step S103, the processing unit 230 captures the current frame (still image) captured by the moving image acquisition unit (hereinafter referred to as camera k) selected in the previous step among the moving image acquisition units included in the imaging unit 220. Get 1 The current frame image captured by the camera k is input to the detection unit 204 .

ステップＳ１０４において、検出部２０４は、前ステップで取得されたフレームに写っている物体を検出する。検出部２０４は、フレーム内に複数の物体が写っている場合にはそれぞれ物体を検出する。そして検出部２０４は、フレーム内から検出した複数の物体のそれぞれの位置と、それら複数の物体のそれぞれ尤もらしさに関するスコアとを検出し、それらをカメラで撮像されたフレームから実際に検出された物体の観測値とする。本実施形態の場合、物体は前述したように複数の人物と１つのボールであるため、検出部２０４は、それらの各位置と各スコアとを、各物体の実際の観測値として取得する。物体（人物およびボール）の各位置とスコアの詳細については後述する。なお、検出部２０４によって検出された各物体の観測値に対しては、ＩＤ対応付け部２０７により、それぞれを個別に識別するための識別情報が付与される。 In step S104, the detection unit 204 detects an object appearing in the frame acquired in the previous step. The detection unit 204 detects each object when a plurality of objects are captured in the frame. Then, the detection unit 204 detects the position of each of the plurality of objects detected from within the frame and the score regarding the likelihood of each of the plurality of objects, and converts them to the actual detected objects from the frame captured by the camera. be the observed value of . In the case of this embodiment, the objects are a plurality of people and a ball as described above, so the detection unit 204 acquires their positions and scores as actual observed values for each object. Details of each position of the object (person and ball) and the score will be described later. Note that the ID association unit 207 assigns identification information for individually identifying each object to the observed value of each object detected by the detection unit 204 .

次にループＬ１０３～Ｌ１１３において、処理部２３０は、物体に関する繰り返し処理を実行する。すなわち処理部２３０は、物体に関するインデックスｎを１からＮの順で与える繰り返し処理を実行する。なお、インデックスｎにおけるＮは、ステップＳ１０１において物体の数の初期値として取得された値を用いる。本実施形態で例示したサッカーの試合の場合、物体は複数の人物と１つのボールであり、それら複数の人物と１つのボールの各物体に対してインデックスｎが１からＮの順に与えられる。以下、インデックスｎが与えられた物体を、物体ｎと表記する。 Next, in loops L103 to L113, the processing unit 230 repeats processing for objects. That is, the processing unit 230 executes a repetitive process of giving an index n for the object in order from 1 to N. Note that N in index n uses the value acquired as the initial value of the number of objects in step S101. In the soccer match exemplified in this embodiment, the objects are a plurality of persons and a ball, and an index n is given in order from 1 to N for each object of the plurality of persons and the ball. An object given an index n is hereinafter referred to as an object n.

次にステップＳ１０５において、予測部２０６は、現在の時刻をｔ（ｔ≧１）とした場合、それより過去の時刻ｔ－１で取得した物体ｎの位置および速度等の状態（物体ｎの状態）を基に、時刻ｔにおける物体ｎの状態の確率分布を予測する処理を行う。例えば、時刻ｔ－１を第１の時点とし、それより後の時刻ｔを第２の時点とした場合、ステップＳ１０５において予測部２０６は、第１の時点での処理で取得されている物体の状態を基に第２の時点における物体の状態を予測（確率分布を予測）する。ここでの処理の詳細については後述する。 Next, in step S105, when the current time is t (t≧1), the prediction unit 206 determines the state of the object n acquired at the past time t−1 (the state of the object n ), the process of predicting the probability distribution of the state of the object n at time t is performed. For example, when the time t-1 is the first time point and the time t after that is the second time point, the prediction unit 206 in step S105 determines the object obtained in the process at the first time point. Based on the state, the state of the object at the second time is predicted (probability distribution is predicted). Details of the processing here will be described later.

次に分岐Ｂ１０１において、処理部２３０は、前ステップでの予測された物体ｎの状態（位置および速度等）と、後述するＦＯＶ重複地図の情報とに基づき、カメラｋのＦＯＶ内に物体ｎが存在する確率があるか否かによって処理を分岐させる。処理部２３０は、カメラｋのＦＯＶ内に物体ｎが存在する確率がある場合にはステップＳ１０６に処理を進め、一方、物体ｎが存在しない場合にはステップＳ１０８に処理を進める。 Next, in branch B101, the processing unit 230 determines that the object n is within the FOV of the camera k based on the state (position, speed, etc.) of the object n predicted in the previous step and information on the FOV overlapping map, which will be described later. The process branches depending on whether there is a probability of existence. If there is a probability that object n exists within the FOV of camera k, processing unit 230 advances the process to step S106, and if object n does not exist, advances the process to step S108.

ステップＳ１０６に進むと、予測部２０６は、現在の時刻ｔ（ｔ≧１）より前の過去の時刻ｔ－１で取得された物体ｎの位置および速度等の状態に基づいて、時刻ｔにおいて取得されるべき観測値の確率分布を予測する。すなわち時刻ｔ－１を第１の時点、時刻ｔを第２の時点とした場合、ステップＳ１０６において予測部２０６は、第１の時点での処理で取得された物体の状態を基に、第２の時点で取得されるべき観測値の確率分布を予測する。ここでの処理の詳細は後述する。 Proceeding to step S106, the prediction unit 206 acquires the Predict the probability distribution of the observations that should be made. That is, when time t-1 is the first time point and time t is the second time point, in step S106 the prediction unit 206 calculates the second time point based on the state of the object obtained in the process at the first time point. Predict the probability distribution of observations that should be taken at . The details of the processing here will be described later.

次にステップＳ１０７において、ＩＤ対応付け部２０７は、前ステップで予測結果として得られた時刻ｔのカメラｋにおける物体ｎの観測値の予測確率分布と、ステップＳ１０４によるカメラｋの実際の観測値とを基に、実際の観測値の尤度を計算する。そして、ＩＤ対応付け部２０７は、その観測値の尤度を基に、３次元空間上の物体ｎと観測値とを対応付ける。なお、３次元空間上の物体ｎと観測値との対応付けは、具体的には物体ｎに付与した識別情報と観測値との対応付けとなされる。このように検出部２０４による検出結果と予測部２０６による予測結果は、ＩＤ対応付け部２０７で対応付けがなされた後、更新部２０８に送れる。 Next, in step S107, the ID associating unit 207 combines the predicted probability distribution of observed values of object n with camera k at time t obtained as the prediction result in the previous step with the actual observed values of camera k obtained in step S104. Based on , we compute the likelihood of the actual observed value. Based on the likelihood of the observed value, the ID associating unit 207 associates the object n in the three-dimensional space with the observed value. Note that the correspondence between the object n in the three-dimensional space and the observed value is specifically the correspondence between the identification information given to the object n and the observed value. In this way, the detection result by the detection unit 204 and the prediction result by the prediction unit 206 are sent to the update unit 208 after being associated by the ID association unit 207 .

ステップＳ１０８に進むと、更新部２０８は、ステップＳ１０６およびＳ１０７の処理で物体ｎに対応付けられた観測値を用い、時刻ｔにおける物体ｎの状態を更新する。すなわち時刻ｔ－１を第１の時点、時刻ｔを第２の時点とした場合、更新部２０８は、実際の観測値と予測された物体の状態とを基に、第２の時点における物体ｎの状態を更新する。すなわち当該更新された物体ｎの状態は、第３の時点における物体ｎの状態に相当する。ここでの処理の詳細については後述する。 After proceeding to step S108, the updating unit 208 updates the state of the object n at time t using the observed values associated with the object n in the processing of steps S106 and S107. That is, when time t−1 is the first time point and time t is the second time point, the updating unit 208 calculates the object n update the state of That is, the updated state of object n corresponds to the state of object n at the third point in time. Details of the processing here will be described later.

次にステップＳ１０９において、可視化部２０９は、ステップＳ１０８で更新された物体ｎの状態を可視化するための処理を実行する。この可視化処理の結果は、物体の追尾の状況を表しており、それがモニタリング部２４０に送られて表示される。
次にステップＳ１１０において、タイミング推定部２１０は、ステップＳ１０８で更新された物体ｎの状態と、ステップＳ１０４での検出部２０４による検出結果とを基に、Ｋ台のカメラの各々の撮像タイミングを推定する。なお、検出部２０４による検出結果は、ＩＤ対応付け部２０７での対応付けがなされた後、更新部２０８を介してタイミング推定部２１０に送られる。これらの処理の詳細については後述する。 Next, in step S109, the visualization unit 209 executes processing for visualizing the state of the object n updated in step S108. The result of this visualization processing represents the tracking status of the object, which is sent to the monitoring section 240 and displayed.
Next, in step S110, the timing estimation unit 210 estimates the imaging timing of each of the K cameras based on the state of the object n updated in step S108 and the detection result by the detection unit 204 in step S104. do. The detection result by the detection unit 204 is sent to the timing estimation unit 210 via the updating unit 208 after being associated by the ID associating unit 207 . Details of these processes will be described later.

次に図３（ａ）に示すフローチャートに従って、図１（ａ）に示した３次元追尾装置２００の処理部２３０における情報処理について、より詳細で具体的な内容を説明する。
本実施形態では、追尾対象の物体の３次元空間中の位置、速度、および姿勢を、観測できない隠れ変数とし、カメラで撮像されたフレームの画像から検出された物体の位置とスコアを観測値とする。そして本実施形態では、カメラで撮像された画像から検出された観測値（物体の位置とスコア）を基に、隠れ変数（物体の３次元空間中の位置、速度、および姿勢）を推定するような、状態空間モデルの枠組みを用いる。この状態空間モデルの枠組みにおいて、隠れ変数は、状態変数と呼ばれる確率変数である。本実施形態では、状態変数を平均（１次モーメント）と分散共分散行列（２次モーメント）のみで記述するガウス型の状態空間モデルを用いる。 Next, according to the flowchart shown in FIG. 3(a), more detailed and specific contents of the information processing in the processing unit 230 of the three-dimensional tracking device 200 shown in FIG. 1(a) will be described.
In this embodiment, the position, velocity, and orientation of the object to be tracked in the three-dimensional space are treated as unobservable hidden variables, and the position and score of the object detected from the image of the frame captured by the camera are treated as observed values. do. In this embodiment, the hidden variables (the position, velocity, and orientation of the object in the three-dimensional space) are estimated based on the observed values (the position and score of the object) detected from the image captured by the camera. using the framework of the state-space model. In the framework of this state-space model, the hidden variables are random variables called state variables. In this embodiment, a Gaussian state space model is used in which state variables are described only by means (first moment) and variance-covariance matrices (second moment).

初期値設定部２０５は、ステップＳ１０１において、Ｋ台のカメラの撮像タイミングの推定値の初期値、３次元空間内における追尾対象の物体（人物、ボール）の数、ＩＤ（識別情報）、および状態変数の初期値をそれぞれ設定する。ここで、撮像タイミングの推定値の初期値は、厳密な値を与える必要はなく任意の値でよく、例えばランダムな値を用いることができる。ただしここで、各カメラの撮像タイミングの推定値の内、最も大きいものと最も小さいものとの差は、１フレーム分の経過時間（３０ｆｐｓの場合、０．０３３秒）以下とする。 In step S101, the initial value setting unit 205 sets the initial values of the estimated imaging timings of the K cameras, the number of objects (persons, balls) to be tracked in the three-dimensional space, IDs (identification information), and states. Sets the initial value of each variable. Here, the initial value of the estimated value of the imaging timing need not be a strict value and may be an arbitrary value, such as a random value. Here, however, the difference between the largest and smallest estimated values of the imaging timing of each camera shall be equal to or less than the elapsed time for one frame (0.033 seconds at 30 fps).

一方、追尾対象の物体の数、ＩＤ、および状態変数に対する各初期値は、その物体が３次元空間内で唯一の物体か、あるいは複数存在する物体かで、設定するべき値が異なる。３次元空間内で唯一の物体であれば、初期値に厳密性は必要でなく、初期値は適当な値を設定すればよい。これは、当該空間中で唯一の物体であれば、初期値が適当な値であっても、状態空間モデルの更新を重ねるに従い、真値近傍に収束することが期待できるからである。この場合、状態変数の１次モーメントは任意の値、２次モーメントは十分に大きい半正定値行列とすればよい。ＩＤは、各物体を個々に識別するための識別情報であるため、他の種類の物体と被らない１以上の自然数を割り当てるとする。なお、３次元空間内で唯一の物体の具体的な例としては、サッカーの試合におけるボールが挙げられる。 On the other hand, the initial values for the number of objects to be tracked, the IDs, and the state variables are different depending on whether the object is the only object in the three-dimensional space or there are multiple objects. If the object is the only one in the three-dimensional space, the initial value does not need to be strict, and an appropriate initial value may be set. This is because if there is only one object in the space, even if the initial value is an appropriate value, it can be expected to converge near the true value as the state space model is repeatedly updated. In this case, the first moment of the state variables should be an arbitrary value, and the second moment should be a sufficiently large positive semidefinite matrix. Since the ID is identification information for individually identifying each object, it is assumed that a natural number of 1 or more is assigned that does not overlap with objects of other types. A concrete example of a unique object in a three-dimensional space is a ball in a soccer game.

３次元空間内に複数存在する物体の場合、物体の数の初期値は正確に設定する必要がある。また位置に関しても比較的正確に推定し、それを初期値とする必要がある。複数の物体を追尾する場合、それぞれの物体の現在の位置から予測した、次の時刻における位置と、各カメラの画像から検出された物体の位置とを対応付ける、割り当て問題を解く必要がある（これに関しては、ステップＳ１０７で詳細に説明する）。このため、各物体の位置の初期値としては、その割り当て問題を解いた結果、正しい割り当てが行える程度の正確さの位置を設定する必要がある。なお、３次元空間内に複数存在する物体の具体的な例は、サッカーの試合における選手等の人物が挙げられる。 In the case of multiple objects existing in the three-dimensional space, the initial value of the number of objects must be set accurately. It is also necessary to estimate the position relatively accurately and use it as an initial value. When tracking multiple objects, it is necessary to solve an assignment problem that associates the position of each object at the next time predicted from the current position of each object with the position of the object detected from the image of each camera (this will be described in detail in step S107). Therefore, as the initial value of the position of each object, it is necessary to set the position with such accuracy that correct assignment can be made as a result of solving the assignment problem. A specific example of a plurality of objects existing in a three-dimensional space is a person such as a player in a soccer match.

ここで、例えばサッカーの試合開始時には、人物同士の重なりが多くないと考えられる。このような場面では、以下のような方法で初期値を決めることで、各人物の数、ＩＤ、位置について十分な精度の初期値を設定することができる。 Here, for example, at the start of a soccer match, it is considered that there are not many overlapping persons. In such a scene, by determining the initial values by the following method, it is possible to set the initial values with sufficient accuracy for the number, ID, and position of each person.

これら各物体に関する初期値を設定する際には、まず、人物の顔の高さを地面から１．７５ｍ等のように適当に決め、Ｋ台の各カメラからその時点のフレームを取得する。そして、後述するステップＳ１０４の顔検出処理を各フレームに掛け、ピクセル座標上の検出位置（ｕ，ｖ）を、各カメラのキャリブレーションパラメータを用い３次元空間に射影することにより、３次元空間上の位置を取得する。この時、Ｙ軸上の値は－１．７５ｍで固定し、Ｘ軸上とＺ軸上の値は（ｕ，ｖ）から推定する。なお、Ｙ軸上の値が－１．７５ｍのように負の値となるのは、図２（ａ）で説明したようにＹ軸上で地面方向が正となされるためである。 When setting the initial values for each of these objects, first, the height of the person's face is determined appropriately, such as 1.75 m from the ground, and the frame at that time is acquired from each of the K cameras. Then, each frame is subjected to face detection processing in step S104, which will be described later, and the detection position (u, v) on the pixel coordinates is projected onto the three-dimensional space using the calibration parameters of each camera. Get the position of At this time, the value on the Y axis is fixed at -1.75 m, and the values on the X and Z axes are estimated from (u, v). The reason why the value on the Y-axis is a negative value such as -1.75 m is that the ground direction is positive on the Y-axis as described with reference to FIG. 2(a).

さらに、各カメラから射影した３次元空間上の検出点をクラスタリングし、検出点のクラスタを物体と見做し、その数に応じてＩＤを与え、その位置を、物体の３次元空間上の位置の初期値とする。物体に関する初期値の設定では、このようにして得たＩＤを複数存在する各物体のＩＤとし、また３次元空間上の位置を各物体の位置に対応する状態変数の１次モーメントの値とする。なお本実施形態における状態空間モデルの状態変数は、位置以外の成分も有するが、それらに関してはゼロとすればよい。また２次モーメントは適当な大きさの半正定値行列とすればよい。状態変数の詳細については後述する。 Furthermore, the detection points on the three-dimensional space projected from each camera are clustered, the cluster of detection points is regarded as an object, an ID is given according to the number, and the position is calculated as the position of the object on the three-dimensional space. be the initial value of In setting the initial values for the objects, the ID obtained in this way is used as the ID of each object, and the position in the three-dimensional space is used as the value of the first moment of the state variable corresponding to the position of each object. . Note that the state variables of the state space model in this embodiment have components other than the position, but they may be set to zero. Also, the second-order moment may be a positive semidefinite matrix of an appropriate size. Details of the state variables will be described later.

次にループＬ１０１～Ｌ１１１において、処理部２３０は、時刻に関する繰り返し処理を実行する。また、ループＬ１０２～Ｌ１１２において、処理部２３０は、カメラに関する繰り返し処理を実行する。これらの繰り返し処理に関する詳細な説明は割愛する。 Next, in loops L101 to L111, the processing unit 230 executes iterative processing relating to time. Also, in loops L102 to L112, the processing unit 230 repeatedly performs camera-related processing. A detailed description of these iterative processes is omitted.

次にステップＳ１０２において、動画取得選択部２０３は、撮像タイミングの推定値に基づいたカメラの選択処理（動画取得部の選択処理）を実行する。本実施形態において、撮像タイミングの推定値は、前述したように、あるカメラを基準とした場合の当該基準カメラと他の各カメラとの撮像時刻の差（単位は秒）である。また撮像タイミングの推定値は、図１のＫ台の各動画取得部にそれぞれ対応して、後述するステップＳ１１０において推定されて更新される。このため、動画取得選択部２０３は、現在の各カメラの撮像タイミングの推定値に基づき、各カメラに関してループＬ１０２～Ｌ１１２の繰り返し処理が行われることで、撮像タイミングの推定値の序列の順にカメラを選択することになる。すなわちループＬ１０２～Ｌ１１２において、動画取得選択部２０３は、インデックスｋ＝１のときには最も小さい撮像タイミングの推定値を持つカメラを選択し、インデックスｋ＝Ｋのときには最も大きい撮像タイミングの推定値を持つカメラを選択する。撮像タイミングの推定値を求める推定方法等の詳細については、後にステップＳ１１０で説明する。 Next, in step S102, the moving image acquisition selection unit 203 executes camera selection processing (movie acquisition unit selection processing) based on the estimated value of the imaging timing. In the present embodiment, the estimated value of the imaging timing is the difference (in seconds) between the imaging times of the reference camera and each of the other cameras, as described above. Also, the estimated value of the imaging timing is estimated and updated in step S110 described later, corresponding to each of the K moving image acquisition units in FIG. For this reason, the moving image acquisition selection unit 203 repeats loops L102 to L112 for each camera based on the current estimated value of the imaging timing of each camera, thereby selecting the cameras in order of the estimated value of the imaging timing. will have to choose. That is, in loops L102 to L112, the moving image acquisition selection unit 203 selects the camera having the smallest estimated value of the imaging timing when index k=1, and selects the camera having the largest estimated value of imaging timing when index k=K. to select. Details such as the estimation method for obtaining the estimated value of the imaging timing will be described later in step S110.

次のステップＳ１０３において、処理部２３０は、前ステップで選択したカメラｋから現時刻のフレームを取得し、当該カメラｋから取得した現在時刻のフレームの画像を検出部２０４に入力する。ここで本実施形態の場合、各カメラは、解像度がいわゆるＦｕｌｌＨＤ（１９２０×１０８０ピクセル）、秒間３０フレームの速度のスペックを有するカメラを想定する。なお、カメラのスペックはＦｕｌｌＨＤ、秒間３０フレームの速度に限定されるものではない。 In the next step S103 , the processing unit 230 acquires the frame at the current time from the camera k selected in the previous step, and inputs the frame image at the current time acquired from the camera k to the detection unit 204 . Here, in the case of this embodiment, each camera is assumed to have a resolution of so-called Full HD (1920×1080 pixels) and a speed of 30 frames per second. Note that the camera specifications are not limited to Full HD and the speed of 30 frames per second.

ここで、Ｋ台の全カメラのシャッタータイミングは、トリガーパルスや同期信号など電気的な信号によって同期されている必要はなく、カメラ内部のマイクロコントローラのクロックによって自律的な周期で撮像するものとする。また、撮影部２２０と処理部２３０とは、ローカルエリアネットワーク等の通信経路を介して接続されている。このため、撮影部２２０で取得されて送信され、処理部２３０にて受信されたフレームは、ネットワーク経路に存在するスイッチングハブ等中継部のパフォーマンスや帯域の制限等で、遅延やコマ落ちが発生する可能性がある。すなわち本実施形態において、処理部２３０が取得するフレームは、非同期、コマ落ち発生等が発生するフレームになる可能性がある。なお、撮影部２２０で撮像されたフレームは、ローカルエリアネットワーク等の通信経路を用いたオンラインで取得されてもよいが、外部記憶装置等に記憶し、そこから取得されてもよい。 Here, the shutter timings of all the K cameras do not need to be synchronized by electrical signals such as trigger pulses and synchronization signals, and images are taken in an autonomous cycle by the clock of the microcontroller inside the cameras. . Also, the imaging unit 220 and the processing unit 230 are connected via a communication path such as a local area network. For this reason, the frames acquired and transmitted by the imaging unit 220 and received by the processing unit 230 may experience delays and dropped frames due to the performance of relay units such as switching hubs on the network path and bandwidth limitations. there is a possibility. That is, in the present embodiment, the frames acquired by the processing unit 230 may be frames in which asynchronization, dropped frames, or the like may occur. Frames captured by the imaging unit 220 may be acquired online using a communication path such as a local area network, or may be stored in an external storage device or the like and acquired therefrom.

次にステップＳ１０４において、検出部２０４は、前ステップで取得したフレームの画像から物体を検出する処理を行う。本実施形態の場合、検出対象の物体は、人物およびボールとなる。ここで、本実施形態のように物体追尾を行う場合、物体の位置姿勢によらず物体の重心等の所定の位置が検出でき、見えの変化に伴う検出位置のずれが発生しないことが望ましい。そのため公知の物体検出器を利用する場合、物体の位置姿勢の変化によって、フレーム内での物体の外接矩形が大きく異なることになるものより、位置姿勢が変化しても外接矩形がさほど変化しない物体を、追尾対象とするのが望ましい。ゆえに、球体のボールは理想的な形状の物体であると考えられる。また、人物であれば球体に近い頭部を、検出対象の物体として扱うことが望ましい。球体に近い物体を扱うことで、本ステップにおいて、検出部２０４が検出して出力する物体の検出位置のずれを軽減可能となる。なお、物体の見えや形状の変化によらずに、フレーム内における物体の位置をずれなく検出できる検出器があれば、それを利用してもよい。 Next, in step S104, the detection unit 204 performs processing for detecting an object from the image of the frame acquired in the previous step. In this embodiment, objects to be detected are a person and a ball. Here, when object tracking is performed as in this embodiment, it is desirable that a predetermined position such as the center of gravity of the object can be detected regardless of the position and orientation of the object, and that the detection position does not shift due to changes in appearance. Therefore, when a known object detector is used, the bounding rectangle of the object in the frame changes greatly due to changes in the position and orientation of the object. is desirable to be tracked. Therefore, a spherical ball is considered to be an ideally shaped object. In the case of a person, it is desirable to treat the head, which is nearly spherical, as the object to be detected. By handling an object that is close to a sphere, it is possible to reduce the displacement of the detection position of the object detected and output by the detection unit 204 in this step. If there is a detector that can detect the position of the object within the frame without deviation regardless of changes in the appearance or shape of the object, it may be used.

また近年、畳み込みニューラルネットワーク（ＣＮＮ）を用いた画像認識技術が発展し、高速かつ高精度に複数の画像認識処理を同時実行する技術が研究されている。参考文献１には、物体候補領域を推定する層を物体認識用ＣＮＮに組み込み、物体候補領域の検出と複数カテゴリの分類の２つのタスクの認識処理を、１５０ｆｐｓという動作速度で実現する技術が開示されている。本実施形態では、検出器２０４として、このようなＣＮＮによって実現される高速な物体検出器を利用することができる。 Further, in recent years, image recognition technology using convolutional neural networks (CNN) has been developed, and technology for simultaneously executing a plurality of image recognition processes at high speed and with high accuracy is being researched. Reference 1 discloses a technique that incorporates a layer for estimating object candidate regions into a CNN for object recognition, and realizes recognition processing for two tasks, object candidate region detection and multiple category classification, at an operating speed of 150 fps. It is In this embodiment, a high-speed object detector realized by such a CNN can be used as the detector 204 .

参考文献１：ＪｏｓｅｐｈＲｅｄｍｏｎ, ＳａｎｔｏｓｈＤｉｖｖａｌａ, ＲｏｓｓＧｉｒｓｈｉｃｋ, ＡｌｉＦａｒｈａｄｉ，“ＹｏｕＯｎｌｙＬｏｏｋＯｎｃｅ：Ｕｎｉｆｉｅｄ, Ｒｅａｌ－ＴｉｍｅＯｂｊｅｃｔＤｅｔｅｃｔｉｏｎ”，ＣＶＰＲ２０１６ Reference 1: Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi, "You Only Look Once: Unified, Real-Time Object Detection", CVPR2016

またこの検出器は、物体の位置とスコアを出力する。ここで、位置とは、画像のピクセル座標上の位置（ｕ，ｖ）である。またスコア（以後スコアを適宜、ｑで表す）とは、物体（本実施形態の場合はボールまたは人物の頭部）の尤もらしさ表す［０，１］の正規化された値、つまり尤度である。スコアが１に近いほど、検出された位置（ｕ，ｖ）はその物体らしいことを意味する。すなわち検出部２０４の出力は（ｕ，ｖ，ｑ）で表され、本実施形態ではこれを観測値と呼んでいる。また検出部２０４は、参考文献１のように、ピクセル座標上の物体のサイズに合わせたバウンディングボックスを出力するものでもよいが、本実施形態ではバウンディングボックスの中心位置を（ｕ，ｖ）として用いる。 This detector also outputs the position and score of the object. Here, the position is the position (u, v) on the pixel coordinates of the image. Also, the score (hereafter, the score is appropriately represented by q) is a normalized value of [0, 1] representing the likelihood of an object (in this embodiment, the ball or the head of a person), that is, the likelihood be. The closer the score is to 1, the more likely the detected position (u, v) is for that object. That is, the output of the detection unit 204 is represented by (u, v, q), which is called an observed value in this embodiment. The detection unit 204 may output a bounding box that matches the size of the object on the pixel coordinates as in reference 1, but in this embodiment, the center position of the bounding box is used as (u, v). .

また一般に、画像から物体を検出する検出器は、物体らしさを検出できる大きさに限界がある。したがって検出部２０４においても、カメラで撮像された画像に対して、対象物体を検出できる検出範囲に制限がある。
図４（ａ）は、カメラの画角と検出器の検出範囲によって決まるＦＯＶ（ＦｅｉｌｄｏｆＶｉｅｗ）を説明する図である。なお、図を簡単にするために、図４の例では、カメラ４０１の光学中心を含む平面でＦＯＶを表している。図４（ａ）において、カメラ４０１の画角４０２と検出器の検出範囲４０３とが交わる範囲（図中の網掛部）が、検出器のＦＯＶ４０４となる。すなわちＦＯＶ４０４は、特定の検出範囲と角度を持つような扇形を成す。 Further, generally, a detector for detecting an object from an image has a limit to the size that can detect object-likeness. Therefore, even in the detection unit 204, there is a limit to the detection range in which the target object can be detected with respect to the image captured by the camera.
FIG. 4A is a diagram for explaining FOV (Feild of View) determined by the angle of view of the camera and the detection range of the detector. To simplify the drawing, in the example of FIG. 4, the plane including the optical center of the camera 401 represents the FOV. In FIG. 4A, the range (shaded area in the figure) where the angle of view 402 of the camera 401 and the detection range 403 of the detector intersect is the FOV 404 of the detector. That is, the FOV 404 forms a sector with a specific detection range and angle.

図４（ｂ）は、図２（ａ）に例示した空間１０１に対し、検出器ＦＯＶを重畳して表したＦＯＶ重複地図を示す図である。各カメラ１０２～１１１は図２（ａ）の各カメラ１０２～１１１に対応している。扇型のＦＯＶ４１１は、カメラ１０９のフレームを用いた検出器のＦＯＶであり、位置４１２は３次元空間中における地点ａの位置であり、位置４１３は地点ｂの位置である。図４（ｂ）のＦＯＶ重複地図では、複数のカメラによる扇形の複数のＦＯＶが重複して示され、それらＦＯＶによる重複数が濃淡によって示されている。地点ａでは、カメラ１０９、１１０、および１０８の、三つのカメラの検出器ＦＯＶが重複している。また、地点ｂでは、カメラ１０９、１１０、１０８、１０２、および１０６の５つのカメラの検出器ＦＯＶが重複している。なお図２（ａ）に示した例では疎なカメラ配置であるため、ＦＯＶの重複数が少ない領域も存在し、例えば領域４１４は１つのカメラ１０８による検出器ＦＯＶでしかカバーされない領域になっている。このようなＦＯＶ重複地図を用いることで、物体の配置が得られた場合、各々のカメラでの観測値の数（撮影される物体の数）を得ることができる。 FIG. 4(b) is a diagram showing an FOV overlapping map in which the detector FOV is superimposed on the space 101 illustrated in FIG. 2(a). Each camera 102-111 corresponds to each camera 102-111 in FIG. 2(a). A fan-shaped FOV 411 is the FOV of the detector using the frame of the camera 109, the position 412 is the position of the point a in the three-dimensional space, and the position 413 is the position of the point b. In the FOV overlap map of FIG. 4B, a plurality of fan-shaped FOVs by a plurality of cameras are overlapped, and the number of overlaps by these FOVs is indicated by shading. At point a, the detector FOVs of the three cameras, cameras 109, 110 and 108, overlap. Also at point b, the detector FOVs of the five cameras 109, 110, 108, 102 and 106 overlap. In the example shown in FIG. 2A, since the camera arrangement is sparse, there are areas where the number of overlapping FOVs is small. there is By using such an FOV overlap map, the number of observations (the number of objects photographed) at each camera can be obtained when the placement of the objects is obtained.

図５は、１つのカメラ５０６のフレームと検出器によって取得される観測値の例を示した図である。
図５（ａ）には人物配置とＦＯＶの位置関係が示されており、空間５０１はカメラによって撮像される人物が配置されている３次元空間である。図５（ａ）においては、人物５０２、５０３、および５０４と、ボール５０５とが示されている。またＦＯＶ５０７は、カメラ５０６で取得されたフレームに検出器を適用した場合の検出器ＦＯＶである。この図の例では、人物５０２、５０３、およびボール５０５が検出器ＦＯＶ５０７の領域内に存在している。 FIG. 5 is a diagram showing an example of observations obtained by one camera 506 frame and detector.
FIG. 5(a) shows the positional relationship between the placement of a person and the FOV. A space 501 is a three-dimensional space in which a person captured by a camera is arranged. In FIG. 5(a), persons 502, 503 and 504 and a ball 505 are shown. FOV 507 is the detector FOV when the detector is applied to the frame acquired by camera 506 . In the example of this figure, people 502, 503 and ball 505 are within the area of detector FOV 507. FIG.

図５（ｂ）は、図５（ａ）のカメラ５０６によって取得されるフレーム５１１の画像と、検出器の出力の一部とを可視化して示した図である。図５（ｂ）では、人物５１２、５１３、およびボール５１４が示されており、人物５１２は図５（ａ）に示されている人物５０２に対応し、同様に人物５１３は図５（ａ）の人物５０３に対応し、さらにボール５１４は図５（ａ）のボール５０５に対応している。また図５（ｂ）では、検出器の出力の一部を可視化して示しており、その可視化例として、それぞれバウンディングボックス５１５、５１６、５１７、５１８を描画している。 FIG. 5(b) is a visualization of the image of frame 511 acquired by the camera 506 of FIG. 5(a) and a portion of the detector output. In FIG. 5(b), persons 512, 513 and ball 514 are shown, with person 512 corresponding to person 502 shown in FIG. 5(a) and similarly person 513 shown in FIG. and the ball 514 corresponds to the ball 505 in FIG. 5(a). In addition, FIG. 5(b) shows a portion of the output of the detector in a visualized manner, and bounding boxes 515, 516, 517, and 518 are drawn as examples of the visualization, respectively.

本実施形態で用いる検出器の出力は、バウンディングボックスの中心（ｕ，ｖ）と幅（ｈ，ｗ）とスコアｑであり、矩形５１５、５１６、５１７、５１８は、（ｕ，ｖ，ｈ，ｗ）で表現されるバウンディングボックスを描画した結果である。バウンディングボックス５１５、５１６、５１７はそれぞれ人物５１２、人物５１３、ボール５１４に対応する。ただし検出器の出力には、誤った検出出力が得られることがある。例えば、検出対象の物体が存在しているのに検出されない場合や、検出対象の物体が存在しないのに検出出力が得られる場合などがある。本実施形態では、検出対象の物体が存在するのに検出されない場合を「未検出」と呼び、検出対象の物体が存在しないのに検出されたとする場合を「過検出」と呼ぶ。図５（ｂ）には、検出対象の物体が存在しないのに存在するとして過検出された場合の例として、バウンディングボックス５１８が描かれている。また前述のとおり、本実施形態では、検出器の出力（ｕ，ｖ，ｑ）を観測値と呼んでいる。 The output of the detector used in this embodiment is the center (u,v) and width (h,w) of the bounding box and the score q, where the rectangles 515, 516, 517, 518 are (u,v,h, w) is the result of drawing the bounding box. Bounding boxes 515, 516, and 517 correspond to person 512, person 513, and ball 514, respectively. However, erroneous detection output may be obtained at the output of the detector. For example, there are cases where an object to be detected exists but is not detected, or a detection output is obtained even though an object to be detected does not exist. In this embodiment, the case where an object to be detected exists but is not detected is called "undetected", and the case where an object to be detected does not exist but is detected is called "overdetection". FIG. 5B shows a bounding box 518 as an example of a case where an object to be detected is over-detected as existing even though it does not exist. Also, as described above, in this embodiment, the detector outputs (u, v, q) are called observation values.

以上の結果、空間１０１に存在する物体（複数の人物の頭部と１つのボール）は、１つまたは複数の検出器で検出され、その数は位置によって変動する。カメラのパラメータや検出器スペックによって、ＦＯＶ重複地図を得ることができるため、通常、各地点でどのカメラ・検出器によって検出されるかを事前に知ることができる。また、物体同士の重なりによる隠蔽の発生や検出器での未検出等により、１つの検出器からも物体を検出したことの検出結果が得られない状況も起こり得る。 As a result, objects (heads of a plurality of persons and a ball) existing in the space 101 are detected by one or more detectors, the number of which varies depending on the position. Since an FOV overlapping map can be obtained by camera parameters and detector specifications, it is usually possible to know in advance which cameras and detectors will be used for detection at each point. In addition, a situation may occur in which a detection result indicating that an object is detected cannot be obtained from even one detector due to occurrence of concealment due to overlap of objects, non-detection by the detector, or the like.

次にループＬ１０３～Ｌ１１３において、処理部２３０は、物体に関する繰り返し処理を実行する。すなわち処理部２３０は、前述したように、複数の物体に関するインデックスｎを１からＮの順で与える繰り返し処理を実行する。なお、ＮはステップＳ１０１で推定した物体の数（クラスタ数）を用いる。この繰り返し処理に関する詳細な説明は割愛する。 Next, in loops L103 to L113, the processing unit 230 repeats processing for objects. That is, the processing unit 230 performs a repetitive process of giving indices n in order from 1 to N for a plurality of objects, as described above. For N, the number of objects (number of clusters) estimated in step S101 is used. A detailed description of this iterative process is omitted.

本実施形態では、３次元空間内の位置の推定および物体の追尾を実現するにあたって、前述のとおり各カメラに対応した検出器の出力（ｕ，ｖ，ｑ）を観測値とし、物体の３次元空間上での位置（ｘ，ｙ，ｚ）、下記式の速度ＰＶの６次元を状態変数とする。また、物体のうち、位置姿勢によって画像上の見えが変化し、検出器のスコアｑの値が変化する人物頭部では、さらに向き（φ，θ，ψ）を加えた計９次元を状態変数とする状態空間モデルが用いられる。本実施形態では、この状態空間モデルを用い、拡張カルマンフィルタによって状態推定および追尾が行われる。 In this embodiment, when estimating a position in a three-dimensional space and tracking an object, the output (u, v, q) of the detector corresponding to each camera is used as an observed value as described above. Position (x, y, z) in space and six dimensions of velocity PV in the following equation are used as state variables. Among the objects, for the human head whose appearance on the image changes depending on the position and orientation, and the value of the score q of the detector changes, a total of 9 dimensions including orientation (φ, θ, ψ) are used as state variables. A state-space model with In this embodiment, state estimation and tracking are performed by an extended Kalman filter using this state space model.

これにより、本実施形態において、観測値yと、頭部およびボールの状態変数xは、下記の式（１）～式（３）のように記述できる。なお、式（２）は頭部、式（３）はボールの状態変数の式である。 Accordingly, in this embodiment, the observed value y and the state variable x of the head and the ball can be described as the following equations (1) to (3). Note that equation (2) is for the head, and equation (3) is for the state variables of the ball.

なお、（ｕ，ｖ，ｑ）は検出器の出力である観測値、（ｘ，ｙ，ｚ）は物体の３次元空間上での位置、速度ＰＶは上記式のようにｘ軸，ｙ軸，ｚ軸方向に対応した値である。さらに（φ，θ，ψ）は人物頭部における向きを示す値である。また式（１）～式（３）中の添え字ｔは時刻、ｋ_jはカメラｋのフレーム内のｊ番目の観測値、添え字ｎはｎ番目の物体（複数存在する物体のうちのｎ番目の物体）を表し、Ｔは転置である。物体に関する添え字ｎは物体のＩＤと言い換えることもできる。 Note that (u, v, q) are the observed values that are the outputs of the detector, (x, y, z) are the positions of the object in the three-dimensional space, and the velocity PV is the x-axis and y-axis as shown in the above equation. , are values corresponding to the z-axis direction. Furthermore, (φ, θ, ψ) are values indicating the orientation of the human head. In equations (1) to (3), the subscript t is the time, k _j is the j-th observed value in the frame of camera k, and the subscript n is the n-th object (n of multiple existing objects). th object) and T is the transpose. The subscript n for an object can also be called the ID of the object.

さらに後述するステップＳ１０７の処理を実行した後では、ｎとｋ_jの対応付けが行われ、ｙ_t,kjはｙ_t,k,n＝［ｕ_t,k,n，ｖ_t,k,n，ｑ_t,k,n］^Tと対応付けられる。ここで、ｙ_t,k,nは物体ｎのカメラｋでの時刻ｔの観測値を意味する。また時刻ｔ＝１の時点では、状態変数ｘ_t,nは、前述したステップＳ１０１で設定された初期値が割り当てられている。時刻ｔ＝１においては、この初期値を、後述する状態変数の初期のフィルタ分布（事後分布）の１次モーメントｘ_0|0,nと、２次モーメントＶ_0|0,nに設定すればよい。 Furthermore, after executing the process of step S107, which will be described later, n and _kj are associated, and yt _,kj is yt _,k,n = [ _ut,k,n , vt _,k,n , q _t,k,n ] ^T . where y _t,k,n means the observed value of object n with camera k at time t. At time t=1, the state variable x _t,n is assigned the initial value set in step S101. At time t=1, if this initial value is set to the first moment x _0|0,n and the second moment V _0|0,n of the initial filter distribution (posterior distribution) of the state variables to be described later, good.

次のステップＳ１０５において、予測部２０６は、状態変数の予測分布と観測値の予測分布を取得する処理を行う。
まず、状態変数の予測分布を取得する際に用いるシステム方程式について説明する。本実施形態では、人物の頭部に対するシステム方程式として下記の式（４）を用いる。 In the next step S105, the prediction unit 206 performs a process of acquiring the predicted distribution of the state variables and the predicted distribution of the observed values.
First, the system equation used when obtaining the predicted distribution of the state variables will be described. In this embodiment, the following equation (4) is used as the system equation for the human head.

式（４）のΔｔは１ステップの時間ステップサイズ（秒）、ｓ_tはプロセスノイズと呼ばれる白色ガウスノイズであり、ｓ_tの分散共分散行列をＱ_tとする。このシステム方程式では、位置（ｘ，ｙ，ｚ）に関し、２次のマルコフ過程でモデル化した位置及び速度（ＰＶ）のトレンド成分モデルを扱う。また、向き（φ，θ，ψ）は１次のマルコフ過程としてモデル化している。時間ステップサイズΔｔには、通常の場合、サンプリング間隔を用いるが、本実施形態ではカメラのフレームレート（ｆｐｓ）とカメラ数（Ｋ）の積の逆数（１／（ｆｐｓ×Ｋ））を用いる。例えば、カメラ数（Ｋ）が１０、フレームレートが３０ｆｐｓの場合、時間ステップサイズΔｔは１／３００となる。ゆえに、本実施形態の場合、物体追尾の時間分解能は、カメラのフレームレートに対してカメラ数分細かくなる。 In Equation (4), Δt is the time step size (seconds) of one step, s _t is white Gaussian noise called process noise, and Q _t is the variance-covariance matrix of s _t . The system equations deal with position and velocity (PV) trend component models modeled with second order Markov processes for position (x, y, z). Also, the directions (φ, θ, ψ) are modeled as a first-order Markov process. For the time step size Δt, the sampling interval is normally used, but in this embodiment, the reciprocal of the product of the camera frame rate (fps) and the number of cameras (K) (1/(fps×K)) is used. For example, when the number of cameras (K) is 10 and the frame rate is 30 fps, the time step size Δt is 1/300. Therefore, in the case of this embodiment, the time resolution of object tracking is as fine as the number of cameras with respect to the camera frame rate.

また本実施形態において、サッカーで用いるボールは球体であり、その位置姿勢が変化しても形状の見えの違いは少ないため、当該ボールのシステム方程式としては、位置と速度のみでモデル化した下記の式（５）を用いる。 Further, in this embodiment, the ball used in soccer is a sphere, and even if the position and orientation of the ball changes, the appearance of the shape does not change much. Equation (5) is used.

式（５）のΔｔは１ステップの時間幅（秒）、ｓ_tはプロセスノイズ（白色ガウスノイズ）で、ｓ_tの分散共分散行列をＱ_tとする。 In Equation (5), Δt is the time width (seconds) of one step, s _t is process noise (white Gaussian noise), and Q _t is the variance-covariance matrix of s _t .

次に、検出器の出力である観測値の予測分布の取得に際して用いる観測方程式について説明する。
３次元空間上の点は、予めキャリブレーションにより取得したカメラパラメータを用いることで、カメラのピクセル座標上に射影できる。この射影は、下式の式（６）で記述される。 Next, an observation equation used when obtaining a predicted distribution of observed values, which is the output of the detector, will be described.
A point in the three-dimensional space can be projected onto the pixel coordinates of the camera by using camera parameters obtained by calibration in advance. This projection is described by equation (6) below.

式（６）のｐ_xx,kはカメラｋの透視投影行列の各要素であり、これらの要素はカメラキャリブレーションによって予め取得される。γは、同次座標系のパラメータである。
この射影に基づき、観測値の内の位置（ｕ，ｖ）の観測方程式は、下記の式（７）、式（８）で記述される。本実施形態では、人物頭部とボールとで共通して、式（７）、式（８）によって観測をモデル化する。 p _xx,k in equation (6) are each element of the perspective projection matrix of camera k, and these elements are obtained in advance by camera calibration. γ is a parameter of the homogeneous coordinate system.
Based on this projection, the observation equation for the position (u, v) in the observed value is described by Equations (7) and (8) below. In the present embodiment, observations are modeled by equations (7) and (8) in common for the human head and the ball.

これら式（７）と式（８）は、３次元空間上の位置（ｘ，ｙ，ｚ）がピクセル座標（ｕ，ｖ）として観測される過程をモデル化した式である。本実施形態における複数カメラを有する撮影システムは前述のとおり、各カメラが非同期であり、またコマ落ちが発生することがある。この非同期性に伴う検出位置のズレは、撮像タイミングを推定しその順番で更新する本実施形態の処理によって抑制可能となる。ただし、予測の際の時間ステップ幅Δｔには量子化した一定値が用いられるため、物体の３次元空間上の位置と、カメラのフレームのピクセル座標上の物体の位置とには多少のずれが発生する可能性が十分にある。さらに、物体検出器の利用に伴う検出位置の位置ずれ、カメラキャリブレーションの誤差に伴う位置ずれが発生することもある。これらの要因により、３次元空間上の物体は、検出器においてその位置に誤差を含んで観測されると考えられる。式（７）と式（８）では、その誤差を観測ノイズとしてモデル化している。 These equations (7) and (8) are equations that model the process in which a position (x, y, z) in a three-dimensional space is observed as pixel coordinates (u, v). As described above, in the photographing system having a plurality of cameras in this embodiment, each camera is asynchronous, and frame dropping may occur. The deviation of the detection position due to this asynchronism can be suppressed by the processing of this embodiment, in which the imaging timing is estimated and updated in that order. However, since a quantized constant value is used for the time step width Δt in the prediction, there is some deviation between the position of the object in the three-dimensional space and the position of the object in the pixel coordinates of the camera frame. It is quite possible that it will occur. Furthermore, positional deviation of the detection position due to the use of the object detector and positional deviation due to camera calibration error may occur. Due to these factors, an object in the three-dimensional space is considered to be observed with an error in its position on the detector. Equations (7) and (8) model the error as observation noise.

また本実施形態において、スコアｑの観測方程式は、人物の頭部とボールとで別のモデルを用いるとする。まず、頭部の観測方程式は下記の式（９）とする。 In this embodiment, the observation equation for the score q uses different models for the human head and the ball. First, let the observation equation of the head be the following equation (9).

ｑ_t,kj=α₀+α₁||ｘ_t,n-Ｃ_k||₂+α₂cos(θ_x)+α₃cos(θ_y)+α₄cos(θ_z)+ｗ_t,q 式(9) q _t,kj =α ₀ +α ₁ ||x _t,n −C _k || ₂ +α ₂ cos(θ _x )+α ₃ cos(θ _y )+α ₄ cos(θ _z )+w _{t, q} formula (9)

式（９）のＣ_kはカメラｋの３次元空間上のカメラ位置、||・||₂はユークリッドノルム、α₀、α₁、α₂、α₃、α₄はモデルパラメータ、ｗ_tは観測ノイズと呼ばれる白色ガウスノイズである。θ_x、θ_y、θ_zはカメラの外部パラメータの回転行列をＲ、人体の向き（φ，θ，ψ）から得られる回転行列をＲｏとしたときの式（１０）の行列の要素を用いて表現できる。すなわちθ_x＝asin(r₃₂)、θ_y=atan(-r₃₁/r₃₃)、θ_z=atan(r₂₁/r₁₁)と表現できる。 _In equation ₍ 9), _C _k is _the camera _position _of camera k in the three-dimensional space, ||· _|| It is white Gaussian noise called observation noise. θ _x , θ _y , and θ _z are the elements of the matrix of formula (10) where R is the rotation matrix of the external parameters of the camera, and Ro is the rotation matrix obtained from the orientation of the human body (φ, θ, ψ). can be expressed as That is, it can be expressed as θ _x =asin(r ₃₂ ), θ _y =atan(−r ₃₁ /r ₃₃ ), θ _z =atan(r ₂₁ /r ₁₁ ).

また式（９）は、観測値の内の検出器のスコアに関する観測過程をモデル化した重回帰モデルである。検出器は一般に、撮像された物体の大きさおよび向きに相関してスコアを変化させる。カメラで撮像した物体の大きさは、通常、カメラからの距離に相関するが、学習データの偏りが無いと仮定すると、一般に物体が大きい場合（カメラから近い場合）テクスチャ等の画像特徴量がロバストに取得されスコアも高くなる。逆に、物体が小さい場合（カメラから遠い場合）、テクスチャが潰れ画像特徴量が安定して取得できずスコアも低くなる傾向がある。また特に、人物の場合、カメラに対して正面を向いていると、目、鼻、口などの人物の識別に際して重要な器官（パーツ）の見え方が安定するため、スコアが高くなる傾向がある。逆に、カメラに対して背面を向いていると、識別の手掛かりとなるパーツが少なくなり、スコアが低くなる傾向がある。 Equation (9) is a multiple regression model that models the observation process regarding the score of the detector in the observed values. Detectors generally vary the score in relation to the size and orientation of the imaged object. The size of an object captured by a camera usually correlates with the distance from the camera, but assuming that there is no bias in the training data, image features such as texture are generally robust when the object is large (when it is close to the camera). and the score will be higher. Conversely, when the object is small (far from the camera), the texture tends to be distorted and the image feature quantity cannot be obtained stably, resulting in a low score. In particular, in the case of a person, when the camera is facing the front, the appearance of important organs (parts) such as the eyes, nose, and mouth is stable, so the score tends to be higher. . Conversely, when the back is facing the camera, there are fewer parts that can be used as identification clues, and the score tends to be lower.

また式（９）の第１項は定数項、第２項は物体からカメラまでの距離とスコアの関係を線形モデルでモデル化した項、第３、第４、第５項はカメラから見える人体の向きとスコアの関係をコサイン関数でモデル化した項で、第６項はノイズ項である。 In equation (9), the first term is a constant term, the second term is a term that models the relationship between the distance from the object to the camera and the score using a linear model, and the third, fourth, and fifth terms are the human body seen from the camera. , and the sixth term is a noise term.

一方で、ボールの観測方程式は下記の式（１１）とする。 On the other hand, the ball observation equation is the following equation (11).

ｑ_t,kj=α₀+α₁||ｘ_t,n-Ｃ_k||₂+ｗ_t,q 式(11) q _t,kj =α ₀ +α ₁ ||x _t,n -C _k || ₂ +w _t,q formula (11)

ボールは球体を想定しており、回転に対する形状の不変性を有するため、位置姿勢により形状の見え方の変化が少ない。そのため、スコアｑの観測方程式は、カメラから物体までの距離のみを説明変数とした線形回帰モデルとしている。 The ball is supposed to be a sphere, and since it has invariance to rotation, the appearance of the shape does not change much depending on the position and orientation. Therefore, the observation equation for the score q is a linear regression model with only the distance from the camera to the object as an explanatory variable.

式（９）、式（１１）のモデルパラメータα₀、α₁、α₂、α₃、α₄の推定には、複数の方法がある。１つは、キャリブレーション済みのカメラで撮影した複数の人物画像に、３次元空間上の向きの正解値を付与し、さらに人体に対する検出器のスコアを得て、向きとスコアを持つ複数のサンプルを用い最小２乗法でパラメータ推定する方法である。もう１つは、後述する式（１２）により取得される尤度関数を用い、観測値に対するモデルの尤度を計算し、多量の観測値から対数尤度を得て、グリッドサーチで対数尤度を最大化させるパラメータを探索する方法である。後者の方法では、人手による正解値の付与を必要としないため、効率的である。その他、ＥＭ法を用いた再帰的な探索方法や、モデルパラメータも状態空間に組み込んだ自己組織的なモデルとする方法等が存在するが、いずれの方法を用いても、本実施形態で説明する機能は大きく損なわれない。
以上の式（７）、式（８）、式（９）、式（１１）をまとめて、以後、下記の式（１２）のように表現することができる。 There are several methods for estimating the model parameters α ₀ , α ₁ , α ₂ , α ₃ , and α ₄ in Equations (9) and (11). One is to assign the correct value of orientation in 3D space to multiple images of people taken with a calibrated camera, obtain the score of the detector for the human body, and obtain multiple samples with orientation and score. is a method of estimating parameters by the method of least squares using The other is to use the likelihood function obtained by equation (12) described later, calculate the likelihood of the model for the observed values, obtain the log likelihood from a large number of observed values, and use the grid search to find the log likelihood This is a method of searching for a parameter that maximizes . The latter method is efficient because it does not require manual assignment of correct values. In addition, there are a recursive search method using the EM method and a method of forming a self-organizing model in which model parameters are also incorporated into the state space. Functionality is not significantly impaired.
The above equations (7), (8), (9), and (11) can be collectively expressed as the following equation (12).

ｙ_t,kj＝ｈ_t,k(ｘ_t,n)+ｗ_t 式(12) y _t,kj =h _t,k (x _t,n )+w _t formula (12)

ここで、式（１２）の観測ノイズｗ_tの分散共分散行列はＲ_tとする。また、この式（１２）より、尤度関数Ｐ(ｙ_t,kj|x_t,n)が取得できる。
以上の、システム方程式と観測方程式により、物体ｎの１時刻前（時刻ｔ－１）の状態から現在（時刻ｔ）の状態と現在の観測値とを予測する式が下記の式（１３）～式（１６）である。 Here, let R _t be the variance-covariance matrix of the observation noise w _t in Equation (12). Also, the likelihood function P(y _t,kj |x _t,n ) can be obtained from this equation (12).
Based on the above system equation and observation equation, the following equations (13) to Equation (16).

ここで、ｘ_t|t-1,nは状態変数の予測分布の１次モーメント、Ｖ_t|t-1,nは状態変数の予測分布の２次モーメント、ｙ_t|t-1,k,nは観測値の予測分布の１次モーメント、Ｕ_t|t-1,k,nは観測値の予測分布の２次モーメントである。またＱ_tはプロセスノイズの分散共分散行列、Ｒ_tは観測ノイズの分散共分散行列である。Ｆ_tはシステム方程式（式（４）、式（５））の行列、Ｈ_t,kは観測方程式（式（１２））の、ｈ_t,k(ｘ_t,n)のヤコビ行列である。 where x _t|t-1,n is the first moment of the predicted distribution of the state variables, V _t|t-1,n is the second moment of the predicted distribution of the state variables, and y _{t|t-1,k, n} is the first moment of the prediction distribution of observations, U _t|t−1,k,n is the second moment of the prediction distribution of observations. Also, Q _t is the variance-covariance matrix of process noise, and R _t is the variance-covariance matrix of observation noise. F _t is the matrix of the system equations (equations (4) and (5)), and H _t,k is the Jacobian matrix of h _t,k (x _t,n ) of the observation equation (equation (12)).

以後簡単のため、上記の１次および２次モーメントを持つガウス分布に従う状態変数の予測分布をＰ(ｘ_t,n|Ｙ_t-1)と表現し、観測値の予測分布をＰ(ｙ_t,k,n|Ｙ_t-1)と表現する。Ｙ_t-1は時刻ｔ－１までの観測値の下記の式（１７）に示す集合Ｙ_t-1である。 Hereinafter, for simplicity, the predicted distribution of the state variables following the Gaussian distribution with the above first and second moments is expressed as P(x _t,n |Y _t−1 ), and the predicted distribution of the observed values is expressed as P(y _{t ,k,n} |Y _t-1 ). Y _t-1 is the set Y _t-1 of the observed values up to time t-1 shown in the following equation (17).

式（１７）のｙ_t,k,nはＩＤがｎの物体の、時刻ｔ、カメラｋにおける観測値である。また、Ｋ_nは物体ｎの３次元空間上の位置で重複するカメラ数である。 y _t,k,n in Equation (17) is the observed value of the object with ID n at time t and camera k. Also, K _n is the number of cameras overlapping the position of the object n in the three-dimensional space.

予測部２０６は、ステップＳ１０５において前述のように物体ｎの状態変数の予測を行う。すなわち予測部２０６は、物体ｎの状態変数のフィルタ分布の１次モーメントｘ_t|t-1,nおよび２次モーメントＶ_t-1|t-1,nに対し、式（１３）、式（１４）をそれぞれ適用する。これにより、予測部２０６は、予測分布の１次モーメントｘ_t|t-1,nおよび２次モーメントＶ_t|t-1,nを取得する。 The prediction unit 206 predicts the state variables of the object n as described above in step S105. That is, the prediction unit 206 calculates the first-order moment x _t|t-1,n and the second-order moment V _t-1|t-1,n of the filter distribution of the state variables of the object n with respect to Equation (13) and Equation ( 14) apply respectively. Thereby, the prediction unit 206 acquires the first moment x _t|t-1,n and the second moment V _t|t-1,n of the prediction distribution.

分岐Ｂ１０１において、予測部２０６は、図４で示したＦＯＶ重複地図情報４１０を用いて、前ステップで予測した物体の位置がカメラｋのＦＯＶ内に存在するか否かを判定し、その判定結果に応じて処理を分岐する。予測部２０６は、前ステップで予測した物体の位置がカメラｋのＦＯＶ内に存在する場合にはステップＳ１０６に処理を進め、存在しない場合にはステップＳ１０８に処理を進める。なお分岐Ｂ１０１において、予測部２０６は、状態変数の予測分布の１次および２次モーメントに基づき、カメラｋのＦＯＶに存在する確率がほぼ０でなければステップＳ１０６に処理を進めてもよい。 In branch B101, the prediction unit 206 uses the FOV overlapping map information 410 shown in FIG. Branch the processing according to the If the position of the object predicted in the previous step exists within the FOV of camera k, the prediction unit 206 advances the process to step S106, and otherwise advances the process to step S108. In branch B101, the prediction unit 206 may advance the process to step S106 if the probability of camera k existing in the FOV of camera k is not substantially zero based on the first and second moments of the predicted distribution of the state variables.

ステップＳ１０６に進むと、予測部２０６は、ステップＳ１０５で予測した物体の位置および速度の状態を基に、カメラｋにおける当該物体ｎの観測値を予測する。すなわち予測部２０６は、当該物体ｎについてカメラｋで得られる観測値の分布の１次および２次モーメントを予測する。予測部２０６は、式（１５）、式（１６）を用い、物体ｎの状態変数の予測分布の１次，２次モーメントｘ_t|t-1,n，Ｖ_t|t-1,nから、カメラｋの観測値の予測分布の１次，２次モーメントｙ_t|t-1,k,n，Ｕ_t|t-1,k,nを得る。 After proceeding to step S106, the prediction unit 206 predicts the observed value of the object n in the camera k based on the state of the position and speed of the object predicted in step S105. That is, the prediction unit 206 predicts the first and second moments of the distribution of observation values obtained by the camera k for the object n. The prediction unit 206 uses equations (15) and (16) to obtain the following from the first and second moments x _t|t-1,n and V _t|t-1,n of the predicted distribution of the state variables of the object n. , obtain the first and second moments y _t|t-1,k,n and U _t|t-1,k,n of the prediction distribution of the observations of camera k.

次にステップＳ１０７では、ＩＤ対応付け部２０７が、時刻ｔにおいてカメラｋで観測される複数（例えばＪ個とする）の各観測値と物体ｎとを対応付けする処理を行う。すなわちＩＤ対応付け部２０７は、時刻ｔでのカメラｋにおける物体ｎの観測値の予測確率分布と、カメラｋにおける実際の観測値とを基に、実際の観測値の尤度を計算する。そして、ＩＤ対応付け部２０７は、その観測値の尤度を基に、３次元空間上の物体ｎと観測値とを対応付ける処理を行う。本実施形態では、時刻ｔにおいてカメラｋでは、誤検出を含んだＪ個の観測値｛ｙ_t,k1，ｙ_t,k2，…，ｙ_t,kj｝が得られたとする。前述した図５（ｂ）の例では、１つの誤検出（過検出）を含む３個の観測値が得られている。 Next, in step S107, the ID associating unit 207 performs a process of associating a plurality of (for example, J) observed values observed by camera k at time t with object n. That is, the ID associating unit 207 calculates the likelihood of the actual observed value based on the predicted probability distribution of the observed value of the object n with camera k at time t and the actual observed value with camera k. Based on the likelihood of the observed value, the ID associating unit 207 performs a process of associating the object n in the three-dimensional space with the observed value. In this embodiment, it is assumed that J observation values {y _t,k1 , y _t , _{k2 ,} . In the example of FIG. 5(b) described above, three observed values including one false detection (excessive detection) are obtained.

ここで、式（１５）、式（１６）により取得される観測値の予測分布の１次および２次モーメントによれば、下記の式（１８）に示すガウス分布が記述できる。 Here, according to the first-order and second-order moments of the predicted distribution of the observed values obtained by Expressions (15) and (16), the Gaussian distribution given by Expression (18) below can be described.

ｌ_kj,n＝Ｎ(ｙ_t,kj；ｙ_t|t-1,k,n，Ｕ_t|t-1,k,n) 式(18) l _kj,n =N(y _t,kj ;y _t|t-1,k,n ,U _t|t-1,k,n ) Equation (18)

この関数に観測値ｙ_t,kjを引数として与えることで、ｙ_t,kjの物体ｎの観測値としての尤度ｌ_kj,nを計算することができる。そして、複数の観測値｛ｙ_t,k1，ｙ_t,k2，…，ｙ_t,kj｝にそれぞれ式（１８）を適用し、尤度の高い観測値を人物（物体ｎ）観測値として対応付ければ、観測値と当該物体ｎとの対応付けが行える。 By giving the observed value y _t,kj to this function as an argument, the likelihood l kj,n of y _t, _kj as the observed value of the object n can be calculated. Then, apply formula (18) to each of the plurality of observed values {y _t,k1 , y _t, _k2 , . With this, it is possible to associate the observed value with the object n.

なお、対応付けでは、いわゆる貪欲法に基づき複数の観測値の内の最大となる尤度の観測値を、ある人物（物体ｎ）の観測値として割り当ててもよいが、尤度の和が最大になる対応付けを線形計画法で計算してもよい。その場合は、観測値と予測分布の１次および２次モーメントで計算されるマハラノビス距離を用い、マハラノビス距離の和が最小となる対応付けをハンガリアン法で計算すれば、尤度の和が最大になる割り当てが取得できる。 In the matching, the observed value with the maximum likelihood among multiple observed values may be assigned as the observed value of a certain person (object n) based on the so-called greedy method, but the sum of the likelihoods is the maximum A linear programming may be used to compute a correspondence such that In that case, using the Mahalanobis distance calculated from the first and second moments of the observed value and the predicted distribution, the Hungarian method can be used to calculate the correspondence that minimizes the sum of the Mahalanobis distances, which maximizes the sum of the likelihoods. different assignments can be obtained.

この時、検出器の未検出および過検出によって、実際の観測値が本来フレーム内にあるはずの真の観測値の数と一致しない場合がある。ここで真の観測値数とは、１つのカメラのＦＯＶ内に存在する人物の数と等しく、図５（ｂ）の例では２である。 At this time, due to detector underdetection and overdetection, the actual observations may not match the number of true observations that should be in the frame. Here, the true number of observed values is equal to the number of persons existing within the FOV of one camera, which is 2 in the example of FIG. 5(b).

図６は、未検出および過検出発生時のハンガリアン法で用いるコスト行列について説明する図である。ここで、３次元空間中に、ＩＤ１、ＩＤ２、ＩＤ３、ＩＤ４が付与された物体があり、カメラｋではＩＤ２、ＩＤ３、ＩＤ４の物体が撮像される状況を考える。
図６（ａ）は実際の観測値数と真の観測値数が一致する場合、図６（ｂ）は実際の観測値数が少ない場合（未検出発生）、図６（ｃ）は実際の観測値数が多い場合（過検出発生）の、ハンガリアン法で用いるコスト行列の例を示している。 FIG. 6 is a diagram for explaining the cost matrix used in the Hungarian method when non-detection and over-detection occur. Assume that there are objects assigned ID1, ID2, ID3, and ID4 in the three-dimensional space, and the objects ID2, ID3, and ID4 are imaged by camera k.
Figure 6(a) shows when the number of actual observed values and the number of true observed values match, Figure 6(b) shows when the actual number of observed values is small (undetected occurrence), and Figure 6(c) shows the actual number of observed values. It shows an example of the cost matrix used in the Hungarian method when the number of observed values is large (occurrence of overdetection).

図６（ａ）の例において、Ｐ(ｙ_t,k,2｜Ｙ_t-1)、Ｐ(ｙ_t,k,3｜Ｙ_t-1)、Ｐ(ｙ_t,k,4｜Ｙ_t-1)は、それぞれＩＤがＩＤ２、ＩＤ３、ＩＤ４の物体の観測値の予測分布である。またｙ_t,k1、ｙ_t,k2、ｙ_t,k3は、それぞれ時刻ｔにおけるカメラｋの１番目、２番目、３番目の観測値である。また図６（ａ）の値６０７はマハラノビス距離で、マハラノビス距離を要素とする行列がコスト行列６０８である。この実際の観測値の数と真の観測値の数が一致する図６（ａ）の例のような場合では、このコスト行列に基づきハンガリアン法を適用すればよい。 In the example of FIG. 6A, P(yt _,k,2 |Yt _-1 ), P( _yt,k,3 |Yt _-1 ), P( _yt,k,4 | _{Yt −1} ) are the predicted distributions of observed values for objects with IDs ID2, ID3, and ID4, respectively. y _t,k1 , y _t,k2 , and y _t,k3 are the first, second, and third observed values of camera k at time t, respectively. Values 607 in FIG. 6A are Mahalanobis distances, and a matrix having Mahalanobis distances as elements is a cost matrix 608 . In cases such as the example of FIG. 6A where the number of actual observed values and the number of true observed values match, the Hungarian method may be applied based on this cost matrix.

図６（ｂ）は、未検出が発生した場合の例を示しているが、このとき実際の観測値がｙ_t,k1とｙ_t,k2の２つである場合、真の観測値と数が合わなくなる。このような場合、偽の観測値ｙ_t,k-1を設定し、マハラノビス距離を無限大（∞）とすることで、コスト行列を正方行列にし、ハンガリアン法を適用できるようにする。このハンガリアン法の計算の結果、何れかのＩＤの予測分布が偽の観測値ｙ_t,k-1に割り当たる。これは観測値が欠損した状況で、後述するステップＳ１０８の更新時に通常とは別の処理を行う。 FIG _. 6(b) shows an example of a case where non-detection occurs _. does not match. In such a case, by setting the false observed value y _t,k−1 and setting the Mahalanobis distance to infinity (∞), the cost matrix becomes a square matrix and the Hungarian method can be applied. As a result of this Hungarian method calculation, the prediction distribution of any ID is assigned to the false observation y _t,k-1 . This is a situation in which an observed value is missing, and processing different from normal processing is performed at the time of updating in step S108, which will be described later.

図６（ｃ）は、過検出が発生した状況の例を示しており、この場合は、偽の予測分布Ｐ(ｙ_t,k,-1｜Ｙ_t-1)を設定し、マハラノビス距離を無限大（∞）とすることで、コスト行列を正方行列化し、ハンガリアン法を適用すればよい。 FIG. 6(c) shows an example of a situation in which overdetection occurs. In this case, a false prediction distribution P(y _t,k,-1 |Y _t-1 ) is set, and the Mahalanobis distance is The cost matrix can be squared by setting it to infinity (∞), and the Hungarian method can be applied.

なお、或るフレームで未検出と過検出が同時に発生し、見かけ上実際の観測値数と真の観測値数が一致してしまうために誤対応が発生する場合が考えられる。このような場合に対応するために、マハラノビス距離に閾値を設定し、例えば３以上のマハラノビス距離で対応付けられたＩＤは、上述の観測値が欠損した場合と同じ扱いにするようにすれば、誤対応を軽減できる。 It is conceivable that undetected and over-detected values may occur simultaneously in a certain frame, and the number of actual observed values and the number of true observed values seem to match, resulting in erroneous correspondence. In order to cope with such a case, a threshold is set for the Mahalanobis distance, and IDs associated with a Mahalanobis distance of 3 or more, for example, are treated in the same way as when the above observation value is missing. Mistakes can be reduced.

次にステップＳ１０８において、更新部２０８は、時刻ｔの観測値を用い、状態変数の予測分布を更新し、フィルタ分布（事後分布）の取得を行う。拡張カルマンフィルタにおける、時刻ｔ、物体ｎの状態変数の予測分布の１次モーメントｘ_t|t-1,nと２次モーメントＶ_t|t-1,nから前ステップで対応付けた観測値ｙ_t,k,nを基に補正する更新式は下記式（１９）～式（２１）で表せる。 Next, in step S108, the updating unit 208 updates the predicted distribution of the state variables using the observed value at time t, and acquires the filter distribution (posterior distribution). Observed value y _t associated in the previous step from first moment x _t|t-1,n and second moment V _t|t-1,n of predicted distribution of state variables of object n at time t in the extended Kalman filter _{, k, n} can be expressed by the following equations (19) to (21).

ここでＫ_t,k,nは、時刻ｔ、カメラｋ、物体ｎのカルマンゲインで、Ｒ_tは観測ノイズの分散共分散行列、Ｈ_t,kは式（１２）におけるｈ_t,k(ｘ_t,n)のヤコビ行列である。
そして物体ｎに関し、対応する観測値がある場合、式（１９）～式（２１）で示した更新式の適用によって、観測値で補正された状態変数のフィルタ分布の１次および２次モーメントが得られる。 where K _t,k,n is the Kalman gain of time t, camera k, and object n, R _t is the variance-covariance matrix of observation noise, and H _t,k is h _t,k (x _t,n ).
Then, for an object n, if there is a corresponding observed value, the first and second moments of the filter distribution of the state variable corrected by the observed value are obtained by applying the update formulas shown in Equations (19) to (21). can get.

一方、観測値の欠損、またはカメラｋの検出器のＦＯＶ外に存在する等の理由により、物体ｎに対応する観測値が無い場合、現在のフィルタ分布を状態変数の予測分布で置き換えることで対応することができる。すなわち観測値が無い場合の更新式は、下記の式（２２）、式（２３）である。 On the other hand, if there is no observed value corresponding to object n due to lack of observed value or existence outside the FOV of the detector of camera k, it is handled by replacing the current filter distribution with the predicted distribution of the state variable. can do. That is, the update formulas when there is no observed value are the following formulas (22) and (23).

更新部２０８は、前述した式（１９）、式（２０）、式（２１）、または式（２２）、式（２３）の何れかの処理を行うことで、物体ｎの状態変数の更新を実施する。 The update unit 208 updates the state variables of the object n by performing any one of the above-described equations (19), (20), (21), or (22) and (23). implement.

次にステップＳ１０９では、可視化部２０９が、ループＬ１０２～Ｌ１１２で実行されたＮ個の物体ｎ（ｎは１からＮが存在）の３次元空間上の位置の推定結果と推定位置の時系列の可視化を行う。
可視化では、それぞれの結果を仮想的な３次元空間上に描画してもよいし、カメラで取得した実画像上に軌跡や点として重畳表示させてもよい。そして、その可視化結果はモニタリング部に送信され、ユーザが閲覧可能となる。 Next, in step S109, the visualization unit 209 generates the position estimation results of the N objects n (where n is 1 to N) in the three-dimensional space executed in the loops L102 to L112 and the time series of the estimated positions. Visualize.
In the visualization, each result may be drawn in a virtual three-dimensional space, or may be superimposed and displayed as a trajectory or points on an actual image acquired by a camera. Then, the visualization result is transmitted to the monitoring unit and can be viewed by the user.

次にステップＳ１１０では、タイミング推定部２１０が、Ｋ台のカメラ各々の撮像タイミングを推定する。
ここでカメラｋの観測値に物体ｎの観測値が存在する場合、その観測値に対応した検出位置を(ｕ_t,k,n，ｖ_t,k,n）とする。その時のカメラの光学中心と検出位置を結ぶ直線は、式（７）により、下記の式（２４）となる。 Next, in step S110, the timing estimation unit 210 estimates the imaging timing of each of the K cameras.
Here, when the observation value of camera k includes the observation value of object n, the detection position corresponding to the observation value is (u _t,k,n , v _t,k,n ). A straight line connecting the optical center of the camera and the detection position at that time is given by the following formula (24) from the formula (7).

(ｕ_t,k,nｐ_21,k－ｖ_t,k,nｐ_11,k）Ｘ＋(ｕ_t,k,nｐ_22,k－ｖ_t,k,nｐ_12,k）Ｙ＋(ｕ_t,k,nｐ_23,k－ｖ_t,k,nｐ_13,k）Ｚ＋ｕ_t,k,nｐ_24,k－ｖ_t,k,nｐ_14,k＝０式(24) (u _t,k,n p _21,k −v _t,k,n p _11,k )X+(u _t,k,n p _22,k −v _t,k,n p _12,k )Y+(u _t,k,n p _23,k −v _t,k,n p _13,k ) Z + u _t,k,n p _24,k −v _t,k,n p _14,k =0 Equation (24)

ここで、物体ｎのフィルタ分布の１次モーメントｘ_t|t,nの位置に関する成分である点(ｘ_t,n，ｙ_t,n，ｚ_t,n)から、この直線への垂線の交点をｐ'_t,nとし、点ｘ_t,n＝(ｘ_t,n，ｙ_t,n，ｚ_t,n)とｐ'_t,nとの差をｅ_t,nとする。撮像タイミングの推定値ｄ_t,n（秒）は、現在の推定位置と現在の観測値から推定される位置の差を現在の推定速度で割った数値（秒）である。すなわち撮像タイミングの推定値ｄ_t,nは、差ｅ_t,nの各次元を、フィルタ分布の１次モーメントの速度に関する式（２５）に示す成分で割った、Ｘ，Ｙ，Ｚの各成分を平均化した値である。 Now, from the point (x _t,n , y _t,n , z _t,n ), which is the component with respect to the position of the first moment x _t|t,n of the filter distribution of object n, to the intersection point of the perpendicular to this straight line be p' _t,n and let e t,n be the difference between the point x _t,n =(x _t,n , y _t,n , z _t,n ) and p' _t, _n . The estimated value d _t,n (seconds) of the imaging timing is a numerical value (seconds) obtained by dividing the difference between the current estimated position and the position estimated from the current observation value by the current estimated speed. That is, the estimated value d _t,n of the imaging timing is obtained by dividing each dimension of the difference e _t,n by the component shown in the equation (25) regarding the velocity of the first moment of the filter distribution. is the average value of

タイミング推定部２１０は、これをカメラｋの複数の観測値に対応付く複数の物体ｎだけ推定し、それらを平均したものをカメラｋの撮像タイミングの推定値とする。さらにタイミング推定部２１０は、これをＫ台の全てのカメラについて計算し、各カメラの撮像タイミングの推定値を取得する。そして取得したものが、ステップＳ１０２のカメラ毎の撮像タイミングの推定値として用いられる。 The timing estimating unit 210 estimates only a plurality of objects n associated with a plurality of observed values of the camera k, and averages them as an estimated value of the imaging timing of the camera k. Further, the timing estimating unit 210 calculates this for all K cameras and obtains an estimated value of the imaging timing of each camera. The acquired value is used as an estimated value of the imaging timing for each camera in step S102.

第１の実施形態の３次元追尾装置２００では、以上の処理を実行する事で、３次元空間上に存在する複数の物体の位置の推定と、時間分解能Δｔをカメラのフレームレートに対しカメラ台数分だけ細かくした追尾が実現できる。この時間分解能は、カメラのフレームレートを超える分解能である。したがって本実施形態によれば、異なるタイミングで撮像する複数のカメラの物体検出結果を利用することで、スポーツ試合中のボール等の高速移動する追尾物体の３次元空間上の位置と軌跡を取得可能となり、追尾の失敗を低減可能となる。 The three-dimensional tracking device 200 of the first embodiment performs the above processing to estimate the positions of a plurality of objects existing in the three-dimensional space, and to adjust the time resolution Δt to the number of cameras with respect to the camera frame rate. It is possible to realize finer tracking. This temporal resolution is the resolution exceeding the frame rate of the camera. Therefore, according to this embodiment, it is possible to acquire the position and trajectory in the three-dimensional space of a tracking object moving at high speed, such as a ball during a sports match, by using the object detection results of multiple cameras that capture images at different timings. As a result, tracking failures can be reduced.

＜第２の実施形態＞
第１の実施形態は、各カメラの撮像タイミングの推定値を、追尾対象の物体の推定した３次元空間上の位置および速度と実際の観測値とを用いて、直接計算して取得している。第２の実施形態では、各カメラの撮像タイミングの推定にもベイズの枠組みを適用し、撮像タイミングの事後確率分布を計算する。さらに第２の実施形態では、追尾対象の物体の状態変数の更新の際に、撮像タイミングの事後確率分布で重みを付けた補正を行うことにより、観測値の欠損の数を減らし、より安定した高時間分解能の追尾を実現する。 <Second embodiment>
In the first embodiment, the estimated value of the imaging timing of each camera is obtained by direct calculation using the estimated position and speed of the object to be tracked in the three-dimensional space and the actual observed value. . In the second embodiment, the Bayesian framework is also applied to the estimation of the imaging timing of each camera, and the posterior probability distribution of the imaging timing is calculated. Furthermore, in the second embodiment, when the state variables of the object to be tracked are updated, correction weighted by the posterior probability distribution of the imaging timing is performed to reduce the number of missing observation values and to improve stability. Realize tracking with high temporal resolution.

図１（ｂ）は、第２の実施形態に係る３次元追尾装置３００の機能構成を示す図である。
図１（ｂ）に示した第２の実施形態に係る３次元追尾装置２００は、撮影部３２０、処理部３３０、およびモニタリング部３４０を有する。
撮影部３２０は、図１（ａ）の撮影部２２０と同様の第１動画取得部３０１～第Ｋ動画取得部３０２を含んでいる。以下の説明においても前述同様に、第１動画取得部３０１～第Ｋ動画取得部３０２のＫ台の動画取得部を特に特定しない場合には、動画取得部を単にカメラと記載する場合もある。 FIG. 1B is a diagram showing the functional configuration of a three-dimensional tracking device 300 according to the second embodiment.
A three-dimensional tracking device 200 according to the second embodiment shown in FIG.
The imaging unit 320 includes a first moving image acquiring unit 301 to a Kth moving image acquiring unit 302 similar to the imaging unit 220 in FIG. 1(a). In the following description, in the same way as described above, when the K moving image acquisition units of the first moving image acquiring unit 301 to the Kth moving image acquiring unit 302 are not particularly specified, the moving image acquiring units may be simply referred to as cameras.

第２の実施形態の処理部３３０は、図１（ａ）に示した機能構成とは異なり、前述した動画取得選択部２０３を有していない一方で、一時記憶装置３１１、タイミング予測部３１４、およびタイミング更新部３１３を有している。また、詳細については図３等を用いて説明するが、第２の実施形態の処理部３３０の機能構成のうち一部の機能は第１の実施形態の例とは異なっている。 Unlike the functional configuration shown in FIG. 1A, the processing unit 330 of the second embodiment does not have the moving image acquisition selection unit 203 described above, but has a temporary storage device 311, a timing prediction unit 314, and a timing updating unit 313 . Details will be described with reference to FIG. 3 and the like, but some functions of the functional configuration of the processing unit 330 of the second embodiment are different from those of the first embodiment.

図３（ｂ）は、３次元追尾装置３００の処理部３３０における情報処理の流れを示したフローチャートである。以下、この図３（ｂ）のフローチャートを用いて、第２の実施形態の３次元追尾装置３００における処理全体の概要を説明する。 FIG. 3B is a flowchart showing the flow of information processing in the processing section 330 of the three-dimensional tracking device 300. As shown in FIG. The outline of the overall processing in the three-dimensional tracking device 300 of the second embodiment will be described below with reference to the flowchart of FIG. 3(b).

ステップＳ２０１において、初期値設定部３０５は、３次元空間上の追尾対象物体（人物、ボール）の数、ＩＤ、状態の初期値を設定する。さらに初期値設定部３０５は、撮影部３２０のＫ台のカメラの撮像タイミングを状態変数とし、その初期値を設定する。また第２の実施形態のステップＳ２０１において、初期値設定部３０５は、物体ｎの状態変数のフィルタ分布の１次および２次モーメントの初期値の他、カメラｋの状態変数である撮像タイミングのフィルタ分布の１次および２次モーメントの初期値を設定する。なお、このステップＳ２０１の処理に関しては、後ほど、第１の実施形態におけるステップＳ１０１との差異を説明する。 In step S201, the initial value setting unit 305 sets initial values for the number, ID, and state of tracking target objects (persons, balls) in the three-dimensional space. Further, the initial value setting unit 305 sets the initial values of the imaging timings of the K cameras of the imaging unit 320 as state variables. In step S201 of the second embodiment, the initial value setting unit 305 sets the initial values of the first and second moments of the filter distribution of the state variables of the object n, as well as the filter of the imaging timing, which is the state variable of the camera k. Sets initial values for the first and second moments of the distribution. Regarding the processing of step S201, differences from step S101 in the first embodiment will be described later.

ループＬ２０１～Ｌ２１１において、処理部３３０は、時刻に関する繰り返し処理を実行する。すなわち処理部３３０は、時刻に関するインデックスｔを１からＴの順で与える、繰り返し処理を実行する。
次にループＬ２０２～Ｌ２１２において、処理部３３０は、撮影部３２０の動画取得部に関する繰り返し処理を実行する。すなわち処理部３３０は、カメラに関するインデックスｋを１からＫの順で与える、繰り返し処理を実行する。
ループＬ２０１～Ｌ２１１、Ｌ２０２～Ｌ２１２の処理は、前述した第１の実施形態に係るループＬ１０１～Ｌ１１１、Ｌ１０２～Ｌ１１２と同じ処理であるため、その詳細な説明は省略する。 In loops L201 to L211, the processing unit 330 repeatedly performs time-related processing. That is, the processing unit 330 executes a repetitive process of giving the index t regarding the time in order from 1 to T. FIG.
Next, in loops L202 to L212, the processing unit 330 repeats processing related to the moving image acquisition unit of the imaging unit 320. FIG. That is, the processing unit 330 performs a repetitive process of giving the index k related to the camera in order from 1 to K. FIG.
The processing of loops L201 to L211 and L202 to L212 is the same as the processing of loops L101 to L111 and L102 to L112 according to the first embodiment, so detailed description thereof will be omitted.

次にステップＳ２０２において、撮影部３２０の第１動画取得部３０１～第Ｋ動画取得部３０２の、いずれかのカメラｋ（動画取得部ｋ）が、現在のフレーム（静止画）を１枚取得する Next, in step S202, one of the camera k (video acquisition unit k) of the first video acquisition unit 301 to the K-th video acquisition unit 302 of the imaging unit 320 acquires one current frame (still image).

次にステップＳ２０３において、検出部３０４は、前ステップで取得されたフレームに写っている物体（人物およびボール）の位置とスコアを検出する。前述したサッカーの例のように、フレーム内に複数の人物が存在する場合、基本的にはそれらの各人数に対応した複数の位置及びスコアが検出される。この処理に関しては、第１の実施形態におけるステップＳ１０４と同様の処理であるため、その詳細な説明は省略する。 Next, in step S203, the detection unit 304 detects the positions and scores of the objects (person and ball) appearing in the frame acquired in the previous step. When there are multiple persons in the frame, as in the example of soccer described above, basically multiple positions and scores corresponding to each number of persons are detected. Since this process is the same as step S104 in the first embodiment, detailed description thereof will be omitted.

次にステップＳ２０４において、検出部３０４は、前ステップで取得した物体検出結果（人物およびボールの位置とスコア）を、一時記憶装置３１１に保存する。 Next, in step S204 , the detection unit 304 saves the object detection results (positions and scores of the person and the ball) acquired in the previous step in the temporary storage device 311 .

ループＬ２０３～Ｌ２１３において、処理部３３０は、物体に関する繰り返し処理を実行する。すなわち処理部３３０は、複数の物体に関するインデックスｎを１からＮの順で与える繰り返し処理を実行する。なお、ＮはステップＳ２０１において物体の数の初期値として取得された値を用いる。 In loops L203 to L213, processing unit 330 repeatedly executes processing for objects. That is, the processing unit 330 executes a repetitive process of giving indices n in order from 1 to N for a plurality of objects. For N, the value obtained as the initial value of the number of objects in step S201 is used.

次にステップＳ２０５において、予測部３０６は、現在の時刻をｔ（ｔ≧１）とし、時刻ｔ－１における物体ｎの位置、速度等の状態に基づき、時刻ｔの状態の確率分布を予測する処理を行う。この処理は、一時記憶装置３１１に一時記憶された物体の位置、速度等の状態が用いられること以外、第１の実施形態におけるステップＳ１０５と同様の処理であるため、その詳細な説明は省略する。 Next, in step S205, the prediction unit 306 sets the current time to t (t≧1), and predicts the probability distribution of the state at time t based on the state of the object n at time t−1, such as the position and speed. process. This process is similar to step S105 in the first embodiment, except that the position, speed, and other states of the object temporarily stored in the temporary storage device 311 are used, so detailed description thereof will be omitted. .

ループＬ２０４～Ｌ２１４において、処理部３３０は、撮影部３２０のカメラ（動画取得部）に関する繰り返し処理を実行する。すなわち処理部３３０は、カメラに関するインデックスｋを１からＫの順で与える、繰り返し処理を実行する。この処理は、第１の実施形態のループＬ１０２～Ｌ１１２、第２の実施形態のループＬ２０２～Ｌ２１２と同様の処理であるため、その詳細な説明は省略する。 In loops L204 to L214, the processing unit 330 repeatedly performs processing related to the camera (video acquisition unit) of the imaging unit 320. FIG. That is, the processing unit 330 performs a repetitive process of giving the index k related to the camera in order from 1 to K. FIG. This process is the same as the loops L102 to L112 of the first embodiment and the loops L202 to L212 of the second embodiment, so detailed description thereof will be omitted.

次に分岐Ｂ２０１において、予測部３０６は、前ステップで予測した物体ｎの位置およびＦＯＶ重複地図に基づき、カメラｋのＦＯＶ内に物体ｎが存在するか否かにより処理を分岐させる。予測部３０６は、カメラｋのＦＯＶ内に物体ｎが存在する確率がある場合にはステップＳ２０６に処理を進め、物体ｎが存在しない場合には次の繰り返し処理に進む。この処理は、第１の実施形態における分岐Ｂ１０１と同様の処理であるため、詳細な説明は省略する。 Next, in branch B201, the prediction unit 306 branches the processing depending on whether or not object n exists within the FOV of camera k based on the position of object n predicted in the previous step and the FOV overlapping map. If there is a probability that the object n exists within the FOV of the camera k, the prediction unit 306 advances the process to step S206, and if the object n does not exist, advances to the next repeated process. Since this process is the same process as the branch B101 in the first embodiment, detailed description thereof will be omitted.

ステップＳ２０６に進むと、予測部３０６は、現在の時刻をｔ（ｔ≧１）とし、時刻ｔ－１における物体ｎの位置、速度等の状態に基づき、時刻ｔに取得されるべき観測値の確率分布を予測する。この処理は、第１の実施形態におけるステップＳ１０６と同様の処理であるため、詳細な説明は省略する。 Proceeding to step S206, the prediction unit 306 sets the current time to t (t≧1), and determines the observation value to be acquired at time t based on the state of the object n at time t−1, such as the position and speed. Predict probability distributions. Since this process is the same process as step S106 in the first embodiment, detailed description will be omitted.

次にステップＳ２０７において、ＩＤ対応付け部３０７は、一時記憶装置３１１に保存されている、カメラｋが取得したフレームに対応する物体検出結果を読み出す。
さらにステップＳ２０８において、ＩＤ対応付け部３０７は、前ステップで時刻ｔについて予測したカメラｋにおける物体ｎの観測値の予測確率分布と、時刻ｔのカメラｋにおける実際の観測値とを基に、その実際の観測値の尤度を計算する。そして、ＩＤ対応付け部３０７は、その観測値の尤度を基に、３次元空間上の物体ｎと観測値とを対応付ける。この対応付け処理は、第１の実施形態におけるステップＳ１０７と同様の処理であるため、詳細な説明は省略する。 Next, in step S207 , the ID association unit 307 reads the object detection result corresponding to the frame acquired by the camera k, stored in the temporary storage device 311 .
Further, in step S208, the ID associating unit 307 calculates the predicted probability distribution of the observed values of the object n with the camera k predicted at the time t in the previous step and the actual observed values with the camera k at the time t. Compute the likelihood of actual observations. Based on the likelihood of the observed value, the ID associating unit 307 associates the object n in the three-dimensional space with the observed value. Since this association processing is the same processing as step S107 in the first embodiment, detailed description thereof will be omitted.

次にステップＳ２０９において、更新部３０８は、撮像タイミングの状態変数の事後確率分布から、物体ｎの状態変数の更新に際して用いる重みを計算する。
さらにステップＳ２１０において、更新部３０８は、前ステップで取得した重みを用い、物体ｎの状態変数を更新する。このステップの処理の詳細は後ほど説明する。 Next, in step S209, the updating unit 308 calculates weights to be used for updating the state variables of the object n from the posterior probability distribution of the state variables at the imaging timing.
Furthermore, in step S210, the updating unit 308 updates the state variables of the object n using the weights acquired in the previous step. Details of the processing of this step will be described later.

ループＬ２０５～Ｌ２１５において、処理部３３０は、撮影部３０２のカメラ（動画取得部）に関する繰り返し処理を実行する。すなわち処理部３３０は、カメラに関するインデックスｋを１からＫの順で与える繰り返し処理を実行する。この処理は、第１の実施形態のループＬ１０２～Ｌ１１２、本実施形態のループＬ２０２～Ｌ２１２、ループＬ２０４～Ｌ２１４と同様の処理であるため、詳細な説明は省略する。 In loops L205 to L215, the processing unit 330 repeatedly performs processing related to the camera (video acquisition unit) of the imaging unit 302. FIG. That is, the processing unit 330 performs a repetitive process of giving the camera-related index k in order from 1 to K. FIG. This process is the same as the loops L102 to L112 of the first embodiment, the loops L202 to L212, and the loops L204 to L214 of the present embodiment, so detailed description thereof will be omitted.

次にステップＳ２１１において、タイミング予測部３１４は、現在の時刻をｔ（ｔ≧１）とし、時刻ｔ－１におけるカメラｋの撮像タイミング（状態変数）に基づき、時刻ｔの状態の確率分布を予測する処理を行う。
さらにステップＳ２１２において、タイミング更新部３１３は、一時記憶装置３１１に保存されている、カメラｋが取得したフレームに対応する物体検出結果を読み出す。この処理は、ステップＳ２０７と同様の処理であるため、詳細な説明は省略する。
そしてステップＳ２１３において、タイミング更新部３１３は、前ステップで取得した検出結果を用い、カメラｋの撮像タイミングの状態を更新する。 Next, in step S211, the timing prediction unit 314 sets the current time to t (t≧1), and predicts the probability distribution of the state at time t based on the imaging timing (state variable) of camera k at time t−1. process.
Furthermore, in step S212, the timing updating unit 313 reads the object detection result corresponding to the frame acquired by the camera k, which is stored in the temporary storage device 311. FIG. Since this process is the same process as step S207, detailed description thereof will be omitted.
Then, in step S213, the timing update unit 313 uses the detection result obtained in the previous step to update the imaging timing state of the camera k.

次にステップＳ２１４において、可視化部３０９は、更新された物体の状態を可視化するための処理を実行し、さらにその処理結果をモニタリング部３４０に表示する。この処理に関しては、第１の実施形態におけるステップＳ１０９と同様の処理であるため、詳細な説明は省略する。 Next, in step S214 , the visualization unit 309 executes processing for visualizing the updated state of the object, and further displays the processing result on the monitoring unit 340 . Since this process is the same as step S109 in the first embodiment, detailed description is omitted.

次に、図３（ｂ）に示したフローチャートに従って、図１（ｂ）に示した３次元追尾装置３００の処理部３３０の各機能部における処理について、より詳細で具体的な処理内容を説明する。なお、ループＬ２０１～Ｌ２１１、Ｌ２０２～Ｌ２１２、Ｌ２０３～Ｌ２１３、Ｌ２０４～Ｌ２１４、およびＬ２０５～Ｌ２１５に関しては、第１の実施形態と同様の処理であるため詳細な説明を省略する。またステップＳ２０３、Ｓ２０５、Ｓ２０６、Ｓ２０８、Ｓ２１４、および分岐Ｂ２０１に関しても、第１の実施形態と同様の処理であるため詳細な説明を省略する。さらにステップＳ２１２は第２の実施形態のステップＳ２０７と同様の処理であるため詳細な説明を省略する。 Next, according to the flowchart shown in FIG. 3(b), more detailed and specific processing contents will be described for the processing in each functional unit of the processing unit 330 of the three-dimensional tracking device 300 shown in FIG. 1(b). . Note that loops L201 to L211, L202 to L212, L203 to L213, L204 to L214, and L205 to L215 are the same processing as in the first embodiment, so detailed description thereof will be omitted. Also, steps S203, S205, S206, S208, S214, and branch B201 are the same processes as in the first embodiment, so detailed description thereof will be omitted. Furthermore, since step S212 is the same processing as step S207 of the second embodiment, detailed description thereof will be omitted.

本実施形態では、追尾対象の物体毎の位置・速度・物体の姿勢を状態変数として扱うとともに、状態空間モデルを拡張し、カメラ毎の撮像タイミングも状態変数として扱う。このため第２の実施形態では、撮像タイミングの時間遷移を記述するシステム方程式、撮像タイミングが観測される過程を記述する観測方程式を導入する。 In this embodiment, the position, velocity, and orientation of each object to be tracked are treated as state variables, and the state space model is expanded to treat the imaging timing of each camera as state variables. Therefore, in the second embodiment, a system equation describing the temporal transition of the imaging timing and an observation equation describing the process of observing the imaging timing are introduced.

本実施形態では、カメラｋの撮像タイミングをｄ_t,kとし、１次のランダムウォークモデルを用いる。撮像タイミングｄ_t,kは下記の式（２７）で表される。 In this embodiment, the imaging timing of camera k is dt _,k , and a first-order random walk model is used. The imaging timing d _t,k is represented by the following equation (27).

ｄ_t,k＝Ｄｄ_t-1,k＋ｓ_t,n 式(27) d _t,k =Dd _t-1,k +s _t,n Formula (27)

ここで、ｓ_t,nはプロセスノイズ（白色ガウスノイズ）である。また、時刻ｋ、物体ｎの３次元空間上の位置をｘ_t,n＝(ｘ_t,n，ｙ_t,n，ｚ_t,n)、速度を式（２８）で表したときの、カメラｋの撮像タイミングに起因する３次元空間上の位置は式（２９）のように記述できる。 where s _t,n is process noise (white Gaussian noise). Also, when the time k, the position of the object n in the three-dimensional space is x _t,n =(x _t,n , y _t,n , z _t,n ), and the velocity is represented by Equation (28), the camera The position in the three-dimensional space due to the imaging timing of k can be described as in Equation (29).

さらにこれらの式と式（７）、式（８）とを基に、観測方程式として下記の式（３０）、式（３１）を用いる。なお、ｗ_t,u、ｗ_t,vはそれぞれ観測ノイズである。 Further, based on these equations, equations (7) and (8), the following equations (30) and (31) are used as observation equations. Note that w _t,u and w _t,v are observation noises, respectively.

図３（ｂ）のフローチャートに説明を戻す。第２の実施形態の場合、初期値設定部３０５は、ステップＳ２０１で物体ｎの状態変数のフィルタ分布の１次および２次モーメントの初期値の他、カメラｋの状態変数である撮像タイミングのフィルタ分布の１次および２次モーメントの初期値も設定している。 Returning to the flowchart of FIG. 3(b). In the case of the second embodiment, in step S201, the initial value setting unit 305 sets the initial values of the first and second moments of the filter distribution of the state variables of the object n, as well as the filter of the imaging timing, which is the state variable of the camera k. We also set initial values for the first and second moments of the distribution.

また第２の実施形態の場合、ステップＳ２１０において、更新部３０８は、物体ｎの状態変数の更新に、前ステップで取得した、撮像タイミングの事後分布を時間ステップ毎に積分した値を重みとして用いる。この場合の拡張カルマンフィルタの更新処理には、カルマンフィードバックを重み付け和する、下記の式（３２）～式（３４）で示される方法を用いる。なお、ｗ_t,k,nは、前ステップで取得した撮像タイミングの積分値である。 In the case of the second embodiment, in step S210, the updating unit 308 uses, as a weight, the value obtained by integrating the posterior distribution of the imaging timing acquired in the previous step for each time step to update the state variable of the object n. . For the update process of the extended Kalman filter in this case, a method of weighted sum of Kalman feedbacks and represented by the following equations (32) to (34) is used. Note that w _t,k,n is the integrated value of the imaging timing acquired in the previous step.

次にステップＳ２１１において、タイミング予測部３１４は、カメラｋの撮像タイミングに基づき確率分布を予測する。第２の実施形態の場合、タイミング予測部３１４は、前述した観測値の内の位置（ｕ，ｖ）の観測方程式を用いて、人物頭部とボールとで共通して観測をモデル化する。 Next, in step S211, the timing prediction unit 314 predicts the probability distribution based on the imaging timing of camera k. In the case of the second embodiment, the timing prediction unit 314 models observations common to the human head and the ball using the observation equation of the position (u, v) in the observation values described above.

図７は、本実施形態において想定する真の撮像タイミングの例を示した図である。図７の例では、複数のカメラとして１０台のカメラを想定し、ある１台のカメラを基準とし、残り９台のカメラの撮影タイミングの１周期内のずれを記している。Ｔはカメラのフレーム周期のタイミングを示している。タイミング７１０は基準カメラのタイミングを表し、そのタイミング７１０に対して遅い順に、タイミング７０１，７０２，７０３，７０４，７０５，７０６，７０７，７０８，７０９が残りの９台のカメラのタイミングを表している。すなわちタイミング７１０に対して、タイミング７０１，７０２，７０３，７０４，７０５，７０６，７０７，７０８，７０９は時間軸上でずれがある。 FIG. 7 is a diagram showing an example of true imaging timing assumed in this embodiment. In the example of FIG. 7, 10 cameras are assumed as a plurality of cameras, and one camera is used as a reference, and deviations in shooting timings of the remaining 9 cameras within one cycle are shown. T indicates the timing of the frame cycle of the camera. Timing 710 represents the timing of the reference camera, and timings 701, 702, 703, 704, 705, 706, 707, 708, and 709 represent the timings of the remaining nine cameras in descending order from the timing 710. . That is, the timings 701, 702, 703, 704, 705, 706, 707, 708, and 709 are shifted from the timing 710 on the time axis.

本実施形態において、図７に示した真の撮像タイミングは直接観測できない値であるため状態変数として扱われる。すなわち撮像タイミングは、観測できる値である各カメラの検出結果から状態空間モデルの枠組みで、事後分布として推定されるものである。
図８は、各カメラの観測値（ｕ，ｖ）が与えられたうえで、前述した状態空間モデルで推定した撮像タイミングの事後分布を示した図である。本実施形態のステップＳ２１０において、図７に示した１０台のカメラのそれぞれの撮像タイミングは、図８の確率分布（ガウス分布）８０１として推定される。 In this embodiment, the true imaging timing shown in FIG. 7 is treated as a state variable because it is a value that cannot be directly observed. That is, the imaging timing is estimated as a posterior distribution within the framework of the state space model from the detection results of each camera, which are observable values.
FIG. 8 is a diagram showing the posterior distribution of imaging timings estimated by the above-described state space model given observation values (u, v) of each camera. In step S210 of this embodiment, the imaging timings of the ten cameras shown in FIG. 7 are estimated as the probability distribution (Gaussian distribution) 801 shown in FIG.

以上説明したように、第２の実施形態では、各カメラの撮像タイミングの推定にもベイズの枠組みを適用し、撮像タイミングの事後確率分布を計算し、さらに追尾対象物体の状態変数の更新の際、撮像タイミングの事後確率分布で重みを付けた補正を行う。これにより、第２の実施形態によれば、観測値の欠損の数を減らし、より安定した、高時間分解能の追尾を実行できる。 As described above, in the second embodiment, the Bayesian framework is also applied to the estimation of the imaging timing of each camera, the posterior probability distribution of the imaging timing is calculated, and the state variables of the tracking target object are updated. , weighted by the posterior probability distribution of the imaging timing. As a result, according to the second embodiment, the number of missing observation values can be reduced, and more stable tracking with high temporal resolution can be performed.

＜第３の実施形態＞
第１、第２の実施形態では、追尾対象として複数の種類の物体を想定している。この中には、人物のように比較的低速で移動する物体もあれば、ボールのように高速で移動する物体も存在する。さらにボールは、サッカー等のスポーツの試合中では、プレイヤーによって蹴られた瞬間は高速に移動するが、ドリブル中などでは人物と同じくらいの速度で移動する。一般に、複数のネットワークカメラの非同期性に起因した撮像タイミングのずれによって発生する各ネットワークカメラのフレーム上の物体検出位置のずれは、検出対象の物体が高速に移動する際に顕著に現れる。さらに、カメラの位置姿勢により、カメラから近距離で、カメラのピクセル座標の平面と平行な速度成分が大きいほど、物体検出位置のずれは大きくなる。第３の実施形態では、これらの性質に着目して撮像タイミングの推定精度を向上させる事例について述べる。第３の実施形態では、撮像タイミングを更新する場合に、検出部３０４の検出結果の好適な条件を定量化する指標を算出する。そして、その指標を、撮像タイミングの状態空間モデルの更新時の重みとし、撮像タイミングを更新する際の更新量を設定して、撮像タイミングの推定精度を向上させる。 <Third Embodiment>
In the first and second embodiments, a plurality of types of objects are assumed as tracking targets. Among these, there are objects that move relatively slowly, such as people, and objects that move at high speed, such as balls. Furthermore, the ball moves at high speed the moment it is kicked by a player during a sports match such as soccer, but moves at the same speed as a person during dribbling. In general, deviations in object detection positions on the frame of each network camera caused by deviations in imaging timing due to asynchronism of a plurality of network cameras become conspicuous when the object to be detected moves at high speed. Furthermore, due to the position and orientation of the camera, the deviation of the object detection position increases as the speed component parallel to the plane of the pixel coordinates of the camera increases at a short distance from the camera. In the third embodiment, a case will be described in which these properties are focused on and the accuracy of imaging timing estimation is improved. In the third embodiment, when updating the imaging timing, an index for quantifying a suitable condition for the detection result of the detection unit 304 is calculated. Then, the index is used as a weight for updating the state space model of the imaging timing, and an update amount for updating the imaging timing is set to improve the estimation accuracy of the imaging timing.

第３の実施形態の３次元追尾装置の機能構成は、図１（ｂ）に示した第２の実施形態の３次元追尾装置３００の機能構成と同様であるため、その図示は省略する。
図９は、３次元追尾装置の処理部における情報処理の流れを示したフローチャートである。まずこの図９のフローチャートを用いて、第３の実施形態に係る処理全体の概要を説明する。 The functional configuration of the three-dimensional tracking device of the third embodiment is the same as the functional configuration of the three-dimensional tracking device 300 of the second embodiment shown in FIG. 1(b), so illustration thereof is omitted.
FIG. 9 is a flow chart showing the flow of information processing in the processing section of the three-dimensional tracking device. First, using the flowchart of FIG. 9, an overview of the overall processing according to the third embodiment will be described.

まずステップＳ３０１において、初期値設定部３０５は、３次元空間上の追尾対象物体（人物、ボール）の数、ＩＤ、状態の初期値を設定する。さらに本実施形態では、撮影部３２０のＫ台のカメラ（動画取得部３０１～３０２）の撮像タイミングを状態変数とし、その初期値を設定する。この処理は、第２の実施形態におけるステップＳ２０１と同様の処理であるため、詳細な説明は省略する。 First, in step S301, the initial value setting unit 305 sets initial values for the number, ID, and state of tracking target objects (persons, balls) in the three-dimensional space. Furthermore, in this embodiment, the imaging timing of the K cameras (moving image acquisition units 301 and 302) of the imaging unit 320 is set as a state variable, and its initial value is set. Since this process is the same process as step S201 in the second embodiment, detailed description will be omitted.

ループＬ３０１～Ｌ３１１において、処理部３３０は、時刻に関する繰り返し処理を実行する。すなわち処理部３３０は、時刻に関するインデックスｔを１からＴの順で与え、繰り返し処理を実行する。
次にループＬ３０２～Ｌ３１２において、処理部３３０は、カメラに関する繰り返し処理を実行する。すなわち処理部３３０は、カメラに関するインデックスｋを１からＫの順で与え、繰り返し処理を実行する。
ループＬ３０１～Ｌ３１１、Ｌ３０２～Ｌ３１２の処理は、第２の実施形態のループＬ２０１～Ｌ２１１、Ｌ２０２～Ｌ２１２と同じ処理であるため、詳細な説明は省略する。 In loops L301 to L311, processing unit 330 repeatedly performs time-related processing. That is, the processing unit 330 gives an index t relating to time in order from 1 to T, and repeats the process.
Next, in loops L302 to L312, the processing unit 330 repeatedly performs camera-related processing. That is, the processing unit 330 gives the camera-related index k in order from 1 to K, and repeats the process.
The processing of loops L301 to L311 and L302 to L312 is the same as the processing of loops L201 to L211 and L202 to L212 of the second embodiment, so detailed description thereof will be omitted.

ステップＳ３０２において、撮影部３２０のいずれかのカメラｋ（画像取得部ｋ）が、現在のフレーム（静止画）を１枚取得する。この処理は、第２の実施形態におけるステップＳ２０２と同様の処理であるため、詳細な説明は省略する。 In step S302, one of the cameras k (image acquiring unit k) of the photographing unit 320 acquires one current frame (still image). Since this process is the same process as step S202 in the second embodiment, detailed description will be omitted.

次にステップＳ３０３において、検出部３０４は、前ステップで取得したフレームに写っている物体（人物およびボール）の位置とスコアを検出する。フレーム内に複数物体（複数の人物）が存在する場合、検出部３０４では、基本的にはその物体の数（人数）に対応した複数の位置及びスコアが検出される。この処理は、第２の実施形態におけるステップＳ２０３と同様の処理であるため、詳細な説明は省略する。 Next, in step S303, the detection unit 304 detects the positions and scores of the objects (person and ball) appearing in the frame acquired in the previous step. When a plurality of objects (a plurality of persons) exist within the frame, the detection unit 304 basically detects a plurality of positions and scores corresponding to the number of the objects (number of persons). Since this process is the same process as step S203 in the second embodiment, detailed description will be omitted.

次にステップＳ３０４において、検出部３０４は、前ステップで取得した物体検出結果を一時記憶装置３１１に保存する。この処理は、第２の実施形態におけるステップＳ２０４と同様の処理であるため、詳細な説明は省略する。 Next, in step S304 , the detection unit 304 saves the object detection result acquired in the previous step in the temporary storage device 311 . Since this process is the same process as step S204 in the second embodiment, detailed description will be omitted.

次にループＬ３０３～Ｌ３１３において、処理部３３０は、物体に関する繰り返し処理を行う。すなわち処理部３３０は、物体に関するインデックスｎを１からＮの順で与え、繰り返し処理を実行する。Ｎは、ステップＳ３０１において物体の数の初期値として取得された値を用いる。この処理は、第２の実施形態におけるループＬ２０３～Ｌ２１３と同様の処理であるため、詳細な説明は省略する。 Next, in loops L303 to L313, the processing unit 330 repeats processing regarding the object. That is, the processing unit 330 gives an index n related to the object in order from 1 to N, and repeats the process. For N, the value obtained as the initial value of the number of objects in step S301 is used. This processing is the same processing as the loops L203 to L213 in the second embodiment, so detailed description will be omitted.

ステップＳ３０５において、予測部３０６は、現在の時刻をｔ（ｔ≧１）とし、時刻ｔ－１における物体ｎの位置、速度等の状態に基づき、時刻ｔの状態の確率分布を予測する処理を行う。この処理は、第２の実施形態におけるステップＳ２０５と同様の処理であるため、詳細な説明は省略する。 In step S305, the prediction unit 306 sets the current time to t (t≧1), and predicts the probability distribution of the state at time t based on the state of the object n at time t−1, such as the position and speed. conduct. Since this process is the same process as step S205 in the second embodiment, detailed description will be omitted.

次にループＬ３０４～Ｌ３１４において、処理部３３０は、動画取得部に関する繰り返し処理を実行する。すなわち処理部３３０は、カメラに関するインデックスｋを１からＫの順で与え、繰り返し処理を実行する。この処理は、第２の実施形態のループＬ２０４～Ｌ２１４と同様の処理であるため、詳細な説明は省略する。 Next, in loops L304 to L314, the processing unit 330 repeats processing related to the moving image acquisition unit. That is, the processing unit 330 gives the camera-related index k in order from 1 to K, and repeats the process. Since this processing is the same processing as the loops L204 to L214 of the second embodiment, detailed description thereof will be omitted.

分岐Ｂ３０１では、予測部３０６が、前ステップで予測した物体ｎの位置とＦＯＶ重複地図とに基づき、カメラｋのＦＯＶ内に物体ｎが存在する確率があるか否かの判定結果に応じて処理を分岐する。分岐Ｂ３０１において、予測部３０６は、カメラｋのＦＯＶ内に物体ｎが存在する確率がある場合にはステップＳ３０６の処理に進み、存在しない場合には次の繰り返し処理に移る。この処理は、第１の実施形態における分岐Ｂ２０１と同様の処理であるため、詳細な説明は省略する。 In branch B301, the prediction unit 306 performs processing according to the determination result of whether or not there is a probability that object n exists within the FOV of camera k, based on the position of object n predicted in the previous step and the FOV overlapping map. branch. In branch B301, the prediction unit 306 proceeds to the process of step S306 if there is a probability that the object n exists within the FOV of the camera k, and proceeds to the next iterative process if not. Since this processing is the same processing as the branch B201 in the first embodiment, detailed description will be omitted.

ステップＳ３０６に進むと、予測部３０６は、現在の時刻をｔ（ｔ≧１）とし、時刻ｔ－１における物体ｎの位置、速度等の状態に基づき、時刻ｔに取得されるべき観測値の確率分布を予測する。この処理は、第２の実施形態におけるステップＳ２０６と同様の処理であるため、詳細な説明は省略する。 Proceeding to step S306, the prediction unit 306 sets the current time to t (t≧1), and determines the observation value to be acquired at time t based on the state of the object n at time t−1, such as the position and speed. Predict probability distributions. Since this process is the same process as step S206 in the second embodiment, detailed description will be omitted.

次にステップＳ３０７において、ＩＤ対応付け部３０７は、一時記憶装置３１１に保存されている、カメラｋが取得したフレームに対応する物体検出結果を読み出す。この処理は、第２の実施形態におけるステップＳ２０７と同様の処理であるため、詳細な説明は省略する。 Next, in step S307 , the ID association unit 307 reads the object detection result corresponding to the frame acquired by the camera k, stored in the temporary storage device 311 . Since this process is the same process as step S207 in the second embodiment, detailed description will be omitted.

次にステップＳ３０８において、ＩＤ対応付け部３０７は、前ステップで時刻ｔについて予測した物体ｎのカメラｋにおける観測値の予測確率分布と時刻ｔのカメラｋにおける実際の観測値の尤度とを基に、３次元空間上の物体ｎと観測値を対応付ける。この処理は、第２の実施形態におけるステップＳ２０８と同様の処理であるため、詳細な説明は省略する。 Next, in step S308, the ID associating unit 307 bases the predicted probability distribution of observed values of object n on camera k predicted at time t in the previous step and the likelihood of actual observed values on camera k at time t. is associated with the object n in the three-dimensional space and the observed value. Since this process is the same process as step S208 in the second embodiment, detailed description will be omitted.

次にステップＳ３０９において、更新部３０８は、撮像タイミングの状態変数の事後確率分布から、物体ｎの状態変数の更新に際して用いる重みを計算する。この処理は、第２の実施形態におけるステップＳ２０９と同様の処理であるため、詳細な説明は省略する。 Next, in step S309, the updating unit 308 calculates weights to be used for updating the state variables of the object n from the posterior probability distribution of the state variables at the imaging timing. Since this process is the same process as step S209 in the second embodiment, detailed description will be omitted.

次にステップＳ３１０において、更新部３０８は、前ステップで取得した重みを用い、物体ｎの状態変数を更新する。この処理は、第２の実施形態におけるステップＳ２１０と同様の処理であるため、詳細な説明は省略する。 Next, in step S310, the updating unit 308 updates the state variables of the object n using the weights acquired in the previous step. Since this process is the same process as step S210 in the second embodiment, detailed description will be omitted.

ループＬ３０５～Ｌ３１５において、処理部３３０は、カメラに関する繰り返し処理を実行する。すなわち処理部３３０は、カメラに関するインデックスｋを１からＫの順で与え、繰り返し処理を実行する。この処理は、第２の実施形態のループＬ２０５～Ｌ２１５と同様の処理であるため、詳細な説明は省略する。 In loops L305 to L315, the processing unit 330 repeatedly performs camera-related processing. That is, the processing unit 330 gives the camera-related index k in order from 1 to K, and repeats the process. Since this processing is the same processing as the loops L205 to L215 of the second embodiment, detailed description thereof will be omitted.

次にステップＳ３１１において、タイミング予測部３１４は、現在の時刻をｔ（ｔ≧１）とし、時刻ｔ－１における動画取得部ｋの撮像タイミング（状態変数）に基づき、時刻ｔの状態の確率分布を予測する処理を行う。この処理は、第２の実施形態におけるステップＳ２１１と同様の処理であるため、詳細な説明は省略する。 Next, in step S311, the timing prediction unit 314 sets the current time to t (t≧1), and based on the imaging timing (state variable) of the moving image acquisition unit k at time t−1, the probability distribution of the state at time t Perform processing to predict Since this process is the same process as step S211 in the second embodiment, detailed description is omitted.

次にステップＳ３１２において、タイミング更新部３１３は、一時記憶装置３１１に保存されている、カメラｋが取得したフレームに対応する物体検出結果を読み出す。この処理は、第２の実施形態におけるステップＳ２１２と同様の処理であるため、詳細な説明は省略する。 Next, in step S312 , the timing updating unit 313 reads the object detection result corresponding to the frame acquired by the camera k, which is stored in the temporary storage device 311 . Since this process is the same process as step S212 in the second embodiment, detailed description is omitted.

次にステップＳ３１３において、タイミング更新部３１３は、撮像タイミング更新ための好適な条件を表す指標、つまり検出部３０４の検出結果の好適な条件を定量化する指標を計算し、その指標を基に撮像タイミング更新のための重みを計算する。
さらにステップＳ３１４において、タイミング更新部３１３は、前ステップで取得した撮像タイミング更新のための重みと、ステップＳ３１１で取得された検出結果とを用い、カメラｋの撮像タイミングの状態を更新する。すなわちタイミング更新部３１３では、撮像タイミングの推定に好適な条件を定量化する指標を、撮像タイミングの状態空間モデルの更新時の重みとし、撮像タイミングを更新する際の更新量を設定して、撮像タイミングの更新を行う。 Next, in step S313, the timing updating unit 313 calculates an index representing a suitable condition for updating the imaging timing, that is, an index for quantifying a suitable condition for the detection result of the detecting unit 304, and performs imaging based on the index. Compute weights for timing updates.
Further, in step S314, the timing updating unit 313 updates the state of the imaging timing of camera k using the weight for updating the imaging timing acquired in the previous step and the detection result acquired in step S311. That is, the timing updating unit 313 sets the weight for updating the state space model of the imaging timing as an index for quantifying the conditions suitable for estimating the imaging timing, and sets the update amount for updating the imaging timing. Update the timing.

その後、ステップＳ３１５において、可視化部３０９は、更新した物体の状態を可視化するための処理を実行し、さらにその処理結果をモニタリング部３４０に表示する。この処理は、第２の実施形態におけるステップＳ２１４と同様の処理であるため、詳細な説明は省略する。 Thereafter, in step S315 , the visualization unit 309 executes processing for visualizing the updated state of the object, and further displays the processing result on the monitoring unit 340 . Since this process is the same process as step S214 in the second embodiment, detailed description will be omitted.

以下、図９に示したフローチャートに従って、第３の実施形態の処理部３３０の各機能部における処理について、より詳細で具体的な処理内容を説明する。なお、ステップＳ３１２、Ｓ３１３以外の各ステップに関しては、すでに述べたとおり、第２の実施形態の各処理と同様の処理であるため詳細な説明を省略する。 Hereinafter, according to the flowchart shown in FIG. 9, more detailed and specific processing contents will be described with respect to the processing in each functional unit of the processing unit 330 of the third embodiment. As described above, each step other than steps S312 and S313 is the same as the processing of the second embodiment, so detailed description thereof will be omitted.

ステップＳ３１３において、タイミング更新部３１３は、撮像タイミングの推定に好適な条件を定量化する指標を計算し、それを重みとして、撮像タイミングの状態空間モデルの更新時に用いるようにする。撮像タイミングの更新は、物体の速度ベクトルと世界座標上の画像平面の成す角が直角に近く、さらに物体とカメラの位置が近い場合に、より好適である。 In step S313, the timing update unit 313 calculates an index that quantifies conditions suitable for estimating the imaging timing, and uses the index as a weight when updating the state space model of the imaging timing. Updating the imaging timing is more suitable when the angle formed by the velocity vector of the object and the image plane on the world coordinates is close to a right angle and the positions of the object and the camera are close.

そこで前述した式（３０）、式（３１）のカメラｋの撮像タイミングｄ_t,kの乗数要素より、撮像タイミングの更新の好適さを定量化する指標として、下記の式（３５）が用いられる。 Therefore, the following formula (35) is used as an index for quantifying the suitability of updating the imaging timing from the multiplier element of the imaging timing d _t, k of the camera k in the above-described formulas (30) and (31). .

ここでα_t,k,nは、物体の速度ベクトルと世界座標上の画像平面の成す角が直角に近く、さらに物体とカメラの位置が近い場合により大きな実数値をとる。本ステップにおいて、タイミング更新部３１３は、指標α_t,k,nを、時刻ｔにおける、カメラｋの繰り返しの中において、物体ｎごとに取得する。 Here, α _t,k,n takes a larger real value when the angle between the velocity vector of the object and the image plane on the world coordinates is close to a right angle and the positions of the object and the camera are close. In this step, the timing updating unit 313 acquires the index α _t,k,n for each object n during the repetition of camera k at time t.

次に、タイミング更新部３１３は、この指標を撮像タイミングの状態空間モデルの更新時の重みとして用いるために、下記の式（３６）によって正規化を行う。 Next, the timing update unit 313 performs normalization using the following equation (36) in order to use this index as a weight when updating the state space model of the imaging timing.

そして、ステップＳ３１４において、タイミング更新部３１３は、このように計算した重みを、式（３２）、式（３３）、式（３４）で既に示したカルマンフィードバックを重み付け和する更新式に適用し、撮像タイミングの更新を実行する。 Then, in step S314, the timing updating unit 313 applies the weights calculated in this way to the update equations for weighted sum of the Kalman feedbacks already shown in equations (32), (33), and (34), Update the imaging timing.

第３の実施形態においては、撮像タイミングのズレの計測に好適な条件を定量化する指標を計算し、その指標を撮像タイミングの状態空間モデルの更新時の重みとして用いることで、撮像タイミングの推定精度を向上させることが可能である。 In the third embodiment, an index for quantifying a condition suitable for measuring the imaging timing shift is calculated, and the index is used as a weight when updating the state space model of the imaging timing, thereby estimating the imaging timing. It is possible to improve the accuracy.

前述した各実施形態の情報処理装置（特に３次元追尾装置の処理部）は、Ｋ台のカメラ（動画取得部）とモニタリング部に接続等されたパーソナルコンピュータ等によって実現されてもよい。そして、コンピュータにおいて前述した各実施形態で説明したような物体検出から可視化処理までの各情報処理が行われる。この例におけるコンピュータは、本実施形態の情報処理を実現するソフトウェアのプログラムコードを実行する。ハードウェア構成の図示は省略するが、本実施形態の情報処理装置を実現するコンピュータは、ＣＰＵ、ＲＯＭ、ＲＡＭ（ランダムアクセスメモリ）、補助記憶装置、表示部（モニタリング部）、操作部、通信Ｉ／Ｆ、及びバス等を有して構成される。ＣＰＵは、ＲＯＭやＲＡＭに格納されているコンピュータプログラムやデータを用いて、当該コンピュータの全体を制御するとともに、前述した物体検出から可視化処理までの各情報処理を実行する。また本実施形態の情報処理装置は、ＣＰＵとは異なる１又は複数の専用のハードウェアを有していて、ＣＰＵによる処理の少なくとも一部を専用のハードウェアが実行する構成であってもよい。専用のハードウェアの例としては、ＡＳＩＣ（特定用途向け集積回路）、ＦＰＧＡ（フィールドプログラマブルゲートアレイ）、およびＤＳＰ（デジタルシグナルプロセッサ）などがある。ＲＯＭは、変更を必要としないプログラムなどを格納する。ＲＡＭは、補助記憶装置から供給されるプログラムやデータ、及び通信Ｉ／Ｆを介してカメラ等の外部から供給されるデータなどを一時記憶する。補助記憶装置は、ＨＤＤ等で構成され、画像データ、キャリブレーション情報などの種々のデータを記憶する。表示部（モニタリング部）は、例えば液晶ディスプレイやＬＥＤディスプレイ等で構成され、ユーザが情報処理装置を操作するためのＧＵＩなどを表示する。操作部は、例えばキーボードやマウス、ジョイスティック、タッチパネル等で構成され、ユーザによる操作を受けて各種の指示をＣＰＵに入力する。またＣＰＵは、表示部を制御する表示制御部、及び操作部を制御する操作制御部としても動作する。通信Ｉ／Ｆは、情報処理装置の外部の装置との通信に用いられる。例えば、情報処理装置がさらに外部の装置と有線で接続される場合には、通信用のケーブルが通信Ｉ／Ｆに接続される。情報処理装置が外部の装置と無線通信する機能を有する場合には、通信Ｉ／Ｆはアンテナを備える。バスは、情報処理装置の各部をつないで情報を伝達する。なお本実施形態の場合、情報処理装置と接続される外部の装置は、前述したカメラや他の情報処理装置等である。また表示部と操作部が情報処理装置の内部に存在するものとしたが、表示部は前述したモニタリング部、操作部は入力装置として、情報処理装置の外部に別の装置として存在していてもよい。 The information processing apparatus (particularly, the processing unit of the three-dimensional tracking apparatus) of each embodiment described above may be realized by a personal computer or the like connected to K cameras (moving image acquisition units) and monitoring units. Then, the computer performs each information processing from object detection to visualization processing as described in each of the above-described embodiments. The computer in this example executes program code of software that implements the information processing of this embodiment. Although illustration of the hardware configuration is omitted, a computer that realizes the information processing apparatus of this embodiment includes a CPU, a ROM, a RAM (random access memory), an auxiliary storage device, a display unit (monitoring unit), an operation unit, a communication I /F, and a bus. The CPU controls the entire computer using computer programs and data stored in the ROM and RAM, and executes each information processing from object detection to visualization described above. Further, the information processing apparatus of the present embodiment may have one or a plurality of pieces of dedicated hardware different from the CPU, and may be configured such that at least part of the processing by the CPU is executed by the dedicated hardware. Examples of dedicated hardware include ASICs (Application Specific Integrated Circuits), FPGAs (Field Programmable Gate Arrays), and DSPs (Digital Signal Processors). The ROM stores programs and the like that do not require modification. The RAM temporarily stores programs and data supplied from the auxiliary storage device and data supplied from outside such as a camera via the communication I/F. The auxiliary storage device is composed of an HDD or the like, and stores various data such as image data and calibration information. The display unit (monitoring unit) is configured by, for example, a liquid crystal display or an LED display, and displays a GUI or the like for the user to operate the information processing device. The operation unit is composed of, for example, a keyboard, a mouse, a joystick, a touch panel, etc., and inputs various instructions to the CPU in response to user's operations. The CPU also operates as a display control section that controls the display section and as an operation control section that controls the operation section. The communication I/F is used for communication with an external device of the information processing device. For example, when the information processing device is further connected to an external device by wire, a communication cable is connected to the communication I/F. If the information processing device has a function of wirelessly communicating with an external device, the communication I/F has an antenna. The bus connects each part of the information processing device to transmit information. Note that in the case of this embodiment, the external device connected to the information processing device is the above-described camera, other information processing device, or the like. In addition, although the display unit and the operation unit are assumed to exist inside the information processing apparatus, the display unit may exist as a monitoring unit and the operation unit as an input device, and may exist as separate devices outside the information processing apparatus. good.

本発明は、上述の実施形態の一以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける一つ以上のプロセッサがプログラムを読出し実行する処理でも実現可能である。また、一以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。
上述の実施形態は、何れも本発明を実施するにあたっての具体化の例を示したものに過ぎず、これらによって本発明の技術的範囲が限定的に解釈されてはならないものである。すなわち、本発明は、その技術思想、又はその主要な特徴から逸脱することなく、様々な形で実施することができる。 The present invention supplies a program that implements one or more functions of the above-described embodiments to a system or apparatus via a network or a storage medium, and one or more processors in the computer of the system or apparatus reads and executes the program. It can also be realized by processing to It can also be implemented by a circuit (eg, ASIC) that implements one or more functions.
All of the above-described embodiments merely show specific examples for carrying out the present invention, and the technical scope of the present invention should not be construed to be limited by these. That is, the present invention can be embodied in various forms without departing from its technical concept or main features.

２００：３次元追尾装置、２０１，２０２：動画取得部、２３０：処理部、２０３：動画取得選択部、２０４：検出部、２０５：初期値設定部、２０６：予測部、２０７：ＩＤ対応付け部、２０８：更新部、２０９：可視化部、２１０：タイミング推定部 200: three-dimensional tracking device, 201, 202: video acquisition unit, 230: processing unit, 203: video acquisition selection unit, 204: detection unit, 205: initial value setting unit, 206: prediction unit, 207: ID association unit , 208: updating unit, 209: visualization unit, 210: timing estimation unit

Claims

a detection means for detecting a predetermined target object from images respectively captured by a plurality of imaging means;
prediction means for predicting the state of the object at a second time after the first time based on the detection result of the object for the image captured at the first time for each of the imaging means;
updating means for updating the predicted state of the object for each imaging means based on the detection result by the detection means and the prediction result by the prediction means;
estimating means for estimating imaging timing for each imaging means based on the updated state of the object and the detection result of the detecting means;
An information processing device comprising:

2. The information processing apparatus according to claim 1, wherein said updating means updates the predicted state of the object to the state of the object at a third point in time.

The imaging timing is the difference between the reference imaging time and the imaging time of each of the other imaging means when the imaging time of one of the plurality of imaging means is used as a reference. 3. The information processing apparatus according to claim 1, wherein:

4. Information according to any one of claims 1 to 3, wherein the state of the object is all of the position, velocity, and orientation of the object in a three-dimensional space, or the position and velocity. processing equipment.

5. The information processing apparatus according to claim 1, wherein said prediction means predicts the state of said object as a probability distribution.

6. The method according to claim 5, wherein the prediction means predicts the state of the object as the probability distribution at a time interval obtained by dividing a time interval of imaging by the imaging means by the number of the imaging means. Information processing equipment.

The update means updates the predicted state of the object as the probability distribution based on the detection result of the detection means for the image acquired at the second time point and the prediction result of the prediction means. 7. The information processing apparatus according to claim 5, wherein:

8. The apparatus according to any one of claims 1 to 7, further comprising selecting means for selecting an imaging means for imaging at the first time point from among the plurality of imaging means based on the imaging timing estimated by the estimating means. The information processing device according to item 1.

9. The information processing apparatus according to any one of claims 1 to 8, further comprising associating means for associating identification information with the object detected by the detecting means.

The associating means obtains a likelihood of the detection result of the object by the detection means based on the detection result of the object by the detection means and the prediction result by the prediction means, and calculates the likelihood based on the likelihood. 10. The information processing apparatus according to claim 9, wherein the result of detection of said object by said detection means and the result of prediction by said prediction means are associated with each other.

a detection means for detecting a predetermined target object from images respectively captured by a plurality of imaging means;
a timing prediction means for predicting an imaging timing for each imaging means;
and timing update means for updating the imaging timing based on the detection result of the object by the detection means and the prediction result of the imaging timing by the timing prediction means.

prediction means for predicting the state of the object at a second time after the first time based on the detection result of the object for the image captured at the first time for each of the imaging means;
updating means for updating the predicted state of the object for each imaging means based on the detection result by the detection means and the prediction result by the prediction means;
12. The information processing apparatus according to claim 11, comprising:

13. The information processing apparatus according to claim 12, wherein said updating means updates the predicted state of the object to the state of the object at a third point in time.

13. The updating means calculates a weight for each imaging means based on the imaging timing updated by the timing updating means, and updates the state of the object based on the weight. 14. The information processing device according to 13.

15. The information processing apparatus according to claim 14, wherein said updating means updates the state of said object based on a weighted sum using said weight for each said imaging means.

16. The information processing apparatus according to any one of claims 12 to 15, further comprising associating means for associating identification information with the object detected by the detecting means.

The associating means obtains a likelihood of the detection result of the object by the detection means based on the detection result of the object by the detection means and the prediction result by the prediction means, and calculates the likelihood based on the likelihood. 17. The information processing apparatus according to claim 16, wherein the result of detection of said object by said detection means and the result of prediction by said prediction means are associated with each other.

18. The information processing apparatus according to any one of claims 11 to 17, wherein said timing prediction means predicts said imaging timing as a probability distribution.

The timing updating means calculates an index for quantifying the conditions under which the detecting means obtains the detection result of the object, and uses the index as a weight to set an amount for updating the imaging timing. Item 19. The information processing apparatus according to any one of Items 11 to 18.

20. The method according to any one of claims 1 to 19, wherein the detection means acquires at least one of a position of the object and a score representing likelihood of the position as the detection result of the object. The information processing device described.

21. The information processing apparatus according to any one of claims 1 to 20, further comprising visualization means for visualizing the positions of said objects in chronological order.

a detection step of detecting a predetermined object from images captured by a plurality of imaging means;
a prediction step of predicting the state of the object at a second time after the first time based on the detection result of the object for the image captured at the first time for each of the imaging means;
an update step of updating the predicted state of the target object based on the detection result of the detection step and the prediction result of the prediction step for each of the imaging means;
an estimating step of estimating an imaging timing for each of the imaging means based on the updated state of the object and the detection result of the detecting step;
An information processing method characterized by having

a detection step of detecting a predetermined object from images captured by a plurality of imaging means;
a timing prediction step of predicting an imaging timing for each of the imaging means;
An information processing method, comprising: a timing update step of updating the imaging timing based on a detection result of the object by the detection step and a prediction result of the imaging timing by the timing prediction step.

A program for causing a computer to function as the information processing apparatus according to any one of claims 1 to 20.