JP7195892B2

JP7195892B2 - Coordinate transformation matrix estimation method and computer program

Info

Publication number: JP7195892B2
Application number: JP2018219735A
Authority: JP
Inventors: 周平田良島; 啓仁野村; 和彦太田
Original assignee: NTT Communications Corp
Current assignee: NTT Communications Corp
Priority date: 2018-11-22
Filing date: 2018-11-22
Publication date: 2022-12-26
Anticipated expiration: 2038-11-22
Also published as: JP2020086879A

Description

本発明は、座標変換行列推定方法及びコンピュータプログラムに関する。 The present invention relates to a coordinate transformation matrix estimation method and a computer program.

従来、サッカーやラグビー等のスポーツを撮影し、撮影された映像を解析してチームの戦術や各選手のパフォーマンスを分析することは戦術の立案や将来性の高い選手のリクルーティングにつながる。
一方で、解析の対象となり得る潜在的なスポーツ映像の規模は極めて大きく、これらを全て人手で分析するには膨大なコストと時間がかかる。スポーツ映像の分析を機械で自動的に行うことができれば、大規模なデータに潜む有益な知見をもれなく収集することが可能になると考えられる。したがって、スポーツ映像の自動分析の実現に寄与する技術の産業価値は極めて高い。 Conventionally, shooting sports such as soccer and rugby and analyzing the shot images to analyze the tactics of the team and the performance of each player leads to the planning of tactics and the recruitment of highly promising players.
On the other hand, the scale of potential sports videos that can be analyzed is extremely large, and it takes a huge amount of time and money to manually analyze all of them. If the analysis of sports videos can be automatically performed by a machine, it will be possible to collect useful knowledge hidden in large-scale data without omission. Therefore, the industrial value of technology that contributes to the realization of automatic analysis of sports videos is extremely high.

通常、スポーツ映像は、スポーツの競技フィールド側面に配置されたカメラによって選手に追従して撮影されることが多い。スポーツ映像から分析されうる代表的な統計値として、選手の活動量（例えば、試合中の選手の走行距離）や選手の移動軌跡が挙げられる。しかしながら、映像中の選手を検出・追跡するのみでは、上記の統計値を取得するには不十分である。上記の統計値を取得するには、映像を構成する複数の映像フレームの中にフィールドがどのように映っているかが明らかである必要がある。より具体的には、映像中のある時刻ｔ、ｔ＋ｔ_０における映像フレームに撮像されているフィールドの物理的に同一な地点を対応付ける座標変換行列が既知である必要がある。フィールドが二次元平面である場合、上記の座標変換行列は３行３列の行列として表現されることが知られている。座標変換行列のパラメータが未知である場合に、入力された映像フレームの情報から座標変換行列を推定する技術をレジストレーションという。 Usually, sports videos are often captured by cameras arranged on the side of the sports field while following the athletes. Typical statistical values that can be analyzed from sports videos include the amount of activity of a player (for example, the running distance of a player during a game) and the trajectory of a player's movement. However, simply detecting and tracking players in video is not sufficient to obtain the above statistics. To obtain the above statistics, it must be known how the fields appear in the multiple video frames that make up the video. More specifically, it is necessary to know the coordinate transformation matrix that associates the physically identical points of the imaged field with the image frame at time t, t+ _t0 in the image. It is known that when the field is a two-dimensional plane, the above coordinate transformation matrix is expressed as a matrix of 3 rows and 3 columns. Registration is a technique for estimating a coordinate transformation matrix from information of an input video frame when the parameters of the coordinate transformation matrix are unknown.

映像フレームに基づいてレジストレーションを行う最も単純な方法として以下の方法がある。まず、時刻ｔ、ｔ＋ｔ_０における映像フレーム間でフィールド中の同一位置をとらえた任意の４点を人手で対応付ける。次に、その対応関係からＤＬＴ（Direct Linear Transform）によって座標変換行列のパラメータを推定する。この方法では、カメラが試合状況に応じて動く場合、連続する時刻の映像フレーム全てに人手で対応点を与える必要があり、そのコストが膨大であるとともに、リアルタイムな分析にも向かないという問題点がある。 The simplest method of performing registration based on video frames is as follows. First, between video frames at times t and t+ _t0 , arbitrary four points capturing the same position in the field are manually associated. Next, the parameters of the coordinate transformation matrix are estimated by DLT (Direct Linear Transform) from the corresponding relationship. With this method, when the camera moves according to the game situation, it is necessary to manually assign corresponding points to all video frames at consecutive times, which is extremely costly and is not suitable for real-time analysis. There is

人手で対応点を与えることを回避して座標変換行列のパラメータを推定する方法として、非特許文献１及び２のような技術が提案されている。
非特許文献１では、各映像フレームに撮像されているフィールド中の特徴的な領域（例えば、ラインやサークル）を予め検出しておき、特徴的な領域を複数の映像フレーム間で対応付けることで座標変換行列のパラメータを推定している。
また、非特許文献２では、画像中の各ピクセルを「センターサークル」、「サイドライン」、「芝」などといったクラスに割り当てるニューラルネットワークを事前に学習しておき、その出力に基づいて映像フレームを対応付けて座標変換行列のパラメータを推定している。 As a method of estimating the parameters of a coordinate transformation matrix without manually giving corresponding points, techniques such as Non-Patent Documents 1 and 2 have been proposed.
In Non-Patent Document 1, a characteristic area (for example, a line or a circle) in a field captured in each video frame is detected in advance, and the characteristic area is associated with a plurality of video frames to obtain coordinates. You are estimating the parameters of the transformation matrix.
In addition, in Non-Patent Document 2, a neural network that assigns each pixel in an image to a class such as "center circle", "sideline", "grass" is trained in advance, and video frames are generated based on the output. The parameters of the coordinate transformation matrix are estimated in association with each other.

Ankur Gupta, James J. Little, Robert J. Woodham, “Using Line and Ellipse Features for Rectification of Broadcast Hockey Video”, Computer and Robot Vision (CRV), 2011 Canadian Conference onAnkur Gupta, James J. Little, Robert J. Woodham, “Using Line and Ellipse Features for Rectification of Broadcast Hockey Video”, Computer and Robot Vision (CRV), 2011 Canadian Conference on Namdar Homayounfar, Sanja Fidler, Raquel Urtasun, “Sports Field Localization via Deep Structured Models”, In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, p.5212-5220Namdar Homayounfar, Sanja Fidler, Raquel Urtasun, “Sports Field Localization via Deep Structured Models”, In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, p.5212-5220

複数の映像フレーム間に特徴的な領域が撮像されている場合には、非特許文献１及び２それぞれの方法で座標変換行列のパラメータを推定することができる。しかしながら、フィールド中の特徴的な領域がほとんど映っていない映像フレームが対象の場合には、非特許文献１及び２の方法では座標変換行列のパラメータの推定精度が低下する。このような問題を解決する手段として、時系列の情報をより明示的に活用する方法がある。 When a characteristic area is imaged between a plurality of video frames, the parameters of the coordinate transformation matrix can be estimated by the methods of Non-Patent Documents 1 and 2, respectively. However, when the target is a video frame in which characteristic areas in the field are hardly captured, the methods of Non-Patent Documents 1 and 2 degrade the accuracy of estimating the parameters of the coordinate transformation matrix. As a means to solve such problems, there is a method of using time-series information more explicitly.

具体的には、撮影時刻の近い複数の映像フレーム間では、状況が滑らかに変化しているという仮定のもと、公知のオプティカルフロー推定手法を用いて映像フレーム間の対応点を推定し、座標変換行列を推定するといった方法がある。しかしこの方法では、前述の仮定を逸脱するような場合、例えば、連続して撮像された映像フレーム間でカメラが撮像する領域が急激に変化する場合には、オプティカルフローの推定精度が低下してしまう。その結果、座標変換行列のパラメータを正確に推定することができなくなってしまう。連続して撮像された映像フレーム間でカメラが撮像する領域が急激に変化することは、視聴用途の映像の場合ごく一般的である。そのため、このような状況に対応できないことは解析対象として得る映像の種類を大きく制約してしまうことにつながりかねない。 Specifically, under the assumption that the situation changes smoothly between a plurality of video frames shot at close times, a known optical flow estimation method is used to estimate corresponding points between the video frames, and the coordinates There are methods such as estimating the transformation matrix. However, in this method, when the above assumption is deviated from, for example, when the area captured by the camera changes rapidly between video frames captured continuously, the optical flow estimation accuracy decreases. put away. As a result, it becomes impossible to accurately estimate the parameters of the coordinate transformation matrix. It is very common for video for viewing purposes that the area captured by the camera changes abruptly between successively captured video frames. Therefore, the inability to deal with such a situation may lead to a significant restriction on the types of images that can be obtained as analysis targets.

以上のように、従来の方法では、撮影対象領域における特徴的な領域が撮像されていない場合や、カメラが撮像する領域が急激に変化する場合には、座標変換行列の推定精度が低下してしまうという問題があった。
上記事情に鑑み、本発明は、座標変換行列の推定精度を向上させることができる技術の提供を目的としている。 As described above, in the conventional method, the accuracy of estimating the coordinate transformation matrix decreases when a characteristic area in the imaging target area is not captured or when the area captured by the camera changes rapidly. There was a problem of hoarding.
In view of the above circumstances, an object of the present invention is to provide a technique capable of improving the estimation accuracy of a coordinate transformation matrix.

本発明の一態様は、複数枚の画像それぞれから物体を検出する検出ステップと、前記複数枚の画像から所定の時間間隔差で撮像された又は、同時刻に異なる撮像装置で撮像された対応付け対象となる複数枚の画像を抽出する抽出ステップと、前記対応付け対象として抽出された複数枚の画像間における物体の検出結果を対応付ける物体対応付けステップと、前記物体対応付けステップにおける対応付け結果と、前記物体の検出結果とに基づいて前記画像間の座標変換行列を推定する変換行列推定ステップと、を有する座標変換行列推定方法である。 One aspect of the present invention is a detection step of detecting an object from each of a plurality of images, and association of images captured at predetermined time intervals from the plurality of images or captured at the same time by different imaging devices. an extraction step of extracting a plurality of target images; an object matching step of matching results of object detection between the plurality of images extracted as matching targets; and a matching result in the object matching step. and a transformation matrix estimation step of estimating a coordinate transformation matrix between the images based on the detection result of the object.

本発明の一態様は、上記の座標変換行列推定方法であって、前記検出ステップにおいて、前記物体を検出した後に、検出した前記物体の分類の判定を行うことによって前記物体を分類し、前記画像間の物体の検出結果に前記物体の分類結果を示す情報が含まれる場合、前記物体対応付けステップにおいて、前記画像間で同一の分類結果を示す情報が含まれる物体の検出結果を同一の物体の検出結果として対応付ける。 One aspect of the present invention is the coordinate transformation matrix estimation method described above, wherein, in the detection step, after detecting the object, the object is classified by determining the classification of the detected object, and the image When the detection result of the object between the images includes information indicating the classification result of the object, in the object matching step, the detection result of the object including the information indicating the same classification result between the images is classified as the same object. Correlate as a detection result.

本発明の一態様は、上記の座標変換行列推定方法であって、前記物体対応付けステップは、前記画像間の物体の検出結果を暫定的に対応付ける暫定対応付けステップと、前記暫定対応付けステップにおいて暫定的に対応付けられた前記画像間の物体の検出結果の組それぞれをノードとしたグラフを構築するグラフ構築ステップと、前記グラフにおける複数のノードのうち、ノード間で関係性が高い順に割り当てられる優先度の高いノードから順番に、正しい対応付けであると想定される対応付けのノードを含むクラスタへの追加条件を満たすか否かを判定することによって、正しい対応付けであると想定される対応付けのノードを選択する選択ステップと、を有する。 One aspect of the present invention is the coordinate transformation matrix estimation method described above, wherein the object matching step comprises a provisional matching step of provisionally matching the detection result of the object between the images; a graph construction step of constructing a graph in which each set of object detection results between the temporarily associated images is a node; A correspondence that is assumed to be a correct correspondence by judging whether or not the condition for addition to the cluster containing the nodes of the correspondence that is assumed to be a correct correspondence is satisfied, in order from the node with the highest priority and a selection step of selecting a node of interest.

本発明の一態様は、上記の座標変換行列推定方法であって、前記検出ステップにおいて、前記物体を検出した後に、検出した前記物体の分類の判定を行うことによって前記物体を分類し、前記画像間の物体の検出結果に前記物体の分類結果を示す情報が含まれる場合、前記暫定対応付けステップにおいて、前記画像間で同一の分類結果を示す情報が含まれる物体の検出結果を同一の物体の検出結果として暫定的に対応付ける。 One aspect of the present invention is the coordinate transformation matrix estimation method described above, wherein, in the detection step, after detecting the object, the object is classified by determining the classification of the detected object, and the image When the detection result of the object between the images includes information indicating the classification result of the object, in the provisional association step, the detection result of the object including the information indicating the same classification result between the images is classified as the same object. Temporarily associated as a detection result.

本発明の一態様は、上記の座標変換行列推定方法であって、前記グラフ構築ステップにおいて、構築した前記グラフに含まれるノードペアの幾何的な一貫性の有無を評価し、幾何的な一貫性があると評価したノードペア間をエッジで接続し、前記選択ステップにおいて、前記エッジで接続されている数が多いノードを優先度の高いノードとして前記正しい対応付けであると想定される対応付けのノードを選択する。 One aspect of the present invention is the coordinate transformation matrix estimation method described above, wherein in the graph construction step, the presence or absence of geometric consistency of node pairs included in the constructed graph is evaluated, and if the geometric consistency is The node pairs evaluated to be correct are connected by edges, and in the selection step, nodes with a large number of nodes connected by the edges are given high priority, and the nodes of the correspondence assumed to be the correct correspondence are selected. select.

本発明の一態様は、上記の座標変換行列推定方法であって、前記選択ステップにおいて、優先度の高いノードから順番に、前記ノードの特徴点が、既にクラスタに追加されているノードにおける特徴点として使用されていない場合に前記クラスタへの追加条件を満たすと判定し、前記クラスタへの追加条件を満たしたノードを前記クラスタに追加する。 One aspect of the present invention is the coordinate transformation matrix estimation method described above, wherein, in the selection step, the feature points of the nodes that have already been added to the cluster are selected in order from the node with the highest priority. is not used as a node, it is determined that the condition for addition to the cluster is satisfied, and the node satisfying the condition for addition to the cluster is added to the cluster.

本発明の一態様は、上記の座標変換行列推定方法であって、複数の撮像装置によって相対する方向で撮影対象領域が撮影された複数枚の画像それぞれに撮像されている物体の検出結果が入力された場合、前記物体対応付けステップにおいて、一方の物体の検出結果を逆転させた後に対応付けを行う。 One aspect of the present invention is the coordinate transformation matrix estimation method described above, wherein a detection result of an imaged object is input to each of a plurality of images obtained by imaging an imaging target region in opposite directions by a plurality of imaging devices. If so, in the object matching step, the matching is performed after reversing the detection result of one of the objects.

本発明の一態様は、上記の座標変換行列推定方法を実行させるためのコンピュータプログラムである。 One aspect of the present invention is a computer program for executing the coordinate transformation matrix estimation method described above.

本発明により、座標変換行列の推定精度を向上させることが可能となる。 According to the present invention, it is possible to improve the estimation accuracy of the coordinate transformation matrix.

第１の実施形態における画像処理システム１００の機能構成を示す図である。1 is a diagram showing a functional configuration of an image processing system 100 according to a first embodiment; FIG. 第１の実施形態における画像処理装置１０が行う座標変換行列の推定処理の流れを示すフローチャートである。4 is a flow chart showing the flow of coordinate transformation matrix estimation processing performed by the image processing apparatus 10 according to the first embodiment. 第１の実施形態における物体検出部１０２による人物検出の一例を示す図である。4A and 4B are diagrams illustrating an example of person detection by the object detection unit 102 in the first embodiment; FIG. 第２の実施形態における画像処理装置１０ａの機能構成を表す概略ブロック図である。FIG. 4 is a schematic block diagram showing the functional configuration of an image processing apparatus 10a according to a second embodiment; 第２の実施形態における物体対応付け部１０３ａの機能構成を表す概略ブロック図である。FIG. 11 is a schematic block diagram showing the functional configuration of an object associating unit 103a in the second embodiment; FIG. 第２の実施形態における画像処理装置１０ａが行う座標変換行列の推定処理の流れを示すフローチャートである。10 is a flow chart showing the flow of coordinate transformation matrix estimation processing performed by the image processing apparatus 10a according to the second embodiment. 第２の実施形態における物体検出部１０２ａによる人物検出の一例を示す図である。FIG. 10 is a diagram showing an example of person detection by an object detection unit 102a in the second embodiment; 第２の実施形態における暫定対応付け部１０３１による暫定対応付けの結果を示す図である。FIG. 10 is a diagram showing a result of provisional association by a provisional association unit 1031 in the second embodiment; FIG. 第２の実施形態におけるグラフ構築部１０３２によるグラフの構築結果を示す図である。FIG. 10 is a diagram showing a graph construction result by a graph construction unit 1032 in the second embodiment; 第２の実施形態におけるグラフ構築部１０３２が構築したグラフにおける処理を示す図である。FIG. 13 is a diagram showing processing in a graph constructed by a graph construction unit 1032 according to the second embodiment; FIG. 第２の実施形態における画像処理装置１０ａが行うクラスタ抽出処理の流れを示すフローチャートである。9 is a flowchart showing the flow of cluster extraction processing performed by the image processing apparatus 10a according to the second embodiment; 第２の実施形態におけるクラスタの抽出結果を示す図である。FIG. 10 is a diagram showing cluster extraction results in the second embodiment; 第３の実施形態における画像処理装置１０ｂの機能構成を表す概略ブロック図である。FIG. 11 is a schematic block diagram showing the functional configuration of an image processing device 10b according to a third embodiment; 第４の実施形態における画像処理システム１００ｃの機能構成を示す図である。FIG. 13 is a diagram showing the functional configuration of an image processing system 100c according to the fourth embodiment; FIG.

以下、本発明の一実施形態を、図面を参照しながら説明する。
（第１の実施形態）
図１は、第１の実施形態における画像処理システム１００の機能構成を示す図である。画像処理システム１００は、撮像装置１及び画像処理装置１０を備える。
撮像装置１は、撮影対象領域２を撮影する。撮影対象領域２は、例えばサッカーやラグビー等のスポーツが行われる競技場である。撮像装置１は、例えばカメラである。撮像装置１は、撮影対象領域２を動画で撮影することによって、撮影対象領域２の一部又は全てが撮像された複数の映像フレームを生成し、生成した複数の映像フレームを画像処理装置１０に出力する。映像フレームは、撮像装置によって撮像された静止画像の一例である。 An embodiment of the present invention will be described below with reference to the drawings.
(First embodiment)
FIG. 1 is a diagram showing the functional configuration of an image processing system 100 according to the first embodiment. The image processing system 100 includes an imaging device 1 and an image processing device 10 .
The image capturing device 1 captures an image of an image capturing target area 2 . A shooting target area 2 is, for example, a stadium where sports such as soccer and rugby are played. The imaging device 1 is, for example, a camera. The imaging device 1 generates a plurality of video frames in which part or all of the imaging target region 2 is captured by capturing a moving image of the imaging target region 2 , and transmits the generated video frames to the image processing device 10 . Output. A video frame is an example of a still image captured by an imaging device.

画像処理装置１０は、入力した複数枚の映像フレームに基づいて映像フレーム間の座標変換行列のパラメータを推定し、推定した座標変換行列のパラメータにより映像フレームの画像変換を行う。例えば、画像処理装置１０は、撮影時刻の異なる２枚の映像フレームに基づいて映像フレーム間の座標変換行列を推定する。画像処理装置１０に入力される複数枚の映像フレームは、映像中の異なる時刻ｔ、ｔ＋ｔ_０に撮像された画像である。映像中の異なる時刻ｔ、ｔ＋ｔ_０との間は、所定の時間間隔差があり、例えば１フレーム、数フレーム、１秒、５秒、１０秒、数時間、数年等である。時刻ｔ、ｔ＋ｔ_０の間隔は、０に近いほど好ましい。以下の説明では、時刻ｔに撮像された画像を映像フレームＩ^ｔと記載し、ｔ＋ｔ_０に撮像された画像を映像フレームＩ^ｔ+ｔ0と記載する。 The image processing apparatus 10 estimates parameters of a coordinate transformation matrix between video frames based on a plurality of input video frames, and performs image transformation of the video frames using the estimated parameters of the coordinate transformation matrix. For example, the image processing apparatus 10 estimates a coordinate transformation matrix between video frames based on two video frames captured at different times. A plurality of video frames input to the image processing apparatus 10 are images captured at different times t and t+ _t0 in the video. There is a predetermined time interval difference between different times t, t+ _t0 in the image, such as 1 frame, several frames, 1 second, 5 seconds, 10 seconds, hours, years, and so on. The interval between times t and t+ _t0 is preferably as close to 0 as possible. In the following description, the image captured at time ^t is described as video frame It, and the image captured at ^t+t0 _is described as video frame It+t0.

次に、画像処理装置１０の具体的な構成について説明する。
画像処理装置１０は、バスで接続されたＣＰＵ（Central Processing Unit）やメモリや補助記憶装置などを備え、画像処理プログラムを実行する。画像処理プログラムの実行によって、画像処理装置１０は、画像取得部１０１、物体検出部１０２、物体対応付け部１０３、変換行列推定部１０４及び画像処理部１０５を備える装置として機能する。なお、画像処理装置１０の各機能の全て又は一部は、ＡＳＩＣ（Application Specific Integrated Circuit）やＰＬＤ（Programmable Logic Device）やＦＰＧＡ（Field Programmable Gate Array）やＧＰＵ(Graphics Processing Unit)等のハードウェアを用いて実現されてもよい。また、画像処理プログラムは、コンピュータ読み取り可能な記録媒体に記録されてもよい。コンピュータ読み取り可能な記録媒体とは、例えばフレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置である。また、画像処理プログラムは、電気通信回線を介して送受信されてもよい。 Next, a specific configuration of the image processing apparatus 10 will be described.
The image processing apparatus 10 includes a CPU (Central Processing Unit), a memory, an auxiliary storage device, etc., which are connected via a bus, and executes an image processing program. By executing the image processing program, the image processing device 10 functions as a device including an image acquisition unit 101 , an object detection unit 102 , an object matching unit 103 , a transformation matrix estimation unit 104 and an image processing unit 105 . All or part of each function of the image processing apparatus 10 is implemented by hardware such as an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), an FPGA (Field Programmable Gate Array), or a GPU (Graphics Processing Unit). may be implemented using Also, the image processing program may be recorded on a computer-readable recording medium. Computer-readable recording media include portable media such as flexible disks, magneto-optical disks, ROMs and CD-ROMs, and storage devices such as hard disks incorporated in computer systems. Also, the image processing program may be transmitted and received via an electric communication line.

画像取得部１０１は、撮像装置１から出力された映像フレームを取得する。例えば、画像取得部１０１は、複数枚の映像フレームを取得する。
物体検出部１０２は、画像取得部１０１によって取得された複数枚の映像フレームそれぞれから、映像フレームに撮像されている物体を検出する。ここで、物体とは、人物、動物、ロボット、乗り物等の自律又は操作に応じて移動可能な移動物体である。以下の説明では、物体が人物である場合を例に説明する。 The image acquisition unit 101 acquires video frames output from the imaging device 1 . For example, the image acquisition unit 101 acquires multiple video frames.
The object detection unit 102 detects objects captured in the video frames from each of the plurality of video frames acquired by the image acquisition unit 101 . Here, the object is a moving object such as a person, an animal, a robot, a vehicle, or the like, which can move autonomously or according to an operation. In the following description, an example in which the object is a person will be described.

人物検出には、例えば以下の参考文献１に開示されている手法が用いられてもよい。なお、人物検出の手法は、参考文献１に開示されている手法に限らず、周知の技術を用いることが可能である。
（参考文献１：Joseph Redmon, Ali Farhadi, “YOLOv3: An Incremental Improvement”, In arXiv, 2018） For human detection, for example, the method disclosed in Reference 1 below may be used. Note that the person detection method is not limited to the method disclosed in Reference 1, and a well-known technique can be used.
(Reference 1: Joseph Redmon, Ali Farhadi, “YOLOv3: An Incremental Improvement”, In arXiv, 2018)

物体検出部１０２は、人物の検出結果として、映像フレームに撮像されている各人物を囲う矩形四隅の座標値、すなわちセグメンテーションされた人物の領域を検出する。人物検出の粒度については、映像フレームに撮像されている特定人物のラベルを付与して、矩形四隅の座標値とともに出力する検出器を用いてもよいし、映像フレームに撮像されている任意の人物を検出する検出器を用いてもよい。 The object detection unit 102 detects the coordinate values of the four corners of a rectangle surrounding each person captured in the video frame, that is, the segmented area of the person, as the person detection result. As for the granularity of human detection, a detector may be used in which a specific person imaged in the video frame is labeled and output along with the coordinate values of the four corners of the rectangle. You may use the detector which detects.

以下の説明では、映像フレームに撮像されている特定人物のラベルを、矩形四隅の座標値とともに出力する検出器を特定物体検出器と記載し、映像フレームに撮像されている任意の人物を検出する検出器を一般物体検出器と記載する。特定物体検出器と一般物体検出器は同じ方法で実現することが可能である。しかしながら、一般的に、特定人物を検出するためには個々の対象人物に関する学習データを収集する必要がある。そのため、特定物体検出器は、任意の人物に関する学習データを用意すれば構築可能な一般物体検出器に比べて構築コストが高くなる傾向がある。第１の実施形態における物体検出部１０２は、特定物体検出器を用いる。 In the following description, a detector that outputs the label of a specific person imaged in the video frame along with the coordinate values of the four corners of the rectangle is referred to as a specific object detector, and detects any person imaged in the video frame. We describe the detector as a general object detector. A specific object detector and a general object detector can be implemented in the same way. However, in general, it is necessary to collect learning data regarding individual target persons in order to detect specific persons. Therefore, a specific object detector tends to be more expensive to construct than a general object detector, which can be constructed by preparing learning data about an arbitrary person. The object detection unit 102 in the first embodiment uses a specific object detector.

物体対応付け部１０３は、物体検出部１０２によって検出された各映像フレームの人物検出結果に基づいて、映像フレーム間における人物の検出結果を対応付ける。具体的には、まず物体対応付け部１０３は、複数枚のフレームから所定の時間間隔差で撮像された対応付け対象となる複数枚の映像フレームを抽出する。次に、物体対応付け部１０３は、対応付け対象として抽出した複数枚の映像フレームそれぞれで検出された物体の検出結果に基づいて、対応付け対象として抽出した複数枚の映像フレーム間の物体の検出結果を対応付ける。例えば、物体対応付け部１０３は、時刻ｔに撮像された映像フレームＩ^ｔで検出された物体の検出結果と、時刻ｔと所定の時間離れた時刻ｔ＋ｔ_０に撮像された映像フレームＩ^ｔ+ｔ0で検出された物体の検出結果とに基づいて、映像フレームＩ^ｔと映像フレームＩ^ｔ+ｔ0との間の物体の検出結果を対応付ける。人物検出結果には、特定人物のラベルと、矩形四隅の座標値とが含まれる。
変換行列推定部１０４は、物体検出部１０２による人物の検出結果と、物体対応付け部１０３による対応付け結果とに基づいて、映像フレームＩ^ｔ及びＩ^ｔ+ｔ0の座標変換行列Ｈ^{ｔ，ｔ+ｔ0}を推定する。 The object association unit 103 associates the person detection results between the video frames based on the person detection results of each video frame detected by the object detection unit 102 . Specifically, first, the object association unit 103 extracts a plurality of image frames to be associated, which are imaged at a predetermined time interval difference from the plurality of frames. Next, the object association unit 103 detects an object between the plurality of video frames extracted as the association target based on the detection result of the object detected in each of the plurality of image frames extracted as the association target. Match the results. For example, the object associating unit 103 combines the detection result of the object detected in the video frame It captured at time ^t with the video frame It + _t0 captured at time t ^+t0 , which is a predetermined time away from time t. Based on the detection result of the object detected in , the object detection result between the image frame It and the image frame It ^+t0 ^is associated with each other. The person detection result includes the label of the specific person and the coordinate values of the four corners of the rectangle.
The transformation matrix estimating unit 104 calculates the coordinate transformation matrices Ht ^, ^t ^{+ of the video frames It and It+t0} based on the person detection result by the object detection unit 102 and the matching result by the object matching unit 103. Estimate ^t0 .

座標変換行列のうち、自由度が８であるものは射影変換行列、自由度が６であるものはアフィン変換行列、自由度が４であるものは相似変換行列と呼ばれ、本発明ではいずれの変換行列を用いてもよい。仮定した変換行列のパラメータは、対応付け結果で示される各対応を構成する人物の検出結果から、映像フレーム間で対応付く座標を算出したうえで公知のパラメータ推定方法を用いることで推定することができる。人物の検出結果から映像フレーム間で対応付く座標を算出する方法としては、矩形の重心座標間を対応点として算出したり、人物の下部が地面に接地していることに着目し、矩形の左下、右下の座標をそれぞれ（ｘ_１，ｙ_１）、（ｘ_２，ｙ_１）として、（（ｘ_１＋ｘ_２）／２，ｙ_１）として算出したりすればよい。また、公知のパラメータ推定方法としては、得られた対応点の数に応じて、ＤＬＴや下記参考文献２に開示されているＲＡＮＳＡＣを用いればよい。
（参考文献２：Martin A. Fischler, Robert C. Bolles, “Random Sample Consensus: A paradigm for model fitting with applications to image analysis and automated cartography”, In Communications of the ACM, 1981） Of the coordinate transformation matrices, those with 8 degrees of freedom are called projective transformation matrices, those with 6 degrees of freedom are called affine transformation matrices, and those with 4 degrees of freedom are called similarity transformation matrices. A transformation matrix may be used. The parameters of the assumed transformation matrix can be estimated by using a known parameter estimation method after calculating the coordinates associated between the video frames from the detection results of the persons forming each correspondence indicated by the correspondence results. can. As a method of calculating corresponding coordinates between video frames from the results of human detection, there are methods of calculating corresponding points between the coordinates of the center of gravity of the rectangle, and focusing on the fact that the lower part of the person is in contact with the ground, the lower left corner of the rectangle. , and ((x ₁ +x ₂ )/2, y ₁ ), where the lower right coordinates are (x ₁ , y ₁ ) and (x ₂ , y ₁ ), respectively. As a known parameter estimation method, DLT or RANSAC disclosed in Reference 2 below may be used depending on the number of obtained corresponding points.
(Reference 2: Martin A. Fischler, Robert C. Bolles, “Random Sample Consensus: A paradigm for model fitting with applications to image analysis and automated cartography”, In Communications of the ACM, 1981)

画像処理部１０５は、変換行列推定部１０４によって推定された座標変換行列Ｈ^{ｔ，ｔ+ｔ0}を用いて、入力された映像フレームに対して画像処理を行う。具体的には、画像処理部１０５は、映像フレームを特定の方向から見た画像に変換する。例えば、画像処理部１０５は、映像フレームＩ^ｔ+ｔ0を、映像フレームＩ^ｔを撮影した方向から見た画像に変換する。 The image processing unit 105 performs image processing on the input video frame using the coordinate transformation matrix ^Ht,t+t0 estimated by the transformation matrix estimation unit 104 . Specifically, the image processing unit 105 converts the video frame into an image viewed from a specific direction. For example, the image processing unit 105 ^converts the video frame It ^+t0 into an image viewed from the direction in which the video frame It was shot.

画像処理部１０５による画像変換が行われることによって、フィールド上の選手のフォーメーション、ある時刻における各選手の位置、選手の動きなどの情報を可視化することができる。 Image conversion by the image processing unit 105 makes it possible to visualize information such as the formation of players on the field, the position of each player at a certain time, and the movements of the players.

図２は、第１の実施形態における画像処理装置１０が行う座標変換行列の推定処理の流れを示すフローチャートである。
画像取得部１０１は、複数枚の映像フレームを取得する（ステップＳ１０１）。画像取得部１０１は、取得した複数枚の映像フレームを物体検出部１０２に出力する。物体検出部１０２は、複数枚の映像フレームそれぞれに撮像されている人物を検出する（ステップＳ１０２）。第１の実施形態における物体検出部１０２は、上述したように特定物体検出器であり、映像フレームに撮像されている特定人物のラベルを矩形四隅の座標値とともに人物検出結果として物体対応付け部１０３及び変換行列推定部１０４に出力する。 FIG. 2 is a flow chart showing the flow of coordinate transformation matrix estimation processing performed by the image processing apparatus 10 according to the first embodiment.
The image acquisition unit 101 acquires a plurality of video frames (step S101). The image acquisition unit 101 outputs the acquired plurality of video frames to the object detection unit 102 . The object detection unit 102 detects a person captured in each of the plurality of video frames (step S102). The object detection unit 102 in the first embodiment is a specific object detector as described above, and the object association unit 103 uses the label of the specific person imaged in the video frame as the person detection result together with the coordinate values of the four corners of the rectangle. and output to transformation matrix estimation section 104 .

図３は、第１の実施形態における物体検出部１０２による人物検出の一例を示す図である。図３に示すように、図３（Ａ）は時刻ｔに撮像された映像フレーム２０を示す図であり、図３（Ｂ）は時刻ｔ＋ｔ_０に撮像された映像フレーム３０を示す図である。
映像フレーム２０では、４人の人物２１－１～２１－４が検出されている。検出された人物２１－１～２１－４には、特定物体検出器により各人物を特定するためのラベルが付与される。図３（Ａ）では、人物２１－１が物体検出部１０２により“人物１”に分類されて“人物１”のラベルが付与され、人物２１－２が物体検出部１０２により“人物２”に分類されて“人物２”のラベルが付与され、人物２１－３が物体検出部１０２により“人物３”に分類されて“人物３”のラベルが付与され、人物２１－４が物体検出部１０２により“人物４”に分類されて“人物４”のラベルが付与されたことが示されている。
物体検出部１０２は、映像フレーム２０の人物検出結果として、各人物２１－１～２１－４の分類結果を示すラベルと、各人物２１－１～２１－４を囲う領域２２－１～２２－４の矩形四隅の座標値を取得する。 FIG. 3 is a diagram showing an example of person detection by the object detection unit 102 in the first embodiment. As shown in FIGS. 3A and 3B, FIG. 3A is a diagram showing a video frame 20 captured at time t, and FIG. 3B is a diagram showing a video frame 30 captured at time t+ _t0 .
In the video frame 20, four persons 21-1 to 21-4 are detected. The detected persons 21-1 to 21-4 are given labels for identifying each person by the specific object detector. In FIG. 3A, the object detection unit 102 classifies the person 21-1 as “person 1” and assigns the label “person 1”, and the object detection unit 102 classifies the person 21-2 as “person 2”. Person 21-3 is classified and labeled as “Person 3” by the object detection unit 102, and person 21-4 is classified as “Person 3” by the object detection unit 102. is classified as "Person 4" and labeled as "Person 4".
Object detection unit 102 generates, as results of person detection in image frame 20, labels indicating classification results of persons 21-1 to 21-4 and regions 22-1 to 22- surrounding persons 21-1 to 21-4. Get the coordinate values of the four corners of the rectangle of 4.

映像フレーム３０では、３人の人物３１－１～３１－３が検出されている。検出された人物３１－１～３１－３には、特定物体検出器により各人物を特定するためのラベルが付与される。図３（Ｂ）では、人物３１－１が物体検出部１０２により“人物３”に分類されて“人物３”のラベルが付与され、人物３１－２が物体検出部１０２により“人物４”に分類されて“人物４”のラベルが付与され、人物３１－３が物体検出部１０２により“人物２”に分類されて“人物２”のラベルが付与されたことが示されている。
物体検出部１０２は、映像フレーム３０の人物検出結果として、各人物３１－１～３１－３の分類結果を示すラベルと、各人物３１－１～３１－３を囲う領域３２－１～３２－３の矩形四隅の座標値を取得する。 In the video frame 30, three persons 31-1 to 31-3 are detected. The detected persons 31-1 to 31-3 are given labels for identifying each person by the specific object detector. In FIG. 3B, person 31-1 is classified as “person 3” by object detection unit 102 and labeled as “person 3”, and person 31-2 is classified as “person 4” by object detection unit 102. It is shown that the person 31-3 has been classified and labeled as "person 4", and that the person 31-3 has been classified as "person 2" by the object detection unit 102 and labeled as "person 2".
Object detection unit 102, as a person detection result of video frame 30, labels indicating the results of classification of persons 31-1 to 31-3 and areas 32-1 to 32- Get the coordinate values of the four corners of the rectangle of 3.

図２に戻って説明を続ける。
物体対応付け部１０３は、複数枚の映像フレームから、所定の時間間隔差で撮像された対応付け対象となる複数枚（例えば、２つ）の映像フレームを抽出する（ステップＳ１０４）。なお、物体対応付け部１０３が抽出する映像フレームの抽出開始タイミング及び所定の時間間隔は、予め設定される。物体対応付け部１０３は、抽出した映像フレームにおける物体検出部１０２から出力された人物検出結果に基づいて、映像フレーム２０における人物の検出結果と、映像フレーム３０における人物の検出結果とを対応付ける（ステップＳ１０４）。具体的には、物体対応付け部１０３は、対応付け対象として抽出した複数枚の映像フレームそれぞれにおいて人物検出結果として得られたラベルを用いて映像フレーム２０における人物の検出結果と、映像フレーム３０における人物の検出結果とを対応付ける。例えば、物体対応付け部１０３は、同じ情報を示すラベルが付与されている人物検出結果を同一の人物の検出結果として対応付ける。物体対応付け部１０３は、対応付け結果を変換行列推定部１０４に出力する。 Returning to FIG. 2, the description continues.
The object association unit 103 extracts a plurality of (for example, two) image frames to be associated, which are imaged at a predetermined time interval difference from the plurality of image frames (step S104). Note that the extraction start timing and the predetermined time interval of the video frames extracted by the object association unit 103 are set in advance. The object association unit 103 associates the person detection result in the image frame 20 with the person detection result in the image frame 30 based on the person detection result output from the object detection unit 102 in the extracted image frame (step S104). Specifically, the object associating unit 103 uses the label obtained as the person detection result in each of the plurality of image frames extracted as the object of association, and uses the label obtained as the person detection result in the image frame 20 and the person detection result in the image frame 30. It is associated with the person detection result. For example, the object associating unit 103 associates person detection results to which labels indicating the same information are given as detection results of the same person. Object association section 103 outputs the association result to transformation matrix estimation section 104 .

図３に示す例では、映像フレーム２０における人物２１－２と、映像フレーム３０における人物３１－３とには、“人物２”を示すラベルが付与されている。そこで、物体対応付け部１０３は、映像フレーム２０における人物２１－２と、映像フレーム３０における人物３１－３とを同一の人物の検出結果として対応付ける。
また、図３に示す例では、映像フレーム２０における人物２１－３と、映像フレーム３０における人物３１－１とには、“人物１”を示すラベルが付与されている。そこで、物体対応付け部１０３は、映像フレーム２０における人物２１－３と、映像フレーム３０における人物３１－１とを同一の人物の検出結果として対応付ける。 In the example shown in FIG. 3, the person 21-2 in the video frame 20 and the person 31-3 in the video frame 30 are labeled "person 2". Therefore, the object associating unit 103 associates the person 21-2 in the video frame 20 with the person 31-3 in the video frame 30 as the detection result of the same person.
Also, in the example shown in FIG. 3, the person 21-3 in the video frame 20 and the person 31-1 in the video frame 30 are labeled "person 1". Therefore, the object association unit 103 associates the person 21-3 in the video frame 20 with the person 31-1 in the video frame 30 as the same person detection result.

また、図３に示す例では、映像フレーム２０における人物２１－４と、映像フレーム３０における人物３１－２とには、“人物４”を示すラベルが付与されている。そこで、物体対応付け部１０３は、映像フレーム２０における人物２１－４と、映像フレーム３０における人物３１－２とを同一の人物の検出結果として対応付ける。 Also, in the example shown in FIG. 3, the person 21-4 in the video frame 20 and the person 31-2 in the video frame 30 are labeled "person 4". Therefore, the object association unit 103 associates the person 21-4 in the video frame 20 with the person 31-2 in the video frame 30 as the same person detection result.

なお、物体対応付け部１０３は、映像フレーム２０における人物の検出結果と、映像フレーム３０における人物の検出結果とで、同じ情報を示すラベルが付与されている人物検出結果がない人物検出結果については対応付けを行わない。この場合、物体対応付け部１０３は、同じ情報を示すラベルが付与されている人物検出結果がない人物検出結果については対応付け結果を出力しない、又は、対応付け結果としてｎｕｌｌ（検出無し）を変換行列推定部１０４に出力する。 Note that the object association unit 103 does not have a person detection result with a label indicating the same information between the person detection result in the video frame 20 and the person detection result in the video frame 30. Do not match. In this case, the object association unit 103 does not output the association result for the person detection result for which there is no person detection result with a label indicating the same information, or converts null (no detection) as the association result. Output to matrix estimation section 104 .

変換行列推定部１０４は、物体検出部１０２による人物検出結果と、物体対応付け部１０３による対応付け結果とに基づいて、映像フレームＩ^ｔ及びＩ^ｔ+ｔ0の座標変換行列Ｈ^{ｔ，ｔ+ｔ0}を推定する（ステップＳ１０５）。変換行列推定部１０４は、推定した座標変換行列Ｈ^{ｔ，ｔ+ｔ0}を画像処理部１０５に出力する。 The transformation matrix estimating unit 104 calculates coordinate transformation matrices H ^{t and t+} t0 of the video frames It and I ^t ^+t0 based on the person detection result by the object detection unit 102 and the matching result by the object matching unit 103 . is estimated (step S105). The transformation matrix estimation unit 104 outputs the estimated coordinate transformation matrix ^Ht,t+t0 to the image processing unit 105 .

以上のように構成された第１の実施形態における画像処理システム１００では、座標変換行列のパラメータの推定精度を向上させることができる。具体的には、画像処理装置１０は、入力された複数枚の映像フレームそれぞれから人物を検出し、映像フレーム間における人物の検出結果をラベルに基づいて対応付ける。画像処理装置１０は、ラベルに基づいて人物の検出結果の対応付けを行うため複雑な演算を行う必要がなく簡便に対応付けができる。そして、画像処理装置１０は、対応付けの結果と、各映像フレームにおける人物の検出結果とに基づいて映像フレーム間の座標変換行列を推定する。このように、第１の実施形態における画像処理システム１００では、従来のように、撮影対象領域２における特徴的な領域が撮像されていなくても座標変換行列を推定することができる。また、カメラが撮像する領域が急激に変化したとしても映像フレーム内に撮像されている人物を検出して対応付けすることによって座標変換行列を推定することができる。したがって、画像処理システム１００は、従来の方法で座標変換行列のパラメータの推定精度が低下してしまう状況下であっても、座標変換行列のパラメータを推定することができる。そのため、座標変換行列のパラメータの推定精度を向上させることが可能になる。 In the image processing system 100 according to the first embodiment configured as described above, it is possible to improve the accuracy of estimating the parameters of the coordinate transformation matrix. Specifically, the image processing apparatus 10 detects a person from each of a plurality of input video frames, and associates the detection results of the person between the video frames based on the label. Since the image processing apparatus 10 associates the detection result of a person based on the label, the association can be easily performed without performing complicated calculations. Then, the image processing apparatus 10 estimates a coordinate transformation matrix between video frames based on the association result and the human detection result in each video frame. As described above, in the image processing system 100 according to the first embodiment, the coordinate transformation matrix can be estimated even if the characteristic area in the imaging target area 2 is not imaged as in the conventional art. Also, even if the area captured by the camera changes rapidly, the coordinate transformation matrix can be estimated by detecting and associating the person captured in the video frame. Therefore, the image processing system 100 can estimate the parameters of the coordinate transformation matrix even in a situation where the estimation accuracy of the parameters of the coordinate transformation matrix is degraded by the conventional method. Therefore, it is possible to improve the estimation accuracy of the parameters of the coordinate transformation matrix.

＜変形例＞
画像処理装置１０が備える一部の機能部は、別の筐体で構成されてもよい。例えば、画像取得部１０１、物体検出部１０２、物体対応付け部１０３及び変換行列推定部１０４、又は、物体対応付け部１０３及び変換行列推定部１０４が、別の筐体で座標変換行列推定装置として構成されてもよい。このように構成される場合、画像処理装置１０は、座標変換行列推定装置から座標変換行列を取得して、映像フレームの画像処理を行う。 <Modification>
Some functional units included in the image processing apparatus 10 may be configured in separate housings. For example, the image acquisition unit 101, the object detection unit 102, the object association unit 103 and the transformation matrix estimation unit 104, or the object association unit 103 and the transformation matrix estimation unit 104 are installed in another housing as a coordinate transformation matrix estimation device. may be configured. In this configuration, the image processing device 10 acquires the coordinate transformation matrix from the coordinate transformation matrix estimation device and performs image processing on the video frame.

（第２の実施形態）
第２の実施形態では、画像処理装置が、一般物体検出器を用いて映像フレームから人物検出を行い、その結果を用いて座標変換行列のパラメータを推定する構成について説明する。第２の実施形態において、画像処理システム１００全体の構成については第１の実施形態と同様であり、画像処理装置の構成のみが異なる。 (Second embodiment)
In the second embodiment, an image processing apparatus detects a person from a video frame using a general object detector, and uses the result to estimate parameters of a coordinate transformation matrix. In the second embodiment, the configuration of the entire image processing system 100 is the same as that of the first embodiment, and only the configuration of the image processing apparatus is different.

図４は、第２の実施形態における画像処理装置１０ａの機能構成を表す概略ブロック図である。
画像処理装置１０ａは、バスで接続されたＣＰＵやメモリや補助記憶装置などを備え、画像処理プログラムを実行する。画像処理プログラムの実行によって、画像処理装置１０ａは、画像取得部１０１、物体検出部１０２ａ、物体対応付け部１０３ａ、変換行列推定部１０４及び画像処理部１０５を備える装置として機能する。なお、画像処理装置１０ａの各機能の全て又は一部は、ＡＳＩＣやＰＬＤやＦＰＧＡ等のハードウェアを用いて実現されてもよい。また、画像処理プログラムは、コンピュータ読み取り可能な記録媒体に記録されてもよい。コンピュータ読み取り可能な記録媒体とは、例えばフレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置である。また、画像処理プログラムは、電気通信回線を介して送受信されてもよい。 FIG. 4 is a schematic block diagram showing the functional configuration of an image processing apparatus 10a according to the second embodiment.
The image processing apparatus 10a includes a CPU, a memory, an auxiliary storage device, and the like connected via a bus, and executes an image processing program. By executing the image processing program, the image processing device 10a functions as a device including an image acquisition unit 101, an object detection unit 102a, an object association unit 103a, a transformation matrix estimation unit 104, and an image processing unit 105. All or part of each function of the image processing apparatus 10a may be implemented using hardware such as ASIC, PLD, and FPGA. Also, the image processing program may be recorded on a computer-readable recording medium. Computer-readable recording media include portable media such as flexible disks, magneto-optical disks, ROMs and CD-ROMs, and storage devices such as hard disks incorporated in computer systems. Also, the image processing program may be transmitted and received via an electric communication line.

画像処理装置１０ａは、物体検出部１０２及び物体対応付け部１０３に代えて物体検出部１０２ａ及び物体対応付け部１０３ａを備える点で画像処理装置１０と構成が異なる。画像処理装置１０ａは、他の構成については画像処理装置１０と同様である。そのため、画像処理装置１０ａ全体の説明は省略し、物体検出部１０２ａ及び物体対応付け部１０３ａについて説明する。 The image processing apparatus 10a differs in configuration from the image processing apparatus 10 in that it includes an object detection unit 102a and an object association unit 103a instead of the object detection unit 102 and the object association unit 103. FIG. The image processing apparatus 10a has the same configuration as the image processing apparatus 10 in other respects. Therefore, the description of the entire image processing apparatus 10a is omitted, and the object detection unit 102a and the object association unit 103a are described.

物体検出部１０２ａは、画像取得部１０１によって取得された複数枚の映像フレームそれぞれから、映像フレームに撮像されている物体を検出する。第２の実施形態における物体検出部１０２ａは、映像フレームＩ^ｔ及びＩ^ｔ+ｔ0それぞれに撮像されている人物を、一般物体検出器を用いて検出する。 The object detection unit 102a detects an object imaged in each of the video frames acquired by the image acquisition unit 101 from each of the plurality of video frames. The object detection unit 102a in the second embodiment detects a person captured in each of the image frames It and It ^+t0 ^using a general object detector.

物体対応付け部１０３ａは、物体検出部１０２ａによって検出された各映像フレーム映像フレームＩ^ｔ及びＩ^ｔ+ｔ0の人物検出結果に基づいて、映像フレーム間における人物の検出結果を対応付ける。 The object association unit 103a associates the person detection results between the video frames based on the person detection results of the video frames It and It ^+t0 ^detected by the object detection unit 102a.

図５は、第２の実施形態における物体対応付け部１０３ａの機能構成を表す概略ブロック図である。物体対応付け部１０３ａは、暫定対応付け部１０３１、グラフ構築部１０３２及び選択部１０３３で構成される。 FIG. 5 is a schematic block diagram showing the functional configuration of the object association unit 103a in the second embodiment. The object association unit 103 a is composed of a temporary association unit 1031 , a graph construction unit 1032 and a selection unit 1033 .

暫定対応付け部１０３１は、物体検出部１０２ａによって検出された各映像フレーム映像フレームＩ^ｔ及びＩ^ｔ+ｔ0の人物検出結果を暫定的に対応付ける。暫定的に対応付ける方法としては、映像フレームＩ^ｔで検出された人物検出結果と、映像フレームＩ^ｔ+ｔ0で検出された人物検出結果とを総当たりで対応付ける方法がある。例えば、映像フレームＩ^ｔにおいて４つの人物検出結果が得られ、映像フレームＩ^ｔ+ｔ0において３つの人物検出結果が得られた場合、暫定対応付け部１０３１は総当たりで対応付けて４×３＝１２の暫定対応付けの結果をグラフ構築部１０３２に出力する。 The provisional association unit 1031 provisionally associates the person detection results of the image frames It and It ^+t0 ^detected by the object detection unit 102a. As a method of provisionally matching, there is a method of matching the human detection result detected in the video frame It with the human detection result detected in the video frame It ^+t0 ^by round-robin. For example, when four human detection results are obtained in the video frame It and three human detection results are obtained in the video frame It ^+t0 ^, the provisional association unit 1031 performs round-robin association to obtain 4×3= The results of the 12 provisional associations are output to the graph constructing unit 1032 .

なお、暫定対応付け部１０３１は、総当たりの対応付け結果のうち尤もらしい対応付けのみを選択する方法を用いて、物体検出部１０２ａによって検出された各映像フレームの人物検出結果を暫定的に対応付けてもよい。
１つ目の方法として、暫定対応付け部１０３１は、参考文献２に開示された方法により、映像フレームＩ^ｔにおける人物検出結果から抽出された画像特徴量と、映像フレームＩ^ｔ+ｔ0における人物検出結果から抽出された画像特徴量との類似度を算出し、類似度が第１の閾値以上の人物検出結果同士を暫定的に対応付ける。 Note that the provisional association unit 1031 temporarily associates the person detection result of each video frame detected by the object detection unit 102a by using a method of selecting only the plausible association from the round-robin association results. may be attached.
As a first method, the temporary association unit 1031 uses the method disclosed in Reference 2 to combine the image feature amount extracted from the person detection result in the video frame It and the person detection result in the video frame It ^+t0 ^. A degree of similarity with the image feature amount extracted from the result is calculated, and person detection results having a degree of similarity greater than or equal to the first threshold are provisionally associated with each other.

２つ目の方法として、暫定対応付け部１０３１は、映像フレームＩ^ｔにおける人物検出結果から得られる矩形領域の縦横比（アスペクト比）と、映像フレームＩ^ｔ+ｔ0における人物検出結果から得られる矩形領域の縦横比（アスペクト比）とを比較して、矩形領域の縦横比が一致する人物検出結果同士を暫定的に対応付ける。 As a second method, the temporary association unit 1031 compares the aspect ratio of the rectangular area obtained from the human detection result in the video frame It and the rectangular area obtained from the human detection result in the video frame It ^+t0 ^. The vertical and horizontal ratios (aspect ratios) of the regions are compared, and human detection results having rectangular regions with the same vertical and horizontal ratios are provisionally associated with each other.

グラフ構築部１０３２は、暫定対応付け部１０３１で得られた暫定対応付けの結果を入力として、暫定対応付けされた人物検出結果の組それぞれをノードとするグラフを構築する。具体的には、グラフ構築部１０３２は、暫定対応付けされた人物検出結果の組それぞれをノードとして仮想上に配置したグラフを構築する。例えば、グラフ構築部１０３２は、暫定対応付け部１０３１で得られた暫定対応付けの結果が１２個であれば、１２個のノードを仮想上に配置したグラフを構築する。 The graph constructing unit 1032 receives the results of the temporary association obtained by the temporary association unit 1031 as input, and constructs a graph having each set of temporarily associated person detection results as nodes. Specifically, the graph constructing unit 1032 constructs a graph in which each set of temporarily associated person detection results is virtually arranged as a node. For example, if the number of temporary association results obtained by the temporary association section 1031 is 12, the graph construction section 1032 constructs a graph in which 12 nodes are virtually arranged.

選択部１０３３は、グラフ構築部１０３２で得られたグラフに基づいて、暫定対応付けの結果の中から、正しい対応付けであると想定される対応付けを選択する。 Based on the graph obtained by the graph construction unit 1032, the selection unit 1033 selects a correspondence that is assumed to be a correct correspondence from among the temporary correspondence results.

図６は、第２の実施形態における画像処理装置１０ａが行う座標変換行列の推定処理の流れを示すフローチャートである。
画像取得部１０１は、複数枚の映像フレームを取得する（ステップＳ２０１）。画像取得部１０１は、取得した複数枚の映像フレームを物体検出部１０２ａに出力する。物体検出部１０２ａは、複数枚の映像フレームそれぞれに撮像されている人物を検出する（ステップＳ２０２）。第２の実施形態における物体検出部１０２ａは、上述したように一般物体検出器であり、検出された人物を囲う矩形四隅の座標値を人物検出結果として暫定対応付け部１０３１及び変換行列推定部１０４に出力する。 FIG. 6 is a flowchart showing the flow of coordinate transformation matrix estimation processing performed by the image processing apparatus 10a according to the second embodiment.
The image acquisition unit 101 acquires a plurality of video frames (step S201). The image acquisition unit 101 outputs the acquired plurality of video frames to the object detection unit 102a. The object detection unit 102a detects a person captured in each of the plurality of video frames (step S202). The object detection unit 102a in the second embodiment is a general object detector, as described above, and uses the coordinate values of the four corners of the rectangle surrounding the detected person as the person detection result. output to

図７は、第２の実施形態における物体検出部１０２ａによる人物検出の一例を示す図である。図７に示すように、図７（Ａ）は時刻ｔに撮像された映像フレーム２０を示す図であり、図７（Ｂ）は時刻ｔ＋ｔ_０に撮像された映像フレーム３０を示す図である。
映像フレーム２０では、４人の人物２１－１～２１－４が検出されている。物体検出部１０２ａは、映像フレーム２０の検出結果として、各人物２１－１～２１－４を囲う領域２２－１～２２－４の矩形四隅の座標値を取得する。
映像フレーム３０では、３人の人物３１－１～３１－３が検出されている。物体検出部１０２ａは、映像フレーム３０の検出結果として、各人物３１－１～３１－３を囲う領域３２－１～３２－３の矩形四隅の座標値を取得する。 FIG. 7 is a diagram showing an example of person detection by the object detection unit 102a in the second embodiment. As shown in FIGS. 7A and 7B, FIG. 7A is a diagram showing a video frame 20 captured at time t, and FIG. 7B is a diagram showing a video frame 30 captured at time t+ _t0 .
In the video frame 20, four persons 21-1 to 21-4 are detected. As the detection result of the image frame 20, the object detection unit 102a acquires the coordinate values of the four corners of the rectangles 22-1 to 22-4 surrounding the persons 21-1 to 21-4.
In the video frame 30, three persons 31-1 to 31-3 are detected. As the detection result of the image frame 30, the object detection unit 102a acquires the coordinate values of the four corners of the rectangles 32-1 to 32-3 surrounding the persons 31-1 to 31-3.

図６に戻って説明を続ける。
暫定対応付け部１０３１は、複数枚の映像フレームから、所定の時間間隔差で撮像された対応付け対象となる複数枚（例えば、２つ）の映像フレームを抽出する（ステップＳ２０３）。なお、暫定対応付け部１０３１が抽出する映像フレームの抽出開始タイミング及び所定の時間間隔は、予め設定される。暫定対応付け部１０３１は、抽出した映像フレームにおける物体検出部１０２ａから出力された人物検出結果に基づいて、映像フレーム２０における人物の検出結果と、映像フレーム３０における人物の検出結果とを暫定的に対応付ける（ステップＳ２０４）。暫定対応付け部１０３１は、暫定対応付けの結果をグラフ構築部１０３２に出力する。 Returning to FIG. 6, the description continues.
The provisional association unit 1031 extracts a plurality of (for example, two) image frames to be associated that are imaged with a predetermined time interval difference from the plurality of image frames (step S203). Note that the extraction start timing and the predetermined time interval of the video frames extracted by the provisional association unit 1031 are set in advance. Based on the person detection result output from the object detection unit 102a in the extracted image frame, the temporary association unit 1031 temporarily associates the person detection result in the image frame 20 with the person detection result in the image frame 30. Associate (step S204). The provisional association unit 1031 outputs the result of provisional association to the graph construction unit 1032 .

図８は、第２の実施形態における暫定対応付け部１０３１による暫定対応付けの結果を示す図である。図８に示す例では、総当たりではなく、尤もらしい対応付け方法による暫定対応付けの結果を示している。その結果、図８に示す例では、７つの暫定対応付けの結果が示されている。図８に示すように、暫定的な対応付けであるため、人物２１－２、２１－３、２１－４、３１－１～３１－３のように一人の人物に対して複数の対応付けがなされている箇所も存在する。実際には、対応付けがなされない、又は、対応付けがなされたとしても１つの対応付けがなされることが正しい。そのため、画像処理装置１０ａは、後述の処理により、より正確な対応付けを行う。 FIG. 8 is a diagram showing the result of provisional association by the provisional association unit 1031 in the second embodiment. The example shown in FIG. 8 shows the result of provisional matching by the plausible matching method, not by round-robin. As a result, the example shown in FIG. 8 shows the results of seven provisional associations. As shown in FIG. 8, since this is a provisional correspondence, there are multiple correspondences for one person such as persons 21-2, 21-3, 21-4, 31-1 to 31-3. There are also places where it is done. In practice, it is correct that no matching is made, or that one matching is made even if a matching is made. Therefore, the image processing device 10a performs more accurate association by the processing described later.

図６に戻って説明を続ける。
グラフ構築部１０３２は、暫定対応付け部１０３１で得られた暫定対応付けされた人物検出結果の組それぞれをノードとして仮想上に配置することによって、暫定対応付けされた人物検出結果の組それぞれをノードとして仮想上に配置したグラフを構築する（ステップＳ２０５）。 Returning to FIG. 6, the description continues.
The graph constructing unit 1032 virtually arranges each pair of the provisionally associated person detection results obtained by the provisional association unit 1031 as a node. (step S205).

図９は、第２の実施形態におけるグラフ構築部１０３２によるグラフの構築結果を示す図である。
図９の上部には暫定対応付け部１０３１で得られた暫定対応付けの結果が示され、図９の下部には暫定対応付けの結果それぞれをノードとし、各ノードが配置されたグラフが示されている。ノード４０－２１は、人物２１－２の検出結果と、人物３１－１の検出結果との対応付けを示すノードである。ノード４０－４１は、人物２２－２の検出結果と、人物３１－３の検出結果との対応付けを示すノードである。 FIG. 9 is a diagram showing a graph construction result by the graph construction unit 1032 in the second embodiment.
The upper part of FIG. 9 shows the result of the provisional matching obtained by the provisional matching unit 1031, and the lower part of FIG. ing. A node 40-21 is a node indicating correspondence between the detection result of the person 21-2 and the detection result of the person 31-1. A node 40-41 is a node indicating correspondence between the detection result of the person 22-2 and the detection result of the person 31-3.

ノード４０－３３は、人物２１－３の検出結果と、人物３１－３の検出結果との対応付けを示すノードである。ノード４０－３１は、人物２１－３の検出結果と、人物３１－１の検出結果との対応付けを示すノードである。ノード４０－４２は、人物２１－４の検出結果と、人物３１－２の検出結果との対応付けを示すノードである。ノード４０－１２は、人物２１－１の検出結果と、人物３１－２の検出結果との対応付けを示すノードである。ノード４０－４１は、人物２１－４の検出結果と、人物３１－１の検出結果との対応付けを示すノードである。 A node 40-33 is a node indicating correspondence between the detection result of the person 21-3 and the detection result of the person 31-3. A node 40-31 is a node indicating correspondence between the detection result of the person 21-3 and the detection result of the person 31-1. Nodes 40-42 are nodes indicating correspondence between the detection result of the person 21-4 and the detection result of the person 31-2. A node 40-12 is a node indicating correspondence between the detection result of the person 21-1 and the detection result of the person 31-2. A node 40-41 is a node indicating correspondence between the detection result of the person 21-4 and the detection result of the person 31-1.

次に、グラフ構築部１０３２は、構築したグラフにおけるノードペアの中で幾何的な一貫性があるノードペア間をエッジで接続する。ここで、ノードペア間の幾何的な一貫性は、ノードペアを構成する人物検出結果の属性情報（例えば、矩形四隅の座標値）から評価することができる。例えば、対象となる２つのノードのいずれもが正しい対応であった場合、各映像フレームに撮像されている人物のスケールには一貫性があると考えられる。そこで、まずグラフ構築部１０３２は、映像フレームに撮像されている人物検出結果の領域の面積比をノード毎に算出する。次に、グラフ構築部１０３２は、算出した面積比の類似度を全てのノードペアで算出する。そして、グラフ構築部１０３２は、算出した類似度の値が第１の閾値以上であったノードペアのみノード間をエッジで接続する。 Next, the graph constructing unit 1032 connects the node pairs that are geometrically consistent among the node pairs in the constructed graph with edges. Here, the geometric consistency between node pairs can be evaluated from the attribute information (for example, the coordinate values of the four corners of the rectangle) of the person detection results that form the node pairs. For example, if both of the two nodes of interest have a correct correspondence, the scale of the person imaged in each video frame is considered to be consistent. Therefore, the graph constructing unit 1032 first calculates, for each node, the area ratio of the person detection result area captured in the video frame. Next, the graph constructing unit 1032 calculates the similarities of the calculated area ratios for all node pairs. Then, the graph constructing unit 1032 connects the nodes of only the node pairs whose calculated similarity values are equal to or greater than the first threshold with edges.

また、別の方法として、対象の２つのノードのいずれもが正しい対応であった場合、映像フレームに撮像されている人物検出結果の領域の距離と、検出された人物のスケールにも一貫性があると考えられる。そこで、まずグラフ構築部１０３２は、対象の２つのノードにおける時刻ｔの映像フレームに撮像されている２つの人物検出結果の領域間の距離と、時刻ｔ＋ｔ_０の映像フレームに撮像されている２つの人物検出結果の領域間の距離とを算出する。次に、グラフ構築部１０３２は、映像フレームに撮像されている人物検出結果の領域の面積比をノード毎に算出する。次に、グラフ構築部１０３２は、算出した面積比の類似度を対象の２つのノードで算出する。そして、グラフ構築部１０３２は、算出した距離及び類似度の値が第２の閾値以上であったノードペアのみノード間をエッジで接続する。 Alternatively, if both of the two nodes of interest are correct correspondences, then the distance of the person detection result region captured in the video frame and the scale of the detected person are also consistent. It is believed that there is. Therefore, first, the graph constructing unit 1032 calculates the distance between the two human detection result regions captured in the video frame at time t in the two target nodes, and the distance between the two human detection result regions captured in the video frame at time _t +t and the distance between regions of the person detection result. Next, the graph constructing unit 1032 calculates, for each node, the area ratio of the human detection result area captured in the video frame. Next, the graph constructing unit 1032 calculates the similarity of the calculated area ratios for the two target nodes. Then, the graph constructing unit 1032 connects the nodes with edges only for the node pairs for which the calculated distance and similarity values are equal to or greater than the second threshold.

図９に示す例では、ノード４０－４２と、ノード４０－３１と、ノード４０－４１で示される対応付けが正しい対応付けであるとする。この条件を加味して上記の評価方法を説明する。ここで、対象の２つのノードとして、ノード４０－４２と、ノード４０－３１を用いる。まずグラフ構築部１０３２は、映像フレームに撮像されている人物検出結果の領域の面積比をノード４０－４２及びノード４０－３１それぞれ算出する。すなわち、グラフ構築部１０３２は、ノード４０－４２における人物２１－４を囲う領域２２－４の面積と、ノード４０－４２における人物３１－２を囲う領域３２－２の面積との比を算出する。同様に、グラフ構築部１０３２は、ノード４０－３１における人物２１－３を囲う領域２２－３の面積と、ノード４０－３１における人物３１－１を囲う領域３２－１の面積との比を算出する。 In the example shown in FIG. 9, it is assumed that the correspondence indicated by nodes 40-42, 40-31, and 40-41 is the correct correspondence. The above evaluation method will be explained with this condition taken into account. Here, the nodes 40-42 and 40-31 are used as the two target nodes. First, the graph constructing unit 1032 calculates the area ratios of the areas of the human detection result imaged in the video frame for each of the nodes 40-42 and 40-31. That is, the graph constructing unit 1032 calculates the ratio of the area of the region 22-4 surrounding the person 21-4 at the node 40-42 to the area of the region 32-2 surrounding the person 31-2 at the node 40-42. . Similarly, the graph constructing unit 1032 calculates the ratio of the area of the region 22-3 surrounding the person 21-3 at the node 40-31 to the area of the region 32-1 surrounding the person 31-1 at the node 40-31. do.

次に、グラフ構築部１０３２は、算出した面積比の類似度を算出する。そして、グラフ構築部１０３２は、算出した類似度の値が第１の閾値以上であった場合に、ノード４０－４２と、ノード４０－３１との間に、幾何的な一貫性があると評価する。グラフ構築部１０３２は、このような処理を全てのノードペアで行う。そして、グラフ構築部１０３２は、幾何的な一貫性があると評価したノードペアの間をエッジで接続する。 Next, the graph constructing unit 1032 calculates the similarity of the calculated area ratios. Then, when the calculated similarity value is equal to or greater than the first threshold, the graph constructing unit 1032 evaluates that there is geometric consistency between the nodes 40-42 and 40-31. do. The graph constructing unit 1032 performs such processing on all node pairs. Then, the graph constructing unit 1032 connects the node pairs evaluated as being geometrically consistent with edges.

図１０は、第２の実施形態におけるグラフ構築部１０３２が構築したグラフにおける処理を示す図である。図１０に示すように、ノード４０－４１、４０－３１及び４０－４２は、互いにエッジ４２で接続されている。すなわち、エッジ４２で接続されているノード４０－４１、４０－３１及び４０－４２は、グラフにおけるノードペアの中で幾何的な一貫性があるノードペアとして評価されたことが示されている。
一方、エッジ４２で接続されていないノード４０－２１、４０－３３、４０－１２及び４０－４１は、グラフにおけるノードペアの中で幾何的な一貫性がないとして評価されたことが示されている。 FIG. 10 is a diagram showing processing in the graph constructed by the graph construction unit 1032 according to the second embodiment. As shown in FIG. 10, nodes 40-41, 40-31 and 40-42 are connected to each other by edges 42. In FIG. That is, it is shown that the nodes 40-41, 40-31 and 40-42 connected by the edge 42 were evaluated as geometrically consistent node pairs among the node pairs in the graph.
On the other hand, nodes 40-21, 40-33, 40-12 and 40-41 that are not connected by edge 42 are shown evaluated as geometrically inconsistent among the node pairs in the graph. .

グラフ構築部１０３２は、ノード毎に、映像フレーム２０（例えば、映像フレームＩ^ｔ）におけるどの人物検出結果と、映像フレーム３０（例えば、Ｉ^ｔ+ｔ0）におけるどの人物検出結果との対応付けであるかを示す情報とともに、構築したエッジを含むグラフを選択部１０３３に出力する。
選択部１０３３は、グラフ構築部１０３２から出力された情報に基づいて、クラスタ抽出処理を実行する（ステップＳ２０６）。クラスタ抽出処理とは、正しい対応付けと予想されるノードを含むクラスタを抽出する処理である。クラスタ抽出処理の詳細については図１１で説明する。 The graph constructing unit 1032 associates, for each node, which person detection result in the video frame 20 (eg, video frame I ^t ) with which person detection result in the video frame 30 (eg, I ^t+t0 ). The graph including the constructed edge is output to the selection unit 1033 along with information indicating whether the edge is formed.
The selection unit 1033 executes cluster extraction processing based on the information output from the graph construction unit 1032 (step S206). Cluster extraction processing is processing for extracting clusters containing nodes that are expected to be correctly associated. Details of the cluster extraction processing will be described with reference to FIG.

変換行列推定部１０４は、クラスタが抽出されたか否かを判定する（ステップＳ２０７）。選択部１０３３からの出力にクラスタの情報が含まれる場合、変換行列推定部１０４はクラスタが抽出されたと判定する。一方、選択部１０３３からの出力がＮｕｌｌである場合、変換行列推定部１０４はクラスタが抽出されなかったと判定する。クラスタが抽出されなかった場合（ステップＳ２０７－ＮＯ）、画像処理装置１０ａは図６の処理を終了する。すなわち、画像処理装置１０ａは、座標変換行列を取得できないとしてＮｕｌｌを出力する。 The transformation matrix estimation unit 104 determines whether or not clusters have been extracted (step S207). If the output from selection section 1033 includes cluster information, transformation matrix estimation section 104 determines that a cluster has been extracted. On the other hand, when the output from selection section 1033 is Null, transformation matrix estimation section 104 determines that no cluster has been extracted. If no cluster is extracted (step S207-NO), the image processing apparatus 10a terminates the processing of FIG. In other words, the image processing device 10a outputs Null, indicating that the coordinate transformation matrix cannot be obtained.

一方、クラスタが抽出された場合（ステップＳ２０７－ＹＥＳ）、変換行列推定部１０４は、クラスタに含まれるノードで特定される人物対応付け結果と、物体検出部１０２ａによる人物検出結果とに基づいて、映像フレームＩ^ｔ及びＩ^ｔ+ｔ0の座標変換行列Ｈ^{ｔ，ｔ+ｔ0}を推定する（ステップＳ２０８）。変換行列推定部１０４は、推定した座標変換行列Ｈ^{ｔ，ｔ+ｔ0}を画像処理部１０５に出力する。 On the other hand, if a cluster is extracted (step S207-YES), the transformation matrix estimation unit 104, based on the person matching result identified by the node included in the cluster and the person detection result by the object detection unit 102a, A coordinate transformation matrix Ht, ^t ^+t0 of the image frames It and ^It+t0 is estimated (step S208). The transformation matrix estimation unit 104 outputs the estimated coordinate transformation matrix ^Ht,t+t0 to the image processing unit 105 .

図１１は、第２の実施形態における画像処理装置１０ａが行うクラスタ抽出処理の流れを示すフローチャートである。
選択部１０３３は、グラフ構築部１０３２によって構築されたエッジを含むグラフにおいて、最も次数の高いノードを選択する（ステップＳ３０１）。ここで次数とは、ノードに接続されているエッジの数を表す。エッジの数が多いということは、接続しているノードが多いということである。すなわち、選択部１０３３は、グラフにおいて、接続しているノードが最も多いノードを選択する。例えば、図１０の場合、エッジの数が多いノードは、ノード４０－４１、４０－３１及び４０－４２である。そのため、選択部１０３３は、ノード４０－４１、４０－３１及び４０－４２のいずれかを選択する。 FIG. 11 is a flow chart showing the flow of cluster extraction processing performed by the image processing apparatus 10a according to the second embodiment.
The selection unit 1033 selects a node with the highest degree in the graph including edges constructed by the graph construction unit 1032 (step S301). Here, the degree represents the number of edges connected to the node. A large number of edges means a large number of connected nodes. That is, the selection unit 1033 selects the node with the largest number of connected nodes in the graph. For example, in FIG. 10, the nodes with the highest number of edges are nodes 40-41, 40-31 and 40-42. Therefore, the selection unit 1033 selects one of the nodes 40-41, 40-31 and 40-42.

選択部１０３３は、選択したノードの次数が第３の閾値以上であるか否かを判定する（ステップＳ３０２）。選択したノードの次数が第３の閾値以上ではない場合（ステップＳ３０２－ＮＯ）、選択部１０３３はクラスタ抽出処理の結果としてＮｕｌｌを出力する（ステップＳ３０３）。その後、選択部１０３３は、クラスタ抽出処理を終了する。
一方、選択したノードの次数が第３の閾値以上である場合（ステップＳ３０２－ＹＥＳ）、選択部１０３３はクラスタの初期化を行う（ステップＳ３０４）。具体的には、選択部１０３３は、空集合でクラスタを初期化する。これにより、クラスタにはノードがいずれも含まれなくなる。選択部１０３３は、クラスタの初期化後に周辺ノードに対してランキング付けを行う（ステップＳ３０５）。周辺ノードとは、ステップＳ３０１の処理で選択されたノード以外のノードである。 The selection unit 1033 determines whether the degree of the selected node is greater than or equal to the third threshold (step S302). If the degree of the selected node is not equal to or greater than the third threshold (step S302-NO), the selection unit 1033 outputs Null as a result of cluster extraction processing (step S303). After that, the selection unit 1033 terminates the cluster extraction process.
On the other hand, if the degree of the selected node is equal to or greater than the third threshold (step S302-YES), the selection unit 1033 initializes the cluster (step S304). Specifically, the selection unit 1033 initializes the cluster with an empty set. This leaves the cluster without any nodes. The selection unit 1033 ranks the peripheral nodes after initializing the cluster (step S305). Peripheral nodes are nodes other than the node selected in the process of step S301.

具体的には、選択部１０３３は、ステップＳ３０１の処理で選択したノードを開始ノードとして、ページランクアルゴリズムを用いて周辺ノードをランキングする。このとき、下記参考文献３で開示されている近似ページランクアルゴリズムを用いることも可能であり、この場合にはグラフのサイズに依存しない計算コストで周辺ノードをランキングすることができる。一般的に、エッジの数が多いほどランキングが高くなる傾向がある。
（参考文献３：Reid Andersen, Fan Chung, Kevin Lang, “Local Graph Partitioning using PageRank Vectors”, In Proc. IEEE Annual Symposium on Foundations of Computer Science (FOCS), 2006） Specifically, the selection unit 1033 ranks peripheral nodes using the page rank algorithm, with the node selected in the process of step S301 as the starting node. At this time, it is also possible to use the approximate page rank algorithm disclosed in Reference 3 below. In this case, peripheral nodes can be ranked at a calculation cost that does not depend on the size of the graph. In general, the higher the number of edges, the higher the ranking tends to be.
(Reference 3: Reid Andersen, Fan Chung, Kevin Lang, “Local Graph Partitioning using PageRank Vectors”, In Proc. IEEE Annual Symposium on Foundations of Computer Science (FOCS), 2006)

選択部１０３３は、ランキング付けの結果、複数の周辺ノードのうちランキングＮ＝１番目の周辺ノードを選択する（ステップＳ３０６）。選択部１０３３は、選択した周辺ノードがクラスタへの追加条件を満たすか否かを判定する（ステップＳ３０７）。クラスタへの追加条件とは、周辺ノードをクラスタに追加するために満たすべき条件であり、例えば選択された周辺ノードにおける特徴点が、既にクラスタに追加されているノードにおける特徴点として使用されていないことである。ここで、ノードにおける特徴点とは、ノードを構成している人物の検出結果である。 As a result of the ranking, the selection unit 1033 selects a peripheral node ranked N=1 out of the plurality of peripheral nodes (step S306). The selection unit 1033 determines whether the selected peripheral node satisfies the conditions for addition to the cluster (step S307). The condition for adding to a cluster is a condition that must be met in order to add a peripheral node to a cluster. For example, feature points in the selected peripheral node are not used as feature points in nodes that have already been added to the cluster. That is. Here, the feature point in the node is the detection result of the person forming the node.

クラスタへの追加条件は、一つの映像フレームの中には、同一人物は高々一度のみ出現するという制約から導かれるものである。したがって、選択部１０３３は、選択した周辺ノードを構成する人物の検出結果が、既にクラスタに含まれているノードを構成する人物の検出結果として使用されている場合には、選択した周辺ノードをクラスタへの追加条件を満たさないと判定する。
一方、選択部１０３３は、選択した周辺ノードを構成する人物の検出結果が、既にクラスタに含まれているノードを構成する人物の検出結果として使用されていない場合には、選択した周辺ノードをクラスタへの追加条件を満たすと判定する。 The condition for adding to a cluster is derived from the restriction that the same person appears only once in one video frame. Therefore, if the detection result of the person configuring the selected peripheral node is used as the detection result of the person configuring the node already included in the cluster, the selecting unit 1033 adds the selected peripheral node to the cluster. It is determined that the conditions for addition to are not met.
On the other hand, if the detection result of the person configuring the selected peripheral node is not used as the detection result of the person configuring the node already included in the cluster, the selecting unit 1033 clusters the selected peripheral node. It is determined that the conditions for addition to are satisfied.

選択した周辺ノードがクラスタへの追加条件を満たさない場合（ステップＳ３０７－ＮＯ）、選択部１０３３は選択した周辺ノードをクラスタに追加しない（ステップＳ３０８）。その後、選択部１０３３は、ランキングＮの値に１を加算する（ステップＳ３０９）。選択部１０３３は、ランキングＮ番目の周辺ノードを選択する（ステップＳ３１０）。その後、選択部１０３３は、ステップＳ３０７以降の処理を実行する。 If the selected peripheral node does not satisfy the conditions for addition to the cluster (step S307-NO), the selection unit 1033 does not add the selected peripheral node to the cluster (step S308). After that, the selection unit 1033 adds 1 to the value of the ranking N (step S309). The selection unit 1033 selects the N-th ranked peripheral node (step S310). After that, the selection unit 1033 executes the processes after step S307.

一方、選択した周辺ノードがクラスタへの追加条件を満たす場合（ステップＳ３０７－ＹＥＳ）、選択部１０３３は選択した周辺ノードをクラスタに追加する（ステップＳ３１１）。選択部１０３３は、周辺ノードがクラスタに追加されると、クラスタの精度を示す評価値ｓｃｏｒｅ＝δＳ／ｖｏｌ（Ｓ）を算出する（ステップＳ３１２）。ここで、δＳはクラスタに含まれない周辺ノードと、クラスタ内のノードとを接続するエッジの数を表し、ｖｏｌ（Ｓ）はクラスタ内のノードの次数の総和を表す。選択部１０３３は、クラスタ内のノードを特定する情報と、算出した評価値ｓｃｏｒｅとを対応付けて記憶する。このように、選択部１０３３は、評価値ｓｃｏｒｅを算出する度に、評価値ｓｃｏｒｅに対応付けて、クラスタに含まれるノードを特定する情報を記憶する。 On the other hand, if the selected peripheral node satisfies the conditions for addition to the cluster (step S307-YES), the selection unit 1033 adds the selected peripheral node to the cluster (step S311). When the peripheral node is added to the cluster, the selection unit 1033 calculates an evaluation value score=δS/vol(S) indicating the accuracy of the cluster (step S312). Here, δS represents the number of edges connecting the peripheral nodes not included in the cluster and the nodes within the cluster, and vol(S) represents the sum of degrees of the nodes within the cluster. The selection unit 1033 associates and stores information specifying a node in the cluster with the calculated evaluation value score. In this way, every time the selection unit 1033 calculates the evaluation value score, the selection unit 1033 stores information specifying the nodes included in the cluster in association with the evaluation value score.

選択部１０３３は、記憶している情報に基づいて、算出した評価値ｓｃｏｒｅが前回の評価値ｓｃｏｒｅよりも高くなっているか否かを判定する（ステップＳ３１３）。評価値ｓｃｏｒｅが前回の評価値ｓｃｏｒｅよりも高くなっていない場合（ステップＳ３１３－ＮＯ）、選択部１０３３はステップＳ３０９以降の処理を実行する。
一方、評価値ｓｃｏｒｅが前回の評価値ｓｃｏｒｅよりも高くなっている場合（ステップＳ３１３－ＹＥＳ）、選択部１０３３は前回の評価値ｓｃｏｒｅを算出した際のクラスタを選択する。そして、選択部１０３３は、選択したクラスタを最適なクラスタとして抽出する（ステップＳ３１４）。なお、選択部１０３３は、前回の評価値ｓｃｏｒｅが記憶されていない場合には、比較ができないため高くなっていないと判定する。 The selection unit 1033 determines whether or not the calculated evaluation value score is higher than the previous evaluation value score based on the stored information (step S313). If the evaluation value score is not higher than the previous evaluation value score (step S313-NO), the selection unit 1033 executes the processes after step S309.
On the other hand, when the evaluation value score is higher than the previous evaluation value score (step S313-YES), the selection unit 1033 selects the cluster when the previous evaluation value score was calculated. The selecting unit 1033 then extracts the selected cluster as the optimum cluster (step S314). Note that when the previous evaluation value score is not stored, the selection unit 1033 determines that the evaluation value score is not high because comparison cannot be made.

評価値ｓｃｏｒｅは、クラスタを構成する組み合わせがいいほど小さい値となる。そのため、評価値ｓｃｏｒｅが前回の評価値ｓｃｏｒｅよりも高くなった場合には、新たに周辺ノードが追加されたことによりクラスタを構成するノードの組み合わせが良くなくなっていく可能性がある。そこで、選択部１０３３は、算出した評価値ｓｃｏｒｅが前回の評価値ｓｃｏｒｅよりも高くなっている場合には、前回の評価値ｓｃｏｒｅを算出した際のクラスタを最適なクラスタとして抽出する。 The evaluation value score becomes a smaller value as the combination forming the cluster is better. Therefore, when the evaluation value score becomes higher than the previous evaluation value score, there is a possibility that the combination of the nodes forming the cluster will deteriorate due to the addition of new peripheral nodes. Therefore, when the calculated evaluation value score is higher than the previous evaluation value score, the selection unit 1033 extracts the cluster when the previous evaluation value score was calculated as the optimum cluster.

選択部１０３３は、抽出したクラスタのサイズが第４の閾値以上であるか否かを判定する（ステップＳ３１５）。クラスタのサイズは、例えばクラスタを構成するノードの数である。抽出したクラスタのサイズが第４の閾値未満である場合（ステップＳ３１５－ＮＯ）、選択部１０３３はステップＳ３０３の処理を実行する。
一方、抽出したクラスタのサイズが第４の閾値以上である場合（ステップＳ３１５－ＹＥＳ）、選択部１０３３は選択したクラスタをクラスタ抽出処理の結果として出力する（ステップＳ３１６）。選択部１０３３で出力されるクラスタは、クラスタに含まれる全てのノードに対応する人物検出結果が、クラスタに含まれるノード間で共有されないという制約を満たしている。 The selection unit 1033 determines whether the size of the extracted cluster is equal to or greater than the fourth threshold (step S315). The cluster size is, for example, the number of nodes forming the cluster. If the size of the extracted cluster is less than the fourth threshold (step S315-NO), the selection unit 1033 executes the process of step S303.
On the other hand, if the size of the extracted cluster is equal to or larger than the fourth threshold (step S315-YES), the selection unit 1033 outputs the selected cluster as the result of cluster extraction processing (step S316). The cluster output by the selection unit 1033 satisfies the constraint that the person detection results corresponding to all the nodes included in the cluster are not shared among the nodes included in the cluster.

図１２は、第２の実施形態におけるクラスタの抽出結果を示す図である。
図１２に示すように、図１１の処理で選択されたクラスタ４１には、ノード４０－４１、４０－３１及び４０－４２が含まれている。変換行列推定部１０４では、選択部１０３３によって抽出されたクラスタ４１に含まれるノード４０－４１、４０－３１及び４０－４２で示される映像フレーム間の対応付けの結果を用いて座標変換行列を推定する。 FIG. 12 is a diagram showing a cluster extraction result in the second embodiment.
As shown in FIG. 12, the cluster 41 selected in the process of FIG. 11 includes nodes 40-41, 40-31 and 40-42. Transformation matrix estimating unit 104 estimates a coordinate transformation matrix using the result of association between video frames indicated by nodes 40-41, 40-31 and 40-42 included in cluster 41 extracted by selecting unit 1033. do.

以上のように構成された第２の実施形態における画像処理システム１００では、座標変換行列のパラメータの推定精度を向上させることができる。具体的には、まず画像処理装置１０ａは、映像フレーム間における人物検出結果の対応付けを暫定的に行う。これにより、一人の人物において１又は複数の対応付けがなされる。次に、画像処理装置１０ａは、暫定対応付けをそれぞれノードとして仮想上にノードを配置したグラフを構築する。次に、画像処理装置１０ａは、構築したグラフにおいて幾何的な一貫性があるノードペアを探し出し、エッジで接続する。次に、画像処理装置１０ａは、エッジに基づいて、正しい対応付けと想定される対応付けのノードを選択し、選択したノードと、クラスタへの追加条件を満たすノードを含むクラスタを抽出する。抽出されたクラスタには、正しい対応付けと想定されるノードが含まれているため、画像処理装置１０ａはクラスタに含まれるノードで示される対応付けの結果と、各映像フレームにおける人物の検出結果とに基づいて映像フレーム間の座標変換行列を推定する。このように、第３の実施形態における画像処理システム１００では、従来のように、撮影対象領域２における特徴的な領域が撮像されていなくても座標変換行列を推定することができる。また、カメラが撮像する領域が急激に変化したとしても映像フレーム内に撮像されている人物を検出して対応付けすることによって座標変換行列を推定することができる。したがって、画像処理システム１００は、従来の方法で座標変換行列のパラメータの推定精度が低下してしまう状況下であっても、座標変換行列のパラメータを推定することができる。そのため、座標変換行列のパラメータの推定精度を向上させることが可能になる。 In the image processing system 100 according to the second embodiment configured as described above, it is possible to improve the accuracy of estimating the parameters of the coordinate transformation matrix. Specifically, first, the image processing device 10a provisionally associates human detection results between video frames. Thereby, one or more correspondences are made in one person. Next, the image processing apparatus 10a constructs a graph in which nodes are virtually arranged with the provisional correspondences as nodes. Next, the image processing device 10a searches for geometrically consistent node pairs in the constructed graph and connects them with edges. Next, the image processing device 10a selects nodes with correspondences that are assumed to be correct correspondences based on the edges, and extracts clusters that include the selected nodes and nodes that satisfy conditions for addition to the clusters. Since the extracted cluster includes nodes that are assumed to have correct association, the image processing device 10a combines the result of association indicated by the nodes included in the cluster with the detection result of the person in each video frame. Estimate the coordinate transformation matrix between video frames based on As described above, in the image processing system 100 of the third embodiment, the coordinate transformation matrix can be estimated even if the characteristic area in the imaging target area 2 is not imaged, as in the conventional art. Also, even if the area captured by the camera changes rapidly, the coordinate transformation matrix can be estimated by detecting and associating the person captured in the video frame. Therefore, the image processing system 100 can estimate the parameters of the coordinate transformation matrix even in a situation where the estimation accuracy of the parameters of the coordinate transformation matrix is degraded by the conventional method. Therefore, it is possible to improve the estimation accuracy of the parameters of the coordinate transformation matrix.

また、画像処理装置１０ａは、クラスタへの追加条件を満たすか否かを判定し、クラスタへの追加条件を満たすノードのみをクラスに追加している。これにより、クラスタ内において同一の人物検出結果が含まれなくなる。したがって、誤った対応付けを軽減することができる。その結果、座標変換行列のパラメータの推定精度を向上させることが可能になる。 Further, the image processing apparatus 10a determines whether or not the condition for addition to the cluster is satisfied, and adds only nodes that satisfy the condition for addition to the cluster to the class. As a result, the same person detection result is not included in the cluster. Therefore, erroneous association can be reduced. As a result, it is possible to improve the estimation accuracy of the parameters of the coordinate transformation matrix.

＜変形例＞
画像処理装置１０ａが備える一部の機能部は、別の筐体で構成されてもよい。例えば、画像取得部１０１、物体検出部１０２ａ、物体対応付け部１０３ａ及び変換行列推定部１０４、又は、物体対応付け部１０３ａ及び変換行列推定部１０４が、別の筐体で座標変換行列推定装置として構成されてもよい。このように構成される場合、画像処理装置１０ａは、座標変換行列推定装置から座標変換行列を取得して、映像フレームの画像処理を行う。
選択部１０３３は、周辺ノード全てに対してクラスタへの追加条件を満たすか否かの判定を行うように構成されてもよい。このように構成される場合、選択部１０３３は、図１１のステップＳ３１３の処理に代えて、ランキング付けされた周辺ノード全てに対して処理を行ったか否かを判定する。そして、ランキング付けされた周辺ノード全てに対して処理を行っていない場合には、選択部１０３３はステップＳ３０９以降の処理を実行する。一方。ランキング付けされた周辺ノード全てに対して処理を行った場合には、選択部１０３３は評価値が最小となるクラスタを構成するノードの組み合わせを選択する。そして、選択部１０３３は、選択したノードの組み合わせで構成されるクラスタを処理結果として抽出する。評価値が最小となるクラスタが複数ある場合には、選択部１０３３は評価値が最小となるクラスタの中からランダムにクラスタを選択してもよい。 <Modification>
Some functional units included in the image processing apparatus 10a may be configured in separate housings. For example, the image acquisition unit 101, the object detection unit 102a, the object association unit 103a and the transformation matrix estimation unit 104, or the object association unit 103a and the transformation matrix estimation unit 104 can be used as a coordinate transformation matrix estimation device in another housing. may be configured. In this configuration, the image processing device 10a acquires the coordinate transformation matrix from the coordinate transformation matrix estimation device and performs image processing on the video frame.
The selection unit 1033 may be configured to determine whether or not all peripheral nodes satisfy the conditions for addition to the cluster. In such a configuration, the selection unit 1033 determines whether or not all the ranked peripheral nodes have been processed instead of the process of step S313 in FIG. Then, if the processing has not been performed for all of the ranked peripheral nodes, the selection unit 1033 executes the processing after step S309. on the other hand. When all the ranked peripheral nodes have been processed, the selection unit 1033 selects a combination of nodes forming a cluster with the smallest evaluation value. Then, the selection unit 1033 extracts a cluster formed by combining the selected nodes as a processing result. If there are a plurality of clusters with the smallest evaluation values, the selection unit 1033 may randomly select clusters from among the clusters with the smallest evaluation values.

（第３の実施形態）
第１の実施形態で説明したように、特定物体人物検出器を用いた場合には特定人物毎のラベルがあるが、学習の精度が低い場合には誤ったラベルが付与されてしまうことや、複数の人物に同じラベルが付与されてしまうことも想定される。そのような場合に第１の実施形態のようにラベルに基づいて人物の検出結果の対応付けを行ってしまうと、座標変換行列のパラメータの推定精度が低下してしまう場合もある。そこで、第３の実施形態では、第１の実施形態のように特定物体人物検出器により特定の人物を検出した後に、第２の実施形態のように暫定対応付けを行って座標変換行列を推定する。 (Third embodiment)
As described in the first embodiment, when the specific object person detector is used, there is a label for each specific person, but if the accuracy of learning is low, an incorrect label is assigned, It is also assumed that the same label is given to multiple persons. In such a case, if the person detection results are associated with each other based on the label as in the first embodiment, the accuracy of estimating the parameters of the coordinate transformation matrix may decrease. Therefore, in the third embodiment, after a specific person is detected by the specific object person detector as in the first embodiment, provisional association is performed as in the second embodiment to estimate the coordinate transformation matrix. do.

図１３は、第３の実施形態における画像処理装置１０ｂの機能構成を表す概略ブロック図である。
画像処理装置１０ｂは、バスで接続されたＣＰＵやメモリや補助記憶装置などを備え、画像処理プログラムを実行する。画像処理プログラムの実行によって、画像処理装置１０ａは、画像取得部１０１、物体検出部１０２、物体対応付け部１０３ｂ、変換行列推定部１０４及び画像処理部１０５を備える装置として機能する。なお、画像処理装置１０ｂの各機能の全て又は一部は、ＡＳＩＣやＰＬＤやＦＰＧＡやＧＰＵ等のハードウェアを用いて実現されてもよい。また、画像処理プログラムは、コンピュータ読み取り可能な記録媒体に記録されてもよい。コンピュータ読み取り可能な記録媒体とは、例えばフレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置である。また、画像処理プログラムは、電気通信回線を介して送受信されてもよい。 FIG. 13 is a schematic block diagram showing the functional configuration of an image processing device 10b according to the third embodiment.
The image processing device 10b includes a CPU, a memory, an auxiliary storage device, etc., which are connected via a bus, and executes an image processing program. By executing the image processing program, the image processing device 10a functions as a device including an image acquisition unit 101, an object detection unit 102, an object association unit 103b, a transformation matrix estimation unit 104, and an image processing unit 105. All or part of each function of the image processing device 10b may be implemented using hardware such as ASIC, PLD, FPGA, GPU, or the like. Also, the image processing program may be recorded on a computer-readable recording medium. Computer-readable recording media include portable media such as flexible disks, magneto-optical disks, ROMs and CD-ROMs, and storage devices such as hard disks incorporated in computer systems. Also, the image processing program may be transmitted and received via an electric communication line.

画像処理装置１０ｂは、物体対応付け部１０３に代えて物体対応付け部１０３ｂを備える点で画像処理装置１０と構成が異なる。画像処理装置１０ｂは、他の構成については画像処理装置１０と同様である。そのため、画像処理装置１０ｂ全体の説明は省略し、物体対応付け部１０３ｂについて説明する。 The image processing apparatus 10b differs in configuration from the image processing apparatus 10 in that it includes an object association unit 103b instead of the object association unit 103. FIG. The image processing device 10b has the same configuration as the image processing device 10 in other respects. Therefore, the description of the entire image processing apparatus 10b is omitted, and the object matching unit 103b is described.

物体対応付け部１０３ｂは、物体検出部１０２によって検出された各映像フレームの人物検出結果に基づいて、映像フレーム間における人物の検出結果を対応付ける。具体的には、物体対応付け部１０３ｂは、人物検出結果として得られたラベルを用いて、同じ情報を示すラベルが付与されている人物検出結果を同一の人物の検出結果として暫定的に対応付ける。上述したように、学習の精度が低い場合には誤ったラベルが付与されてしまうことや、複数の人物に同じラベルが付与されてしまうことがあるため、物体対応付け部１０３ｂは同じ情報を示すラベルが付与されている人物検出結果を暫定的に対応付ける。その後、物体対応付け部１０３ｂは、第２の実施形態と同様の処理を行う。 The object association unit 103b associates the person detection results between the video frames based on the person detection results of each video frame detected by the object detection unit 102 . Specifically, the object associating unit 103b uses the labels obtained as the person detection results to provisionally associate the person detection results to which labels indicating the same information are assigned as the same person detection results. As described above, if the accuracy of learning is low, an incorrect label may be assigned, or the same label may be assigned to a plurality of persons. Tentatively associate labeled person detection results. After that, the object association unit 103b performs the same processing as in the second embodiment.

また、物体対応付け部１０３ｂは、第２の実施形態と同様の方法で暫定的に人物検出結果を対応付け、同じ情報を示すラベルが付与されている人物検出結果の組が、同じ情報を示すラベルが付与されていない人物検出結果の組よりも選択されやすいように重みづけしてもよい。 Further, the object association unit 103b temporarily associates the person detection results by the same method as in the second embodiment, and sets of person detection results to which labels indicating the same information are assigned indicate the same information. It may be weighted so that it is more likely to be selected than a set of unlabeled person detection results.

以上のように構成された第３の実施形態における画像処理システム１００では、映像フレームそれぞれで検出された、同じ情報を示すラベルが付与されている人物検出結果を同一の人物の検出結果として対応付けるのではなく、暫定的に対応付けて他の検出結果との対応関係を踏まえて最終的な対応付けを選択する。これにより、特定物体人物検出器の学習の精度が低い場合に、誤ったラベルが付与されていたり、複数の人物に同一のラベルが付与されていた場合であっても、より精度の高い対応付けを行うことができる。そのため、特定物体人物検出器の学習の精度が低い場合であっても、座標変換行列のパラメータの推定精度の低下を抑制することができる。 In the image processing system 100 according to the third embodiment configured as described above, the person detection results to which the label indicating the same information, which is detected in each video frame, are assigned are associated as the detection result of the same person. Instead, it temporarily associates and selects the final association based on the correspondence with other detection results. As a result, when the accuracy of learning of the specific object person detector is low, even if the wrong label is assigned or the same label is assigned to multiple people, more accurate matching can be performed. It can be performed. Therefore, even when the accuracy of learning of the specific object person detector is low, it is possible to suppress deterioration in accuracy of estimating the parameters of the coordinate transformation matrix.

＜変形例＞
画像処理装置１０ｂが備える一部の機能部は、別の筐体で構成されてもよい。例えば、画像取得部１０１、物体検出部１０２、物体対応付け部１０３ｂ及び変換行列推定部１０４、又は、物体対応付け部１０３ｂ及び変換行列推定部１０４が、別の筐体で座標変換行列推定装置として構成されてもよい。このように構成される場合、画像処理装置１０ｂは、座標変換行列推定装置から座標変換行列を取得して、映像フレームの画像処理を行う。 <Modification>
Some functional units included in the image processing apparatus 10b may be configured in separate housings. For example, the image acquiring unit 101, the object detecting unit 102, the object matching unit 103b and the transformation matrix estimating unit 104, or the object matching unit 103b and the transformation matrix estimating unit 104 are installed in another housing as a coordinate transformation matrix estimating device. may be configured. In this configuration, the image processing device 10b acquires the coordinate transformation matrix from the coordinate transformation matrix estimation device and performs image processing on the video frame.

（第４の実施形態）
第４の実施形態では、複数台のカメラを用いて、異なる視点（例えば、相対する方向）で撮影した複数の画像を用いる場合を例に説明する。
図１４は、第４の実施形態における画像処理システム１００ｃの機能構成を示す図である。画像処理システム１００ｃは、複数の撮像装置１、３及び画像処理装置１０ｃを備える。撮像装置１及び３は、例えば撮影対象領域２を挟んで相対する向きに設置される。なお、複数の撮像装置１及び３の設置位置と撮影方向の向きは既知である。 (Fourth embodiment)
In the fourth embodiment, a case of using a plurality of images shot from different viewpoints (for example, facing directions) using a plurality of cameras will be described as an example.
FIG. 14 is a diagram showing the functional configuration of an image processing system 100c according to the fourth embodiment. The image processing system 100c includes a plurality of imaging devices 1 and 3 and an image processing device 10c. The imaging devices 1 and 3 are installed facing each other with an imaging target region 2 interposed therebetween, for example. Note that the installation positions and shooting directions of the plurality of imaging devices 1 and 3 are known.

撮像装置１は、撮影対象領域２を撮影する。撮像装置１は、例えばカメラである。撮像装置１は、撮影対象領域２を撮像することによって映像フレームを生成し、生成した映像フレームを画像処理装置１０ｃに出力する。
撮像装置３は、撮影対象領域２を撮影する。撮像装置３は、例えばカメラである。撮像装置３は、撮影対象領域２を撮像することによって映像フレームを生成し、生成した映像フレームを画像処理装置１０ｃに出力する。 The image capturing device 1 captures an image of an image capturing target area 2 . The imaging device 1 is, for example, a camera. The imaging device 1 generates a video frame by capturing an image of the imaging target region 2, and outputs the generated video frame to the image processing device 10c.
The imaging device 3 photographs the imaging target area 2 . The imaging device 3 is, for example, a camera. The imaging device 3 generates a video frame by imaging the imaging target region 2, and outputs the generated video frame to the image processing device 10c.

画像処理装置１０ｃは、入力した複数枚の映像フレームに基づいて映像フレーム間の座標変換行列を推定し、推定した座標変換行列により映像フレームの画像変換を行う。例えば、画像処理装置１０ｃは、撮像装置１によって撮像された映像フレームと、撮像装置３によって撮像された映像フレームとの２枚の映像フレームに基づいて映像フレーム間の座標変換行列を推定する。また、例えば、画像処理装置１０ｃは、撮像装置１によって撮像された複数枚の映像フレームに基づいて映像フレーム間の座標変換行列を推定する。また、例えば、画像処理装置１０ｃは、撮像装置３によって撮像された複数枚の映像フレームに基づいて映像フレーム間の座標変換行列を推定する。 The image processing device 10c estimates a coordinate transformation matrix between video frames based on a plurality of input video frames, and performs image transformation of the video frames using the estimated coordinate transformation matrix. For example, the image processing device 10c estimates the coordinate transformation matrix between the video frames based on two video frames, the video frame captured by the imaging device 1 and the video frame captured by the imaging device 3. Also, for example, the image processing device 10 c estimates a coordinate transformation matrix between video frames based on a plurality of video frames captured by the imaging device 1 . Also, for example, the image processing device 10 c estimates a coordinate transformation matrix between video frames based on a plurality of video frames captured by the imaging device 3 .

画像処理装置１０ｃに入力される複数枚の映像フレームは、異なる撮像装置で撮像された画像又は異なる時刻に撮像された画像である。ただし、画像処理装置１０ｃに入力される複数枚の映像フレームの撮影時刻が近いほど好ましい。以下の説明では、撮像装置１によって時刻ｔに撮像された画像を映像フレームＩ^ｔ１と記載し、撮像装置３によって時刻ｔに撮像された画像を映像フレームＩ^ｔ３と記載する。 A plurality of video frames input to the image processing device 10c are images captured by different imaging devices or images captured at different times. However, it is preferable that the shooting times of the plurality of video frames input to the image processing device 10c are closer. In the following description, the image captured by the imaging device 1 at time t is referred to as video frame ^It1 , and the image captured by imaging device 3 at time t is referred to as video frame ^It3 .

図１４に示すように、第４の実施形態では、第１の実施形態～第３の実施形態とは異なり、画像処理装置１０ｃが複数の撮像装置１及び３で撮像された映像フレームのいずれか又は両方を用いる。撮像装置１及び３は、相対する方向で撮影対象領域２を撮像する。したがって、撮像装置１によって撮像された映像フレームと、撮像装置３によって撮像された映像フレームとでは、物体の進行方向が反対になる。そのため、第１の実施形態～第３の実施形態と同様に、得られた座標変換行列をそのまま用いて座標変換を行うと正しい画像が得られなくなる。 As shown in FIG. 14, in the fourth embodiment, unlike the first to third embodiments, an image processing device 10c selects one of video frames captured by a plurality of imaging devices 1 and 3. Or use both. The imaging devices 1 and 3 image the imaging target area 2 in opposite directions. Therefore, the moving direction of the object is opposite between the video frame captured by the imaging device 1 and the video frame captured by the imaging device 3 . Therefore, as in the first to third embodiments, if coordinate transformation is performed using the obtained coordinate transformation matrix as it is, a correct image cannot be obtained.

そこで、まず画像処理装置１０ｃは、撮像装置１及び３の識別情報（例えば、カメラＩＤ）を用いて、入力された映像フレームが撮像装置１及び３のいずれで撮像された映像フレームであるのかを判定する。次に、画像処理装置１０ｃは、判定結果により撮像装置が切り替わった場合には、撮像装置１と撮像装置３との相対的な関係性から一方の人物検出結果を逆転させて対応付けを行う。 Therefore, first, the image processing device 10c uses the identification information (for example, camera ID) of the imaging devices 1 and 3 to determine which one of the imaging devices 1 and 3 captured the input video frame. judge. Next, when the imaging device is switched as a result of the determination, the image processing device 10c reverses one person detection result based on the relative relationship between the imaging device 1 and the imaging device 3 and associates them.

次に、画像処理装置１０ｃの具体的な構成について説明する。
画像処理装置１０ｃは、バスで接続されたＣＰＵやメモリや補助記憶装置などを備え、画像処理プログラムを実行する。画像処理プログラムの実行によって、画像処理装置１０ｃは、画像取得部１０１ｃ、物体検出部１０２ｃ、物体対応付け部１０３ｃ、変換行列推定部１０４ｃ及び画像処理部１０５を備える装置として機能する。なお、画像処理装置１０ｃの各機能の全て又は一部は、ＡＳＩＣやＰＬＤやＦＰＧＡやＧＰＵ等のハードウェアを用いて実現されてもよい。また、画像処理プログラムは、コンピュータ読み取り可能な記録媒体に記録されてもよい。コンピュータ読み取り可能な記録媒体とは、例えばフレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置である。また、画像処理プログラムは、電気通信回線を介して送受信されてもよい。 Next, a specific configuration of the image processing device 10c will be described.
The image processing device 10c includes a CPU, a memory, an auxiliary storage device, etc., which are connected via a bus, and executes an image processing program. By executing the image processing program, the image processing device 10c functions as a device including an image acquisition unit 101c, an object detection unit 102c, an object association unit 103c, a transformation matrix estimation unit 104c, and an image processing unit 105. All or part of each function of the image processing device 10c may be implemented using hardware such as ASIC, PLD, FPGA, GPU, or the like. Also, the image processing program may be recorded on a computer-readable recording medium. Computer-readable recording media include portable media such as flexible disks, magneto-optical disks, ROMs and CD-ROMs, and storage devices such as hard disks incorporated in computer systems. Also, the image processing program may be transmitted and received via an electric communication line.

画像処理装置１０ｃは、画像取得部１０１、物体検出部１０２、物体対応付け部１０３及び変換行列推定部１０４に代えて画像取得部１０１ｃ、物体検出部１０２ｃ、物体対応付け部１０３ｃ及び変換行列推定部１０４ｃを備える点で画像処理装置１０と構成が異なる。画像処理装置１０ｃは、他の構成については画像処理装置１０と同様である。そのため、画像処理装置１０ｃ全体の説明は省略し、画像取得部１０１ｃ、物体検出部１０２ｃ、物体対応付け部１０３ｃ及び変換行列推定部１０４ｃについて説明する。 The image processing device 10c includes an image acquisition unit 101c, an object detection unit 102c, an object association unit 103c, and a transformation matrix estimation unit instead of the image acquisition unit 101, the object detection unit 102, the object association unit 103, and the transformation matrix estimation unit 104. The configuration differs from that of the image processing apparatus 10 in that the image processing apparatus 104c is provided. The image processing device 10c has the same configuration as the image processing device 10 in other respects. Therefore, the overall description of the image processing apparatus 10c is omitted, and the image acquisition unit 101c, the object detection unit 102c, the object association unit 103c, and the transformation matrix estimation unit 104c are described.

画像取得部１０１ｃは、撮像装置１及び３のいずれか一方又はそれぞれから出力された映像フレームを取得する。また、画像取得部１０１ｃは、映像フレームからカメラＩＤを取得し、取得したカメラＩＤが変化した場合にはその変化を検出する。画像取得部１０１ｃは、カメラＩＤの変化を検出した場合には、カメラＩＤと、カメラＩＤが変化したときに取得された映像フレームがどの映像フレームであるのかを示す情報とを含む通知を物体対応付け部１０３ｃに出力する。 The image acquisition unit 101c acquires video frames output from one or each of the imaging devices 1 and 3 . Further, the image acquisition unit 101c acquires the camera ID from the video frame, and detects the change when the acquired camera ID changes. When the image acquisition unit 101c detects a change in the camera ID, the image acquisition unit 101c sends a notification including the camera ID and information indicating which video frame is the video frame acquired when the camera ID changes. Output to the attaching unit 103c.

物体検出部１０２ｃは、画像取得部１０１ｃによって取得された複数枚の映像フレームそれぞれから、映像フレームに撮像されている物体を検出する。なお、物体検出部１０２ｃによる物体の検出方法は、第１の実施形態～第３の実施形態のいずれかの実施形態における同名の機能部と同様の処理を行う。すなわち、物体検出部１０２ｃは、一般物体検出器又は特定物体検出器のいずれかを用いて、映像フレームから人物検出を行う。 The object detection unit 102c detects objects captured in the video frames from each of the plurality of video frames acquired by the image acquisition unit 101c. The method of detecting an object by the object detection unit 102c performs the same processing as that of the function unit with the same name in any one of the first to third embodiments. That is, the object detection unit 102c uses either a general object detector or a specific object detector to detect a person from a video frame.

物体対応付け部１０３ｃは、物体検出部１０２ｃによって検出された各映像フレームの人物検出結果に基づいて、映像フレーム間における人物の検出結果を対応付ける。具体的には、まず物体対応付け部１０３ｃは、複数枚のフレームから所定の時間間隔差で撮像された対応付け対象となる複数枚の映像フレームを抽出する。次に、物体対応付け部１０３ｃは、対応付け対象として抽出した複数枚の映像フレームそれぞれで検出された物体の検出結果に基づいて、対応付け対象として抽出した複数枚の映像フレーム間の物体の検出結果を対応付ける。物体対応付け部１０３ｃにおける対応付け対象となる複数枚の映像フレームは、所定の時間間隔差で同じ撮像装置又は異なる撮像装置によって撮像された映像フレーム、及び、同時刻に異なる撮像装置によって撮像された映像フレームである。このように、物体対応付け部１０３ｃは、物体検出部１０２ｃによって検出された各映像フレームのうち所定の時間間隔差で撮像された又は同時刻に異なる撮像装置によって撮像された複数枚の映像フレームそれぞれで検出された物体の検出結果に基づいて、所定の時間間隔差で撮像された又は同時刻に異なる撮像装置によって撮像された複数枚の映像フレーム間の物体の検出結果を対応付ける。 The object association unit 103c associates the person detection results between the video frames based on the person detection results of each video frame detected by the object detection unit 102c. Specifically, first, the object association unit 103c extracts a plurality of image frames to be associated, which are imaged at a predetermined time interval difference from the plurality of frames. Next, the object association unit 103c detects an object between the plurality of video frames extracted as the association target based on the detection result of the object detected in each of the plurality of image frames extracted as the association target. Match the results. A plurality of video frames to be matched by the object matching unit 103c are video frames captured by the same imaging device or different imaging devices with a predetermined time interval difference, and video frames captured by different imaging devices at the same time. It is a video frame. In this way, the object associating unit 103c selects a plurality of image frames captured with a predetermined time interval difference or captured at the same time by different image capturing devices among the image frames detected by the object detection unit 102c. Based on the detection results of the object detected in , the object detection results between a plurality of video frames captured with a predetermined time interval difference or captured by different imaging devices at the same time are associated with each other.

例えば、物体対応付け部１０３ｃは、撮像装置１によって時刻ｔに撮像された映像フレームＩ^ｔで検出された物体の検出結果と、時刻ｔと所定の時間離れた時刻又は同時刻に撮像装置３によって撮像された映像フレームで検出された物体の検出結果とに基づいて、当該映像フレーム間の物体の検出結果を対応付ける。また、例えば、物体対応付け部１０３ｃは、撮像装置１によって時刻ｔに撮像された映像フレームＩ^ｔで検出された物体の検出結果と、時刻ｔと所定の時間離れた時刻ｔ＋ｔ_０に撮像装置１によって撮像された映像フレームＩ^ｔ+ｔ0で検出された物体の検出結果とに基づいて、映像フレームＩ^ｔと映像フレームＩ^ｔ+ｔ0との間の物体の検出結果を対応付ける。また、例えば、物体対応付け部１０３ｃは、撮像装置３によって時刻ｔに撮像された映像フレームＩ^ｔ３で検出された物体の検出結果と、時刻ｔと所定の時間離れた時刻ｔ＋ｔ_０に撮像装置３によって撮像された映像フレームＩ^ｔ３+ｔ0で検出された物体の検出結果とに基づいて、映像フレームＩ^ｔ３と映像フレームＩ^ｔ３+ｔ0との間の物体の検出結果を対応付ける。 For example, the object associating unit 103c combines the detection result of the object detected in the video frame It captured by the imaging device 1 at time ^t with the imaging device 3 at a time a predetermined time apart from time t or at the same time. Based on the detection results of the objects detected in the imaged video frames, the detection results of the objects between the video frames are associated with each other. Further, for example, the object associating unit 103c combines the detection result of the object detected in the video frame It captured by the imaging device 1 at time ^t with the imaging device 1 at time t+ _t0 , which is a predetermined time away from time t. Based on the detection result of the object detected in the image frame It ^{+t0 picked} up by the image frame It+ ^t0 , the object detection result between the image frame It and the image frame It ^+t0 is associated. Further, for example, the object associating unit 103c combines the detection result of the object detected in the video frame ^It3 captured by the imaging device 3 at time t with the imaging device 3 at time t+ _t0 , which is a predetermined time away from time t. Based on the detection result of the object detected in the image frame ^It3+t0 captured by , the object detection result between the image frame ^It3 and the image frame ^It3+t0 is associated with each other.

なお、物体対応付け部１０３ｃは、第１の実施形態～第３の実施形態のいずれかの実施形態における同名の機能部と同様の処理を行う。例えば、物体対応付け部１０３ｃは、物体検出部１０２ｃが一般物体検出器を用いる場合には、第１の実施形態又は第３の実施形態のいずれかの実施形態における同名の機能部と同様の処理を行う。また、例えば、物体対応付け部１０３ｃは、物体検出部１０２ｃが特定物体検出器を用いる場合には、第２の実施形態における同名の機能部と同様の処理を行う。物体対応付け部１０３ｃが、第１の実施形態～第３の実施形態と異なる点は、画像取得部１０１ｃから通知を取得した場合の映像フレーム間における人物の検出結果の処理が異なる点である。 It should be noted that the object association unit 103c performs the same processing as the function unit with the same name in any one of the first to third embodiments. For example, when the object detection unit 102c uses a general object detector, the object association unit 103c performs processing similar to that of the same-named function unit in either the first embodiment or the third embodiment. I do. Further, for example, when the object detection unit 102c uses a specific object detector, the object association unit 103c performs the same processing as the functional unit with the same name in the second embodiment. The difference between the object association unit 103c and the first to third embodiments is that the processing of the human detection result between the video frames is different when the notification is acquired from the image acquisition unit 101c.

具体的には、物体対応付け部１０３ｃは、カメラＩＤが変化したときに取得された映像フレームと、その直前又は直後に取得された映像フレームとの対応付けにおいて、撮像装置１と撮像装置３との相対的な関係性から一方の人物検出結果を逆転させて対応付けを行う。 Specifically, the object associating unit 103c associates the video frame acquired when the camera ID changes with the video frame acquired immediately before or immediately after that, in which the imaging device 1 and the imaging device 3 One of the human detection results is reversed from the relative relationship between the two.

変換行列推定部１０４ｃは、物体検出部１０２ｃによる人物の検出結果と、物体対応付け部１０３ｃによる対応付け結果とに基づいて、映像フレームＩの座標変換行列を推定する。 The transformation matrix estimation unit 104c estimates the coordinate transformation matrix of the video frame I based on the person detection result by the object detection unit 102c and the matching result by the object matching unit 103c.

以上のように構成された画像処理システム１００ｃでは、複数のカメラで互いに異なる方向から撮影対象領域２を撮影した映像フレームが取得された場合には、映像フレーム間の人物検出結果において、一方の人物検出結果を逆転した後に対応付けを行う。これにより、複数のカメラで互いに異なる方向から撮影対象領域２を撮影した映像フレームが取得された場合であっても、座標変換行列のパラメータの推定することができる。 In the image processing system 100c configured as described above, when image frames obtained by photographing the photographing target area 2 from different directions with a plurality of cameras are acquired, the person detection result between the image frames indicates that one person Correlation is performed after reversing the detection result. As a result, even when image frames obtained by photographing the photographing target area 2 from different directions with a plurality of cameras are obtained, the parameters of the coordinate transformation matrix can be estimated.

＜変形例＞
撮像装置１及び３は、同じ方向を撮影するように設置されてもよいし、異なる方向を撮影するように設置されてもよい。撮像装置１及び３が同じ方向を撮影するように設置される場合、画像取得部１０１ｃはカメラＩＤによる判定処理を行う必要はない。撮像装置１及び３が異なる方向を撮影するように設置される場合、例えば撮像装置１と撮像装置３とは、直交する方向を向くように設置される。 <Modification>
The imaging devices 1 and 3 may be installed so as to shoot in the same direction, or may be installed so as to shoot in different directions. When the imaging devices 1 and 3 are installed so as to capture images in the same direction, the image acquisition unit 101c does not need to perform determination processing based on the camera ID. When the imaging devices 1 and 3 are installed so as to shoot in different directions, for example, the imaging devices 1 and 3 are installed so as to face orthogonal directions.

＜第１の実施形態から第４の実施形態に共通する変形例＞
画像取得部１０１，１０１ｃは、映像フレームを撮像装置以外から取得するように構成されてもよい。具体的には、画像取得部１０１，１０１ｃは、撮影対象領域２の一部又は全てが撮像されている映像フレームを記憶する記憶装置から映像フレームを取得してもよいし、ＵＳＢ（Universal Serial Bus）やＳＤカードのように持ち運び可能な記録媒体から映像フレームを取得してもよいし、ネットワーク上から映像フレームを取得してもよい。 <Modification Common to First to Fourth Embodiments>
The image acquisition units 101 and 101c may be configured to acquire video frames from sources other than the imaging device. Specifically, the image acquisition units 101 and 101c may acquire video frames from a storage device that stores video frames in which part or all of the imaging target region 2 is captured, or may acquire video frames from a USB (Universal Serial Bus). ) or a portable recording medium such as an SD card, or from a network.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 Although the embodiment of the present invention has been described in detail with reference to the drawings, the specific configuration is not limited to this embodiment, and design and the like are included within the scope of the gist of the present invention.

１、３…撮像装置，１０、１０ａ、１０ｂ、１０ｃ…画像処理装置，１０１、１０１ｃ…画像取得部，１０２、１０２ａ、１０２ｃ…物体検出部，１０３、１０３ａ、１０３ｂ、１０３ｃ…物体対応付け部，１０３１…暫定対応付け部，１０３２…グラフ構築部，１０３３…選択部，１０４、１０４ｃ…変換行列推定部，１０５…画像処理部 DESCRIPTION OF SYMBOLS 1, 3... Imaging device 10, 10a, 10b, 10c... Image processing apparatus 101, 101c... Image acquisition part 102, 102a, 102c... Object detection part, 103, 103a, 103b, 103c... Object correspondence part, 1031 Temporary association unit 1032 Graph constructing unit 1033 Selecting unit 104, 104c Transformation matrix estimating unit 105 Image processing unit

Claims

a detection step of detecting an object from each of a plurality of images;
an extracting step of extracting, from the plurality of images, a plurality of images to be associated that have been captured with a predetermined time interval difference or have been captured by different imaging devices at the same time;
an object association step of associating object detection results between the plurality of images extracted as the association target;
a transformation matrix estimation step of estimating a coordinate transformation matrix between the images based on the matching result in the object matching step and the detection result of the object;
has
The object matching step includes:
a provisional matching step of provisionally matching the object detection results between the images;
a graph construction step of constructing a graph whose nodes are each set of object detection results between the images provisionally associated in the provisional association step;
Among the plurality of nodes in the graph, the conditions for addition to the cluster containing the nodes of the correspondence assumed to be correct are satisfied in order from the node with the highest priority assigned in the order of the relationship between the nodes. a selection step of selecting the node of the supposed correct mapping by determining whether
Coordinate transformation matrix estimation method with

In the detection step, after detecting the object, classify the object by determining the classification of the detected object;
When information indicating the classification result of the object is included in the detection result of the object between the images, in the provisional association step, the detection result of the object including the information indicating the same classification result is matched between the images to the same detection result. 2. The method of estimating a coordinate transformation matrix according to claim 1 , wherein provisional association is made as a detection result of an object.

In the graph construction step, evaluating the presence or absence of geometric consistency of node pairs included in the constructed graph, and connecting the node pairs evaluated as having geometric consistency with edges,
3. The coordinates according to claim 1 or 2 , wherein in the selecting step, the node of the correspondence assumed to be the correct correspondence is selected with a node having a large number of nodes connected by the edge as a node with a high priority. Transformation matrix estimation method.

In the selection step, it is determined in order from the node with the highest priority that the conditions for addition to the cluster are satisfied if the feature point of the node is not used as a feature point in a node that has already been added to the cluster. 4. The coordinate transformation matrix estimation method according to any one of claims 1 to 3 , wherein a node that satisfies an addition condition to said cluster is added to said cluster.

When a detection result of an object imaged in each of a plurality of images obtained by photographing an imaging target area in opposite directions by a plurality of imaging devices is input,
5. The coordinate transformation matrix estimation method according to any one of claims 1 to 4 , wherein in said object matching step, matching is performed after reversing the detection result of one of the objects.

A computer program for executing the coordinate transformation matrix estimation method according to any one of claims 1 to 5 .