JP2020042608A

JP2020042608A - Detection apparatus and program

Info

Publication number: JP2020042608A
Application number: JP2018170289A
Authority: JP
Inventors: 建鋒徐; Kenho Jo; カノクパンラートニポンパン; Kanokpan Ratnipompan; 和之田坂; Kazuyuki Tasaka
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2018-09-12
Filing date: 2018-09-12
Publication date: 2020-03-19
Anticipated expiration: 2038-09-12
Also published as: JP6962662B2

Abstract

To provide a detection apparatus which can detect an object in a high-resolution image at a high speed.SOLUTION: A detection apparatus includes: a first acquisition unit 1 which acquires a first candidate area as an area with motion from a first image; a first detection unit 2 which detects an area and type of an object for a window where the first candidate area has been acquired out of windows defined on the first image, to detect the area and type of the object in the whole of the first image. The first detection unit 2 detects at least an area of a target. The apparatus includes: a second acquisition unit 3 which acquires a second candidate area as an area corresponding to the area detected in the first image from a second image obtained by imaging the same scene as that captured in the first image in a different viewpoint; and a second detection unit 4 which detects the object only for a window where the second candidate area has been acquired out of windows defined on the second image, to detect the object in the whole of the second image.SELECTED DRAWING: Figure 2

Description

本発明は、例えば高解像度な多視点映像における画像に適用することに好適な、高解像度の画像であっても高速に画像内の対象を検出することが可能な検出装置及びプログラムに関する。 The present invention relates to a detection device and a program suitable for application to, for example, an image in a high-resolution multi-viewpoint video and capable of quickly detecting an object in an image even with a high-resolution image.

近年、深層学習（ディープラーニング）を用いて画像から物体のクラスを認識すると共に、画像内において物体が存在する領域（bounding box）をも推定する物体検出技術が開発されている。例えば非特許文献１では、２段階（two-stage）構造として、特徴マップを抽出する畳み込み層（Convolutional Layer）と物体候補領域を抽出するネットワーク（Region Proposal Network）に加え、分類、回帰の結果を出力するネットワークで構成された物体検出技術が開示されている。非特許文献１の手法は高精度な検出を実現したが、物体候補領域を抽出するネットワークの利用によって、計算が重くなることが問題であった。 In recent years, an object detection technique has been developed that recognizes a class of an object from an image using deep learning (deep learning) and also estimates a region (bounding box) where the object exists in the image. For example, in Non-Patent Document 1, as a two-stage structure, in addition to a convolutional layer for extracting a feature map and a network for extracting an object candidate region (Region Proposal Network), classification and regression results are obtained. An object detection technology configured by an output network is disclosed. Although the method of Non-Patent Document 1 achieves highly accurate detection, there is a problem that the use of a network for extracting an object candidate region makes the calculation heavy.

そこで、計算の高速化のため、２段階構造ではなく１つのネットワークのみで済む手法が提案されている。例えば、非特許文献２のYOLOv3（You Only Look Once version 3）では、画像をグリッドに分割して各領域ごとに1つのシンプルなネットワークで物体検出を行うことで、高速化を実現した。また例えば、非特許文献３のSSD（Single Shot Detector）では、同様に1つのシンプルなネットワークで物体検出を行ったが、さらにマルチスケール（multiscale）を導入することで、小さい物体の認識精度を向上させた。 Therefore, in order to speed up the calculation, a method that requires only one network instead of a two-stage structure has been proposed. For example, in YOLOv3 (You Only Look Once version 3) of Non-Patent Document 2, the image is divided into grids, and object detection is performed by one simple network for each area, thereby realizing high speed. For example, in the case of the SSD (Single Shot Detector) of Non-Patent Document 3, object detection is similarly performed using one simple network, but the recognition accuracy of small objects is improved by further introducing multiscale. I let it.

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91-99.Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.2015.Faster r-cnn: Towards real-time object detection with region proposal networks.In Advances in neural information processing systems.91-99. Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement. arXiv (2018).Joseph Redmon and Ali Farhadi. 2018.YOLOv3: An Incremental Improvement.arXiv (2018). Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. SSD: Single shot multibox detector. In European conference on computer vision. Springer, 21-37.Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. SSD: Single shot multibox detector. In European conference on computer vision.Springer, 21-37. C. Stauffer; W. Grimson (August 1999). Adaptive background mixture models for real-time tracking (PDF). IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2. pp. 246-252.C. Stauffer; W. Grimson (August 1999) .Adaptive background mixture models for real-time tracking (PDF). IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2. pp. 246-252.

しかしながら、上記のような高速且つ高精度な検出を実現する非特許文献２や３等の手法であっても、例えば4K画像のような高解像度の画像内に検出対象が小さくスパース（疎）に存在している場合に対して当該手法を適用すると、計算時間が大きくなってしまうという課題があった。 However, even with the methods described in Non-Patent Documents 2 and 3 that realize high-speed and high-precision detection as described above, the detection target is small and sparse in a high-resolution image such as a 4K image. When the method is applied to the case where the data exists, there is a problem that the calculation time becomes longer.

図１は、当該課題が発生する画像の模式例として、サッカー映像における画像を示す図であるが、その他のスポーツ映像やその他のジャンルの映像においても同様の状況は発生しうる。図１にて、画像Pの全体には広大なサッカーフィールドFが鳥瞰的に撮影され、検出すべき対象の例として、ある１人のサッカー選手PLとボールBと（それぞれ、人型の形状及びドット形状として図示されている）が、当該画像P内のフィールドF上において非常に小さく、画像P内にスパースに存在するものとして撮影されている。なお、通常のサッカー映像ではその他の選手や観客席等のその他の対象（検出対象であるか否かは問わない）も画像P内に撮影されることとなるが、図１では模式例としてこれらその他の対象は省略している。 FIG. 1 is a diagram showing an image in a soccer video as a typical example of an image in which the problem occurs, but the same situation may occur in other sports videos and videos of other genres. In FIG. 1, a vast soccer field F is photographed in a bird's-eye view over the entire image P, and as an example of an object to be detected, a certain soccer player PL and a ball B (each having a humanoid shape and (Shown as a dot shape) is very small on the field F in the image P, and is photographed as being sparsely present in the image P. Note that, in a normal soccer video, other objects (whether or not to be detected) such as other players and spectator seats are also captured in the image P, but in FIG. Other objects are omitted.

図１にて、大きさの数値例として、画像Pは4K画像であってサイズが横4096画素×縦2160画素であるのに対し、選手PLを囲う矩形領域（bounding box）は横100画素×縦40画素程度であり、ボールBを囲う矩形領域は横10画素×縦10画素程度であって、画像Pの全体に対して非常に小さく、スパースに存在するものとなる。 In FIG. 1, as a numerical example of the size, the image P is a 4K image and has a size of 4096 pixels in width × 2160 pixels in height, whereas a rectangular area (bounding box) surrounding the player PL is 100 pixels in width. It is about 40 pixels in height, and the rectangular area surrounding the ball B is about 10 pixels in width × 10 pixels in height, which is extremely small with respect to the entire image P and exists sparsely.

図１のような画像Pに対して非特許文献２や３の検出法をそのまま適用した場合の計算時間の数値例（本発明者らの確認によるもの）を挙げると次の通りである。すなわち、YOLOv3やSSDが想定している入力サイズ（例えばYOLOv3の場合、次に説明する通りの416×416サイズ）に合わせることなく、高解像度な画像Pを縮小せずにそのまま入力する場合を考える。計算機環境として例えばNVIDIA社のGTX 1080Tiを使うと、サイズが横416画素×縦416画素である１つのスライディングウィンドウ（Sliding Window）でYOLOを実行するのに要する時間が約25ms（ミリ秒）である。この場合、１枚の4K画像Pの全体に渡って当該スライディングウィンドウを適用し、YOLOを実行するのに要する時間は以下の計算式の通り、1278msとなり、映像としてリアルタイムに処理する場合を考えると、計算時間が大きい。
25ms×(4096画素×2160画素)÷(416画素×416画素)=1278ms A numerical example of the calculation time when the detection methods of Non-Patent Documents 2 and 3 are applied to the image P as shown in FIG. 1 (as confirmed by the present inventors) is as follows. That is, consider a case in which a high-resolution image P is input without being reduced without being adjusted to an input size assumed by YOLOv3 or SSD (for example, in the case of YOLOv3, 416 × 416 size as described below). . When using NVIDIA's GTX 1080Ti as the computer environment, for example, the time required to execute YOLO in one sliding window (Sliding Window), which is 416 pixels wide by 416 pixels long, is about 25 ms (milliseconds). . In this case, the time required to apply the sliding window over the entirety of one 4K image P and execute YOLO is 1278 ms as shown in the following formula, and it is assumed that video is processed in real time. , Calculation time is large.
25ms x (4096 x 2160) / (416 x 416) = 1278ms

なお、図１ではウィンドウWとして、画像P内に当該サイズ416×416であるスライディングウィンドウの模式的な一例が示されている。 In FIG. 1, a typical example of a sliding window having the size 416 × 416 in the image P is shown as the window W.

さらに、各画像Pが多視点映像のある１つの視点の画像である場合、当該多視点映像の全体（ある１時刻のもの）に対してYOLOを実行することを考えると、上記の1278msに多視点映像の視点数を乗じた分の計算時間が必要となってしまう。例えば、4つの視点からなる多視点映像であれば、１時刻フレームあたり4倍の5112msの計算時間が必要となり、映像としてリアルタイム処理することがより一層、厳しい状況となる。 Further, when each image P is an image of a certain viewpoint in a multi-view image, considering that YOLO is executed on the entire multi-view image (at a certain time), the number of times is 1278 ms as described above. The calculation time required for multiplying the number of viewpoints of the viewpoint video is required. For example, in the case of a multi-viewpoint video including four viewpoints, the calculation time of 4112 times per time frame is required, which makes it more difficult to perform real-time processing as a video.

以上のように、従来技術においては、高解像度の画像内にある程度スパースに検出されるべき対象が存在しているような場合であっても、当該高解像度の分だけそのまま比例して、検出のための計算量が増えてしまうという課題があった。また、従来技術においてはさらに、画像が多視点画像におけるものである場合には同様に、視点数の分だけそのまま比例して、検出のための計算量が増えてしまうという課題があった。 As described above, in the related art, even when a target to be detected sparsely exists to some extent in a high-resolution image, the detection is performed in proportion to the high-resolution image. There is a problem that the amount of calculation for this increases. Further, in the related art, when the image is a multi-viewpoint image, similarly, there is a problem that the calculation amount for detection increases in proportion to the number of viewpoints.

上記従来技術の課題に鑑み、本発明は、高解像度の画像であっても高速に画像内の対象を検出することが可能な検出装置及びプログラムを提供することを第一の目的とする。また、本発明はさらに、画像が多視点画像における各画像であっても高速に画像内の対象を検出することが可能な検出装置及びプログラムを提供することを第二の目的とする。 SUMMARY OF THE INVENTION In view of the above-described problems of the related art, a first object of the present invention is to provide a detection device and a program capable of detecting an object in an image at a high speed even if the image has a high resolution. A second object of the present invention is to provide a detection device and a program capable of detecting an object in an image at high speed even if the image is each image in a multi-view image.

上記目的を達成するため、本発明は、検出装置であって、第一画像より動きのある領域として第一候補領域を取得する第一取得部と、前記第一画像上で定義される各ウィンドウのうち前記第一候補領域が取得されたウィンドウのみに対して、対象の領域及び／又は種別を検出することで、前記第一画像の全体における対象の領域及び／又は種別を検出する第一検出部と、を備えることを第一の特徴とする。また、前記第一検出部では少なくとも対象の領域を検出し、前記第一画像に撮影されているのと共通のシーンを別視点で撮影して得られる第二画像より、前記第一検出部で前記第一画像より検出された対象の領域に対応する領域として第二候補領域を取得する第二取得部と、前記第二画像上で定義される各ウィンドウのうち前記第二候補領域が取得されたウィンドウのみに対して、対象の領域及び種別を検出することで、前記第二画像の全体における対象の領域及び種別を検出する第二検出部と、をさらに備えることを第二の特徴とする。また、本発明は、検出装置であって、第一画像に対して、対象の領域が既に検出されており、前記第一画像に撮影されているのと共通のシーンを別視点で撮影して得られる第二画像より、前記第一画像より検出された対象の領域に対応する領域として第二候補領域を取得する第二取得部と、前記第二画像上で定義される各ウィンドウのうち前記第二候補領域が取得されたウィンドウのみに対して、対象の領域及び／又は種別を検出することで、前記第二画像の全体における対象の領域及び／又は種別を検出する第二検出部と、を備えることを第三の特徴とする。さらに、前記第一〜第三の特徴に係る検出装置としてコンピュータを機能させるプログラムであることを第四の特徴とする。 In order to achieve the above object, the present invention is a detection device, a first acquisition unit for acquiring a first candidate region as a region having motion from the first image, and each window defined on the first image A first detection that detects a target region and / or type in the entire first image by detecting a target region and / or type only for a window in which the first candidate region is acquired. And a first part. Further, the first detection unit detects at least a target area, and a second image obtained by photographing a scene common to that captured in the first image from another viewpoint is used by the first detection unit. A second acquisition unit that acquires a second candidate region as a region corresponding to the target region detected from the first image, and the second candidate region is acquired from each window defined on the second image. A second detection unit that detects the target area and type for only the window that has been detected, thereby detecting the target area and type in the entire second image. . In addition, the present invention is a detection device, for the first image, the target region has already been detected, by capturing a scene common to that captured in the first image from another viewpoint From the obtained second image, a second obtaining unit that obtains a second candidate region as a region corresponding to the target region detected from the first image, and among the windows defined on the second image, A second detection unit that detects a target region and / or type in the entirety of the second image by detecting a target region and / or type for only the window in which the second candidate region is obtained; The third feature is to provide the following. A fourth feature is that the program is a program that causes a computer to function as the detection device according to the first to third features.

前記第一の特徴によれば、動きのある領域のみに限定してウィンドウによる検出処理を適用することで、高解像度の画像であっても高速に検出を行うことができる。前記第二又は第三の特徴によれば、第一画像に撮影されているのと共通のシーンを別視点で撮影して得られる第二画像より、第一画像で既に対象が検出された領域に対応する第二画像の領域のみに限定してウィンドウによる検出処理を適用することで、高解像度の画像であっても高速に検出を行うことができる。 According to the first feature, a high-resolution image can be detected at high speed by applying a detection process using a window only to a moving area. According to the second or third feature, from the second image obtained by photographing the same scene as that captured in the first image from another viewpoint, the region where the target has already been detected in the first image By applying the detection process using the window only to the area of the second image corresponding to the image, it is possible to perform high-speed detection even for a high-resolution image.

課題が発生する画像の模式例を示す図である。FIG. 4 is a diagram illustrating a schematic example of an image in which a problem occurs. 一実施形態に係る検出装置の機能ブロック図である。It is a functional block diagram of a detecting device concerning one embodiment. 多視点画像を撮影するためのカメラ配置の模式例を、視点数が4個の場合に関して示すものである。FIG. 9 shows a schematic example of a camera arrangement for capturing a multi-viewpoint image in a case where the number of viewpoints is four. 一実施形態に係る検出装置の動作のフローチャートである。5 is a flowchart of an operation of the detection device according to one embodiment. 第一検出部において混合正規分布モデルを用いた背景差分法を適用した結果の例である。It is an example of the result of applying the background subtraction method using the mixture normal distribution model in the first detection unit. 第一検出部による処理の模式例を示す図である。FIG. 7 is a diagram illustrating a schematic example of a process performed by a first detection unit. 第一検出部においてウィンドウ内で検出した相対位置を画像内での絶対位置に換算することを説明するための図である。FIG. 7 is a diagram for explaining that a relative position detected in a window is converted into an absolute position in an image by a first detection unit. 第二取得部で第二候補領域を取得する一実施形態を説明するための図である。It is a figure for explaining one embodiment which acquires a 2nd candidate field with a 2nd acquisition part. 第二検出部による拡大処理を説明するための図である。FIG. 9 is a diagram for explaining an enlargement process by a second detection unit.

図２は、一実施形態に係る検出装置の機能ブロック図である。図２に示す通り、検出装置10は、第一取得部1、第一検出部2、第二取得部3及び第二検出部4を備える。検出装置10は、その全体的な動作として次の第一動作及び第二動作を行うことができる。第一動作においては、入力としての第一画像を第一取得部1及び第一検出部2で受け取り、第一検出部2より当該第一画像における検出結果（第一画像内において対象の占める範囲及び当該対象の種別の検出結果）を出力する。第二動作においては、入力としての第二画像を第二取得部3及び第二検出部4で受け取り、第二検出部4より当該第二画像における検出結果（第二画像内において対象の占める範囲及び当該対象の種別の検出結果）を出力する。 FIG. 2 is a functional block diagram of the detection device according to one embodiment. As shown in FIG. 2, the detection device 10 includes a first acquisition unit 1, a first detection unit 2, a second acquisition unit 3, and a second detection unit 4. The detection device 10 can perform the following first operation and second operation as its overall operation. In the first operation, the first image as an input is received by the first acquisition unit 1 and the first detection unit 2, and the detection result in the first image (the range occupied by the object in the first image) is received from the first detection unit 2. And the detection result of the type of the target). In the second operation, the second image as an input is received by the second acquisition unit 3 and the second detection unit 4, and the detection result in the second image (the range occupied by the object in the second image) is received from the second detection unit 4. And the detection result of the type of the target).

一実施形態（以降での説明のため実施形態EAとする）において、検出装置1は、前述の第一動作を行ったうえで、第一動作において得られた情報（第一検出部2で得られた検出結果から、対象の占める範囲としての第一領域の情報）を利用することによりさらに、前述の第二動作を行うことができる。別の一実施形態（以降での説明のため実施形態EBとする）においては、検出装置1は、前述の第一動作のみを行い、第二動作は行わないようにすることもできる。この場合、検出装置10は第二取得部3及び第二検出部4が省略され、第一取得部1及び第一検出部2のみを備える構成とすることができる。 In one embodiment (hereinafter, referred to as an embodiment EA), the detection device 1 performs the above-described first operation, and then obtains information obtained in the first operation (obtained by the first detection unit 2). The above-mentioned second operation can be further performed by using the information of the first region as the range occupied by the object from the obtained detection result. In another embodiment (hereinafter, referred to as an embodiment EB), the detection device 1 may perform only the above-described first operation, and may not perform the second operation. In this case, the detection device 10 may be configured such that the second acquisition unit 3 and the second detection unit 4 are omitted and only the first acquisition unit 1 and the first detection unit 2 are provided.

なお、検出装置10への入力としての第一画像及び第二画像は、次のような関係を有するものとして用意しておくことができる。すなわち、図２にも示す通り、ある多視点画像（共通時刻における共通のシーンを互いに異なる配置（視点）のカメラでそれぞれ撮影した複数の画像）における第一視点での画像が第一画像であり、第一視点とは別の第二視点での画像が第二画像である、という関係である。 Note that the first image and the second image as inputs to the detection device 10 can be prepared as having the following relationship. That is, as shown in FIG. 2, the image at the first viewpoint in a certain multi-viewpoint image (a plurality of images obtained by photographing a common scene at a common time with cameras having different arrangements (viewpoints)) is the first image. , An image at a second viewpoint different from the first viewpoint is a second image.

図３は、前述した模式例としての図１の画像P（サッカー試合の映像の画像）のような画像を多視点画像の各視点におけるものとして撮影する場合を例として、多視点画像を撮影するためのカメラ配置の模式例を、視点数が4個の場合に関して示すものである。4つの各カメラC10,C20,C30,C40は、共通シーンとしてのサッカー試合が行われているフィールドFを互いに異なる配置（視点）で撮影している。なお、図３にはフィールドFや各カメラC10,C20,C30,C40を上空側から見た状態が模式的に示されており、図１で示した選手PLやボールB等の対象は描くのを省略している。各カメラC10,C20,C30,C40が共通時刻において共通シーンとしての当該フィールドFを撮影して得た画像をそれぞれ画像P10,P20,P30,P40とすると、これらの画像群(P10,P20,P30,P40)が、多視点映像における各時刻フレーム画像としての多視点画像を構成するものとなる。 FIG. 3 shows a case where an image such as the image P (image of a video of a soccer match) shown in FIG. 1 as a schematic example described above is taken as an image at each viewpoint of the multi-view image. Example of a camera arrangement for the case where the number of viewpoints is four is shown. Each of the four cameras C10, C20, C30, and C40 captures a field F where a soccer match is being performed as a common scene with different arrangements (viewpoints). FIG. 3 schematically shows the field F and each of the cameras C10, C20, C30, and C40 as viewed from above. The objects such as the player PL and the ball B shown in FIG. Is omitted. Assuming that the images obtained by the cameras C10, C20, C30, and C40 capturing the field F as a common scene at a common time are images P10, P20, P30, and P40, respectively, these image groups (P10, P20, and P30) , P40) constitute a multi-view image as each time frame image in the multi-view video.

図２にも示される通り、検出装置1の各部1〜4の概略的な処理内容は以下の通りである。 As shown in FIG. 2, the schematic processing contents of each unit 1 to 4 of the detection device 1 are as follows.

第一取得部1は、入力される第一画像P1において、前景としての動きがあると判定される領域を第一候補領域R1として取得する。当該取得された第一候補領域R1は、第一検出部2へと出力される。 The first obtaining unit 1 obtains, as a first candidate area R1, an area in the input first image P1 that is determined to have a movement as a foreground. The obtained first candidate region R1 is output to the first detection unit 2.

ここで、第一取得部1では、映像上の各時刻t（tは時刻インデクスであってt=1,2,3,…であり、説明のために注目する時刻としての現時刻をtとする。以下同様。）のフレーム画像としての第一画像P1(t)に対して、現時刻tの第一画像P1(t)と、1つ以上の過去時刻t-k（k≧1）の第一画像P1(t-k)とを参照することにより、現時刻tでの第一画像P1(t)内での前景として動きの情報を取得することによって、現時刻tにおける第一候補領域R1(t)を求めることができる。 Here, in the first acquisition unit 1, each time t on the video (t is a time index and t = 1, 2, 3,..., And the current time as the time of interest for the description is t and The same applies to the first image P1 (t) as a frame image of the current time t and the first image P1 (t) of one or more past times tk (k ≧ 1). By referring to the image P1 (tk) and the motion information as the foreground in the first image P1 (t) at the current time t, the first candidate region R1 (t) at the current time t Can be requested.

なお、第一候補領域R1(t)とは、第一画像P1(t)内での領域（連結領域）の個数としては任意の個数で構成されうるものである。このことは、第一検出部2に関して後述する第一画像内の第一領域に関しても、また、第二取得部3に関して後述する第二画像内の第二候補領域に関しても、同様である。 Note that the first candidate region R1 (t) can be an arbitrary number of regions (connected regions) in the first image P1 (t). The same applies to a first region in a first image described later with respect to the first detection unit 2 and a second candidate region in a second image described later with respect to the second acquisition unit 3.

第一検出部2は、第一画像P1(t)内を対象検出処理のために移動するものとして予め定義しておく各ウィンドウWに対して、各時刻tにおいて第一取得部1で得られた第一候補領域R1(t)が当該ウィンドウWの内部に含まれているようなウィンドウWのみを処理対象として、撮影されている対象の領域及び種別の検出処理を行うことにより、第一画像P1(t)の全体に対する、撮影されている対象の領域及び種別の検出結果を得る。 The first detection unit 2 is obtained by the first acquisition unit 1 at each time t for each window W that is defined in advance as moving in the first image P1 (t) for the target detection processing. Only the window W in which the first candidate region R1 (t) is included inside the window W is to be processed, and the detection process of the region and the type of the imaged target is performed, whereby the first image The detection result of the region and the type of the object to be photographed with respect to the entire P1 (t) is obtained.

このように、第一検出部2では、第一画像P1(t)内を移動するものとして予め定義される多数のウィンドウWの全てについて撮影されている対象の領域検出及び種別判定の処理を行うのではなく、第一候補領域R1(t)が含まれるようなウィンドウWのみに限定して当該処理を行うことにより、第一画像P1(t)内で撮影されている対象の分布がスパースである場合等に、当該処理を高速に完了することができる。この模式例は図６を参照して後述する。 As described above, the first detection unit 2 performs the process of detecting the region of the target imaged and determining the type of all of the multiple windows W that are predefined as moving in the first image P1 (t). Rather, by performing the process only on the window W that includes the first candidate region R1 (t), the distribution of the target captured in the first image P1 (t) is sparse. In some cases, the processing can be completed at high speed. This schematic example will be described later with reference to FIG.

実施形態EA（第一動作及び第二動作の両方が行われるもの）においては、第一検出部2で得た検出結果（すなわち、第一画像P1(t)内での撮影されている対象の領域及び種別の情報）から得られる対象の領域の情報を、第一領域D1(t)の情報として、第二取得部3へも出力される。 In the embodiment EA (in which both the first operation and the second operation are performed), the detection result obtained by the first detection unit 2 (that is, the detection result of the object captured in the first image P1 (t)) The information of the target area obtained from the (area and type information) is also output to the second acquisition unit 3 as the information of the first area D1 (t).

第二取得部3は、各時刻tにて、入力される第二画像P2(t)において、第一検出部2から得られる第一領域D1(t)に対応する領域を第二候補領域R2(t)として取得する。当該取得された第二候補領域R2(t)は第二検出部4へと出力される。 The second acquisition unit 3 at each time t, in the input second image P2 (t), the region corresponding to the first region D1 (t) obtained from the first detection unit 2, the second candidate region R2 (t). The obtained second candidate region R2 (t) is output to the second detection unit 4.

ここで、第二取得部3では、第一画像P1(t)を撮影している第一カメラに関して予め求まっている第一カメラパラメータと、第二画像P2(t)を撮影している第二カメラに関して予め求まっている第二カメラパラメータと、の間の関係として定まる、第一画像P1(t)内の画素位置(x1,y1)から第二画像P2(t)内の画素位置(x2,y2)への変換H（後述するホモグラフィー変換H）を第一領域D1(t)へと適用することにより、第一画像P1(t)内の第一領域D1(t)を第二画像P2(t)内の領域H(D1(t))へと写像し、当該写像された領域H(D1(t))に基づいて第二候補領域R2(t)を取得することができる。 Here, in the second acquisition unit 3, the first camera parameters previously determined for the first camera capturing the first image P1 (t) and the second camera P2 (t) capturing the second image P2 (t) A second camera parameter determined in advance for the camera, determined as a relationship between the pixel position (x1, y1) in the first image P1 (t) from the pixel position in the second image P2 (t) (x2, y2) by applying the conversion H (homography conversion H described later) to the first region D1 (t), the first region D1 (t) in the first image P1 (t) is converted to the second image P2 The second candidate area R2 (t) can be obtained based on the area H (D1 (t)) mapped to the area H (D1 (t)) in (t).

第二検出部4は、第一検出部2が第一画像P1(t)及び第一候補領域R1(t)を用いて行ったのと同様の処理を第二画像P2(t)及び第二候補領域R2(t)を用いて行うことにより、第二画像P2(t)の全体に対する、撮影されている対象の領域及び種別の検出結果を得ることができる。 The second detection unit 4 performs the same processing as that performed by the first detection unit 2 using the first image P1 (t) and the first candidate region R1 (t). By using the candidate region R2 (t), it is possible to obtain the detection result of the region and the type of the object to be photographed with respect to the entire second image P2 (t).

すなわち、第二検出部4は、第二画像P2(t)内を対象検出処理のために移動するものとして予め定義しておく各ウィンドウWに対して、各時刻tにおいて第二取得部3で得られた第二候補領域R2(t)が当該ウィンドウWの内部に含まれているようなウィンドウWのみを処理対象として、撮影されている対象の領域及び種別の検出処理を行うことにより、第二画像P2(t)の全体に対する、撮影されている対象の領域及び種別の検出結果を得ることができる。この模式例は第一検出部2の説明において図６を参照して後述する。 That is, the second detection unit 4 performs, at each time t, the second acquisition unit 3 for each window W defined in advance as moving in the second image P2 (t) for the target detection processing. By performing processing for detecting the region and type of the target to be photographed, only the window W in which the obtained second candidate region R2 (t) is included inside the window W is processed, It is possible to obtain the detection result of the region and the type of the object to be photographed for the entire two images P2 (t). This schematic example will be described later with reference to FIG.

従って、第一検出部2に関して説明したのと同様の理由（全てのウィンドウWのうち一部のみが検出処理の対象となること）によって、第二検出部4においても、第二画像P2(t)内で撮影されている対象の分布がスパースである場合等に、当該処理を高速に完了することができる。 Accordingly, for the same reason as described for the first detection unit 2 (only a part of all the windows W is subjected to the detection processing), the second image P2 (t For example, when the distribution of the target imaged in the parentheses is sparse, the processing can be completed at high speed.

図４は、一実施形態（第一動作及び第二動作の両方が行われる、前述の実施形態EAに対応する一実施形態）に係る検出装置10の動作のフローチャートである。図４では、多視点映像の各時刻tの画像としての第一画像及び第二画像に対して検出処理を行う場合の各ステップが示されている。 FIG. 4 is a flowchart of the operation of the detection device 10 according to one embodiment (one embodiment corresponding to the above-described embodiment EA in which both the first operation and the second operation are performed). FIG. 4 shows each step in the case where the detection processing is performed on the first image and the second image as the images at each time t of the multi-view video.

図４の各ステップの説明のための前提事項として、検出装置10による検出処理の対象となる多視点映像はN個（N≧2）の視点の各カメラC1,C2,…,CNでそれぞれ撮影されるものとして、各時刻tでの多視点画像P1(t),P2(t),…,PN(t)が構成されているものとし、これらのうちP1(t)を第一画像とし、残りのN-1枚のP2(t),…,PN(t)をそれぞれ、第二画像として用いるものとする。なお、検出装置10においては、このような一般のN個（N≧2）の視点による多視点画像P1(t),P2(t),…,PN(t)のうちの、任意の１個の視点の画像を第一画像とし、残りのN-1個の視点の画像をそれぞれ第二画像として用いるものとして、予め設定しておくことができる。ここでは説明のための変数表記（P1(t),P2(t),…,PN(t)）を割り当てる必要があることにより、一般性を失うことなく、このように任意に設定しうる1個の視点の画像としてP1(t)を第一画像とし、それ以外のN-1個の視点の各画像P2(t),…,PN(t)を第二画像としているに過ぎない。 As a prerequisite for the explanation of each step in FIG. 4, a multi-view video to be subjected to detection processing by the detection device 10 is captured by each of cameras C1, C2,..., CN of N (N ≧ 2) viewpoints. It is assumed that multi-view images P1 (t), P2 (t),..., PN (t) at each time t are configured, and among these, P1 (t) is a first image, Each of the remaining N-1 P2 (t),..., PN (t) is used as a second image. In the detection device 10, any one of the multi-viewpoint images P1 (t), P2 (t),..., PN (t) based on such general N (N ≧ 2) viewpoints is used. Can be set in advance so that the image of the viewpoint is used as the first image and the remaining N-1 viewpoint images are used as the second images. Here, it is necessary to assign variable notations (P1 (t), P2 (t),..., PN (t)) for explanation, and thus, it is possible to set arbitrarily without loss of generality. P1 (t) is set as the first image as the images of the viewpoints, and only the images P2 (t),..., PN (t) of the other N-1 viewpoints are set as the second images.

また、当該変数表記の割り当ては、前述の各部1〜4の概略説明において第一画像をカメラC1による画像P1(t)とし、第二画像をカメラC2による画像P2(t)としたこととも表記を整合させたものである。すなわち、概略説明で述べた第二画像として処理されるものは必ずしも１つのカメラC2における画像のみである必要はなく、以下に図４を参照して説明する通り、その他のカメラC3,…CN（第一画像のカメラC1は除く）における画像が追加で存在してもよい。 The assignment of the variable notation is also referred to as the first image being the image P1 (t) by the camera C1 and the second image being the image P2 (t) by the camera C2 in the schematic description of each of the above-described units 1 to 4. Is matched. That is, what is processed as the second image described in the schematic description is not necessarily the image of only one camera C2, and as described below with reference to FIG. 4, the other cameras C3,. The image of the first image (excluding the camera C1) may additionally be present.

図４のフローが開始されるとまずはステップS10へと進む。ステップS10では、検出装置10が、入力データとしての現時刻tの多視点画像P1(t),P2(t),…,PN(t)を取得してから、ステップS12へと進む。ステップS12では、ステップS10で入力として取得した現時刻tの多視点画像のうち第一画像P1(t)に対して、第一取得部1及び第一検出部2が概略説明にて説明した通りの処理を行うことにより、第一画像P1(t)内において撮影されている対象の領域及び種別の検出を行ってから、ステップS14へと進む。 When the flow of FIG. 4 is started, the process first proceeds to step S10. In step S10, the detection device 10 acquires the multi-view images P1 (t), P2 (t),..., PN (t) at the current time t as input data, and then proceeds to step S12. In step S12, for the first image P1 (t) of the multi-view image at the current time t acquired as an input in step S10, the first acquisition unit 1 and the first detection unit 2 have been described in the schematic description. By performing the above processing, the region and the type of the object being photographed in the first image P1 (t) are detected, and then the process proceeds to step S14.

ステップS14では、ステップS10で入力として取得した現時刻tの多視点画像のうち第二画像Pn(t)（n=2,3,…,Nのいずれか）において検出処理が未完了の１つ（これを、第二画像Pn(t)とする）に対して、第二取得部3及び第二検出部4が概略説明にて説明した通りの処理を行うことにより、第二画像Pn(t)内において撮影されている対象の領域及び種別の検出を行ってから、ステップS16へと進む。 In step S14, one of the multi-view images at the current time t acquired as an input in step S10, for which the detection processing is not completed in the second image Pn (t) (any of n = 2, 3,..., N) (This is referred to as a second image Pn (t).) The second acquisition unit 3 and the second detection unit 4 perform the processing as described in the schematic description, thereby obtaining the second image Pn (t). After detecting the region and the type of the object photographed in the parentheses, the process proceeds to step S16.

なお、概略説明でも説明した通り、ステップS14で第二画像Pn(t)に対して検出処理を行う際には、ステップS12で第一検出部2で得た検出結果のうちの第一領域D1(t)の情報を第二取得部3が参照したうえで、第二取得部3及び第二検出部4による処理が行われる。 As described in the schematic description, when performing the detection processing on the second image Pn (t) in step S14, the first area D1 of the detection results obtained by the first detection unit 2 in step S12. After the second acquisition unit 3 refers to the information of (t), the processing by the second acquisition unit 3 and the second detection unit 4 is performed.

ステップS16では、現時刻tに関して、インデクスn（n=2,3,…,N）でそれぞれが特定されるN-1個の第二画像Pn(t)の全てに関して、ステップS14における検出理処理が完了したか否かの判断を行い、N-1個の全てについて完了していれば（すなわち、肯定判断であれば）ステップS20へと進み、未完了のものが残っていれば（すなわち、否定判断であれば）ステップS18へと進む。 In step S16, with respect to the current time t, the detection processing in step S14 is performed on all of the N-1 second images Pn (t) specified by the index n (n = 2, 3,..., N). It is determined whether or not has been completed. If all of the N-1 items have been completed (that is, if the determination is affirmative), the process proceeds to step S20, and if there are uncompleted ones (ie, The process proceeds to step S18 (if a negative determination is made).

ステップS18では、現時刻tに関して、インデクスn（n=2,3,…,N）でそれぞれが特定されるN-1個の第二画像Pn(t)のうち、ステップS14での処理が未完了の１つを選択したうえで、ステップS14へと戻る。当該戻ったステップS14においては、ステップS18で処理が未完了として選択された第二画像Pn(t)を処理対象として、検出処理が行われることとなる。 In step S18, for the current time t, of the N-1 second images Pn (t) specified by the index n (n = 2, 3,..., N), the processing in step S14 has not been performed. After selecting one of the completions, the process returns to step S14. In the returned step S14, the detection processing is performed on the second image Pn (t) selected as the processing not completed in step S18.

なお、ステップS18からステップS14へと戻るのではなく、ステップS12からステップS14へと至った時点ではN-1個の第二画像Pn(t)の全てが検出処理が未完了の状態であるので、任意の１つを検出対象とすればよい。各時刻tに関してステップS14の処理がN-1個の第二画像Pn(t)のそれぞれを対象としてN-1回だけ実行されることとなるが、当該実行される順番（nの順番）は予め設定しておき、ステップS18では当該設定された順番に従う設定を行うようにしてもよい。例えば昇順に、n=2,3,…,Nと順番を設定してもよい。 In addition, instead of returning from step S18 to step S14, at the time when the process proceeds from step S12 to step S14, all of the N-1 second images Pn (t) are in a state where the detection processing is not completed. , Any one of them may be a detection target. For each time t, the process of step S14 is executed only N-1 times for each of the N-1 second images Pn (t), and the order of execution (the order of n) is It may be set in advance, and in step S18, setting according to the set order may be performed. For example, the order may be set as n = 2, 3,..., N in ascending order.

ステップS20では、現時刻tに関してステップS12で得られた第一画像P1(t)における検出結果と、N-1回だけ実行されたステップS14で得られた第二画像Pn(t)(n=2,3,…,N)における検出結果と、を入力された多視点画像P1(t),P2(t),…,PN(t)の全体に対する検出結果として検出装置10が出力したうえで、ステップS22へと進む。ステップS22では、現時刻tを次の最新時刻t+1へと更新したうえでステップS10へと戻る。 In step S20, the detection result in the first image P1 (t) obtained in step S12 with respect to the current time t, and the second image Pn (t) obtained in step S14 executed N-1 times (n = 2, 3,..., N), and the detection device 10 outputs the detection result as a detection result for the entire input multi-viewpoint image P1 (t), P2 (t),. Then, the process proceeds to step S22. In step S22, the current time t is updated to the next latest time t + 1, and the process returns to step S10.

以上のようにして、図４のフローにより、各時刻t=1,2,3,…における多視点映像のフレームとしての多視点画像P1(t),P2(t),…,PN(t)に関して、ステップS20において検出結果を出力するようにすることができる。 As described above, the multi-viewpoint images P1 (t), P2 (t),..., PN (t) as frames of the multi-viewpoint video at each time t = 1, 2, 3,. With respect to, the detection result can be output in step S20.

以下、検出装置10の各部1〜4における処理内容の詳細を説明する。 Hereinafter, details of the processing contents in each of the units 1 to 4 of the detection device 10 will be described.

＜第一取得部1＞
第一取得部1では、多視点画像P1(t),P2(t),…,PN(t)における「ベース視点」としての役割を有した第一画像P1(t)より、前景領域を取得する。例えば図１の模式例のようなサッカー映像の場合であれば、複数の選手PL及びボールBといった領域が前景領域として取得される。 <First acquisition unit 1>
The first acquisition unit 1 acquires a foreground area from the first image P1 (t) having a role as a “base viewpoint” in the multi-viewpoint images P1 (t), P2 (t),..., PN (t). I do. For example, in the case of a soccer video as in the schematic example of FIG. 1, areas such as a plurality of players PL and balls B are acquired as foreground areas.

なお、ベース視点とは、既に説明した内容における次の技術的意義に関する用語である。すなわち、実施形態EAにおいて、当該ベース視点である第一画像P1(t)における第一領域D1(t)の情報が第二取得部3に参照されることで、ベース視点以外の第二画像Pn(t)(n=2,3,…,N)において第二候補領域R2(t)として検出対象の絞り込みを可能にするという点で、第一画像は第二画像の検出処理の前提となることから、ベース視点として称している。 The base viewpoint is a term related to the following technical significance in the contents already described. That is, in the embodiment EA, the information of the first region D1 (t) in the first image P1 (t), which is the base viewpoint, is referred to by the second acquisition unit 3, whereby the second image Pn other than the base viewpoint is displayed. The first image is a prerequisite for the detection of the second image in that the detection target can be narrowed down as the second candidate region R2 (t) in (t) (n = 2, 3,..., N). Therefore, it is called as the base viewpoint.

具体的に、第一取得部1では既存手法である背景差分法により、前景領域としての第一候補領域R1(t)を取得することができる。例えば、前掲の非特許文献４に開示される混合正規分布を用いた背景差分法を利用して、動いている対象を前景として分離することができる。すなわち、非特許文献４等の手法では、背景が風などで揺れる、太陽の位置や雲の動きにより照明環境が変化する、といった問題に対して、混合正規分布(Mixture of Gaussian Distribution, MoG)を用いた背景のモデル化などによる対処が可能となる。MoGは新たに観測された画像を用いて逐次的に背景モデルを更新することから、太陽の位置の変化のような、ゆっくりとした照明環境の変化にも対処できる。 Specifically, the first acquisition unit 1 can acquire the first candidate region R1 (t) as the foreground region by the background subtraction method which is an existing method. For example, a moving target can be separated as a foreground by using a background difference method using a mixed normal distribution disclosed in Non-Patent Document 4 mentioned above. In other words, in the method of Non-Patent Document 4 and the like, for the problem that the background fluctuates due to wind or the like, or the lighting environment changes due to the position of the sun or the movement of clouds, a mixture normal distribution (Mixture of Gaussian Distribution, It is possible to cope with modeling of the used background. Since MoG updates the background model sequentially using newly observed images, it can handle slow changes in the lighting environment, such as changes in the position of the sun.

図５は、図１で模式例を示したサッカー映像としての第一画像に当該MoGによる背景差分法を適用することで、選手及びボールが前景として抽出された例であり、前景ピクセルを白色で、背景ピクセルを黒色で示している。 FIG. 5 shows an example in which a player and a ball are extracted as a foreground by applying the background difference method by the MoG to the first image as a soccer video shown in the schematic example of FIG. , Background pixels are shown in black.

＜第一検出部2＞
第一検出部2では、第一画像P1(t)内を移動するものとして予め定義しておく各ウィンドウW（スライディングウィンドウ）のうち、第一取得部1で前景として取得された第一候補領域R1(t)が当該ウィンドウW内に含まれるようなウィンドウWのみに限定して、既存手法である前掲の非特許文献２に開示されるYOLO等の検出器（ウィンドウ内での撮影されている対象の領域及び種別の検出器）を適用することで、第一画像P1(t)の全体における検出結果としての対象の領域及び種別の検出結果を得る。 <First detector 2>
In the first detection unit 2, among the windows W (sliding windows) defined in advance as moving in the first image P1 (t), the first candidate area acquired as the foreground by the first acquisition unit 1 R1 (t) is limited to only the window W that is included in the window W, and a detector such as YOLO disclosed in the above-mentioned Non-Patent Document 2 which is an existing method (the image is captured in the window W) By applying the target region and type detector, a target region and type detection result is obtained as a detection result in the entire first image P1 (t).

図６は、第一検出部2による処理の模式例を[1]〜[4]と分けて示す図である。[1]は、YOLO等の検出器を適用するために画像内に予め定義しておくウィンドウWの例であり、ここでは模式例として、画像全体の領域を横に8分割、縦に3分割した合計24個のウィンドウ例が示されている。[2]は、図５と同様の前景抽出結果の例であるが、図６による説明における視認性の確保の観点から、背景を白ピクセルとし前景を黒ピクセルとして、図５とは逆の表現で示している。[3]は、[2]の前景抽出結果に[1]で定義しておいた所定のウィンドウを割り当てたものであり、[4]は、当該[3]で割り当てられたウィンドウのうち、前景が存在することでYOLO等の適用対象となる6個のウィンドウを灰色で示したものである。 FIG. 6 is a diagram illustrating a schematic example of the processing by the first detection unit 2 separately from [1] to [4]. [1] is an example of a window W defined in an image in advance to apply a detector such as YOLO. Here, as a schematic example, the entire image area is divided into eight horizontally and three vertically. A total of 24 window examples are shown. [2] is an example of the same foreground extraction result as in FIG. 5, but from the viewpoint of ensuring visibility in the description with reference to FIG. Indicated by. [3] is obtained by assigning the predetermined window defined in [1] to the foreground extraction result of [2], and [4] is the foreground extracted among the windows assigned in [3]. The six windows to which YOLO and the like are applied due to the presence of are shown in gray.

図６の[4]に見て取れる通り、第一検出部2においては合計24個のウィンドウの全てに対してYOLO等の検出器を適用することなく、前景が存在する6個のウィンドウのみが当該検出器の適用対象となる（すなわち、残り18個のウィンドウは適用対象外となる）ことで、画像内のスパースな対象に関して、高速に検出を行うことができる。 As can be seen in [4] of FIG. 6, the first detection unit 2 does not apply a detector such as YOLO to all of the 24 windows in total, and detects only the six windows where the foreground exists. By applying the sparse object in the image (that is, the remaining 18 windows are excluded from the application), the sparse object in the image can be quickly detected.

なお、図６の例は、予め定義しておくウィンドウWの例として、画像全体の領域を縦横に分割することで、異なるウィンドウ同士に重複がない場合の例であるが、ウィンドウに関しては所謂スライディングウィンドウとして、異なる位置にあるウィンドウ同士に重複が存在するものを用いるようにしてもよい。例えば、ウィンドウWのサイズを図６と同じ横8分割と縦3分割のものとして定義して、スライド幅に関しては図６のように横及び縦のウィンドウ幅全体をそのまま用いるのではなく、横及び縦の両方に関してウィンドウ幅の半分に設定するといったことも可能である。この場合、図６と同じウィンドウWのサイズで隣接ウィンドウ同士はウィンドウ幅の半分の重複を有しているものとして、横に8×2-1=15ステップと、縦に3×2-1=5ステップと、の合計15×5=75箇所の位置にウィンドウをスライドさせることができる。その他にも、既存の任意の画像内探索のためのウィンドウのスライド方式（例えばテンプレートマッチング等において利用されているもの）に即したウィンドウWの定義を利用してよく、例えば、画像の端ではウィンドウの一部が画像の外部に出ることを許容するようにしてもよい。 Note that the example of FIG. 6 is an example of a window W defined in advance in which the entire image area is divided vertically and horizontally so that different windows do not overlap with each other. As the windows, windows having overlaps between windows at different positions may be used. For example, the size of the window W is defined as that of eight horizontal divisions and three vertical divisions, which is the same as that of FIG. 6, and the slide width does not use the entire horizontal and vertical window widths as shown in FIG. It is also possible to set it to half the window width both vertically and vertically. In this case, assuming that adjacent windows have the same window W size as in FIG. 6 and have an overlap of half the window width, 8 × 2-1 = 15 steps horizontally and 3 × 2-1 = The window can be slid to a total of 15 × 5 = 75 positions of 5 steps. Alternatively, the definition of the window W according to an existing window sliding method for searching in an image (for example, one used in template matching or the like) may be used. May be allowed to go outside the image.

第一検出部2では、YOLO等の検出器を適用することでウィンドウW（前景が存在するウィンドウW）内での対象の領域及び種別の情報を取得できる。（ここで、種別の情報に関しては、例えば「種別：人、信頼度：0.8」といったように、種別及びその信頼度数値のペアの情報として取得することができる。）さらに、当該取得したウィンドウW内での相対的な位置領域の情報に対して、当該ウィンドウWが第一画像P1(t)内で占めている位置の情報を加味することにより、第一画像P1(t)内での検出された対象の領域の情報を得ることができる。 The first detection unit 2 can acquire information of a target region and a type in a window W (a window W in which a foreground exists) by applying a detector such as YOLO. (Here, the type information can be acquired as information of a pair of a type and its reliability value, such as “type: person, reliability: 0.8”.) Further, the acquired window W In the first image P1 (t), the information on the position occupied by the window W in the first image P1 (t) is added to the information on the relative position area in the first image P1 (t). It is possible to obtain the information of the target area.

図７は、第一検出部2が上記のようにウィンドウW内での相対座標値として検出した領域から、当該領域を、第一画像P1(t)全体内での座標値に換算することを説明するための図である。 FIG. 7 shows that, from the region detected by the first detection unit 2 as a relative coordinate value in the window W as described above, the region is converted into a coordinate value in the entire first image P1 (t). It is a figure for explaining.

図７では、第一画像P1(t)内のj番目のウィンドウW(j)に対して第一検出部2が検出を行い、矩形領域Obj-1(i)としてi番目の対象の領域が検出されたことが示されている。この場合、矩形領域Obj-1(i)の領域の情報は、図７中に黒点（●）で示される、その左上の頂点(x_r ¹(i),y_r ¹(i))及び右下の頂点(x_r ²(i),y_r ²(i))の、2つの座標値によって表現することができる。これら2つの座標値は当該ウィンドウW(j)の左上頂点を原点（基準位置）とした、ウィンドウW(j)内での相対座標値として得られている。一方、当該ウィンドウW(j)は、その左上頂点（図７中に黒丸（●）で示される左上頂点）の第一画像P1(t)内での位置座標が(x_w ¹(i),y_w ¹(i))となるような位置にあるものである。 In FIG. 7, the first detection unit 2 detects the j-th window W (j) in the first image P1 (t), and the i-th target area is defined as a rectangular area Obj-1 (i). It shows that it was detected. In this case, the information of the area of the rectangular area Obj-1 (i) is represented by the upper left vertex (x _r ¹ (i), y _r ¹ (i)) indicated by a black dot (●) in FIG. It can be represented by two coordinate values of the lower vertex (x _r ² (i), y _r ² (i)). These two coordinate values are obtained as relative coordinate values within the window W (j), with the upper left vertex of the window W (j) as the origin (reference position). On the other hand, in the window W (j), the position coordinates of the upper left vertex (the upper left vertex indicated by a black circle (●) in FIG. 7) in the first image P1 (t) are (x _w ¹ (i), y _w ¹ (i)).

従って、第一検出部2では以下の式(1A)〜(1D)によって、検出された対象であるi番目の矩形領域Obj-1(i)を表現するウィンドウW(j)内の左上及び右下の2点の相対座標値(x_r ¹(i),y_r ¹(i))及び(x_r ²(i),y_r ²(i))を、画像P1(t)全体内での当該2点の絶対座標値(x₁ ¹(i),y₁ ¹(i))及び(x₁ ²(i),y₁ ²(i))へと変換することで、画像P1(t)全体内での領域検出結果を得ることができる。
x₁ ¹(i)=x_r ¹(i)+x_w(j) …(1A)
y₁ ¹(i)=y_r ¹(i)+y_w(j) …(1B)
x₁ ²(i)=x_r ²(i)+x_w(j) …(1C)
y₁ ²(i)=y_r ²(i)+y_w(j) …(1D) Therefore, in the first detection unit 2, the following equations (1A) to (1D), the upper left and right in the window W (j) expressing the i-th rectangular area Obj-1 (i) to be detected. The relative coordinate values (x _r ¹ (i), y _r ¹ (i)) and (x _r ² (i), y _r ² (i)) of the lower two points are calculated in the entire image P1 (t). absolute coordinates of the two points _{^{(x 1 1 (i),}} y 1 1 (i)) and _{^{(x 1 2 (i),}} y 1 2 (i)) to convert into an image P1 (t) An area detection result within the whole can be obtained.
x ₁ ¹ (i) = x _r ¹ (i) + x _w (j)… (1A)
y ₁ ¹ (i) = y _r ¹ (i) + y _w (j)… (1B)
x ₁ ² (i) = x _r ² (i) + x _w (j)… (1C)
y ₁ ² (i) = y _r ² (i) + y _w (j)… (1D)

ここで、図７中にも示されるように、画像内（ウィンドウ内も含む）の座標軸の方向は、相対座標及び絶対座標のいずれにおいても、横方向に関して右向きを+x（x座標が増加する方向）とし、縦方向に関して下向きを+y（y座標が増加する方向）とする。これは、以降に説明するその他の座標に関しても同様である。 Here, as shown in FIG. 7, the direction of the coordinate axes in the image (including in the window) is + x (the x coordinate increases in the right direction with respect to the horizontal direction in both the relative coordinates and the absolute coordinates). Direction), and the downward direction is + y (the direction in which the y coordinate increases). This is the same for the other coordinates described below.

なお、第一検出部2で得られる画像P1(t)内の第一領域D1(t)の情報は、検出された全ての対象の領域Obj-1(i)（i=1,2,…,M；Mは検出された領域の総数）に関しての、上記2点の絶対座標値(x₁ ¹(i),y₁ ¹(i))及び(x₁ ²(i),y₁ ²(i))を与えることにより、表現することが可能である。なお、第一領域D1(t)の情報は、このように矩形（検出対象をその内部に含む矩形）として検出された各対象の領域Obj-1(i)をそのまま用いるのではなく、所定割合だけ拡大したものを用いるようにしてもよい。拡大させることにより、第二取得部3に関して次に説明する第二候補領域Rn(t)（n=2,3,…,N）もマージン（余裕）を有したものとして取得し、後述する変換H_n,1に誤差がある場合であっても、第二画像における検出をより確実なものとすることができる。 Note that the information of the first region D1 (t) in the image P1 (t) obtained by the first detection unit 2 includes all detected target regions Obj-1 (i) (i = 1, 2,...). , M; M is the respect the total number) of the detected region, the absolute coordinate values of the two points _{^{(x 1 1 (i),}} y 1 1 (i)) and _{^{(x 1 2 (i),}} y 1 2 ( It can be expressed by giving i)). Note that the information of the first area D1 (t) is obtained by using a predetermined ratio instead of using the target area Obj-1 (i) detected as a rectangle (a rectangle including the detection target therein) as it is. Only an enlarged version may be used. By enlarging, the second candidate area Rn (t) (n = 2, 3,..., N) described below with respect to the second acquisition unit 3 is also acquired as having a margin, and the conversion described later is performed. Even if there is an error in H _{n, 1} , detection in the second image can be made more reliable.

なお、図７ではさらに、白点（○）として、ウィンドウW(j)内の矩形領域Obj-1(i)の左下頂点の相対座標(x_r ¹(i),y_r ²(i))が示されている。これは、後述する図８等との関係で、参考として示すものである。当該左下頂点の相対座標(x_r ¹(i),y_r ²(i))も上記の式(1A),(1D)により、画像内での絶対座標(x₁ ¹(i),y₁ ²(i))へと変換することが可能である。 In FIG. 7, relative coordinates ( _xr ¹ (i), _yr ² (i)) of the lower left vertex of the rectangular area Obj-1 (i) in the window W (j) are further defined as white points (点). It is shown. This is shown for reference in relation to FIG. The relative coordinates (x _r ¹ (i), y _r ² (i)) of the lower left vertex are also calculated by the above formulas (1A) and (1D) using the absolute coordinates (x ₁ ¹ (i), y ₁ ) in the image. ² (i)).

＜第二取得部3＞
第二取得部3では、第一画像P1(t)に関して第一検出部2で検出されている第一領域D1(t)の情報に対して、第一画像P1(t)の画素座標から第二画像Pn(t)（n=2,3,…,N）の画像座標への所定のホモグラフィー変換H_n,1を適用することにより、第二候補領域Rn(t) （n=2,3,…,N）を取得する。 <Second acquisition unit 3>
In the second acquisition unit 3, the information of the first region D1 (t) detected by the first detection unit 2 with respect to the first image P1 (t) is calculated based on the pixel coordinates of the first image P1 (t). By applying a predetermined homography transformation H _{n, 1} to the image coordinates of the two images Pn (t) (n = 2, 3,..., N), the second candidate region Rn (t) (n = 2, 3,…, N).

図８は、第二取得部3で第二候補領域Rn(t)（n=2,3,…,N）を取得する一実施形態を説明するための図である。[1]に示すように第一画像P1(t)において検出された対象の矩形領域を領域Obj-1(i)とする。（すなわち、第一領域D1(t)の情報は、検出された全ての対象についての領域Obj-1(i)の情報である。）図８の[1]の領域Obj-1(i)は、図７で示した領域Obj-1(i)と同じであり、画像P1(t)内での絶対座標として、矩形領域Obj-1(i)の下方（+y方向）側の水平方向の一辺の両端である、左下頂点（○）及び右下頂点（●）がそれぞれ、座標(x₁ ¹(i),y₁ ²(i))及び(x₁ ²(i),y₁ ²(i))として示されている。 FIG. 8 is a diagram for explaining an embodiment in which the second acquisition unit 3 acquires the second candidate region Rn (t) (n = 2, 3,..., N). As shown in [1], the target rectangular area detected in the first image P1 (t) is defined as an area Obj-1 (i). (That is, the information of the first area D1 (t) is information of the area Obj-1 (i) for all detected targets.) The area Obj-1 (i) of [1] in FIG. 7 is the same as the area Obj-1 (i) shown in FIG. 7, and as the absolute coordinates in the image P1 (t), the horizontal direction below (+ y direction) the rectangular area Obj-1 (i). The lower left vertex (○) and the lower right vertex (●), which are both ends of one side, are coordinates (x ₁ ¹ (i), y ₁ ² (i)) and (x ₁ ² (i), y ₁ ² ( i)).

第二取得部3では、図８の[2]に模式的に示すようにホモグラフィー変換H_n,1を当該[1]の矩形領域Obj-1(i)の下方側の水平線分（その長さがLen₁(i)である）の両端点(x₁ ¹(i),y₁ ²(i))及び(x₁ ²(i),y₁ ²(i))にそれぞれ適用することで、[3]に同じく白点（〇）及び黒点（●）として示すように、その第二画像Pn(t)上への変換した2点(x_n ^1L(i),y_n ^2L(i))及び(x_n ^2R(i),y_n ^2R(i))を求める。 As schematically shown in [2] of FIG. 8, the second acquisition unit 3 converts the homography conversion H _{n, 1} into a horizontal line segment (the length thereof) below the rectangular area Obj-1 (i) of [1]. Is Len ₁ (i)), (x ₁ ¹ (i), y ₁ ² (i)) and (x ₁ ² (i), y ₁ ² (i)). likewise the [3] white spots (〇) and black points (●) as shown as the converted two points to the second image Pn (t) above _{^{(x n 1L (i),}} y n 2L (i) ) and (x _n ^2R (i), determine the y _n ^2R (i)).

なお、当該変換はコンピュータグラフィックス分野等での数学として周知のように、当該座標を斉次座標で表現したサイズ3の列ベクトルに対する3行3列の行列H_n,1の乗算として、以下の式(2L),(2R)のように変換することができる。上付きのTは転置であり、斉次座標でのサイズ3の列ベクトルを表している。
(x_n ^1L(i),y_n ^2L(i),1)^T= H_n,1 *(x₁ ¹(i),y₁ ²(i),1)^T …(2L)
(x_n ^2R(i),y_n ^2R(i),1)^T= H_n,1 *(x₁ ²(i),y₁ ²(i),1)^T …(2R) Note that the conversion is, as is well known as mathematics in the field of computer graphics, etc., as a multiplication of a 3 × 3 matrix H _{n, 1} to a size 3 column vector expressing the coordinates in homogeneous coordinates as follows: The conversion can be performed as in equations (2L) and (2R). The superscript T is transpose, and represents a column vector of size 3 in homogeneous coordinates.
(x _n ^1L (i), y _n ^2L (i), 1) ^T = H _{n, 1} * (x ₁ ¹ (i), y ₁ ² (i), 1) ^T … (2L)
(x _n ^2R (i), y _n ^2R (i), 1) ^T = H _{n, 1} * (x ₁ ² (i), y ₁ ² (i), 1) ^T … (2R)

第二取得部3では、上記の式(2L),(2R)で得られる当該変換した2点がなす横方向（x方向）の幅によって横幅Len_n(i)=| x_n ^2R(i)-x_n ^1L(i)|が定義されるものとして、[3]に示すような、第一画像P1(t)の矩形領域Obj-1(i)に対応する第二画像Pn(t)における矩形領域Obj-n(i)を求めることができる。こうして、検出された全ての対象iについての矩形領域Obj-n(i)の情報として、第二候補領域Rn(t)(n=2,3,…,N)の情報を得ることができる。 In the second acquisition unit 3, the horizontal width Len _n (i) = | x _n ^2R (i) is obtained by the width in the horizontal direction (x direction) formed by the converted two points obtained by the above equations (2L) and (2R). -x _n ^1L (i) | is defined, as shown in [3], in the second image Pn (t) corresponding to the rectangular region Obj-1 (i) of the first image P1 (t). The rectangular area Obj-n (i) can be obtained. In this manner, information on the second candidate area Rn (t) (n = 2, 3,..., N) can be obtained as information on the rectangular area Obj-n (i) for all detected targets i.

上記の横幅Len_n(i)=| x_n ^2R(i)-x_n ^1L(i)|と共に、画像の座標軸x,yに平行な辺を有するような矩形領域Obj-n(i)の情報のうち、そのx軸方向範囲としての右端側及び左端側が位置x_n ^2R(i)及びx_n ^1L(i)（またはこの逆）として確定する。ここで、矩形領域Obj-n(i)の情報を確定させるためには、そのy軸方向範囲としての下端側の位置と上端側の位置とをさらに決定する必要があるが、以下のようにして決定することができる。 Along with the horizontal width Len _n (i) = | x _n ^2R (i) -x _n ^1L (i) |, information on a rectangular area Obj-n (i) having sides parallel to the coordinate axes x and y of the image Among them, the right end side and the left end side as the x-axis direction range are determined as positions x _n ^2R (i) and x _n ^1L (i) (or vice versa). Here, in order to determine the information of the rectangular area Obj-n (i), it is necessary to further determine the lower end position and the upper end position as the y-axis direction range, as follows. Can be determined.

＜下端側の位置＞
まず、第二画像Pn(t)における矩形領域Obj-n(i)に関して、下端側の位置は変換した2点(x_n ^1L(i),y_n ^2L(i))及び(x_n ^2R(i),y_n ^2R(i))のy座標のうち、y座標がより大きいものmax(y_n ^2L(i), y_n ^2R(i))として定義すればよい。図８の[3]では、変換した結果としてmax(y_n ^2L(i), y_n ^2R(i))= y_n ^2R(i)となった場合が例として示されている。 <Lower position>
First, with respect to the rectangular area Obj-n (i) in the second image Pn (t), the lower end position is the converted two points (x _n ^1L (i), y _n ^2L (i)) and (x _n ^2R ( i), y _n ^2R (i)) may be defined as max (y _n ^2L (i), y _n ^2R (i)) having the larger y coordinate among the y coordinates. In [3] in FIG. 8, max may become _{^{(y n 2L (i),}} y n 2R (i)) = y n 2R (i) is shown as an example as a result of converted.

＜上端側の位置＞
さらに、第二画像Pn(t)における矩形領域Obj-n(i)に関して、上端側の位置は次のように決定すればよい。すなわち、第二画像Pn(t)における矩形領域Obj-n(i)のアスペクト比は、これに対応する第一画像P1(t)での矩形領域Obj-1(i)のアスペクト比と同一であるものとして、矩形領域Obj-n(i)の縦幅を決定することにより、矩形領域Obj-n(i)の上端側の位置も決定することができる。 <Top position>
Further, regarding the rectangular area Obj-n (i) in the second image Pn (t), the position on the upper end side may be determined as follows. That is, the aspect ratio of the rectangular region Obj-n (i) in the second image Pn (t) is the same as the corresponding aspect ratio of the rectangular region Obj-1 (i) in the first image P1 (t). For example, by determining the vertical width of the rectangular area Obj-n (i), the position on the upper end side of the rectangular area Obj-n (i) can also be determined.

すなわち、矩形領域Obj-1(i)の縦幅をHeight₁(i)、矩形領域Obj-n(i)の縦幅をHeight_n(i)とすると、アスペクト比（縦横の長さ比）が等しいという以下の関係式(3)により矩形領域Obj-n(i)の縦幅Height_n(i)を決定することができる。
Len₁(i)/Height₁(i)=Len_n(i)/Height_n(i) …(3) That is, if the vertical width of the rectangular area Obj-1 (i) is Height ₁ (i) and the vertical width of the rectangular area Obj-n (i) is Height _n (i), the aspect ratio (vertical to horizontal length ratio) is The vertical width Height _n (i) of the rectangular area Obj-n (i) can be determined by the following relational expression (3) that is equal.
Len ₁ (i) / Height ₁ (i) = Len _n (i) / Height _n (i)… (3)

なお、以上のような、図８の手法による第二画像Pn(t)における矩形領域Obj-n(i)の決定の一実施形態は、次の事項を前提とすることで、平面同士の変換であるホモグラフィー変換H_n,1を立体領域に適用可能とさせたものである。 Note that, as described above, one embodiment of the determination of the rectangular area Obj-n (i) in the second image Pn (t) by the method of FIG. the homography transformation H _{n, 1} is is obtained by the applicable solid region.

第一前提は、第一画像P1(t)及び第二画像Pn(t)で検出されるべき対象（例えばサッカー選手）は全て、ホモグラフィー行列H_n,1の適用対象の平面としての空間内の共通平面（例えば図１のサッカーフィールドFの平面）上に立って存在しているというものである。すなわち、当該共通平面から高さ方向に大きく乖離することなく、概ね当該平面上に立っているものとして存在しているというものである。図８で説明した矩形下端側の2点の変換は、第一前提に基づく。（サッカー選手の例であれば、地面から離れた頭の位置はホモグラフィー変換により歪んだ位置に変換されてしまうが、地面に接している足の位置はホモグラフィー変換により、概ね地面に接した位置に変換される、というのが第一前提である。）なお、第一前提ではさらに、第一画像P1(t)及び第二画像Pn(t)では下方側（+y方向）が空間内の共通平面（地面など）に近い側にあるものとなるようなカメラ配置によって撮影されていることを前提としている。 The first premise is that all objects (for example, soccer players) to be detected in the first image P1 (t) and the second image Pn (t) are in a space as a plane to which the homography matrix H _{n, 1} is applied. (For example, the plane of the soccer field F in FIG. 1). In other words, it does not largely deviate from the common plane in the height direction, and exists almost as standing on the plane. The conversion of the two points at the lower end of the rectangle described in FIG. 8 is based on the first assumption. (In the case of a soccer player, the position of the head away from the ground is transformed into a distorted position by the homography conversion, but the position of the foot in contact with the ground is almost in contact with the ground by the homography conversion. The first premise is that the image is converted into a position.) In the first premise, the lower side (+ y direction) of the first image P1 (t) and the second image Pn (t) is in the space. It is assumed that images are taken by a camera arrangement that is on the side closer to a common plane (such as the ground).

また、第二前提は、第一画像P1(t)及び第二画像Pn(t)で見た際の同一対象を囲う矩形は概ね同じアスペクト比になるというものであり、図８のアスペクト比に基づく上端側位置の決定はこの第二前提に基づく。 The second premise is that rectangles surrounding the same object when viewed in the first image P1 (t) and the second image Pn (t) have substantially the same aspect ratio. The determination of the upper end side position based on this second premise.

なお、各ホモグラフィー行列H_n,1は、地面等の共通平面上に配置したカメラキャリブレーション用のマーカー等（例えば正方マーカー）を利用することで、任意の既存手法によって予め算出しておくことができうる。固定カメラであれば固定パラメータとして行列H_n,1を用意しておいてもよい。移動カメラの場合、当該マーカーを用いたカメラキャリブレーションを各時刻において行うようにすればよい。 Each homography matrix H _{n, 1} should be calculated in advance by an arbitrary existing method by using a camera calibration marker (for example, a square marker) arranged on a common plane such as the ground. Can be done. For a fixed camera, a matrix H _{n, 1} may be prepared as a fixed parameter. In the case of a moving camera, camera calibration using the marker may be performed at each time.

＜第二検出部4＞
既に説明した通り、第二検出部4の処理は第一検出部2の処理内容と同じである。すなわち、処理対象としてのデータが、第一検出部2では第一画像P1(t)及びウィンドウ限定するために参照する第一候補領域R1(t)であったのに対し、第二検出部4では第二画像Pn(t)及びウィンドウ限定するために参照する第二候補領域Rn(t)であるという点で異なるのみであり、処理内容に関しては、第一検出部2及び第二検出部4は共通である。 <Second detector 4>
As described above, the processing of the second detection unit 4 is the same as the processing content of the first detection unit 2. That is, while the data to be processed was the first image P1 (t) and the first candidate region R1 (t) referred to for window limitation in the first detection unit 2, the second detection unit 4 Is different only in that it is a second image Pn (t) and a second candidate region Rn (t) to be referred to for window limitation, and regarding the processing contents, the first detection unit 2 and the second detection unit 4 Are common.

ただし、第二検出部4においては、次の追加的な実施形態を行うことも可能である。 However, in the second detection unit 4, the following additional embodiment can be performed.

当該追加実施形態の意義をまず説明する。既に説明した通り、第二検出部4（及び第一検出部2）では、YOLO等の深層学習により事前に学習して構築された検出器を用いる。当該事前学習においては、標準的なサイズの画像において標準的な大きさで撮影されている対象に関してラベル付与した多数の学習データを利用する。従って、構築される検出器にも、検出信頼度を確保するという観点から、検出されるウィンドウ内の対象のサイズに関して、好ましい標準的なサイズというものが存在することとなる。すなわち、小さすぎればそもそも情報が少ないので、検出信頼度が下がるので、ある程度の大きさがあることが望ましい。 First, the significance of the additional embodiment will be described. As described above, the second detection unit 4 (and the first detection unit 2) uses a detector constructed by learning in advance by deep learning such as YOLO. In the pre-learning, a large number of pieces of learning data that are labeled with respect to an object photographed at a standard size in an image of a standard size are used. Therefore, the constructed detector has a preferable standard size with respect to the size of the object in the detected window from the viewpoint of securing the detection reliability. That is, if the size is too small, there is little information in the first place, and the detection reliability is lowered.

しかしながら、図１で説明したような撮影状況では、上記のようなある程度の大きさが確保できておらず、学習した検出器で所定の検出信頼度を確保して検出するには小さすぎることがある。従って、当該追加実施形態においては、小さすぎると判定される場合に、予め拡大したうえで、ウィンドウ内での検出器の処理を行うようにする。 However, in the shooting situation described with reference to FIG. 1, the above-mentioned certain size cannot be secured, and it is too small to secure a predetermined detection reliability with a learned detector. is there. Therefore, in the additional embodiment, when it is determined that the size is too small, the processing of the detector in the window is performed after being enlarged in advance.

図９は、第二検出部4による拡大処理を説明するための図である。図９の[1]は図８の[3]に示した第二画像Pn(t)内の矩形領域Obj-n(i)をウィンドウW内のものとして示したものである。矩形領域Obj-n(i)が小さいと判定された場合、黒点（●）で示すその左上頂点(x_n ^1L(i),y_n ^1L(i))を拡大のための基準位置（固定位置）として、[2]に示すようにscale倍だけ拡大（相似拡大）することにより、拡大された矩形領域Obj-n'(i)を得ることができる。こうして、[2]に示すような拡大された対象としての矩形領域Obj-n'(i)を含むウィンドウWを検出器による検出対象とすればよい。 FIG. 9 is a diagram for explaining the enlargement processing by the second detection unit 4. [1] of FIG. 9 shows the rectangular area Obj-n (i) in the second image Pn (t) shown in [3] of FIG. If it is determined that the rectangular area Obj-n (i) is small, the upper left vertex (x _n ^1L (i), y _n ^1L (i)) indicated by a black point (●) is set to a reference position (fixed position) for enlargement. ), It is possible to obtain an enlarged rectangular area Obj-n ′ (i) by enlarging (similarly enlarging) by a scale factor as shown in [2]. Thus, the window W including the rectangular region Obj-n '(i) as the enlarged target as shown in [2] may be set as the detection target by the detector.

なお、拡大のための基準位置は左上頂点以外の、当該矩形領域Obj-n(i)内の所定点を用いてもよい。なお、図９に示す拡大しない場合の[1]に示すウィンドウW内の矩形領域Obj-n(i)以外の背景BGWと、拡大する場合の[2]に示すウィンドウW内のObj-n'(i)以外の背景BGW'とは、対応する第二画像Pn(t)に撮影されているテクスチャが存在しないものとして扱って、検出器の処理を行うようにしてもよい。 Note that a predetermined point in the rectangular area Obj-n (i) other than the upper left vertex may be used as a reference position for enlargement. Note that the background BGW other than the rectangular area Obj-n (i) in the window W shown in [1] when the image is not enlarged as shown in FIG. 9 and the Obj-n 'in the window W shown in [2] when the image is enlarged. The background BGW 'other than (i) may be treated as if the texture captured in the corresponding second image Pn (t) does not exist, and the processing of the detector may be performed.

第二検出部4による追加実施形態としての拡大処理における、拡大するかの否かの判断と、拡大する場合の拡大率scaleの算出とは、次のようにすればよい。 The determination as to whether or not to enlarge the image and the calculation of the enlargement factor scale in the case of enlargement in the enlargement process as an additional embodiment by the second detection unit 4 may be performed as follows.

まず、図８の[1]に示した第一画像P1(t)での矩形領域Obj-1(i)の画素数すなわち面積S₁(i)を基準とした、図８の[3]又は図９の[1]の矩形領域Obj-n(i)の面積S_n(i)を以下のように求める。ここで、前述の第二前提の通りアスペクト比が等しいことから、以下（横幅の比Len_n(i)/Len₁(i)の2乗を乗ずること）の式(4)で面積を求めることができる。
S_n(i)= S_１(i)*(Len_n(i)/Len₁(i))² …(4) First, based on the number of pixels of the rectangular region Obj-1 (i) in the first image P1 (t) shown in [1] of FIG. 8, that is, the area S ₁ (i), [3] in FIG. The area S _n (i) of the rectangular area Obj-n (i) in [1] of FIG. 9 is obtained as follows. Here, since the aspect ratios are equal as described in the second premise, the area is determined by the following equation (4) by multiplying the square of the width ratio Len _n (i) / Len ₁ (i). Can be.
S _n (i) = S ₁ (i) * (Len _n (i) / Len ₁ (i)) ² … (4)

当該求めた面積S_n(i)が所定の閾値S_TH（検出器で検出される標準的なサイズに応じて定めることのできる所定の閾値）以下となった場合に、拡大処理を行う。拡大率scaleは以下の式(5A)の中間値「s_tmp」を算出したうえで、以下の式(5B),(5C)による場合分けによって計算すればよい。すなわち、中間値s_tmpが閾値TH未満となる場合は式(5B)を採用し、閾値TH以上となる場合は式(5C)を採用して、拡大率scaleを求めればよい。sqrt()は平方根演算であり、面積で判断したものを長さとしての拡大率scaleに変換するための演算である。
s_tmp= S_peak(c)/ S_n(i) …(5A)
scale=sqrt(s_tmp) (if s_tmp <TH) …(5B)
scale= sqrt(TH) (if s_tmp ≧TH) …(5C) When the obtained area S _n (i) is _equal to or smaller than a predetermined threshold S _TH (a predetermined threshold that can be determined according to a standard size detected by a detector), enlargement processing is performed. The enlargement ratio scale may be calculated by calculating the intermediate value “s _tmp ” of the following expression (5A) and then dividing by the following expressions (5B) and (5C). That is, when the intermediate value s _tmp is less than the threshold value TH, Expression (5B) is adopted, and when the intermediate value s _tmp is equal to or more than the threshold value TH, Expression (5C) is adopted, and the enlargement ratio scale may be obtained. sqrt () is a square root operation, and is an operation for converting a value determined based on an area into an enlargement ratio scale as a length.
s _tmp = S _peak (c) / S _n (i)… (5A)
scale = sqrt (s _tmp ) (if s _tmp <TH)… (5B)
scale = sqrt (TH) (if s _tmp ≧ TH)… (5C)

上記(5B),(5C)においてTHは拡大率scaleを大きくしすぎないように予め設定しておく閾値である。（なお、閾値THは、面積S_n(i)に対する閾値S_THとは別のものである。）過剰に拡大すると画質悪化により検出の信頼度が下がる傾向があるので、このような閾値THを設定しておく。上記(5A)においてS_peak(c)は第一画像P1(t)での検出された矩形領域Obj-1(i)の種別c（例えば、c=サッカー選手、ボール、…）に応じた、検出信頼度が確保される所定の面積であるが、種別cによらない一定値を用いてもよい。種別cの情報は第一検出部2での検出結果を参照して取得すればよい。 In the above (5B) and (5C), TH is a threshold value that is set in advance so as not to make the enlargement ratio scale too large. (Note that the threshold TH is different from the threshold S _TH for the area S _n (i).) Excessive enlargement tends to lower the reliability of detection due to deterioration of image quality. Set it. In the above (5A), S _peak (c) corresponds to the type c of the rectangular area Obj-1 (i) detected in the first image P1 (t) (for example, c = soccer player, ball,...) Although it is a predetermined area where the detection reliability is ensured, a fixed value independent of the type c may be used. Information of the type c may be obtained by referring to the detection result of the first detection unit 2.

以上、本発明によれば、4K映像等において図1のようなサッカー映像に検出対象としての選手等がスパースに、また、小さく存在している場合であっても、高速に検出を行うことが可能である。以下、本発明における追加実施形態等の補足説明を行う。 As described above, according to the present invention, in a 4K video or the like, a player or the like to be detected in a soccer video as shown in FIG. It is possible. Hereinafter, a supplementary description of additional embodiments and the like in the present invention will be provided.

（１）第二検出部4では上記の追加実施形態の通り、領域Obj-n(i)の面積S_n(i)が小さいと判定される場合にscale倍だけの拡大処理を行うことができる。これに対する代替実施形態及び／又は追加実施形態として、小さいと判定された場合であっても拡大処理をせずにそのまま検出器で適切に検出できるように、既に学習されている検出器の第一学習データにおける画像を縮小した第二学習データを新たに用意して、深層学習等により検出器を再学習させ、当該再学習した検出器を第二検出部4で利用するようにしてもよい。 (1) As described in the above additional embodiment, when it is determined that the area S _n (i) of the region Obj- _n (i) is small, the second detection unit 4 can perform the scale-up process by scale times. . As an alternative and / or additional embodiment to this, the first of the already learned detectors is used so that even if it is determined to be small, it can be appropriately detected by the detector without performing the enlargement processing. Second learning data obtained by reducing the image of the learning data may be newly prepared, the detector may be re-learned by deep learning or the like, and the re-learned detector may be used by the second detection unit 4.

具体的には、元の学習データ（第一学習データ）の中にラベリングされた対象の領域（bounding box）から対象のサイズを算出し、事前に統計処理等によって取得した対象の最小サイズと比べて、複数レベルの対象サイズに縮小したものを追加した第二学習データを用いて、再学習するようにすればよい。例えば、統計処理でボールの最小サイズは100である場合に、第一学習データ内のある学習画像でラベリングされたボールの領域（bounding box）面積が400であったとする。この１つの学習画像に関して、1/4、1/3、1/2、1倍に縮小（1倍の縮小の場合は元のサイズのままとなる）して、4枚（うち3枚が実際に縮小された新たなもの）の学習画像を作る。同様の処理を、第一学習データ内のその他の学習画像にも適用することで、第二学習データを得ることができる。 Specifically, the size of the target is calculated from the target region (bounding box) labeled in the original learning data (first learning data), and compared with the minimum size of the target obtained in advance by statistical processing or the like. Then, re-learning may be performed using the second learning data to which the reduced size is added to the target size of a plurality of levels. For example, when the minimum size of the ball is 100 in the statistical processing, it is assumed that the area (bounding box) of the ball labeled with a certain learning image in the first learning data is 400. This one training image is reduced to 1/4, 1/3, 1/2, and 1 times (the original size is kept in case of 1 times reduction), and 4 images (of which 3 images are A new learning image reduced to a new one. By applying the same processing to other learning images in the first learning data, second learning data can be obtained.

（２）第一検出部2及び第二検出部4では、YOLO等の検出器によって画像内の対象の領域及び種別を検出するが、領域と種別とのうちいずれか一方のみを出力するようにしてもよい。この際、内部処理としては領域及び種別を検出し、実際に利用する出力データとしては領域又は種別のいずれか一方のみを利用するようにしてもよい。ただし、第二取得部3及び第二検出部4が動作するためには、第一検出部2から（ユーザ等が利用するデータとして）外部出力しないとしても、第一検出部2から領域の情報（第一領域D1(t)の情報）を第二取得部3へ向けて出力する必要がある。なお、第一検出部2及び第二検出部4では、ウィンドウベースで高速に検出可能な検出器として、YOLOの他にも前掲の非特許文献３等に開示されるSSD等を用いてもよい。 (2) The first detection unit 2 and the second detection unit 4 detect the target region and type in the image using a detector such as YOLO, but output only one of the region and type. You may. At this time, the area and the type may be detected as the internal processing, and only one of the area and the type may be used as the output data to be actually used. However, in order for the second acquisition unit 3 and the second detection unit 4 to operate, even if the second detection unit 2 and the second detection unit 4 do not output to the outside (as data used by a user or the like), the first detection unit 2 outputs It is necessary to output (information of the first area D1 (t)) to the second acquisition unit 3. In addition, in the first detection unit 2 and the second detection unit 4, a SSD or the like disclosed in the above-mentioned Non-Patent Document 3 or the like may be used in addition to YOLO as a detector capable of detecting at high speed on a window basis. .

（３）前述の通り、本発明は実施形態EA（第一動作及び第二動作の両方が行われるもの）と、実施形態EB（第一動作のみが行われるもの）とが可能であるが、さらに、実施形態EAにおける第二動作のみを抽出した実施形態ECも可能である。実施形態ECにおいては、予め第一動作を完了しておくことで第一領域D1(t)を入力として利用すればよい。また、実施形態ECは次のものも可能である。すなわち、あるカメラ視点nの第二画像Pn(t)に関して既に検出結果としての第二領域Dn(t)が得られている場合に、別のカメラ視点m（m≠n,m≧2）の第二画像Pm(t)から検出結果として第二領域Dm(t)を得るための入力として、検出済みの第一領域D1(t)に代えて検出済みの第二領域Dn(t)を用いることも可能である。 (3) As described above, in the present invention, the embodiment EA (where both the first operation and the second operation are performed) and the embodiment EB (where only the first operation is performed) are possible. Further, an embodiment EC in which only the second operation in the embodiment EA is extracted is also possible. In the embodiment EC, by completing the first operation in advance, the first region D1 (t) may be used as an input. Further, the embodiment EC can also be as follows. That is, when a second area Dn (t) as a detection result has already been obtained for the second image Pn (t) of a certain camera viewpoint n, another camera viewpoint m (m ≠ n, m ≧ 2) As an input for obtaining the second area Dm (t) as a detection result from the second image Pm (t), the detected second area Dn (t) is used instead of the detected first area D1 (t). It is also possible.

（４）本発明は、コンピュータを検出装置10として機能させるプログラムとしても提供可能である。当該コンピュータには、CPU(中央演算装置)、メモリ及び各種I/Fといった周知のハードウェア構成のものを採用することができ、CPUが検出装置10の各部の機能に対応する命令を実行することとなる。また、当該コンピュータはさらに、CPUよりも並列処理を高速実施可能なGPU（グラフィック処理装置）を備え、CPUに代えて検出装置10の全部又は任意の一部分の機能を当該GPUにおいてプログラムを読み込んで実行するようにしてもよい。 (4) The present invention can also be provided as a program that causes a computer to function as the detection device 10. The computer may have a well-known hardware configuration such as a CPU (Central Processing Unit), a memory, and various I / Fs, and the CPU may execute an instruction corresponding to a function of each unit of the detection device 10. Becomes In addition, the computer further includes a GPU (graphics processing device) capable of performing parallel processing at a higher speed than the CPU, and instead of the CPU, executes the functions of all or any part of the detection device 10 by reading the program on the GPU. You may make it.

10…検出装置、1…第一取得部、2…第一検出部、3…第二取得部、4…第二検出部 10 detection device, 1 first acquisition unit, 2 first detection unit, 3 second acquisition unit, 4 second detection unit

Claims

A first acquisition unit that acquires a first candidate region as a region with motion from the first image,
Of the windows defined on the first image, for only the window from which the first candidate region is obtained, by detecting the region and / or type of the target, the target of the entirety of the first image is detected. A detection device comprising: a first detection unit that detects an area and / or a type.

The detection device according to claim 1, wherein the first acquisition unit acquires the first candidate region by applying a background subtraction method.

The first detection unit detects at least a target area,
From the second image obtained by shooting the same scene as that captured in the first image from another viewpoint, as an area corresponding to the target area detected from the first image by the first detection unit A second acquisition unit for acquiring a second candidate area,
Of the windows defined on the second image, only the window from which the second candidate region is obtained, by detecting the target region and the type, the target region and the target region in the entire second image The detection device according to claim 1, further comprising: a second detection unit configured to detect a type.

For the first image, the target area has already been detected,
From a second image obtained by shooting a scene common to the first image from a different viewpoint, a second candidate area is obtained as an area corresponding to a target area detected from the first image A second acquisition unit to
By detecting the target area and / or type only for the window from which the second candidate area is obtained among the windows defined on the second image, the target in the entire second image is detected. A second detection unit that detects an area and / or a type.

The said 2nd acquisition part acquires the said 2nd candidate area | region by applying a homography conversion with respect to the area | region of the target detected from the said 1st image, The Claims 3 or 4 characterized by the above-mentioned. Detection device.

The first image and the second image are photographed as an object can move on a predetermined plane,
In the second acquisition unit, of the rectangle surrounding the target area detected from the first image, on the second image obtained by applying homography conversion to the side near the predetermined plane The detection device according to claim 5, wherein the second candidate area is acquired as a rectangle surrounding a side.

The detection of the area and / or type of the target in the second detection unit is performed by applying a detector pre-constructed by deep learning using learning data, and the detection is performed in comparison with the standard size of the target in the detector. When it is determined that the size of the second candidate area is small, the second detection unit expands the second candidate area in advance before detecting the target area and / or type, and then expands the second candidate area. The detection device according to claim 3, wherein a target region and / or a type is detected for the region.

A program for causing a computer to function as the detection device according to claim 1.