JP7063837B2

JP7063837B2 - Area extraction device and program

Info

Publication number: JP7063837B2
Application number: JP2019059632A
Authority: JP
Inventors: 軍陳; 整内藤
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2019-03-27
Filing date: 2019-03-27
Publication date: 2022-05-09
Anticipated expiration: 2039-03-27
Also published as: JP2020160812A

Description

本発明は、多視点画像に対して、オクルージョンがあっても適切に物体の領域を検出することのできる領域抽出装置及びプログラムに関する。 The present invention relates to a region extraction device and a program capable of appropriately detecting a region of an object even if there is occlusion for a multi-viewpoint image.

被写体輪郭（シルエット）とは、画像に写っている人物、動物、あるいはその他一般の対象物の領域を２値マスク画像として抽出したものであり、シルエット境界が写っている対象の境界に該当するものとなる。輪郭抽出技術においては、前景／背景の分離を行うことで、画像内の対象物を背景から分離する。輪郭抽出技術は大きく、深層学習を用いるものと、背景をモデル化するものと、の２タイプに分類することができる。 The subject contour (silhouette) is an extraction of the area of a person, animal, or other general object in the image as a binary mask image, and corresponds to the boundary of the object in which the silhouette boundary is shown. It becomes. In the contour extraction technique, the object in the image is separated from the background by separating the foreground / background. Contour extraction techniques are broad and can be classified into two types: those that use deep learning and those that model the background.

深層学習手法として、非特許文献１のMask R-CNN（マスクR-CNN）では、インスタンス分離、すなわち、画像内の物体マスクを検出して且つ各物体マスクの識別結果を与えること、の一般的なフレームワークが提供される。このマスクR-CNNはさらに一般化して、人物のポーズ推定、矩形囲み枠（bounding box）での物体検出、あるいはキーポイント検出といった他のタスクにおいて利用することも可能である。 As a deep learning method, in Mask R-CNN (Mask R-CNN) of Non-Patent Document 1, it is common to perform instance separation, that is, to detect an object mask in an image and give an identification result of each object mask. Framework is provided. This mask R-CNN can be further generalized and used for other tasks such as person pose estimation, object detection in a rectangular bounding box, or keypoint detection.

背景モデル化手法においては、背景を統計処理によって数学モデル化し、小範囲でのピクセル値の分布からモデル分布のパラメータ推定を行う。非特許文献２では、背景でのピクセル値の分布が、特定値（平均値）の周りに小さい振幅（分散）を有して分布する正規分布に従うものとの仮定を利用している。 In the background modeling method, the background is mathematically modeled by statistical processing, and the parameters of the model distribution are estimated from the distribution of pixel values in a small range. Non-Patent Document 2 utilizes the assumption that the distribution of pixel values in the background follows a normal distribution distributed with a small amplitude (variance) around a specific value (mean value).

He, K., Gkioxari, G., Dollar, P., & Girshick, R. (2017, October). Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on (pp. 2980-2988). IEEE.He, K., Gkioxari, G., Dollar, P., & Girshick, R. (2017, October). Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on (pp. 2980-2988) .IEEE. 寺林賢司，梅田和昇，モロアレッサンドロ，"人物追跡情報を用いた背景差分のリアルタイム適応閾値処理" ,電気学会一般産業研究会資料, GID-09-17, pp.89-90(2009).Kenji Terabayashi, Kazunobu Umeda, Moro Alessandro, "Real-time adaptive threshold processing of background subtraction using person tracking information", Institute of Electrical Engineers of Japan General Industry Study Group, GID-09-17, pp.89-90 (2009). Laurentini A. The visual hull concept for silhouette-based image understanding. IEEE Transactions on pattern analysis and machine intelligence, 1994, 16(2): 150-162.Laurentini A. The visual hull concept for silhouette-based image understanding. IEEE Transactions on pattern analysis and machine intelligence, 1994, 16 (2): 150-162. Lorensen, William E., and Harvey E. Cline. "Marching cubes: A high resolution 3D surface construction algorithm." ACM siggraph computer graphics. Vol. 21. No. 4. ACM, 1987.Lorensen, William E., and Harvey E. Cline. "Marching cubes: A high resolution 3D surface construction algorithm." ACM siggraph computer graphics. Vol. 21. No. 4. ACM, 1987.

しかしながら、上記の従来技術の輪郭抽出には以下のような課題があった。 However, the contour extraction of the above-mentioned conventional technique has the following problems.

非特許文献１のマスクR-CNNは、ノイズや対象物同士が混み合っている状況に対してロバストであるが、マスク形状を精密なものとして得ることはできず、粗い形状のマスクしか得ることができなかった。この粗い形状のマスクは、多視点映像において輪郭マスクを用いて対象物の3Dモデルを生成する自由視点映像技術に適用するには、不適切なものであった。さらに、マスクR-CNNはオクルージョン（遮蔽）に弱く、一般にオクルージョンが発生している物体を正しく分離することができなかった。 The mask R-CNN of Non-Patent Document 1 is robust against noise and situations where objects are crowded with each other, but the mask shape cannot be obtained as a precise one, and only a coarse-shaped mask can be obtained. I couldn't. This coarsely shaped mask was unsuitable for application to free-viewpoint video technology, which uses contour masks to generate 3D models of objects in multi-viewpoint video. In addition, Mask R-CNN is vulnerable to occlusion (shielding) and generally could not properly separate objects with occlusion.

非特許文献２の背景モデル化手法は、光源が適切に制御されている環境下においては適切に機能するものの、屋外などの光源環境が動的に変化する環境においては、充分な精度を得ることができなかった。さらに、対象物の影の領域がある場合について、対象物と共に移動するものであるため、正しく検出することができなかった。 The background modeling method of Non-Patent Document 2 functions properly in an environment where the light source is appropriately controlled, but obtains sufficient accuracy in an environment where the light source environment changes dynamically such as outdoors. I couldn't. Furthermore, when there is a shadow area of the object, it cannot be detected correctly because it moves with the object.

上記従来技術の課題に鑑み、本発明は、多視点画像に関して、オクルージョンがあっても適切に物体の領域を検出することのできる領域抽出装置及びプログラムを提供することを目的とする。 In view of the above problems of the prior art, it is an object of the present invention to provide a region extraction device and a program capable of appropriately detecting a region of an object even if there is occlusion for a multi-viewpoint image.

上記目的を達成するため、本発明は、領域抽出装置であって、多視点画像の各視点の画像に領域分割を適用して各視点の第１前景マスクを得る分割部と、前記各視点の第１前景マスクに交差判定を緩和した視体積交差法を適用して３次元モデルを得る逆投影部と、前記３次元モデルを前記多視点画像の各視点の画像平面に投影して各視点の第２前景マスクを得る投影部と、前記各視点の第２前景マスクより前景への距離マップを算出する算出部と、前記各視点の第２前景マスクに対して、前記距離マップを考慮した背景差分法を適用して前景抽出することにより、前記多視点画像からの領域抽出結果としての、各視点の第３前景マスクを得る改良部と、を備えることを特徴とする。また、コンピュータを前記領域抽出装置として機能させるプログラムであることを特徴とする。 In order to achieve the above object, the present invention is a region extraction device, which is a divided portion that applies region division to an image of each viewpoint of a multi-viewpoint image to obtain a first foreground mask of each viewpoint, and a divided portion of each viewpoint. A back-projection unit that obtains a three-dimensional model by applying the visual volume crossing method that relaxes the crossing determination to the first foreground mask, and the three-dimensional model is projected onto the image plane of each viewpoint of the multi-viewpoint image of each viewpoint. A background that takes the distance map into consideration for the projection unit that obtains the second foreground mask, the calculation unit that calculates the distance map from the second foreground mask of each viewpoint to the foreground, and the second foreground mask of each viewpoint. It is characterized by comprising an improved portion for obtaining a third foreground mask of each viewpoint as a result of region extraction from the multi-viewpoint image by applying the difference method to extract the foreground. Further, the program is characterized in that the computer functions as the area extraction device.

本発明によれば、交差判定を緩和した視体積交差法と、距離マップを考慮した背景差分法との利用により、オクルージョンがあっても適切に物体の領域を検出することができる。 According to the present invention, the area of an object can be appropriately detected even if there is occlusion by using the visual volume crossing method in which the crossing determination is relaxed and the background subtraction method in consideration of the distance map.

一実施形態に係る領域抽出装置の機能ブロックである。It is a functional block of the area extraction apparatus which concerns on one Embodiment. 一実施形態に係る映像を対象としての領域抽出装置の動作のフローチャートである。It is a flowchart of the operation of the area extraction apparatus for the image which concerns on one Embodiment. 映像上の任意の時刻の多視点画像に対する、一実施形態に係る領域抽出装置の動作のフローチャートである。It is a flowchart of the operation of the area extraction apparatus which concerns on one Embodiment with respect to the multi-viewpoint image of arbitrary time on the image. 多視点映像を得るカメラ配置の模式例を示す図である。It is a figure which shows the schematic example of the camera arrangement which obtains a multi-viewpoint image. 分割部による粗な前景マスクとしての領域抽出の模式例を示すための図である。It is a figure for showing the schematic example of the area extraction as a rough foreground mask by a division part. 既存手法の視体積交差法と本実施形態における緩和条件下での視体積交差法との結果を対比で示すものである。The results of the visual volume crossing method of the existing method and the visual volume crossing method under the relaxation conditions in the present embodiment are shown in comparison. 背景差分法のための背景画像の模式例として、図５の画像に対して用意しておく背景画像を示す図である。As a schematic example of the background image for the background subtraction method, it is a figure which shows the background image prepared for the image of FIG. 今回の前景マスクと前回の前景マスクの例に対する排他的論理和（XOR）の模式例を示す図である。It is a figure which shows the schematic example of exclusive OR (XOR) with respect to the example of this foreground mask and the previous foreground mask. 領域抽出装置の処理データの模式例を示す図である。It is a figure which shows the schematic example of the processing data of a region extraction apparatus. 一般的なコンピュータ装置におけるハードウェア構成を示す図である。It is a figure which shows the hardware configuration in a general computer device.

図１は、一実施形態に係る領域抽出装置10の機能ブロックである。領域抽出装置10は、分割部1、逆投影部2、投影部3、算出部4、改良部5及び予測部6を備える。 FIG. 1 is a functional block of the area extraction device 10 according to the embodiment. The area extraction device 10 includes a division unit 1, a back projection unit 2, a projection unit 3, a calculation unit 4, an improvement unit 5, and a prediction unit 6.

図２は、一実施形態に係る映像を対象としての領域抽出装置10の動作のフローチャートである。領域抽出装置10は多視点映像を読み込み、その各時刻tのフレーム画像としての多視点画像について領域抽出を行うことが可能なものであり、図２はこのように映像の各時刻フレームを処理するフローを表している。以下、図２の各ステップを説明する。 FIG. 2 is a flowchart of the operation of the area extraction device 10 for the video according to the embodiment. The area extraction device 10 can read a multi-viewpoint video and extract a region of the multi-viewpoint image as a frame image at each time t. FIG. 2 processes each time frame of the video in this way. It represents the flow. Hereinafter, each step in FIG. 2 will be described.

現時刻tを多視点映像の初期時刻t=1として図２のフローを開始すると、ステップS10では、多視点映像より現時刻tのフレーム画像としての多視点画像を領域抽出装置10が入力として読み込み、且つ、直前時刻t-1に予測された現時刻tのパラメータ（改良部5で背景差分法を適用するための統計パラメータ）を取得してから、ステップS11へと進む。 When the flow of FIG. 2 is started with the current time t as the initial time t = 1 of the multi-viewpoint video, in step S10, the region extraction device 10 reads the multi-viewpoint image as the frame image of the current time t from the multi-viewpoint video as input. In addition, after acquiring the parameter of the current time t predicted at the immediately preceding time t-1 (statistical parameter for applying the background subtraction method in the improvement unit 5), the process proceeds to step S11.

ステップS11では、領域抽出装置10が分割部1、逆投影部2、投影部3、算出部4及び改良部5において詳細を後述する処理を行うことにより、改良部5より現時刻tの多視点画像の前景マスクを出力してから、ステップS12へと進む。 In step S11, the area extraction device 10 performs the processing described in detail later in the division unit 1, the back projection unit 2, the projection unit 3, the calculation unit 4, and the improvement unit 5, so that the improvement unit 5 has multiple viewpoints at the current time t. After outputting the foreground mask of the image, proceed to step S12.

ステップS12では、直前のステップS11で改良部5の出力結果等を用いて予測部6が次の時刻t+1でのパラメータ（背景差分法の統計パラメータ）を予測してから、ステップS13へと進む。ステップS13では現時刻tを次の時刻t+1へと更新してから、ステップS10へ戻り、以上と同様に多視点映像の各時刻tに関して図２のフローが繰り返される。 In step S12, the prediction unit 6 predicts the parameters (statistical parameters of the background subtraction method) at the next time t + 1 using the output results of the improvement unit 5 in the immediately preceding step S11, and then proceeds to step S13. move on. In step S13, the current time t is updated to the next time t + 1, and then the process returns to step S10, and the flow of FIG. 2 is repeated for each time t of the multi-viewpoint video in the same manner as described above.

ステップS10で取得される直前時刻t-1に予測された現時刻tのパラメータとは、そのステップS10に至る直前のステップS12（時刻t-1）において予測部6で予測されたパラメータとなる。t=1（初期時刻）の場合は、予測部6の説明において後述するように、背景差分法を適用するための所定の背景画像より予め算出しておくパラメータを利用すればよい。 The parameter of the current time t predicted at the immediately preceding time t-1 acquired in step S10 is the parameter predicted by the prediction unit 6 in step S12 (time t-1) immediately before reaching the step S10. When t = 1 (initial time), as will be described later in the description of the prediction unit 6, parameters that are calculated in advance from a predetermined background image for applying the background subtraction method may be used.

図２のステップS10～S12に示されるように、領域抽出装置10においては多視点映像の各時刻tのフレームとしての多視点画像に対して共通の処理を行う。図３は、この共通の処理、すなわち映像上の任意の時刻tの多視点画像に対する、一実施形態に係る領域抽出装置10の動作のフローチャートである。 As shown in steps S10 to S12 of FIG. 2, the area extraction device 10 performs common processing on the multi-viewpoint image as a frame at each time t of the multi-viewpoint video. FIG. 3 is a flowchart of the operation of the region extraction device 10 according to the embodiment for this common process, that is, a multi-viewpoint image at an arbitrary time t on the video.

以下、図３の各ステップを説明しながら、領域抽出装置10の各機能部の処理内容の詳細に関して説明する。 Hereinafter, the details of the processing contents of each functional unit of the area extraction device 10 will be described while explaining each step of FIG.

（０）入力データとしての多視点映像に関して
領域抽出装置10への入力データとしての多視点映像は、共通のシーンを異なるカメラ視点で配置された複数（少なくとも２つ）のカメラで撮影して得られるものである。図４は、多視点映像を得るカメラ配置の模式例として、10台のカメラC1～C10が共通のフィールドF（例えばスポーツが行われるフィールドF）を取り囲んで撮影している状況を示す図である。多視点映像の各カメラの映像においては時刻同期を行っておくものとする。また、各カメラに関してはそれぞれ独立にカメラキャリブレーションを行っておき、カメラパラメータを求めておくものとする。 (0) Multi-viewpoint video as input data The multi-viewpoint video as input data to the area extraction device 10 is obtained by shooting a common scene with a plurality of (at least two) cameras arranged from different camera viewpoints. It is something that can be done. FIG. 4 is a diagram showing a situation in which 10 cameras C1 to C10 surround a common field F (for example, a field F where sports are performed) as a schematic example of a camera arrangement for obtaining a multi-viewpoint image. .. Time synchronization shall be performed for the images of each camera of the multi-viewpoint image. In addition, each camera shall be calibrated independently and the camera parameters shall be obtained.

図１中に線L1及びL2として示されるように、入力データとしての多視点映像における各時刻の多視点画像は、領域抽出装置10の分割部1及び改良部5へと入力される。 As shown as lines L1 and L2 in FIG. 1, the multi-viewpoint image at each time in the multi-viewpoint video as input data is input to the division unit 1 and the improvement unit 5 of the region extraction device 10.

（１）ステップS1…分割部1が疎な前景マスクを得る
ステップS1では、入力される現時刻tの多視点画像の各カメラ視点の画像に対して、分割部1が領域分割を行うことにより、粗な状態の前景マスクを得て、この前景マスクを逆投影部2へと出力してから、ステップS2へと進む。 (1) Step S1 ... In step S1, the division unit 1 obtains a sparse foreground mask. In step S1, the division unit 1 divides the area of the input multi-viewpoint image at the current time t from each camera viewpoint. , Obtaining a foreground mask in a rough state, outputting this foreground mask to the back projection unit 2, and then proceeding to step S2.

分割部1では具体的に、前掲の非特許文献１の深層学習（畳込ニューラルネットワーク）ベースのマスクR-CNNを利用することにより、粗な状態の前景マスクを得ることができる。（なお、マスクR-CNNは影があってもロバストに検出可能なことが知られている。）学習データとしては例えばCOCOデータセットを用いて、ネットワークのパラメータを学習するようにすればよい。この学習データでは20種類以上の物体を検出可能であるが、本実施形態においては、これら全てを検出するのではなく、入力される多視点画像（映像）において前景として抽出するものとして予め設定しておく、所定種類の対象のみを前景マスクとして抽出するようにしてよい。例えば、多視点映像はスポーツ映像としてのバレーボールの映像である場合に、選手とボールのみを抽出対象としてもよい。以下においても、説明例としてはバレーボール映像において選手及びボールを抽出する場合を用いることとする。 Specifically, the divided portion 1 can obtain a rough foreground mask by using the deep learning (convolutional neural network) -based mask R-CNN of Non-Patent Document 1 described above. (It is known that the mask R-CNN can be detected robustly even if there is a shadow.) As the training data, for example, a COCO data set may be used to learn the network parameters. This learning data can detect more than 20 types of objects, but in this embodiment, it is set in advance to be extracted as the foreground in the input multi-viewpoint image (video) instead of detecting all of them. It is possible to extract only a predetermined type of object as a foreground mask. For example, when the multi-viewpoint image is a volleyball image as a sports image, only the player and the ball may be extracted. Also in the following, as an explanatory example, the case where the player and the ball are extracted in the volleyball video will be used.

図５は分割部1による粗な前景マスクとしての領域抽出の模式例であり、あるカメラ視点の画像PにおいてフィールドFに選手PLが存在し、バレーボールネットの水平帯Bによって選手PLが遮蔽されてオクルージョンが発生している場合に、データDとして示すように、選手PLが領域R1及びR2のように分断された領域として検出されてしまうことを示している。なお、図５に示す箇所BNについては、後述の改良部5の説明の際に参照するものである。 FIG. 5 is a schematic example of region extraction as a rough foreground mask by the division portion 1, in which the player PL exists in the field F in the image P of a certain camera viewpoint, and the player PL is shielded by the horizontal band B of the volleyball net. It is shown that when occlusion occurs, the athlete PL is detected as a divided region such as regions R1 and R2, as shown as data D. The location BN shown in FIG. 5 will be referred to in the explanation of the improvement unit 5 described later.

分割部1においてはマスクR-CNNにより前景マスクを抽出するに際して、抽出される対象種別に応じた領域サイズ閾値より大きいもののみを選別して抽出結果とするようにしてもよい。例えば、選手として検出される領域は第１閾値TH1よりも大きいもののみとし、ボールとして検出される領域は第２閾値TH2よりも大きいもののみとし、閾値条件を満たさないものは抽出結果に含めないようにしてよい。閾値は縦幅、横幅、領域画素数それぞれについて設けてもよいし、いずれか一部のみについて設けてもよい。 In the division unit 1, when the foreground mask is extracted by the mask R-CNN, only those larger than the area size threshold value according to the target type to be extracted may be selected and used as the extraction result. For example, the area detected as a player is limited to the area larger than the first threshold value TH1, the area detected as a ball is limited to the area larger than the second threshold value TH2, and the area not satisfying the threshold condition is not included in the extraction result. You can do it. The threshold value may be set for each of the vertical width, the horizontal width, and the number of area pixels, or may be set for only a part of them.

（２）ステップS2…緩和条件の下で逆投影部2が前景マスクを逆投影して3Dモデルを得る
ステップS2では、逆投影部2が多視点画像の各視点における前景マスクをフィールドF（実空間）の３次元空間内へと逆投影し、視体積交差法を適用することにより、対象物の3Dモデルを得てこの3Dモデルを投影部3へと出力してから、ステップS3へと進む。 (2) Step S2 ... The back projection unit 2 back-projects the foreground mask to obtain a 3D model under relaxation conditions. In step S2, the back projection unit 2 sets the foreground mask at each viewpoint of the multi-viewpoint image in the field F (actual). By back-projecting into the 3D space of (space) and applying the visual volume crossing method, a 3D model of the object is obtained, this 3D model is output to the projection unit 3, and then the process proceeds to step S3. ..

ここで、既存手法としての視体積交差法は前掲の非特許文献３等に開示される通りのものであり、以下の式(1)で表される。すなわち、前景マスクにおける識別された対象物の領域の視体積（Visual Cone）を、各カメラ視点に関して積集合を取ることで、対象物の3Dモデルを得ることができる。 Here, the visual volume crossing method as an existing method is as disclosed in Non-Patent Document 3 and the like described above, and is expressed by the following equation (1). That is, a 3D model of the object can be obtained by taking the intersection of the visual volume (Visual Cone) of the area of the identified object in the foreground mask for each camera viewpoint.

ここで、Iは各カメラのマスクの全体のID集合であり、i∈Iはi番目のカメラマスクのIDであり、V_iはi番目のカメラマスク（前景）が形成する視体積（Visual Cone）である。i番目のカメラのカメラパラメータを用いて透視投影行列（カメラ行列）を計算することで、前景マスク上にある画像座標(u,v)に対応する空間座標(x,y,z)（カメラ視点から画素位置(u,v)を通って延びる光線上の空間座標(x,y,z)）を求めることにより、この視体積V_iを求めることができる。式(1)の積集合は３次元空間内に設定したボクセルにおいて求め、空間内のボクセル点群（point of cloud）としてカメラマスク集合Iでの積集合VK(I)を求めた後、さらに前掲の非特許文献４のマーチングキューブ法（marching cubes algorithm）を適用することで、３次元空間内のポリゴンメッシュモデルとして、視体積を得ることができる。 Here, I is the total ID set of the masks of each camera, i ∈ I is the ID of the i-th camera mask, and V _i is the visual volume (Visual Cone) formed by the i-th camera mask (foreground). ). Spatial coordinates (x, y, z) (camera viewpoint) corresponding to the image coordinates (u, v) on the foreground mask by calculating the perspective projection matrix (camera matrix) using the camera parameters of the i-th camera. This visual volume V _i can be obtained by obtaining the spatial coordinates (x, y, z) on the ray extending from the pixel position (u, v). The product set of Eq. (1) is obtained in a voxel set in a three-dimensional space, and the product set VK (I) in the camera mask set I is obtained as a point of cloud in the space. By applying the marching cubes algorithm of Non-Patent Document 4, the visual volume can be obtained as a polygon mesh model in a three-dimensional space.

本実施形態においても、逆投影部2は視体積交差法及びポリゴンメッシュ化を行うことで3Dモデルを得るが、この際、既存手法の視体積交差法（及びポリゴンメッシュ化）をそのまま適用するのではなく、緩和条件の下で視体積交差法を適用する。すなわち、既存手法の式(1)ではカメラマスク集合I内の全てのカメラ視点の画像の前景マスクにおいて積集合を求めたのに対して、本実施形態においては、例えば以下の式(1')のように交差判定を緩和する。すなわち、カメラマスク集合Iのカメラ視点の個数がN個であったとする場合に、その全部（N個）ではなく、一定割合r(0<r<1)のr*N個以上の対応するカメラ視点の視体積V_iが通過するような空間領域を、緩和された積集合VK(I)として採用する。式(1')において、S(I)はカメラマスク集合Iの部分集合であって、要素数がr*Nとなるものである。 Also in this embodiment, the back projection unit 2 obtains a 3D model by performing the visual volume crossing method and polygon meshing, but at this time, the visual volume crossing method (and polygon meshing) of the existing method is applied as it is. Instead, the visual volume crossing method is applied under relaxed conditions. That is, while the equation (1) of the existing method obtained the intersection in the foreground masks of the images of all the camera viewpoints in the camera mask set I, in the present embodiment, for example, the following equation (1') The intersection judgment is relaxed as in. That is, when the number of camera viewpoints of the camera mask set I is N, not all of them (N), but r * N or more corresponding cameras of a fixed ratio r (0 <r <1). The spatial region through which the visual volume V _i of the viewpoint passes is adopted as the relaxed intersection VK (I). In equation (1'), S (I) is a subset of the camera mask set I, and the number of elements is r * N.

図６は、既存手法の視体積交差法と本実施形態における緩和条件下での視体積交差法との結果を対比で示すものである。上段側が既存手法であり、図４のように10個のカメラがある場合に10個の視体積が全て通過することによる結果R3としてその結果が示され、一部が結果R30として右側に拡大されている。結果R30においては、図５のデータDとして示したようなオクルージョンで分断された選手領域があったことにより、得られる３Dモデルの選手PL3も分断された状態となっていることを見て取ることができる。一方、下段側が対比例として本実施形態における緩和条件下（例えば、10個のうち少なくとも8個の視体積が通過すればよい条件）での結果R4及びその一部の拡大結果R40であり、既存手法では分断されてしまった選手が選手PL4として分断されることなく3Dモデルが得られている。 FIG. 6 shows the results of the visual volume crossing method of the existing method and the visual volume crossing method under relaxed conditions in the present embodiment in comparison. The upper side is the existing method, and when there are 10 cameras as shown in Fig. 4, the result is shown as the result R3 by passing all 10 visual volumes, and a part is enlarged to the right as the result R30. ing. As a result, in R30, it can be seen that the player PL3 of the obtained 3D model is also divided due to the presence of the player area divided by the occlusion as shown in the data D of FIG. .. On the other hand, the lower side is the result R4 and the enlarged result R40 of a part thereof under the relaxation condition (for example, the condition that at least 8 out of 10 visual volumes need to pass) in the present embodiment as a inverse proportion, and the existing ones. In the method, the player who has been divided is not divided as the player PL4, and the 3D model is obtained.

なお、ステップS2において逆投影部2が緩和条件下で3Dモデルを求めるために利用する前景マスクは、直前のステップにおいて得られるものである。直前のステップがステップS1であった場合（ステップS1からステップS2に至った際のステップS2である場合）、直前ステップS1で分割部1が出力した粗な前景マスクを逆投影部2の入力データとして利用する。一方、直前のステップが後述するステップS7であった場合（ステップS7からステップS2に戻った際のステップS2である場合）、直前ステップS7で改良部5が出力した前景マスク（図１中に線L3として示される、後述する「途中版」の前景マスク）を逆投影部2の入力データとして利用する。 The foreground mask used by the back projection unit 2 to obtain the 3D model under relaxation conditions in step S2 is obtained in the immediately preceding step. When the immediately preceding step is step S1 (when step S2 is from step S1 to step S2), the rough foreground mask output by the dividing unit 1 in the immediately preceding step S1 is used as the input data of the back projection unit 2. Use as. On the other hand, if the immediately preceding step is step S7, which will be described later (when step S2 is when returning from step S7 to step S2), the foreground mask output by the improvement unit 5 in the immediately preceding step S7 (line in FIG. 1). The foreground mask of the "intermediate version" described later, which is shown as L3), is used as the input data of the back projection unit 2.

（３）ステップS3…投影部3が3Dモデルを多視点画像の各画像へと投影して前景マスクを得る
ステップS3では、投影部3が、直前のステップS2において逆投影部2で得られた3Dモデルを多視点画像の各画像平面へと投影することで各視点での前景マスクを得て、この前景マスクを算出部4及び改良部5へと出力してから、ステップS4へと進む。この出力の流れは図１中にそれぞれ線L3及びL4として示される通りである。 (3) Step S3 ... The projection unit 3 projects the 3D model onto each image of the multi-viewpoint image to obtain a foreground mask. In step S3, the projection unit 3 was obtained by the back projection unit 2 in the immediately preceding step S2. By projecting the 3D model onto each image plane of the multi-viewpoint image, a foreground mask at each viewpoint is obtained, and this foreground mask is output to the calculation unit 4 and the improvement unit 5, and then the process proceeds to step S4. The flow of this output is as shown in FIG. 1 as lines L3 and L4, respectively.

逆投影部2においてはポリゴンメッシュモデルとして3Dモデルを得ているので、その各ポリゴン要素（三角形などの平面要素）に対して、逆投影部2でも利用した透視投影行列の関係を用いることにより、投影部3において多視点画像の各視点の画像へ投影（3Dモデルの空間座標(x,y,z)から対応する画像座標(u,v)を得る投影）を行い、投影結果としての前景マスクを得ることができる。 Since the 3D model is obtained as a polygon mesh model in the back projection unit 2, the relationship of the perspective projection matrix used in the back projection unit 2 is used for each polygon element (planar element such as a triangle). The projection unit 3 projects the multi-viewpoint image onto the image of each viewpoint (projection to obtain the corresponding image coordinates (u, v) from the spatial coordinates (x, y, z) of the 3D model), and the foreground mask as the projection result. Can be obtained.

ここで、逆投影部2においては図６で模式例を示したように緩和条件下で3Dモデルを求めているため、この3Dモデルを投影部3において投影して得られる前景マスクは、図５で例示したような同一対象のオクルージョンによる分断（選手PLの領域R1,R2への分断）が解消ないし緩和されていることが想定されるものとなる。 Here, since the back projection unit 2 obtains a 3D model under relaxation conditions as shown in the schematic example in FIG. 6, the foreground mask obtained by projecting this 3D model on the projection unit 3 is shown in FIG. It is assumed that the division due to occlusion of the same object (division of the player PL into areas R1 and R2) as illustrated in the above is eliminated or alleviated.

なお、3Dモデルにおけるポリゴンメッシュの各ポリゴン（面要素）には、ポリゴンを一周して囲む辺に回る際の向き（面要素を3Dモデル上の表向きで見た際の回る向き）を定義しておき、3D上での面要素を回る向きが対応する投影後の2D（２次元画像）上でも同じ回る向きとなっているもの（対応する視点の画像において表向きとなるポリゴン）のみを、ポリゴン要素としての投影対象としてもよい。これにより、3Dモデル上で裏側となっており見えない箇所（対応する視点の画像において投影不要な箇所）を投影することを省略してよい。 For each polygon (face element) of the polygon mesh in the 3D model, the direction when turning around the side that surrounds the polygon (the turning direction when the face element is viewed face up on the 3D model) is defined. Only polygon elements that have the same rotation direction on the corresponding projected 2D (two-dimensional image) (polygons that are ostensibly in the image of the corresponding viewpoint) that rotate the surface element on 3D. It may be a projection target as. As a result, it may be omitted to project an invisible part (a part that does not need to be projected in the image of the corresponding viewpoint) on the back side of the 3D model.

（４）ステップS4…算出部4が前景マスクを用いて距離マップを算出する
ステップS4では、直前のステップS3で投影部3より得られた前景マスクを用いて算出部4が距離マップを算出し、この距離マップを改良部5へと出力してからステップS5へと進む。 (4) Step S4 ... The calculation unit 4 calculates the distance map using the foreground mask. In step S4, the calculation unit 4 calculates the distance map using the foreground mask obtained from the projection unit 3 in the immediately preceding step S3. , Output this distance map to the improvement section 5, and then proceed to step S5.

この距離マップは、多視点画像の各視点の画像平面の画素位置(u,v)において、投影部3から得られた前景マスクに属する点への最短距離d(u,v)のマップとして求めることができる。位置(u,v)が前景に属する場合、d(u,v)=0とすればよい。 This distance map is obtained as a map of the shortest distance d (u, v) to a point belonging to the foreground mask obtained from the projection unit 3 at the pixel position (u, v) of the image plane of each viewpoint of the multi-viewpoint image. be able to. If the position (u, v) belongs to the foreground, d (u, v) = 0 may be set.

（５）ステップS5…改良部5が距離マップを用いて前景マスクを改良する
ステップS5では、直前のステップS4において算出部4より得られた距離マップを利用した可変閾値（画素位置ごとに変化しうる閾値）での背景差分法を適用することにより、直前のステップS3において投影部3から得られた前景マスク内からこの背景差分法によって前景と判定される領域のみを抽出することで、改良された前景マスクを得てから、ステップS6へと進む。 (5) Step S5 ... The improvement unit 5 improves the foreground mask using the distance map. In step S5, a variable threshold value (changes for each pixel position) using the distance map obtained from the calculation unit 4 in the immediately preceding step S4. By applying the background subtraction method (Uru threshold value), it is improved by extracting only the area determined to be the foreground by this background subtraction method from the foreground mask obtained from the projection unit 3 in the immediately preceding step S3. After obtaining the foreground mask, proceed to step S6.

すなわち、改良部5は、投影部3で投影により得た前景マスクにおける前景領域に対して、背景差分法によって画素情報から背景と判定される部分をいわば、「削り取る」ことにより前景領域から除外して、改良された前景マスクを得るものである。従って、以下の包含関係が成立する。（なお、包含関係「⊂」は等しい場合も含むものである。）
改良部5で得る前景マスク⊂投影部3で得る前景マスク That is, the improvement unit 5 excludes the portion determined to be the background from the pixel information by the background subtraction method from the foreground area by "shaving" the foreground area in the foreground mask obtained by projection by the projection unit 3. To obtain an improved foreground mask. Therefore, the following inclusion relationship is established. (Note that the inclusion relationship "⊂" includes cases where they are equal.)
Foreground mask obtained by improvement unit 5 ⊂ Foreground mask obtained by projection unit 3

ここで、背景差分法を適用するためには画素情報が必要であるため、改良部5は図１中に線L2で示されるように、入力される多視点画像を参照してこの画素情報を取得する。 Here, since pixel information is required to apply the background subtraction method, the improvement unit 5 refers to the input multi-viewpoint image as shown by the line L2 in FIG. 1 and obtains this pixel information. get.

改良部5にて適用する背景差分法は、以下の式(2),(3)による前景・背景の判定を行うものとすることができる。 The background subtraction method applied in the improved section 5 can determine the foreground / background by the following equations (2) and (3).

式(2)は、多視点映像の現時刻tにおけるある視点画像において、画素p=(u,v)（投影部3で得た前景マスク内の各画素）を前景として残すか、背景として削除するかを判定する式であり、M_t,p=0となるような画素pは背景とし、M_t,p=1となるような画素pは前景とする。（この定義のもとでM_t,pが改良部5が得る前景マスクとなる。）I_t,pはこの位置pでの画素値であり、線L2に示される通り当初の多視点画像（時刻tのフレーム）を参照してこの値を取得することができる。 Equation (2) leaves pixel p = (u, v) (each pixel in the foreground mask obtained by the projection unit 3) as the foreground or deletes it as the background in a certain viewpoint image at the current time t of the multi-viewpoint image. It is an expression for determining whether to do so, and the pixel p such that M _{t, p} = 0 is used as the background, and the pixel p such that M _{t, p} = 1 is used as the foreground. (Under this definition, M _{t, p} is the foreground mask obtained by the improved part 5.) It _{, p} are the pixel values at this position p, and as shown by the line L2, the original multi-viewpoint image ( You can get this value by referring to the frame at time t).

μ_t,p及びo'_t,pは、多視点映像の現時刻tの対象としている視点画像における画素位置pでの背景画素分布（ガウシアンモデル）における平均及び標準偏差である。初期時刻t=1に関しては、これら平均及び標準偏差は事前知識として所定のモデル背景画像に対して計算しておくことで求めておき、以降の時刻t≧2での値に関しては、後述するステップS9での予測部6によって予測される値を利用するようにすればよい。 μ _{t, p} and o't _{, p} are the mean and standard deviation in the background pixel distribution (Gaussian model) at the pixel position p in the viewpoint image targeted at the current time t of the multi-viewpoint image. Regarding the initial time t = 1, these averages and standard deviations are calculated for a predetermined model background image as prior knowledge, and the values at the subsequent time t ≧ 2 are described in the step described later. The value predicted by the prediction unit 6 in S9 may be used.

Dis_t,pは、多視点映像の現時刻tの対象としている視点画像における画素位置pでの、算出部4で得た距離マップの値である。本実施形態においては特に、通常の背景差分法とは異なり、距離マップDis_t,pに応じた関数値f(Dis_t,p)を利用して閾値判定（距離マップDis_t,pに基づく可変閾値判定）を行うようにしている。関数fとしては例えば式(3)で示されるように、k>0となる定数によって距離マップDis_t,pに比例する関数を利用することで、距離が大きいほど背景と判定される画素値範囲を大きくすることで、算出部4で得た距離マップ（投影部3での前景マスクの情報が反映されている）に即した適切な前景抽出が可能となる。距離Dis_t,pの関数fに関しては式(3)以外にも、値が非負となるその他の増加関数（非減少関数）を用いるようにしてよい。 Dis _{t and p} are the values of the distance map obtained by the calculation unit 4 at the pixel position p in the viewpoint image targeted at the current time t of the multi-viewpoint video. In this embodiment, unlike the normal background subtraction method, the threshold value is determined by using the function value f (Dis _{t, p} ) corresponding to the distance map Dis _t _{, p} (variable based on the distance map Dis t, p). Threshold judgment) is performed. As the function f, for example, as shown in Eq. (3), by using a function proportional to the distance map Dis _{t, p} by a constant k> 0, the larger the distance, the more the pixel value range determined to be the background. By enlarging, it becomes possible to appropriately extract the foreground according to the distance map obtained by the calculation unit 4 (the information of the foreground mask in the projection unit 3 is reflected). For the function f of the distances Dis _{t and p} , other increasing functions (non-decreasing functions) whose values are non-negative may be used in addition to the equation (3).

背景差分法を適用するための、初期時刻t=1での画素分布モデル（ガウシアン分布モデル）の平均及び標準偏差を求めるための事前知識としてのモデル背景画像は、多視点映像において前景として抽出する対象以外のものを背景画像としたものとして用意しておけばよい。この際、前景として抽出する対象ではないが、前景に対してオクルージョンを発生させうる対象についても、背景画像には含めないようにすればよい。 Model as prior knowledge for obtaining the mean and standard deviation of the pixel distribution model (Gaussian distribution model) at the initial time t = 1 for applying the background subtraction method The background image is extracted as the foreground in the multi-viewpoint video. It suffices to prepare something other than the target as a background image. At this time, although it is not the target to be extracted as the foreground, the target that can generate occlusion with respect to the foreground may not be included in the background image.

例えば、多視点映像に対して、図５の画像Pのように選手PL（や不図示のボール）を前景として抽出する場合は、図５に示される水平帯Bを含むバレーボールネットに関しては、前景として抽出する対象以外のものであり且つオクルージョンの原因となるものであり、背景画像には含めないようにすることで、図７に模式的に示されるような背景BG（オクルージョンを発生させるバレーボールネット等を含まないフィールドFのみで構成される背景BG）を予め用意しておけばよい。 For example, when the player PL (or a ball (not shown) is extracted as the foreground from the multi-viewpoint image as shown in the image P of FIG. 5, the foreground of the volleyball net including the horizontal band B shown in FIG. The background BG (volleyball net that generates occlusion) as schematically shown in FIG. 7 by not including it in the background image because it is something other than the object to be extracted as and causes occlusion. It suffices to prepare in advance a background BG) composed only of the field F that does not include the above.

本実施形態においては式(2),(3)の判定による背景差分法を適用することで、抽出対象として設定されている選手PLにオクルージョンが発生していたとしても、オクルージョンを発生させている選手PLに重複しているような水平帯Bの部分的箇所のみが結果的に前景として抽出されることで、選手PLのオクルージョンが解消することとなる。すなわち、図５に示されるような水平帯Bを含むバレーボールネットに関しては、通常の背景差分法では全て前景として抽出されることとなるが、本実施形態においては式(2),(3)の判定により、抽出を所望する選手PLの付近のオクルージョンを発生している箇所（図５における点線楕円で囲まれるような選手PLの近傍箇所BN）のみが選手PLと共に前景として抽出され、結果としてオクルージョンを解消した選手Pの抽出が可能となる。 In this embodiment, by applying the background subtraction method based on the judgments of the equations (2) and (3), occlusion is generated even if occlusion occurs in the player PL set as the extraction target. As a result, only a partial part of the horizontal band B that overlaps with the player PL is extracted as the foreground, so that the occlusion of the player PL is eliminated. That is, the volleyball net including the horizontal band B as shown in FIG. 5 is all extracted as the foreground by the normal background subtraction method, but in the present embodiment, the equations (2) and (3) are used. By the judgment, only the part where the occlusion near the player PL to be extracted (the part near the player PL surrounded by the dotted ellipse in FIG. 5 BN) is extracted as the foreground together with the player PL, and as a result, the occlusion is performed. It is possible to extract the player P who has solved the problem.

（６）ステップS6…改良部5で得た改良された前景マスクの収束判定
ステップS6では、改良部5が、自身が直前のステップS5で得た前景マスクに関して、収束しているか否かを判定する。具体的には、直前のステップS5で得た前景マスクをn回目（図３のステップS7からステップS2に戻ることによるループ処理構造でのn回目）の前景マスクFG(n)とすると、前回の前景マスクFG(n-1)との相違をこれらの排他的論理和（XOR）の集合XOR(FG(n),FG(n-1))のピクセル数|XOR(FG(n),FG(n-1))|として評価し、このピクセル数が所定閾値未満であれば収束したものとしてステップS8へと進み、所定閾値以上であれば収束していないものとして、ステップS7へと進む。 (6) Step S6 ... Convergence determination of the improved foreground mask obtained in the improvement unit 5 In step S6, the improvement unit 5 determines whether or not the foreground mask obtained in the immediately preceding step S5 has converged. do. Specifically, assuming that the foreground mask obtained in the immediately preceding step S5 is the nth foreground mask FG (n) (nth in the loop processing structure by returning from step S7 to step S2 in FIG. 3), the previous time. The difference from the foreground mask FG (n-1) is the number of pixels of these exclusive ORs (XOR) set XOR (FG (n), FG (n-1)) | XOR (FG (n), FG ( It is evaluated as n-1)) |, and if the number of pixels is less than the predetermined threshold, it is considered to have converged and proceeds to step S8, and if it is equal to or more than the predetermined threshold, it is assumed that it has not converged and proceeds to step S7.

図８に、今回の前景マスクFG(n)と前回の前景マスクFG(n-1)の例に対する排他的論理和（XOR）の模式例を示す。 FIG. 8 shows a schematic example of the exclusive OR (XOR) for the example of the current foreground mask FG (n) and the previous example of the foreground mask FG (n-1).

なお、図３のループ処理の回数n=1（初回）の場合、すなわち、現時刻tの多視点画像に関して、初めてこのステップS6へ至った際は、比較対象としての前回の前景マスクFG(1-1)=FG(0)は、ステップS3で投影部3が出力した前景マスクとすればよい。 When the number of loop processes in FIG. 3 is n = 1 (first time), that is, when the multi-viewpoint image at the current time t reaches this step S6 for the first time, the previous foreground mask FG (1) as a comparison target. -1) = FG (0) may be the foreground mask output by the projection unit 3 in step S3.

あるいは、別の実施形態として、図３のループ処理の回数n=1（初回）の場合、ステップS6では必ず否定判定を行い、必ずステップS7へと進むようにしてもよい。 Alternatively, as another embodiment, when the number of loop processes in FIG. 3 is n = 1 (first time), a negative determination may always be made in step S6, and the process may always proceed to step S7.

また、図３のループ処理の回数nがn≧2の場合は、このステップS6の直前のステップS5（ループ処理n回目）での改良部5の出力を今回(n回目)の前景マスクFG(n)とし、その前のループ処理n-1回目でのステップS5での改良部5の出力を前回(n-1)回目の前景マスクFG(n-1)とすればよい。 When the number of loop processes n in FIG. 3 is n ≧ 2, the output of the improvement unit 5 in step S5 (nth loop process) immediately before this step S6 is output to the foreground mask FG (nth) of this time (nth). It may be set to n), and the output of the improvement unit 5 in step S5 in the loop processing n-1th time before that may be the foreground mask FG (n-1) in the previous (n-1) time.

（７）ステップS7
ステップS7では、直前のステップS6にて収束判定が得られなかったことから、図１中に線L5として示すように、改良部5が今回の前景マスクFG(n)を最終版ではなく途中版であるものとして逆投影部2へと出力してから、ステップS2へと戻る。 (7) Step S7
In step S7, the convergence test was not obtained in step S6 immediately before, so as shown by line L5 in FIG. 1, the improvement unit 5 has changed the foreground mask FG (n) this time to the intermediate version instead of the final version. After outputting to the back projection unit 2 as if it is, the process returns to step S2.

（８）ステップS8
ステップS8では、直前のステップS6にて収束判定が得られたことから、図１中に線L6として示すように、改良部5が今回の前景マスクFG(n)を最終版（多視点映像の現時刻tのフレーム画像における最終的な前景抽出結果に該当するもの）として出力してから、ステップS9へと進む。 (8) Step S8
In step S8, since the convergence test was obtained in step S6 immediately before, the improvement unit 5 made the final version (multi-viewpoint video) of this foreground mask FG (n) as shown by line L6 in FIG. Output as the final foreground extraction result in the frame image at the current time t), and then proceed to step S9.

（９）ステップS9…予測部6において次時刻t+1での背景差分法のパラメータを予測
ステップS9では、線L7として示すように改良部5が現時刻tの最終結果を得た旨の通知を受けたうえで、予測部7が、次の時刻t+1で改良部5が背景差分法を適用するためのパラメータ（平均及び標準偏差）を予測して改良部5へと出力したうえで、図３のフローは終了する。なお、図３のステップS9は、図２ではステップS12に相当するものである。 (9) Step S9 ... Predicting the parameters of the background subtraction method at the next time t + 1 in the prediction unit 6. In step S9, notification that the improvement unit 5 has obtained the final result at the current time t as shown by the line L7. After receiving this, the prediction unit 7 predicts the parameters (mean and standard deviation) for the improvement unit 5 to apply the background subtraction method at the next time t + 1, and outputs it to the improvement unit 5. , The flow of FIG. 3 ends. Note that step S9 in FIG. 3 corresponds to step S12 in FIG.

予測部7では具体的に、背景差分法のパラメータを映像時系列上で更新する任意の既存手法によってパラメータ更新を行ってよく、例えば以下の式(4),(5)により更新してよい。l_tは、重みづけ和による更新のための所定の重みであり、0<l_t<1の範囲で設定すればよい。なお、以下の式(4),(5)により現時刻tで背景とされた箇所のパラメータを更新し、前景とされた箇所はそれより前の時刻で背景とされていた際の背景パラメータをそのまま引き継ぐようにすればよい。 Specifically, the prediction unit 7 may update the parameters by any existing method for updating the parameters of the background subtraction method on the video time series, and may be updated by, for example, the following equations (4) and (5). l _t is a predetermined weight for updating by the weighted sum, and may be set in the range of 0 <l _t <1. In addition, the parameters of the part that was used as the background at the current time t are updated by the following equations (4) and (5), and the background parameter that was used as the background at the time before that is used for the part that was used as the foreground. You can take over as it is.

図９は、領域抽出装置10の処理データの模式例を示す図である。図９では、データD1,D2,D3,D4と分けて、ある共通の視点の共通時刻での画像（バレーボールの試合の画像で選手とボールを検出対象とするもの）に対する処理結果（図３のループ処理での共通のn回目の処理結果）として、それぞれ、分割部1で得る前景マスクと、投影部3で得る前景マスクと、算出部4で得る距離マップと、改良部5で得る前景マスクと、を示している。また、データD1,D2,D3,D4の左側には、データD1,D2,D3,D4の左上領域の様子を拡大したものとして、データD10,D20,D30,D40が示されている。 FIG. 9 is a diagram showing a schematic example of the processing data of the region extraction device 10. In FIG. 9, the data D1, D2, D3, and D4 are separated, and the processing result for the image at the common time of a certain common viewpoint (the image of the volleyball game in which the player and the ball are detected) (FIG. 3). As the common n-th processing result in the loop processing), the foreground mask obtained by the division unit 1, the foreground mask obtained by the projection unit 3, the distance map obtained by the calculation unit 4, and the foreground mask obtained by the improvement unit 5, respectively. And. Further, on the left side of the data D1, D2, D3, D4, the data D10, D20, D30, D40 are shown as an enlarged view of the upper left region of the data D1, D2, D3, D4.

データD10に見られる選手領域のオクルージョンによる欠損が、データD20では修復された状態であり、さらにデータD40においては、データD20において存在していた余分な背景が消えている状態であることを確認できる。 It can be confirmed that the occlusion defect of the player area seen in the data D10 has been repaired in the data D20, and that the extra background existing in the data D20 has disappeared in the data D40. ..

以上、本発明の一実施形態によれば、交差判定を緩和した視体積交差法と、距離マップを考慮した背景差分法との利用により、オクルージョンがあっても適切に物体の領域を検出することができる。以下、補足事項を説明する。 As described above, according to the embodiment of the present invention, the area of the object can be appropriately detected even if there is occlusion by using the visual volume crossing method in which the crossing determination is relaxed and the background subtraction method in consideration of the distance map. Can be done. The supplementary matters will be described below.

（１）図３のループ処理は行わないようにしてもよい。すなわち、ステップS6を省略してステップS5から直ちにステップS8へと進み、改良部5での一回目の出力を最終結果としてもよい。 (1) The loop processing shown in FIG. 3 may not be performed. That is, step S6 may be omitted and the process immediately proceeds from step S5 to step S8, and the first output in the improvement unit 5 may be the final result.

（２）図１０は、一般的なコンピュータ装置70におけるハードウェア構成を示す図であり、領域抽出装置10はこのような構成を有する１台以上のコンピュータ装置70として実現可能である。コンピュータ装置70は、所定命令を実行するCPU（中央演算装置）71、CPU71の実行命令の一部又は全部をCPU71に代わって又はCPU71と連携して実行する専用プロセッサ72（GPU（グラフィック演算装置）や深層学習専用プロセッサ等）、CPU71や専用プロセッサ72にワークエリアを提供する主記憶装置としてのRAM73、補助記憶装置としてのROM74、通信インタフェース75、ディスプレイ76、カメラ77、マウス、キーボード、タッチパネル等によりユーザ入力を受け付ける入力インタフェース78と、これらの間でデータを授受するためのバスBと、を備える。 (2) FIG. 10 is a diagram showing a hardware configuration in a general computer device 70, and the area extraction device 10 can be realized as one or more computer devices 70 having such a configuration. The computer device 70 is a CPU (central processing unit) 71 that executes a predetermined instruction, and a dedicated processor 72 (GPU (graphic calculation device)) that executes a part or all of the execution instructions of the CPU 71 on behalf of the CPU 71 or in cooperation with the CPU 71. And deep learning dedicated processor, etc.), RAM73 as the main storage device that provides a work area for the CPU71 and the dedicated processor 72, ROM74 as the auxiliary storage device, communication interface 75, display 76, camera 77, mouse, keyboard, touch panel, etc. It includes an input interface 78 that accepts user input, and a bus B for exchanging data between them.

領域抽出装置10の各部は、各部の機能に対応する所定のプログラムをROM74から読み込んで実行するCPU71及び／又は専用プロセッサ72によって実現することができる。ここで、撮影関連の処理が行われる場合にはさらに、カメラ77が連動して動作し、表示関連の処理が行われる場合にはさらに、ディスプレイ76が連動して動作し、データ送受信に関する通信関連の処理が行われる場合にはさらに通信インタフェース75が連動して動作する。 Each part of the area extraction device 10 can be realized by a CPU 71 and / or a dedicated processor 72 that reads and executes a predetermined program corresponding to the function of each part from the ROM 74. Here, when the shooting-related processing is performed, the camera 77 further operates in conjunction with the display, and when the display-related processing is performed, the display 76 further operates in conjunction with the communication-related data transmission / reception. When the processing of is performed, the communication interface 75 further operates in conjunction with it.

例えば、入力される多視点映像は、通信インタフェース75を介してネットワーク上から取得してもよい。改良部5で得た最終結果をディスプレイ76において表示するようにしてもよい。２台以上のコンピュータ装置70によって領域抽出装置10がシステムとして実現される場合、ネットワーク経由で各処理に必要な情報を送受信するようにすればよい。 For example, the input multi-viewpoint video may be acquired from the network via the communication interface 75. The final result obtained by the improvement unit 5 may be displayed on the display 76. When the area extraction device 10 is realized as a system by two or more computer devices 70, information necessary for each process may be transmitted and received via a network.

10…領域抽出装置、1…分割部、2…逆投影部、3…投影部、4…算出部、5…改良部 10 ... area extraction device, 1 ... division part, 2 ... back projection part, 3 ... projection part, 4 ... calculation part, 5 ... improvement part

Claims

The division part that applies the area division to the image of each viewpoint of the multi-viewpoint image to obtain the first foreground mask of each viewpoint, and the division part.
A back projection unit for obtaining a three-dimensional model by applying the visual volume crossing method in which the crossing determination is relaxed to the first foreground mask of each viewpoint.
A projection unit that projects the three-dimensional model onto the image plane of each viewpoint of the multi-viewpoint image to obtain a second foreground mask for each viewpoint.
A calculation unit that calculates a distance map to the foreground from the second foreground mask of each viewpoint,
By applying the background subtraction method in consideration of the distance map to the second foreground mask of each viewpoint and extracting the foreground, the third foreground mask of each viewpoint as a result of region extraction from the multi-viewpoint image. With the improved part to get ,
In the back projection unit, the visual volume in which the intersection determination is relaxed by obtaining the three-dimensional model as a region where the intersection determination can be obtained in the first foreground mask of at least a part of the viewpoints of all the viewpoints of the multi-viewpoint image. Apply the crossing method,
In the improved section, by applying the background subtraction method, when it is determined that the difference in pixel values from the predefined background model is within a predetermined range at each pixel position, it is determined as the background.
In the determination, the larger the value of the distance of the distance map at the pixel position, the wider the predetermined range for the determination, so that the background subtraction method in consideration of the distance map is applied. Area extraction device.

The region extraction device according to claim 1, wherein the division portion uses a convolutional neural network learned in advance to perform region division including identification of a region type.

The region extraction device according to claim 2, wherein in the division portion, only a specific region type is used as the foreground to obtain the first foreground mask.

The calculation unit is characterized in that the distance map is calculated by giving the shortest distance from the pixel position to the foreground in the second foreground mask as the value of the distance map at each pixel position. The region extraction device according to any one of 1 to 3 .

The back-projection unit, the projection unit, the calculation unit, and the improvement unit perform each repetition process in this order, and the third foreground mask obtained by the processing of the improvement unit in the iteration process is viewed in the next back-projection unit. Used as an application target of the volume crossing method
The improved unit is characterized in that the third foreground mask obtained in the present time of the iterative process is compared with the third foreground mask obtained in the previous time, and the iterative process is completed when it is determined that there is no difference. The area extraction device according to any one of claims 1 to 4 .

The fifth aspect of the present invention is characterized in that the improvement unit obtains an exclusive OR of the third foreground mask obtained in the present time of the iterative process and the third foreground mask obtained in the previous time, and makes the comparison. The area extraction device described.

A program characterized in that the computer functions as the area extraction device according to any one of claims 1 to 6 .