JP2023080290A

JP2023080290A - Information processing apparatus, control method of the same and program

Info

Publication number: JP2023080290A
Application number: JP2023065897A
Authority: JP
Inventors: 圭輔森澤; Keisuke Morisawa
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2018-05-07
Filing date: 2023-04-13
Publication date: 2023-06-08
Anticipated expiration: 2038-11-06
Also published as: JP7353527B2

Abstract

To provide an information processing apparatus capable of generating high quality 3D shape data and while suppressing missing of 3D shape data of an object.SOLUTION: The information processing apparatus generates three-dimensional shape data of an object on the basis of a plurality of pick-up images acquired by multiple cameras. The information processing apparatus includes: judgment means that, for each of predetermined elements that constitutes 3D space, determines whether the condition is met that the number of cameras in which the pixel or area corresponding to the specified element included in the area of the object in the pick-up images of the multiple cameras is equal to or less than the first threshold; and generating means that generates three-dimensional shape data of the object that includes predetermined elements that are not determined to meet the conditions by the judgment means.SELECTED DRAWING: Figure 4

Description

本発明は、情報処理装置、情報処理装置の制御方法およびプログラムに関するものである。 The present invention relates to an information processing device, a control method for an information processing device, and a program.

近年、複数のカメラを異なる位置に設置して複数視点で同期撮影し、当該撮影により得られた複数視点画像を用いて仮想視点コンテンツを生成する技術が注目されている。複数視点画像から仮想視点コンテンツを生成する技術によれば、例えば、サッカーやバスケットボールのハイライトシーンを様々な角度から視聴することができるため、通常の画像と比較してユーザに高臨場感を与えることができる。複数視点画像に基づく仮想視点コンテンツの生成及び閲覧は、複数のカメラが撮影した画像をサーバなどの画像処理部に集約し、当該画像処理部により、三次元モデル生成、レンダリングなどの処理を施し、ユーザ端末に伝送を行うことで実現できる。 2. Description of the Related Art In recent years, attention has been paid to a technique of installing a plurality of cameras at different positions, synchronously capturing images from multiple viewpoints, and generating virtual viewpoint content using the captured images from the multiple viewpoints. According to technology that generates virtual viewpoint content from multiple viewpoint images, for example, highlight scenes of soccer or basketball can be viewed from various angles, giving the user a high sense of presence compared to normal images. be able to. The generation and browsing of virtual viewpoint content based on multi-viewpoint images is achieved by collecting images captured by multiple cameras in an image processing unit such as a server, and performing processing such as 3D model generation and rendering by the image processing unit. It can be realized by transmitting to the user terminal.

精細な三次元モデルを生成する代表的な方法として、視体積交差法という計算手法が提案されている。非特許文献１は、複数視点画像から対象物体（前景）の領域（シルエット）を抽出し、視体積交差法により三次元モデル（三次元形状データ）を生成することを開示している。 As a representative method for generating a detailed three-dimensional model, a calculation method called the visual volume intersection method has been proposed. Non-Patent Document 1 discloses extracting a region (silhouette) of a target object (foreground) from a multi-viewpoint image and generating a three-dimensional model (three-dimensional shape data) by the visual volume intersection method.

ＬａｕｒｅｎｔｉｎｉＡ："ＴｈｅＶｉｓｕａｌＨｕｌｌＣｏｎｃｅｐｔｆｏｒＳｉｌｈｏｕｅｔｔｅ－ＢａｓｅｄＩｍａｇｅＵｎｄｅｒｓｔａｎｄｉｎｇ"，ＩＥＥＥＴｒａｎｓｃｒｉｐｔｉｏｎｓＰａｔｔｅｒｎＡｎａｌｙｓｉｓａｎｄｍａｃｈｉｎｅＩｎｔｅｌｌｉｇｅｎｃｅ, Ｖｏｌ.１６, Ｎｏ.２, ｐｐ.１５０－１６２,Ｆｅｂ.１９９４Laurenni A: "The Visual Hull Concept for Silhouette-Based Image Understanding", IEEE Transcriptions Pattern Analysis and Machine Intelligence, Vol.16, No.2, pp.150-162, Feb.1994

しかしながら、非特許文献１に記載の技術では、構造物等の他の物体が対象物体に重なったり、対象物体と背景の色が類似したりする場合には、生成される三次元形状データの一部に欠落が生じてしまう虞がある。 However, in the technique described in Non-Patent Document 1, when another object such as a structure overlaps the target object, or when the target object and the background are similar in color, one of the generated three-dimensional shape data There is a risk that parts may be missing.

本発明は、上記の課題に鑑みてなされたものであり、オブジェクトの三次元形状データの欠落を抑制し、高品質な三次元形状データを生成するための技術を提供することを目的とする。 SUMMARY OF THE INVENTION The present invention has been made in view of the above problems, and an object of the present invention is to provide a technique for suppressing the lack of 3D shape data of an object and generating high-quality 3D shape data.

上記の目的を達成する本発明に係る情報処理装置は、
複数のカメラにより撮影されて取得された複数の撮影画像に基づいて、オブジェクトの三次元形状データを生成する情報処理装置であって、
三次元空間を構成する所定の要素それぞれについて、前記複数のカメラのうち撮影画像内における前記オブジェクトの領域に当該所定の要素に対応する画素又は領域が含まれるカメラの数が第１の閾値以下であるという条件に合致するか否かを判定する判定手段と、
前記判定手段により前記条件に合致すると判定されなかった所定の要素を含む前記オブジェクトの三次元形状データを生成する生成手段と、
を備えることを特徴とする。 An information processing apparatus according to the present invention that achieves the above object includes:
An information processing device that generates three-dimensional shape data of an object based on a plurality of captured images captured by a plurality of cameras,
For each predetermined element constituting the three-dimensional space, the number of cameras including pixels or areas corresponding to the predetermined element in the area of the object in the captured image among the plurality of cameras is equal to or less than a first threshold. Determination means for determining whether or not the condition that there is is met;
generation means for generating three-dimensional shape data of the object including a predetermined element that is not determined by the determination means to match the condition;
characterized by comprising

本発明によれば、オブジェクトの三次元形状データの欠落を抑制し、高品質な三次元形状データを生成することが可能となる。 According to the present invention, it is possible to suppress loss of 3D shape data of an object and generate high-quality 3D shape data.

一実施形態に係る三次元モデルを構成するボクセルの例を示す図。FIG. 4 is a diagram showing an example of voxels forming a three-dimensional model according to one embodiment; 一実施形態に係る三次元モデル生成装置を含む仮想視点画像生成システムの構成例を示す図。1 is a diagram showing a configuration example of a virtual viewpoint image generation system including a 3D model generation device according to an embodiment; FIG. 一実施形態に係る仮想視点画像生成システムのカメラ配置の例を示す図。FIG. 2 is a diagram showing an example of camera arrangement in a virtual viewpoint image generation system according to an embodiment; （ａ）第１の実施形態に係る三次元モデル生成装置の機能構成の一例を示す図、（ｂ）第１の実施形態に係る三次元モデル生成装置のハードウェア構成の一例を示す図。(a) A diagram showing an example of the functional configuration of the 3D model generation device according to the first embodiment, (b) A diagram showing an example of the hardware configuration of the 3D model generation device according to the first embodiment. 第１の実施形態に係る三次元モデル生成装置が実施する処理の手順を示すフローチャート。4 is a flowchart showing the procedure of processing performed by the 3D model generation device according to the first embodiment; 本発明の一実施形態に係る複数のカメラにより撮影した撮影画像の例を示す図。FIG. 4 is a diagram showing an example of captured images captured by a plurality of cameras according to one embodiment of the present invention; 本発明の一実施形態に係る構造物マスク画像の例を示す図。FIG. 4 is a diagram showing an example of a structure mask image according to one embodiment of the present invention; 本発明の一実施形態に係る前景マスク画像の例を示す図。FIG. 4 is a diagram showing an example of a foreground mask image according to one embodiment of the present invention; 本発明の一実施形態に係る前景マスク画像と構造物マスク画像とを統合した統合マスク画像の例を示す図。FIG. 4 is a diagram showing an example of an integrated mask image obtained by integrating a foreground mask image and a structure mask image according to one embodiment of the present invention; 第１の実施形態に係る競技場システムの三次元モデル生成対象のボクセル空間を示す図。FIG. 4 is a diagram showing a voxel space to be used for generating a three-dimensional model of the stadium system according to the first embodiment; 第１の実施形態に係るＴｒｕｅＣｏｕｎｔ／ＦａｌｓｅＣｏｕｎｔを示す図。The figure which shows True Count/False Count which concerns on 1st Embodiment. 第１の実施形態に係るＦａｌｓｅＣｏｕｎｔの閾値判定を適用して生成された三次元モデルの一例を示す図。FIG. 7 is a diagram showing an example of a three-dimensional model generated by applying the False Count threshold determination according to the first embodiment; 第１の実施形態に係るＦａｌｓｅＣｏｕｎｔの閾値判定及びＴｒｕｅＣｏｕｎｔの閾値判定を適用して生成された三次元モデルの一例を示す図。FIG. 7 is a diagram showing an example of a three-dimensional model generated by applying the threshold determination of false count and the threshold determination of true count according to the first embodiment; 欠落が生じる場合の三次元モデルを表す図。The figure showing a three-dimensional model when omission arises. 第２の実施形態に係る三次元モデル生成装置の機能構成の一例を示す図。The figure which shows an example of the functional structure of the three-dimensional model generation apparatus which concerns on 2nd Embodiment. 第２の実施形態に係る三次元モデル生成装置が実施する処理の手順を示すフローチャート。9 is a flowchart showing the procedure of processing performed by a 3D model generation device according to the second embodiment; 第２の実施形態に係る仮想視点画像生成システムのカメラ配置と前景の例を表した図。FIG. 11 is a diagram showing an example of camera arrangement and foreground in a virtual viewpoint image generation system according to a second embodiment; 第２の実施形態に係るＴｒｕｅ／ＦａｌｓｅＣｏｕｎｔを示す図。The figure which shows True/False Count which concerns on 2nd Embodiment. 第３の実施形態に係る三次元モデル生成装置の機能ブロックを表した図。The figure showing the functional block of the three-dimensional model generation apparatus which concerns on 3rd Embodiment. 第３の実施形態に係る三次元モデル生成装置の処理フローを表した図。The figure showing the processing flow of the three-dimensional model generation apparatus which concerns on 3rd Embodiment. 第３の実施形態に係る重み加算なしとありによるＴｒｕｅ／ＦａｌｓｅＣｏｕｎｔを示す図。FIG. 11 is a diagram showing True/False Counts with and without weight addition according to the third embodiment; 視体積交差法による三次元モデル生成の概要を示す図。FIG. 4 is a diagram showing an overview of three-dimensional model generation by the visual volume intersection method; 第４の実施形態に係る三次元モデル生成装置の機能構成の一例を示す図。The figure which shows an example of a functional structure of the three-dimensional model generation apparatus which concerns on 4th Embodiment. 第４の実施形態に係る三次元モデル生成装置が実施する処理の手順を示すフローチャート。10 is a flow chart showing the procedure of processing performed by a 3D model generation device according to the fourth embodiment; 第４の実施形態に係るＦａｌｓｅＣｏｕｎｔ／Ｓｔｒｕｃｔｕｒｅを示す図。The figure which shows False Count/Structure which concerns on 4th Embodiment.

以下、図面を参照しながら実施形態を説明する。なお、以下の実施形態において示す構成は一例に過ぎず、本発明は図示された構成に限定されるものではない。 Hereinafter, embodiments will be described with reference to the drawings. Note that the configurations shown in the following embodiments are merely examples, and the present invention is not limited to the illustrated configurations.

（第１の実施形態）
第１の実施形態では、対象物体（前景、オブジェクト）の三次元モデル（三次元形状データ）を生成するための判定対象となる空間内の各ボクセル（部分領域）について、撮影画像内における対象のボクセルに対応する位置（画素や領域）が前景の領域を示す前景マスク画像に含まれるようなカメラの数が、閾値以下であるか否かを判定する。そして、当該数が閾値以下である場合にそのボクセル（部分領域）を三次元モデルの候補から除去することで、前景の三次元モデルを生成する例を説明する。 (First embodiment)
In the first embodiment, for each voxel (partial region) in a space to be determined for generating a 3D model (3D shape data) of a target object (foreground, object), It is determined whether or not the number of cameras whose positions (pixels or regions) corresponding to voxels are included in the foreground mask image representing the foreground region is equal to or less than a threshold. Then, an example of generating a foreground 3D model by removing the voxel (partial region) from the 3D model candidates when the number is equal to or less than the threshold will be described.

具体的には、三次元空間を構成する複数の部分領域それぞれについて、複数のカメラのうち撮影画像内における対象物体の領域を示す前景領域に当該部分領域が含まれるカメラの数が第１の閾値以下であるという条件に合致するか否かを判定する。そして、条件に合致すると判定されなかった部分領域を含む対象物体の三次元モデルを生成する。 Specifically, for each of a plurality of partial regions that form a three-dimensional space, the number of cameras whose foreground region that indicates the region of the target object in the captured image among the plurality of cameras includes the partial region is the first threshold value. It is determined whether or not the following conditions are met. Then, a three-dimensional model of the target object including the partial regions that are not determined to meet the conditions is generated.

＜三次元モデルの表現方法＞
図１（ａ）は立方体の単一ボクセルを示す。図１（ｂ）は三次元モデル生成の対象空間を表したボクセル集合を示す。図１（ｂ）に示すように、ボクセルは三次元空間を構成する微小な部分領域である。そして、図１（ｃ）は対象空間のボクセル集合である図１（ｂ）の集合から四角錐領域以外のボクセルを除去することで四角錐の三次元モデルのボクセル集合を生成した例を示す。なお、本発明では三次元空間及び三次元モデルが立方体のボクセルで構成される例を説明するが、これに限らず点群などで構成されてもよい。なお、ここでいう三次元モデルとは、三次元の形状を表すデータをいう。 <Expression method of 3D model>
FIG. 1(a) shows a single cubic voxel. FIG. 1(b) shows a voxel set representing a target space for 3D model generation. As shown in FIG. 1(b), a voxel is a minute partial area that constitutes a three-dimensional space. FIG. 1(c) shows an example of generating a voxel set of a three-dimensional model of a quadrangular pyramid by removing voxels other than the quadrangular pyramid region from the set of voxels in the target space shown in FIG. 1(b). In the present invention, an example in which the three-dimensional space and the three-dimensional model are composed of cubic voxels will be described, but they may be composed of point groups and the like. The term "three-dimensional model" as used herein refers to data representing a three-dimensional shape.

＜システム構成＞
図２は、第１の実施形態に係る三次元モデル生成装置を含む仮想視点画像生成システムの構成例を示すブロック図である。仮想視点画像生成システム１は、複数のカメラ１０ａ－カメラ１０ｚを含むカメラアレイ１１、制御装置１２、前景背景分離装置１３、三次元モデル生成装置１４、及びレンダリング装置１５を含んで構成されている。 <System configuration>
FIG. 2 is a block diagram showing a configuration example of a virtual viewpoint image generation system including the 3D model generation device according to the first embodiment. The virtual viewpoint image generation system 1 includes a camera array 11 including a plurality of cameras 10a-10z, a control device 12, a foreground/background separation device 13, a 3D model generation device 14, and a rendering device 15.

カメラアレイ１１は、複数のカメラ１０ａ－カメラ１０ｚを含む撮影装置群であり、様々な角度から被写体を撮影して、前景背景分離装置１３および制御装置１２へ画像を出力する。なお、カメラ１０ａ－カメラ１０ｚと前景背景分離装置１３、制御装置１２は、スター型のトポロジーで接続されているものとするが、ディジーチェーン接続によるリング型、バス型等のトポロジーで接続されてもよい。カメラアレイ１１は、例えば図３に示すように競技場の周囲に配置され、全てのカメラで共通したフィールド上の注視点に向けて様々な角度から同期して撮影する。ただし、カメラアレイ１１に含まれるカメラのうちの半数が向けられる注視点と、残りの半数が向けられる別の注視点のように、複数の注視点が設定されてもよい。 The camera array 11 is a photographing device group including a plurality of cameras 10a to 10z, photographs a subject from various angles, and outputs images to the foreground/background separation device 13 and the control device 12. FIG. The cameras 10a to 10z, the foreground/background separation device 13, and the control device 12 are assumed to be connected in a star topology. good. The camera array 11 is arranged around the stadium, for example, as shown in FIG. 3, and synchronously captures images from various angles toward a point of interest on the field common to all cameras. However, a plurality of gazing points may be set, such as a gazing point to which half of the cameras included in the camera array 11 are directed and another gazing point to which the remaining half is directed.

ここで、前景とは、仮想視点で任意の角度から見ることを可能とする所定の対象物体（撮影画像に基づいて三次元モデルを生成する対象となる被写体）であり、本実施形態では競技場のフィールド上に存在する人物を指す。一方、背景とは、前景以外の領域であり、本実施形態では競技場全体（フィールド、観客席など）を指す。ただし、前景と背景はこれらの例に限定されない。また、本実施形態における仮想視点画像は、指定可能な視点からの見えを表す画像だけでなく、カメラが設置されていない仮想的な視点からの見えを表す画像全般を含むものとする。 Here, the foreground is a predetermined target object (a subject for which a 3D model is to be generated based on a captured image) that can be viewed from any angle from a virtual viewpoint. refers to a person existing on the field of On the other hand, the background is an area other than the foreground, and in this embodiment, refers to the entire stadium (field, spectator seats, etc.). However, the foreground and background are not limited to these examples. Also, the virtual viewpoint images in this embodiment include not only images representing views from specifiable viewpoints, but also general images representing views from virtual viewpoints where no camera is installed.

制御装置１２は、カメラアレイ１１で同期して撮影された画像からカメラ１０ａ－カメラ１０ｚの位置や姿勢を示すカメラパラメータを算出し、算出したカメラパラメータを三次元モデル生成装置１４に出力する。ここで、カメラパラメータは、外部パラメータ及び内部パラメータで構成されている。外部パラメータは、回転行列と並進行列とで構成されており、カメラの位置や姿勢を示す。内部パラメータは、カメラの焦点距離や光学的中心等の情報を含み、カメラの画角や撮影センサの大きさ等を示す。 The control device 12 calculates camera parameters indicating the positions and orientations of the cameras 10a to 10z from the images captured synchronously by the camera array 11, and outputs the calculated camera parameters to the three-dimensional model generation device . Here, the camera parameters are composed of extrinsic parameters and intrinsic parameters. The extrinsic parameters consist of a rotation matrix and a translation matrix, and represent the position and orientation of the camera. The internal parameters include information such as the focal length and optical center of the camera, and indicate the angle of view of the camera, the size of the imaging sensor, and the like.

カメラパラメータを算出する処理はキャリブレーションと呼ばれる。カメラパラメータは、例えば、チェッカーボードのような特定パターンをカメラにより撮影した複数枚の画像を用いて取得した三次元の世界座標系の点と、それに対応する二次元上の点との対応関係を用いることで求めることができる。 The process of calculating camera parameters is called calibration. Camera parameters are, for example, a correspondence relationship between a point in a three-dimensional world coordinate system obtained using multiple images of a specific pattern such as a checkerboard photographed by a camera and the corresponding two-dimensional point. It can be obtained by using

制御装置１２は、カメラ１０ａ－カメラ１０ｚで撮影される画像の中で、前景の手前に重なる可能性のある構造物領域を示す構造物マスク画像を算出し、算出した構造物マスク画像の情報を出力する。本実施形態では、構造物とは撮影対象空間内に設置された静止物体であり、例としてサッカーゴールを構造物として扱い、各カメラで撮影される画像内におけるゴールの領域を示す画像が構造物マスク画像となる。なお、前景の手前に重なる可能性のある構造物領域とは、少なくとも１つのカメラの撮影方向からの撮影時に前景であるオブジェクトを遮る可能性のあることを意味する。 The control device 12 calculates a structure mask image indicating a structure region that may overlap in front of the foreground among the images captured by the cameras 10a to 10z, and uses the information of the calculated structure mask image. Output. In this embodiment, the structure is a stationary object installed in the shooting target space. For example, a soccer goal is treated as a structure. It becomes a mask image. Note that the structural area that may overlap in front of the foreground means that there is a possibility that the foreground object will be blocked when photographed from at least one camera's photographing direction.

前景背景分離装置１３は、カメラアレイ１１から入力される複数のカメラで撮影された画像から、前景としてフィールド上の人物が存在する領域と、それ以外の背景の領域とを識別し、前景領域を示す前景マスク画像を出力する。前景領域の識別方法として、予め保持する背景画像と撮影画像との差分のある領域を前景領域として識別する方法や、移動する物体の領域を前景領域として識別する方法などを用いることができる。 The foreground/background separation device 13 identifies, as the foreground, an area in which a person exists on the field and other background areas from images captured by a plurality of cameras input from the camera array 11, and identifies the foreground area. output the foreground mask image shown. As a method of identifying the foreground area, a method of identifying an area having a difference between a background image stored in advance and a captured image as the foreground area, a method of identifying an area of a moving object as the foreground area, and the like can be used.

ここで、マスク画像とは、撮影画像から抽出したい特定部分を表す基準画像であり、０と１で表される２値画像である。例えば、前景マスク画像は、撮影画像の中で、例えば選手などの前景が存在する領域を示し、撮影画像と同じ解像度で、前景領域を示す画素を１、前景以外の画素を０として表した画像である。ただし、マスク画像の形式はこれに限定されるものではなく、撮影画像内における特定のオブジェクトの領域を示す情報であればよい。 Here, the mask image is a reference image representing a specific portion to be extracted from the captured image, and is a binary image represented by 0s and 1s. For example, the foreground mask image indicates an area in the captured image in which the foreground of, for example, a player exists, and has the same resolution as the captured image, with pixels indicating the foreground area being 1 and pixels other than the foreground being 0. is. However, the format of the mask image is not limited to this, and any information indicating the area of a specific object in the captured image may be used.

三次元モデル生成装置１４は、複数のカメラにより撮影された複数の撮影画像を用いて三次元モデルを生成する情報処理装置としての機能を有する。まず、制御装置１２からカメラパラメータ及び構造物マスク画像の情報を受信し、前景背景分離装置１３から前景マスク画像を受信する。そして、三次元モデル生成装置１４は、構造物マスク画像と前景マスク画像とを統合して統合領域を示す統合マスク画像を生成する。さらに、前景の三次元モデルを生成する対象となる空間内の各ボクセルが統合マスク画像に含まれないカメラの数、及び、各ボクセルが前景マスク画像に含まれるカメラの数に基づいて、各ボクセルを除去するか否か判定する。そして、除去すると判定されたボクセルを除去した残りのボクセルに基づいて、例えば視体積交差法により前景の三次元モデルを生成し、レンダリング装置１５に出力する。 The 3D model generation device 14 has a function as an information processing device that generates a 3D model using a plurality of captured images captured by a plurality of cameras. First, the camera parameters and structure mask image information are received from the control device 12 , and the foreground mask image is received from the foreground background separation device 13 . Then, the three-dimensional model generation device 14 integrates the structure mask image and the foreground mask image to generate an integrated mask image indicating the integrated area. Furthermore, based on the number of cameras whose respective voxels in the space for which the 3D model of the foreground is to be generated is not included in the integrated mask image, and the number of cameras whose respective voxels are included in the foreground mask image, each voxel is removed. Based on the voxels remaining after the voxels determined to be removed are removed, a three-dimensional model of the foreground is generated by, for example, the visual volume intersection method, and is output to the rendering device 15 .

ここで、図２２を参照して、視体積交差法の基本原理を説明する。図２２（ａ）のように、対象物体Ｃを撮影すると、撮影面Ｓに対象物体の二次元シルエットを表すマスク画像Ｄａが得られる。カメラの投影中心Ｐａからマスク画像の輪郭上の各点を通すように、三次元空間中に広がる錐体が考えられる。この錐体のことを該当するカメラによる対象の「視体積」と呼び、視体積Ｖａを図２２（ｂ）に示す。さらに、図２２（ｃ）に示すように、複数の視体積の共通領域、すなわち視体積の交差を求めることによって、対象物体の三次元モデルを求めることができる。 Now, with reference to FIG. 22, the basic principle of the visual volume intersection method will be described. As shown in FIG. 22A, when a target object C is photographed, a mask image Da representing a two-dimensional silhouette of the target object is obtained on a photographing surface S. As shown in FIG. A cone extending in three-dimensional space is conceivable so as to pass through each point on the contour of the mask image from the projection center Pa of the camera. This cone is called the "visual volume" of the object by the corresponding camera, and the visual volume Va is shown in FIG. 22(b). Furthermore, as shown in FIG. 22(c), a three-dimensional model of the target object can be obtained by obtaining a common area of a plurality of visual volumes, that is, intersection of the visual volumes.

レンダリング装置１５は、三次元モデル生成装置１４から三次元モデルを受信し、前景背景分離装置１３から前景を示す画像を受信する。また、カメラパラメータより前景を示す画像と三次元モデルとの位置関係を求め、三次元モデルに対応する前景画像を貼り付けることで色付けを行い、三次元モデルを任意視点から観察した仮想視点画像を生成する。なお、仮想視点画像には背景の画像が含まれていてもよい。すなわち、レンダリング装置１５は、三次元空間内に背景のモデルと前景のモデルと視点の位置とを設定することで、背景及び前景を設定された視点から見た仮想視点画像を生成してもよい。 The rendering device 15 receives the 3D model from the 3D model generation device 14 and an image showing the foreground from the foreground/background separation device 13 . In addition, the positional relationship between the image showing the foreground and the 3D model is determined from the camera parameters, and the foreground image corresponding to the 3D model is pasted to colorize the 3D model. Generate. Note that the virtual viewpoint image may include a background image. That is, the rendering device 15 may generate a virtual viewpoint image of the background and foreground viewed from the set viewpoint by setting a background model, a foreground model, and a viewpoint position in a three-dimensional space. .

＜三次元モデル生成装置の機能構成＞
続いて、図４（ａ）を参照して、本実施形態に係る三次元モデル生成装置の機能構成を説明する。三次元モデル生成装置１４は、受信部１００、構造物マスク保存部１０１、カメラパラメータ保持部１０２、マスク統合部１０３、座標変換部１０４、マスク内外判定部１０５、閾値設定部１０６、前景モデル生成部１０７及び出力部１０８を備えている。各処理部の機能は、図４（ｂ）を参照して後述するＣＰＵ１００１がＲＯＭ１００２やＲＡＭ１００３から読み出したコンピュータプログラムを実施することにより実現される。 <Functional configuration of 3D model generation device>
Next, with reference to FIG. 4A, the functional configuration of the three-dimensional model generation device according to this embodiment will be described. The 3D model generation device 14 includes a reception unit 100, a structure mask storage unit 101, a camera parameter storage unit 102, a mask integration unit 103, a coordinate conversion unit 104, a mask inside/outside determination unit 105, a threshold value setting unit 106, and a foreground model generation unit. 107 and an output unit 108 . The function of each processing unit is realized by executing a computer program read from the ROM 1002 or the RAM 1003 by the CPU 1001, which will be described later with reference to FIG. 4B.

受信部１００は、制御装置１２から、カメラアレイ１１を構成する各カメラのカメラパラメータ及び構造物の領域を示す構造物マスク画像を受信する。また、受信部１００は、前景背景分離装置１３から、カメラアレイ１１の各カメラで撮影された画像と、その画像内の前景領域を示す前景マスク画像を撮影毎に受信する。 The receiving unit 100 receives, from the control device 12 , camera parameters of each camera that constitutes the camera array 11 and a structure mask image that indicates the region of the structure. Further, the receiving unit 100 receives an image captured by each camera of the camera array 11 and a foreground mask image representing a foreground area in the image from the foreground/background separation device 13 each time the image is captured.

構造物マスク保存部１０１は、受信部１００で受信した構造物マスク画像を保存する。構造物マスク画像はカメラの位置に応じた固定の画像である。 The structure mask storage unit 101 stores the structure mask image received by the reception unit 100 . The structure mask image is a fixed image according to the position of the camera.

カメラパラメータ保持部１０２は、カメラアレイ１１により撮影された各カメラの位置及び／又は姿勢を示す外部パラメータと、焦点距離及び／又は画像サイズを示す内部パラメータとをカメラパラメータとして保持する。 The camera parameter holding unit 102 holds, as camera parameters, extrinsic parameters indicating the position and/or orientation of each camera photographed by the camera array 11 and intrinsic parameters indicating the focal length and/or image size.

マスク統合部１０３は、カメラアレイ１１で撮影する毎に前景背景分離装置１３から受信される前景マスク画像と、構造物マスク保存部１０１に保存されている構造物マスク画像とを統合して、統合マスク画像を生成する。前景マスク画像と構造物マスク画像との統合方法の詳細は後述する。 The mask integration unit 103 integrates the foreground mask image received from the foreground/background separation device 13 and the structure mask image stored in the structure mask storage unit 101 each time the camera array 11 captures an image. Generate a mask image. The details of the method for integrating the foreground mask image and the structure mask image will be described later.

座標変換部１０４は、カメラパラメータ保持部１０２に保持されているカメラパラメータに基づいて各撮影画像の世界座標系での位置や画角を算出し、各撮影画像が三次元空間上のどの撮影領域を示すかを表す情報に変換する。 The coordinate transformation unit 104 calculates the position and angle of view of each captured image in the world coordinate system based on the camera parameters held in the camera parameter storage unit 102, and determines which shooting area in the three-dimensional space each captured image corresponds to. Converts to information that indicates or represents

マスク内外判定部１０５は、対象となるボクセル空間内の各ボクセルが前景マスク画像内に含まれるカメラの台数が閾値以下である場合に、当該ボクセルを除去すると判定する。また、対象となるボクセル空間内の各ボクセルが統合マスク画像内に含まれないカメラの台数が他の閾値以上である場合、当該ボクセルを除去すると判定する。 If the number of cameras in which each voxel in the target voxel space is included in the foreground mask image is equal to or less than a threshold, the mask inside/outside determination unit 105 determines to remove the voxel. Also, when the number of cameras whose voxels in the target voxel space are not included in the integrated mask image is equal to or greater than another threshold, it is determined that the voxels are removed.

閾値設定部１０６は、マスク内外判定部１０５によりボクセルを除去するか否かを判定するための各閾値を設定する。この閾値は、三次元モデル生成装置１４に対するユーザ操作に応じて設定されてもよいし、閾値設定部１０６が自動で設定してもよい。前景モデル生成部１０７は、対象となるボクセル空間内のボクセルのうち、マスク内外判定部１０５により除去されるべきであると判定されたボクセルを除去して、残ったボクセルに基づいて三次元モデルを生成する。出力部１０８は、前景モデル生成部１０７により生成された三次元モデルをレンダリング装置１５へ出力する。 A threshold value setting unit 106 sets each threshold value for determining whether or not to remove a voxel by the mask inside/outside determination unit 105 . This threshold may be set according to a user's operation on the 3D model generation device 14 or may be automatically set by the threshold setting unit 106 . A foreground model generation unit 107 removes voxels determined to be removed by the mask inside/outside determination unit 105 from voxels in the target voxel space, and generates a three-dimensional model based on the remaining voxels. Generate. The output unit 108 outputs the 3D model generated by the foreground model generation unit 107 to the rendering device 15 .

＜三次元モデル生成装置のハードウェア構成＞
次に、図４（ｂ）を参照して、本実施形態に係る三次元モデル生成装置のハードウェア構成の一例を説明する。三次元モデル生成装置１４は、ＣＰＵ１００１、ＲＯＭ１００２、ＲＡＭ１００３、記憶装置１００４、及びバス１００５を備え、入力装置１００６及び表示装置１００７と接続されている。 <Hardware Configuration of 3D Model Generation Device>
Next, an example of the hardware configuration of the 3D model generation device according to this embodiment will be described with reference to FIG. 4(b). The 3D model generation device 14 has a CPU 1001 , a ROM 1002 , a RAM 1003 , a storage device 1004 and a bus 1005 and is connected to an input device 1006 and a display device 1007 .

ＣＰＵ１００１は、本実施形態に係る三次元モデル生成装置１４の上述の機能ブロックによる各種動作を制御する。その制御内容は、後述するＲＯＭ１００２やＲＡＭ１００３上のプログラムによって指示される。また、ＣＰＵ１００１は、複数の計算機プログラムを並列に動作させることもできる。ＲＯＭ１００２は、ＣＰＵ１００１による制御の手順を記憶させた計算機プログラムやデータを格納している。ＲＡＭ１００３は、ＣＰＵ１００１が処理するための制御プログラムを格納するとともに、ＣＰＵ１００１が各種制御を実行する際の様々なデータの作業領域を提供する。ＲＯＭ１００２やＲＡＭ１００３などの記録媒体に格納されたプログラムコードの機能は、ＣＰＵ１００１が読み出して実行することによって実現されるが、記録媒体の種類は問わない。 The CPU 1001 controls various operations by the above functional blocks of the 3D model generation device 14 according to this embodiment. The contents of the control are instructed by programs on the ROM 1002 and RAM 1003, which will be described later. The CPU 1001 can also run a plurality of computer programs in parallel. The ROM 1002 stores computer programs and data that store control procedures by the CPU 1001 . A RAM 1003 stores a control program for processing by the CPU 1001 and provides a work area for various data when the CPU 1001 executes various controls. The functions of program codes stored in recording media such as the ROM 1002 and RAM 1003 are implemented by reading and executing them by the CPU 1001, but any type of recording medium is acceptable.

記憶装置１００４は、さまざまなデータ等を記憶することができる。記憶装置１００４は、ハードディスクやフロッピーディスク、光ディスク、磁気ディスク、光磁気ディスク、磁気テープ、不揮発性のメモリカード等の記録媒体と、当該記録媒体を駆動して情報を記録するドライブとを有する。保管された計算機プログラムやデータはキーボード等の指示や、各種計算機プログラムの指示により、必要な時にＲＡＭ１００３上に呼び出される。 The storage device 1004 can store various data and the like. The storage device 1004 has recording media such as hard disks, floppy disks, optical disks, magnetic disks, magneto-optical disks, magnetic tapes, non-volatile memory cards, etc., and drives for recording information by driving the recording media. The stored computer programs and data are called up on the RAM 1003 when necessary according to instructions from the keyboard or the like or instructions from various computer programs.

バス１００５は、各構成要素と接続されているデータバスなどであり、各構成要素間の通信を実現し、情報のやり取りを高速に実現するためのものである。入力装置１００６は、ユーザによる各種入力環境を提供する。各種入力操作環境を提供するものとして、キーボードやマウス等が考えられるが、タッチパネル、スタイラスペン等であってもよい。表示装置１００７は、液晶ディスプレイなどで構成され、各種入力操作の状態やそれに応じた計算結果などをユーザに対して表示する。なお、以上述べてきた構成は一例であり、説明した構成に限定されるものでない。 A bus 1005 is a data bus or the like connected to each component, and is used to realize communication between the components and to exchange information at high speed. The input device 1006 provides various input environments by the user. A keyboard, a mouse, and the like are conceivable for providing various input operation environments, but a touch panel, a stylus pen, and the like may also be used. The display device 1007 is composed of a liquid crystal display or the like, and displays to the user the states of various input operations, calculation results corresponding thereto, and the like. In addition, the configuration described above is an example, and the configuration is not limited to the described configuration.

＜処理＞
図５は、本実施形態に係る三次元モデル生成装置が実施する処理の手順を示すフローチャートである。 <Processing>
FIG. 5 is a flow chart showing the procedure of processing performed by the 3D model generation device according to the present embodiment.

Ｓ２０１において、受信部１００は、カメラアレイ１１を構成する各カメラの構造物マスク画像を制御装置１２から受信する。ここで、撮影画像及び構造物マスク画像の一例を説明する。図６は、カメラアレイ１１の一部を構成する５台のカメラで撮影された５つの撮影画像の例を示す。ここでは、フィールド上に人物が一人、ゴールが構造物としてフィールド上に存在しており、図６（ｂ）、図６（ｃ）、図６（ｄ）では人物の手前に構造物であるゴールがあるため、人物の一部が隠れている。図７は、図６に示した各撮影画像に対応する構造物マスク画像を示している。構造物であるゴールの領域が１（白）、構造物以外の領域が０（黒）の２値画像として示されている。 In S<b>201 , the receiving unit 100 receives the structure mask image of each camera constituting the camera array 11 from the control device 12 . Here, an example of the photographed image and the structure mask image will be described. FIG. 6 shows an example of five captured images captured by five cameras that form part of the camera array 11 . Here, there is one person on the field, and a goal as a structure exists on the field. Because there is a part of the person is hidden. FIG. 7 shows a structure mask image corresponding to each photographed image shown in FIG. A goal area, which is a structure, is shown as 1 (white), and a non-structure area is shown as a binary image of 0 (black).

Ｓ２０２において、受信部１００は、前景領域を示す前景マスク画像を前景背景分離装置１３から受信する。ここで、前景マスク画像の一例を説明する。図８は、図６で示した各撮影画像に対応する前景マスク画像を示している。前景背景分離装置１３は、時間的に変化のある領域を前景領域として抽出するため、図８（ｂ）、図８（ｃ）、図８（ｄ）のようにゴールに隠れた人物の一部の領域は前景領域として抽出されない。また、図８（ｅ）では時間的変化の無かった人物の足の一部が前景領域として抽出されていない。 In S<b>202 , the receiving unit 100 receives the foreground mask image representing the foreground area from the foreground/background separation device 13 . An example of the foreground mask image will now be described. FIG. 8 shows a foreground mask image corresponding to each photographed image shown in FIG. Since the foreground/background separation device 13 extracts an area that changes with time as the foreground area, a portion of the person hidden behind the goal is extracted as shown in FIGS. 8(b), 8(c), and 8(d). are not extracted as foreground regions. Also, in FIG. 8E, a part of the person's foot that did not change over time is not extracted as the foreground area.

Ｓ２０３において、マスク統合部１０３は、Ｓ２０１及びＳ２０２で受信した構造物マスク画像と前景マスク画像とを統合して統合マスク画像を生成する。図９は、図７で示した構造物マスク画像と図８で示した前景マスク画像とを統合した結果である統合マスク画像の一例を示す。統合マスク画像は２値で表される前景マスク画像と構造物マスク画像とのＯＲ（論理和）により算出する。 In S203, the mask integration unit 103 integrates the structure mask image and the foreground mask image received in S201 and S202 to generate an integrated mask image. FIG. 9 shows an example of an integrated mask image resulting from integrating the structure mask image shown in FIG. 7 and the foreground mask image shown in FIG. The integrated mask image is calculated by ORing the foreground mask image and the structure mask image represented by binary values.

Ｓ２０４において、マスク内外判定部１０５は、対象ボクセル空間内から未選択のボクセルを一つ選択する。 In S204, the mask inside/outside determination unit 105 selects one unselected voxel from within the target voxel space.

Ｓ２０５において、マスク内外判定部１０５は、選択された一つのボクセルが各カメラの統合マスク画像のマスク領域内に含まれないカメラの台数（以降、ＦａｌｓｅＣｏｕｎｔと呼ぶ）をカウントする。 In S205, the mask inside/outside determination unit 105 counts the number of cameras in which one selected voxel is not included in the mask area of the integrated mask image of each camera (hereinafter referred to as False Count).

Ｓ２０６において、マスク内外判定部１０５は、ＦａｌｓｅＣｏｕｎｔが閾値以上であるか否かを判定する。ＦａｌｓｅＣｏｕｎｔが閾値以上である場合、選択された一つのボクセルは前景でも構造物でもないと判定できるため、Ｓ２０７へ進む。これにより、明らかに非前景である多くのボクセルを除去することができる。一方、ＦａｌｓｅＣｏｕｎｔが閾値未満である場合、選択された一つのボクセルは前景又は構造物であると判定できるため、Ｓ２０８へ進む。 In S206, the mask inside/outside determination unit 105 determines whether False Count is equal to or greater than a threshold. If the False Count is greater than or equal to the threshold, it can be determined that the selected one voxel is neither the foreground nor a structure, so the process proceeds to S207. This allows many voxels that are clearly non-foreground to be removed. On the other hand, if the False Count is less than the threshold, it can be determined that the selected one voxel is the foreground or a structure, so the process proceeds to S208.

Ｓ２０７において、前景モデル生成部１０７は、選択された一つのボクセルを対象ボクセル空間から除去する。Ｓ２０８において、マスク内外判定部１０５は、選択された一つのボクセルが各カメラの前景マスク画像のマスク領域内に含まれるカメラの台数（以降、ＴｒｕｅＣｏｕｎｔと呼ぶ）をカウントする。 In S207, the foreground model generation unit 107 removes one selected voxel from the target voxel space. In S208, the mask inside/outside determination unit 105 counts the number of cameras in which one selected voxel is included in the mask region of the foreground mask image of each camera (hereinafter referred to as True Count).

Ｓ２０９において、マスク内外判定部１０５は、ＴｒｕｅＣｏｕｎｔが他の閾値以下であるか否かを判定する。ＴｒｕｅＣｏｕｎｔが他の閾値以下である場合、選択された一つのボクセルは構造物であると判定できるため、Ｓ２０７へ進み、選択された一つのボクセルを対象ボクセル空間から除去する。一方、ＴｒｕｅＣｏｕｎｔが他の閾値を超過する場合、選択された一つのボクセルは前景と判定できるため、対象ボクセル空間から除去しない。 In S209, the mask inside/outside determination unit 105 determines whether or not the True Count is equal to or less than another threshold. If the True Count is equal to or less than another threshold, it can be determined that the selected voxel is a structure, so the process proceeds to S207 to remove the selected voxel from the target voxel space. On the other hand, if the True Count exceeds another threshold, the selected one voxel can be determined as the foreground and is not removed from the target voxel space.

Ｓ２１０において、マスク内外判定部１０５は、対象ボクセル空間内の全てのボクセルについて処理が完了したか否かを判定する。全てのボクセルについて処理が完了した場合、Ｓ２１１へ進む。一方、全てのボクセルについて処理が完了していない場合、Ｓ２０４に戻って、未選択のボクセルのうち次の一つのボクセルを選択し、以降、同様の処理を行う。 In S210, the mask inside/outside determination unit 105 determines whether or not the processing has been completed for all voxels in the target voxel space. If the processing has been completed for all voxels, the process proceeds to S211. On the other hand, if the processing has not been completed for all voxels, the process returns to S204 to select the next voxel from among the unselected voxels, and perform the same processing thereafter.

Ｓ２１１において、前景モデル生成部１０７は、対象ボクセル空間についてボクセルの除去判定を行った後の残りのボクセルを用いて、前景の三次元モデルを生成する。 In S211, the foreground model generation unit 107 generates a three-dimensional model of the foreground using the remaining voxels after performing the voxel removal determination for the target voxel space.

Ｓ２１２において、出力部１０８は、前景モデル生成部１０７により生成された前景の三次元モデルをレンダリング装置１５へ出力する。以上の一連の処理が、各カメラにより撮影されたフレーム毎に実施される。 In S<b>212 , the output unit 108 outputs the foreground 3D model generated by the foreground model generation unit 107 to the rendering device 15 . The series of processes described above is performed for each frame captured by each camera.

ここで、図３に示した１６台のカメラにより競技場を撮影する仮想視点画像生成システムを例として、三次元モデルの生成例を説明する。図１０は、本実施形態に係る競技場システムの三次元モデル生成対象のボクセル空間を示す図である。図３の仮想視点画像生成システムにおける三次元モデル生成の対象領域として格子で示された直方体の領域が対象ボクセル空間を表している。 Here, an example of generating a three-dimensional model will be described by taking as an example the virtual viewpoint image generating system that captures images of a stadium using 16 cameras shown in FIG. FIG. 10 is a diagram showing a voxel space for which a 3D model of the stadium system according to this embodiment is to be generated. A rectangular parallelepiped region indicated by a grid as a target region for three-dimensional model generation in the virtual viewpoint image generation system of FIG. 3 represents a target voxel space.

図１１は、図３に示した仮想視点画像生成システムにおける前景、一部のカメラで未検出の前景、構造物に隠れた前景、構造物、非前景として、それぞれ人物、人物の足、人物の頭部、ゴール、その他の領域に対する、ボクセルのＦａｌｓｅＣｏｕｎｔ／ＴｒｕｅＣｏｕｎｔと、判定結果の例を示している。ただし、１台のカメラで人物の足の前景抽出に失敗しており、また３台のカメラで人物の頭部が構造物であるゴールに隠れており、これらは前景背景分離装置１３により前景として抽出されないものとする。 FIG. 11 shows the foreground in the virtual viewpoint image generation system shown in FIG. False Counts/True Counts of voxels and determination results for the head, goal, and other regions are shown. However, one camera failed to extract the foreground of the person's feet, and three cameras found that the person's head was hidden behind the goal structure. shall not be extracted.

Ｓ２０６の判定において、ＦａｌｓｅＣｏｕｎｔの閾値が固定値の１０である場合、その他の領域に位置するボクセルはＦａｌｓｅＣｏｕｎｔが１６であり閾値を超えることから除去される。その結果、例えば図１２に示すような前景と構造物とから構成される三次元モデルが生成されることになる。ここで図１２は、ＦａｌｓｅＣｏｕｎｔの閾値判定を適用して生成された三次元モデルの一例を示す図である。 In the determination of S206, if the False Count threshold is a fixed value of 10, voxels located in other regions are removed because the False Count is 16 and exceeds the threshold. As a result, a three-dimensional model composed of the foreground and the structure as shown in FIG. 12, for example, is generated. Here, FIG. 12 is a diagram showing an example of a three-dimensional model generated by applying the False Count threshold determination.

さらに、Ｓ２０９の判定において、ＴｒｕｅＣｏｕｎｔの閾値（他の閾値）が固定値の５である場合、構造物であるゴールの領域に位置するボクセルはＴｒｕｅＣｏｕｎｔが０で閾値以下であることから除去される。一方、人物、人物の足、頭部の領域に位置するボクセルはＴｒｕｅＣｏｕｎｔは各々１６、１５、１３であり、第２の閾値を超過するため除去されない。 Furthermore, in the determination of S209, if the True Count threshold (other threshold) is a fixed value of 5, the voxels located in the goal area, which is a structure, are removed because the True Count is 0 and is equal to or less than the threshold. be. On the other hand, the voxels located in the human, human feet, and head regions have True Counts of 16, 15, and 13, respectively, which exceed the second threshold and are not removed.

すなわち、図１１に示すように、前景（人物）、一部未検出の前景（足）及び構造物で隠れた前景（頭部）はボクセル残存と判定され、構造物（ゴール）及び非前景（その他の領域）はボクセル除去と判定されることになる。従って、最終的に、図１０で示した対象空間のボクセル集合から、例えば図１３に示すような欠落のない人物の三次元モデルが生成されることになる。ここで図１３は、ＦａｌｓｅＣｏｕｎｔの閾値判定及びＴｒｕｅＣｏｕｎｔの閾値判定を適用して生成された三次元モデルの一例を示す図である。 That is, as shown in FIG. 11, the foreground (person), the partially undetected foreground (legs), and the foreground (head) hidden by the structure are determined to be voxels remaining, and the structure (goal) and non-foreground ( Other regions) will be determined to be voxel removed. Therefore, finally, a complete three-dimensional model of a person as shown in FIG. 13, for example, is generated from the set of voxels in the target space shown in FIG. Here, FIG. 13 is a diagram showing an example of a three-dimensional model generated by applying the False Count threshold determination and the True Count threshold determination.

これに対し、図１４は、図８に示した前景マスク画像のみを用いて視体積交差法により三次元モデル生成した例を示す。図８（ａ）は人物全体が写っているが、図８（ｂ）、図８（ｃ）、図８（ｄ）に示す撮影画像では構造物のゴールにより人物の頭の一部が隠れている。さらに、図８（ｅ）に示す撮影画像では人物の足が前景として抽出されていない。そのため、生成された三次元モデルも一部が欠落している。 On the other hand, FIG. 14 shows an example in which a three-dimensional model is generated by the visual volume intersection method using only the foreground mask image shown in FIG. Although FIG. 8(a) shows the whole person, in the photographed images shown in FIGS. 8(b), 8(c), and 8(d), part of the person's head is hidden by the goal of the structure. there is Furthermore, the person's feet are not extracted as the foreground in the photographed image shown in FIG. 8(e). Therefore, a part of the generated 3D model is also missing.

以上説明したように、本実施形態では、対象物体（前景）の三次元モデルを生成する対象となる空間内の各ボクセルについて、対象とするボクセルが前景の領域を示す前景マスク画像に含まれるカメラの数が閾値（ＴｒｕｅＣｏｕｎｔの閾値）以下であるか否かを判定し、当該数が閾値以下である場合にそのボクセルを除去する。 As described above, in the present embodiment, for each voxel in the target space for generating a three-dimensional model of the target object (foreground), the target voxel is included in the foreground mask image representing the foreground region. is equal to or less than a threshold value (threshold value of True Count), and if the number is equal to or less than the threshold value, the voxel is removed.

本実施形態によれば、対象物体（前景）の領域を示す前景マスク画像に欠落がある場合でも、生成する対象物体（前景）の三次元モデルの欠落を回避し、三次元モデルの品質を向上させることができる。 According to this embodiment, even if the foreground mask image showing the area of the target object (foreground) is missing, the missing of the 3D model of the target object (foreground) to be generated is avoided, and the quality of the 3D model is improved. can be made

また、前景マスク画像と構造物マスク画像とを統合して統合マスク画像を生成し、対象とするボクセルが統合マスク画像に含まれないカメラの数が閾値（ＦａｌｓｅＣｏｕｎｔの閾値）以上である場合に、当該ボクセルを除去すると判定する。これにより、明らかに非前景である多くのボクセルを除去することができるので、後段の処理の速度を向上させることが可能となる。 In addition, when the foreground mask image and the structure mask image are integrated to generate an integrated mask image, and the number of cameras whose target voxels are not included in the integrated mask image is equal to or greater than a threshold value (False Count threshold) , to remove the voxel. As a result, many voxels that are clearly non-foreground can be removed, so that the speed of subsequent processing can be improved.

（第２の実施形態）
本実施形態では、画角内外判定の結果に基づいて閾値を設定することにより、注視点から離れた位置にある前景も除去されないように三次元モデルを生成する例を説明する。 (Second embodiment)
In the present embodiment, an example will be described in which a three-dimensional model is generated by setting a threshold value based on the result of the inside/outside angle of view determination so that the foreground at a position away from the gaze point is not removed.

第１の実施形態では、ボクセルが各カメラから撮影範囲内（画角内）か否かを判定していないため、多数のカメラで撮影範囲外である場合に、誤って前景を示すボクセルを除去してしまう可能性がある。 In the first embodiment, since it is not determined whether the voxel is within the shooting range (within the angle of view) of each camera, voxels that erroneously indicate the foreground are removed when they are outside the shooting range of many cameras. There is a possibility of doing so.

例えば、図３に示す競技場の仮想視点画像生成システムにおいて、注視点と反対側のゴール付近に位置する人物の領域に位置するボクセルを撮影範囲内に含むカメラの台数は３台であり、ＴｒｕｅＣｏｕｎｔが３となる。その際、ＴｒｕｅＣｏｕｎｔの閾値が５である場合、閾値未満であるため当該ボクセルは除去されてしまうことになる。 For example, in the virtual viewpoint image generation system of the stadium shown in FIG. Count becomes 3. At that time, if the threshold value of the True Count is 5, the voxel will be removed because it is less than the threshold value.

本実施形態では、ボクセルを撮影範囲内（画角内）に含むカメラの台数に基づいてＴｒｕｅＣｏｕｎｔの閾値を算出することにより、ボクセルが注視点から離れていたとしても、誤って前景を示すボクセルを除去してしまうことを回避する。 In this embodiment, by calculating the True Count threshold value based on the number of cameras that include the voxel within the imaging range (within the angle of view), even if the voxel is far from the gaze point, the voxel that erroneously indicates the foreground avoid removing the

＜三次元モデル生成装置の機能構成＞
図１５を参照して、本実施形態に係る三次元モデル生成装置の機能構成を説明する。本実施形態に係る三次元モデル生成装置１４は、受信部１００、構造物マスク保存部１０１、カメラパラメータ保持部１０２、マスク統合部１０３、座標変換部１０４、マスク内外判定部１０５、閾値設定部１０６、前景モデル生成部１０７、出力部１０８に加えて、画角内外判定部１０９及び閾値算出部１１０をさらに備えている。各処理部の機能は、第１の実施形態において図４（ｂ）を参照して説明したＣＰＵ１００１がＲＯＭ１００２やＲＡＭ１００３から読み出したコンピュータプログラムを実施することにより実現される。また、仮想視点画像生成システムの構成は第１実施形態と同様であるため、説明は省略する。 <Functional configuration of 3D model generation device>
The functional configuration of the three-dimensional model generation device according to this embodiment will be described with reference to FIG. The 3D model generation device 14 according to this embodiment includes a receiving unit 100, a structure mask storage unit 101, a camera parameter storage unit 102, a mask integration unit 103, a coordinate conversion unit 104, a mask inside/outside determination unit 105, and a threshold value setting unit 106. , a foreground model generation unit 107 and an output unit 108 , an inside/outside angle of view determination unit 109 and a threshold calculation unit 110 are further provided. The function of each processing unit is realized by executing a computer program read from the ROM 1002 and the RAM 1003 by the CPU 1001 described with reference to FIG. 4B in the first embodiment. Also, since the configuration of the virtual viewpoint image generation system is the same as that of the first embodiment, description thereof will be omitted.

三次元モデル生成装置１４の受信部１００乃至出力部１０８の機能は第１の実施形態と同様であるため、説明を省略する。 Since the functions of the receiving unit 100 to the output unit 108 of the 3D model generation device 14 are the same as those of the first embodiment, description thereof will be omitted.

画角内外判定部１０９は、各カメラのカメラパラメータに基づいて、対象ボクセル空間内の各ボクセルが各カメラの撮影範囲内であるか否かを判定する。 The inside/outside angle of view determination unit 109 determines whether or not each voxel in the target voxel space is within the shooting range of each camera based on the camera parameters of each camera.

閾値算出部１１０は、撮影範囲内であると判定されたカメラの台数に所定の割合を乗算した値を、ＴｒｕｅＣｏｕｎｔの閾値として算出する。例えば、あるボクセルを撮影範囲内とするカメラの台数が５台、所定の割合を６０％とすると、そのボクセルに対するＴｒｕｅＣｏｕｎｔの閾値は３として算出される。閾値算出部１１０により算出された閾値は閾値設定部１０６へ出力され、閾値設定部１０６は閾値設定部１０６から入力された閾値をＴｒｕｅＣｏｕｎｔの閾値として設定する。 The threshold calculation unit 110 calculates a value obtained by multiplying the number of cameras determined to be within the shooting range by a predetermined ratio as the threshold of the True Count. For example, if the number of cameras that make a certain voxel within the shooting range is 5 and the predetermined ratio is 60%, the threshold value of True Count for that voxel is calculated as 3. The threshold calculated by the threshold calculation unit 110 is output to the threshold setting unit 106, and the threshold setting unit 106 sets the threshold input from the threshold setting unit 106 as the threshold of True Count.

なお、あるボクセルを撮影範囲内とするカメラの台数が一定数未満である場合、生成される三次元モデルの精度は低くなり、処理が不要であると考えられることから、カメラの台数が一定数未満であるには閾値を所定値に設定するように構成してもよい。 Note that if the number of cameras that capture a voxel is less than a certain number, the accuracy of the generated 3D model will be low and processing will be unnecessary. A configuration may be adopted in which a threshold is set to a predetermined value for being less than.

＜処理＞
図１６は、本実施形態に係る三次元モデル生成装置が実施する処理の手順を示すフローチャートである。Ｓ３０１～Ｓ３０４の処理は、第１の実施形態で図５を参照しながら説明したＳ２０１～Ｓ２０４の処理と同様であるため、説明を省略する。 <Processing>
FIG. 16 is a flowchart showing the procedure of processing performed by the 3D model generation device according to this embodiment. Since the processing of S301 to S304 is the same as the processing of S201 to S204 described with reference to FIG. 5 in the first embodiment, description thereof will be omitted.

Ｓ３０５において、画角内外判定部１０９は、各カメラのカメラパラメータに基づいて、Ｓ３０４で選択された一つのボクセル各カメラの画角内に含まれるか否かを判定する。 In S305, the inside/outside angle of view determination unit 109 determines whether or not the one voxel selected in S304 is included within the angle of view of each camera, based on the camera parameters of each camera.

Ｓ３０６において、マスク内外判定部１０５は、選択された一つのボクセルが各カメラの統合マスク画像のマスク領域内に含まれず、且つ、選択された一つのボクセルが画角内に含まれる、カメラの台数（以降、ＦａｌｓｅＣｏｕｎｔと呼ぶ）をカウントする。 In S306, the mask inside/outside determination unit 105 determines the number of cameras in which the selected voxel is not included in the mask area of the integrated mask image of each camera and the selected voxel is included in the angle of view. (hereafter referred to as False Count).

Ｓ３０７～Ｓ３０９の処理は、Ｓ２０６～Ｓ２０８の処理と同様であるため、説明を省略する。 Since the processing of S307-S309 is the same as the processing of S206-S208, the description thereof is omitted.

Ｓ３１０において、閾値算出部１１０は、選択された一つのボクセルを画角内に含むカメラの台数に基づいて、ＴｒｕｅＣｏｕｎｔの閾値を算出する。閾値設定部１０６は、閾値算出部１１０により算出されたＴｒｕｅＣｏｕｎｔの閾値を設定する。 In S310, the threshold calculation unit 110 calculates the threshold of True Count based on the number of cameras including one selected voxel within the angle of view. The threshold setting unit 106 sets the threshold of True Count calculated by the threshold calculation unit 110 .

Ｓ３１１～Ｓ３１４の処理は、Ｓ２０９～Ｓ２１２の処理と同様であるため、説明を省略する。以上が図１６の一連の処理である。 Since the processing of S311 to S314 is the same as the processing of S209 to S212, description thereof will be omitted. The above is the series of processing in FIG.

ここで、図１７は、注視点に近い前景Ａと注視点から遠い前景Ｂとを含む競技場を図３と同様に１６台のカメラにより撮影する仮想視点画像生成システムを示す。前景Ａは１６台全てのカメラで画角内であり、前景Ｂは図１７のカメラ１０ｋ、１０ｌ、１０ｍの３台のカメラでのみ画角内であるものとする。 Here, FIG. 17 shows a virtual viewpoint image generation system in which a stadium including a foreground A close to the point of interest and a foreground B far from the point of interest is photographed by 16 cameras as in FIG. It is assumed that the foreground A is within the angle of view of all 16 cameras, and the foreground B is within the angle of view of only the three cameras 10k, 10l, and 10m in FIG.

また、図１８は、図１７に示す仮想視点画像生成システムにおいて注視点から近い前景Ａの位置のボクセルと、注視点から遠い前景Ｂの位置のボクセルとのそれぞれのＦａｌｓｅＣｏｕｎｔ／ＴｒｕｅＣｏｕｎｔの一例を示す。ＦａｌｓｅＣｏｕｎｔの閾値は固定値の１０とし、ＴｒｕｅＣｏｕｎｔの閾値は、ボクセルを画角内に含むカメラの台数の７０％とする。 FIG. 18 shows an example of False Count/True Count of voxels at the position of the foreground A close to the point of interest and voxels at the position of the foreground B far from the point of interest in the virtual viewpoint image generation system shown in FIG. show. The False Count threshold is a fixed value of 10, and the True Count threshold is 70% of the number of cameras including the voxel within the angle of view.

注視点に近い前景Ａに位置するボクセルは１６台全てのカメラで統合マスク画像内に含まれるため、ボクセルが統合マスク画像外となるカメラは存在しない。従って、ボクセルが統合マスク画像外且つ画角内のカメラの台数は０であり、ＦａｌｓｅＣｏｕｎｔは０である。 Since voxels located in the foreground A near the gaze point are included in the integrated mask image for all 16 cameras, there is no camera whose voxels are outside the integrated mask image. Therefore, the number of cameras whose voxels are outside the integrated mask image and within the angle of view is 0, and the False Count is 0.

また、注視点に近い前景Ａに位置するボクセルを画角内に含むカメラの台数も１６台であるので、ＴｒｕｅＣｏｕｎｔの閾値は１６台の７０％である１１．２となる。そして、注視点に近い前景Ａに位置するボクセルは全てのカメラで前景マスク画像内となるためＴｒｕｅＣｏｕｎｔは１６となり、当該カウント値は閾値（１１．２）以上であるのでボクセルは除去されない。 Also, since the number of cameras including voxels located in the foreground A close to the gaze point within the angle of view is also 16, the threshold for the True Count is 11.2, which is 70% of the 16 cameras. Since the voxels located in the foreground A near the point of interest are within the foreground mask image for all cameras, the True Count is 16, and since the count value is equal to or greater than the threshold value (11.2), the voxels are not removed.

注視点から遠い前景Ｂの位置のボクセルは１３台のカメラ（カメラ１０ｋ、１０ｌ、１０ｍを除く１３台）で画角外となり、３台のカメラ（カメラ１０ｋ、１０ｌ、１０ｍ）で画角内となる。また、３台のカメラ（カメラ１０ｋ、１０ｌ、１０ｍ）でボクセルが統合マスク画像内となる。従って、ボクセルが統合マスク画像外且つ画角内のカメラの台数は０台であり、ＦａｌｓｅＣｏｕｎｔは０である。 The voxel in the foreground B position far from the gaze point is outside the angle of view for 13 cameras (13 cameras excluding cameras 10k, 10l, and 10m), and is within the angle of view for 3 cameras (cameras 10k, 10l, and 10m). Become. Also, the voxels of the three cameras (cameras 10k, 10l, and 10m) are within the integrated mask image. Therefore, the number of cameras whose voxels are outside the integrated mask image and within the angle of view is zero, and the False Count is zero.

また、注視点から遠い前景Ｂに位置するボクセルを画角内に含むカメラの台数が３台であるので、ＴｒｕｅＣｏｕｎｔの閾値は３台の７０％である２．１となる。そして、注視点から遠い前景Ｂに位置するボクセルは３台のカメラで前景マスク画像内となるためＴｒｕｅＣｏｕｎｔは３となり、当該カウント値は閾値（２．１）以上であるのでボクセルは除去されない。 Also, since the number of cameras including voxels located in the foreground B far from the gaze point within the angle of view is three, the threshold value of the True Count is 2.1, which is 70% of the three cameras. A voxel located in the foreground B far from the gaze point is included in the foreground mask image by the three cameras, so the True Count is 3. Since the count value is equal to or greater than the threshold value (2.1), the voxel is not removed.

このように、対象とするボクセルが画角内に含まれるカメラの台数に基づいて、ＴｒｕｅＣｏｕｎｔの閾値を設定することによって、注視点から離れており、画角内であるカメラ台数が少ない前景について三次元モデルを生成することができる。従って、注視点から遠い前景であっても欠落を抑制した三次元モデルを生成することが可能となる。 In this way, by setting the TrueCount threshold based on the number of cameras whose target voxels are included in the angle of view, the foreground, which is far from the point of interest and has a small number of cameras within the angle of view, is subject to tertiary The original model can be generated. Therefore, it is possible to generate a three-dimensional model in which omissions are suppressed even in the foreground far from the gaze point.

（第３の実施形態）
本実施形態では、対象とするボクセルが構造物マスク画像に含まれるカメラ台数に基づいて重み値を設定する。そして、対象とするボクセルが前景マスク画像に含まれるカメラの数と、対象とするボクセルが構造物マスク画像に含まれるカメラの台数に重み値を乗算した値とを加算した値が、ＴｒｕｅＣｏｕｎｔの閾値以下である場合に、当該ボクセルを除去すると判定する。これにより、多数のカメラで構造物により前景が遮られた場合でも欠落のない三次元モデルを生成する例を説明する。 (Third Embodiment)
In this embodiment, the weight value is set based on the number of cameras whose target voxels are included in the structure mask image. Then, the value obtained by adding the number of cameras whose target voxels are included in the foreground mask image and the value obtained by multiplying the number of cameras whose target voxels are included in the structure mask image by the weight value is the True Count. If it is equal to or less than the threshold, it is determined to remove the voxel. An example of generating a three-dimensional model with no omissions even when the foreground is blocked by a structure using a large number of cameras will be described.

第１の実施形態及び第２の実施形態では、各ボクセルのＴｒｕｅＣｏｕｎｔとしてボクセルが前景マスク画像内に含まれるカメラのみをカウントする例を説明した。しかし、その場合、多数のカメラにおいて構造物で隠れた前景の位置にあるボクセルは、ＴｒｕｅＣｏｕｎｔが閾値を超えずに、除去されてしまうことがある。 In the first and second embodiments, an example was described in which only cameras whose voxels are included in the foreground mask image are counted as the true count of each voxel. However, in that case, voxels in the foreground locations hidden by structures in many cameras may be removed without the TrueCount exceeding the threshold.

これに対して、本実施形態では、対象とするボクセルが前景マスク画像外であっても構造物マスク画像内に含まれる場合には、そのボクセルは前景である可能性があるため、ボクセルが構造物マスク画像内に含まれると判定されたカメラの台数に重み値を乗算した値を、ＴｒｕｅＣｏｕｎｔに加算することで、前景の欠落を回避する。 On the other hand, in this embodiment, if a target voxel is outside the foreground mask image but is included in the structure mask image, the voxel may be in the foreground. The loss of the foreground is avoided by adding the value obtained by multiplying the number of cameras determined to be included in the object mask image by the weight value to the True Count.

＜三次元モデル生成装置の機能構成＞
図１９を参照して、本実施形態に係る三次元モデル生成装置の機能構成を説明する。本実施形態に係る三次元モデル生成装置１４は、第２の実施形態の三次元モデル生成装置の構成に加えて、重み設定部１１１をさらに備えている。各処理部の機能は、第１の実施形態において図４（ｂ）を参照して説明したＣＰＵ１００１がＲＯＭ１００２やＲＡＭ１００３から読み出したコンピュータプログラムを実施することにより実現される。また、仮想視点画像生成システムの構成は第１実施形態と同様であるため、説明は省略する。 <Functional configuration of 3D model generation device>
The functional configuration of the three-dimensional model generation device according to this embodiment will be described with reference to FIG. The three-dimensional model generation device 14 according to this embodiment further includes a weight setting unit 111 in addition to the configuration of the three-dimensional model generation device of the second embodiment. The function of each processing unit is realized by executing a computer program read from the ROM 1002 and the RAM 1003 by the CPU 1001 described with reference to FIG. 4B in the first embodiment. Also, since the configuration of the virtual viewpoint image generation system is the same as that of the first embodiment, description thereof will be omitted.

重み設定部１１１は、対象とするボクセルが構造物マスク画像内と判定された場合にＴｒｕｅＣｏｕｎｔに加算する値を、カメラ１台当たりの重み値として設定する。この重み値は、前景に位置するボクセルの可能性を示す値と同等であり、本実施形態では、カメラ１台当たりの重み値を０．５と設定する。そして、対象とするボクセルが構造物マスク画像内と判定されたカメラの台数に、カメラ１台当たりの重み値０．５を乗算した値を、ＴｒｕｅＣｏｕｎｔに加算する。 The weight setting unit 111 sets, as a weight value per camera, a value to be added to the True Count when it is determined that the target voxel is within the structure mask image. This weight value is equivalent to the value indicating the likelihood of voxels located in the foreground, and in this embodiment the weight value per camera is set to 0.5. Then, a value obtained by multiplying the number of cameras for which the target voxel is determined to be within the structure mask image by a weight value of 0.5 per camera is added to the True Count.

＜処理＞
図２０は、本実施形態に係る三次元モデル生成装置が実施する処理の手順を示すフローチャートである。 <Processing>
FIG. 20 is a flowchart showing the procedure of processing performed by the 3D model generation device according to this embodiment.

Ｓ４０１～Ｓ４０４の処理はＳ３０１～Ｓ３０４の処理と同様であり、Ｓ４０５～Ｓ４０８の処理はＳ３０６～Ｓ３０９の処置と同様である。Ｓ４０９の処理はＳ３０５の処理と同様であり、Ｓ４１０の処理はＳ３１０の処理と同様である。 The processing of S401-S404 is the same as the processing of S301-S304, and the processing of S405-S408 is the same as the processing of S306-S309. The processing of S409 is the same as the processing of S305, and the processing of S410 is the same as the processing of S310.

Ｓ４１１において、マスク内外判定部１０５は、選択された一つのボクセルが各カメラの構造物マスク画像のマスク領域内に含まれるカメラの台数をカウントする。 In S411, the mask inside/outside determination unit 105 counts the number of cameras in which one selected voxel is included in the mask area of the structure mask image of each camera.

Ｓ４１２において、重み設定部１１１は、構造物マスク画像のマスク領域内に含まれるカメラの台数に、カメラ１台当たりの重み値０．５を乗算した値を、Ｓ４０８で算出されたＴｒｕｅＣｏｕｎｔに加算する。Ｓ４１３～Ｓ４１６の処理は、Ｓ３１１～Ｓ３１４の処理と同様である。以上で図２０の一連の処理が終了する。 In S412, the weight setting unit 111 multiplies the number of cameras included in the mask region of the structure mask image by a weight value of 0.5 per camera, and adds the value to the True Count calculated in S408. do. The processing of S413-S416 is the same as the processing of S311-S314. Thus, the series of processing in FIG. 20 ends.

ここで、図２１に、ある前景領域に位置するボクセルにおける、重み加算なしの場合のＴｒｕｅＣｏｕｎｔの例と、本実施形態に係る重み加算ありの場合のＴｒｕｅＣｏｕｎｔの例とを示す。 Here, FIG. 21 shows an example of True Count without weight addition and an example of True Count with weight addition according to the present embodiment in a voxel located in a certain foreground region.

このボクセルは１６台全てのカメラで画角内であり、対象とするボクセルを前景マスク画像内に含むカメラの台数が７台、対象とするボクセルを構造物マスク画像内に含むカメラの台数が９台であるものとする。この場合、ボクセルが統合マスク画像外であるカメラは０台（全カメラ１６台－７台－９台）である。従って、ボクセルが統合マスク画像外且つ画角内のカメラの台数は０であり、ＦａｌｓｅＣｏｕｎｔは０である。 This voxel is within the angle of view of all 16 cameras, the number of cameras including the target voxel in the foreground mask image is 7, and the number of cameras including the target voxel in the structure mask image is 9. shall be a table. In this case, the number of cameras whose voxels are outside the integrated mask image is 0 (16 total cameras-7 cameras-9 cameras). Therefore, the number of cameras whose voxels are outside the integrated mask image and within the angle of view is 0, and the False Count is 0.

重み加算なしの場合、対象とするボクセルを前景マスク画像内に含むカメラの台数が７台であるため、ＴｒｕｅＣｏｕｎｔは７となる。ＴｒｕｅＣｏｕｎｔの閾値が、対象とするボクセルを画角内に含むカメラの台数の７０％であるものとする。すると、閾値は１１．２（１６×０．７）となるため、ＴｒｕｅＣｏｕｎｔ（７）＜閾値（１１．２）であり、ＴｒｕｅＣｏｕｎｔが閾値以下となることから、当該ボクセルは除去されてしまう。 Without weight addition, the number of cameras including the target voxel in the foreground mask image is seven, so the True Count is seven. It is assumed that the True Count threshold is 70% of the number of cameras that include the target voxel within the angle of view. Then, since the threshold is 11.2 (16×0.7), True Count (7)<Threshold (11.2), and the True Count is less than or equal to the threshold, so the voxel is removed. .

一方、重み加算ありの場合、対象とするボクセルを前景マスク画像内に含むカメラの台数が７台であるため、同様にＴｒｕｅＣｏｕｎｔは７となり、これに重み値が加算されることになる。対象とするボクセルを構造物マスク画像内に含むカメラの台数が９であり、カメラ１台あたりの重み値が０．５であるため、９×０．５＝４．５を重み値として加算する。重み値を加算した後のＴｒｕｅＣｏｕｎｔは１１．５であり、ＴｒｕｅＣｏｕｎｔ（１１．５）＞閾値（１１．２）となり、閾値を超えることから、当該ボクセルは前景であるものとして除去されない。 On the other hand, with weight addition, the number of cameras including the target voxel in the foreground mask image is 7, so the True Count is similarly 7, and the weight value is added to this. Since the number of cameras including target voxels in the structure mask image is 9, and the weight value for each camera is 0.5, 9×0.5=4.5 is added as the weight value. . The True Count after adding the weight value is 11.5, and True Count (11.5)>Threshold (11.2), which exceeds the threshold, so the voxel is not removed as foreground.

なお、本実施形態では、構造物が一つである場合を想定したが、前景と重なる可能性のある異なる複数の構造物がある場合、構造物マスク画像の種類ごとに異なる重み値を設定し、その重み値に基づく値をＴｒｕｅＣｏｕｎｔに加算してもよい。例えば、競技場の競技フィールドを囲むように設置されている電子看板の構造物マスク画像については、電子看板は大きく前景と重なりやすいことから前景を含む可能性が高くなるので、カメラ１台当たりの重み値を０．５とする。また、ゴールの構造物マスク画像については、カメラ１台当たりの重み値を０．３とする。ゴールよりも電子看板の方が大きく隙間もないので前景（人物）と重なる可能性が高いと考えられることから、電子看板に対する重み値を、ゴールに対する重み値よりも大きい値としている。 In this embodiment, it is assumed that there is one structure, but if there are multiple different structures that may overlap with the foreground, different weight values are set for each type of structure mask image. , a value based on its weight value may be added to the True Count. For example, regarding the structure mask image of the electronic signboards installed to surround the playing field of the stadium, the electronic signboards are large and easily overlapped with the foreground, so there is a high possibility that the foreground will be included. Let the weight value be 0.5. For the goal structure mask image, the weight value per camera is set to 0.3. Since the electronic signboard is larger than the goal and does not have any gaps, it is highly likely that the signboard overlaps the foreground (person).

また、ボクセル位置、シーン、マスク領域の大きさや形状、撮影対象となる競技場のエリアなどに応じて異なる重み値を設定してもよい。 Also, different weight values may be set according to the voxel position, the scene, the size and shape of the mask area, the area of the stadium to be photographed, and the like.

以上説明したように、本実施形態では、対象とするボクセルが構造物マスク画像のマスク領域内に含まれるカメラの台数に基づく重みＴｒｕｅＣｏｕｎｔに加算した上で、閾値判定を行う。これにより、多数のカメラで前景が構造物に遮られる場合でも、欠落のない三次元モデルの生成を実現することができる。 As described above, in the present embodiment, threshold determination is performed after adding the weight True Count based on the number of cameras whose target voxels are included in the mask region of the structure mask image. As a result, even when the foreground is blocked by a structure with many cameras, it is possible to generate a 3D model with no omissions.

（第４の実施形態）
本実施形態では、第１の実施形態で用いた前景マスク画像に含まれるカメラ台数（ＴｒｕｅＣｏｕｎｔ）の代わりに、構造物マスク画像に含まれるカメラ台数を用いる処理について説明する。 (Fourth embodiment)
In this embodiment, processing using the number of cameras included in the structure mask image instead of the number of cameras (True Count) included in the foreground mask image used in the first embodiment will be described.

第１の実施形態では、前景マスク画像と構造物マスク画像に基づいて生成した三次元モデルに対して、毎回、前景マスク画像を更新して、三次元モデルを構成するボクセルが前景マスク画像に含まれるか判定するため、処理が煩雑となる場合がある。本実施形態では、前景マスク画像と構造物マスク画像に基づいて生成した三次元モデルに対して、固定の構造物マスク画像に含まれるカメラ台数をカウントすることにより、構造物を含まない前景三次元モデルの生成を行う。 In the first embodiment, for a three-dimensional model generated based on a foreground mask image and a structure mask image, the foreground mask image is updated each time so that the voxels constituting the three-dimensional model are included in the foreground mask image. processing may be complicated. In this embodiment, for a 3D model generated based on the foreground mask image and the structure mask image, the number of cameras included in the fixed structure mask image is counted to obtain a foreground 3D model that does not include the structure. Generate a model.

＜三次元モデル生成装置の機能構成、及びハードウェア構成＞
図２３は、本実施形態における三次元モデル生成装置の構成を示す図である。本実施形態における三次元モデル生成装置１４の構成は、第１の実施形態とほぼ同様であり、同じ処理を行うブロックについては説明を省略する。本実施形態に係る三次元モデル生成装置１４は、マスク内外判定部１０５に代えて、マスク内外判定部１１２を備えている。マスク内外判定部１１２は、対象となるボクセル空間内の各ボクセルが統合マスク画像及び構造物マスク画像のマスク内外に含まれるカメラ台数をカウントし、閾値判定により、対象となるボクセルを除去するか否かを判定し、前景モデル生成部１０７に出力する。また、本実施形態の三次元モデル生成装置１４のハードウェア構成は、図４（ｂ）と同様であるため、説明は省略する。 <Functional Configuration and Hardware Configuration of 3D Model Generating Device>
FIG. 23 is a diagram showing the configuration of a three-dimensional model generation device according to this embodiment. The configuration of the 3D model generation device 14 in this embodiment is substantially the same as in the first embodiment, and descriptions of blocks that perform the same processing will be omitted. The three-dimensional model generation device 14 according to the present embodiment includes an inside/outside mask determination unit 112 instead of the inside/outside mask determination unit 105 . The mask inside/outside determination unit 112 counts the number of cameras in which each voxel in the target voxel space is included inside or outside the mask of the integrated mask image and the structure mask image, and determines whether or not to remove the target voxel by threshold determination. or not, and output to the foreground model generation unit 107 . Also, the hardware configuration of the 3D model generation device 14 of the present embodiment is the same as that shown in FIG.

＜処理＞
図２４は、本実施形態における三次元モデル生成装置１４が実施する処理の手順を示すフローチャートである。Ｓ５０１～Ｓ５０７及びＳ５１０～Ｓ５１２は、第１の実施形態で図５を参照しながら説明したＳ２０１～Ｓ２０７及びＳ２１０～Ｓ２１２と同様であるため重複箇所については詳細な説明を省略し、主に必要箇所を中心に説明するする。 <Processing>
FIG. 24 is a flow chart showing the procedure of processing performed by the 3D model generation device 14 in this embodiment. S501 to S507 and S510 to S512 are the same as S201 to S207 and S210 to S212 described in the first embodiment with reference to FIG. will be mainly explained.

Ｓ５０６において、マスク内外判定部１１２は、ＦａｌｓｅＣｏｕｎｔが閾値以上であるか否かを判定する。ＦａｌｓｅＣｏｕｎｔが閾値未満である場合、選択されたボクセルは前景又は構造物であると判定できるため、Ｓ５０８へ進む。 In S506, the mask inside/outside determination unit 112 determines whether or not the False Count is greater than or equal to the threshold. If the False Count is less than the threshold, it can be determined that the selected voxel is the foreground or a structure, so the process proceeds to S508.

Ｓ５０８において、マスク内外判定部１１２は、選択された一つのボクセルに対応する画素や領域が各カメラの構造物マスク画像のマスク領域内に含まれるカメラの台数（以降、ＳｔｒｕｃｔｕｒｅＣｏｕｎｔと呼ぶ）をカウントする。 In S508, the mask inside/outside determination unit 112 counts the number of cameras (hereinafter referred to as Structure Count) in which the pixel or area corresponding to one selected voxel is included in the mask area of the structure mask image of each camera. do.

Ｓ５０９において、マスク内外判定部１１２は、ＳｔｒｕｃｔｕｒｅＣｏｕｎｔが閾値以上であるか判定する。ＳｔｒｕｃｔｕｒｅＣｏｕｎｔが閾値以上の場合は、選択されたボクセルは構造物であると判定できるため、Ｓ５０７へ進み、選択されたボクセルを対象ボクセル空間から除去する。一方、ＳｔｒｕｃｔｕｒｅＣｏｕｎｔが閾値未満の場合、選択されたボクセルは前景と判定できるため、対象ボクセル空間から除去しない。 In S509, the mask inside/outside determination unit 112 determines whether the Structure Count is greater than or equal to the threshold. If the Structure Count is equal to or greater than the threshold, it can be determined that the selected voxel is a structure, so the process proceeds to S507 to remove the selected voxel from the target voxel space. On the other hand, if the Structure Count is less than the threshold, the selected voxel can be determined as the foreground and is not removed from the target voxel space.

ここで、図３で示した１６台のカメラにより競技場を撮影する仮想視点画像生成システムを例として、三次元モデルの生成例を説明する。図２５は、図３に示した仮想視点画像生成システムにおける前景、一部のカメラで未検出の前景、構造物に隠れた前景、構造物、非前景として、それぞれ、人物、人物の足、人物の頭部、ゴール、その他の領域に対する、ボクセルのＦａｌｓｅＣｏｕｎｔ／ＳｔｒｕｃｔｕｒｅＣｏｕｎｔと判定結果の例を示す。 Here, an example of generating a three-dimensional model will be described by taking as an example the virtual viewpoint image generating system that captures images of a stadium using 16 cameras shown in FIG. FIG. 25 shows the foreground in the virtual viewpoint image generation system shown in FIG. False Counts/Structure Counts of voxels and determination results for the head, goal, and other regions of .

ただし、１台のカメラで人物の足の前景抽出に失敗しており、また３台のカメラで人物の頭部が構造物であるゴールに隠れており、これらは前景背景分離装置１３により前景として抽出されないものとする。 However, one camera failed to extract the foreground of the person's feet, and three cameras found that the person's head was hidden behind the goal structure. shall not be extracted.

Ｓ５０４の判定において、ＦａｌｓｅＣｏｕｎｔの閾値が固定値の１０である場合、人物、足、頭部、構造物のゴールポストを除くその他の領域に位置するボクセルはＦａｌｓｅＣｏｕｎｔが１６であり、閾値を超えるため、除去される。ここで、図１２は、ＦａｌｓｅＣｏｕｎｔの閾値判定を適用して生成された三次元モデルの一例を示す図である。 In the determination of S504, if the False Count threshold is a fixed value of 10, voxels located in other regions excluding people, feet, heads, and goalposts of structures have a False Count of 16, exceeding the threshold. therefore removed. Here, FIG. 12 is a diagram showing an example of a three-dimensional model generated by applying the False Count threshold determination.

さらに、Ｓ５０８で示した判定において、ＳｔｒｕｃｔｕｒｅＣｏｕｎｔの閾値が固定値の３である場合、構造物であるゴールの領域に位置するボクセルは、ＳｔｒｕｃｔｕｒｅＣｏｕｎｔが５であり、閾値以上であるため、除去される。一方、人物、人物の足、頭部の領域に位置するボクセルは、ＳｔｒｕｃｔｕｒｅＣｏｕｎｔは各々０であり、閾値未満となるため、除去されない。よって、図１３で示す欠落のない人物の三次元モデルが生成される。 Furthermore, in the determination shown in S508, if the Structure Count threshold is the fixed value of 3, the voxels located in the goal region, which is a structure, have a Structure Count of 5, which is greater than or equal to the threshold, and therefore are removed. be. On the other hand, the voxels located in the regions of the person, the feet, and the head of the person have a Structure Count of 0, which is less than the threshold, and therefore are not removed. Therefore, a complete three-dimensional model of a person shown in FIG. 13 is generated.

以上の処理により、構造物マスクに含まれるカメラ台数（ＳｔｒｕｃｔｕｒｅＣｏｕｎｔ）の閾値判定により、前景が構造物に遮られる場合でも欠落のない三次元モデル生成を実現することができる。 By the above processing, it is possible to generate a 3D model without omission even when the foreground is blocked by the structure by threshold determination of the number of cameras (Structure Count) included in the structure mask.

（その他の実施形態）
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 (Other embodiments)
The present invention supplies a program that implements one or more functions of the above-described embodiments to a system or device via a network or a storage medium, and one or more processors in the computer of the system or device reads and executes the program. It can also be realized by processing to It can also be implemented by a circuit (for example, ASIC) that implements one or more functions.

１：仮想視点映像生成システム、１０：カメラ、１１：カメラアレイ、１２：制御装置、１３：前景背景分離装置、１４：三次元モデル生成装置、１５：レンダリング装置、１００：受信部、１０１：構造物マスク保存部、１０２：カメラパラメータ保持部、１０３：マスク統合部、１０４：座標変換部、１０５：マスク内外判定部、１０６：閾値設定部、１０７：前景モデル生成部、１０８：出力部、１０９：画角内外判定部、１１０：閾値算出部、１１１：重み設定部、１１２：マスク内外判定部 1: Virtual Viewpoint Video Generation System, 10: Camera, 11: Camera Array, 12: Control Device, 13: Foreground/Background Separation Device, 14: 3D Model Generation Device, 15: Rendering Device, 100: Receiver, 101: Structure object mask storage unit, 102: camera parameter storage unit, 103: mask integration unit, 104: coordinate conversion unit, 105: mask inside/outside determination unit, 106: threshold value setting unit, 107: foreground model generation unit, 108: output unit, 109 : view angle inside/outside determination unit 110: threshold calculation unit 111: weight setting unit 112: mask inside/outside determination unit

上記の目的を達成する本発明に係る情報処理装置は、
複数の撮影方向からの撮影により得られた複数の画像内のオブジェクトの領域を示す第１領域情報を取得する第１取得手段と、
前記複数の撮影方向からの撮影により得られた前記複数の画像における構造物の領域を示す第２領域情報を取得する第２取得手段と、
前記第１取得手段により取得したオブジェクトの領域を示す第１領域情報と前記第２取得手段により取得した構造物の領域を示す第２領域情報とに基づき、前記オブジェクトに対応する３次元形状データを生成する生成手段と、を有し、
前記生成手段は、３次元形状データを構成する可能性のある構成要素のうち、構造物に対応する構成要素を除去することにより前記オブジェクトに対応する３次元形状データを生成することを特徴とする。 An information processing apparatus according to the present invention that achieves the above object includes:
a first acquisition means for acquiring first area information indicating an area of an object in a plurality of images obtained by photographing from a plurality of photographing directions;
a second acquisition means for acquiring second area information indicating an area of a structure in the plurality of images obtained by photographing from the plurality of photographing directions;
Three-dimensional shape data corresponding to the object is obtained based on the first area information indicating the area of the object obtained by the first obtaining means and the second area information indicating the area of the structure obtained by the second obtaining means. a generating means for generating,
The generating means generates the three-dimensional shape data corresponding to the object by removing the constituent elements corresponding to the structure from among the constituent elements that may constitute the three-dimensional shape data. .

Claims

An information processing device that generates three-dimensional shape data of an object based on a plurality of captured images captured by a plurality of cameras,
For each predetermined element constituting the three-dimensional space, the number of cameras including pixels or areas corresponding to the predetermined element in the area of the object in the captured image among the plurality of cameras is equal to or less than a first threshold. Determination means for determining whether or not the condition that there is is met;
generation means for generating three-dimensional shape data of the object including a predetermined element that is not determined by the determination means to match the condition;
An information processing device comprising:

further comprising integration means for generating an integrated area by integrating the area of the object and the structure area indicating the area of the structure;
When the number of cameras whose pixels or regions corresponding to the predetermined elements are not included in the integrated region is equal to or greater than a second threshold, the determining means determines the pixels or regions corresponding to the predetermined elements as objects. 2. The information processing apparatus according to claim 1, wherein it is determined to remove from the three-dimensional shape data candidates.

When the number of cameras whose pixels or regions corresponding to the predetermined element are not included in the integrated region is less than the second threshold value, the determination means determines that the pixels or regions corresponding to the predetermined element are not included in the 3. The information processing apparatus according to claim 2, wherein it is determined whether or not the number of cameras included in the object area is equal to or less than the first threshold.

The judging means determines that the number of cameras is the second, and the pixel or area corresponding to the predetermined element is not included in the integrated area and the pixel or area corresponding to the predetermined element is included in the angle of view. 4. The information processing apparatus according to claim 2 or 3, wherein, if the threshold is equal to or greater than the threshold of , it is determined that the pixel or region corresponding to the predetermined element is to be removed from the candidates for the three-dimensional shape data of the object.

further comprising weight setting means for setting a weight value based on the number of cameras whose pixels or regions corresponding to the predetermined element are included in the structure region;
The determination means determines the number of cameras whose pixels or areas corresponding to the predetermined elements are included in the object area and the number of cameras whose pixels or areas corresponding to the predetermined elements are included in the structure area. Determining to remove the pixel or region corresponding to the predetermined element from the candidates for the three-dimensional shape data of the object when the value obtained by adding the value obtained by multiplying the weight value is equal to or less than the first threshold value. 5. The information processing apparatus according to any one of claims 2 to 4, characterized by:

6. The information processing apparatus according to claim 5, wherein said weight setting means sets said weight value according to the type or shape of a structure in said structure area.

7. The method according to any one of claims 1 to 6, further comprising threshold calculation means for calculating the first threshold based on the number of cameras whose angle of view includes pixels or areas corresponding to the predetermined element. The information processing device according to item 1.

An information processing device that generates three-dimensional shape data of an object based on a plurality of captured images captured by a plurality of cameras,
For each predetermined element constituting the three-dimensional space, the number of cameras including pixels or areas corresponding to the predetermined element in the structure area in the photographed image among the plurality of cameras is equal to or greater than a first threshold. Determination means for determining whether or not the condition is met;
generation means for generating three-dimensional shape data of the object including a predetermined element that is not determined by the determination means to match the condition;
An information processing device comprising:

further comprising integration means for generating an integrated area by integrating the object area and the structure area;
When the number of cameras whose pixels or regions corresponding to the predetermined elements are not included in the integrated region is equal to or greater than a second threshold, the determining means determines the pixels or regions corresponding to the predetermined elements as objects. 9. The information processing apparatus according to claim 8, wherein it is determined to remove from the three-dimensional shape data candidates.

When the number of cameras whose pixels or regions corresponding to the predetermined element are not included in the integrated region is less than the second threshold value, the determination means determines that the pixels or regions corresponding to the predetermined element are not included in the 10. The information processing apparatus according to claim 9, wherein it is determined whether or not the number of cameras included in the structure area is equal to or greater than the first threshold.

11. The information processing apparatus according to any one of claims 1 to 10, wherein the object is a subject for which three-dimensional shape data is to be generated.

12. The information processing apparatus according to claim 1, further comprising output means for outputting the three-dimensional shape data of said object generated by said generating means to a rendering device.

A control method for an information processing device that generates three-dimensional shape data of an object based on a plurality of captured images captured by a plurality of cameras, comprising:
For each predetermined element constituting the three-dimensional space, the number of cameras including pixels or areas corresponding to the predetermined element in the area of the object in the captured image among the plurality of cameras is equal to or less than a first threshold. A determination step of determining whether or not the condition that there is is met;
a generation step of generating three-dimensional shape data of the object including a predetermined element that is not determined to match the condition by the determination step;
A control method for an information processing device, comprising:

further comprising an integration step of generating an integrated area by integrating the area of the object and a structure area indicating the area of the structure;
In the determination step, when the number of cameras whose pixels or regions corresponding to the predetermined element are not included in the integrated region is equal to or greater than a second threshold value, the pixels or regions corresponding to the predetermined element are defined as the object 14. The method of controlling an information processing apparatus according to claim 13, further comprising the step of determining to remove from the three-dimensional shape data candidates.

In the determination step, if the number of cameras that do not include pixels or regions corresponding to the predetermined element in the integrated region is less than the second threshold value, pixels or regions corresponding to the predetermined element are not included in the 15. The method of controlling an information processing apparatus according to claim 14, further comprising determining whether or not the number of cameras included in the object area is equal to or less than the first threshold.

In the determining step, the number of cameras for which the pixel or area corresponding to the predetermined element is not included in the integrated area and the pixel or area corresponding to the predetermined element is included in the angle of view is the second 16. The information processing apparatus according to claim 14 or 15, wherein the pixel or region corresponding to the predetermined element is determined to be removed from candidates for the three-dimensional shape data of the object when the control method.

A control method for an information processing device that generates three-dimensional shape data of an object based on a plurality of captured images captured by a plurality of cameras, comprising:
For each predetermined element constituting the three-dimensional space, the number of cameras including pixels or areas corresponding to the predetermined element in the structure area in the photographed image among the plurality of cameras is equal to or greater than a first threshold. A determination step of determining whether or not the condition is met;
a generation step of generating three-dimensional shape data of the object including a predetermined element that is not determined to match the condition by the determination step;
A control method for an information processing device, comprising:

further comprising an integration step of generating an integrated area by integrating the object area and the structure area;
In the determination step, when the number of cameras whose pixels or regions corresponding to the predetermined element are not included in the integrated region is equal to or greater than a second threshold value, the pixels or regions corresponding to the predetermined element are defined as the object 18. The control method for an information processing apparatus according to claim 17, wherein the control method of the information processing apparatus determines to remove from the three-dimensional shape data candidates.

In the determination step, if the number of cameras that do not include pixels or regions corresponding to the predetermined element in the integrated region is less than the second threshold value, pixels or regions corresponding to the predetermined element are not included in the 18. The method of controlling an information processing apparatus according to claim 17, further comprising determining whether or not the number of cameras included in the structure area is equal to or greater than the first threshold.

A program for causing a computer to function as each means of the information processing apparatus according to any one of claims 1 to 12.