JP7078564B2

JP7078564B2 - Image processing equipment and programs

Info

Publication number: JP7078564B2
Application number: JP2019029364A
Authority: JP
Inventors: 敬介野中
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2019-02-21
Filing date: 2019-02-21
Publication date: 2022-05-31
Anticipated expiration: 2039-02-21
Also published as: JP2020135525A

Description

本発明は、被写体が遮蔽されている場合であっても適切な３次元モデルを得ることのできる画像処理装置及びプログラムに関する。 The present invention relates to an image processing device and a program capable of obtaining an appropriate three-dimensional model even when the subject is shielded.

従来、スポーツシーンなどを対象として、カメラで撮影されていない自由な視点からの映像（以下、自由視点映像）を生成する技術が提案されている。この技術は複数のカメラで撮影された映像を基に、それらの配置されていない仮想的な視点の映像を合成し、その結果を画面上に表示することでさまざまな視点での映像観賞を可能とするものである。 Conventionally, a technique for generating an image from a free viewpoint (hereinafter referred to as a free viewpoint image) that has not been shot by a camera has been proposed for sports scenes and the like. This technology synthesizes images from virtual viewpoints that are not arranged based on images taken by multiple cameras, and displays the results on the screen to enable video viewing from various viewpoints. Is to be.

ここで、自由視点映像合成技術のうち、視体積交差法と呼ばれる原理を利用して、被写体の3次元コンピュータグラフィクス（3DCG）モデルを生成することで高品質な自由視点映像を合成する既存技術が存在する（非特許文献１）。この方式では、複数のカメラから得られる被写体の概形情報を3次元空間に逆投影し、それらを膨大な数の点群データに記述し、被写体の概形を精緻に再現するものである。あらかじめ生成された被写体の3DCGモデルを入力として、仮想視点の位置を決めてディスプレイ上にレンダリングすることで、自由視点映像が生成される。この他に、点群データを介さずに仮想的な平面群を用いて視体積交差法を実現する技術が提案されている（特許文献１、２）。 Here, among the free-viewpoint image synthesis techniques, there is an existing technique for synthesizing high-quality free-viewpoint images by generating a 3D computer graphics (3DCG) model of the subject using a principle called the visual volume crossing method. Exists (Non-Patent Document 1). In this method, the outline information of the subject obtained from a plurality of cameras is back-projected into a three-dimensional space, and they are described in a huge number of point cloud data to accurately reproduce the outline of the subject. By inputting a 3DCG model of the subject generated in advance, deciding the position of the virtual viewpoint and rendering it on the display, a free viewpoint image is generated. In addition to this, a technique for realizing a visual volume crossing method using a virtual plane group without using point cloud data has been proposed (Patent Documents 1 and 2).

以上のように、複数のカメラ映像から被写体を3DCGモデル化し、任意の視点の仮想映像を合成する発明は複数提案されており、その多くが視体積交差法の原理に則っている。 As described above, a plurality of inventions have been proposed in which a subject is 3DCG modeled from a plurality of camera images and a virtual image of an arbitrary viewpoint is synthesized, and most of them are based on the principle of the visual volume crossing method.

特願2017-167472号Japanese Patent Application No. 2017-167472 特願2018-161868号Japanese Patent Application No. 2018-161868

Laurentini, A. "The Visual Hull Concept for Silhouette Based Image Understanding."IEEE PAMI, 16,2 (1994), 150-162Laurentini, A. "The Visual Hull Concept for Silhouette Based Image Understanding." IEEE PAMI, 16,2 (1994), 150-162

多様なシーンにおける被写体の3DCGモデル化を可能とする視体積交差法であるが、その適用には「原則、すべてのカメラで被写体を捉えていること」といった前提条件が存在する。すなわち、いずれかのカメラにおいて被写体が遮蔽され、カメラ映像上での被写体の認識結果の一部が欠落した場合、被写体モデルも同様に欠損し合成映像に著しい品質の劣化を引き起こす。特に、一般に広く用いられる人物抽出技術の多くは、人物が動いていることを前提としており、この劣化は被写体が静止物体によって遮蔽される場合において顕著となる。なお、被写体同士の前後関係により遮蔽された場合は、背景側の被写体の形状を前景の形状で補うことが可能である。機械学習を用いた人物抽出においても、被写体人物が人物以外の静止物体によって遮蔽される場合に、同様の問題が起こる。 Although it is a visual volume crossing method that enables 3DCG modeling of a subject in various scenes, there are preconditions such as "in principle, all cameras capture the subject". That is, when the subject is shielded by any of the cameras and a part of the recognition result of the subject on the camera image is missing, the subject model is also lost and the composite image is significantly deteriorated in quality. In particular, many of the commonly used person extraction techniques assume that the person is moving, and this deterioration becomes remarkable when the subject is shielded by a stationary object. In addition, when the subject is shielded by the front-back relationship between the subjects, the shape of the subject on the background side can be supplemented by the shape of the foreground. In person extraction using machine learning, the same problem occurs when the subject person is shielded by a stationary object other than the person.

この問題は、特に被写体と静止物体とのインタラクションを重要視するスポーツ映像などで顕著となる。例えば、図１にその会場の様子の模式例を示すスポーツクライミングを考えると、被写体である選手PLが壁Wを登るためにホールドHL（壁Wに固定されて壁Wから突出する構造物で、選手PLに登るための足場や取手を提供するもの）を掴んだ際に、手がホールドHLによって遮蔽されたためにモデル化されず、選手PLの手の部分が欠損しているような合成映像となることが多々ある。また、手に限らず、いずれかのカメラから撮影した際にホールドHLによって選出PLの一部が遮蔽されていると、同様に欠損した合成映像となってしまう。（なお、図１の模式例では、選手PLは壁Wに登る前の時点で地面GR上にいる状態が描かれている。） This problem becomes remarkable especially in sports images that emphasize the interaction between a subject and a stationary object. For example, considering sports climbing in which a schematic example of the state of the venue is shown in FIG. 1, the subject PL is a structure that is fixed to the wall W and protrudes from the wall W in order to climb the wall W. When grasping the foothold and handle for climbing the player PL), it is not modeled because the hand is shielded by the hold HL, and the composite image that the hand part of the player PL is missing It often happens. Also, if a part of the selected PL is shielded by the hold HL when shooting from any camera, not limited to the hand, the composite image will be similarly missing. (In the schematic example of FIG. 1, the player PL is shown to be on the ground GR before climbing the wall W.)

以上のように、従来技術の視体積交差法においては、3DCGモデル化の対象となる被写体（例えば選手PL）が当該モデル化の対象以外のもの（例えばホールドHL）によって遮蔽されている場合に、適切なモデルを得ることができないという課題があった。そして、このような不適切なモデルから任意の視点の仮想映像を合成したとしても、被写体に欠損が発生しているような不適切な合成映像となってしまうこととなる。 As described above, in the conventional visual volume crossing method, when the subject to be modeled in 3DCG (for example, athlete PL) is shielded by something other than the target of the modeling (for example, hold HL), There was a problem that an appropriate model could not be obtained. Then, even if a virtual image of an arbitrary viewpoint is synthesized from such an inappropriate model, the composite image will be inappropriate as if the subject has a defect.

上記の従来技術の課題に鑑み、本発明は、被写体が遮蔽されている場合であっても適切な３次元モデルを得ることのできる画像処理装置及びプログラムを提供することを目的とする。 In view of the above problems of the prior art, it is an object of the present invention to provide an image processing apparatus and a program capable of obtaining an appropriate three-dimensional model even when the subject is shielded.

上記目的を達成するため、本発明は、画像処理装置であって、多視点画像の各画像より、被写体の領域を第１マスクとして抽出する第１抽出部と、前記多視点画像の各画像より、前記被写体とは別の対象としての前景物体の領域を第２マスクとして抽出する第２抽出部と、前記第２マスクにおける個別領域を識別する識別部と、前記第１マスクに対して、前記第２マスクの前記識別された個別領域のうち、位置関係に基づいて、前記第１マスクにおいて遮蔽による欠損を発生させていると判定されるものを加算することにより、統合された第１マスクを得る統合部と、前記多視点画像の各画像に対応する前記統合された第１マスクを用いて、前記被写体の３次元モデルを生成する生成部と、を備えることを特徴とする。また、コンピュータを前記画像処理装置として機能させるプログラムであることを特徴とする。 In order to achieve the above object, the present invention is an image processing apparatus, from a first extraction unit that extracts a subject area as a first mask from each image of a multi-viewpoint image, and from each image of the multi-viewpoint image. A second extraction unit that extracts a region of a foreground object as an object different from the subject as a second mask, an identification unit that identifies an individual region in the second mask, and the first mask. The integrated first mask is obtained by adding the identified individual regions of the second mask that are determined to cause a defect due to shielding in the first mask based on the positional relationship. It is characterized by including an integrated unit for obtaining, and a generating unit for generating a three-dimensional model of the subject by using the integrated first mask corresponding to each image of the multi-viewpoint image. Further, the program is characterized in that the computer functions as the image processing device.

本発明によれば、統合部において前記第２マスクの前記識別された個別領域のうち、位置関係に基づいて、前記第１マスクにおいて遮蔽による欠損を発生させていると判定されるものを加算することにより、統合された第１マスクを得たうえで、合成部において前記統合された第１マスクを用いて、前記被写体の３次元モデルを生成することにより、被写体が遮蔽されている場合であっても適切な３次元モデルを得ることができる。 According to the present invention, among the identified individual regions of the second mask in the integrated portion, those determined to have caused a defect due to shielding in the first mask based on the positional relationship are added. This is a case where the subject is shielded by generating a three-dimensional model of the subject by using the integrated first mask in the synthesis unit after obtaining the integrated first mask. However, an appropriate 3D model can be obtained.

課題となる状況が発生する映像が撮影される例として、スポーツクライミングの会場の様子の模式例を示す図である。It is a figure which shows the schematic example of the state of a sport climbing venue as an example of shooting an image in which a problematic situation occurs. 一実施形態に係る画像処理装置の機能ブロック図である。It is a functional block diagram of the image processing apparatus which concerns on one Embodiment. 識別部による処理の模式例を示す図である。It is a figure which shows the schematic example of the processing by an identification part. 識別部での追加処理のラベリングに関する第二実施形態の模式例を示す図である。It is a figure which shows the schematic example of the 2nd Embodiment regarding the labeling of the additional processing in the identification part. 統合部による統合処理の模式例を示す図である。It is a figure which shows the schematic example of the integration process by the integration part. 画像処理装置を実現することが可能な一般的なコンピュータ装置のハードウェア構成の例を示す図である。It is a figure which shows the example of the hardware composition of the general computer apparatus which can realize the image processing apparatus.

図２は、一実施形態に係る画像処理装置10の機能ブロック図である。図示するように、画像処理装置10は、第１抽出部1、第２抽出部2、識別部3、統合部4、生成部5、合成部6及び校正部7を備える。画像処理装置10は、その全体的な動作として、多視点映像の各時刻フレームとしての多視点画像を入力として第１抽出部1、第２抽出部2及び校正部7において受け取り、図示される各部での処理を経たうえで生成部5において遮蔽による欠損の影響を除外した3DCGモデルを作成し、このモデルを用いることで合成部6からユーザ入力等によって指定される仮想視点位置における合成画像を出力する。この合成画像は入力される多視点映像の各時刻に関して出力されることで、画像処理装置10は自由視点映像を出力することができる。 FIG. 2 is a functional block diagram of the image processing apparatus 10 according to the embodiment. As shown in the figure, the image processing apparatus 10 includes a first extraction unit 1, a second extraction unit 2, an identification unit 3, an integration unit 4, a generation unit 5, a synthesis unit 6, and a calibration unit 7. As its overall operation, the image processing device 10 receives the multi-viewpoint image as each time frame of the multi-viewpoint video as input in the first extraction unit 1, the second extraction unit 2, and the calibration unit 7, and each unit shown in the figure. After processing in, a 3DCG model excluding the effect of defects due to shielding is created in the generation unit 5, and by using this model, the composite image at the virtual viewpoint position specified by user input etc. is output from the synthesis unit 6. do. By outputting this composite image for each time of the input multi-viewpoint video, the image processing device 10 can output the free-viewpoint video.

以下の説明では、多視点映像におけるある任意の１時刻のフレームを多視点画像として説明を行う。（すなわち、画像処理装置10において映像を扱う場合には、各時刻における処理は基本的には共通のものとすることができる。従って、以下の説明では特に明示的に映像や時刻に関して言及する場合を除いて、説明はこの各時刻での共通処理に関するものである。） In the following description, an arbitrary one-time frame in a multi-viewpoint video will be described as a multi-viewpoint image. (That is, when the image processing apparatus 10 handles video, the processing at each time can be basically the same. Therefore, in the following description, the video and time are particularly explicitly referred to. Except for, the explanation is about the common processing at each time.)

図２にも各機能部における入出力データ内容が示される通り、画像処理装置10の各機能部の処理の概要は次の通りである。なお、視体積交差法や自由視点画像生成の技術分野において既知のように、画像処理装置10に対する入力としての多視点画像は、同一のシーン（例えば図１のスポーツクライミングの会場）を取り囲むように配置された複数（少なくとも２つ）のカメラで撮影されて得られるものである。 As the input / output data contents of each functional unit are shown in FIG. 2, the outline of the processing of each functional unit of the image processing apparatus 10 is as follows. As is known in the technical fields of the visual volume crossing method and free viewpoint image generation, the multi-viewpoint image as an input to the image processing device 10 surrounds the same scene (for example, the venue for sports climbing in FIG. 1). It is obtained by being photographed by a plurality of (at least two) cameras arranged.

第１抽出部1は、入力としての多視点画像の各視点の画像から、後段側の生成部5で3DCGモデルを生成する対象である被写体の領域を第１マスクとして抽出して、統合部4へと出力する。第２抽出部2は、入力としての多視点画像の各視点の画像から、後段側の生成部5で3DCGモデルを生成する対象以外のもの（静止物体）の領域を第２マスクとして抽出して、識別部3へと出力する。校正部7は、入力としての多視点画像を解析することでこの多視点画像を撮影しているカメラの校正データ（カメラキャリブレーションデータ）を求め、識別部3及び生成部5へと出力する。 The first extraction unit 1 extracts the area of the subject to which the 3DCG model is generated by the generation unit 5 on the rear stage side as the first mask from the image of each viewpoint of the multi-viewpoint image as an input, and the integration unit 4 Output to. The second extraction unit 2 extracts from the image of each viewpoint of the multi-viewpoint image as an input a region other than the target (stationary object) for which the 3DCG model is generated by the generation unit 5 on the subsequent stage side as the second mask. , Output to the identification unit 3. The calibration unit 7 obtains calibration data (camera calibration data) of the camera taking the multi-viewpoint image by analyzing the multi-viewpoint image as an input, and outputs the calibration data (camera calibration data) to the identification unit 3 and the generation unit 5.

識別部3は、第２抽出部2から得られる第２マスク（一般に１つ以上の閉領域として構成される）に対して、個別領域の識別を行うことで識別された第２マスクを得て、統合部4へと出力する。 The identification unit 3 obtains a second mask identified by identifying individual regions with respect to the second mask (generally configured as one or more closed regions) obtained from the second extraction unit 2. , Output to the integration unit 4.

識別部3において、一実施形態では、本来は２つ以上の異なる静止物体が撮影されているが画像領域としては１つの閉領域としてつながって構成されていることがあるという事情への対処処理として、第２マスク内の閉領域の各々に、１つの静止物体に起因する閉領域であるか、または、２つ以上の静止物体に起因する閉領域であるか、の識別を付与したものとして、この識別された第２マスクを統合部4へと出力することができる。識別部3ではこの識別処理を行うために、校正部7から得られる校正データを利用することができる。 In the identification unit 3, in one embodiment, as a coping process for dealing with a situation in which two or more different stationary objects are originally photographed, but the image area may be connected and configured as one closed area. , Each of the closed regions in the second mask is given an identification as to whether it is a closed region caused by one stationary object or a closed region caused by two or more stationary objects. This identified second mask can be output to the integration unit 4. In order to perform this identification process, the identification unit 3 can use the calibration data obtained from the calibration unit 7.

ここで、１つの閉領域が２つ以上の静止物体に起因するものである場合、この１つの閉領域のうちどの部分がどの静止物体に起因するかの区別も付与されたうえで、識別された第２マスクが出力される。 Here, when one closed region is caused by two or more stationary objects, it is identified after being given a distinction as to which part of this one closed region is caused by which stationary object. The second mask is output.

いずれの実施形態においても、識別部3により識別された第２マスクとは、個別の静止物体ごとの個別領域が識別された状態の第２マスクである。（第２抽出部2から得られた際の第２マスクには、この識別が付与されていない。） In any of the embodiments, the second mask identified by the identification unit 3 is a second mask in which an individual region for each individual stationary object is identified. (This identification is not given to the second mask obtained from the second extraction unit 2.)

統合部4では、第１抽出部1から得られる第１マスクに対して、識別部3から得られる個別領域が識別された第２マスクを用いることで、第１マスクにおいて遮蔽が発生していると判定される箇所に対して、第２マスクの対応する個別領域を加算することによってこの遮蔽発生箇所を埋める処理（統合処理）を行い、この統合処理により得られる統合された第１マスクを生成部5へと出力する。（なお、判定によっては個別領域の加算が行われず、第１抽出部1から得られた第１マスクと、これを統合部4において統合した第１マスクとがデータとしては同じものとなることもありうる。） In the integration unit 4, shielding occurs in the first mask by using the second mask in which the individual region obtained from the identification unit 3 is identified with respect to the first mask obtained from the first extraction unit 1. The processing (integration processing) for filling the shielding occurrence portion is performed by adding the corresponding individual areas of the second mask to the portion determined to be, and the integrated first mask obtained by this integration processing is generated. Output to part 5. (In addition, depending on the determination, the addition of the individual areas is not performed, and the first mask obtained from the first extraction unit 1 and the first mask integrated by the integration unit 4 may be the same as the data. It is possible.)

なお、以上の第１抽出部1が出力する第１マスクと、第２抽出部2が出力する第２マスクと、識別部3が出力する識別された第２マスクと、統合部4が出力する統合された第１マスクとは、入力としての多視点画像の各カメラ視点の画像ごとにそれぞれ得られるものであり、得るための処理に関しては各機能部1,2,3,4においてそれぞれ共通である。 The first mask output by the first extraction unit 1, the second mask output by the second extraction unit 2, the identified second mask output by the identification unit 3, and the integration unit 4 output. The integrated first mask is obtained for each camera viewpoint image of the multi-viewpoint image as an input, and the processing for obtaining the mask is common to each of the functional units 1, 2, 3, and 4. be.

生成部5では、統合部4から得られた多視点画像の各カメラ視点の画像ごとの統合された第１マスクに対して視体積交差法を適用することにより、多視点画像に撮影されている被写体の3DCGモデルを生成して合成部6へと出力する。生成部5で当該生成する際には校正部7から得られる校正データを参照して利用することができる。 In the generation unit 5, the multi-viewpoint image is captured by applying the visual volume crossing method to the integrated first mask for each image of each camera viewpoint of the multi-viewpoint image obtained from the integration unit 4. Generates a 3DCG model of the subject and outputs it to the synthesizer 6. When the generation unit 5 generates the data, the calibration data obtained from the calibration unit 7 can be referred to and used.

合成部6では、生成部5から得られた3DCGモデルを、ユーザ入力等によって指定される仮想視点位置においてレンダリングすることにより、合成画像を合成して出力する。既に説明した通り、入力される多視点映像の各時刻フレームにおいてこの合成画像を出力することで得られる合成映像は、指定された仮想視点位置における被写体の自由視点映像となる。 The compositing unit 6 synthesizes and outputs a composite image by rendering the 3DCG model obtained from the generating unit 5 at a virtual viewpoint position designated by user input or the like. As described above, the composite image obtained by outputting this composite image in each time frame of the input multi-viewpoint image becomes the free viewpoint image of the subject at the designated virtual viewpoint position.

以下では、以上において概要の説明を行った各機能部の詳細に関して説明する。 In the following, the details of each functional unit whose outline has been described above will be described.

＜校正部7＞
多視点画像には、所定の空間内平面（例えば図１の例における壁WLあるいは地面GRなど）が撮影されているものとする。校正部7では、多視点画像の各視点の画像における所定の空間内平面の特徴的な点（例えば、スポーツのコートの白線の交点など）と、撮影される現実空間上の空間内平面上の点との対応付けを行い、カメラパラメータ（外部パラメータ及び内部パラメータ）として校正データを算出する。例えば、多視点画像が一般的なスポーツ映像におけるものである場合は、コートのサイズが規格化されているため、これを事前知識（コートのフィールドモデル）として利用することにより、既存手法のコーナー検出や特徴点検出等によって検出される画像平面上の点が、この事前知識で与えられる実空間上（世界座標系）のどの座標に対応するかを容易に計算することが可能である。 <Calibration unit 7>
It is assumed that a predetermined in-space plane (for example, wall WL or ground GR in the example of FIG. 1) is captured in the multi-viewpoint image. In the calibration unit 7, the characteristic points of a predetermined in-space plane in the image of each viewpoint of the multi-viewpoint image (for example, the intersection of the white lines of the sports court) and the in-space plane in the real space to be photographed. Correspondence with points is performed, and calibration data is calculated as camera parameters (external parameters and internal parameters). For example, when the multi-viewpoint image is for a general sports image, the size of the court is standardized, so by using this as prior knowledge (field model of the court), the corner detection of the existing method is performed. It is possible to easily calculate which coordinates in the real space (world coordinate system) given by this prior knowledge correspond to the points on the image plane detected by the feature point detection or the like.

校正部7におけるこのカメラキャリブレーションは、手動のほか、任意の既存手法の自動キャリブレーションを用いても行うことができる。例えば、手動の方法としては画面上の白線の交点をユーザ操作により選択し、あらかじめ測定されたフィールドモデルとの対応付けをとることで、カメラパラメータを推定できる。なお、画面に歪みがある場合は先に内部パラメータを推定しておけばよい。 This camera calibration in the calibration unit 7 can be performed manually or by using the automatic calibration of any existing method. For example, as a manual method, the camera parameters can be estimated by selecting the intersection of the white lines on the screen by user operation and associating with the field model measured in advance. If the screen is distorted, the internal parameters may be estimated first.

入力される多視点映像が固定カメラでの撮影を前提としている場合には、校正部7におけるこのカメラキャリブレーションの処理は、映像の最初に一度のみ行えばよい。すなわち、ある時刻で得た校正データは時間変化しないものとして、それより後の時刻でも参照して利用するようにすればよい。また、移動カメラを前提とした場合は、前述の任意の既存手法による自動キャリブレーション処理により、各時刻フレームにおいてカメラパラメータを算出すればよい。 If the input multi-viewpoint image is premised on shooting with a fixed camera, the camera calibration process in the calibration unit 7 may be performed only once at the beginning of the image. That is, the calibration data obtained at a certain time may not change with time, and may be referred to and used at a later time. Further, when a mobile camera is assumed, the camera parameters may be calculated in each time frame by the automatic calibration process by the above-mentioned arbitrary existing method.

＜第１抽出部1＞
第１抽出部1では、多視点画像の各画像における被写体（動物体）の形状を0,1の2値マスク画像として表現した第１マスクを抽出する。すなわち、多視点画像の各画像の各画素位置において、値が0となる位置は背景であって被写体が存在しないことを意味し、値が1となる位置は前景であって被写体が存在することを意味する2値マスク画像として、第１マスクが抽出される。（なお、2値のいずれが前景か背景かの規則は、この逆の規則を用いてもよい。）抽出された2値マスク画像としての第１マスクは後段側の統合部4において統合された第１マスク（統合前と同じく、2値マスク画像のデータ形式で与えられる）へと加工されたうえで生成部5に入力され、被写体の3DCGモデル形状の生成に利用される。 <1st extraction unit 1>
The first extraction unit 1 extracts a first mask that expresses the shape of the subject (animal body) in each image of the multi-viewpoint image as a 0,1 binary mask image. That is, at each pixel position of each image of the multi-viewpoint image, the position where the value is 0 means that the background is the background and the subject does not exist, and the position where the value is 1 is the foreground and the subject exists. The first mask is extracted as a binary mask image meaning. (The rule of which of the binary values is the foreground or the background may use the reverse rule.) The first mask as the extracted binary mask image was integrated in the integration unit 4 on the rear stage side. After being processed into the first mask (given in the data format of the binary mask image as before the integration), it is input to the generation unit 5 and used to generate the 3DCG model shape of the subject.

第１抽出部1において、2値マスクとしての第１マスクを得るための手法には、例えば、既存技術である背景差分法を利用することができる。この技術では、あらかじめ被写体のいない映像またはその平均値などの統計情報を背景統計情報として登録し、背景統計情報と対象時刻のカメラ映像（すなわち、入力される多視点画像）との差分をとり、それに対してしきい値処理を行うことで被写体領域を抽出する。その他にも、機械学習を用いた人物等の抽出技術（物体認識技術）など広く既存技術を利用して、第１マスクを抽出することができる。 In the first extraction unit 1, for example, the background subtraction method, which is an existing technique, can be used as a method for obtaining the first mask as a binary mask. In this technique, statistical information such as an image without a subject or its average value is registered as background statistical information in advance, and the difference between the background statistical information and the camera image at the target time (that is, the input multi-viewpoint image) is taken. On the other hand, the subject area is extracted by performing threshold processing. In addition, the first mask can be extracted by widely using existing techniques such as a technique for extracting a person or the like using machine learning (object recognition technique).

＜第２抽出部2＞
第２抽出部2では、第１抽出部1で第１マスクとして得る被写体マスクとは別に、多視点画像の各画像において被写体を遮蔽する前景の静止物になり得るもの（例えば、図１のスポーツクライミングのホールドHL）を前景物体としてマスクし、第２マスクを得る。第２抽出部2で第２マスクを得る処理には既存手法を利用することができ、例えば、画像を色情報などを用いて小領域に分割する領域分割技術を利用することができる。 <Second extraction unit 2>
In the second extraction unit 2, apart from the subject mask obtained as the first mask by the first extraction unit 1, a still object in the foreground that shields the subject in each image of the multi-viewpoint image (for example, the sport of FIG. 1). Climbing Hold HL) is masked as a foreground object to obtain a second mask. An existing method can be used for the process of obtaining the second mask in the second extraction unit 2, and for example, an area division technique for dividing an image into small areas using color information or the like can be used.

具体的には、既存手法としての画像の領域分割を行いて得られた小領域ごとに、それらに含まれる色情報と、事前知識として与えられている前景物体の所定の色情報について、平均値や中央値等の統計情報を比較し類似の画像領域を抽出することで、第２マスクを得ることができる。 Specifically, the average value of the color information contained in each small area obtained by dividing the area of the image as an existing method and the predetermined color information of the foreground object given as prior knowledge. A second mask can be obtained by comparing statistical information such as the median and the median and extracting a similar image region.

ここで、類似判定の閾値についてはユーザ設定による所定値を用いればよい。その後、前景物体と判定された領域の画素については1（前景）を、その他の領域の画素については0（背景）を割り当て、第１マスクと同じデータ形式の２値マスク画像として第２マスクを得ることができる。この際、前景物体マスクデータとしての第２マスクには、実際のシーンでは被写体の前に現れず（すなわち、被写体を遮蔽せず）、前景（視体積交差法の適用対象としての前景）にはならない領域も含まれることがありうるが、第２抽出部2では特にこれらを区別する必要はなく、第２マスクを得ることができる。なお、これらを区別した処理は後段側の統合部4において実現されることとなる。 Here, as the threshold value for the similarity determination, a predetermined value set by the user may be used. After that, 1 (foreground) is assigned to the pixels in the area determined to be the foreground object, 0 (background) is assigned to the pixels in the other areas, and the second mask is assigned as a binary mask image in the same data format as the first mask. Obtainable. At this time, the second mask as the foreground object mask data does not appear in front of the subject in the actual scene (that is, does not shield the subject), and the foreground (foreground as the object to which the visual volume crossing method is applied). There may be regions that do not become, but in the second extraction unit 2, it is not necessary to distinguish between them, and a second mask can be obtained. It should be noted that the processing for distinguishing these is realized in the integrated unit 4 on the subsequent stage side.

その他、第２抽出部2での前景物体の抽出には、上記の領域分割以外にも機械学習による物体認識やセマンティックセグメンテーション、クロマキーによる抽出など広く既存技術を利用することができるが、上記の領域分割の場合の色情報のような前景物体に関する事前知識（物体認識であれば認識される物体の種別の情報など）は同様に与えておくものとする。また、本明細書では主に完全に静止している人工物体を対象とした説明例を用いるが、木などの微小な動きを伴う自然物であっても、同様に機械学習を用いたセグメンテーションなどを利用し、その形状マスクを抽出することで静止物体として扱うことができる。すなわち、第２マスクにおいて抽出される静止物体は静止していると判定できる対象であればよい。 In addition, for the extraction of the foreground object in the second extraction unit 2, existing technologies such as object recognition by machine learning, semantic segmentation, and extraction by chroma key can be widely used in addition to the above-mentioned area division. Prior knowledge about foreground objects such as color information in the case of division (information on the type of object to be recognized in the case of object recognition, etc.) shall be given in the same way. Further, in this specification, an explanatory example mainly for an artificial object that is completely stationary is used, but even if it is a natural object with minute movements such as a tree, segmentation using machine learning is similarly used. By using it and extracting its shape mask, it can be treated as a stationary object. That is, the stationary object extracted by the second mask may be an object that can be determined to be stationary.

なお、用語として、第１マスクにおいて前景をなす動きうる対象を被写体、この被写体とは別対象として、第２マスクにおいて前景をなす対象（静止物体（上記の通り微小動きを伴う場合も含む））を前景物体と称する。 In terms of terms, a movable object that forms the foreground in the first mask is a subject, and an object that forms the foreground in the second mask as a separate object from this subject (a stationary object (including a case with minute movement as described above)). Is called a foreground object.

＜識別部3＞
識別部3では、第２抽出部2から出力された前景物体マスクデータとしての第２マスクに対して、これらの画像座標上での閉領域ごとのID付（ラベリング）を行うことで、識別された第２マスクデータを得る。このラベリング処理については例えば領域拡張法などの既存技術を利用してよい。 <Identification unit 3>
The identification unit 3 identifies the second mask as the foreground object mask data output from the second extraction unit 2 by assigning an ID (labeling) for each closed region on these image coordinates. The second mask data is obtained. For this labeling process, existing techniques such as the area expansion method may be used.

本発明の一実施形態においてはこのような既存技術で画像座標上の閉領域ごとのラベリングを行ったうえでさらに、次の追加処理を行うようにしてもよい。すなわち、前景物体によっては複数の物体が連結し、画像上において１つの閉領域を共有する場合がある。この場合に対処すべく、同一閉領域（既存技術のラベリングにより同一ラベルが付与されている）内において異なる前景物体ごとにラベルを分けるための追加処理を行い、追加処理でのラベリング結果を最終的な識別された第２マスクとして統合部4に出力してよい。（あるいは、追加処理を行わず、既存技術による画像座標上の閉領域としてのラベリング結果を識別された第２マスクとして統合部4に出力してもよい。） In one embodiment of the present invention, the following additional processing may be performed after labeling each closed region on the image coordinates by such an existing technique. That is, depending on the foreground object, a plurality of objects may be connected and share one closed area on the image. In order to deal with this case, additional processing is performed to separate labels for different foreground objects in the same closed area (the same label is given by labeling of existing technology), and the labeling result in the additional processing is finally obtained. It may be output to the integration unit 4 as the identified second mask. (Alternatively, the labeling result as a closed area on the image coordinates by the existing technique may be output to the integration unit 4 as the identified second mask without performing additional processing.)

図３は、識別部3による処理の模式例を示す図であり、上段側には第２抽出部2から出力された識別前第２マスクF_[識別前]が、下段側にはこの識別前第２マスクF_[識別前]に対して識別部3による処理を経て識別された第２マスクFが、それぞれ示されている。識別前第２マスクF_[識別前]では黒色で示す領域が前景物体マスクとして得られているのみであって、個別領域の識別がなされていないのに対し、識別された第２マスクFでは合計で10個の個別領域F₁₀～F₁₉の識別結果が得られている。 FIG. 3 is a diagram showing a schematic example of processing by the identification unit 3, in which the second mask F _{[before identification] before} identification output from the second extraction unit 2 is on the upper side, and before this identification is on the lower side. The second mask F _{[before identification]} identified by the processing by the identification unit 3 is shown. In the second mask F before _{identification [before identification]} , the area shown in black is only obtained as the foreground object mask, and the individual areas are not identified, whereas in the identified second mask F, the total is obtained. The identification results of 10 individual regions F ₁₀ to F ₁₉ are obtained.

この図３の識別された第２マスクFは、識別部3において追加処理を行って識別した場合の例となっている。すなわち、２つの個別領域F₁₃及びF₁₄（このうち個別領域F₁₃は区別のために淡色で示す）は、識別前第２マスクF_[識別前]の状態において見て取れるように、画像座標上では単一閉領域をなす互いに接続されたものであり、既存技術のラベリングで同一ラベルが付与されたうえでさらに、追加処理によって２つの異なるラベルが付与されて識別されている。同様に、３つの個別領域F₁₇、F₁₈及びF₁₉（このうち個別領域F₁₇及びF₁₈は区別のために淡色で示す）は、識別前第２マスクF_[識別前]の状態において見て取れるように、画像座標上では単一閉領域をなす互いに接続されたものであり、既存技術のラベリングで同一ラベルが付与されたうえでさらに、追加処理によって３つの異なるラベルが付与されて識別されている。 The identified second mask F in FIG. 3 is an example when the identification unit 3 performs additional processing for identification. That is, the two individual regions F ₁₃ and F ₁₄ (of which the individual regions F ₁₃ are shown in light color for distinction) are on the image coordinates as can be seen in the state of the second mask F _{before identification [before identification]} . They are connected to each other in a single closed area, are labeled with the same label by existing technology labeling, and are further labeled with two different labels by additional processing to identify them. Similarly, the three distinct regions F ₁₇ , F ₁₈ and F ₁₉ (of which the distinct regions F ₁₇ and F ₁₈ are shown in light color for distinction) can be seen in the pre-discrimination second mask F _{[pre-discrimination]} state. As described above, they are connected to each other forming a single closed area on the image coordinates, and are labeled with the same label by the labeling of the existing technique, and further, three different labels are given by the additional processing to be identified. There is.

なお、その他の個別領域F₁₀、F₁₁、F₁₂、F₁₅及びF₁₆に関しては、既存技術のラベリングで画像座標上の閉領域として同一ラベルが付与されたうえでさらに、追加処理を行ったが同一ラベルのままで変化しなかったものである。 For the other individual areas F ₁₀ , F ₁₁ , F ₁₂ , F ₁₅ and F ₁₆ , additional processing was performed after the same label was given as a closed area on the image coordinates by labeling the existing technology. Remains the same label and does not change.

以下、識別部3における追加処理のラベリングに関する第一実施形態と第二実施形態とを説明する。 Hereinafter, the first embodiment and the second embodiment regarding the labeling of the additional processing in the identification unit 3 will be described.

第一実施形態では、以下の（１）～（３）の手順で追加処理のラベリングを行うことができる。
（１）校正部7にてカメラキャリブレーションにて得られたカメラパラメータとしての校正データと、第２抽出部2によって抽出された前景物体マスクデータとしての第２マスクとを用いて、視体積交差法を行い前景物体の３次元形状を取得する。この際、多視点画像の各カメラ視点の画像の全てを用いて視体積交差法を適用すればよい。視体積交差法に関しては、後述する生成部5における視体積交差法と同様に、任意の既存手法を用いてよい。なお、ホールド等の前景物体を近距離で捉えられていない等の原因により画像の品質が悪いと判定される場合や、カメラパラメータの精度が悪いカメラ視点の画像は、視体積交差法の適用対象から除外してもよい。
（２）そして、当該取得した３次元形状に対して、既存技術により連結領域ごとのラベリングを行うことで、３次元空間内において異なる物体同士の区別を付与する。
（３）その後、当該３次元形状を各カメラ画像平面（前景物体マスクデータとしての第２マスク）上に再投影し、上記３次元空間内でのラベリングにより区別された異なる物体に属するマスク領域について異なるラベルを付与する。 In the first embodiment, the additional processing can be labeled by the following procedures (1) to (3).
(1) Visual volume intersection using the calibration data as the camera parameter obtained by the camera calibration in the calibration unit 7 and the second mask as the foreground object mask data extracted by the second extraction unit 2. Perform the method to obtain the three-dimensional shape of the foreground object. At this time, the visual volume crossing method may be applied using all the images of each camera viewpoint of the multi-viewpoint image. As for the visual volume crossing method, any existing method may be used as in the visual volume crossing method in the generation unit 5 described later. In addition, when it is judged that the image quality is poor due to reasons such as the foreground object not being captured at a short distance such as a hold, or the image from the camera viewpoint where the accuracy of the camera parameters is poor, the visual volume crossing method is applied. May be excluded from.
(2) Then, the acquired three-dimensional shape is labeled for each connection region by the existing technique, so that different objects can be distinguished from each other in the three-dimensional space.
(3) After that, the three-dimensional shape is reprojected onto each camera image plane (second mask as foreground object mask data), and the mask regions belonging to different objects distinguished by labeling in the above three-dimensional space. Give different labels.

以上の（１）～（３）の手順により、ある前景物体マスクデータ（第２マスク）において複数の異なる物体のマスクが連結し１つの閉領域を共有していた場合においても、この２次元マスク画像としての前景物体マスクデータにおいて異なる物体（３次元空間内での分離を考慮することによる異なる物体）として識別された追加ラベリング結果を得ることができる。 According to the above steps (1) to (3), even when the masks of a plurality of different objects are connected and share one closed area in a certain foreground object mask data (second mask), this two-dimensional mask is used. It is possible to obtain additional labeling results identified as different objects (different objects by considering separation in three-dimensional space) in the foreground object mask data as an image.

なお、第一実施形態においては、第２マスクに対する前処理として、既存技術で閉領域ごとのラベリングを行っておくことは省略してもよい。（すなわち、第一実施形態では手順（３）で一意なラベルが得られるため、追加処理としてではなく、上記の（１）～（３）の手順のみによってラベリング結果を得るようにしてもよい。） In the first embodiment, it may be omitted to perform labeling for each closed region by the existing technique as a pretreatment for the second mask. (That is, in the first embodiment, since a unique label is obtained in the procedure (3), the labeling result may be obtained only by the above steps (1) to (3), not as an additional process. )

第二実施形態は、多視点画像のほとんどすべてのカメラ画像上の前景物体マスクを利用する第一実施形態とは異なり、多視点画像のうち少ないカメラ台数の視点のもののみを利用してラベリングすることも可能な実施形態である。 The second embodiment is different from the first embodiment in which the foreground object mask on almost all camera images of the multi-viewpoint image is used, and labeling is performed using only the viewpoints of a small number of cameras among the multi-viewpoint images. It is also a possible embodiment.

第二実施形態では、第２抽出部2にて利用した領域分割や機械学習による認識の結果を利用することができる。具体的には、既に領域分割された小領域（あるいは認識された小領域）ごとに、その重心などの代表的な点（または領域）を3次元空間に投影する。同様の操作を2,3台のカメラに渡って行い、上記代表点の投影される光線（カメラの光学中心と代表点とを通って投影される光線）において、他の前景物体マスクの代表点の光線との距離がある閾値以下の場合、異なるカメラ画像間において同一の物体に対応している小領域であるとの識別情報を得る。この操作により、あるカメラ画像では連結しているが、他のカメラでは明らかに離れている前景物体マスクについて異なるラベルを付与し、ラベル付与前景マスクデータを出力することが可能となる。 In the second embodiment, the result of the area division and the recognition by machine learning used in the second extraction unit 2 can be used. Specifically, for each small area (or recognized small area) that has already been divided into areas, a representative point (or area) such as the center of gravity is projected onto the three-dimensional space. The same operation is performed over a few cameras, and the representative point of the other foreground object mask in the projected ray of the above representative point (the ray projected through the optical center and the representative point of the camera). When the distance from the light beam of the above is less than or equal to a certain threshold value, the identification information that the area corresponds to the same object between different camera images is obtained. By this operation, it is possible to assign different labels to foreground object masks that are concatenated in one camera image but are clearly distant in other cameras, and output the labeled foreground mask data.

図４は、識別部3での追加処理のラベリングに関する第二実施形態の模式例を示す図である。撮影される３次元空間内には、第２マップに前景物体として抽出される対象として、図１で説明したようなホールドが第１ホールドHL1及び第２ホールドHL2として２個存在するものとする。これらホールドHL1及びHL2をカメラC1は概ね正面から撮影することで、その画像P1（第２マップ）上においてそれぞれ閉領域R11及びR12（２つの単一の閉領域R11及びR12）と分離して得られており、その代表点がそれぞれ点p11及びp12である。 FIG. 4 is a diagram showing a schematic example of a second embodiment relating to labeling of additional processing in the identification unit 3. In the three-dimensional space to be photographed, it is assumed that there are two holds as the first hold HL1 and the second hold HL2 as described in FIG. 1 as the objects to be extracted as the foreground object on the second map. By taking these hold HL1 and HL2 from the front of the camera C1, they are obtained separately from the closed regions R11 and R12 (two single closed regions R11 and R12) on the image P1 (second map), respectively. The representative points are points p11 and p12, respectively.

一方、これらホールドHL1及びHL2をカメラC2では非常に傾いた向きから撮影することで、その画像P2（第２マップ）上において単一の閉領域R20に連結して得られており、その代表点がp20である。この図４の例では以下の（処理１）～（処理３）で追加処理のラベリング結果を得ることができる。 On the other hand, these holds HL1 and HL2 are obtained by taking pictures from a very tilted direction with the camera C2 and connecting them to a single closed region R20 on the image P2 (second map), which is a representative point. Is p20. In the example of FIG. 4, the labeling results of the additional processing can be obtained by the following (processing 1) to (processing 3).

（処理１）
次の２つの閾値判定結果（ａ）及び（ｂ）を得ることで、カメラC2の単一の閉領域R20が２つの異なる前景物体を含んでいるという識別情報を得る。
（ａ）カメラC1での光線C1-p11とカメラC2での光線C2-p20との距離が予め設定される閾値th以下であることから、領域R11と領域R20（あるいはその代表点p11とp20）とが対応している、と判定する。
（ｂ）カメラC1での光線C1-p12とカメラC2での光線C2-p20との距離も同じ閾値th以下であることから、領域R12と領域R20（あるいはその代表点p12とp20）とが対応している、と判定する。
（ここで、例えば「光線C1-p11」という場合、C1はカメラC1の光学中心を表すものとし、光学中心C1及び点p11を通る直線としての光線が「光線C1-p11」であるものとする。） (Process 1)
By obtaining the following two threshold determination results (a) and (b), identification information that a single closed region R20 of the camera C2 contains two different foreground objects is obtained.
(A) Since the distance between the light rays C1-p11 in the camera C1 and the light rays C2-p20 in the camera C2 is equal to or less than the preset threshold value th, the regions R11 and the regions R20 (or their representative points p11 and p20). Is determined to correspond to.
(B) Since the distance between the light rays C1-p12 in the camera C1 and the light rays C2-p20 in the camera C2 is also equal to or less than the same threshold value, the regions R12 and the regions R20 (or their representative points p12 and p20) correspond to each other. It is determined that it is.
(Here, for example, in the case of "ray C1-p11", C1 represents the optical center of the camera C1, and the ray as a straight line passing through the optical center C1 and the point p11 is "ray C1-p11". .)

従って、カメラC2の単一の閉領域R20に関して上記の（ａ）及び（ｂ）が成立していることにより、カメラC2の単一の閉領域R20は、これとは別のカメラC1において適切に分離されている２つの閉領域R11及びR12が連結されているものである、すなわち、２つの異なる前景物体としてのホールドHL1及びHL2が連結されているものである、という識別情報を得ることができる。 Therefore, since the above (a) and (b) are satisfied with respect to the single closed region R20 of the camera C2, the single closed region R20 of the camera C2 is appropriately used in another camera C1. It is possible to obtain identification information that two separated closed regions R11 and R12 are concatenated, that is, hold HL1 and HL2 as two different foreground objects are concatenated. ..

（処理２）
この識別情報を得た後に、２つの異なる前景物体としてのホールドHL1及びHL2に対応する代表点p21及びp22（図３の画像P2内に白丸○で示す）を、カメラC2の画像P2内の座標として求める。具体的には、ホールドHL1及びHL2の3次元空間内での代表点としての空間座標x1及びx2（図３では不図示）を求めたうえで、光線C2-x1と画像P2（エピポーラ幾何モデルで既知のように、投影面としての画像P2）との交点として第１ホールドHL1に対応する代表点p21を求め、同様に、光線C2-x2と画像P2との交点として第２ホールドHL2に対応する代表点p22を求めることができる。なお、交点として求めた代表点p21,p22が領域R20内に含まれなかった場合には、当該求めた点の近傍で領域R20に含まれる点を改めて代表点p21,p22とすればよい。 (Process 2)
After obtaining this identification information, the representative points p21 and p22 (indicated by white circles in the image P2 in FIG. 3) corresponding to the holds HL1 and HL2 as two different foreground objects are the coordinates in the image P2 of the camera C2. Ask as. Specifically, after obtaining the spatial coordinates x1 and x2 (not shown in Fig. 3) as representative points of the hold HL1 and HL2 in the three-dimensional space, the ray C2-x1 and the image P2 (in the epipolar geometric model). As is known, the representative point p21 corresponding to the first hold HL1 is obtained as the intersection with the image P2) as the projection plane, and similarly, it corresponds to the second hold HL2 as the intersection between the ray C2-x2 and the image P2. The representative point p22 can be obtained. If the representative points p21 and p22 obtained as intersections are not included in the region R20, the points included in the region R20 in the vicinity of the obtained points may be set as the representative points p21 and p22 again.

ここで、ホールドHL1及びHL2の3次元空間内での代表点としての空間座標x1及びx2は、次のように求めればよい。すなわち、図４では不図示であるが、カメラC1の画像P1以外に、さらに別のカメラC3の画像P3において閉領域R31及び閉領域R32（それぞれ代表点が点p31及びp32とする）としてホールドHL1及びHL2が分離されて得られており、画像P1との間でも光線同士の距離の閾値判定（上記の判定（ａ）、（ｂ）と同様の判定）によって対応関係が、すなわち、閉領域R11及びR31の対応関係と、閉領域R12及びR32の対応関係とが得られているものとする。これにより、エピポーラ幾何モデルにおいて既知のように、光線C1-p11及び光線C3-p31の交点として空間座標x1を求めることができ、同様に、光線C1-p12及び光線C3-p32の交点として空間座標x2を求めることができる。なお、これら光線が誤差などにより実際に完全に交わる交点が得られない場合には、光線同士が最も接近する位置の中点などを交点とすればよい。 Here, the spatial coordinates x1 and x2 as representative points of the holds HL1 and HL2 in the three-dimensional space may be obtained as follows. That is, although not shown in FIG. 4, hold HL1 is used as a closed region R31 and a closed region R32 (representative points are points p31 and p32, respectively) in the image P3 of another camera C3 in addition to the image P1 of the camera C1. And HL2 are obtained separately, and there is a correspondence between the image P1 and the image P1 by the threshold value determination of the distance between the rays (the same determination as in the above determinations (a) and (b)), that is, the closed region R11. It is assumed that the correspondence between R31 and the correspondence between the closed regions R12 and R32 are obtained. This makes it possible to obtain the spatial coordinates x1 as the intersections of the rays C1-p11 and C3-p31, as is known in the epipolar geometric model, and similarly, the spatial coordinates as the intersections of the rays C1-p12 and the rays C3-p32. You can find x2. If an intersection where these rays actually completely intersect cannot be obtained due to an error or the like, the midpoint at the position where the rays are closest to each other may be set as the intersection.

（処理３）
そして、画像P2の単一の閉領域R20内の各画素位置に関して、例えば代表点p21及びp22との距離をそれぞれ計算し、代表点p21との距離の方が小さい場合にはその画素位置は第１ホールドHL1に対応するものとしてラベリングし、代表点p22との距離が小さい場合にはその画素位置は第２ホールドHL2に対応するものとしてラベリングすることができる。ここで、距離としては画像座標上のユークリッド距離のみを評価対象として用いてもよいが、さらに、色空間上での距離等も評価対象に追加することで、代表点p21,p22とテクスチャ等が類似していると判定され、且つ、代表点p21,p22にある程度近い位置にあるものとしてラベリング結果を得るようにしてもよい。色空間上での評価を行う際は、評価対象のピクセルの近傍小領域も含めてヒストグラム等で評価してもよい。以上、図４の画像P2の単一の閉領域R20の例では、２つの異なる前景物体が連結している場合に関して説明したが、３つ以上が連結している場合も全く同様にしてラベリングを行うことができる。 (Process 3)
Then, for each pixel position in the single closed region R20 of the image P2, for example, the distances from the representative points p21 and p22 are calculated, and if the distance from the representative points p21 is smaller, the pixel position is the first. Labeling can be performed as corresponding to the 1-hold HL1, and when the distance from the representative point p22 is small, the pixel position can be labeled as corresponding to the 2nd hold HL2. Here, as the distance, only the Euclidean distance on the image coordinates may be used as the evaluation target, but by adding the distance on the color space to the evaluation target, the representative points p21, p22 and the texture etc. can be obtained. The labeling result may be obtained assuming that they are similar to each other and are located close to the representative points p21 and p22 to some extent. When performing evaluation on the color space, evaluation may be performed using a histogram or the like including a small area in the vicinity of the pixel to be evaluated. In the example of the single closed region R20 in the image P2 of FIG. 4, the case where two different foreground objects are connected has been described, but when three or more objects are connected, labeling is performed in exactly the same manner. It can be carried out.

以上、図４で模式例を説明した第二実施形態は、より一般には次のような（手順１）～（手順３）で実現することができる。なお、既知のように、エピポーラ幾何モデルの計算（光線に関する計算や再投影に関する計算）を行う際に、校正部7から得られる校正データを利用することができる。 As described above, the second embodiment described in the schematic example in FIG. 4 can be more generally realized by the following (procedure 1) to (procedure 3). As is known, the calibration data obtained from the calibration unit 7 can be used when calculating the epipolar geometric model (calculation regarding light rays and calculation regarding reprojection).

（手順１）
異なる視点のカメラ画像（第２マップ）間でエピポーラ幾何モデルを利用することで、対応する光線（カメラの光学中心と領域の代表点とを通る光線、以下同様）同士の距離の閾値判定によって個別の閉領域同士の対応関係を網羅的に求め、対応関係の重複の有無により、各カメラ画像の各閉領域に関して、３次元空間内で対応している前景物体が１個であると推測されるか、２個以上であると推測されるか、の識別情報を得る。なお、異なる視点のカメラ画像の閉領域との対応関係が得られない閉領域は、前景物体が１個であるものと推測すればよい。 (Procedure 1)
By using an epipolar geometric model between camera images (second map) from different viewpoints, it is possible to individually determine the distance between corresponding rays (rays passing through the optical center of the camera and the representative point of the region, the same applies hereinafter). It is presumed that there is one foreground object corresponding to each closed region of each camera image in the three-dimensional space by comprehensively obtaining the correspondence between the closed regions of the above and whether or not the correspondence is duplicated. The identification information of whether it is presumed to be two or more is obtained. It should be noted that the closed region in which the correspondence with the closed regions of the camera images of different viewpoints cannot be obtained may be presumed to have one foreground object.

なお、手順１で対応関係を求めるのは、カメラ視点の相違が一定条件を満たすようなカメラ画像間のみに限定してもよい。例えば、カメラ視点の相違が一定範囲内にあるようなカメラ画像間のみに限定してよい。 It should be noted that the correspondence relationship may be obtained only between the camera images in which the difference in the camera viewpoint satisfies a certain condition in the procedure 1. For example, it may be limited to only between camera images in which the difference in camera viewpoint is within a certain range.

（手順２）
２個以上の前景物体が対応していると推測された第１カメラ画像の第１閉領域に関して、これと異なる第２カメラ画像及び第３カメラ画像であって第１閉領域に対応する複数の閉領域（第２カメラ画像と第３カメラ画像とで等しい複数個数の閉領域）が、前景物体が１個であるものとそれぞれ推測されて求まっているものを参照し、第２カメラ画像及び第３カメラ画像の対応領域の代表点に対してエピポーラ幾何モデルの光線交点を求め、この光線交点を、第１カメラ画像の第１閉領域の内部で連結している複数の前景物体のそれぞれの代表点（空間座標）とする。 (Procedure 2)
With respect to the first closed region of the first camera image presumed to correspond to two or more foreground objects, a plurality of second camera images and third camera images different from the first closed region corresponding to the first closed region. The closed region (a plurality of closed regions equal to each other in the second camera image and the third camera image) is presumed to be one foreground object, and is obtained by referring to the second camera image and the second camera image. The ray intersections of the epipolar geometric model are obtained for the representative points of the corresponding regions of the three camera images, and the ray intersections are representative of each of the plurality of foreground objects connected inside the first closed region of the first camera image. Let it be a point (spatial coordinate).

なお、第１カメラ画像の第１閉領域に関して、第２カメラ画像ではＮ個の閉領域が対応し、第３カメラ画像では異なるＭ個の閉領域が対応する（Ｎ≠Ｍで、Ｎ＞Ｍとしても一般性を失わない）場合、第３カメラ画像のＭ個の光線のそれぞれと、第２カメラ画像のＮ個の光線のうち交点を求めた際の誤差が最も小さくなる光線と、のＭ個の交点を、複数の前景物体の代表点（空間座標）として求めればよい。 Regarding the first closed region of the first camera image, N closed regions correspond to each other in the second camera image, and M different closed regions correspond to each other in the third camera image (N ≠ M, N> M). However, if the generality is not lost), the M of each of the M rays of the third camera image and the ray having the smallest error when finding the intersection among the N rays of the second camera image. The intersections may be obtained as representative points (spatial coordinates) of a plurality of foreground objects.

（手順３）
上記求めた複数の前景物体の代表点（空間座標）を第１カメラ画像の画像座標へと再投影した代表点を求め、第１カメラ画像の第１閉領域の各画素に、当該画素との距離（画像座標上のユークリッド距離のみでなく色空間上の距離も利用してよい）が最も近い代表点に対応するIDを付与して、ラベリング結果とする。なお、再投影した画像座標が第１閉領域に含まれない点であった場合、この再投影した画像座標の近傍で第１閉領域に含まれる点を代表点とすればよい。 (Procedure 3)
A representative point obtained by reprojecting the representative points (spatial coordinates) of the plurality of foreground objects obtained above onto the image coordinates of the first camera image is obtained, and each pixel in the first closed region of the first camera image is combined with the pixel. An ID corresponding to the representative point with the closest distance (not only the Euclidean distance on the image coordinates but also the distance on the color space may be used) is assigned, and the labeling result is obtained. If the reprojected image coordinates are not included in the first closed region, a point included in the first closed region near the reprojected image coordinates may be used as a representative point.

なお、図４の説明の際の（処理１）～（処理３）はそれぞれ、上記の（手順１）～（手順３）の具体例に対応するものである。なお、（手順２）が可能な前提として、多視点画像が３視点以上で構成されている必要がある。 It should be noted that (Process 1) to (Process 3) in the description of FIG. 4 correspond to the specific examples of the above (Procedure 1) to (Procedure 3), respectively. As a premise that (Procedure 2) is possible, the multi-viewpoint image needs to be composed of three or more viewpoints.

＜統合部4＞
統合部4は、第１抽出部1より得られた被写体マスクとしての第１マスクと、識別部3より得られたラベル付与された前景物体マスクとしての識別された第２マスクとを利用して、統合部4の後段側の生成部5にて視体積交差法を実行するための入力データとしての、統合された第１マスクを生成する。 <Integration Department 4>
The integration unit 4 utilizes the first mask as the subject mask obtained from the first extraction unit 1 and the identified second mask as the labeled foreground object mask obtained from the identification unit 3. , The generation unit 5 on the rear side of the integration unit 4 generates an integrated first mask as input data for executing the visual volume crossing method.

統合部4での処理は、次の考察に基づく。すなわち、視体積交差法に入力するデータとしては、画像平面上においてもノイズなどが含まれておらず、被写体の形状をよく表していることが望ましい。一方で、課題として前述した通り、静止物体によって遮蔽されている領域については、第１抽出部1で得る第１マスクにおいて欠損してしまうため、その領域を埋めるような処理が必要となる。そのため、識別部3において得られた識別された第２マスクの一部（欠損を埋めるために必要最小限と考えられる一部のみ）を第１マスクに追加することで、可能な限り上記の視体積交差法の入力データとしての性質を保つようにする。 The processing in the integration unit 4 is based on the following considerations. That is, it is desirable that the data input to the visual volume crossing method does not include noise or the like even on the image plane and well represents the shape of the subject. On the other hand, as described above as a problem, the region shielded by the stationary object is lost in the first mask obtained by the first extraction unit 1, so that a process for filling the region is required. Therefore, by adding a part of the identified second mask obtained by the identification unit 3 (only a part considered to be the minimum necessary to fill the defect) to the first mask, the above-mentioned visual observation can be performed as much as possible. Keep the properties of the volume crossing method as input data.

統合部4では具体的には、第１マスクをM、第１マスクMに含まれるあるマスク閉領域をM_i（iは閉領域ごとに割り振られるインデックス）、識別された第２マスクをF、第２マスクFに含まれるある閉領域をF_l（lは識別結果としてのラベル値）としたときに、以下の式(1)及び(2)の通り、マスク閉領域M_iごとに第２マスクのそれぞれの閉領域F_lを付与するか否かを判定し、付与判定があれば付与したうえで領域M_iを領域M_i'へと更新し、式(3)のようにこれら更新された領域M_i'の和として、新たなマスク画像M'すなわち統合された第１マスクを出力する。 Specifically, in the integration unit 4, the first mask is M, a certain mask closed area included in the first mask M is Mi ( _i is an index assigned to each closed area), and the identified second mask is F. Assuming that a certain closed area included in the second mask F is F _l (l is a label value as an identification result), the second mask closed area M _i is as shown in the following equations (1) and (2). It is determined whether or not to assign each closed region F _l of the mask, and if there is a grant determination, the region M _i is updated to the region M _i ', and these are updated as in Eq. (3). A new mask image M', that is, an integrated first mask, is output as the sum of the regions M _i '.

すなわち、式(2)にラベル値lに関して示される判定条件「TH1>|M_i∩F_l|>TH2」を満たすような閉領域F_lを、式(1)に示すように閉領域M_iに加算する（集合としての画像領域に関して和集合を求める）ことにより、閉領域M_iを閉領域M_i'へと更新し、式(3)のようにこれら更新された閉領域M_i'の全体として、当初のマスク画像Mを統合した新たなマスク画像M'を得ることができる。なお、集合Lに属する要素lの条件として判定条件が含まれる式(2)において、|・|（絶対値の記号）はその引数である画像領域「・」に含まれるピクセル数を返す演算子（後述の式(4)でも同様）であり、TH1及びTH2はそれぞれ、このピクセル数に対する所定の上側閾値及び下側閾値である。 That is, a closed region F _l that satisfies the judgment condition “TH 1> | M _i ∩ F _l |> TH 2” shown in the equation (2) with respect to the label value l is a closed region M _i as shown in the equation (1). By adding to (the union is obtained for the image region as a set), the closed region M _i is updated to the closed region M _i ', and these updated closed region M _i'are as shown in Eq. (3). As a whole, a new mask image M'that integrates the original mask image M can be obtained. In the equation (2) in which the judgment condition is included as the condition of the element l belonging to the set L, | ・ | (absolute value symbol) is an operator that returns the number of pixels included in the image area "・" which is its argument. (The same applies to the equation (4) described later), where TH1 and TH2 are predetermined upper and lower thresholds for the number of pixels, respectively.

すなわち、式(2)のように、閉領域M_iと閉領域F_lとの重複箇所M_i∩F_lのピクセル数|M_i∩F_l|が下側閾値TH2よりも大きく、且つ、上側閾値TH1よりも小さい場合に、式(1)のように閉領域F_lを閉領域M_iへと加算すればよい。閾値TH1及びTH2に関しては、ユーザ設定により固定値を与えておいてもよいし、以下の式(4-1)及び(4-2)のように、閉領域F_lのピクセル数|F_l|又は閉領域M_iのピクセル数|M_i|に一定割合r1及びr2（0<r2<r1<1）をそれぞれ乗じた数のうち小さい方として与えるようにしてもよい。
TH1=min{r1*|F_l|, r1*|M_i|} …(4-1)
TH2=min{r2*|F_l|, r2*|M_i|} …(4-2) That is, as shown in Eq. (2), the number of pixels of the overlapping point M _i ∩ F _l between the closed area M _i and the closed area F _l | M _i ∩ F _l | is larger than the lower threshold value TH 2 and is higher. When it is smaller than the threshold value TH1, the closed region F _l may be added to the closed region M _i as shown in Eq. (1). For the threshold values TH1 and TH2, fixed values may be given by user setting, or the number of pixels in the closed region F _l | F _l | as shown in the following equations (4-1) and (4-2). Alternatively, it may be given as the smaller of the numbers obtained by multiplying the number of pixels | M _i | in the closed region M _i by a fixed ratio r1 and r2 (0 <r2 <r1 <1), respectively.
TH1 = min {r1 * | F _l |, r1 * | M _i |}… (4-1)
TH2 = min {r2 * | F _l |, r2 * | M _i |}… (4-2)

式(2)では、下側閾値TH2の利用により、加算すべきか判断する閉領域F_lに関して、重複箇所M_i∩F_lが小さすぎることにより、ノイズ等で偶然重複していると考えられるものを、加算対象から除外することができる。また、上側閾値TH1の利用により、加算すべきか判断する閉領域F_lに関して、閉領域M_iをその内部に完全に覆ってしまっている（集合の包含関係として「M_i⊂F_l」の関係が成立する）状況か、これに近い状況にある場合、例えば被写体の背後に前景領域が完全にあるいはほとんど覆いかぶさってしまっているような場合に関して、加算対象から除外することができる。 In Eq. (2), regarding the closed region F _l that determines whether to add by using the lower threshold value TH 2, it is considered that the overlapping location M _i ∩ F l is accidentally overlapped due to noise or the like because the overlapping point M i ∩ F _l is too small. Can be excluded from the addition target. In addition, with respect to the closed domain F _l that determines whether to add by using the upper threshold value TH 1, the closed domain M _i is completely covered inside (the relationship of "M _i ⊂ F _l " as the inclusion relationship of the set). Can be excluded from the addition target when the situation is close to or close to the above, for example, when the foreground area completely or almost covers the background of the subject.

なお、式(2)に代えて、上側閾値TH1及び下側閾値TH2の両方を用いるのではなく、以下の式(2-1)又は(2-2)のようにいずれか片方のみを用いるようにするようにしてもよい。
L={l｜TH1>|M_i∩F_l|} …(2-1)
L={l｜|M_i∩F_l|>TH2} …(2-2) Instead of using both the upper threshold TH1 and the lower threshold TH2 instead of the equation (2), use only one of them as in the following equation (2-1) or (2-2). You may try to.
L = {l ｜ TH1> | M _i ∩ F _l |}… (2-1)
L = {l ｜| M _i ∩ F _l |> TH2}… (2-2)

なお、第１マスクM内において各閉領域M_iを識別する処理は、第１抽出部1において第１マスクを抽出した際に併せて行っておけばよい。この処理には、第２抽出部2において説明した既存技術のラベリングと同様の手法（画像座標上での領域拡張法など）を用いればよい。 The process of identifying each closed region M _i in the first mask M may be performed at the same time when the first mask is extracted by the first extraction unit 1. For this process, a method similar to the labeling of the existing technique described in the second extraction unit 2 (such as a region expansion method on image coordinates) may be used.

図５は、データD1～D4と分けて統合部4による統合処理の模式例を示す図である。データD1には入力としての第１マスクMの例として、第１マスクMが２個の閉領域M₁及びM₂で構成される例が示されている。この第１マスクMは、図１に示したような選出PLの腕の部分がホールドHLで遮蔽されて欠損が発生している例となっている。データD2には入力としての識別された第２マスクFの例として、識別された第２マスクFが２個の識別された閉領域F₁及びF₂（図１に示したようなホールドHLの各々に対応する閉領域）で構成される例が示されている。 FIG. 5 is a diagram showing a schematic example of the integration process by the integration unit 4 separately from the data D1 to D4. Data D1 shows an example in which the first mask M is composed of two closed regions M ₁ and M ₂ as an example of the first mask M as an input. This first mask M is an example in which the arm portion of the selected PL as shown in FIG. 1 is shielded by the hold HL and a defect occurs. In the data D2, as an example of the identified second mask F as an input, the identified second mask F has two identified closed regions F ₁ and F ₂ (hold HL as shown in FIG. 1). An example is shown which consists of a closed region corresponding to each.

図５にてデータD3は式(2)の条件判定を行う処理の模式例として、データD1の第１マスクMとデータD2の識別された第２マスクFとを重ねた状態が示され、データD4にはこの条件判定に基づいて式(1)の加算を行い式(3)のようにそれら全体を求めることで得られる、統合された第1マスクM'の模式例が示されている。 In FIG. 5, the data D3 shows a state in which the first mask M of the data D1 and the identified second mask F of the data D2 are overlapped as a schematic example of the process for determining the condition of the equation (2). D4 shows a schematic example of the integrated first mask M'obtained by adding the equation (1) based on this condition determination and obtaining all of them as in the equation (3).

データD3及びD4の例は、式(2)の条件判定が閉領域M₁及び閉領域F₁の組み合わせのみに関して「TH1>|M₁∩F₁|>TH2」として成立し、それ以外の組み合わせに関しては不成立であった場合の例となっている。すなわち、閉領域M₁に関して式(2)は「L={1}」であり、閉領域M₂に関して式(2)は「L=空集合（該当なし）」であった場合の例となっている。従って、第1マスクM内のこれら２個の閉領域M₁及びM₂は式(1)を適用することによってそれぞれ以下の式(1-1),(1-2)のように更新され、これらに基づき式(3)を適用することにより、以下の式(3-1)のように統合された第１マスクM'が得られるのが、図５の例である。
M₁'=M₁+F₁ …(1-1)
M₂'=M₂ …(1-2)
M'=M₁'+M₂'
=M+F₁ …(3-1) In the example of data D3 and D4, the condition judgment in Eq. (2) holds as "TH 1> | M ₁ ∩ F ₁ |> TH 2" only for the combination of the closed region M ₁ and the closed region F ₁ , and other combinations. Is an example of the case where it was not established. That is, the equation (2) is "L = {1}" for the closed domain M ₁ , and the equation (2) is "L = empty set (not applicable)" for the closed domain M ₂ . ing. Therefore, these two closed regions M ₁ and M ₂ in the first mask M are updated as the following equations (1-1) and (1-2) by applying the equation (1), respectively. By applying the equation (3) based on these, the first mask M'integrated as shown in the following equation (3-1) can be obtained in the example of FIG.
M ₁ '= M ₁ + F ₁ … (1-1)
M ₂ '= M ₂ … (1-2)
M'= M ₁ '+ M ₂ '
= M + F ₁ … (3-1)

以上の図５の模式例では、統合処理は上記の式(3-1)で与えられるものとなっており、統合された第１マスクM'を得るに際して、第１マスクMでの欠損の原因となっている閉領域F₁のみが加算され、欠損の原因となっていない閉領域F₂は加算されていないことを見て取ることができる。 In the above schematic example of FIG. 5, the integration process is given by the above equation (3-1), and when the integrated first mask M'is obtained, the cause of the defect in the first mask M is obtained. It can be seen that only the closed region F ₁ is added, and the closed region F ₂ that is not the cause of the defect is not added.

なお、既に述べたように統合部4の処理は、多視点画像の各カメラ視点の画像に対応する第１マスク及び識別された第２マスクに対して、同様の処理が独立に行われる。この際、第１マスク及び第２マスクはカメラ視点によって異なるものであるため、式(2)の判定結果がカメラ視点によって異なることがありうる。従って例えば、あるカメラ視点においては統合処理により第２マスク内の１つ以上の個別領域が加算されるが、別のカメラ視点においては統合処理によって第２マスク内の個別領域が１つも加算されない、ということもありうる。 As already described, in the processing of the integration unit 4, the same processing is independently performed on the first mask corresponding to the image of each camera viewpoint of the multi-viewpoint image and the identified second mask. At this time, since the first mask and the second mask differ depending on the camera viewpoint, the determination result of the equation (2) may differ depending on the camera viewpoint. Therefore, for example, in one camera viewpoint, one or more individual areas in the second mask are added by the integration process, but in another camera viewpoint, no individual area in the second mask is added by the integration process. It is possible that.

＜生成部5＞
生成部5では、被写体の３次元形状を推定し3DCGモデルを生成する。具体的な生成処理には例えば前掲の特許文献１，２や非特許文献１や、その他の任意の既存の視体積交差法を用いてよい。一般に、視体積交差法はすべてのカメラの被写体形状情報（本実施形態においては統合部4で得た統合された第１マスク）による視体積（visual cone）の積集合を取ることで、3DCGモデルの形状を得る。そのため、複数のカメラ2値マスクのうち1つでも欠損領域を含む場合、その3DCGモデル形状も欠損する。しかしながら、本実施形態においては前述の通り統合部4により、静止物体の遮蔽による欠損を埋める処理が行われているので、生成部5において欠損の影響が排除された適切な3DCGモデルを生成することが可能となる。 <Generator 5>
The generation unit 5 estimates the three-dimensional shape of the subject and generates a 3DCG model. For the specific generation process, for example, the above-mentioned Patent Documents 1 and 2 and Non-Patent Document 1 and any other existing visual volume crossing method may be used. In general, the visual volume crossing method is a 3DCG model by taking the intersection of visual cones based on the subject shape information of all cameras (in the present embodiment, the integrated first mask obtained by the integrated unit 4). Get the shape of. Therefore, if even one of the plurality of camera binary masks contains a missing area, the 3DCG model shape is also lost. However, in the present embodiment, as described above, the integration unit 4 performs the process of filling the defect due to the shielding of the stationary object, so that the generation unit 5 generates an appropriate 3DCG model in which the influence of the defect is eliminated. Is possible.

＜合成部6＞
合成部6では、生成部5にて得られた3DCGモデルの形状に応じて最終的な仮想視点からの画像を合成する。当該合成処理には任意の既存のレンダリング手法を用いてよく、ユーザ入力等によって与えられた仮想視点の位置座標に応じて、その近傍のカメラテクスチャ（多視点画像のうちカメラ視点が仮想視点に近いもののテクスチャ）を利用して被写体3Dモデルの色情報を決定することができる。また、3DCGモデル化されていない被写体以外の背景等については、あらかじめ制作しておいた背景モデル（例えばスポーツを撮影している場合、背景としてのスポーツ会場や設備（スポーツクライミングの場合であればホールドなど）のCGモデル）などを利用し、上記3DCGモデルと重畳することで最終的な合成画像を得ることができる。 <Synthesis unit 6>
In the compositing unit 6, the image from the final virtual viewpoint is synthesized according to the shape of the 3DCG model obtained in the generating unit 5. Any existing rendering method may be used for the compositing process, and the camera texture in the vicinity (the camera viewpoint is close to the virtual viewpoint among the multi-view images) according to the position coordinates of the virtual viewpoint given by the user input or the like. The color information of the subject 3D model can be determined using the texture of the object. For backgrounds other than subjects that are not 3DCG models, a background model created in advance (for example, if you are shooting sports, hold the sports venue or equipment as the background (in the case of sports climbing, hold it). The final composite image can be obtained by superimposing it on the above 3DCG model using the CG model).

以上、本発明の一実施形態によれば、スポーツクライミングなどの被写体と静止物体のインタラクションが発生する映像において、被写体形状を損なうことなく自由視点映像を合成することが可能となる。以下、個別事項等に関してさらに補足説明等を行う。 As described above, according to one embodiment of the present invention, it is possible to synthesize a free-viewpoint image without damaging the shape of the subject in an image in which an interaction between a subject and a stationary object such as sports climbing occurs. Hereinafter, supplementary explanations will be given regarding individual matters.

（１）統合部4では前述の式(1)～(3)により統合処理を行うが、この際の統合する対象を判定する条件を指定する式(2)、すなわち、第１マスクにおいて遮蔽による欠損を発生させていると考えられる第２マスクの閉領域の判定は、より一般には、次の式(5)を用いるようにしてもよい。
L={l｜閉領域M_iと閉領域F_lとの位置関係が所定条件を満たす。} …(5) (1) In the integration unit 4, the integration process is performed by the above equations (1) to (3), and the equation (2) that specifies the condition for determining the target to be integrated at this time, that is, the first mask is shielded. More generally, the following equation (5) may be used to determine the closed region of the second mask, which is considered to cause a defect.
L = {l | The positional relationship between the closed area M _i and the closed area F _l satisfies a predetermined condition. } …(Five)

前述の式(2)は、上記の式(5)における位置関係が所定条件を満たすことの判定を具体的に、閉領域M_iと閉領域F_lとの重複に基づいて行っているものである。その他の実施形態として、式(5)の位置関係の条件判定を例えば、閉領域M_iと閉領域F_lとの距離が所定の閾値よりも小さいことによって判定（肯定判定）するようにしてもよい。この距離は、閉領域M_iと閉領域F_lとのそれぞれの代表点（重心など）の距離として求めてもよいし、閉領域M_i境界上の点と閉領域F_lの境界上の点との距離のうち最小のものとしても求めてもよい。距離に対する閾値は固定値を用いる他にも、重複に対する閾値に関する前述の式(4)と類似の設定として、閉領域M_i及び／又は閉領域F_lのサイズ（領域を囲む矩形のいずれかの辺のサイズ等）を基準として一定割合の値として設定してもよい。 The above-mentioned equation (2) specifically determines that the positional relationship in the above equation (5) satisfies a predetermined condition based on the overlap between the closed region M _i and the closed region F _l . be. As another embodiment, the condition determination of the positional relationship in the equation (5) may be determined (affirmative determination) by, for example, the distance between the closed region M _i and the closed region F _l being smaller than a predetermined threshold value. good. This distance may be obtained as the distance between the representative points (center of gravity, etc.) of the closed region M _i and the closed region F _l , or the point on the boundary of the closed region M _i and the point on the boundary of the closed region F _l . It may be obtained as the minimum distance between the two. In addition to using a fixed value for the distance threshold, the size of the closed region M _i and / or the closed region F _l (one of the rectangles surrounding the region) is set similar to the above equation (4) regarding the threshold value for duplication. It may be set as a fixed ratio value based on the size of the side, etc.).

また、式(5)の判定は、以上のような閉領域同士の重複による判定と距離による判定とを組み合わせた判定としてもよい。 Further, the determination in the equation (5) may be a determination in which the determination based on the overlap between the closed regions and the determination based on the distance as described above are combined.

（２）図６は、画像処理装置10を実現することが可能な一般的なコンピュータ装置30のハードウェア構成の例を示す図である。図６に示すように、コンピュータ装置30は、所定命令を実行するCPU（中央演算装置）101、CPU101の実行命令の一部又は全部をCPU101に代わって又はCPU101と連携して実行する専用プロセッサ102（GPU（グラフィック演算装置）や深層学習専用プロセッサ等）、CPU101や専用プロセッサ102にワークエリアを提供する主記憶装置としてのRAM103、補助記憶装置としてのROM104、通信IF（インタフェース）201及びディスプレイ202と、これらの間でデータを授受するためのバスBと、を備える。 (2) FIG. 6 is a diagram showing an example of a hardware configuration of a general computer device 30 capable of realizing the image processing device 10. As shown in FIG. 6, the computer device 30 is a CPU (central processing unit) 101 that executes a predetermined instruction, and a dedicated processor 102 that executes a part or all of the execution instructions of the CPU 101 on behalf of the CPU 101 or in cooperation with the CPU 101. (GPU (graphic calculation device), deep learning dedicated processor, etc.), RAM 103 as the main storage device that provides a work area for the CPU 101 and the dedicated processor 102, ROM 104 as the auxiliary storage device, communication IF (interface) 201, and display 202. , A bus B for sending and receiving data between them.

図２に示される画像処理装置10の各機能部は、各機能部の処理内容に対応する所定のプログラムをROM104から読み込んで実行するCPU101及び／又は専用プロセッサ102によって実現することができる。ここで、ネットワークを経由したデータ送受信に関する通信関連の処理が行われる場合にはさらに通信IF201が連動して動作し、表示関連の処理が行われる場合にはさらに、ディスプレイ202が連動して動作する。例えば、入力としての多視点画像は通信IF201を介して受信され、出力としての合成画像は通信IF201を介して送信されるようにしてもよい。合成部6で得た合成画像はディスプレイ202において表示されるようにしてもよい。 Each functional unit of the image processing apparatus 10 shown in FIG. 2 can be realized by a CPU 101 and / or a dedicated processor 102 that reads and executes a predetermined program corresponding to the processing content of each functional unit from the ROM 104. Here, when communication-related processing related to data transmission / reception via the network is performed, the communication IF 201 further operates in conjunction with each other, and when display-related processing is performed, the display 202 further operates in conjunction with the display-related processing. .. For example, the multi-viewpoint image as an input may be received via the communication IF201 and the composite image as an output may be transmitted via the communication IF201. The composite image obtained by the composite unit 6 may be displayed on the display 202.

また、画像処理装置10は１台のみのコンピュータ装置30において実現されてもよいし、ネットワーク経由で相互に通信する２台以上のコンピュータ装置30がそれぞれ、図２に示される画像処理装置10の機能部の１つ以上を分担して担うシステムとして実現されてもよい。 Further, the image processing device 10 may be realized by only one computer device 30, or two or more computer devices 30 that communicate with each other via a network each have a function of the image processing device 10 shown in FIG. It may be realized as a system in which one or more of the parts are shared and shared.

10…画像処理装置、1…第１抽出部、2…第２抽出部、3…識別部、4…統合部、5…生成部、6…合成部 10 ... Image processing device, 1 ... 1st extraction unit, 2 ... 2nd extraction unit, 3 ... Identification unit, 4 ... Integration unit, 5 ... Generation unit, 6 ... Synthesis unit

Claims

A first extraction unit that extracts the area of the subject as the first mask from each image of the multi-viewpoint image,
A second extraction unit that extracts a region of a foreground object as an object different from the subject as a second mask from each image of the multi-viewpoint image.
An identification unit that identifies an individual area in the second mask,
To the first mask, among the identified individual regions of the second mask, those determined to have caused a defect due to shielding in the first mask based on the positional relationship are added. To obtain the integrated first mask by
A generation unit that generates a three-dimensional model of the subject by using the integrated first mask corresponding to each image of the multi-viewpoint image is provided .
With respect to the first mask, the integrated unit determines that the identified individual regions of the second mask, in which the size of the overlap satisfies a predetermined condition, cause the defect. An image processing device characterized by adding as .

The image processing apparatus according to claim 1 , wherein the predetermined condition includes that the size of the overlap is larger than the lower threshold value.

The image processing apparatus according to claim 2 , wherein the predetermined condition includes that the size of the overlap is smaller than the upper threshold value.

In the identification unit, a three-dimensional model of the foreground object is generated and labeled by applying the visual volume crossing method using the second mask corresponding to each image of the multi-viewpoint image, and the label is assigned. The image processing apparatus according to any one of claims 1 to 3 , wherein the three-dimensional model is reprojected onto the second mask to identify individual regions in the second mask.

A first extraction unit that extracts the area of the subject as the first mask from each image of the multi-viewpoint image,
A second extraction unit that extracts a region of a foreground object as an object different from the subject as a second mask from each image of the multi-viewpoint image.
An identification unit that identifies an individual area in the second mask,
To the first mask, among the identified individual regions of the second mask, those determined to have caused a defect due to shielding in the first mask based on the positional relationship are added. To obtain the integrated first mask by
A generation unit that generates a three-dimensional model of the subject by using the integrated first mask corresponding to each image of the multi-viewpoint image is provided.
In the identification unit, after identifying individual closed regions on the image coordinates with respect to the second mask corresponding to each image of the multi-viewpoint image, the identification unit is used.
By using the epipolar geometry model
The correspondence between the individual closed regions between the different second masks corresponding to the images of different camera viewpoints in the multi-viewpoint image was obtained.
In the obtained correspondence relationship, when a plurality of closed regions correspond to one closed region of the first camera viewpoint in each of the second camera viewpoint and the third camera viewpoint, the plurality of closed regions are used. Find the corresponding spatial coordinates for each
An image processing apparatus comprising identifying the one closed region as a plurality of different individual regions on image coordinates based on the obtained plurality of spatial coordinates.

The image processing apparatus according to any one of claims 1 to 5 , wherein the first extraction unit extracts the first mask by applying a background subtraction method or object recognition.

Any of claims 1 to 6 , wherein the second extraction unit extracts the second mask by applying region division or object recognition using the prior knowledge given about the foreground object. The image processing device described in Crab.

The invention according to any one of claims 1 to 7 , further comprising a compositing unit that synthesizes a free-viewpoint image corresponding to the multi-viewpoint image by rendering the three-dimensional model of the subject from a designated virtual viewpoint. Image processing equipment.

A program characterized in that a computer functions as the image processing device according to any one of claims 1 to 8 .