JP2023026244A

JP2023026244A - Image generation apparatus, image generation method, and program

Info

Publication number: JP2023026244A
Application number: JP2021132065A
Authority: JP
Inventors: 秀藤田; Shu Fujita
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2021-08-13
Filing date: 2021-08-13
Publication date: 2023-02-24

Abstract

To provide an image generation apparatus adapted to suppress a virtual visual-point image from lowering in quality.SOLUTION: An image generation apparatus is allowed to select one of three-dimensional shape information of a photographic subject generated based on silhouette information of the photographic subject obtained from a plurality of picked-up images obtained by imaging the photographic subject by a plurality of imaging devices and three-dimensional shape information of the photographic subject generated based on a posture of the photographic subject presumed from the plurality of picked-up images and generate a virtual visual-point image using the three-dimensional shape information selected.SELECTED DRAWING: Figure 1

Description

本開示は、画像生成装置および画像生成方法、プログラムに関する。 The present disclosure relates to an image generation device, an image generation method, and a program.

異なる位置に配置された複数の撮像装置により得られた複数の画像（複数視点画像）を用いて、ユーザにより指定された仮想視点からの仮想視点画像を生成する技術が注目されている。複数視点画像から仮想視点コンテンツを生成する技術によれば、例えば、サッカーやバスケットボールのハイライトシーンを様々な角度から視聴することが出来るため、通常の画像と比較してユーザに高臨場感を与えることが出来る。 A technique for generating virtual viewpoint images from virtual viewpoints designated by a user using a plurality of images (multi-viewpoint images) obtained by a plurality of imaging devices arranged at different positions has attracted attention. According to technology that generates virtual viewpoint content from multiple viewpoint images, for example, highlight scenes of soccer or basketball can be viewed from various angles, giving the user a high sense of realism compared to normal images. can do

特許文献１には、仮想視点画像を生成する方法として、モデルを形成する３次元空間を点やボクセル空間として捉え、被写体の３次元形状の取得と色付けを複数視点画像から得られた被写体のシルエット情報に基づいて行う方法が記載されている。また、特許文献２には、仮想視点画像を生成する方法として、３次元オブジェクトテンプレートモデルを複数視点画像から得られた被写体の姿勢情報に基づいて調整することで３次元形状の取得と色付けを行う方法が記載されている。 Patent Document 1 describes a method for generating a virtual viewpoint image, in which a three-dimensional space forming a model is regarded as a point or voxel space, and the three-dimensional shape of the subject is acquired and colored from multiple viewpoint images. It describes how to be informed. In addition, Patent Document 2 describes a method for generating a virtual viewpoint image, in which a 3D shape is obtained and colored by adjusting a 3D object template model based on posture information of a subject obtained from multi-viewpoint images. method is described.

特開２０２０－０４２６６６号公報JP 2020-042666 A 特開２０１６－１２６４２５号公報JP 2016-126425 A

シルエット情報に基づいて３次元形状を取得する方法と、姿勢情報に基づいて３次元形状を取得する方法とでは、撮像領域や被写体の状態が３次元形状の推定の精度に及ぼす影響は異なる。シルエット情報に基づく方法では、被写体のシルエットに即した形状を生成できるが、例えばその被写体を撮像できるカメラの数が少ない場合や、その被写体の付近に被写体が密集している場合、３次元形状を正しく推定できなくなる可能性が高い。そのため、これらの状況下では、３次元形状の推定の精度が落ちてしまう。一方、姿勢情報に基づく方法では、上述した状況においても比較的高精度に３次元形状を推定できるが、被写体の形状が３次元オブジェクトテンプレートモデルと乖離していた場合には３次元形状を正しく推定できなくなる可能性が高い。３次元形状の推定の精度を維持することは仮想視点画像の画質を維持するうえで重要である。 The method of obtaining a three-dimensional shape based on silhouette information and the method of obtaining a three-dimensional shape based on posture information have different effects on the accuracy of three-dimensional shape estimation due to the imaging region and the state of the subject. A method based on silhouette information can generate a shape that conforms to the silhouette of a subject. There is a high possibility that it will not be possible to estimate correctly. Therefore, under these circumstances, the accuracy of 3D shape estimation is reduced. On the other hand, the method based on posture information can estimate the 3D shape with relatively high accuracy even in the above situation, but if the shape of the subject deviates from the 3D object template model, the 3D shape can be estimated correctly. It is highly likely that you won't be able to. Maintaining the accuracy of 3D shape estimation is important for maintaining the image quality of virtual viewpoint images.

本開示における一つの態様によれば、仮想視点画像の画質の低下を抑制する技術が提供される。 According to one aspect of the present disclosure, there is provided a technique for suppressing deterioration in image quality of virtual viewpoint images.

本発明の一態様による画像生成装置は以下の構成を備える。すなわち、
複数の撮像装置により被写体を撮像した複数の撮像画像から得られる被写体のシルエット情報に基づいて、前記被写体の３次元形状情報を生成する第１の生成手段と、
前記複数の撮像画像から推定された前記被写体の姿勢に基づいて前記被写体の３次元形状情報を生成する第２の生成手段と、
前記複数の撮像画像に含まれる各被写体について、前記第１の生成手段により生成された３次元形状情報と前記第２の生成手段により生成された３次元形状情報のうちの一方を、仮想視点画像の生成に用いる３次元形状情報として選択する選択手段と、
前記選択手段により選択された種類の３次元形状情報を用いて、仮想視点画像を生成する生成手段と、を有する。 An image generation device according to one aspect of the present invention has the following configuration. i.e.
a first generation means for generating three-dimensional shape information of a subject based on silhouette information of the subject obtained from a plurality of captured images of the subject captured by a plurality of imaging devices;
a second generation means for generating three-dimensional shape information of the subject based on the posture of the subject estimated from the plurality of captured images;
one of the three-dimensional shape information generated by the first generation means and the three-dimensional shape information generated by the second generation means for each subject included in the plurality of captured images, and displaying one of the three-dimensional shape information generated by the second generation means as a virtual viewpoint image; a selection means for selecting as three-dimensional shape information used to generate the
and generating means for generating a virtual viewpoint image using the type of three-dimensional shape information selected by the selecting means.

本開示によれば、仮想視点画像の画質の低下を抑制することができる。 According to the present disclosure, it is possible to suppress deterioration in image quality of a virtual viewpoint image.

第１実施形態による画像処理システムの装置構成と、画像生成装置の機能構成の例を示すブロック図。1 is a block diagram showing an example of the device configuration of an image processing system and an example of the functional configuration of an image generation device according to the first embodiment; FIG. 被写体の姿勢を表現する３次元形状情報および姿勢情報を説明する図。FIG. 4 is a diagram for explaining three-dimensional shape information and posture information expressing the posture of a subject; 撮像領域を複数のカメラで撮像する様子を表す模式図。FIG. 4 is a schematic diagram showing how an imaging region is imaged by a plurality of cameras; 第１実施形態による画像生成装置のハードウェア構成例を示すブロック図。FIG. 2 is a block diagram showing a hardware configuration example of the image generation device according to the first embodiment; 第１実施形態による仮想視点画像の生成処理を示すフローチャート。4 is a flowchart showing virtual viewpoint image generation processing according to the first embodiment; 選択部による３次元形状情報の選択を説明する図。FIG. 4 is a diagram for explaining selection of three-dimensional shape information by a selection unit; 第２実施形態による仮想視点画像の生成処理を示すフローチャート。10 is a flowchart showing processing for generating a virtual viewpoint image according to the second embodiment; 撮像領域を複数のカメラで撮像する様子を表す模式図。FIG. 4 is a schematic diagram showing how an imaging region is imaged by a plurality of cameras;

以下、添付図面を参照して実施形態を詳しく説明する。なお、以下の実施形態は特許請求の範囲に係る発明を限定するものではない。実施形態には複数の特徴が記載されているが、これらの複数の特徴の全てが発明に必須のものとは限らず、また、複数の特徴は任意に組み合わせられてもよい。さらに、添付図面においては、同一若しくは同様の構成に同一の参照番号を付し、重複した説明は省略する。 Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. In addition, the following embodiments do not limit the invention according to the scope of claims. Although multiple features are described in the embodiments, not all of these multiple features are essential to the invention, and multiple features may be combined arbitrarily. Furthermore, in the accompanying drawings, the same or similar configurations are denoted by the same reference numerals, and redundant description is omitted.

＜第１実施形態＞
第１実施形態では、仮想視点画像を生成する際、撮像領域中の各領域を撮像することが可能な撮像装置の数に応じて、使用する形状推定の方法を切り替えながら仮想視点画像を生成する実施形態について述べる。本実施形態の画像処理システムは、複数の撮像装置により異なる方向から撮像した複数の撮像画像（複数視点画像）、各撮像装置の状態（位置、姿勢）、指定された仮想視点（仮想視点情報）基づいて、仮想視点からの見えを表す仮想視点画像を生成する。 <First embodiment>
In the first embodiment, when generating a virtual viewpoint image, the virtual viewpoint image is generated while switching the shape estimation method to be used according to the number of imaging devices capable of imaging each region in the imaging region. An embodiment will be described. The image processing system of this embodiment includes a plurality of captured images (multi-viewpoint images) captured from different directions by a plurality of imaging devices, the state (position, orientation) of each imaging device, a specified virtual viewpoint (virtual viewpoint information), and the like. Based on this, a virtual viewpoint image representing the view from the virtual viewpoint is generated.

複数の撮像装置は、複数の方向から撮像領域を撮像する。撮像領域は、例えば、ラグビーやサッカーが行われる競技場の平面と任意の高さで囲まれた領域である。複数の撮像装置は、このような撮像領域を取り囲むようにして、それぞれ異なる位置と方向に設置され、同期して撮像を行う。なお、撮像装置は撮像領域の全周にわたって設置されなくてもよい。例えば、設置場所の制限等によって、撮像領域の一部の方向にのみ複数の撮像装置が設置されていてもよい。また、配置される撮像装置の数に制限はない。例えば、撮像領域をラグビーの競技場とする場合、競技場の周囲に数十～数百台程度の撮像装置が設置されてもよい。また、望遠カメラと広角カメラなど画角が異なる撮像装置が設置されていてもよい。例えば、望遠カメラを用いれば、高解像度に被写体を撮像できるので、生成される仮想視点画像の解像度も向上する。また、例えば、広角カメラを用いれば、一台のカメラで撮像できる範囲が広いので、カメラ台数を減らすことができる。撮像装置は現実世界の一つの時刻情報で同期され、撮像した動画には毎フレームの画像に撮像時刻情報が付与される。 A plurality of imaging devices capture images of the imaging region from a plurality of directions. The imaging area is, for example, an area surrounded by a plane of a stadium where rugby or soccer is played and an arbitrary height. A plurality of imaging devices are installed in different positions and directions so as to surround such an imaging region, and perform imaging in synchronism. Note that the imaging devices do not have to be installed over the entire circumference of the imaging area. For example, a plurality of imaging devices may be installed only in a part of the imaging area due to restrictions on installation locations. In addition, there is no limit to the number of arranged imaging devices. For example, if the imaging area is a rugby stadium, several tens to hundreds of imaging devices may be installed around the stadium. Imaging devices with different angles of view, such as a telephoto camera and a wide-angle camera, may also be installed. For example, if a telephoto camera is used, the subject can be imaged with high resolution, so the resolution of the generated virtual viewpoint image is also improved. Also, for example, if a wide-angle camera is used, the range that can be captured by one camera is wide, so the number of cameras can be reduced. The imaging device is synchronized with one piece of time information in the real world, and imaging time information is added to each frame image of the captured moving image.

撮像装置の状態とは、撮像装置の位置、姿勢（向き、撮像方向）、焦点距離、光学中心、歪みなどの状態のことをいう。撮像装置の位置、姿勢（向き、撮像方向）は、撮像装置そのもので制御されてもよいし、撮像装置が搭載される雲台により制御されてもよい。以下では、撮像装置の状態を撮像装置のパラメータ（以下、カメラパラメータ）として説明を行う。カメラパラメータには、雲台等の別の装置を制御するためのパラメータが含まれていてもよい。また、撮像装置の位置、姿勢（向き、撮像方向）に関するカメラパラメータは、いわゆる外部パラメータであり、撮像装置の焦点距離、画像中心、歪みに関するカメラパラメータは、いわゆる内部パラメータである。撮像装置の位置や姿勢は一つの原点と直交する３軸を持つ座標系で表現される（以下、世界座標系と呼ぶ）。 The state of the imaging device refers to the state of the imaging device, such as the position, orientation (orientation, imaging direction), focal length, optical center, distortion, and the like. The position and orientation (orientation, imaging direction) of the imaging device may be controlled by the imaging device itself, or may be controlled by a camera platform on which the imaging device is mounted. In the following description, the state of the imaging device will be described as a parameter of the imaging device (hereinafter referred to as a camera parameter). Camera parameters may include parameters for controlling another device such as a camera platform. Camera parameters relating to the position and orientation (orientation, imaging direction) of the imaging device are so-called external parameters, and camera parameters relating to the focal length, image center, and distortion of the imaging device are so-called internal parameters. The position and orientation of the imaging device are represented by a coordinate system having three axes orthogonal to one origin (hereinafter referred to as a world coordinate system).

仮想視点画像は、自由視点画像、任意視点画像とも呼ばれる。但し、仮想視点画像は、ユーザが自由に（任意に）指定した視点に対応する画像に限定されない。例えば、複数の候補からユーザが選択した視点に対応する画像なども仮想視点画像に含まれる。また、仮想視点の指定は、ユーザ操作により行われてもよいし、画像解析の結果等に基づいて自動で行われてもよい。また、以下の実施形態では仮想視点画像が静止画である場合を中心に説明するが、仮想視点画像は動画であってもよい。 A virtual viewpoint image is also called a free viewpoint image or an arbitrary viewpoint image. However, the virtual viewpoint image is not limited to the image corresponding to the viewpoint freely (arbitrarily) specified by the user. For example, the virtual viewpoint image includes an image corresponding to a viewpoint selected by the user from a plurality of candidates. Also, the designation of the virtual viewpoint may be performed by a user operation, or may be automatically performed based on the result of image analysis or the like. Also, in the following embodiments, the case where the virtual viewpoint image is a still image will be mainly described, but the virtual viewpoint image may be a moving image.

仮想視点画像の生成に用いられる仮想視点情報は、仮想視点の位置及び向きを示す。具体的には、仮想視点情報は、仮想視点の３次元位置を表すパラメータと、パン、チルト、及びロール方向における仮想視点の向きを表すパラメータとを含む。なお、仮想視点情報の内容は上記に限定されない。例えば、仮想視点情報のパラメータには、仮想視点の視野の大きさ（画角）を表すパラメータが含まれてもよい。また、仮想視点情報は複数フレームに関連したパラメータを有していてもよい。つまり、仮想視点情報は、仮想視点画像の動画を構成する複数のフレームにそれぞれ対応するパラメータを有し、連続する複数の時点それぞれにおける仮想視点の位置及び向きを示す情報であってもよい。 The virtual viewpoint information used to generate the virtual viewpoint image indicates the position and orientation of the virtual viewpoint. Specifically, the virtual viewpoint information includes parameters representing the three-dimensional position of the virtual viewpoint and parameters representing the orientation of the virtual viewpoint in the pan, tilt, and roll directions. Note that the content of the virtual viewpoint information is not limited to the above. For example, the parameters of the virtual viewpoint information may include parameters representing the size of the field of view (angle of view) of the virtual viewpoint. Also, the virtual viewpoint information may have parameters associated with multiple frames. In other words, the virtual viewpoint information may be information that has parameters corresponding to a plurality of frames that form a moving image of a virtual viewpoint image, and that indicates the position and orientation of the virtual viewpoint at each of a plurality of successive time points.

仮想視点画像は、例えば、以下のような方法で生成される。まず、複数の撮像装置により異なる方向から撮像することで複数の撮像画像（複数視点画像）が取得される。次に、複数視点画像から、人物やボールなどの被写体に対応する前景領域を抽出した前景画像と、前景領域以外の背景領域を抽出した背景画像が取得される。前景画像、背景画像は、テクスチャ情報（色情報など）を有している。前景画像に基づいて、被写体の３次元形状を表す前景モデルと前景モデルに色付けするためのテクスチャ情報とが生成される。また、背景画像に基づいて、競技場などの背景の３次元形状を表す背景モデルに色づけするためのテクスチャ情報が生成される。そして、前景モデルと背景モデルに対してテクスチャ情報をマッピングし、仮想視点情報が示す仮想視点に応じてレンダリングを行うことにより、仮想視点画像が生成される。ただし、仮想視点画像の生成方法はこれに限定されず、前景モデルや背景モデルを用いずに撮像画像の射影変換により仮想視点画像を生成する方法など、種々の方法を用いることができる。 A virtual viewpoint image is generated, for example, by the following method. First, a plurality of captured images (multi-viewpoint images) are obtained by capturing images from different directions using a plurality of imaging devices. Next, a foreground image obtained by extracting a foreground area corresponding to a subject such as a person or a ball, and a background image obtained by extracting a background area other than the foreground area are obtained from the multi-viewpoint image. A foreground image and a background image have texture information (such as color information). A foreground model representing the three-dimensional shape of the subject and texture information for coloring the foreground model are generated based on the foreground image. Also, based on the background image, texture information for coloring the background model representing the three-dimensional shape of the background such as the stadium is generated. A virtual viewpoint image is generated by mapping texture information on the foreground model and the background model and performing rendering according to the virtual viewpoint indicated by the virtual viewpoint information. However, the method of generating a virtual viewpoint image is not limited to this, and various methods such as a method of generating a virtual viewpoint image by projective transformation of a captured image without using a foreground model or a background model can be used.

前景画像は、撮像装置により撮像されて取得された撮像画像から、被写体の領域（前景領域）を抽出した画像である。前景領域として抽出される被写体（前景被写体ともいう）とは、一般に、時系列で同じ方向から撮像を行った場合において動きのある（その位置や形が変化し得る）動的被写体（動体）を指す。例えば、競技において、それが行われるフィールド内にいる選手や審判などの人物が被写体となる。また、球技であれば、人物に加えてボールなども被写体となる。コンサートやエンタテイメントにおいては、歌手、演奏者、パフォーマー、司会者などが被写体となる。なお、前景領域は前景被写体のシルエット情報と解釈でき、以降、本実施形態では、前景画像を前景シルエット情報と呼ぶ。 A foreground image is an image obtained by extracting a subject area (foreground area) from a captured image captured by an imaging device. A subject extracted as a foreground area (also called a foreground subject) generally refers to a dynamic subject (moving object) that moves (its position and shape can change) when images are captured from the same direction in time series. Point. For example, in a game, the subject is a person such as a player or a referee in the field where the game is played. Also, in the case of a ball game, in addition to a person, a ball or the like becomes a subject. In concerts and entertainment, singers, musicians, performers, moderators, and the like are subjects. Note that the foreground area can be interpreted as silhouette information of the foreground subject, and the foreground image is hereinafter referred to as foreground silhouette information in this embodiment.

背景画像とは、少なくとも前景となる被写体とは異なる領域（背景領域）の画像である。具体的には、背景画像は、撮像画像から前景となる被写体を取り除いた状態の画像である。また、背景は、時系列で同じ方向から撮像を行った場合において静止している、又は静止に近い状態が継続している撮像対象物を指す。このような撮像対象物は、例えば、コンサート等のステージ、競技などのイベントを行うスタジアム、球技で使用するゴールなどの構造物、フィールド、などである。ただし、背景は少なくとも前景となる被写体とは異なる領域であり、撮像対象には、被写体と背景の他に、別の物体等が含まれていてもよい。 A background image is an image of an area (background area) different from at least the foreground subject. Specifically, the background image is an image obtained by removing the foreground subject from the captured image. In addition, the background refers to an object to be imaged that is stationary or continues to be nearly stationary when imaged from the same direction in time series. Such imaging targets are, for example, stages of concerts and the like, stadiums where events such as competitions are held, structures such as goals used in ball games, fields, and the like. However, the background is at least a region different from the foreground subject, and the imaging target may include other objects in addition to the subject and background.

［画像処理システムの構成］
第１実施形態の画像処理システムの構成について図面を参照しながら説明する。図１は、第１実施形態による画像処理システムの装置構成の例、および、画像生成装置１の機能構成の例を示すブロック図である。画像処理システムは、画像生成装置１、撮像システム２、操作装置３、表示装置４を備える。画像生成装置１は、撮像システム２、操作装置３、表示装置４に接続される。画像生成装置１は、撮像システム２から撮像画像又は前景シルエット情報、カメラパラメータを取得し、操作装置３から仮想視点情報を取得する。画像生成装置１は、撮像システム２と操作装置３から得られた情報に基づいて、仮想視点画像を生成する。生成した仮想視点画像は、表示装置４へ出力される。撮像システム２は複数の撮像装置を備える。以下、撮像システム２が備える撮像装置をカメラと称する。撮像システム２の各カメラは、カメラを識別するための識別番号を持つ。撮像システム２は、カメラが撮像した画像から前景シルエット情報を抽出する機能など、撮像以外の機能やその機能を実現するためのハードウェア（回路や装置など）を含んでもよい。操作装置３は、仮想視点画像を生成するための仮想視点情報を指定する。仮想視点情報は、例えば、ジョイスティック、ジョグダイヤル、タッチパネル、キーボード、及びマウスなどにより、ユーザ（操作者）から指定される。なお、仮想視点情報の指定はユーザ指定に限定されない。例えば、被写体を認識するなどして、自動的に仮想視点情報が指定されても構わない。表示装置４は、画像生成装置１から仮想視点画像を取得し、それらをディスプレイなどの表示デバイスを用いて出力する。 [Configuration of image processing system]
The configuration of the image processing system of the first embodiment will be described with reference to the drawings. FIG. 1 is a block diagram showing an example of the device configuration of an image processing system and an example of the functional configuration of an image generation device 1 according to the first embodiment. The image processing system includes an image generation device 1 , an imaging system 2 , an operation device 3 and a display device 4 . An image generation device 1 is connected to an imaging system 2 , an operation device 3 and a display device 4 . The image generation device 1 acquires a captured image or foreground silhouette information and camera parameters from the imaging system 2 and acquires virtual viewpoint information from the operation device 3 . The image generation device 1 generates a virtual viewpoint image based on information obtained from the imaging system 2 and the operation device 3 . The generated virtual viewpoint image is output to the display device 4 . The imaging system 2 includes a plurality of imaging devices. Hereinafter, the imaging device included in the imaging system 2 will be referred to as a camera. Each camera of the imaging system 2 has an identification number for identifying the camera. The imaging system 2 may include functions other than imaging, such as a function of extracting foreground silhouette information from an image captured by a camera, and hardware (circuits, devices, etc.) for realizing these functions. The operation device 3 designates virtual viewpoint information for generating a virtual viewpoint image. The virtual viewpoint information is specified by a user (operator) using, for example, a joystick, jog dial, touch panel, keyboard, mouse, or the like. Note that designation of the virtual viewpoint information is not limited to user designation. For example, virtual viewpoint information may be automatically specified by recognizing a subject. The display device 4 acquires virtual viewpoint images from the image generation device 1 and outputs them using a display device such as a display.

次に、画像生成装置１の機能構成について説明する。画像生成装置１は、情報取得部１０１、第１形状推定部１０２、姿勢推定部１０３、第２形状推定部１０４、選択部１０５、画像生成部１０６を有する。 Next, the functional configuration of the image generation device 1 will be described. The image generation device 1 has an information acquisition unit 101 , a first shape estimation unit 102 , a posture estimation unit 103 , a second shape estimation unit 104 , a selection unit 105 and an image generation unit 106 .

情報取得部１０１は、撮像システム２の複数のカメラが撮像した複数の撮像画像を取得する。情報取得部１０１は、撮像システム２から取得した撮像画像から前景被写体のシルエット情報（前景シルエット情報）を生成する。なお、撮像システム２が前景シルエット情報を生成してもよい。その場合、情報取得部１０１は、撮像システム２から前景シルエット情報を取得する。さらに、情報取得部１０１は、撮像システム２のカメラパラメータを取得する。なお、情報取得部１０１が、撮像システム２のカメラパラメータを算出するようにしてもよい。例えば、情報取得部１０１は、複数のカメラの撮像画像から対応点を算出し、対応点を各カメラに投影した時の誤差が最小になるように最適化し、各カメラを校正することでカメラパラメータを算出する。カメラの校正には既存のいかなる方法が用いられてもよい。カメラパラメータは、撮像画像に同期して取得されてもよいし、事前準備の段階で取得されてもよいし、必要に応じて撮像画像に非同期で取得されてもよい。さらに、情報取得部１０１は、撮像領域中の各部分領域について、部分領域を撮像可能なカメラ（以下、有効カメラともいう）の数の情報を取得する。例えば、各カメラの位置、姿勢、画角の情報に基づいて、撮像領域を分割する複数の部分領域の各々について、当該部分領域を撮影することが可能なカメラの数を判定する。なお、部分領域ごとの有効カメラの数は選択部１０５に提供される。 The information acquisition unit 101 acquires a plurality of captured images captured by a plurality of cameras of the imaging system 2 . The information acquisition unit 101 generates silhouette information of a foreground subject (foreground silhouette information) from the captured image acquired from the imaging system 2 . Note that the imaging system 2 may generate the foreground silhouette information. In that case, the information acquisition unit 101 acquires foreground silhouette information from the imaging system 2 . Furthermore, the information acquisition unit 101 acquires camera parameters of the imaging system 2 . Note that the information acquisition unit 101 may calculate camera parameters of the imaging system 2 . For example, the information acquisition unit 101 calculates corresponding points from images captured by a plurality of cameras, optimizes the corresponding points so that the error when projecting the corresponding points onto each camera is minimized, and calibrates each camera to obtain camera parameter Calculate Any existing method may be used to calibrate the camera. The camera parameters may be acquired synchronously with the captured image, may be acquired at the stage of preparation, or may be acquired asynchronously with the captured image as necessary. Furthermore, the information acquisition unit 101 acquires information on the number of cameras (hereinafter also referred to as effective cameras) capable of imaging each partial area in the imaging area. For example, based on the position, orientation, and angle of view information of each camera, for each of a plurality of partial areas that divide the imaging area, the number of cameras capable of photographing the partial area is determined. Note that the number of effective cameras for each partial area is provided to the selection unit 105 .

第１形状推定部１０２は、情報取得部１０１が取得した前景被写体の前景シルエット情報とカメラパラメータに基づいて３次元形状情報を推定する。前景シルエット情報に基づく３次元形状の推定には、例えば、Ｖｉｓｕａｌｈｕｌｌ法やＰｈｏｔｏｈｕｌｌ法などの既知の方法が用いられ得る。以下、第１形状推定部１０２により推定される３次元形状情報を、シルエットベースの３次元形状情報と称する。 The first shape estimation unit 102 estimates three-dimensional shape information based on the foreground silhouette information of the foreground subject acquired by the information acquisition unit 101 and the camera parameters. Known methods such as the Visual Hull method and the Photo Hull method can be used for estimating the three-dimensional shape based on the foreground silhouette information. The three-dimensional shape information estimated by the first shape estimation unit 102 is hereinafter referred to as silhouette-based three-dimensional shape information.

姿勢推定部１０３は、情報取得部１０１で取得した撮像画像（あるいは前景シルエット情報）とカメラパラメータを用いて、前景被写体の姿勢を推定し、姿勢情報を生成する。姿勢情報は、例えば、対象の被写体の骨格を表現するボーンモデルである。姿勢推定には、例えば、深層学習を利用した姿勢推定方法など、既知の方法が用いられ得る。また、姿勢推定部１０３は、姿勢情報から被写体のトラッキング情報を取得する。姿勢情報に基づく被写体のトラッキングは、例えば、ある被写体のボーンモデルの各節点と、１フレーム前の全ての被写体に対応するボーンモデルにおける各節点との、差分が最小となる被写体を探索することで行われる。 A posture estimation unit 103 uses the captured image (or foreground silhouette information) acquired by the information acquisition unit 101 and camera parameters to estimate the posture of the foreground subject and generate posture information. The posture information is, for example, a bone model representing the skeleton of the target subject. For posture estimation, a known method such as a posture estimation method using deep learning can be used. Also, posture estimation section 103 acquires tracking information of the subject from the posture information. Subject tracking based on posture information is, for example, searching for a subject that minimizes the difference between each node of a certain subject's bone model and each node of bone models corresponding to all subjects one frame before. done.

ここで、本実施形態による姿勢情報を、図２を用いて説明する。図２は、ある被写体を表現する３次元形状の一例と姿勢情報を示す模式図である。なお、３次元形状及び姿勢情報は、３次元空間上の情報であるが、説明のために図２では２次元の画像に簡略化して示す。被写体２０１は、３次元形状２０２（３次元モデル）及び姿勢情報２０３の元となる被写体である。３次元形状２０２は、点の集合である点群で表現されている。点群は、３次元空間上の点の位置情報(ｘ，ｙ，ｚ)と、点の大きさを示す情報の集合であり、一つの点は例えば一辺の長さがｋである立方体２０４で表現される。姿勢情報２０３は、被写体の構造の主要な節点と、節点間を接続する線より結線されるボーンモデルとして表される。 Here, posture information according to this embodiment will be described with reference to FIG. FIG. 2 is a schematic diagram showing an example of a three-dimensional shape representing a subject and posture information. Although the three-dimensional shape and orientation information is information on a three-dimensional space, it is shown in a simplified two-dimensional image in FIG. 2 for explanation. A subject 201 is a subject on which a three-dimensional shape 202 (three-dimensional model) and posture information 203 are based. A three-dimensional shape 202 is represented by a point group, which is a set of points. A point group is a set of positional information (x, y, z) of points in a three-dimensional space and information indicating the size of the points. expressed. The posture information 203 is expressed as a bone model that is connected by lines connecting main nodes of the structure of the object and the nodes.

図１に戻り、第２形状推定部１０４は、姿勢推定部１０３で取得した前景被写体の姿勢情報に基づいて、前景被写体の３次元形状を取得する。なお、姿勢情報に基づく３次元を推定する方法には、例えば、人体の標準テンプレートモデルあるいは事前に３次元スキャンした人体モデルを用意しておき、そのモデルを姿勢情報にフィッティングする方法などが用いられ得る。以下、第２形状推定部１０４により推定される３次元形状情報を、姿勢ベースの３次元形状情報と称する。 Returning to FIG. 1 , second shape estimation section 104 acquires the three-dimensional shape of the foreground subject based on the posture information of the foreground subject acquired by posture estimation section 103 . As a method of estimating three dimensions based on posture information, for example, a standard template model of the human body or a human body model that has been three-dimensionally scanned in advance is prepared, and a method of fitting the model to the posture information is used. obtain. The three-dimensional shape information estimated by the second shape estimation unit 104 is hereinafter referred to as posture-based three-dimensional shape information.

ここで、シルエットベースの３次元形状推定と姿勢ベースの３次元形状推定の精度が、被写体に対する有効カメラの数に関してどう変化するかについて、図３を用いて説明する。図３は、ある撮像領域上に存在する被写体を複数のカメラで撮像する様子を表す模式図である。３次元形状の推定対象となる撮像領域３０１を撮像するように複数のカメラ３０２（カメラ３０２ａ～３０２ｆ）が配置されている。複数のカメラ３０２は、撮像領域３０１の中心を向いている。複数のカメラ３０２による複数の撮像画像からは、複数の前景シルエット情報３０３ａ～３０３ｆが得られる。また、図３において、被写体３０４は撮像領域３０１の中心に存在し、被写体３０５は撮像領域３０１の端に存在している。シルエットベースの３次元形状の推定では、３次元空間上の点を各カメラへ投影し、その点が前景であるかどうかを判定することで３次元形状を得ている。そのため、シルエットベースの３次元形状の推定は、被写体を撮像できるカメラの数が多いほど精度が向上する。図３においてシルエットベースの３次元形状推定を行うと、被写体３０４は精度良く３次元形状を得られる可能性が高い一方で、被写体３０５の３次元形状精度は低くなる可能性が高い。一方、姿勢ベースの３次元形状推定は、姿勢情報、つまり対象の骨格を表現するボーンモデルを基に、事前に用意したモデルを変形させることで３次元形状を得る。姿勢情報は比較的少数の視点で得られることが知られており、姿勢情報に基づく３次元形状推定の精度は、被写体を撮像できるカメラの数に依らず、ほぼ一定となる。したがって、図３において姿勢情報に基づく３次元形状推定を行うと、被写体３０４と被写体３０５の３次元形状の精度に大きな差は生じない可能性が高い。 Here, how the accuracies of silhouette-based 3D shape estimation and attitude-based 3D shape estimation change with respect to the number of effective cameras for the subject will be described with reference to FIG. FIG. 3 is a schematic diagram showing how a plurality of cameras capture images of a subject existing in a certain imaging area. A plurality of cameras 302 (cameras 302a to 302f) are arranged so as to capture images of an imaging region 301 whose three-dimensional shape is to be estimated. A plurality of cameras 302 face the center of the imaging area 301 . A plurality of foreground silhouette information 303a to 303f are obtained from a plurality of captured images by a plurality of cameras 302. FIG. Also, in FIG. 3, a subject 304 exists in the center of the imaging area 301 and a subject 305 exists at the edge of the imaging area 301 . In silhouette-based 3D shape estimation, a 3D shape is obtained by projecting a point in 3D space onto each camera and determining whether the point is in the foreground. Therefore, silhouette-based three-dimensional shape estimation improves in accuracy as the number of cameras capable of capturing images of the subject increases. When silhouette-based three-dimensional shape estimation is performed in FIG. 3, there is a high possibility that the three-dimensional shape of the subject 304 can be obtained with high accuracy, while the three-dimensional shape accuracy of the subject 305 is highly likely to be low. On the other hand, posture-based 3D shape estimation obtains a 3D shape by deforming a model prepared in advance based on posture information, that is, a bone model representing the target skeleton. It is known that posture information can be obtained from a relatively small number of viewpoints, and the accuracy of 3D shape estimation based on posture information is almost constant regardless of the number of cameras capable of capturing images of the subject. Therefore, if three-dimensional shape estimation based on posture information is performed in FIG.

次に、シルエットベースの３次元形状推定の精度と姿勢ベースの３次元形状推定の精度が、被写体の状態に応じてどう変化するかを説明する。シルエットベースの３次元形状は、被写体を各カメラで観測したシルエットのままになるように推定されるため、被写体の実際の形状に即した３次元形状を推定できる。一方で、姿勢ベースの３次元形状は、基本となる３次元形状をボーンモデルにフィッティングすることにより推定される。そのため、被写体形状の変化への柔軟な対応が困難であり、例えば、被写体がバットやボールなどを持ったときや帽子の着用などで外観が変化すると、推定精度が低下してしまう可能性がある。 Next, how the accuracy of silhouette-based 3D shape estimation and the accuracy of attitude-based 3D shape estimation change according to the state of the subject will be described. Since the silhouette-based three-dimensional shape is estimated so that the silhouette of the subject observed by each camera remains unchanged, the three-dimensional shape can be estimated in line with the actual shape of the subject. On the other hand, pose-based 3D shape is estimated by fitting the underlying 3D shape to a bone model. For this reason, it is difficult to flexibly respond to changes in the shape of the subject. For example, if the appearance changes when the subject is holding a bat or ball, or wearing a hat, the estimation accuracy may decrease. .

図１に戻り、選択部１０５は、第１形状推定部１０２で生成されたシルエットベースの３次元形状情報と、第２形状推定部１０４で生成された姿勢ベースの３次元形状情報のうちの一方を、仮想視点画像の生成に用いる３次元形状情報として選択する。選択された３次元形状情報は画像生成部１０６へ出力される。第１実施形態の選択部１０５は、各被写体が存在する部分領域を撮影可能なカメラ（有効カメラ）の数が一定以上あるかどうかに基づいて、被写体ごとに、シルエットベースと姿勢ベースの２種類の３次元形状のうちのどちらかを選択する。画像生成部１０６は、情報取得部１０１で取得された撮像画像およびカメラパラメータと、選択部１０５で選択された３次元形状と、操作装置３からの仮想視点情報とに基づいて、仮想視点画像を生成する。 Returning to FIG. 1, the selection unit 105 selects one of the silhouette-based three-dimensional shape information generated by the first shape estimation unit 102 and the pose-based three-dimensional shape information generated by the second shape estimation unit 104. is selected as the three-dimensional shape information used for generating the virtual viewpoint image. The selected three-dimensional shape information is output to image generator 106 . The selection unit 105 of the first embodiment selects two types, silhouette-based and posture-based, for each subject based on whether or not the number of cameras (effective cameras) capable of capturing a partial area in which each subject exists is equal to or greater than a certain number. 3D shape. The image generation unit 106 generates a virtual viewpoint image based on the captured image and camera parameters acquired by the information acquisition unit 101, the three-dimensional shape selected by the selection unit 105, and the virtual viewpoint information from the operation device 3. Generate.

図４は、画像生成装置１のハードウェア構成例を示すブロック図である。画像生成装置１は、ＣＰＵ４０１、ＲＯＭ４０２、ＲＡＭ４０３、補助記憶装置４０４、表示部４０５、操作部４０６、通信Ｉ／Ｆ４０７、ＧＰＵ４０８、及びバス４０９を有する。ＣＰＵ４０１は、ＲＯＭ４０２またはＲＡＭ４０３に格納されているコンピュータプログラムおよびデータを用いて画像生成装置１の全体を制御することで、図１に示す画像生成装置１の各機能を実現する。なお、画像生成装置１がＣＰＵ４０１とは異なる１又は複数の専用のハードウェアを有し、ＣＰＵ４０１による処理の少なくとも一部を専用のハードウェアが実行してもよい。専用のハードウェアの例としては、ＡＳＩＣ（特定用途向け集積回路）、ＦＰＧＡ（フィールドプログラマブルゲートアレイ）、およびＤＳＰ（デジタルシグナルプロセッサ）などがある。ＲＯＭ４０２は、変更を必要としないプログラムなどを格納する。ＲＡＭ４０３は、補助記憶装置４０４から供給されるプログラムおよびデータ、並びに、通信Ｉ／Ｆ４０７を介して外部から供給されるデータなどを一時記憶する。補助記憶装置４０４は、例えばハードディスクドライブ等で構成され、画像データまたは音声データなどの種々のデータを記憶する。 FIG. 4 is a block diagram showing a hardware configuration example of the image generation device 1. As shown in FIG. The image generation device 1 has a CPU 401 , a ROM 402 , a RAM 403 , an auxiliary storage device 404 , a display section 405 , an operation section 406 , a communication I/F 407 , a GPU 408 and a bus 409 . The CPU 401 implements each function of the image generating apparatus 1 shown in FIG. 1 by controlling the entire image generating apparatus 1 using computer programs and data stored in the ROM 402 or RAM 403 . Note that the image generating apparatus 1 may have one or a plurality of dedicated hardware different from the CPU 401, and at least part of the processing by the CPU 401 may be executed by the dedicated hardware. Examples of dedicated hardware include ASICs (Application Specific Integrated Circuits), FPGAs (Field Programmable Gate Arrays), and DSPs (Digital Signal Processors). The ROM 402 stores programs that do not require modification. The RAM 403 temporarily stores programs and data supplied from the auxiliary storage device 404 and data externally supplied via the communication I/F 407 . The auxiliary storage device 404 is composed of, for example, a hard disk drive, and stores various data such as image data and audio data.

ＧＰＵ４０８は、画像処理専用のプロセッサであり、画像データをより多く並列処理することで効率的な演算を行うことができる。このため、前景被写体の３次元形状の推定や、仮想視点画像の生成など、大規模データを処理する場合にはＧＰＵ４０８で処理を行うことが有効である。そこで本実施形態では、第１形状推定部１０２、第２形状推定部１０４、画像生成部１０６などによる処理には、ＣＰＵ４０１に加えてＧＰＵ４０８が用いられ得る。但し、このような構成は必須ではなく、第１形状推定部１０２、第２形状推定部１０４、画像生成部１０６の処理がＣＰＵ４０１またはＧＰＵ４０８の一方のみにより実現されても良いことは明らかである。 The GPU 408 is a processor dedicated to image processing, and can perform efficient computation by parallel processing more image data. Therefore, it is effective to use the GPU 408 when processing large-scale data, such as estimating the three-dimensional shape of a foreground object and generating a virtual viewpoint image. Therefore, in this embodiment, the GPU 408 can be used in addition to the CPU 401 for the processing by the first shape estimation unit 102, the second shape estimation unit 104, the image generation unit 106, and the like. However, such a configuration is not essential, and it is clear that the processes of the first shape estimation unit 102, the second shape estimation unit 104, and the image generation unit 106 may be realized by only one of the CPU 401 and the GPU 408.

表示部４０５は、例えば液晶ディスプレイまたはＬＥＤ等で構成され、ユーザが画像生成装置１を操作するためのＧＵＩ（ＧｒａｐｈｉｃａｌＵｓｅｒＩｎｔｅｒｆａｃｅ）などを表示する。操作部４０６は、例えばキーボード、マウス、ジョイスティック、またはタッチパネル等で構成され、ユーザによる操作を受けて各種の指示をＣＰＵ４０１に入力する。ＣＰＵ４０１は、表示部４０５を制御する表示制御部として、及び、操作部４０６を制御する操作制御部として動作する。 The display unit 405 is composed of, for example, a liquid crystal display or an LED, and displays a GUI (Graphical User Interface) for the user to operate the image generating apparatus 1, and the like. The operation unit 406 is composed of, for example, a keyboard, a mouse, a joystick, a touch panel, or the like, and inputs various instructions to the CPU 401 in response to user's operations. The CPU 401 operates as a display control unit that controls the display unit 405 and as an operation control unit that controls the operation unit 406 .

通信Ｉ／Ｆ４０７は、画像生成装置１の外部の装置との通信に用いられる。例えば、画像生成装置１が外部の装置と有線で接続される場合には、通信用のケーブルが通信Ｉ／Ｆ４０７に接続される。画像生成装置１が外部の装置と無線通信する機能を有する場合には、通信Ｉ／Ｆ４０７はアンテナを備える。バス４０９は、画像生成装置１の各部をつないで情報を伝達する。 A communication I/F 407 is used for communication with an external device of the image generating device 1 . For example, when the image generating device 1 is connected to an external device by wire, a communication cable is connected to the communication I/F 407 . If the image generating device 1 has a function of wirelessly communicating with an external device, the communication I/F 407 has an antenna. A bus 409 connects each unit of the image generating apparatus 1 and transmits information.

本実施形態では表示部４０５と操作部４０６が画像生成装置１の内部に存在するものとするが、これに限定されるものではない。表示部４０５と操作部４０６との少なくとも一方が画像生成装置１の外部に別の装置として存在していてもよい。すなわち、図１に示される操作装置３および表示装置４は、画像生成装置１に組み込まれてもよいし、画像生成装置１の外部装置として存在してもよい。 In this embodiment, it is assumed that the display unit 405 and the operation unit 406 exist inside the image generating apparatus 1, but the present invention is not limited to this. At least one of the display unit 405 and the operation unit 406 may exist as a separate device outside the image generation device 1 . That is, the operation device 3 and the display device 4 shown in FIG.

［仮想視点画像の生成処理］
図５に示すフローチャートを用いて、画像生成装置１が行う処理を説明する。 [Generation processing of virtual viewpoint image]
Processing performed by the image generating apparatus 1 will be described using the flowchart shown in FIG.

ステップＳ５０１において、情報取得部１０１は、撮像システム２から複数のカメラにより撮像された複数の撮像画像と、各カメラのカメラ情報を取得する。カメラ情報は、例えば、複数のカメラの各々のカメラパラメータ（位置、姿勢、画角など）を含む。 In step S<b>501 , the information acquisition unit 101 acquires a plurality of captured images captured by a plurality of cameras and camera information of each camera from the imaging system 2 . Camera information includes, for example, camera parameters (position, orientation, angle of view, etc.) of each of a plurality of cameras.

また、情報取得部１０１は、撮像システム２から複数の撮像画像（複数視点画像）を取得し、前景シルエット情報を生成する。前景シルエット情報は、被写体を撮像した撮像画像から、試合開始前などに被写体が存在しない時に予め撮像した背景画像との差分を算出する背景差分法などの一般的な手法を用いて生成され得る。ただし、前景シルエット情報を生成する方法は、これに限定されない。例えば、人体を認識するなどの方法を用いて認識された被写体の領域を抽出することにより前景シルエット情報が生成されてもよい。なお、前景シルエット情報が撮像システム２により抽出され、抽出された前景シルエット情報を情報取得部１０１が取得するようにしてもよい。その場合は、情報取得部１０１において被写体の前景シルエット情報を生成する処理を省略することができる。また、情報取得部１０１がテクスチャ情報を含む前景画像を取得する場合は、テクスチャ情報を消すことで前景シルエット情報が生成され得る。この場合、例えば、前景シルエット情報を８ビットデータとして扱うのであれば、被写体が存在する領域の画素値を２５５、それ以外の領域の画素値を０にすればよい。取得された前景シルエット情報は、第１形状推定部１０２および姿勢推定部１０３に出力される。また、情報取得部１０１は、前景シルエット情報のテクスチャ情報を画像生成部１０６に出力する。 The information acquisition unit 101 also acquires a plurality of captured images (multi-viewpoint images) from the imaging system 2 and generates foreground silhouette information. The foreground silhouette information can be generated by using a general method such as a background subtraction method that calculates the difference between a captured image of a subject and a background image captured before the start of a game when the subject does not exist. However, the method for generating foreground silhouette information is not limited to this. For example, the foreground silhouette information may be generated by extracting a recognized subject area using a method such as recognizing a human body. Note that the foreground silhouette information may be extracted by the imaging system 2, and the information acquisition unit 101 may acquire the extracted foreground silhouette information. In that case, the process of generating the foreground silhouette information of the subject in the information acquisition unit 101 can be omitted. Also, when the information acquisition unit 101 acquires a foreground image including texture information, foreground silhouette information can be generated by erasing the texture information. In this case, for example, if the foreground silhouette information is handled as 8-bit data, the pixel value of the area where the subject exists should be set to 255, and the pixel value of the other area should be set to 0. The acquired foreground silhouette information is output to first shape estimation section 102 and posture estimation section 103 . The information acquisition unit 101 also outputs the texture information of the foreground silhouette information to the image generation unit 106 .

また、上述したように、情報取得部１０１が、カメラパラメータを算出するようにしてもよい。また、カメラパラメータは撮像画像を取得する度に取得／算出される必要はなく、形状推定する前に少なくとも１度取得／算出されればよい。取得されたカメラパラメータは、第１形状推定部１０２、姿勢推定部１０３、画像生成部１０６に出力される。さらに、情報取得部１０１は、撮像システム２から撮像領域中の部分領域に関して、部分領域を撮影可能なカメラ（有効カメラ）の数の情報を取得する。情報取得部１０１が、撮像システム２から取得されたカメラパラメータに基づいて、撮像領域中の各部分領域に関する有効カメラの数を算出するようにしてもよい。なお、撮像領域中の各部分領域に関する有効カメラの数の情報は、撮像画像を取得する度に算出される必要はなく、３次元形状の推定方法を選択する前に少なくとも１度算出されればよい。 Also, as described above, the information acquisition unit 101 may calculate camera parameters. Also, camera parameters do not have to be obtained/calculated each time a captured image is obtained, and may be obtained/calculated at least once before shape estimation. The acquired camera parameters are output to first shape estimation section 102 , posture estimation section 103 and image generation section 106 . Further, the information acquisition unit 101 acquires from the imaging system 2 information about the number of cameras (effective cameras) that can capture partial areas in the imaging area. The information acquisition unit 101 may calculate the number of effective cameras for each partial area in the imaging area based on the camera parameters acquired from the imaging system 2 . Note that the information on the number of effective cameras for each partial area in the imaging area does not need to be calculated each time a captured image is acquired, and can be calculated at least once before selecting the method for estimating the three-dimensional shape. good.

ステップＳ５０２において、第１形状推定部１０２は、ステップＳ５０１で取得された前景シルエット情報とカメラパラメータを基に、全ての前景被写体のシルエット形状を構成するボクセル集合（シルエットベースの３次元形状情報）を推定する。このようなボクセル集合の推定には、上述のとおり、Ｖｉｓｕａｌｈｕｌｌ法、或いはＰｈｏｔｏｈｕｌｌ法等の周知の方法が用いられ得る。ボクセルのサイズは、予めユーザがＧＵＩ（ＧｒａｐｈｉｃａｌＵｓｅｒＩｎｔｅｒｆａｃｅ）を用いて設定されても良いし、テキストファイルなどを用いて設定されても良い。また、第１形状推定部１０２は、推定された各点が各カメラから可視であるか否かを表す可視情報を算出する。 In step S502, the first shape estimation unit 102 generates a voxel set (silhouette-based three-dimensional shape information) that forms the silhouette shape of all foreground subjects based on the foreground silhouette information and camera parameters acquired in step S501. presume. As described above, a well-known method such as the Visual hull method or the Photo hull method can be used for estimating such a voxel set. The voxel size may be set in advance by the user using a GUI (Graphical User Interface), or may be set using a text file or the like. The first shape estimation unit 102 also calculates visibility information indicating whether each estimated point is visible from each camera.

ステップＳ５０３において、姿勢推定部１０３は、ステップＳ５０１で取得した撮像画像又は前景シルエット情報とカメラパラメータを基に、全ての前景被写体に関して姿勢を推定し姿勢情報を生成する。姿勢の推定には、例えば、一般的に普及している深層学習を利用した姿勢推定方法を用いることができる。 In step S503, the posture estimation unit 103 estimates the postures of all foreground subjects and generates posture information based on the captured image or the foreground silhouette information acquired in step S501 and the camera parameters. For posture estimation, for example, a generally popular posture estimation method using deep learning can be used.

ステップＳ５０４において、姿勢推定部１０３は、被写体のトラッキング情報を取得する。本実施形態では、被写体のトラッキング情報は、現在フレームの注目被写体の姿勢情報と１フレーム前の全ての被写体の姿勢情報との差分が最小となる被写体を探索することで得られる。ここで、姿勢情報とは、例えば、ボーンモデルの各節点の位置を示す情報であり、姿勢情報の差分は、各節点の位置の差の絶対値の総和である。なお、被写体のトラッキングは、前後フレームのボーンモデルの重心との差分により推定されてもよいし、深層学習などを利用した人物認識を利用することで行われてもよい。姿勢推定部１０３は、トラッキングしているそれぞれの前景被写体に識別情報（被写体ＩＤ）を付与し、前景被写体の位置を示す位置情報とともに第１形状推定部１０２と第２形状推定部１０４に供給する。第１形状推定部１０２は、前景被写体の位置情報を用いて、各被写体のシルエットベースの３次元形状情報と被写体ＩＤを対応付ける。 In step S504, the posture estimation unit 103 acquires subject tracking information. In this embodiment, the tracking information of a subject is obtained by searching for a subject that minimizes the difference between the orientation information of the subject of interest in the current frame and the orientation information of all the subjects one frame before. Here, the posture information is, for example, information indicating the position of each node of the bone model, and the difference in posture information is the sum of the absolute values of the differences in the positions of the nodes. The tracking of the subject may be estimated from the difference between the center of gravity of the bone model in the previous and subsequent frames, or may be performed by using person recognition using deep learning or the like. Posture estimation section 103 assigns identification information (subject ID) to each of the foreground subjects being tracked, and supplies it to first shape estimation section 102 and second shape estimation section 104 together with position information indicating the position of the foreground subject. . The first shape estimating unit 102 associates the silhouette-based three-dimensional shape information of each subject with the subject ID using the position information of the foreground subject.

ステップＳ５０５において、第２形状推定部１０４は、ステップＳ５０３で生成された姿勢情報に基づいて、前景被写体の姿勢形状（姿勢ベースの３次元形状情報）を推定する。例えば、事前に被写体を３次元スキャンしてその３次元モデルとしてのボーンモデルを取得しておく。第２形状推定部１０４は、事前に取得したボーンモデルをステップＳ５０３にて推定した姿勢へフィッティングすることで、３次元形状を得る。なお、姿勢情報に基づく３次元形状の推定方法はこれに限られるものではない。例えば、人体の標準テンプレートモデルを、推定された姿勢に基づいて変形することで、３次元形状が推定されてもよい。第２形状推定部１０４は、姿勢推定部１０３から供給された前景被写体の位置情報を用いて、各被写体の姿勢ベースの３次元形状情報と被写体ＩＤを対応付ける。 In step S505, the second shape estimation unit 104 estimates the posture shape (posture-based three-dimensional shape information) of the foreground subject based on the posture information generated in step S503. For example, the subject is three-dimensionally scanned in advance to acquire a bone model as the three-dimensional model. The second shape estimation unit 104 obtains a three-dimensional shape by fitting the previously acquired bone model to the posture estimated in step S503. Note that the three-dimensional shape estimation method based on posture information is not limited to this. For example, the three-dimensional shape may be estimated by deforming a standard template model of the human body based on the estimated posture. The second shape estimation unit 104 uses the position information of the foreground object supplied from the posture estimation unit 103 to associate pose-based three-dimensional shape information of each object with the object ID.

ステップＳ５０６において、選択部１０５は、全ての被写体の中から、注目する被写体（以下、注目被写体）を選択する。続くステップＳ５０７～Ｓ５１０において、選択部１０５は、シルエットベースの３次元形状情報と姿勢ベースの３次元形状情報の一方を、当該注目被写体の仮想視点画像の生成に用いられる３次元形状情報として選択する。ステップＳ５０７～Ｓ５０９では条件１～条件３の各条件が満たされているか否かが判定され、ステップＳ５１０ではそれらの判定結果に従って３次元形状が選択される。 In step S506, the selection unit 105 selects a subject of interest (hereinafter referred to as a subject of interest) from among all the subjects. In subsequent steps S507 to S510, the selection unit 105 selects one of the silhouette-based three-dimensional shape information and the orientation-based three-dimensional shape information as the three-dimensional shape information used to generate the virtual viewpoint image of the subject of interest. . In steps S507 to S509, it is determined whether or not conditions 1 to 3 are satisfied, and in step S510, a three-dimensional shape is selected according to these determination results.

ステップＳ５０７において、選択部１０５は、注目被写体が存在する部分領域を撮像できるカメラ（有効カメラ）の数が閾値未満か否かを判定する（条件１）。選択部１０５は、まず、注目被写体の被写体ＩＤにより特定される前景被写体の位置情報に基づいて注目被写体が存在する部分領域を特定する。そして、選択部１０５は、特定された部分領域に対する有効カメラの数を情報取得部１０１から取得し、閾値と比較する。なお、ここで用いられる閾値は、シルエット形状の精度をどの程度信頼するかを表す指標である。従って、シルエット形状の精度が何台のカメラから見えていれば十分であると考えるかに基づいて閾値が決定される。例えば、８台以上のカメラから見えている被写体のシルエット形状であれば精度的に十分だとするなら、閾値は８と定められる。 In step S507, the selection unit 105 determines whether or not the number of cameras (effective cameras) capable of capturing an image of the partial area in which the subject of interest exists is less than a threshold (condition 1). The selection unit 105 first identifies a partial area where the subject of interest exists based on the position information of the foreground subject identified by the subject ID of the subject of interest. Then, the selection unit 105 acquires the number of effective cameras for the specified partial area from the information acquisition unit 101 and compares it with the threshold. It should be noted that the threshold used here is an index representing how much the accuracy of the silhouette shape is relied upon. Therefore, the threshold value is determined based on how many cameras are considered sufficient for silhouette shape accuracy. For example, if the silhouette shape of the subject seen from eight or more cameras is sufficient for accuracy, the threshold is set to eight.

ステップＳ５０８において、選択部１０５は、１フレーム前において、姿勢情報に基づく３次元形状が注目被写体に対して用いられたか否かを判定する（条件２）。ステップＳ５０９において、選択部１０５は、注目被写体が１フレーム前では仮想視点の画角外に存在していたか否かを判定する（条件３）。仮想視点の画角は、操作装置３から取得される仮想視点情報に基づいて特定され得る。ステップＳ５１０において、選択部１０５は、ステップＳ５０７からステップＳ５０９における条件１～条件３の判定結果に基づいて、現フレームの注目被写体について使用する３次元形状を選択する。 In step S508, the selection unit 105 determines whether or not the three-dimensional shape based on the orientation information was used for the subject of interest one frame before (condition 2). In step S509, the selection unit 105 determines whether or not the subject of interest was present outside the angle of view of the virtual viewpoint one frame before (condition 3). The angle of view of the virtual viewpoint can be specified based on virtual viewpoint information acquired from the operation device 3 . In step S510, the selection unit 105 selects a three-dimensional shape to be used for the subject of interest in the current frame based on the determination results of conditions 1 to 3 in steps S507 to S509.

ここで、条件１～条件３の判定結果の組み合わせに応じて選択される３次元形状情報の種類（シルエットベース、姿勢ベース）について、図６を用いて説明する。図６は、条件１、条件２、条件３のそれぞれを満たすか否かの判定結果の全ての組み合わせに対して、シルエットベースの３次元形状情報と姿勢ベースの３次元形状情報のいずれが選択されるかを表す。なお、図６において、Ｙは条件を満たしていることを表し、Ｎは条件を満たしていないことを表す。したがって、条件１においてＹは、有効カメラの数が閾値未満であることを示し、Ｎはそれ以外であることを示す。条件２においてＹは、１フレーム前の注目被写体について選択された３次元形状情報が姿勢ベースの３次元形状情報であったことを示し、Ｎはそれ以外であることを示す。条件３においてＹは、１フレーム前の中億被写体が仮想視点画像の外（仮想視点の画角外）であったことを示し、Ｎはそれ以外であることを示す。基本的には条件１、つまり注目被写体が存在する領域を撮影する有効カメラの数が閾値未満であるかどうかで、どちらの３次元形状情報を選択するかが決定される。しかし、条件１のみで判断すると、不自然な仮想視点画像が生成される可能性がある。例えば、ある被写体を仮想視点画像で映し続けたとき、有効カメラの台数が閾値以上である領域から閾値未満である領域に注目被写体が移動すると、その瞬間に使用される３次元形状の種類が切り替わる。そのため、仮想視点画像において注目被写体の画像に不自然な変化が生じる。そこで、本実施形態では、このような不自然な変化の発生を防止するために、条件２と条件３を考慮している。 Here, the types of three-dimensional shape information (silhouette-based, posture-based) selected according to the combination of determination results of conditions 1 to 3 will be described with reference to FIG. FIG. 6 shows that either the silhouette-based 3D shape information or the pose-based 3D shape information is selected for all combinations of determination results as to whether or not conditions 1, 2, and 3 are satisfied. Represents Ruka. In FIG. 6, Y indicates that the conditions are satisfied, and N indicates that the conditions are not satisfied. Therefore, in Condition 1, Y indicates that the number of valid cameras is less than the threshold, and N indicates otherwise. In Condition 2, Y indicates that the three-dimensional shape information selected for the subject of interest one frame before was posture-based three-dimensional shape information, and N indicates that otherwise. In the condition 3, Y indicates that the middle object one frame before was outside the virtual viewpoint image (outside the angle of view of the virtual viewpoint), and N indicates otherwise. Basically, which three-dimensional shape information to select is determined according to condition 1, that is, whether or not the number of effective cameras photographing the area where the subject of interest exists is less than the threshold. However, if the determination is made based only on condition 1, an unnatural virtual viewpoint image may be generated. For example, when a certain subject is continuously projected in a virtual viewpoint image, when the subject of interest moves from an area where the number of effective cameras is greater than or equal to a threshold to an area where the number of effective cameras is less than the threshold, the type of 3D shape used at that moment is switched. . Therefore, an unnatural change occurs in the image of the subject of interest in the virtual viewpoint image. Therefore, in the present embodiment, conditions 2 and 3 are taken into consideration in order to prevent such unnatural changes from occurring.

仮に、注目被写体の存在する領域を撮像できる有効カメラの数が閾値未満であり、１フレーム前の注目被写体に対してシルエットベースの３次元形状であったとする。このとき、注目被写体に姿勢ベースの３次元形状情報が用いられると、３次元形状の種類の切り替わりが発生することになる。その結果、仮想視点画像において注目被写体の画像に不自然な不連続性が生じる。そこで、条件２により、１フレーム前と現在フレームとで注目被写体の３次元形状の種類が切り替わらないように３次元形状情報を選択することで、注目被写体の画像の連続性が保たれる。しかし、条件１と条件２で判断をすると、１フレーム目に使用した形状を常に用いるようになってしまう。そこで、条件３として、１フレーム前に注目被写体が仮想視点の画角内に存在していたかどうかの判断が用いられる。１フレーム前において注目被写体が仮想視点の画角外にいた場合、注目被写体は仮想視点画像には映っていない。そのため、現在フレームにおいて注目被写体が仮想視点の画角内に存在し、使用すべき３次元形状の種類が１フレーム前と現在フレームとで変わっていても、注目被写体の画像に不自然な不連続性は発生しない。結果、注目被写体の画像に影響を与えずに、姿勢ベースの３次元形状とシルエットベースの３次元形状との間の切り替えが行うことができる。 Suppose that the number of effective cameras capable of capturing an image of the area where the subject of interest exists is less than the threshold, and the subject of interest one frame before has a silhouette-based three-dimensional shape. At this time, if posture-based 3D shape information is used for the subject of interest, the type of 3D shape will be switched. As a result, an unnatural discontinuity occurs in the image of the subject of interest in the virtual viewpoint image. Therefore, the continuity of the image of the target subject is maintained by selecting the three-dimensional shape information so that the type of the three-dimensional shape of the target subject does not change between the previous frame and the current frame according to condition 2. However, if the judgment is made based on conditions 1 and 2, the shape used in the first frame will always be used. Therefore, as condition 3, a judgment is used as to whether or not the subject of interest was present within the angle of view of the virtual viewpoint one frame before. If the subject of interest is outside the angle of view of the virtual viewpoint one frame before, the subject of interest is not shown in the virtual viewpoint image. Therefore, even if the subject of interest exists within the angle of view of the virtual viewpoint in the current frame and the type of three-dimensional shape to be used is different between the previous frame and the current frame, unnatural discontinuity occurs in the image of the subject of interest. sex does not occur. As a result, it is possible to switch between the pose-based three-dimensional shape and the silhouette-based three-dimensional shape without affecting the image of the subject of interest.

したがって、図６のＣａｓｅ３において注目被写体に用いられる３次元形状は姿勢ベースの３次元形状となり、Ｃａｓｅ４において注目被写体に用いられる３次元形状はシルエットベースの３次元形状となる。図６における他のＣａｓｅの説明についても考え方は同じであるため、説明は省略する。 Therefore, the three-dimensional shape used for the subject of interest in Case 3 of FIG. 6 is a posture-based three-dimensional shape, and the three-dimensional shape used for the subject of interest in Case 4 is a silhouette-based three-dimensional shape. The explanation of the other Cases in FIG. 6 is the same, so the explanation is omitted.

図５に戻り、ステップＳ５１１において、全ての被写体が処理されたかどうかを確認する。全ての被写体が処理されていなければ、ステップＳ５０６に戻り、次の被写体を注目被写体として仮想視点画像の生成に使用すべき３次元形状を選択する。全ての被写体が処理されたと判定された場合、各前景被写体の３次元形状情報を、画像生成部１０６に出力する。ステップＳ５１２において、画像生成部１０６は、情報取得部１０１からのカメラパラメータ、被写体のテクスチャ情報、選択部１０５からの３次元形状情報、操作装置３からの仮想視点情報に基づき、仮想視点画像を生成する。生成された仮想視点画像は、表示装置４に出力される。 Returning to FIG. 5, in step S511, it is checked whether all subjects have been processed. If all the objects have not been processed, the process returns to step S506 to select the three-dimensional shape to be used for generating the virtual viewpoint image with the next object as the object of interest. When it is determined that all subjects have been processed, the three-dimensional shape information of each foreground subject is output to the image generation unit 106 . In step S512, the image generation unit 106 generates a virtual viewpoint image based on the camera parameters from the information acquisition unit 101, the texture information of the subject, the three-dimensional shape information from the selection unit 105, and the virtual viewpoint information from the operation device 3. do. The generated virtual viewpoint image is output to the display device 4 .

仮想視点画像を生成する方法について説明する。画像生成部１０６は、前景仮想視点画像（被写体領域の仮想視点画像）を生成する処理と、背景仮想視点画像（被写体領域以外の仮想視点画像）を生成する処理を実行する。そして、生成した背景仮想視点画像に前景仮想視点画像を重ねることで仮想視点画像を生成する。生成した仮想視点画像は表示装置４に送信され、表示装置４に出力される。 A method of generating a virtual viewpoint image will be described. The image generation unit 106 executes processing for generating a foreground virtual viewpoint image (virtual viewpoint image of the subject region) and processing for generating a background virtual viewpoint image (virtual viewpoint image other than the subject region). Then, a virtual viewpoint image is generated by superimposing the foreground virtual viewpoint image on the generated background virtual viewpoint image. The generated virtual viewpoint image is transmitted to the display device 4 and output to the display device 4 .

仮想視点画像の前景仮想視点画像を生成する方法について説明する。前景仮想視点画像は、ボクセルを座標が（Ｘｗ，Ｙｗ，Ｚｗ）である３次元点と仮定し、ボクセルの色を算出し、色が付いたボクセルを既存のＣＧレンダリング手法によりレンダリングすることで生成され得る。画像生成部１０６は、色を算出する前に、まず、撮像システム２のカメラから被写体の３次元形状の表面までの距離を画素値とする距離画像を生成する。次に、画像生成部１０６は、ボクセルに色を割り当てるために、３次元点（Ｘｗ，Ｙｗ，Ｚｗ）を画角内に含むカメラにおいて、その３次元点をカメラ座標系に一度変換する。カメラ座標系とは、カメラのレンズ中心を原点とし、レンズ平面（Ｘｃ、Ｙｃ）とレンズ光軸（Ｚｃ）から定義される３次元座標系である。そして、画像生成部１０６は、カメラ座標系に変換された３次元点をカメラ画像座標系に変換し、該ボクセルからカメラまでの距離ｄとカメラ画像上の座標（Ｘｉ，Ｙｉ）を算出する。なお、カメラ画像座標系とは、レンズ面から前方にある一定距離離れた平面上に定義され、カメラ座標系のＸｃ軸とＹｃ軸およびカメラ画像座標系のＸｉ軸とＹｉ軸とが、それぞれ平行であるような２次元座標系である。画像生成部１０６は、距離ｄと上記距離画像の座標（Ｘｉ，Ｙｉ）の画素値（＝表面までの距離）との差を算出し、算出された差が予め設定した閾値以下であれば、該ボクセルは該カメラから可視であると判定する。可視と判定された場合、画像生成部１０６は、撮像システム２の撮像画像における座標（Ｘｉ，Ｙｉ）の画素値を該ボクセルの色とする。該ボクセルが複数のカメラにおいて可視と判定された場合、画像生成部１０６は、撮像システム２の各カメラからの撮像画像から得られた前景シルエット情報のテクスチャ情報から画素値を取得し、例えば、それらの平均値を該ボクセルの色とする。ただし、色を算出する方法はこれに限定されない。例えば、平均値ではなく、仮想視点から最も近い撮像システム２から取得された撮像画像の画素値を用いるなどの方法を用いても構わない。全ボクセルについて同じ処理を繰り返すことで３次元形状情報を構成する全ボクセルに色を割り当てることができる。ここで、形状を構成する各ボクセルについて、可視か否かの判定の対象となるカメラは撮像システム２を構成する全てのカメラでも良いが、これに限られるものではない。例えば、カメラ情報によりボクセルが可視であることが示されるカメラ、形状推定に用いられるカメラを対象としてもよい。このようにすることで、仮想視点画像を生成する処理時間を短縮できる。 A method of generating a foreground virtual viewpoint image of a virtual viewpoint image will be described. The foreground virtual viewpoint image is generated by assuming a voxel as a three-dimensional point with coordinates (Xw, Yw, Zw), calculating the color of the voxel, and rendering the colored voxel using an existing CG rendering method. can be Before calculating the color, the image generation unit 106 first generates a distance image in which the pixel value is the distance from the camera of the imaging system 2 to the surface of the three-dimensional shape of the subject. Next, in order to assign colors to voxels, the image generation unit 106 once transforms the 3D point (Xw, Yw, Zw) into the camera coordinate system in the camera that includes the 3D point (Xw, Yw, Zw) within the angle of view. The camera coordinate system is a three-dimensional coordinate system defined by the lens plane (Xc, Yc) and the lens optical axis (Zc) with the lens center of the camera as the origin. Then, the image generator 106 converts the three-dimensional point converted into the camera coordinate system into the camera image coordinate system, and calculates the distance d from the voxel to the camera and the coordinates (Xi, Yi) on the camera image. The camera image coordinate system is defined on a plane in front of the lens surface and separated by a certain distance. is a two-dimensional coordinate system such that The image generator 106 calculates the difference between the distance d and the pixel value (=distance to the surface) of the coordinates (Xi, Yi) of the distance image, and if the calculated difference is equal to or less than a preset threshold, Determine that the voxel is visible from the camera. When determined to be visible, the image generation unit 106 sets the pixel value of the coordinates (Xi, Yi) in the captured image of the imaging system 2 as the color of the voxel. If the voxel is determined to be visible in a plurality of cameras, the image generation unit 106 acquires pixel values from the texture information of the foreground silhouette information obtained from the captured images from each camera of the imaging system 2. is the color of the voxel. However, the method of calculating the color is not limited to this. For example, instead of using the average value, a method such as using the pixel value of the captured image acquired from the imaging system 2 closest to the virtual viewpoint may be used. By repeating the same process for all voxels, colors can be assigned to all voxels that constitute the three-dimensional shape information. Here, for each voxel that constitutes the shape, the cameras that are the targets of determination as to whether or not they are visible may be all the cameras that constitute the imaging system 2, but are not limited to this. For example, a camera whose camera information indicates that voxels are visible, or a camera used for shape estimation may be targeted. By doing so, the processing time for generating the virtual viewpoint image can be shortened.

次に、仮想視点画像の背景仮想視点画像を生成する方法について説明する。画像生成部１０６は、背景仮想視点画像を生成するために、競技場などの背景の３次元形状情報を取得する。背景の３次元形状情報は、予め作成され、システム内に保存された競技場や構造物のＣＧモデルが用いられる。画像生成部１０６は、ＣＧモデルを構成する各面の法線ベクトルと撮像システム２を構成する各カメラの方向ベクトルを比較し、各面を画角内に収め、最も正対するカメラを抽出する。そして、画像生成部１０６は、抽出されたカメラによる撮像画像に各面の頂点座標を投影し、各面に貼るテクスチャ画像を生成し、既存のテクスチャマッピング手法でレンダリングすることで、背景仮想視点画像を生成する。このようにして得られた仮想視点画像の背景仮想視点画像上に前景仮想視点画像を重ねることで、仮想視点画像が生成される。 Next, a method for generating a background virtual viewpoint image of a virtual viewpoint image will be described. The image generation unit 106 acquires three-dimensional shape information of a background such as a stadium in order to generate a background virtual viewpoint image. CG models of stadiums and structures that are created in advance and stored in the system are used as background three-dimensional shape information. The image generation unit 106 compares the normal vector of each surface that constitutes the CG model and the direction vector of each camera that constitutes the imaging system 2, and extracts the most facing camera by fitting each surface within the angle of view. Then, the image generation unit 106 projects the vertex coordinates of each surface onto the extracted image captured by the camera, generates a texture image to be pasted on each surface, and renders the background virtual viewpoint image by an existing texture mapping method. to generate A virtual viewpoint image is generated by superimposing the foreground virtual viewpoint image on the background virtual viewpoint image of the virtual viewpoint image thus obtained.

以上のように、第１実施形態によれば、被写体の３次元形状の推定精度の低下を抑制することができ、結果として、仮想視点画像の画質を向上することができる。 As described above, according to the first embodiment, it is possible to suppress deterioration in the accuracy of estimating the three-dimensional shape of the subject, and as a result, it is possible to improve the image quality of the virtual viewpoint image.

＜第２実施形態＞
第１実施形態では、撮像領域中の各領域に対する有効カメラの数に応じて形状推定の方法（３次元形状情報の種類）を切り替えて仮想視点画像を生成した。第２実施形態では、撮像領域中の各領域における被写体の密集の度合いに応じて形状推定の方法を切り替えながら仮想視点画像を生成する。なお、なお、第２実施形態における画像処理システムの装置構成、画像生成装置の機能構成、画像生成装置のハードウェア構成は、第１実施形態（図１、図４）と同様である。 <Second embodiment>
In the first embodiment, the virtual viewpoint image is generated by switching the shape estimation method (three-dimensional shape information type) according to the number of effective cameras for each region in the imaging region. In the second embodiment, a virtual viewpoint image is generated while switching the shape estimation method according to the degree of density of subjects in each region in the imaging region. Note that the device configuration of the image processing system, the functional configuration of the image generation device, and the hardware configuration of the image generation device in the second embodiment are the same as those in the first embodiment (FIGS. 1 and 4).

シルエットベースの３次元形状の精度と姿勢ベースの３次元形状の精度が、被写体の密集度合いに応じてどう変化するかについて、図８を用いて説明する。図８は、撮像領域中のある小領域において、被写体が複数人密接している様子を表す模式図である。カメラ８０２は、３次元形状の推定対象となる撮像領域８０１を撮像する。カメラ８０２の撮像画像から、前景シルエット情報８０３が生成される。図８では、撮像領域８０１のある小領域に複数人の被写体８０４が密集している様子が示されている。このとき、カメラ８０２からの撮像画像において被写体と被写体とが重なるため、前景シルエット情報８０３からは、各被写体のシルエットを正しく取得することができない。そのため、シルエットベースの３次元形状の精度は、被写体が密集するほど低下する。一方、深層学習を利用した姿勢推定方法では、有効カメラの数や被写体の密集度合いにかかわらず、撮像画像から比較的精度よく姿勢情報が得られることが知られている。そのため、被写体が密集していても姿勢ベースの３次元形状は一定の精度を保つことができる。したがって、被写体の密集度合いに応じて３次元形状の種類を選択することで、仮想視点画像の画質低下防止が期待できる。 How the precision of the three-dimensional shape based on the silhouette and the precision of the three-dimensional shape based on the posture change according to the density of subjects will be described with reference to FIG. FIG. 8 is a schematic diagram showing a state in which a plurality of subjects are in close proximity in a certain small area in the imaging area. A camera 802 captures an image of an imaging region 801 whose three-dimensional shape is to be estimated. Foreground silhouette information 803 is generated from the image captured by the camera 802 . FIG. 8 shows a state in which a plurality of subjects 804 are concentrated in a small area of an imaging area 801 . At this time, since the subjects overlap in the captured image from the camera 802 , the silhouette of each subject cannot be correctly obtained from the foreground silhouette information 803 . Therefore, the accuracy of the silhouette-based three-dimensional shape decreases as the objects are densely packed. On the other hand, it is known that pose estimation methods using deep learning can obtain pose information from captured images with relatively high accuracy regardless of the number of effective cameras and the degree of crowding of subjects. Therefore, even if the subjects are densely packed, the pose-based three-dimensional shape can maintain a certain degree of accuracy. Therefore, by selecting the type of three-dimensional shape according to the density of subjects, it can be expected to prevent the image quality of the virtual viewpoint image from deteriorating.

［仮想視点画像の生成処理］
図７に示すフローチャートを用いて、第２実施形態の画像生成装置１が行う処理について説明する。なお、第１実施形態（図５）と同様の処理を行うステップについては、同一のステップ番号を付し、詳細な説明を省略する。第１実施形態とは、条件１として注目被写体が存在する領域における被写体の密集の度合いが用いられる点が異なる。 [Generation processing of virtual viewpoint image]
Processing performed by the image generating apparatus 1 according to the second embodiment will be described with reference to the flowchart shown in FIG. The same step numbers are assigned to steps that perform the same processing as in the first embodiment (FIG. 5), and detailed description thereof is omitted. This embodiment differs from the first embodiment in that condition 1 uses the degree of concentration of subjects in the area where the subject of interest exists.

ステップＳ７０７において、選択部１０５は、条件１として、ステップＳ５０６において選択した注目被写体の存在する部分領域における被写体の密集度合いが所定の閾値を越える被写体密集領域であるかどうかを判定する。第２実施形態では、撮像領域をｎ分割した各部分領域について、閾値以上の数のボーンモデルが存在するかどうかで被写体密集領域かどうかが判断される。第１実施形態で説明したように、姿勢推定部１０３は、全ての前景被写体について、位置情報と被写体ＩＤを付与している。従って、注目被写体が存在する部分領域は、例えば、注目被写体として選択されている前景被写体の位置情報に基づいて特定され得る。また、特定された部分領域に存在するボーンモデルの数は、例えば、全ての前景被写体の位置情報に基づいて、特定された部分領域に存在する前景被写体の数をカウントすることで得られる。なお、撮像領域の分割数は、撮像領域の大きさ（面積または体積）に応じて適切に設定されるべきである。例えば、分割した部分領域の面積が約１ｍ^２となるように撮像領域を分割する。また、例被写体密集領域であると判断するボーンモデルの数の閾値を、例えば３とする。なお、撮像領域をｎ分割した部分領域におけるボーンモデルの数の他に、注目被写体を中心とした所定距離の範囲内に存在するボーンモデルの数により、被写体の密集度合いが判断されてもよい。 In step S707, as condition 1, the selection unit 105 determines whether or not the partial area in which the subject of interest selected in step S506 exists is a subject dense area in which the degree of subject density exceeds a predetermined threshold. In the second embodiment, it is determined whether or not each partial area obtained by dividing the imaging area into n areas is a subject-dense area based on whether or not there are bone models in a number equal to or greater than a threshold. As described in the first embodiment, the posture estimation unit 103 assigns position information and subject IDs to all foreground subjects. Therefore, the partial area where the subject of interest exists can be specified based on the position information of the foreground subject selected as the subject of interest, for example. Also, the number of bone models existing in the specified partial area can be obtained, for example, by counting the number of foreground objects existing in the specified partial area based on the position information of all the foreground objects. Note that the number of divisions of the imaging region should be appropriately set according to the size (area or volume) of the imaging region. For example, the imaging area is divided so that the area of each divided partial area is about 1 m ² . Also, the threshold for the number of bone models determined to be an example subject-dense area is set to 3, for example. In addition to the number of bone models in partial areas obtained by dividing the imaging area into n parts, the density of subjects may be determined based on the number of bone models present within a predetermined distance from the subject of interest.

ステップＳ７１０において、選択部１０５は、ステップＳ７０７、Ｓ５０８、Ｓ５０９における条件１～３の判定結果に基づいて、注目被写体に対して使用する３次元形状の種類を決定する。なお、ステップＳ７１０における判断は、図６における条件１をステップＳ７０７の条件（「Ｙ」は被写体密集領域であると判定された場合を示す）と差し替えたものとなる。 In step S710, the selection unit 105 determines the type of three-dimensional shape to be used for the subject of interest based on the determination results of conditions 1 to 3 in steps S707, S508, and S509. Note that the determination in step S710 is obtained by replacing condition 1 in FIG. 6 with the condition in step S707 ("Y" indicates the case where it is determined that the area is a subject-dense area).

以上のように、第２実施形態によれば、被写体が密集した領域に存在する各被写体に関して３次元形状の推定精度の低下を抑制することができる。その結果として、仮想視点画像の画質を向上できる。 As described above, according to the second embodiment, it is possible to suppress a decrease in accuracy in estimating the three-dimensional shape of each subject existing in an area where the subjects are concentrated. As a result, the image quality of the virtual viewpoint image can be improved.

なお、上記各実施形態では、条件１～条件３の判定結果に基づいて３次元形状情報を選択したが、これに限られるものではない。例えば、条件１のみを用いて３次元形状情報を選択するようにしてもよい。また、例えば、条件１～条件３に代えて、或いは、これらの条件に加えて、第２形状推定部１０４が推定した３次元形状と、第１形状推定部１０２が推定した３次元形状との差が大きい場合に、シルエットベースの３次元形状情報が選択されるようにしてもよい。上述したように、標準となるボーンモデルからの形状の乖離が大きい場合に姿勢ベースの３次元形状情報の推定精度が低下する。そこで、選択部１０５が、第１形状推定部１０２により推定された３次元形状と、第２形状推定部１０４により推定された３次元形状との差を定量化し、定量化された差が閾値を越える場合に、シルエットベースの３次元形状を選択する。これにより、注目被写体の形状の標準からの乖離による推定精度の低下を低減できる。なお、差の定量化は、例えば、２つの３次元形状の間のズレの大きさ（体積）を計算することでなされ得る。 In each of the embodiments described above, three-dimensional shape information is selected based on the determination results of conditions 1 to 3, but the present invention is not limited to this. For example, only condition 1 may be used to select three-dimensional shape information. Further, for example, instead of or in addition to the conditions 1 to 3, the three-dimensional shape estimated by the second shape estimation unit 104 and the three-dimensional shape estimated by the first shape estimation unit 102 If the difference is large, silhouette-based 3D shape information may be selected. As described above, when the deviation of the shape from the standard bone model is large, the estimation accuracy of pose-based three-dimensional shape information decreases. Therefore, the selection unit 105 quantifies the difference between the three-dimensional shape estimated by the first shape estimation unit 102 and the three-dimensional shape estimated by the second shape estimation unit 104, and the quantified difference exceeds the threshold. If exceeded, choose a silhouette-based 3D shape. As a result, it is possible to reduce the decrease in estimation accuracy due to the deviation of the shape of the subject of interest from the standard. Note that the quantification of the difference can be done, for example, by calculating the magnitude (volume) of the displacement between the two three-dimensional shapes.

また、上記各実施形態では、直前のフレームの注目被写体が仮想視点の画角の外にある場合に、３次元形状の種類の変更が可能となる（直前のフレームの注目被写体が仮想視点の画角内にある場合は３次元形状情報の種類が維持される）が、これに限られない。例えば、直前のフレームの仮想視点画像において描画された被写体の画像の大きさが所定のサイズ未満である場合に、３次元形状の種類の変更が可能となってもよい。この場合、直前のフレームの仮想視点画像において描画された被写体の画像の大きさが所定のサイズ以上の間は３次元形状情報の種類が維持される。仮想視点画像に描画されている被写体が小さければ、３次元形状情報の種類が変わっても大きな違和感は生じない。或いは、直前のフレームの仮想視点画像に描画された注目被写体が当該注目被写体の全体の所定割合未満であった場合に、３次元形状情報の種類の変更が可能となってもよい。この場合、直前のフレームの仮想視点画像に当該注目被写体の全体の所定割合以上が描画される間は、３次元形状情報の種類が維持されることになる。 Further, in each of the above-described embodiments, when the subject of interest in the previous frame is outside the angle of view of the virtual viewpoint, it is possible to change the type of the three-dimensional shape (the subject of interest in the previous frame is the image of the virtual viewpoint). If it is within the corner, the type of 3D shape information is maintained), but it is not limited to this. For example, it may be possible to change the type of three-dimensional shape when the size of the subject image drawn in the virtual viewpoint image of the previous frame is smaller than a predetermined size. In this case, the type of 3D shape information is maintained as long as the size of the subject image drawn in the virtual viewpoint image of the previous frame is equal to or greater than a predetermined size. If the subject drawn in the virtual viewpoint image is small, a change in the type of three-dimensional shape information does not cause a great sense of discomfort. Alternatively, it may be possible to change the type of three-dimensional shape information when the subject of interest drawn in the virtual viewpoint image of the previous frame is less than a predetermined proportion of the entire subject of interest. In this case, the type of the three-dimensional shape information is maintained while a predetermined ratio or more of the subject of interest is drawn in the virtual viewpoint image of the immediately preceding frame.

本開示は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 The present disclosure provides a program that implements one or more functions of the above-described embodiments to a system or device via a network or storage medium, and one or more processors in a computer of the system or device reads and executes the program. It can also be realized by processing to It can also be implemented by a circuit (for example, ASIC) that implements one or more functions.

発明は上記実施形態に制限されるものではなく、発明の精神及び範囲から離脱することなく、様々な変更及び変形が可能である。従って、発明の範囲を公にするために請求項を添付する。 The invention is not limited to the embodiments described above, and various modifications and variations are possible without departing from the spirit and scope of the invention. Accordingly, the claims are appended to make public the scope of the invention.

１：画像生成装置、２：撮像装置、３：操作装置、４：表示装置、１０１：カメラ情報取得部、１０２：第１形状推定部、１０３：姿勢推定部、１０４：第２形状推定部、１０５：選択部、１０６：画像生成部 1: image generation device, 2: imaging device, 3: operation device, 4: display device, 101: camera information acquisition unit, 102: first shape estimation unit, 103: attitude estimation unit, 104: second shape estimation unit, 105: selection unit, 106: image generation unit

Claims

a first generation means for generating three-dimensional shape information of a subject based on silhouette information of the subject obtained from a plurality of captured images of the subject captured by a plurality of imaging devices;
a second generation means for generating three-dimensional shape information of the subject based on the posture of the subject estimated from the plurality of captured images;
one of the three-dimensional shape information generated by the first generation means and the three-dimensional shape information generated by the second generation means for each subject included in the plurality of captured images, and displaying one of the three-dimensional shape information generated by the second generation means as a virtual viewpoint image; a selection means for selecting as three-dimensional shape information used to generate the
and image generating means for generating a virtual viewpoint image using the three-dimensional shape information selected by the selecting means.

The selecting means performs the selection based on the number of imaging devices capable of imaging a partial region in which the subject exists, among a plurality of partial regions obtained by dividing the imaging region, among the plurality of imaging devices. 2. The image generating device according to claim 1, characterized by:

The selection means selects the three-dimensional shape information generated by the first generation means when the number of imaging devices capable of imaging exceeds a threshold, and the number of imaging devices capable of imaging is equal to or less than the threshold. 3. The image generating apparatus according to claim 2, wherein the three-dimensional shape information generated by said second generating means is selected in case of an error.

The selecting means performs the selection based on the degree of density of subjects in a partial area in which the subject exists among a plurality of partial areas obtained by dividing an imaging area among the plurality of imaging devices. 4. The image generation device according to any one of claims 1 to 3.

The selecting means selects the three-dimensional shape information generated by the second generating means when the degree of congestion exceeds a threshold, and generates the three-dimensional shape information by the first generating means when the degree of congestion is equal to or less than the threshold. 5. The image generation device according to claim 4, wherein the three-dimensional shape information obtained by the processing is selected.

6. The image generating apparatus according to claim 1, wherein said selection means performs said selection based on the number of subjects existing within a predetermined distance range from said subject. .

The selecting means selects the three-dimensional shape information generated by the second generating means when the number of subjects existing within the range exceeds the threshold, and the number of subjects existing within the range exceeds the threshold. 7. The image generating apparatus according to claim 6, wherein the three-dimensional shape information generated by said first generating means is selected in the following cases.

The selection means quantifies the difference between the three-dimensional shape represented by the three-dimensional shape information generated by the first generation means and the three-dimensional shape represented by the three-dimensional shape information generated by the second generation means. 5. The image generation apparatus according to claim 1, wherein the selection is performed based on the size of the difference that has been quantified and quantified.

The selection means selects the three-dimensional shape information generated by the first generation means when the magnitude of the difference exceeds the threshold, and selects the three-dimensional shape information generated by the first generation means when the magnitude of the difference is less than the threshold. 9. The image generating apparatus according to claim 8, wherein the three-dimensional shape information generated by the second generating means is selected.

10. The selecting means maintains the selection in the immediately preceding frame while the subject exists within the range of the virtual viewpoint image of the immediately preceding frame generated by the image generating means. The image generation device according to any one of .

The selection means maintains the selection in the immediately preceding frame while the size of the subject image drawn in the immediately preceding frame virtual viewpoint image generated by the image generating means is equal to or larger than a predetermined size. 10. The image generating apparatus according to any one of claims 1 to 9, characterized by:

3. The selecting means maintains the selection in the immediately preceding frame while a predetermined proportion or more of the subject is drawn in the virtual viewpoint image of the immediately preceding frame generated by the image generating means. 10. The image generation device according to any one of 1 to 9.

a first generation step of generating three-dimensional shape information of the subject based on silhouette information of the subject obtained from a plurality of captured images of the subject captured by a plurality of imaging devices;
a second generation step of generating three-dimensional shape information of the subject based on the posture of the subject estimated from the plurality of captured images;
one of the three-dimensional shape information generated in the first generating step and the three-dimensional shape information generated in the second generating step for each subject included in the plurality of captured images, and generating a virtual viewpoint image; A selection step of selecting as three-dimensional shape information used to generate the
and a generating step of generating a virtual viewpoint image using the three-dimensional shape information selected in the selecting step.

A program for causing a computer to function as each means of the image generating apparatus according to any one of claims 1 to 12.