JP2023021735A

JP2023021735A - Image processing device, image processing method, and program

Info

Publication number: JP2023021735A
Application number: JP2021126787A
Authority: JP
Inventors: 大地阿達; Taichi Adachi
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2021-08-02
Filing date: 2021-08-02
Publication date: 2023-02-14

Abstract

To provide an image processing device capable of acquiring a virtual viewpoint image that expresses a part of an object like an afterimage.SOLUTION: An image processing device acquires object shape data corresponding to each of multiple times in a period of continuous synchronous imaging with multiple imaging devices, and determines a specific portion that is a part of the object. The image processing device projects the object shape data at a reference time among the multiple times and the shape data of the object specific portion at a specific time different from the reference time among the multiple times to a virtual viewpoint, and generates a virtual viewpoint image.SELECTED DRAWING: Figure 4

Description

本開示は、複数の撮像画像に基づいて仮想視点画像を生成する技術に関する。 The present disclosure relates to technology for generating a virtual viewpoint image based on a plurality of captured images.

複数の撮像装置を異なる位置に設置し、被写体（オブジェクト）を同期撮像することにより得られる複数の撮像画像を用いて、仮想視点からの見えを表す仮想視点画像を生成する技術がある。この仮想視点画像を応用した技術として特許文献１には、特定のオブジェクトを所望の仮想視点から見た仮想視点画像を異なる時点それぞれについて生成し、それらを重畳表示することでオブジェクトの軌跡を表現する手法が開示されている。 2. Description of the Related Art There is a technique for generating a virtual viewpoint image representing a view from a virtual viewpoint using a plurality of captured images obtained by synchronously capturing a subject (object) by installing a plurality of imaging devices at different positions. As a technique that applies this virtual viewpoint image, Patent Document 1 discloses that virtual viewpoint images of a specific object viewed from a desired virtual viewpoint are generated at different points in time and superimposed on each other to express the trajectory of the object. A method is disclosed.

特開２０２１－１３０９５号公報Japanese Patent Application Laid-Open No. 2021-13095

小川原光一，李暁路，池内克史，“関節構造を持つ柔軟変形モデルを用いた人体運動の推定”、 Proc. of MIRU2006， pp.994-999，2006．Koichi Ogawara, Xiaolu Li, Katsushi Ikeuchi, "Estimation of human body motion using flexible deformation model with joint structure", Proc. of MIRU2006, pp.994-999, 2006.

仮想視点画像においてオブジェクトの動きのある部分だけ（例えば人物の腕だけ）を残像として表現したい場合がある。この点、上記特許文献１の手法では腕以外の動きのない部分も残像として表現されることになり、動きのある腕だけが残像のように表現された仮想視点画像を得ることはできなかった。 There is a case where it is desired to express only a moving part of an object (for example, only a person's arm) in a virtual viewpoint image as an afterimage. In this regard, in the technique of Patent Document 1, parts other than the arms that do not move are also expressed as afterimages, and it was not possible to obtain a virtual viewpoint image in which only the arms that are in motion are expressed like afterimages. .

本開示に係る画像処理装置は、複数の撮像装置にて連続して撮像される期間における複数の時刻それぞれに対応するオブジェクトの形状データを取得する取得手段と、前記オブジェクトの一部である特定部位を決定する決定手段と、前記複数の時刻のうち基準時刻における前記オブジェクトの形状データと、前記複数の時刻のうち前記基準時刻とは異なる特定時刻における前記オブジェクトの特定部位の形状データとを用いて仮想視点に対応する仮想視点画像を生成する生成手段と、を有することを特徴とする。 An image processing apparatus according to the present disclosure includes acquisition means for acquiring shape data of an object corresponding to each of a plurality of times during a period in which images are continuously captured by a plurality of imaging devices; shape data of the object at a reference time among the plurality of times, and shape data of a specific portion of the object at a specific time different from the reference time among the plurality of times and generating means for generating a virtual viewpoint image corresponding to the virtual viewpoint.

本開示の技術によれば、オブジェクトの一部を残像のように表現した仮想視点画像を得ることができる。 According to the technique of the present disclosure, it is possible to obtain a virtual viewpoint image in which a part of an object is expressed like an afterimage.

画像処理システムの構成の一例を示す図。The figure which shows an example of a structure of an image processing system. 画像処理装置のハードウェア構成の一例を示す図。FIG. 2 is a diagram showing an example of the hardware configuration of an image processing apparatus; 画像処理装置のソフトウェア構成の一例を示す図。FIG. 2 is a diagram showing an example of the software configuration of an image processing apparatus; 実施形態１に係る、仮想視点画像の生成処理の流れを示すフローチャート。4 is a flowchart showing the flow of processing for generating a virtual viewpoint image according to the first embodiment; （ａ）及び（ｂ）は、特定部位の検出を説明する図。(a) And (b) is a figure explaining detection of a specific part. （ａ）～（ｄ）は、３Ｄモデルの投影画像を示す図。(a) to (d) are diagrams showing projected images of a 3D model. （ａ）及び（ｂ）は、投影画像の合成処理を説明する図。4A and 4B are diagrams for explaining synthesis processing of projected images; FIG. （ａ）～（ｃ）は、残像表現の一例を示す図。(a) to (c) are diagrams showing an example of an afterimage expression. 実施形態２に係る、仮想視点画像の生成処理の流れを示すフローチャート。9 is a flowchart showing the flow of processing for generating a virtual viewpoint image according to the second embodiment; 部位指定用ＵＩ画面の一例を示す図。The figure which shows an example of the UI screen for site|part specification. （ａ）及び（ｂ）は、合成時の調整を説明する図。4A and 4B are diagrams for explaining adjustment during synthesis; FIG. （ａ）及び（ｂ）は、残像表現の一例を示す図。(a) and (b) are diagrams showing an example of an afterimage expression.

以下、本実施形態を実施するための形態について図面などを参照して説明する。なお、以下の実施形態は、本開示の技術を限定するものではなく、また、以下の実施形態で説明されている全ての構成が課題を解決するための手段に必須であるとは限らない。 Hereinafter, the form for implementing this embodiment is demonstrated with reference to drawings. Note that the following embodiments do not limit the technology of the present disclosure, and not all the configurations described in the following embodiments are essential for solving the problems.

［実施形態１］
はじめに、仮想視点画像の概要を簡単に説明する。仮想視点画像は、実際のカメラ視点とは異なる、仮想的なカメラ視点（仮想視点）からの見えを表す画像であり、自由視点画像とも呼ばれる。仮想視点は、ユーザがコントローラを操作して直接指定したり、例えば予め設定された複数の仮想視点候補の中から選択したりするといった方法により設定される。なお、仮想視点画像には、動画と静止画の両方が含まれるが、以下の実施形態では、動画を例に説明を行うものとする。動画の仮想視点画像を生成する場合、仮想視点は固定でもよいし、オブジェクトの動きに合わせて追随するように変化してもよい。 [Embodiment 1]
First, an outline of the virtual viewpoint image will be briefly described. A virtual viewpoint image is an image representing a view from a virtual camera viewpoint (virtual viewpoint) different from the actual camera viewpoint, and is also called a free viewpoint image. The virtual viewpoint is set by a method such that the user operates the controller to directly specify it, or selects, for example, from a plurality of preset virtual viewpoint candidates. Note that the virtual viewpoint image includes both moving images and still images, but the following embodiments will be described using moving images as an example. When generating a moving image virtual viewpoint image, the virtual viewpoint may be fixed, or may be changed so as to follow the movement of the object.

＜システム構成について＞
図１は、本実施形態に係る、仮想視点画像を生成するための画像処理システムの構成の一例を示す図である。画像処理システム１００は、複数の撮像装置（カメラ）１０１、画像処理装置１０２、コントローラ１０３及び表示装置１０４を有する。画像処理システム１００では、複数のカメラ１０１の同期撮像による複数の撮像画像に基づいて、画像処理装置１０２が仮想視点画像を生成し、表示装置１０４に表示する。 <About system configuration>
FIG. 1 is a diagram showing an example of the configuration of an image processing system for generating a virtual viewpoint image according to this embodiment. The image processing system 100 has a plurality of imaging devices (cameras) 101 , an image processing device 102 , a controller 103 and a display device 104 . In the image processing system 100 , the image processing device 102 generates a virtual viewpoint image based on a plurality of captured images obtained by synchronous imaging by a plurality of cameras 101 and displays it on the display device 104 .

カメラ１０１はオブジェクトを取り囲むように設置され、各カメラ１０１が時刻同期してオブジェクトを撮像する。ただし、スタジオやコンサートホールなど設置場所に制限がある場合には、複数のカメラ１０１は撮像対象領域の一部の方向にのみ設置されることになる。各カメラ１０１は、例えば、シリアルデジタルインタフェイス（ＳＤＩ）に代表される映像信号インタフェイスを備えたデジタル方式のビデオカメラにより実現される。各カメラ１０１は、出力する映像信号に対しタイムコードに代表される時刻情報を付加して、画像処理装置１０２に送信する。この際、カメラの三次元位置（x，y，z）、撮像方向（パン、チルト、ロール）、画角、解像度といった撮像パラメータセットも併せて送信される。撮像パラメータセットは、各カメラ１０１について公知のカメラキャリブレーションを行って予め算出し記憶しておく。 The cameras 101 are installed so as to surround the object, and each camera 101 takes an image of the object in synchronization with the time. However, if there is a limit to where to install the cameras 101, such as in a studio or concert hall, the multiple cameras 101 will be installed only in a partial direction of the imaging target area. Each camera 101 is realized by, for example, a digital video camera having a video signal interface typified by a serial digital interface (SDI). Each camera 101 adds time information represented by a time code to an output video signal and transmits the video signal to the image processing apparatus 102 . At this time, an imaging parameter set such as the three-dimensional camera position (x, y, z), imaging direction (pan, tilt, roll), angle of view, and resolution is also transmitted. The imaging parameter set is calculated and stored in advance by performing known camera calibration for each camera 101 .

画像処理装置１０２は、複数のカメラ１０１が同期撮像して得た複数の撮像画像に基づき、オブジェクトの一部を残像のように表現した仮想視点画像を生成する。具体的には、前景となるオブジェクトの形状データの生成、当該オブジェクトのうち残像表現の対象となる動きのある一部（特定部位）の設定、オブジェクトの形状データの仮想視点への投影、投影画像の合成などを行なう。画像処理装置１０２の機能の詳細は後述する。 The image processing device 102 generates a virtual viewpoint image that expresses a part of an object like an afterimage based on a plurality of captured images obtained by synchronous imaging by a plurality of cameras 101 . Specifically, the generation of the shape data of the foreground object, the setting of a part (specific part) of the object that is subject to movement for afterimage expression, the projection of the shape data of the object to the virtual viewpoint, and the projection image Synthesis of Details of the functions of the image processing apparatus 102 will be described later.

コントローラ１０３は、画像処理装置１０２に仮想視点画像を生成させるためユーザが操作する制御装置である。オペレータは、コントローラ１０３が有するジョイスティックやキーボード等の入力装置を介して、仮想視点画像を生成するために必要な各種の設定やデータ入力等を行う。具体的には、撮像空間内の仮想視点（仮想カメラ）の三次元位置、視線方向、画角、解像度、画像生成に必要なタイムコードなどを指定する。ここでタイムコードが表す時刻には、上記複数の撮像画像の撮像期間のうち仮想視点画像生成の開始及び終了の時刻、キーフレームの時刻（残像表現を行う際の基準時刻）などがある。こうしてユーザ指定に基づき設定された仮想視点画像生成のための情報（以下、「仮想視点情報」と呼ぶ。）は、画像処理装置１０２に送信される。 A controller 103 is a control device operated by a user to cause the image processing device 102 to generate a virtual viewpoint image. The operator performs various settings, data input, and the like necessary for generating a virtual viewpoint image through an input device such as a joystick or keyboard of the controller 103 . Specifically, the three-dimensional position of the virtual viewpoint (virtual camera) in the imaging space, line-of-sight direction, angle of view, resolution, time code necessary for image generation, and the like are specified. Here, the time represented by the time code includes the start and end times of virtual viewpoint image generation in the imaging period of the plurality of captured images, key frame time (reference time for afterimage expression), and the like. Information for generating a virtual viewpoint image set based on user designation (hereinafter referred to as “virtual viewpoint information”) is transmitted to the image processing device 102 .

表示装置１０４は、画像処理装置１０２から送られてくる画像データ（グラフィカルユーザインタフェイスのためのＵＩ画面のデータや仮想視点画像のデータ）を取得して表示する。表示装置１０４は、例えば、液晶ディスプレイ、プロジェクタ、ヘッドマウントディスプレイ等で実現される。 The display device 104 acquires and displays image data (UI screen data for a graphical user interface and virtual viewpoint image data) sent from the image processing device 102 . The display device 104 is implemented by, for example, a liquid crystal display, a projector, a head-mounted display, or the like.

＜ハードウェア構成について＞
図２は、画像処理装置１０２のハードウェア構成の一例を示す図である。情報処理装置である画像処理装置１０２は、ＣＰＵ２１１、ＲＯＭ２１２、ＲＡＭ２１３、補助記憶装置２１４、操作部２１５、通信Ｉ／Ｆ２１６、及びバス２１７を有する。 <About hardware configuration>
FIG. 2 is a diagram showing an example of the hardware configuration of the image processing apparatus 102. As shown in FIG. The image processing apparatus 102 as an information processing apparatus has a CPU 211 , a ROM 212 , a RAM 213 , an auxiliary storage device 214 , an operation section 215 , a communication I/F 216 and a bus 217 .

ＣＰＵ２１１は、ＲＯＭ２１２またはＲＡＭ２１３に格納されているコンピュータプログラムおよびデータを用いて画像処理装置１０２の全体を制御することで、画像処理装置１０２の各機能を実現する。なお、画像処理装置１０２は、ＣＰＵ２１１とは異なる専用の１又は複数のハードウェアあるいはＧＰＵ（Graphics Processing Unit）を有していてもよい。そして、ＣＰＵ２１１による処理の少なくとも一部をＧＰＵあるいは専用のハードウェアが行うようにしても良い。専用のハードウェアの例としては、ＡＳＩＣ（特定用途向け集積回路）、ＦＰＧＡ（フィールドプログラマブルゲートアレイ）、およびＤＳＰ（デジタルシグナルプロセッサ）などがある。 The CPU 211 implements each function of the image processing apparatus 102 by controlling the entire image processing apparatus 102 using computer programs and data stored in the ROM 212 or RAM 213 . Note that the image processing apparatus 102 may have one or more dedicated hardware or a GPU (Graphics Processing Unit) different from the CPU 211 . At least part of the processing by the CPU 211 may be performed by the GPU or dedicated hardware. Examples of dedicated hardware include ASICs (Application Specific Integrated Circuits), FPGAs (Field Programmable Gate Arrays), and DSPs (Digital Signal Processors).

ＲＯＭ２１２は、変更を必要としないプログラムなどを格納する。ＲＡＭ２１３は、補助記憶装置２１４から供給されるプログラムやデータ、及び通信Ｉ／Ｆ２１７を介して外部から供給されるデータなどを一時記憶する。補助記憶装置２１４は、例えばハードディスクドライブ等で構成され、画像データや音量データなどの種々のデータを記憶する。 The ROM 212 stores programs that do not require modification. The RAM 213 temporarily stores programs and data supplied from the auxiliary storage device 214, data supplied from the outside via the communication I/F 217, and the like. The auxiliary storage device 214 is composed of, for example, a hard disk drive or the like, and stores various data such as image data and sound volume data.

操作部２１５は、例えばキーボードやマウス等で構成され、ユーザによる操作を受けて各種の指示をＣＰＵ２１１に入力する。ＣＰＵ２１１は、表示装置１０４を制御する表示制御部、及び操作部２１５を制御する操作制御部として動作する。通信Ｉ／Ｆ２１６は、画像処理装置１０２の外部の装置との通信に用いられる。例えば、画像処理装置１０２が外部の装置と有線で接続される場合には、通信用のケーブルが通信Ｉ／Ｆ２１６に接続される。画像処理装置１０２が外部の装置と無線通信する機能を有する場合には、通信Ｉ／Ｆ２１６はアンテナを備える。 The operation unit 215 is composed of, for example, a keyboard and a mouse, and inputs various instructions to the CPU 211 in response to user's operations. The CPU 211 operates as a display control unit that controls the display device 104 and as an operation control unit that controls the operation unit 215 . A communication I/F 216 is used for communication with an external device of the image processing apparatus 102 . For example, when the image processing apparatus 102 is connected to an external apparatus by wire, a communication cable is connected to the communication I/F 216 . If the image processing device 102 has a function of wirelessly communicating with an external device, the communication I/F 216 has an antenna.

バス２１７は、画像処理装置１０２の各部を繋いで情報を伝達する。なお、本実施形態では、コントローラ１０３及び表示装置１０４を外部装置として設けているが、いずれも画像処理装置１０２の機能部の１つとして内在する形で設けてもよい。 A bus 217 connects each unit of the image processing apparatus 102 and transmits information. Although the controller 103 and the display device 104 are provided as external devices in this embodiment, they may be provided as one of the functional units of the image processing device 102 .

＜ソフトウェア構成について＞
図３は、画像処理装置１０２のソフトウェア構成の一例を示す図である。画像処理装置１０２は、データ取得部３００、モデル生成部３０１、モデル解析部３０２、仮想視点画像生成部３０３を有する。そして、仮想視点画像生成部３０３は、投影部３０４と合成部３０５とを有する。以下、各部の機能について説明する。 <About software configuration>
FIG. 3 is a diagram showing an example of the software configuration of the image processing apparatus 102. As shown in FIG. The image processing device 102 has a data acquisition unit 300 , a model generation unit 301 , a model analysis unit 302 and a virtual viewpoint image generation unit 303 . The virtual viewpoint image generating unit 303 has a projecting unit 304 and a synthesizing unit 305 . The function of each part will be described below.

データ取得部３００は、仮想視点画像の生成に用いる各種データ、具体的には、各カメラ１０１で同期撮像された複数の撮像画像データ及びカメラ１０１毎の撮像パラメータセット、及び仮想視点情報を取得する。取得したデータは、補助記憶装置２１４に記憶される。 The data acquisition unit 300 acquires various data used for generating a virtual viewpoint image, specifically, a plurality of captured image data synchronously captured by each camera 101, an imaging parameter set for each camera 101, and virtual viewpoint information. . The acquired data is stored in the auxiliary storage device 214 .

モデル生成部３０１は、複数の撮像画像と各カメラ１０１の撮像パラメータセットとに基づき、舞台演者など前景となるオブジェクトの三次元形状を表す形状データ（以下、「前景モデル」と呼ぶ。）を、例えば視体積交差法によって生成する。本実施形態の前景モデルは、テクスチャ付きポリゴンメッシュや各点に色の付いた三次元点群などで表現され、例えばスタンフォードＰＬＹやウェーブフロントＯＢＪといった汎用のフォーマットファイルにて生成される。なお、前景モデルは、色情報を持たない形状データでもよい。この場合、後述の投影部３０４において、仮想視点に対応した色を付ける処理（テクスチャマッピング）を行うことになる。生成した前景モデルは、補助記憶装置２１４に記憶される。 The model generation unit 301 generates shape data representing the three-dimensional shape of a foreground object such as a stage performer (hereinafter referred to as a "foreground model") based on a plurality of captured images and the imaging parameter set of each camera 101. For example, it is generated by the visual volume intersection method. The foreground model of this embodiment is represented by a textured polygon mesh or a three-dimensional point group with each point colored, and is generated in a general-purpose format file such as Stanford PLY or Wavefront OBJ. Note that the foreground model may be shape data without color information. In this case, the projection unit 304, which will be described later, performs a process (texture mapping) for adding a color corresponding to the virtual viewpoint. The generated foreground model is stored in the auxiliary storage device 214 .

モデル解析部３０２は、モデル生成部３０１で生成された前景モデルの複数時刻に亘る形状変化（動き）を解析して、前景モデルのうち変化量の大きい部分を検出し、残像表現の対象となる特定部位を決定する。 The model analysis unit 302 analyzes the shape change (movement) of the foreground model generated by the model generation unit 301 over a plurality of times, detects a portion of the foreground model with a large amount of change, and uses it as a target for afterimage expression. Determine specific parts.

投影部３０４は、コントローラ１０３から入力される仮想視点情報に基づき、前景モデルを仮想視点に投影して、仮想視点に対応した投影画像を生成する。ここで、投影対象には、上述の前景モデルに加え、当該前景モデルの特定部位に対応する形状データ（以下、「部分モデル」と呼ぶ。）や背景となるオブジェクトの形状データ（以下、「背景モデル」と呼ぶ。）が含まれる。背景モデルは、舞台設備など背景となるオブジェクトの三次元形状を表す色情報付きの形状データであり、予め生成し補助記憶装置２１４に記憶しておく。背景モデルの生成は、ＣＡＤなどの設計データを用いてもよいし、レーザースキャナーなどでスキャンした形状と色データを用いてもよい。あるいは複数視点の撮像画像からＳｔｒｕｃｔｕｒｅｆｒｏｍＭｏｔｉｏｎなどのコンピュータビジョンの技術を用いて生成してもよい。 A projection unit 304 projects a foreground model onto a virtual viewpoint based on virtual viewpoint information input from the controller 103 to generate a projection image corresponding to the virtual viewpoint. Here, in addition to the above-mentioned foreground model, the projection target includes shape data corresponding to a specific part of the foreground model (hereinafter referred to as "partial model") and shape data of the background object (hereinafter referred to as "background model"). (referred to as “models”). The background model is shape data with color information representing the three-dimensional shape of the background object such as stage equipment, and is generated in advance and stored in the auxiliary storage device 214 . The background model may be generated using design data such as CAD, or shape and color data scanned by a laser scanner or the like. Alternatively, it may be generated from captured images from multiple viewpoints using a computer vision technique such as structure from motion.

合成部３０５は、投影部３０４が各３Ｄモデル（前景モデル、部分モデル、背景モデル）を仮想視点に投影することで得られた投影画像を合成し、特定部位を残像のように表現した仮想視点画像を生成する。生成した仮想視点画像のデータは表示装置１０４へ送信される。 A synthesizing unit 305 synthesizes projection images obtained by projecting each 3D model (foreground model, partial model, background model) onto a virtual viewpoint by the projection unit 304, and creates a virtual viewpoint that expresses a specific part like an afterimage. Generate an image. Data of the generated virtual viewpoint image is transmitted to the display device 104 .

なお、上述した各機能部は、複数の画像処理装置に分散されていてもよい。例えば、データ取得部３００及びモデル生成部３０１を第１の画像処理装置が有し、モデル解析部３０２と仮想視点画像生成部３０３を第２の画像処理装置が有する、といった構成でもよい。 Note that each functional unit described above may be distributed among a plurality of image processing apparatuses. For example, the first image processing device may have the data acquisition unit 300 and the model generation unit 301, and the second image processing device may have the model analysis unit 302 and the virtual viewpoint image generation unit 303. FIG.

＜仮想視点画像の生成フロー＞
次に、画像処理装置１０２における仮想視点画像の生成フローについて、図４のフローチャートを参照して説明する。図４のフローチャートが示す一連の処理は、ＣＰＵ２１１が、ＲＯＭ２１２又は補助記憶装置２１４等に格納された制御プログラムを読み出してＲＡＭ２１３に展開し、これを実行することで実現される。なお、以下の説明において記号「Ｓ」はステップを意味する。 <Flow of generating virtual viewpoint image>
Next, a flow of generating a virtual viewpoint image in the image processing device 102 will be described with reference to the flowchart of FIG. A series of processes shown in the flowchart of FIG. 4 is realized by the CPU 211 reading out a control program stored in the ROM 212 or the auxiliary storage device 214 or the like, developing it in the RAM 213, and executing it. In the following description, symbol "S" means step.

Ｓ４０１では、データ取得部３００が、連続する期間（例えば５ｓｅｃ）の複数の撮像画像データ、各カメラ１０１の撮像パラメータセット、及び仮想視点情報を取得する。取得されたデータはＲＡＭ２１３に展開して保持される。 In S401, the data acquisition unit 300 acquires a plurality of captured image data for a continuous period (for example, 5 seconds), the imaging parameter set of each camera 101, and virtual viewpoint information. The acquired data is developed and held in the RAM 213 .

Ｓ４０２では、モデル生成部３０１が、Ｓ４０１にて取得された撮像画像データと撮像パラメータセットを用いて、上記連続する期間のうち注目する基準時刻ｔから過去又は未来の複数時刻それぞれに対応する前景モデルを生成する。ここで、注目する時刻は、例えば取得された連続する期間のうちタイムコードによって指定された特定のキーフレームに対応する時刻である。また、前景モデルを生成する対象の過去又は未来の複数時刻については、予め設定しておいてもよいし、仮想視点情報において指定してもよい。 In S402, the model generation unit 301 uses the captured image data and the imaging parameter set acquired in S401 to generate a foreground model corresponding to each of a plurality of times in the past or future from the reference time t of interest in the continuous period. to generate Here, the time of interest is, for example, the time corresponding to a specific key frame specified by the time code in the acquired continuous period. In addition, a plurality of times in the past or future for which the foreground model is to be generated may be set in advance or may be specified in the virtual viewpoint information.

Ｓ４０３では、モデル解析部３０２が、Ｓ４０２で生成された前景モデルの形状の経時変化を解析し、残像表現を行いたい特定部位を決定する。具体的には、生成された複数時刻分の前景モデルに基づき、時間の経過に伴って大きな動きのある部位を検出し、当該部位を残像表現の処理対象となる特定部位として決定する。図５（ａ）は、特定部位の検出を説明する図であり、異なる時刻それぞれに対応する前景モデルを同一空間のy軸上で重畳して示した図である。また、図５（ｂ）は、時刻間の関係を示す図である。図５（ｂ）において、細い両方向矢印５０１は仮想視点画像の生成範囲を示し、太い両方向矢印５０２は残像として表現する範囲を示す。時刻ｔ－ｎは、範囲５０２のうちの基準時刻ｔ以外の時刻であり、変数ｎは上述の“過去又は未来の複数時刻”に対応した任意の整数を取る。例えば“ｎ＝１”であれば基準時刻ｔの１時刻過去、“ｎ＝２”であれば２時刻過去を表し、“ｎ＝－１”であれば基準時刻ｔの１時刻未来を表すことになる。いま、図５（ａ）において一点鎖線は時刻ｔ－１における前景モデルの左腕部分を表し、二点鎖線は時刻ｔ－２における前景モデルの左腕部分を表している。つまり、時刻ｔ－２から基準時刻ｔに掛けて左腕部分だけが動いていることを意味する。これは、前景モデルが表す形状のうち左腕に相当する部分だけ、範囲５０２が示す異なる時刻間の変化量が大きいことになる。このように、異なる時刻間の前景モデルの変化量を算出することで前景オブジェクトの動きのある部位を検出することができる。変化量の算出には、例えばポリゴンメッシュ形式の前景モデルであればポリゴン間の距離を利用できる。すなわち、ある時刻における前景モデルの各頂点から別の時刻における前景モデルの各頂点に対し、法線方向に最近傍点を探索し、別の時刻における前景モデルのメッシュとの接点を最近傍点とする。そして頂点と最近傍点の距離を符号付距離で求める。符号付距離とは、頂点から見て最近傍点がメッシュの外側にあれば正、内側にあれば負の値とする距離である。ある時刻における前景モデルの全頂点に対してこの符号付距離を求め、その分布の標準偏差（σ）よりも距離の絶対値の大きな頂点を抽出する。そして、抽出された頂点群で構成される前景モデルの部位を検出する。図５（ａ）の例では、前景モデルの左腕に相当する部分の各頂点は、符号付距離の絶対値が大きいため、上述の手順によって抽出されることになる。こうして抽出された頂点群で構成される前景モデルの一部が、残像表現の処理対象となる特定部位として決定される。そして、上述の処理を、Ｓ４０２で生成した前景モデルに対応する複数時刻すべてについて行い、各時刻における特定部位を決定する。決定された特定部位に対応する形状データ（前景モデルの一部部分に相当する形状データ。以下、「部分モデル」と呼ぶ。）は、前景モデルとは別にＲＡＭ２１３に保持される。 In S403, the model analysis unit 302 analyzes the time-dependent change in the shape of the foreground model generated in S402, and determines a specific part for which afterimage expression is desired. Specifically, based on the generated foreground model for a plurality of time points, a part that moves significantly over time is detected, and the detected part is determined as a specific part to be subjected to afterimage expression processing. FIG. 5(a) is a diagram for explaining the detection of a specific part, and is a diagram showing foreground models corresponding to different times superimposed on the y-axis in the same space. FIG. 5(b) is a diagram showing the relationship between times. In FIG. 5B, a thin double-headed arrow 501 indicates the generation range of the virtual viewpoint image, and a thick double-headed arrow 502 indicates the range to be represented as an afterimage. The time t−n is a time other than the reference time t within the range 502, and the variable n takes an arbitrary integer corresponding to the above-mentioned “a plurality of past or future times”. For example, "n=1" represents one hour past the reference time t, "n=2" represents two hours past, and "n=-1" represents one hour future of the reference time t. become. Now, in FIG. 5A, the one-dot chain line represents the left arm of the foreground model at time t-1, and the two-dot chain line represents the left arm of the foreground model at time t-2. In other words, it means that only the left arm is moving from time t-2 to reference time t. This means that only the portion of the shape represented by the foreground model that corresponds to the left arm has a large amount of change between different times indicated by the range 502 . In this way, by calculating the amount of change in the foreground model between different times, it is possible to detect a moving part of the foreground object. For calculating the amount of change, for example, if the foreground model is in the form of a polygon mesh, the distance between polygons can be used. That is, each vertex of the foreground model at a certain time is searched for the nearest neighbor point in the normal direction to each vertex of the foreground model at another time, and the point of contact with the mesh of the foreground model at another time is taken as the nearest neighbor point. Then, the distance between the vertex and the nearest point is obtained as a signed distance. The signed distance is a distance that takes a positive value if the closest point is outside the mesh and a negative value if it is inside the mesh when viewed from the vertex. This signed distance is obtained for all the vertices of the foreground model at a certain time, and the vertices whose absolute value of the distance is larger than the standard deviation (σ) of the distribution are extracted. Then, the part of the foreground model composed of the extracted vertex group is detected. In the example of FIG. 5A, each vertex of the portion corresponding to the left arm of the foreground model has a large absolute value of the signed distance, so it is extracted by the above procedure. A part of the foreground model composed of the extracted vertices is determined as a specific part to be subjected to afterimage expression processing. Then, the above-described processing is performed for all multiple times corresponding to the foreground model generated in S402, and the specific part at each time is determined. Shape data corresponding to the determined specific portion (shape data corresponding to a partial portion of the foreground model; hereinafter referred to as "partial model") is stored in the RAM 213 separately from the foreground model.

Ｓ４０４では、投影部３０４が、基準時刻における前景モデル、Ｓ４０３で決定された特定部位の部分モデル、及び予め用意していた背景モデルを仮想視点情報に基づき投影する処理を行う。投影処理には公知のモデルベースドレンダリングやイメージベースドレンダリングを用いればよい。図６の（ａ）は基準時刻ｓにおける上述の前景モデル５０１の投影画像を示し、同（ｂ）及び（ｃ）はそれぞれ時刻ｓ－１と時刻ｓ－２における前景モデル５０１の左腕の部分モデルの投影画像を示している。また、図６（ｄ）は背景モデルの投影画像を示している。これら投影画像は、例えば各画素が８ビット（０～２５５）のＲＧＢ値に加えアルファ値を持つ４チャンネルの画像である。ここで、アルファ値とは透明度を表す数値であり、最大値２５５のときは不透明で、数値が小さくなるにつれて透明度が上がり、最小値０のときに完全な透明となる。つまり、各モデルの形状データが投影される画素のアルファ値は０より大きい値となり、何も投影されない画素のアルファ値は０となる。よって、まったく透過させない背景モデルの投影画像及び現在時刻ｓの前景モデルの投影画像については、各画素のアルファ値を２５５に設定する。一方、残像を表現するため透過させたい部分モデルの投影画像については、時刻ｓ－１に対しては２５５×０．６＝１５３、時刻ｓ－２に対しては２５５×０．３＝７７といった具合に異なるアルファ値をそれぞれ設定する。この場合において、“０．６”及び“０．３”は透過率を制御するための係数であり、基準時刻からの差分値に応じた１．０未満の値が入ることになる。すなわち、上述の図６の（ｂ）及び（ｃ）の例では、対応する特定時刻が基準時刻に対して過去になるほど透過率が高くなるように（アルファ値が小さくなるように）している。ここで説明した透過率の制御方法は一例であり、例えば部分モデルに対応する特定時刻が基準時刻に対して未来になるほど透過率が高くなるようにしてもよい。 In S404, the projection unit 304 performs processing for projecting the foreground model at the reference time, the partial model of the specific part determined in S403, and the background model prepared in advance based on the virtual viewpoint information. Known model-based rendering or image-based rendering may be used for projection processing. FIG. 6(a) shows the projected image of the foreground model 501 at the reference time s, and (b) and (c) show the left arm partial model of the foreground model 501 at the time s−1 and the time s−2, respectively. shows a projected image of Also, FIG. 6(d) shows a projected image of the background model. These projection images are, for example, 4-channel images in which each pixel has an 8-bit (0-255) RGB value plus an alpha value. Here, the alpha value is a numerical value representing transparency, and when the maximum value is 255, it is opaque. That is, the alpha value of pixels onto which the shape data of each model is projected is greater than 0, and the alpha value of pixels onto which nothing is projected is 0. Therefore, the alpha value of each pixel is set to 255 for the projected image of the background model that is not transmitted at all and the projected image of the foreground model at the current time s. On the other hand, the projected image of the partial model to be transmitted to express the afterimage is 255×0.6=153 for time s−1 and 255×0.3=77 for time s−2. Set different alpha values accordingly. In this case, "0.6" and "0.3" are coefficients for controlling transmittance, and values less than 1.0 corresponding to the difference value from the reference time are entered. That is, in the examples of (b) and (c) of FIG. 6 described above, the transmittance increases (the alpha value decreases) as the corresponding specific time is past the reference time. . The transmittance control method described here is merely an example, and for example, the transmittance may be increased as the specific time corresponding to the partial model is in the future with respect to the reference time.

Ｓ４０５では、合成部３０５が、Ｓ４０４で生成された全ての投影画像を合成して仮想視点画像を生成する。図７（ａ）は、上述の図６（ａ）～（ｄ）に示す４つの投影画像をアルファ合成する様子を示す図である。図７（ａ）に示すように４つの投影画像７０１～７０４をレイヤー状に重ねてアルファ合成することで、図７（ｂ）に示すように左腕部分を残像のように表現した基準時刻ｔにおける仮想視点画像が得られる。 In S405, the synthesizing unit 305 synthesizes all the projection images generated in S404 to generate a virtual viewpoint image. FIG. 7(a) is a diagram showing how the four projection images shown in FIGS. 6(a) to 6(d) are alpha-combined. As shown in FIG. 7(a), four projected images 701 to 704 are superimposed in layers and alpha synthesized to express the left arm as an afterimage as shown in FIG. 7(b). A virtual viewpoint image is obtained.

Ｓ４０６では、仮想視点情報で指定されたすべての基準時刻について処理が完了したか否かが判定される。未処理の基準時刻があればＳ４０７に進んで注目する基準時刻を更新し、Ｓ４０２に戻って処理を続行する。この際、仮想視点情報にて、仮想視点が時間経過と共に移動するように指定されていれば、任意のカメラアングルに投影した仮想視点画像が得られることになる。すべての基準時刻について処理が完了していれば本フローを終了する。 In S406, it is determined whether or not the processing has been completed for all the reference times designated by the virtual viewpoint information. If there is an unprocessed reference time, the process advances to S407 to update the reference time of interest, returns to S402, and continues the process. At this time, if the virtual viewpoint information designates that the virtual viewpoint moves with the passage of time, a virtual viewpoint image projected at an arbitrary camera angle can be obtained. If the processing has been completed for all reference times, this flow ends.

以上が、本実施形態に係る、仮想視点画像の生成フローである。こうして得られた仮想視点画像のデータは、表示装置１０４に送信されユーザの視聴に供されることになる。なお、対象となる前景オブジェクトは人物などの動的オブジェクトに限定されるものではなく、建造物などの静的オブジェクトでもよい。例えば、高層建築物などの建築過程を一定期間に亘って連続して撮像して得られた撮像画像に基づき、新たに建築された部分だけを不透明にし、既存の建築部分を透明にすることで、日々の変化を表現した図８（ａ）～（ｇ）に示すような各基準時刻に対応した仮想視点画像が得られる。この場合、例えば１日毎に建造物を撮像し（Ｓ４０１）、得られた撮像画像から前景モデルを生成し（Ｓ４０２）、１日前と比較して形状変化の大きい部分（すなわち当日に建造された部分）を検出して特定部位を決定する（Ｓ４０３）。そして、当日の建造部分に対応する部分モデル及び前日分の前景モデルを仮想視点に投影し（Ｓ４０４）、得られた投影画像を用いてアルファ合成を行う（Ｓ４０５）。この際、当日分についてはアルファ値を“１２７”、前日分についてはアルファ値を“２５５”といった具合に設定することで、前日分までとの差分を直感的に把握可能な仮想視点画像が得られる。 The above is the flow of generating a virtual viewpoint image according to the present embodiment. The data of the virtual viewpoint image thus obtained is transmitted to the display device 104 for viewing by the user. Note that the target foreground object is not limited to a dynamic object such as a person, and may be a static object such as a building. For example, based on the captured images obtained by continuously capturing the construction process of a high-rise building over a certain period, only the newly constructed parts are made opaque and the existing building parts are made transparent. , a virtual viewpoint image corresponding to each reference time is obtained as shown in FIGS. In this case, for example, the building is imaged every day (S401), a foreground model is generated from the obtained captured image (S402), and a portion with a large shape change compared to the day before (that is, the portion built on the day) ) is detected to determine the specific site (S403). Then, the partial model corresponding to the part to be built on the current day and the foreground model for the previous day are projected onto the virtual viewpoint (S404), and alpha synthesis is performed using the obtained projected image (S405). At this time, by setting the alpha value for the current day to "127" and the alpha value for the previous day to "255", etc., a virtual viewpoint image is obtained in which the difference from the previous day can be intuitively grasped. be done.

＜変形例１＞
上述の実施形態では複数時刻間の前景モデルの形状変化に基づき特定部位を決定したが、特定部位の決定方法はこれに限定されない。すなわち、特定部位の決定に先立って複数時刻分の仮想視点への投影処理を行い、得られた投影画像における複数時刻間の差分を求め、差分値の大きな画像領域に対応する部位を特定部位に決定してもよい。この際は、決定した特定部位に対応する部分モデルを前景モデルから生成し、改めて部分モデルの仮想視点への投影処理を行って図７（ｂ）や（ｃ）に相当する投影画像を得た上で、上述した合成処理を行えばよい。 <Modification 1>
In the above-described embodiment, the specific part is determined based on the shape change of the foreground model during a plurality of times, but the method of determining the specific part is not limited to this. That is, prior to determining the specific part, projection processing is performed on the virtual viewpoint for a plurality of times, the difference between the obtained projection images is obtained for a plurality of times, and the part corresponding to the image region with a large difference value is selected as the specific part. may decide. In this case, a partial model corresponding to the determined specific part is generated from the foreground model, and the partial model is again projected onto the virtual viewpoint to obtain projection images corresponding to FIGS. 7(b) and 7(c). Then, the synthesis processing described above may be performed.

＜変形例２＞
透過率を変化させるアルファ合成に代えて、例えば明度又は彩度を変化させる色変更処理によって残像を表現してもよい。この際は、投影画像の各画素はα値を持たないＲＧＢの３チャンネルの投影画像となる。そして、部分モデルに対応する特定時刻が、例えば基準時刻に対して過去になるほど明度又は彩度を低くしたり、基準時刻に対して未来になるほど明度又は彩度を低くしたりすればよい。これにより、時間経過に伴うオブジェクトの動きを視覚的に表現できる。その他、特定時刻における特定部位を目立たせるためのハイライト処理を行ってもよい。 <Modification 2>
Instead of alpha synthesis that changes transmittance, afterimages may be expressed by color change processing that changes brightness or saturation, for example. In this case, each pixel of the projection image becomes a three-channel projection image of RGB without an α value. Then, the brightness or saturation may be decreased as the specific time corresponding to the partial model becomes past the reference time, or the brightness or saturation may be decreased as the specific time becomes future relative to the reference time. This makes it possible to visually represent the movement of the object over time. In addition, highlight processing may be performed to make a specific part stand out at a specific time.

＜変形例３＞
上述の実施形態では複数の時刻それぞれに対応する投影画像をレイヤー状に重ねて合成したが、当該複数の時刻における前景オブジェクトの遮蔽関係を正確に表現するために、デプス情報を用いた合成を行ってもよい。すなわち、Ｓ４０４にてモデル毎の投影画像を生成する際に、仮想視点から前景オブジェクトまでの距離を画素毎に記録したデプス画像も生成する。そして、Ｓ４０５にて合成する際に、デプスが小さい（つまり仮想視点に距離が近い）ほど上のレイヤーになるように投影画像を画素毎に選択して合成を行う。これにより実際の遮蔽関係を適切に表したより立体的な映像表現が可能となる。 <Modification 3>
In the above-described embodiment, projection images corresponding to a plurality of times are superimposed and synthesized in a layered manner, but in order to accurately express the shielding relationship of the foreground object at the plurality of times, synthesis is performed using depth information. may That is, when generating a projection image for each model in S404, a depth image is also generated in which the distance from the virtual viewpoint to the foreground object is recorded for each pixel. Then, when synthesizing in S405, projection images are selected and synthesized for each pixel such that the smaller the depth (that is, the closer the distance to the virtual viewpoint), the higher the layer. This makes it possible to express a more stereoscopic image that appropriately expresses the actual shielding relationship.

＜変形例４＞
上述の実施形態では前景オブジェクトの動きを検出して特定部位を決定したが、撮像が行われた三次元空間中の特定領域との位置関係に基づいて決定してもよい。具体的には、人物の肩がその境界に位置し、かつ、腕以外の部分が内包されるようなバウンディングボックスを三次元空間中に設定し、当該バウンディングボックスの内側或いは外側にある部分を検出して特定部位を決定する。バウンディングボックスの形状は任意であり、直方体、円柱、球、楕円球など三次元空間において内外判定のできる形状であればよい。また、バウンディングボックスの位置や向き、大きさを時刻毎（フレーム毎）に変更してもよい。バウンディングボックスの形状、大きさ、時刻毎の変更の有無、内外のどちらを検出するか、などの情報は予め補助記憶装置２１４に格納しておき必要に応じて読み出して使用すればよい。これにより、より簡単に特定部位を決定することができる。 <Modification 4>
In the above-described embodiment, the motion of the foreground object is detected to determine the specific part, but it may be determined based on the positional relationship with the specific area in the three-dimensional space where the image was captured. Specifically, a bounding box is set in a three-dimensional space such that a person's shoulders are located at the boundary of the bounding box and the parts other than the arm are included, and parts inside or outside the bounding box are detected. to determine the specific site. The shape of the bounding box is arbitrary, and any shape such as a rectangular parallelepiped, a cylinder, a sphere, or an elliptical sphere can be used as long as the inside/outside can be determined in a three-dimensional space. Also, the position, orientation, and size of the bounding box may be changed for each time (for each frame). Information such as the shape and size of the bounding box, whether it changes at each time, whether to detect the inside or the outside, etc. may be stored in advance in the auxiliary storage device 214 and read out and used as necessary. This makes it possible to more easily determine the specific site.

以上のとおり、本実施形態によれば、前景オブジェクトの動く部位を残像で表現した仮想視点画像を容易に得ることができる。 As described above, according to the present embodiment, it is possible to easily obtain a virtual viewpoint image in which a moving part of a foreground object is represented by an afterimage.

［実施形態２］
実施形態１では、前景モデルの形状の経時変化を解析して動きの大きい部分を検出することで、残像表現の対象となる特定部位を自動で決定していた。次に、生成した前景モデルの構造を解析してその結果をユーザに提示し、ユーザの明示的な指示に基づき特定部位を決定する態様を、実施形態２として説明する。また、実施形態１では特定部位以外の部位については複数時刻間で変化がないことを前提としていたが、本実施形態では前景オブジェクトの位置や姿勢などが複数時刻間で変化する場合の調整についても併せて説明する。なお、基本的なシステム構成など実施形態１と共通する内容については説明を省略し、以下では差異点を中心に説明を行うこととする。 [Embodiment 2]
In the first embodiment, by analyzing temporal changes in the shape of the foreground model and detecting parts with large movements, specific parts to be subjected to afterimage expression are automatically determined. Next, a mode of analyzing the structure of the generated foreground model, presenting the result to the user, and determining the specific part based on the user's explicit instruction will be described as a second embodiment. Further, in the first embodiment, it is assumed that parts other than specific parts do not change over a plurality of times. It will be explained together. Note that the description of the contents common to the first embodiment, such as the basic system configuration, will be omitted, and the differences will be mainly described below.

＜仮想視点画像の生成フロー＞
本実施形態に係る仮想視点画像の生成フローについて、図９のフローチャートを参照して説明する。図９のフローチャートが示す一連の処理は、ＣＰＵ２１１が、ＲＯＭ２１２又は補助記憶装置２１４等に格納された制御プログラムを読み出してＲＡＭ２１３に展開し、これを実行することで実現される。なお、以下の説明において記号「Ｓ」はステップを意味する。 <Flow of generating virtual viewpoint image>
A flow of generating a virtual viewpoint image according to this embodiment will be described with reference to the flowchart of FIG. A series of processes shown in the flowchart of FIG. 9 is realized by the CPU 211 reading a control program stored in the ROM 212 or the auxiliary storage device 214 or the like, developing it in the RAM 213, and executing it. In the following description, symbol "S" means step.

Ｓ９０１及びＳ９０２は、実施形態１に係る図４のフローチャートのＳ４０１及びＳ４０２にそれぞれ対応する。すなわち、連続する期間分の撮像画像データ、撮像パラメータセット及び仮想視点情報を取得され（Ｓ９０１）、続いて、複数の時刻それぞれに対応する前景モデルが生成される（Ｓ９０２）。 S901 and S902 respectively correspond to S401 and S402 in the flowchart of FIG. 4 according to the first embodiment. That is, captured image data, imaging parameter sets, and virtual viewpoint information for a continuous period are acquired (S901), and then foreground models corresponding to each of a plurality of times are generated (S902).

Ｓ９０３では、モデル解析部３０２が、Ｓ９０２で生成された前景モデルの構造を解析する。構造解析には公知の手法を適用すればよい。例えば、基準ボーンモデルと前景モデルとの表面同士の距離を最小化することで基準ボーンモデルを前景オブジェクトに重ね合わせ、ボーン構造を関連付ける手法が知られている（非特許文献１）。その他、オブジェクトの映像を入力として深層学習モデルを用いてボーン構造を推論する“ＯｐｅｎＰｏｓｅ”の手法やモーションキャプチャの手法を用いてもよい。ポリゴンメッシュ形式の前景モデルの場合、構造解析結果は構成要素であるメッシュポリゴン毎に部位名称や部位ＩＤを関連付けるメタデータとして得られる。こうして得られた構造解析の結果は、ＲＡＭ２１３に格納される。 In S903, the model analysis unit 302 analyzes the structure of the foreground model generated in S902. A known method may be applied to the structural analysis. For example, there is a known method of minimizing the distance between the surfaces of the reference bone model and the foreground model to superimpose the reference bone model on the foreground object and associate the bone structure (Non-Patent Document 1). In addition, the method of "OpenPose", which uses a deep learning model to infer the bone structure with the image of the object as an input, or the method of motion capture may be used. In the case of a polygon mesh format foreground model, the structural analysis result is obtained as metadata that associates a part name and part ID with each mesh polygon that is a constituent element. The structural analysis results thus obtained are stored in the RAM 213 .

Ｓ９０４では、Ｓ９０３で得られた構造解析結果を反映したＵＩ画面を介したユーザ入力に基づき特定部位が決定される。より詳細には、まず、残像表現を行いたい特定部位を指定するためのＵＩ画面（部位指定用ＵＩ画面）のデータが表示装置１０４に送信され、表示装置１０４に表示される。図１０は、部位指定用ＵＩ画面の一例である。図１０のＵＩ画面１０００の場合、特定部位を指定する対象の前景モデルが画面中央に表示され、その右側に、構造解析によって得られた全部位の一覧１００１と、上述の複数の時刻のうち所望の時刻を指定するためのシークバー１００２が表示される。ユーザは、マウス等を用いてポインタ１００３を操作して、残像表現を行いたい特定部位を指定する。例えば、左腕を指定する場合は、ポインタ１００３を左肩にマウスオーバーした後にドラッグしながら左手の指先付近まで移動させる操作を行う。こうしてユーザが指定した部位については、例えば一覧１００１内の対応する部位名称をハイライト表示する。これによりユーザは自身が指定した部位が特定部位として選択された事実を確認することができる。上記の指定方法は一例であり、例えば構造解析によって得られた各部位に対応するボタンを設け、所望の部位に対応するボタンを押下することで指定できるようにしてもよい。 In S904, the specific site is determined based on user input via the UI screen reflecting the structural analysis results obtained in S903. More specifically, first, data of a UI screen for designating a specific part for which afterimage representation is desired (part designation UI screen) is transmitted to the display device 104 and displayed on the display device 104 . FIG. 10 is an example of a part designation UI screen. In the case of the UI screen 1000 of FIG. 10, the foreground model of the target for designating the specific part is displayed in the center of the screen, and on the right side thereof, a list 1001 of all the parts obtained by the structural analysis and a desired time out of the plurality of times described above are displayed. A seek bar 1002 for designating the time is displayed. The user operates the pointer 1003 using a mouse or the like to designate a specific part where the afterimage representation is desired. For example, when specifying the left arm, an operation of moving the pointer 1003 to the vicinity of the fingertips of the left hand while dragging it is performed after moving the pointer 1003 over the left shoulder. For the part specified by the user in this way, the corresponding part name in the list 1001 is highlighted, for example. Thereby, the user can confirm the fact that the site specified by the user has been selected as the specific site. The above designation method is an example. For example, a button corresponding to each part obtained by structural analysis may be provided so that the desired part can be designated by pressing the button corresponding to the part.

Ｓ９０５は、実施形態１のＳ４０４に対応する。すなわち、基準時刻ｓにおける前景モデル、Ｓ９０４で決定された特定部位の部分モデル、及び予め用意していた背景モデルを仮想視点情報に基づき投影する処理が行われる。この際、本実施形態では、Ｓ９０４にて決定された特定部位に対応する部分モデルの位置や姿勢等の調整が必要に応じて行われる。図１１（ａ）は、特定部位に決定された左腕の付け根が各時刻において一致するよう、基準時刻の前景モデルの形状データに合わせてその位置や姿勢を調整する様子を説明する図である。いま、人物の前景オブジェクトが時刻ｔ－１から時刻ｔに掛けて左腕を動かしながら移動している。前景モデル１１００は時刻ｔに対応し、前景モデル１１００’は時刻ｔ－１に対応している。調整では、時刻ｔ－１における前景モデル１１００’の左腕の付け根１１０１’が、時刻ｔにおける前景モデル１１００の左腕の付け根１１０１と重なるように幾何学変換を行う。この際、例えば時刻ｔ－１と時刻ｔとの間で左腕のスケールが異なっていれば、スケールを一致させる変形処理を行ってもよい。図１１（ｂ）は、調整を行った結果を示しており、時刻ｔ－１における前景モデル１１００’と時刻ｔにおける前景モデルとが、特定部位に指定された左腕の付け根が一致する状態で三次元的に重なっている。このような調整を部分モデルに対し必要に応じ行って、前景モデル及び部分モデルが仮想視点に投影される。 S905 corresponds to S404 of the first embodiment. That is, the process of projecting the foreground model at the reference time s, the partial model of the specific part determined in S904, and the background model prepared in advance based on the virtual viewpoint information is performed. At this time, in this embodiment, the position, posture, etc. of the partial model corresponding to the specific part determined in S904 are adjusted as necessary. FIG. 11A is a diagram for explaining how the position and posture of the left arm determined as the specific part are adjusted according to the shape data of the foreground model at the reference time so that the base of the left arm matches each time. Now, a human foreground object is moving from time t-1 to time t while moving its left arm. Foreground model 1100 corresponds to time t and foreground model 1100' corresponds to time t-1. In the adjustment, geometric transformation is performed so that the left arm base 1101' of the foreground model 1100' at time t-1 overlaps with the left arm base 1101 of the foreground model 1100 at time t. At this time, if the scales of the left arm differ between time t−1 and time t, for example, transformation processing may be performed to match the scales. FIG. 11(b) shows the result of the adjustment, in which the foreground model 1100′ at time t−1 and the foreground model at time t are three-dimensionally aligned with the base of the left arm designated as the specific part. inherently overlapped. Such adjustments are made to the partial model as necessary, and the foreground model and the partial model are projected onto the virtual viewpoint.

Ｓ９０６～Ｓ９０８は、実施形態１のＳ４０５～Ｓ４０７にそれぞれ対応し特に異なるところはないので説明を省く。 S906 to S908 correspond to S405 to S407 of the first embodiment, respectively, and are not particularly different, so description thereof is omitted.

以上が、本実施形態に係る、仮想視点画像の生成フローである。なお、図９のフローの開始に先立って各時刻における前景モデルの構造解析を別装置にて行っておき、Ｓ９０３ではその結果を取得するだけでもよい。また、Ｓ９０５にて調整を行った場合、調整結果をＵＩ画面に表示し、ユーザが位置や姿勢等の微調整を行えるようにしてもよい。 The above is the flow of generating a virtual viewpoint image according to the present embodiment. Note that the structural analysis of the foreground model at each time may be performed by another device prior to the start of the flow of FIG. 9, and the result may be acquired in S903. Further, when the adjustment is performed in S905, the adjustment result may be displayed on the UI screen so that the user can fine-tune the position, orientation, and the like.

＜変形例１＞
ユーザが指定した特定部位の透過率を時刻に応じて変更するのに代えて、特定部位周辺の透過率を特定部位からの距離に応じて変更してもよい。このような残像表現は、例えばゴルフスイングの解析に有用である。すなわち、ゴルフクラブのヘッドを特定部位に決定し、ヘッドを不透明とした上で、ヘッドから離れるほど透明になるようにゴルフクラブの前景モデルの投影時に透過処理を行う。この透過処理においては、構造解析結果を参照しつつ、指定されたヘッド部分からの距離が大きくなるほどアルファ値を小さくし、ゴルフクラブをグラデーション状に透明化して、各時刻における前景モデルを仮想視点に投影する。この際、距離とアルファ値との関係については、透過開始距離と透過終了距離を予め設定しておけばよい。或いは、動きの大きい部分を検出した上でヘッド部分から最も遠い部分までの距離を透過終了距離として設定してもよい。このときの透過開始距離はヘッドから最も近い部分までの距離とすればよい。図１２（ａ）に、本変形例に係る合成処理によって得られる仮想視点画像の一例を示す。時間的に最も進んだ最新時刻の前景モデルの投影画像と、それよりも過去の各時刻における前景モデルの投影画像であってゴルフクラブのヘッドの付け根から持ち手にかけて徐々に透明になるように処理された投影画像とが合成されている。本変形例によって、ユーザ指定に係る特定部位の軌跡を強調する仮想視点画像を得ることができる。 <Modification 1>
Instead of changing the transmittance of the specific site designated by the user according to the time, the transmittance around the specific site may be changed according to the distance from the specific site. Such an afterimage representation is useful for analyzing a golf swing, for example. That is, the head of the golf club is determined as a specific portion, and after making the head opaque, the foreground model of the golf club is subjected to transparency processing when projecting so that the more transparent the head is, the further away it is from the head. In this transparency processing, while referring to the structural analysis results, the alpha value is decreased as the distance from the specified head portion increases, and the golf club is made transparent in a gradation pattern, and the foreground model at each time is displayed as a virtual viewpoint. Project. At this time, regarding the relationship between the distance and the alpha value, a transmission start distance and a transmission end distance may be set in advance. Alternatively, the distance from the head portion to the farthest portion may be set as the transmission end distance after detecting a portion with large movement. The transmission start distance at this time may be the distance from the head to the nearest portion. FIG. 12(a) shows an example of a virtual viewpoint image obtained by synthesizing processing according to this modification. The projected image of the foreground model at the most recent time and the projected image of the foreground model at each time in the past, processed so that the golf club becomes gradually transparent from the base of the head to the handle. The projection image obtained by the projection image is synthesized. According to this modified example, it is possible to obtain a virtual viewpoint image that emphasizes the trajectory of the specific part specified by the user.

なお、本変形例の手法を適用した上で、時刻が過去になるほど透明にする上述の実施形態の手法を重畳して適用することで、図１２（ｂ）に示すような仮想視点画像が得られる。このとき、仮想視点を移動（例えば、人物を正面から捉える位置から開始し、打ち終わりの時点で人物の背中から捉える位置になるように等速移動）させてもよい。これにより、クラブヘッド部分の動きを任意のカメラアングルに投影した仮想視点画像が得られる。 By applying the method of this modified example and superimposing the method of the above-described embodiment, which makes the image more transparent as the time goes past, a virtual viewpoint image as shown in FIG. 12B is obtained. be done. At this time, the virtual viewpoint may be moved (for example, moving at a constant speed starting from a position where the person is seen from the front and moving to a position where the person is seen from the back at the end of hitting). As a result, a virtual viewpoint image is obtained in which the movement of the club head portion is projected at an arbitrary camera angle.

以上のとおり、本実施形態によれば、前景オブジェクトのうちユーザが指定した部位のみを残像で表現した仮想視点画像を容易に得ることができる。また、必要に応じて前景モデルの位置や姿勢の調整を行うことで、より違和感の少ない仮想視点画像が得られる。 As described above, according to the present embodiment, it is possible to easily obtain a virtual viewpoint image in which only the user-specified portion of the foreground object is expressed as an afterimage. Further, by adjusting the position and orientation of the foreground model as necessary, a virtual viewpoint image with less sense of discomfort can be obtained.

（その他の実施例）
本開示は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 (Other examples)
The present disclosure provides a program that implements one or more functions of the above-described embodiments to a system or device via a network or storage medium, and one or more processors in a computer of the system or device reads and executes the program. It can also be realized by processing to It can also be implemented by a circuit (for example, ASIC) that implements one or more functions.

１０２画像処理装置
３００データ取得部
３０２モデル解析部
３０３仮想視点画像生成部
３０４投影部
３０５合成部 102 Image processing device 300 Data acquisition unit 302 Model analysis unit 303 Virtual viewpoint image generation unit 304 Projection unit 305 Synthesis unit

Claims

Acquisition means for acquiring shape data of an object corresponding to each of a plurality of times during a period in which images are continuously captured by a plurality of imaging devices;
determining means for determining a specific portion that is part of the object;
A virtual viewpoint corresponding to a virtual viewpoint using shape data of the object at a reference time among the plurality of times and shape data of a specific portion of the object at a specific time different from the reference time among the plurality of times generating means for generating an image;
An image processing device comprising:

The generating means projects a first projection image obtained by projecting shape data of the object at the reference time onto the virtual viewpoint, and shape data of a specific portion of the object at the specific time onto the virtual viewpoint. 2. The image processing apparatus according to claim 1, wherein the virtual viewpoint image is generated by synthesizing the second projection image obtained by the above.

The first projection image and the second projection image are 4-channel images in which each pixel has an alpha value in addition to RGB values,
The generating means generates the virtual viewpoint image by alpha synthesis using the first projected image and the second projected image.
3. The image processing apparatus according to claim 2, wherein:

In the alpha synthesis, the transmittance of pixels corresponding to the specific part at the specific time in the second projection image differs from the transmittance of pixels corresponding to the specific part at the reference time. Item 4. The image processing apparatus according to item 3.

When the specific time is earlier than the reference time, the generating means determines that the transmittance of the pixels corresponding to the specific site at the specific time is higher than the transmittance of the pixels corresponding to the specific site at the reference time. 5. The image processing apparatus according to claim 4, wherein said alpha composition is performed so as to be higher than a rate.

The generating means performs the alpha synthesis such that the transmittance of pixels corresponding to the specific part in the second projection image increases as the specific time is past the reference time. The image processing apparatus according to claim 5.

The generating means, when the specific time is later than the reference time, the transmittance of the pixels corresponding to the specific part at the specific time is higher than the transmittance of the pixels corresponding to the specific part at the reference time. 5. The image processing apparatus according to claim 4, wherein said alpha composition is performed so as to be higher than a rate.

The generating means performs the alpha synthesis such that the transmittance of pixels corresponding to the specific portion in the second projection image increases as the specific time is later than the reference time. The image processing apparatus according to claim 7.

4. The method according to claim 3, wherein in said alpha synthesis, the transmittance of pixels corresponding to said specific portion in said second projection image differs from the transmittance of pixels corresponding to the periphery of said specific portion. image processing device.

3. The generating means performs the alpha synthesis so that the transmittance of pixels corresponding to the specific portion is higher than the transmittance of pixels corresponding to the periphery of the specific portion. 9. The image processing apparatus according to 9.

The generating means performs the alpha synthesis such that the transmittance of pixels corresponding to the periphery of the specific site in the second projection image increases as the distance from the specific site increases. Item 11. The image processing device according to Item 10.

the first projection image and the second projection image are 3-channel images in which each pixel has an RGB value;
The generation means generates the virtual viewpoint image by synthesizing the first projection image and an image obtained by changing the brightness or saturation of the second projection image.
3. The image processing apparatus according to claim 2, wherein:

In the synthesis, the brightness or saturation of pixels corresponding to the specific part at the specific time in the second projection image is different from the brightness or saturation of pixels corresponding to the specific part at the reference time. 13. The image processing apparatus according to claim 12.

When the specific time is earlier than the reference time, the generating means converts the brightness or saturation of the pixels corresponding to the specific part at the specific time to the brightness of the pixels corresponding to the specific part at the reference time. 14. The image processing apparatus according to claim 13, wherein said synthesis is performed so as to lower color saturation.

The generating means performs the synthesis so that the more the specific time is past the reference time, the lower the brightness or saturation of pixels corresponding to the specific part at the specific time in the second projection image. 15. The image processing apparatus according to claim 14, characterized by:

When the specific time is later than the reference time, the generating means converts the brightness or saturation of the pixels corresponding to the specific part at the specific time to the brightness of the pixels corresponding to the specific part at the reference time. 14. The image processing apparatus according to claim 13, wherein said synthesis is performed so as to lower color saturation.

The generation means performs the synthesis such that the further the specific time is later than the reference time, the lower the brightness or saturation of pixels corresponding to the specific part at the specific time in the second projection image. 17. The image processing apparatus according to claim 16, characterized by:

The generating means is
generating a depth image in which the distance from the virtual viewpoint to the object is recorded for each pixel;
Selecting the second projection image for each pixel so that the lower the depth, the higher the layer, and performing the synthesis;
3. The image processing apparatus according to claim 2, wherein:

3. The image processing apparatus according to claim 2, wherein said generating means performs said synthesis by geometrically transforming the shape data corresponding to said specific part.

The geometric transformation is processing for adjusting at least one of the position, posture, and scale of the shape data corresponding to the specific part in accordance with the shape data of the object at the reference time. The image processing apparatus according to claim 19.

The determining means is
analyzing shape changes of the object at the plurality of times using the shape data;
Based on the results of the analysis, determining a portion with a large amount of change as the specific site;
21. The image processing apparatus according to any one of claims 1 to 20, characterized by:

The determining means is
analyzing the shape change of the object at the plurality of times using projection images obtained by projecting the shape data of the object corresponding to each of the plurality of times onto a virtual viewpoint;
Based on the results of the analysis, determining a portion with a large amount of change as the specific site;
21. The image processing apparatus according to any one of claims 1 to 20, characterized by:

The determining means is
using the shape data to analyze the positional relationship between the object and a specific region in the three-dimensional space where the imaging was performed at the plurality of times;
Based on the results of the analysis, determining a portion inside or outside the specific region as the specific site;
21. The image processing apparatus according to any one of claims 1 to 20, characterized by:

24. The image processing apparatus according to claim 23, wherein said specific area is a bounding box.

further comprising a graphical user interface for accepting designation of the specific part from the user;
the determining means determines the specific part based on the received designation;
21. The image processing apparatus according to any one of claims 1 to 20, characterized by:

The graphical user interface displays results of structural analysis of the object,
The user designates the specific site based on the displayed structural analysis result,
26. The image processing apparatus according to claim 25, characterized by:

an acquisition step of acquiring shape data of an object corresponding to each of a plurality of times during a period in which images are continuously captured by a plurality of imaging devices;
a determining step of determining a specific portion that is part of said object;
A virtual viewpoint corresponding to a virtual viewpoint using shape data of the object at a reference time among the plurality of times and shape data of a specific portion of the object at a specific time different from the reference time among the plurality of times a generation step for generating an image;
An image processing method comprising:

A program for causing a computer to function as the image processing apparatus according to any one of claims 1 to 26.