JP2023048531A

JP2023048531A - Information processing device, information processing method, program, and acoustic automatic synthesis system

Info

Publication number: JP2023048531A
Application number: JP2021157900A
Authority: JP
Inventors: 尊士宮浦; Takashi Miyaura
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2023-04-07

Abstract

To efficiently synthesize virtual viewpoint image data and acoustic data acquired in imaging spaces different from each other.SOLUTION: An information processing device initially acquires data on a first attitude model and data on a virtual viewpoint image, generated on the basis of a first multi-viewpoint image obtained from a plurality of imaging devices for imaging a first imaging space. Next, the device acquires data on a second attitude model generated on the basis of a second multi-viewpoint image obtained from the plurality of imaging devices for imaging a second imaging space different from the first imaging space. Moreover, the device acquires data on sound collected when the second multi-viewpoint image is taken in the second imaging space. Furthermore, the device evaluates a coincidence between the first attitude model and the second attitude model, and generates the data including the virtual viewpoint image and the sound, on the basis of the coincidence.SELECTED DRAWING: Figure 1

Description

本開示は、仮想視点画像のデータに音響のデータを合成する技術に関する。 The present disclosure relates to a technique for synthesizing acoustic data with virtual viewpoint image data.

複数の異なる位置に配置したカメラ等の撮像装置を用いて多視点で同期撮像を行い、当該撮像により得られた複数の撮像画像（以下「複数視点画像」という。）に基づいて仮想視点画像を生成する技術がある。仮想視点画像によれば、撮像空間に存在するオブジェクトを様々な角度から鑑賞することができる。仮想視点画像を生成するシステムには、複数視点画像の撮像時に集音した音響のデータ（以下「音響データ」という。）を仮想視点画像のデータ（以下「仮想視点画像データ」という。）に合成するものがある。このようなシステムでは、仮想視点画像の時刻と複数視点画像の撮像時に集音した音響の時刻とを同期させることにより、仮想視点画像データに音響データを合成することが行われている。なお、特許文献１には、ゲームシステムの分野において、仮想空間内において動作するキャラクタの関節モデルにおける関節同士の距離に応じて、キャラクタの動作に対応する音声を出力させる技術が開示されている。 Synchronous multi-viewpoint imaging is performed using imaging devices such as cameras placed at a plurality of different positions, and a virtual viewpoint image is generated based on a plurality of captured images (hereinafter referred to as "multi-viewpoint images") obtained by the imaging. There is technology to generate. According to the virtual viewpoint image, an object existing in the imaging space can be viewed from various angles. In a system for generating virtual viewpoint images, sound data (hereinafter referred to as "acoustic data") collected when multiple viewpoint images are captured is combined with virtual viewpoint image data (hereinafter referred to as "virtual viewpoint image data"). there is something to do In such a system, sound data is combined with virtual viewpoint image data by synchronizing the time of the virtual viewpoint image with the time of the sound collected when capturing the multiple viewpoint images. In the field of game systems, Japanese Patent Laid-Open No. 2003-200000 discloses a technique for outputting a sound corresponding to the motion of a character according to the distance between the joints in the joint model of the character that moves in the virtual space.

特開２０２０－１９５６７２号公報JP 2020-195672 A

スタジオ等の第１の撮像空間で撮像された複数視点画像に基づいて生成された仮想視点画像データと、舞台等の第１の撮像空間とは異なる第２の撮像空間で集音された音響のデータとの合成は、ユーザの手動により合成する手法により実現できる。具体的には、仮想視点画像を提供するユーザ（以下、単に「ユーザ」という。）は、予め取得した音響データと仮想視点画像データとを同期させるタイミングを手動で合わせることにより、音響データと仮想視点画像と合成する。しかしながら、このような手法では、ユーザは、仮想視点画像をしながら合成したい様々な音響データの1つ1つを仮想視点画像データに手動で合成する必要がある。そのため、このような手法では、音響データと仮想視点画像データと合成するための編集に長い時間を要するという問題点があった。 Virtual viewpoint image data generated based on multi-viewpoint images captured in a first imaging space such as a studio, and sound collected in a second imaging space different from the first imaging space such as a stage. Synthesis with data can be realized by a user's manual synthesis technique. Specifically, a user who provides a virtual viewpoint image (hereinafter, simply referred to as a “user”) manually synchronizes the timing of synchronizing acoustic data and virtual viewpoint image data acquired in advance, so that the acoustic data and the virtual viewpoint image data are synchronized. Synthesize with the viewpoint image. However, in such a method, the user needs to manually synthesize the virtual viewpoint image data with various sound data that the user wishes to synthesize while creating the virtual viewpoint image. Therefore, in such a method, there is a problem that it takes a long time to edit for synthesizing the acoustic data and the virtual viewpoint image data.

本開示は、このような問題点を解決するためのものであり、互いに異なる撮像空間において取得された仮想視点画像データと音響データとを効率よく合成する情報処理装置を提供することを目的としている。 The present disclosure is intended to solve such problems, and aims to provide an information processing apparatus that efficiently synthesizes virtual viewpoint image data and sound data acquired in different imaging spaces. .

本開示に係る情報処理装置は、第１の撮像空間を撮像する複数の撮像装置から得られた第１の複数視点画像に基づいて生成された第１の姿勢モデルのデータを取得する第１モデル取得手段と、第１の複数視点画像に基づいて生成された仮想視点画像のデータを取得する画像取得手段と、第１の撮像空間とは異なる第２の撮像空間を撮像する複数の撮像装置から得られた第２の複数視点画像に基づいて生成された第２の姿勢モデルのデータを取得する第２モデル取得手段と、第２の撮像空間において、第２の複数視点画像が撮像されるときに集音された音響のデータを取得する音響取得手段と、第１の姿勢モデルと第２の姿勢モデルとの一致度を評価する評価手段と、一致度に基づいて、仮想視点画像と音響とを含むデータを生成するデータ生成手段と、を有する。 An information processing device according to the present disclosure is a first model that acquires data of a first posture model generated based on first multi-viewpoint images obtained from a plurality of imaging devices that capture images of a first imaging space. From an acquisition means, an image acquisition means for acquiring data of a virtual viewpoint image generated based on a first multi-viewpoint image, and a plurality of imaging devices for imaging a second imaging space different from the first imaging space. second model acquisition means for acquiring data of a second posture model generated based on the obtained second multi-viewpoint image; and when the second multi-viewpoint image is captured in the second imaging space. sound acquisition means for acquiring data of sound collected by the virtual viewpoint image; evaluation means for evaluating the degree of matching between the first posture model and the second posture model; and data generation means for generating data including

本開示によれば、互いに異なる撮像空間において取得された仮想視点画像データと音響データとを効率よく合成することができる。 According to the present disclosure, it is possible to efficiently synthesize virtual viewpoint image data and sound data acquired in imaging spaces different from each other.

実施形態１に係る情報処理システムの構成の一例を示すブロック図である。1 is a block diagram showing an example of the configuration of an information processing system according to Embodiment 1; FIG. 実施形態１に係る第１情報処理装置のハードウェア構成の一例を示すブロック図である。2 is a block diagram showing an example of a hardware configuration of a first information processing device according to Embodiment 1; FIG. 実施形態１に係る標準形状モデル及び標準姿勢モデルの一例を説明するための説明図である。FIG. 4 is an explanatory diagram for explaining an example of a standard shape model and a standard posture model according to Embodiment 1; 実施形態１に係る第２情報処理装置における処理の流れの一例を示すフローチャートである。8 is a flow chart showing an example of the flow of processing in the second information processing apparatus according to the first embodiment; 実施形態１に係る第１情報処理装置における処理の流れの一例を示すフローチャートである。4 is a flow chart showing an example of the flow of processing in the first information processing apparatus according to the first embodiment; 実施形態１に係る対応付け部における対応付け処理の一例を説明するための説明図である。FIG. 5 is an explanatory diagram for explaining an example of matching processing in the matching unit according to the first embodiment; 実施形態１に係る対応付け部による音響データと姿勢モデルとの対応付けを示す情報の構成の一例を説明するための説明図である。FIG. 4 is an explanatory diagram for explaining an example of a configuration of information indicating correspondence between acoustic data and a posture model by a correspondence unit according to the first embodiment; 実施形態２に係る第２情報処理装置における処理の流れの一例を示すフローチャートである。10 is a flow chart showing an example of the flow of processing in the second information processing apparatus according to the second embodiment; 実施形態２に係る第１情報処理装置における処理の流れの一例を示すフローチャートである。10 is a flow chart showing an example of the flow of processing in the first information processing apparatus according to the second embodiment; 実施形態２に係る対応付け部における対応付け処理の一例を説明するための説明図である。FIG. 11 is an explanatory diagram for explaining an example of matching processing in a matching unit according to the second embodiment;

以下、添付の図面を参照して、本開示の実施の形態について詳細に説明する。なお、以下の実施の形態に示す構成は一例に過ぎず、本開示の範囲をその構成のみに限定するものではない。 Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. Note that the configurations shown in the following embodiments are merely examples, and the scope of the present disclosure is not limited only to those configurations.

（実施形態１）
［構成］
図１乃至７を参照して、実施形態１に係る情報処理システム１について説明する。図１は、実施形態１に係る情報処理システム１の構成の一例を示すブロック図である。情報処理システム１は、複数の撮像装置１１、集音装置１５、複数の撮像装置１６、第１情報処理装置１００、第２情報処理装置１５０、及び出力装置１９を備える。 (Embodiment 1)
[composition]
An information processing system 1 according to the first embodiment will be described with reference to FIGS. FIG. 1 is a block diagram showing an example of the configuration of an information processing system 1 according to the first embodiment. The information processing system 1 includes a plurality of imaging devices 11 , a sound collector 15 , a plurality of imaging devices 16 , a first information processing device 100 , a second information processing device 150 and an output device 19 .

情報処理システム１は、互いに異なる撮像空間において取得された仮想視点画像データと音響データとを合成するためのシステムである。音響が画像を視聴する視聴者に与える影響は様々ある。例えば、舞台芸術の分野では、演者の動きによる拍手又は足音等の舞台音響は、演出上において視聴者に大きな影響を与える。また、スポーツ等の室内競技の分野では、体育館等の空間における音響は、視聴者に競技の臨場感を感じさせる上で必要不可欠なものである。これらの音響には、舞台又は体育館等の撮像空間における反響、又は撮像空間を構成する壁又は床等の構造等が影響する。そのため、仮想視点画像を生成するシステムが設置されるスタジオ等の撮像空間における反響又は構造等は、舞台又は体育館等における反響又は構造等とは異なる。したがって、スタジオ等の撮像空間における集音では、視聴者に演出又は臨場感等を感じさせるのに必要な音響データを取得することができない。本実施形態における情報処理システム１は、この課題を解決するためのシステムである。 The information processing system 1 is a system for synthesizing virtual viewpoint image data and sound data obtained in different imaging spaces. There are various effects that sound has on a viewer viewing an image. For example, in the field of performing arts, stage sounds such as applause or footsteps caused by movements of performers have a great impact on audiences. Also, in the field of indoor sports such as sports, the sound in spaces such as gymnasiums is essential for giving viewers a sense of presence in the games. These sounds are affected by reverberations in an imaging space such as a stage or a gymnasium, or by structures such as walls and floors forming the imaging space. Therefore, the echo or structure in an imaging space such as a studio in which a system for generating a virtual viewpoint image is installed is different from the echo or structure in a stage or a gymnasium. Therefore, sound collection in an imaging space such as a studio cannot acquire sound data necessary to give a viewer a sense of performance or realism. The information processing system 1 in this embodiment is a system for solving this problem.

複数の撮像装置１１のそれぞれは、デジタルビデオカメラ又はデジタルスチルカメラ等により構成され、スタジオ等の第１の撮像空間（以下「第１撮像空間」という。）の周囲に設置されている。複数の撮像装置１１のそれぞれは、第１撮像空間を撮像して、当該撮像により得た撮像画像のデータ（以下「撮像画像データ」という。）を第１情報処理装置１００に出力する。複数の撮像装置１６のそれぞれは、デジタルビデオカメラ又はデジタルスチルカメラ等により構成され、舞台又は体育館等の第２の撮像空間（以下「第２撮像空間」という。）の周囲に設置されている。複数の撮像装置１６のそれぞれは、第２撮像空間を撮像して、当該撮像により得た撮像画像データを第２情報処理装置１５０に出力する。 Each of the plurality of imaging devices 11 is configured by a digital video camera, a digital still camera, or the like, and is installed around a first imaging space (hereinafter referred to as "first imaging space") such as a studio. Each of the plurality of imaging devices 11 captures an image of the first imaging space and outputs captured image data obtained by the imaging (hereinafter referred to as “captured image data”) to the first information processing device 100 . Each of the plurality of imaging devices 16 is composed of a digital video camera, a digital still camera, or the like, and is installed around a second imaging space (hereinafter referred to as "second imaging space") such as a stage or a gymnasium. Each of the plurality of imaging devices 16 captures an image of the second imaging space and outputs captured image data obtained by the imaging to the second information processing device 150 .

集音装置１５は、マイク等により構成され、第２撮像空間における音響、具体的には、集音装置１５は、第２撮像空間に存在するオブジェクトの動作時に発生する音響を集音し、集音した音響を音響信号に変換して第２情報処理装置１５０に出力する。以下、複数の撮像装置１１のそれぞれが出力する撮像画像データを合わせて、複数視点画像のデータという。また、同様に、複数の撮像装置１６のそれぞれが出力する撮像画像データを合わせて、複数視点画像のデータ（以下「複数視点画像データ」という。）という。 The sound collector 15 is composed of a microphone or the like, and collects sound in the second imaging space, specifically, the sound generated when an object existing in the second imaging space moves. The generated sound is converted into an acoustic signal and output to the second information processing device 150 . Hereinafter, captured image data output from each of the plurality of imaging devices 11 is collectively referred to as multi-viewpoint image data. Similarly, the captured image data output from each of the plurality of imaging devices 16 is collectively referred to as multi-viewpoint image data (hereinafter referred to as “multi-viewpoint image data”).

第２情報処理装置１５０は、集音装置１５が出力した音響信号と、複数の撮像装置１６が出力する複数視点画像データとを取得する。第２情報処理装置１５０は、取得した音響信号が示す音響を音響データとして第１情報処理装置１００に出力する。また、第２情報処理装置１５０は、取得した複数視点画像データを構成する複数の撮像画像データに基づいて、撮像画像に写るオブジェクトに対応する姿勢モデルのデータを生成し、生成した姿勢モデルのデータを第１情報処理装置１００に出力する。第２情報処理装置１５０は、音声取得部１５１、第２画像群取得部１５２、第２前景取得部１５３、第２モデル生成部１５４、モデル出力部１５５、及び音響出力部１５６を備える。第２情報処理装置１５０が備える各部の詳細については後述する。 The second information processing device 150 acquires the acoustic signal output by the sound collector 15 and the multi-viewpoint image data output by the plurality of imaging devices 16 . The second information processing device 150 outputs the sound indicated by the acquired sound signal to the first information processing device 100 as sound data. In addition, the second information processing apparatus 150 generates posture model data corresponding to an object appearing in the captured image based on a plurality of pieces of captured image data forming the acquired multi-viewpoint image data, and generates posture model data. is output to the first information processing device 100 . The second information processing device 150 includes an audio acquisition unit 151 , a second image group acquisition unit 152 , a second foreground acquisition unit 153 , a second model generation unit 154 , a model output unit 155 and an audio output unit 156 . Details of each unit included in the second information processing apparatus 150 will be described later.

ここで、姿勢モデルのデータとは、オブジェクトを構成する関節の位置、関節同士の接続関係、関節間の距離、及び関節の角度等を表すデータである。以下、音響データには、第２情報処理装置１５０が音声信号を取得した時刻等を示す情報が含まれているものとする。また、各撮像装置１６は、互いに時刻が同期されており、第２情報処理装置１５０と各撮像装置１６とは、互いに時刻が同期されているものとする。また、各撮像装置１６が出力する撮像画像データには、撮像画像の撮像時刻を示す情報（以下「撮像時刻情報」という。）が含まれているものとする。なお、装置間の時刻の同期手法は周知であるため説明を省略する。 Here, the posture model data is data representing the positions of the joints forming the object, the connection relationship between the joints, the distance between the joints, the angles of the joints, and the like. Hereinafter, it is assumed that the acoustic data includes information indicating the time when the second information processing device 150 acquired the audio signal. It is also assumed that the imaging devices 16 are synchronized in time, and the second information processing device 150 and the imaging devices 16 are synchronized in time. In addition, it is assumed that the captured image data output by each imaging device 16 includes information indicating the capturing time of the captured image (hereinafter referred to as “capturing time information”). Note that the method for synchronizing the time between devices is well known, so the description thereof will be omitted.

第１情報処理装置１００は、第２情報処理装置１５０が出力した音響データ及び姿勢モデルのデータと、複数の撮像装置１６が出力する複数視点画像データとを取得する。第１情報処理装置１００は、取得した複数視点画像データを構成する複数の撮像画像データに基づいて、仮想視点画像のデータ（以下「仮想視点画像データ」という。）と、撮像画像に写るオブジェクトに対応する姿勢モデルのデータとを生成する。以下、第１情報処理装置１００が生成する姿勢モデルのデータを第１姿勢モデルのデータといい、第２情報処理装置１５０が生成する姿勢モデルのデータを第２姿勢モデルのデータという。第１情報処理装置１００は、第１姿勢モデルと第２姿勢モデルとの一致度を評価し、生成した仮想視点画像と取得した音響とを含むデータを、当該一致度に基づいて生成する。 The first information processing device 100 acquires the acoustic data and posture model data output by the second information processing device 150 and the multi-viewpoint image data output by the plurality of imaging devices 16 . The first information processing apparatus 100 converts data of a virtual viewpoint image (hereinafter referred to as “virtual viewpoint image data”) and an object appearing in the captured image based on a plurality of pieces of captured image data forming the acquired multiple viewpoint image data. and corresponding pose model data. Hereinafter, the posture model data generated by the first information processing device 100 will be referred to as first posture model data, and the posture model data generated by the second information processing device 150 will be referred to as second posture model data. The first information processing apparatus 100 evaluates the degree of matching between the first posture model and the second posture model, and generates data including the generated virtual viewpoint image and the acquired sound based on the degree of matching.

具体的には、第１情報処理装置１００は、生成した仮想視点画像データに、取得した音響データを合成して、音響データ付きの仮想視点画像データを生成することにより、生成した仮想視点画像と取得した音響とを含むデータを生成する。更に、第１情報処理装置１００は、合成後の音響データ付きの仮想視点画像データを出力装置１９に出力する。第１情報処理装置１００は、第１画像群取得部１０１、第１前景取得部１０２、画像生成部１０３、第１モデル生成部１０４、モデル取得部１０５、音響取得部１０６、対応付け部１０７、評価部１０８、及びデータ生成部１０９を備える。第１情報処理装置１００が備える各部の詳細については後述する。以下、各撮像装置１１は、互いに時刻が同期されているものとする。また、各撮像装置１１が出力する撮像画像データには、撮像画像の撮像時刻を示す情報（撮像時刻情報）が含まれているものとする。なお、装置間の時刻の同期手法は周知であるため説明を省略する。 Specifically, the first information processing apparatus 100 synthesizes the generated virtual viewpoint image data with the acquired acoustic data to generate virtual viewpoint image data with acoustic data. generating data including the acquired sound; Further, the first information processing device 100 outputs the synthesized virtual viewpoint image data with the acoustic data to the output device 19 . The first information processing apparatus 100 includes a first image group acquisition unit 101, a first foreground acquisition unit 102, an image generation unit 103, a first model generation unit 104, a model acquisition unit 105, a sound acquisition unit 106, an association unit 107, An evaluation unit 108 and a data generation unit 109 are provided. Details of each unit included in the first information processing apparatus 100 will be described later. Hereinafter, it is assumed that the imaging devices 11 are synchronized with each other in time. In addition, it is assumed that the captured image data output by each imaging device 11 includes information indicating the capturing time of the captured image (capturing time information). Note that the method for synchronizing the time between devices is well known, so the description thereof will be omitted.

出力装置１９は、ＬＣＤ等により構成された表示出力部と、スピーカ等により構成された音声出力部とを有し、第１情報処理装置１００が出力する仮想視点画像データをレンダリングして、仮想視点画像と音響とを視聴可能に出力する。第１情報処理装置１００が合成後の仮想視点画像データを出力する先は、出力装置１９に限定されるものではなく、第１情報処理装置１００は、合成後の仮想視点画像データを、図１には不図示の記憶装置に出力してもよい。この場合、第１情報処理装置１００は、記憶装置に合成後の仮想視点画像データを書き込んで、記憶装置に合成後の仮想視点画像データを記憶させる。 The output device 19 has a display output unit such as an LCD and an audio output unit such as a speaker, and renders the virtual viewpoint image data output by the first information processing device 100 to produce a virtual viewpoint It outputs images and sounds so that they can be viewed. The destination to which the first information processing device 100 outputs the synthesized virtual viewpoint image data is not limited to the output device 19. The first information processing device 100 outputs the synthesized virtual viewpoint image data to the image shown in FIG. may be output to a storage device (not shown). In this case, the first information processing apparatus 100 writes the synthesized virtual viewpoint image data in the storage device, and causes the storage device to store the synthesized virtual viewpoint image data.

第１情報処理装置１００のハードウェア構成について説明する。第１情報処理装置１００が備える各部の処理は、第１情報処理装置１００に内蔵されたＡＳＩＣ（Application Specific Integrated Circuit）等のハードウェアによってなされる。当該処理は、ＦＰＧＡ（Field Programmable Gate Array）等のハードウェアによってなされるものであってもよい。また、当該処理は、ＣＰＵ（Central Processor Unit）若しくはＧＰＵ（Graphic Processor Unit）、及びメモリを用いたソフトウエアによってなされてもよい。 A hardware configuration of the first information processing apparatus 100 will be described. The processing of each unit included in the first information processing apparatus 100 is performed by hardware such as an ASIC (Application Specific Integrated Circuit) incorporated in the first information processing apparatus 100 . The processing may be performed by hardware such as FPGA (Field Programmable Gate Array). Further, the processing may be performed by software using a CPU (Central Processor Unit) or GPU (Graphic Processor Unit) and memory.

図２を参照して、第１情報処理装置１００が備える各部がソフトウエアとして動作する場合の第１情報処理装置１００のハードウェア構成について説明する。図２は、実施形態１に係る第１情報処理装置１００のハードウェア構成の一例を示すブロック図である。第１情報処理装置１００は、コンピュータにより構成されており、当該コンピュータは、図２に一例として示すようにＣＰＵ２０１、ＲＯＭ２０２、ＲＡＭ２０３、補助記憶装置２０４、表示部２０５、操作部２０６、通信部２０７、及びバス２０８を有している。 With reference to FIG. 2, the hardware configuration of the first information processing apparatus 100 when each unit included in the first information processing apparatus 100 operates as software will be described. FIG. 2 is a block diagram showing an example of the hardware configuration of the first information processing device 100 according to the first embodiment. The first information processing apparatus 100 is composed of a computer, and as shown in FIG. and bus 208 .

ＣＰＵ２０１は、ＲＯＭ２０２又はＲＡＭ２０３に格納されているプログラム又はデータを用いて当該コンピュータを制御することにより、当該コンピュータを図１に示す第１情報処理装置１００が備える各部として機能させる。なお、第１情報処理装置１００は、ＣＰＵ２０１とは異なる１又は複数の専用のハードウェアを有し、ＣＰＵ２０１による処理の少なくとも一部を専用のハードウェアが実行してもよい。専用のハードウェアの例としては、ＡＳＩＣ、ＦＰＧＡ、及びＤＳＰ（デジタルシグナルプロセッサ）等がある。ＲＯＭ２０２は、変更を必要としないプログラム等を格納する。ＲＡＭ２０３は、補助記憶装置２０４から供給されるプログラム若しくはデータ、又は通信部２０７を介して外部から供給されるデータ等を一時記憶する。補助記憶装置２０４は、例えばハードディスクドライブ等で構成され、画像データ又は音声データ等の種々のデータを記憶する。 The CPU 201 controls the computer using programs or data stored in the ROM 202 or the RAM 203, thereby causing the computer to function as each unit included in the first information processing apparatus 100 shown in FIG. Note that the first information processing apparatus 100 may have one or a plurality of dedicated hardware different from the CPU 201 , and the dedicated hardware may execute at least part of the processing by the CPU 201 . Examples of dedicated hardware include ASICs, FPGAs, and DSPs (digital signal processors). The ROM 202 stores programs and the like that do not require modification. The RAM 203 temporarily stores programs or data supplied from the auxiliary storage device 204, data supplied from the outside via the communication unit 207, or the like. The auxiliary storage device 204 is composed of, for example, a hard disk drive, and stores various data such as image data and audio data.

表示部２０５は、例えば液晶ディスプレイ又はＬＥＤ等により構成され、ユーザが第１情報処理装置１００を操作又は閲覧するためのＧＵＩ（Graphical User Interface）等を表示する。操作部２０６は、例えばキーボード、マウス、ジョイスティック、又はタッチパネル等により構成され、ユーザによる操作を受けて各種の指示をＣＰＵ２０１に入力する。ＣＰＵ２０１は、表示部２０５を制御する表示制御部、及び操作部２０６を制御する操作制御部としても動作する。 The display unit 205 is configured by, for example, a liquid crystal display or an LED, and displays a GUI (Graphical User Interface) or the like for the user to operate or view the first information processing apparatus 100 . The operation unit 206 is composed of, for example, a keyboard, a mouse, a joystick, a touch panel, or the like, and inputs various instructions to the CPU 201 in response to user's operations. The CPU 201 also operates as a display control unit that controls the display unit 205 and an operation control unit that controls the operation unit 206 .

通信部２０７は、第１情報処理装置１００の外部の装置との通信に用いられる。例えば、第１情報処理装置１００が外部の装置と有線接続される場合には、通信用のケーブルが通信部２０７に接続される。第１情報処理装置１００が外部の装置と無線通信する機能を有する場合には、通信部２０７はアンテナを備える。バス２０８は、第１情報処理装置１００の備える各部をつないで情報を伝達する。実施形態１では、表示部２０５及び操作部２０６は、第１情報処理装置１００の内部に存在するものとして説明するが、表示部２０５及び操作部２０６の少なくとも一方は、第１情報処理装置１００の外部に別の装置として存在していてもよい。 The communication unit 207 is used for communication with devices external to the first information processing device 100 . For example, when the first information processing device 100 is wired to an external device, a communication cable is connected to the communication unit 207 . If the first information processing device 100 has a function of wirelessly communicating with an external device, the communication unit 207 has an antenna. A bus 208 connects each unit provided in the first information processing apparatus 100 and transmits information. In the first embodiment, the display unit 205 and the operation unit 206 are described as existing inside the first information processing apparatus 100 . It may exist outside as another device.

第２情報処理装置１５０のハードウェア構成について説明する。第２情報処理装置１５０が備える各部の処理は、第１情報処理装置１００と同様に、第２情報処理装置１５０に内蔵されたＡＳＩＣ、ＦＰＧＡ等のハードウェアによってなされる。また、当該処理は、ＣＰＵ若しくはＧＰＵ及びメモリを用いたソフトウエアによってなされてもよい。第２情報処理装置１５０が備える各部がソフトウエアとして動作する場合、第２情報処理装置１５０は、図２に示すコンピュータにより構成され、当該コンピュータを図１に示す第２情報処理装置１５０が備える各部として機能させてもよい。 A hardware configuration of the second information processing device 150 will be described. As with the first information processing apparatus 100 , the processing of each unit included in the second information processing apparatus 150 is performed by hardware such as ASIC and FPGA built into the second information processing apparatus 150 . Also, the processing may be performed by software using a CPU or GPU and memory. When each part provided in the second information processing device 150 operates as software, the second information processing device 150 is configured by the computer shown in FIG. may function as

［各部の処理］
第２情報処理装置１５０が備える各部の処理について説明する。音声取得部１５１は、集音装置１５が出力する音声信号を取得し、音声信号をＡＤ変換することによりデジタル化して音響データを生成する。音声取得部１５１は、音響データを生成する際に、例えば、音響データに音声信号を取得した時刻を示す情報を含めるように音響データを生成する。音声取得部１５１は、生成した音響データを第２情報処理装置１５０が備える補助記憶装置２０４に記憶させて、補助記憶装置２０４に当該音響データを保持させる。音響出力部１５６は、音声取得部１５１が生成した音響データを第１情報処理装置１００に出力する。具体的には、音響出力部１５６は、補助記憶装置２０４に保持された音響データを補助記憶装置２０４から読み出し、読み出した音響データを第１情報処理装置１００に出力する。 [Processing of each part]
Processing of each unit included in the second information processing device 150 will be described. The sound acquisition unit 151 acquires the sound signal output by the sound collector 15, digitizes the sound signal by AD conversion, and generates sound data. When generating the audio data, the audio acquisition unit 151 generates the audio data such that, for example, information indicating the time when the audio signal was acquired is included in the audio data. The voice acquisition unit 151 causes the auxiliary storage device 204 included in the second information processing device 150 to store the generated acoustic data, and causes the auxiliary storage device 204 to hold the acoustic data. The sound output unit 156 outputs the sound data generated by the sound acquisition unit 151 to the first information processing device 100 . Specifically, the audio output unit 156 reads the audio data held in the auxiliary storage device 204 from the auxiliary storage device 204 and outputs the read audio data to the first information processing device 100 .

第２画像群取得部１５２は、複数の撮像装置１６のそれぞれが出力する撮像画像データを取得する。すなわち、第２画像群取得部１５２は、複数の撮像装置１６から複数視点画像データを取得する。第２前景取得部１５３は、第２画像群取得部１５２が取得した複数視点画像を構成する複数の撮像画像毎に、撮像画像に写るオブジェクトに対応する画像領域を抽出し、抽出した画像領域を前景領域として取得する。更に、第２前景取得部１５３は、取得した前景領域を示す前景画像を生成する。ここで、前景領域として抽出される画像領域に対応するオブジェクトとは、一般的に、時系列で同じ方向から撮像を行った場合において動きのある、又は、その位置が変化し得る動的なオブジェクト（以下「動体オブジェクト」という。）を指す。動体オブジェクトは、例えば、競技において、それが行われるフィールド内にいる選手又は審判等の自然人、及び、球技であれば、それに使用されるボール等である。また、コンサート又はエンタテイメント等の催しにおいては、歌手、演奏者、パフォーマー、又は司会者等が動体オブジェクトとなる。撮像画像から前景画像を生成する手法は周知であるため説明を省略する。 The second image group acquisition unit 152 acquires captured image data output from each of the plurality of imaging devices 16 . That is, the second image group acquisition unit 152 acquires multi-viewpoint image data from the plurality of imaging devices 16 . The second foreground acquisition unit 153 extracts an image region corresponding to an object appearing in the captured image for each of a plurality of captured images forming the multi-viewpoint image acquired by the second image group acquisition unit 152, and extracts the extracted image region. Take as foreground region. Furthermore, the second foreground acquisition unit 153 generates a foreground image representing the acquired foreground area. Here, the object corresponding to the image area extracted as the foreground area is generally a dynamic object that moves or whose position can change when the image is captured from the same direction in time series. (hereinafter referred to as "moving object"). A moving object is, for example, a natural person such as a player or a referee in the field where the game is played, or a ball used in a ball game. Also, in events such as concerts and entertainment, singers, musicians, performers, moderators, and the like are moving objects. Since the method of generating the foreground image from the captured image is well known, the description thereof is omitted.

第２モデル生成部１５４は、第２前景取得部１５３が取得した、複数の撮像画像のそれぞれに対応する前景領域に基づいて、撮像画像に写るオブジェクトに対応する姿勢モデル（第２姿勢モデル）のデータを生成する。第２モデル生成部１５４が第２姿勢モデルのデータを生成する手法について説明する。第２モデル生成部１５４は、まず、標準的な人の形状を模した標準的な３次元形状モデル（以下「標準形状モデル」という。）のデータと、標準形状モデルに対応する標準的な姿勢モデル（以下「標準姿勢モデル」という・）のデータを取得する。標準形状モデル及び標準姿勢モデルのデータは、例えば、第２情報処理装置１５０が備える補助記憶装置２０４に予め記憶させており、第２モデル生成部１５４は、これらのデータを補助記憶装置２０４から読み出すことにより取得する。 The second model generation unit 154 generates a posture model (second posture model) corresponding to the object captured in the captured image based on the foreground regions corresponding to each of the plurality of captured images acquired by the second foreground acquisition unit 153. Generate data. A method by which the second model generation unit 154 generates the data of the second posture model will be described. The second model generating unit 154 first generates data of a standard three-dimensional shape model imitating a standard human shape (hereinafter referred to as “standard shape model”) and a standard posture corresponding to the standard shape model. Acquire the data of the model (hereinafter referred to as "standard posture model"). Data of the standard shape model and the standard posture model are stored in advance in the auxiliary storage device 204 included in the second information processing device 150, for example, and the second model generation unit 154 reads these data from the auxiliary storage device 204. Obtained by

図３を参照して、標準形状モデル３０１及び標準姿勢モデル３０２について説明する。図３は、実施形態１に係る標準形状モデル３０１及び標準姿勢モデル３０２の一例を説明するための説明図である。標準形状モデル３０１は、３次元のメッシュにより表現されたモデルであって、標準形状モデル３０１のデータには、各頂点の位置を示す座標と三角形又は四角形等の面を構成する頂点のＩＤ（identifier）等の識別情報とが含まれている。標準形状モデル３０１は、ボクセルと呼ばれる点の集合により表現されたものであってもよい。 A standard shape model 301 and a standard posture model 302 will be described with reference to FIG. FIG. 3 is an explanatory diagram for explaining an example of the standard shape model 301 and the standard posture model 302 according to the first embodiment. The standard shape model 301 is a model represented by a three-dimensional mesh, and the data of the standard shape model 301 includes coordinates indicating the position of each vertex and IDs (identifiers) of vertices forming surfaces such as triangles or quadrilaterals. ) and other identification information. The standard shape model 301 may be represented by a set of points called voxels.

標準姿勢モデル３０２のデータには、頭部、肩、肘、手首、又は手足等の人体における関節部位に対応する標準姿勢モデル３０２における位置を示す情報（以下「関節情報３０３」という。）が含まれている。また、標準姿勢モデル３０２のデータには、関節情報３０３に加えて、標準姿勢モデル３０２における関節部位同士の接続関係を示す情報（以下「接続情報３０４」という。）が含まれている。接続情報３０４とは、例えば、標準姿勢モデル３０２における関節部位同士の距離を示す情報である。また、標準姿勢モデル３０２のデータには、関節情報３０３及び接続情報３０４に加えて、標準姿勢モデル３０２における関節における角度を示す情報（以下「角度情報」という。）が含まれる。角度情報とは、例えば、標準姿勢モデル３０２における隣接する関節部位同士を結んだ線分同士であって、隣接する線分同士が成す角度を示す情報である。 The data of the standard posture model 302 includes information (hereinafter referred to as “joint information 303”) indicating positions in the standard posture model 302 corresponding to joint parts of the human body such as the head, shoulders, elbows, wrists, or limbs. is In addition to the joint information 303, the data of the standard posture model 302 includes information indicating the connection relationship between the joint parts in the standard posture model 302 (hereinafter referred to as "connection information 304"). The connection information 304 is, for example, information indicating distances between joint parts in the standard posture model 302 . In addition to the joint information 303 and the connection information 304, the data of the standard posture model 302 also includes information indicating the angles of the joints in the standard posture model 302 (hereinafter referred to as "angle information"). The angle information is, for example, line segments connecting adjacent joint parts in the standard posture model 302, and is information indicating the angle formed by the adjacent line segments.

第２モデル生成部１５４は、第２前景取得部１５３が取得した前景領域と一致するように図３に一例として示す標準姿勢モデル３０２に対応する標準形状モデル３０１を変形させる。第２モデル生成部１５４は、変形させた結果、最も一致した状態となった標準姿勢モデル３０２をオブジェクトに対応する姿勢モデル（第２姿勢モデル）として推定し、第２姿勢モデルのデータを生成する。ただし、第２姿勢モデルの生成手法は、上述の手法に限定されるものではない。例えば、第２モデル生成部１５４は、２次元画像上でオブジェクトの２次元の姿勢を推定し、各撮像装置１６の位置、撮像方向、及び画角等に基づいて、３次元の姿勢モデルを推定して第２姿勢モデルのデータを生成してもよい。 The second model generation unit 154 transforms the standard shape model 301 corresponding to the standard posture model 302 shown as an example in FIG. 3 so as to match the foreground region acquired by the second foreground acquisition unit 153 . The second model generation unit 154 estimates the standard posture model 302 that is most consistent as a result of deformation as a posture model (second posture model) corresponding to the object, and generates data of the second posture model. . However, the method of generating the second posture model is not limited to the method described above. For example, the second model generation unit 154 estimates a two-dimensional posture of an object on a two-dimensional image, and estimates a three-dimensional posture model based on the position, imaging direction, angle of view, etc. of each imaging device 16. may be used to generate the data of the second posture model.

第２モデル生成部１５４は、生成した第２姿勢モデルのデータを第２情報処理装置１５０が備える補助記憶装置２０４に記憶させて、補助記憶装置２０４に当該データを保持させる。第２モデル生成部１５４は、第２姿勢モデルのデータを補助記憶装置２０４に記憶させる際に、第２姿勢モデルを生成する際に用いた撮像画像の撮像時刻を示す情報（撮像時刻情報）を、当該第２姿勢モデルのデータに対応付けて補助記憶装置２０４に記憶させる。モデル出力部１５５は、第２モデル生成部１５４が生成した第２姿勢モデルのデータと、当該第２姿勢モデルのデータに対応付けられ撮像時刻情報とを第１情報処理装置１００に出力する。具体的には、モデル出力部１５５は、第２姿勢モデルのデータと撮像時刻情報と補助記憶装置２０４から読み出し、読み出した第２姿勢モデルのデータと撮像時刻情報とを第１情報処理装置１００に出力する。 The second model generating unit 154 causes the auxiliary storage device 204 included in the second information processing device 150 to store the data of the generated second posture model, and causes the auxiliary storage device 204 to hold the data. When storing the data of the second posture model in the auxiliary storage device 204, the second model generating unit 154 stores information (imaging time information) indicating the capturing time of the captured image used when generating the second posture model. , is stored in the auxiliary storage device 204 in association with the data of the second posture model. The model output unit 155 outputs the data of the second posture model generated by the second model generation unit 154 and the imaging time information associated with the data of the second posture model to the first information processing apparatus 100 . Specifically, the model output unit 155 reads the data of the second posture model and the imaging time information from the auxiliary storage device 204, and sends the read data of the second posture model and the imaging time information to the first information processing device 100. Output.

第１情報処理装置１００が備える各部の処理について説明する。モデル取得部１０５は、第２情報処理装置１５０が有するモデル出力部１５５が出力する第２姿勢モデルのデータと、当該第２姿勢モデルのデータに対応付けられ撮像時刻情報とを取得する。音響取得部１０６は、第２情報処理装置１５０が有する音響出力部１５６が出力する音響データを取得する。対応付け部１０７は、第２姿勢モデルのデータと音響データとの対応付けを行う。対応付け部１０７が行う当該対応付けの詳細については、図５及び図６を用いて後述する。 Processing of each unit included in the first information processing apparatus 100 will be described. The model acquisition unit 105 acquires the data of the second posture model output by the model output unit 155 of the second information processing device 150 and the imaging time information associated with the data of the second posture model. The sound acquisition unit 106 acquires sound data output by the sound output unit 156 of the second information processing device 150 . The associating unit 107 associates the data of the second posture model with the acoustic data. Details of the association performed by the associating unit 107 will be described later with reference to FIGS. 5 and 6. FIG.

第１画像群取得部１０１は、複数の撮像装置１１のそれぞれが出力する撮像画像データを取得する。すなわち、第１画像群取得部１０１は、複数の撮像装置１１から複数視点画像データを取得する。第１前景取得部１０２は、第１画像群取得部１０１が取得した複数視点画像を構成する複数の撮像画像毎に、撮像画像に写るオブジェクトに対応する画像領域を抽出し、抽出した画像領域を前景領域として取得する。更に、第１前景取得部１０２は、取得した前景領域を示す前景画像を生成する。ここで、オブジェクトは、動体オブジェクトを指し、動体オブジェクトは、スタジオ等の第１撮像空間に存在する自然人等である。第１モデル生成部１０４は、第１前景取得部１０２が取得した、複数の撮像画像のそれぞれに対応する前景領域に基づいて、撮像画像に写るオブジェクトに対応する姿勢モデル（第１姿勢モデル）のデータを生成する。具体的には、例えば、第１モデル生成部１０４は、上述した、第２モデル生成部１５４が第２姿勢モデルのデータを生成する手法と同様の手法により、第１姿勢モデルのデータを生成する。 The first image group acquisition unit 101 acquires captured image data output from each of the plurality of imaging devices 11 . That is, the first image group acquisition unit 101 acquires multi-viewpoint image data from a plurality of imaging devices 11 . A first foreground acquisition unit 102 extracts an image region corresponding to an object appearing in each of a plurality of captured images constituting the multi-viewpoint image acquired by the first image group acquisition unit 101, and extracts the extracted image region. Take as foreground region. Furthermore, the first foreground acquisition unit 102 generates a foreground image representing the acquired foreground area. Here, the object refers to a moving object, and the moving object is a natural person or the like existing in the first imaging space such as a studio. The first model generation unit 104 generates a posture model (first posture model) corresponding to the object captured in the captured image based on the foreground regions corresponding to each of the plurality of captured images acquired by the first foreground acquisition unit 102. Generate data. Specifically, for example, the first model generation unit 104 generates the data of the first posture model using the same method as the method of generating the data of the second posture model by the second model generation unit 154 described above. .

評価部１０８は、第１モデル生成部１０４が生成した第１姿勢モデルのデータとモデル取得部１０５が取得した第２姿勢モデルのデータとに基づいて、第１姿勢モデルと第２姿勢モデルとの間の一致度を評価する。具体的には、例えば、評価部１０８は、第２姿勢モデルのデータを正解のデータとして、第１姿勢モデルのデータと第２姿勢モデルのデータとを比較して一致度を評価する。より具体的には、評価部１０８は、第１姿勢モデルのデータと第２姿勢モデルのデータとに含まれる関節情報３０３同士、接続情報３０４同士、及び角度情報同士の少なくともいずれかを比較することにより一致度を評価する。例えば、評価部１０８は、関節情報３０３同士の差分値、接続情報３０４同士の差分値、及び、角度情報同士の差分値の少なくともいずれかが、予め定められた閾値以下であるとき、第１姿勢モデルと第２姿勢モデルとが一致していると判定する。 Based on the data of the first posture model generated by the first model generation unit 104 and the data of the second posture model acquired by the model acquisition unit 105, the evaluation unit 108 compares the first posture model and the second posture model. Evaluate the degree of agreement between Specifically, for example, the evaluation unit 108 evaluates the degree of matching by comparing the data of the first posture model and the data of the second posture model with the data of the second posture model as correct data. More specifically, the evaluation unit 108 compares at least one of joint information 303, connection information 304, and angle information included in the data of the first posture model and the data of the second posture model. Evaluate the degree of agreement. For example, when at least one of the difference value between the joint information 303, the difference value between the connection information 304, and the difference value between the angle information is equal to or less than a predetermined threshold, the evaluation unit 108 determines that the first posture It is determined that the model and the second posture model match.

評価部１０８は、関節情報３０３、接続情報３０４、及び角度情報のうちの２つ以上の情報を組わせて、第１姿勢モデルと第２姿勢モデルとの間の一致度を評価してもよい。また、評価部１０８は、関節情報３０３のうち、標準姿勢モデル３０２における頭部及び四肢の先端の関節等の予め定められた関節部位の位置を示す情報同士を比較してもよい。同様に、評価部１０８は、接続情報３０４のうちの標準姿勢モデル３０２における予め定められた関節部位間の接続情報同士、又は、角度情報のうちの標準姿勢モデル３０２における予め定められた関節部位における角度を比較してもよい。評価部１０８による第１姿勢モデルと第２姿勢モデルとの間の一致度の評価手法は、上述のものに限定されるものではない。 The evaluation unit 108 may combine two or more of the joint information 303, the connection information 304, and the angle information to evaluate the degree of matching between the first posture model and the second posture model. . In addition, the evaluation unit 108 may compare the joint information 303 with information indicating the positions of predetermined joint parts such as the joints of the tips of the head and limbs in the standard posture model 302 . Similarly, the evaluation unit 108 determines connection information between predetermined joint parts in the standard posture model 302 out of the connection information 304, or Angles may be compared. The method of evaluating the degree of matching between the first posture model and the second posture model by the evaluation unit 108 is not limited to the one described above.

画像生成部１０３は、第１画像群取得部１０１が取得した複数視点画像と、第１前景取得部１０２が取得した、当該複数視点画像を構成する各撮像画像における前景領域とに基づいて、仮想視点画像を生成する。具体的には、例えば、画像生成部１０３は、以下のような手法により仮想視点画像を生成する。まず、画像生成部１０３は、各撮像画像における前景領域を示す前景画像を用いて、視体積交差法（Visual hull）によりオブジェクトに対応する三次元形状（以下「前景モデル」という。）のデータを生成する。Visual hullによる三次元形状データ生成手法は周知であるため説明を省略する。 The image generation unit 103 generates a virtual Generate a viewpoint image. Specifically, for example, the image generation unit 103 generates a virtual viewpoint image by the following method. First, the image generation unit 103 generates data of a three-dimensional shape (hereinafter referred to as a “foreground model”) corresponding to an object by a visual hull method using a foreground image representing a foreground region in each captured image. Generate. Since the method of generating three-dimensional shape data by Visual Hull is well known, the explanation is omitted.

次に、画像生成部１０３は、前景モデルに対して、複数視点画像を構成する複数の撮像画像のうちの少なくとも１つを用いてテクスチャマッピングを行うことにより、当該前景モデルに色付けを行う。また、画像生成部１０３は、前景モデルの背景となる背景モデルに対して、予め用意された背景画像を用いてテクスチャマッピングを行うことにより、当該背景モデルに色付けを行う。ここで、例えば、背景画像とは、競技場又は舞台等を撮像することにより得られた撮像画像である。最後に、画像生成部１０３は、ユーザ等により指定された三次元の仮想空間における視点（以下「仮想視点」という。）の位置に応じてレンダリングを行うことにより仮想視点画像を生成する。画像生成部１０３における仮想視点画像の生成手法は上述の手法に限定されるものではなく、例えば、画像生成部１０３は、三次元形状データを用いずに、撮像画像に対して射影変換を行うことにより仮想視点画像を生成してもよい。 Next, the image generation unit 103 colors the foreground model by performing texture mapping on the foreground model using at least one of the plurality of captured images forming the multi-viewpoint image. The image generation unit 103 also colors the background model by performing texture mapping on the background model, which is the background of the foreground model, using a background image prepared in advance. Here, for example, the background image is a captured image obtained by capturing an image of a stadium, a stage, or the like. Finally, the image generation unit 103 generates a virtual viewpoint image by performing rendering according to the position of a viewpoint (hereinafter referred to as “virtual viewpoint”) in a three-dimensional virtual space specified by a user or the like. The method of generating the virtual viewpoint image in the image generation unit 103 is not limited to the method described above. may generate a virtual viewpoint image.

データ生成部１０９は、評価部１０８による評価結果である一致度に基づいて、画像生成部１０３が生成した仮想視点画像と、音響取得部１０６が取得した音響データが示す音響とを含むデータを生成する。具体的には、データ生成部１０９は、画像生成部１０３が生成した仮想視点画像のデータに音響取得部１０６が取得した音響データを合成して、音響データ付きの仮想視点画像データを生成する。すなわち、データ生成部１０９が生成する、仮想視点画像と音響とを含むデータとは、音響データ付きの仮想視点画像データである。より具体的には、まず、データ生成部１０９は、評価部１０８により第１姿勢モデルと第２姿勢モデルとが一致している判定された第２姿勢モデルに対応する時刻を取得する。ここで、第２姿勢モデルに対応する時刻とは、第２姿勢モデルを生成する際に用いられた前景領域を取得した撮像画像の撮像時刻である。 The data generation unit 109 generates data including the virtual viewpoint image generated by the image generation unit 103 and the sound indicated by the sound data acquired by the sound acquisition unit 106, based on the degree of matching, which is the evaluation result of the evaluation unit 108. do. Specifically, the data generation unit 109 combines the sound data acquired by the sound acquisition unit 106 with the virtual viewpoint image data generated by the image generation unit 103 to generate virtual viewpoint image data with sound data. That is, the data including the virtual viewpoint image and the sound generated by the data generation unit 109 is virtual viewpoint image data with sound data. More specifically, first, the data generation unit 109 acquires the time corresponding to the second posture model determined by the evaluation unit 108 to match the first posture model and the second posture model. Here, the time corresponding to the second orientation model is the imaging time of the captured image that acquired the foreground region used when generating the second orientation model.

次に、データ生成部１０９は、対応付け部１０７により第２姿勢モデルのデータに対応付けられた音響データのうち、評価部１０８により第１姿勢モデルと第２姿勢モデルとが一致している判定された第２姿勢モデルに対応する時刻の音響データを取得する。最後に、データ生成部１０９は、評価部１０８により第１姿勢モデルと第２姿勢モデルとが一致している判定された第１姿勢モデルに対応する仮想視点画像のデータに、取得した当該時刻の音響データを合成して、音響データ付きの仮想視点画像データを生成する。ここで、第１姿勢モデルに対応する仮想視点画像とは、第１姿勢モデルを生成する際に用いられた前景領域と同一の前景領域を用いて生成された仮想視点画像である。データ生成部１０９は、生成した音響データ付きの仮想視点画像データを出力装置に出力する。なお、データ生成部１０９は、評価部１０８により第１姿勢モデルと第２姿勢モデルとが一致していない判定された第１姿勢モデルに対応する仮想視点画像のデータについては、音響取得部１０６が取得した音響データを合成せずにそのまま出力装置に出力する。 Next, the data generation unit 109 causes the evaluation unit 108 to determine whether the first posture model and the second posture model match among the acoustic data associated with the data of the second posture model by the association unit 107. Acquisition of the acoustic data at the time corresponding to the second posture model. Finally, the data generation unit 109 adds the data of the acquired time at the virtual viewpoint image corresponding to the first posture model determined by the evaluation unit 108 to match the first posture model and the second posture model. Acoustic data is synthesized to generate virtual viewpoint image data with acoustic data. Here, the virtual viewpoint image corresponding to the first orientation model is a virtual viewpoint image generated using the same foreground area as the foreground area used when generating the first orientation model. The data generation unit 109 outputs the generated virtual viewpoint image data with acoustic data to the output device. Note that the data generating unit 109 uses the sound acquisition unit 106 to obtain the data of the virtual viewpoint image corresponding to the first posture model determined by the evaluation unit 108 that the first posture model and the second posture model do not match. The acquired acoustic data is output as it is to the output device without being synthesized.

［動作フロー］
図４を参照して、第２情報処理装置１５０の動作について説明する。図４は、実施形態１に係る第２情報処理装置１５０における処理の流れの一例を示すフローチャートである。なお、図４の説明において、記号「Ｓ」はステップを意味する。まず、Ｓ４０１にて、音声取得部１５１は、集音装置１５が出力する音声信号を取得して音響データを生成する。また、第２画像群取得部１５２は、複数の撮像装置１６のそれぞれが出力する撮像画像データ、すなわち、複数視点画像データを取得する。次に、Ｓ４０２にて、第２前景取得部１５３は、Ｓ４０１にて取得した複数視点画像を構成する複数の撮像画像毎に、撮像画像に写るオブジェクトに対応する前景領域を取得して、前景領域を示す前景画像を生成する。次に、Ｓ４０３にて、第２モデル生成部１５４は、第２姿勢モデルを推定して、第２姿勢モデルのデータを生成する。 [Operation flow]
The operation of the second information processing device 150 will be described with reference to FIG. FIG. 4 is a flow chart showing an example of the flow of processing in the second information processing device 150 according to the first embodiment. In addition, in the description of FIG. 4, the symbol "S" means a step. First, in S401, the sound acquisition unit 151 acquires a sound signal output by the sound collector 15 and generates acoustic data. Also, the second image group acquisition unit 152 acquires captured image data output from each of the plurality of imaging devices 16, that is, multi-viewpoint image data. Next, in S402, the second foreground acquisition unit 153 acquires a foreground region corresponding to an object appearing in the captured image for each of a plurality of captured images forming the multi-viewpoint image acquired in S401, and obtains the foreground region. generates a foreground image showing Next, in S403, the second model generation unit 154 estimates the second posture model and generates data of the second posture model.

次に、Ｓ４０４にて、モデル出力部１５５は、第２姿勢モデルのデータと、当該第２姿勢モデルのデータに対応付けられ撮像時刻情報とを第１情報処理装置１００に出力する。具体的には、例えば、モデル出力部１５５は、第１情報処理装置１００から出力指示を受信したときに、第２姿勢モデルのデータと当該撮像時刻情報とを第１情報処理装置１００に出力する。また、音響出力部１５６は、音響データを第１情報処理装置１００に出力する。具体的には、例えば、音響出力部１５６は、第１情報処理装置１００から出力指示を受信したときに、音響データを第１情報処理装置１００に出力する。Ｓ４０４の後、第２情報処理装置１５０は、図４に示すフローチャートの処理を終了する。 Next, in S<b>404 , the model output unit 155 outputs the data of the second posture model and the imaging time information associated with the data of the second posture model to the first information processing apparatus 100 . Specifically, for example, when receiving an output instruction from the first information processing apparatus 100, the model output unit 155 outputs the data of the second posture model and the imaging time information to the first information processing apparatus 100. . Also, the sound output unit 156 outputs sound data to the first information processing device 100 . Specifically, for example, the sound output unit 156 outputs sound data to the first information processing device 100 when receiving an output instruction from the first information processing device 100 . After S404, the second information processing apparatus 150 ends the processing of the flowchart shown in FIG.

図５を参照して、第１情報処理装置１００の動作について説明する。図５は、実施形態１に係る第１情報処理装置１００における処理の流れの一例を示すフローチャートである。なお、図５の説明において、記号「Ｓ」はステップを意味する。まず、Ｓ５０１にて、モデル取得部１０５は、第２姿勢モデルのデータと、当該第２姿勢モデルのデータに対応付けられ撮像時刻情報とを第２情報処理装置１５０から取得する。また、音響取得部１０６は、音響データを第２情報処理装置１５０から取得する。 The operation of the first information processing apparatus 100 will be described with reference to FIG. FIG. 5 is a flow chart showing an example of the flow of processing in the first information processing apparatus 100 according to the first embodiment. In addition, in the description of FIG. 5, the symbol "S" means a step. First, in S<b>501 , the model acquisition unit 105 acquires the data of the second posture model and the imaging time information associated with the data of the second posture model from the second information processing apparatus 150 . Also, the sound acquisition unit 106 acquires sound data from the second information processing device 150 .

次に、Ｓ５０２にて、対応付け部１０７は、Ｓ５０１にて取得された音響データを解析する。具体的には、例えば、対応付け部１０７は、音響データが示す音響の音量の大きさ、すなわち、音響に対応する音声信号の振幅の大きさを解析して、音響データのうち、音量が極大値となる時点を探索する。対応付け部１０７における音量の解析は、例えば、音響における各周波数のうち、４８キロヘルツ（ｋＨｚ）等の予め定められた周波数に対応する音響の音量の大きさを解析してもよい。また、対応付け部１０７は、Ｓ５０２の解析結果に基づいて、音響データの切り出しを行う。解析結果に基づく音響データの切り出しは、例えば、Ｓ５０１にて取得された音響データのうちから、オブジェクトである自然人が舞台でジャンプした際の着地音等に対応する部分の切り出しが考えられる。 Next, in S502, the associating unit 107 analyzes the acoustic data acquired in S501. Specifically, for example, the associating unit 107 analyzes the magnitude of the sound volume indicated by the sound data, that is, the magnitude of the amplitude of the audio signal corresponding to the sound, Search for a value point. The volume analysis in the association unit 107 may be, for example, analysis of the volume of sound corresponding to a predetermined frequency such as 48 kilohertz (kHz) among the frequencies of the sound. The associating unit 107 also cuts out the acoustic data based on the analysis result of S502. Extraction of the acoustic data based on the analysis result may be, for example, extraction of a portion corresponding to the landing sound when the natural person as the object jumps on the stage from the acoustic data acquired in S501.

この場合、着地したときに舞台の床から生じる着地音以外の音響は、不必要な音響である。そのため、音響データのうち、音量が極大値となる時点の前後において音響の音量が予め定められた閾値以下の期間を削除する等の手法を用いて、仮想視点画像データに合成したい期間以外の音響データを削除して合成用の音響データとして切り出す。合成用の音響データの切り出し手法は上述のものに限定されるものではない。 In this case, sounds other than the landing sound produced from the floor of the stage upon landing are unnecessary sounds. Therefore, by using a method such as deleting a period before and after the point in time when the sound volume reaches a maximum value in the sound data when the sound volume is equal to or less than a predetermined threshold value, the sound other than the period desired to be combined with the virtual viewpoint image data is extracted. Delete the data and cut it out as sound data for synthesis. The method of cutting out sound data for synthesis is not limited to the one described above.

次に、Ｓ５０３にて、対応付け部１０７は、合成用の音響データと第２姿勢モデルのデータとを対応付ける。具体的には、まず、対応付け部１０７は、ステップＳ５０２における解析の結果に基づいて、合成用の音響データの音量が極大値となる時点に対応する時刻と同時刻に撮像された撮像画像に基づいて生成された第２姿勢モデルのデータを特定する。以下、ある時刻に撮像された撮像画像に基づいて生成された姿勢モデルのデータを姿勢モデルフレームのフレームデータと称して説明する。すなわち、対応付け部１０７は、ステップＳ５０２における解析の結果に基づいて、合成用の音響データの音量が極大値となる時点に対応する時刻と同時刻に撮像された撮像画像に基づいて生成された第２姿勢モデルフレームのフレームデータを特定する。次に、対応付け部１０７は、合成用の音響データの音量が極大値となる時点に対応する時刻と、特定した第２姿勢モデルフレームのフレームデータとを対応付ける。このようにして、対応付け部１０７は、合成用の音響データと第２姿勢モデルのデータとを対応付ける。 Next, in S503, the associating unit 107 associates the acoustic data for synthesis with the data of the second posture model. Specifically, first, based on the analysis result in step S502, the associating unit 107 converts the captured image captured at the same time as the time corresponding to the point in time when the sound volume of the sound data for synthesis reaches the maximum value. The data of the second posture model generated based on this is specified. In the following description, posture model data generated based on an image captured at a certain time is referred to as frame data of a posture model frame. That is, based on the result of the analysis in step S502, the associating unit 107 generates a captured image captured at the same time as the time corresponding to the point in time when the sound volume of the sound data for synthesis reaches the maximum value. Identify the frame data of the second posture model frame. Next, the associating unit 107 associates the time corresponding to the point in time when the sound volume of the sound data for synthesis reaches the maximum value, and the frame data of the specified second posture model frame. In this manner, the associating unit 107 associates the sound data for synthesis with the data of the second posture model.

図６を参照して、対応付け部１０７における対応付け処理について説明する。図６は、実施形態１に係る対応付け部１０７における対応付け処理の一例を説明するための説明図である。図６において、上部には、時系列に並べた第２姿勢モデルフレーム６０１ａ～６０１ｅが示されており、下部には、時系列の音声信号で示した音響データ６０２ａ～６０２ｃが示されている。なお、図６の下部は、一例として、音響データのうち、４８ｋＨｚに対応する音響データを音声信号で示したものである。ここで、音響データ６０２ｂは、Ｓ５０２における切り出し処理により切り出された合成用の音響データを示している。また、音響データ６０２ａ，６０２ｃは、Ｓ５０１において取得された音声データのうち、Ｓ５０２における切り出し処理により削除された音声データを示している。なお、図６において、横軸は、時間軸であり、図６の下部において、縦軸は、音声信号の振幅の大きさを示している。 The association processing in the association unit 107 will be described with reference to FIG. FIG. 6 is an explanatory diagram for explaining an example of matching processing in the matching unit 107 according to the first embodiment. In FIG. 6, the upper portion shows second posture model frames 601a to 601e arranged in time series, and the lower portion shows acoustic data 602a to 602c represented by time-series audio signals. In addition, the lower part of FIG. 6 shows, as an example, audio data corresponding to 48 kHz among the audio data in the form of an audio signal. Here, the sound data 602b indicates the sound data for synthesis extracted by the extraction processing in S502. Also, acoustic data 602a and 602c indicate audio data deleted by the extraction processing in S502 from the audio data acquired in S501. In FIG. 6, the horizontal axis is the time axis, and the vertical axis in the lower part of FIG. 6 indicates the amplitude of the audio signal.

音響データ６０２ｂは、仮想視点画像に合成するための合成用の音響データとして、第２姿勢モデルフレーム６０１ａ～６０１ｅのいずれかと対応付けられる。Ｓ５０２における解析により、音響データ６０２ｂにおいて音量が最大となる時点、すなわち、音響データ６０２ｂが最大振幅となる時点に対応する時刻が特定される。当該特定の後、当該時刻と同時刻に撮像された撮像画像に基づいて生成された第２姿勢モデルフレームが特定される。図６に示す例では、第２姿勢モデルフレーム６０１ｄに対応する時刻と、音響データ６０２ｂが最大振幅となる時点に対応する時刻とが一致しているため、音響信号６０２ｂと第２姿勢モデルフレーム６０１ｄとが対応付け部１０７により対応付けられる。 The acoustic data 602b is associated with one of the second posture model frames 601a to 601e as synthesizing acoustic data for synthesizing with the virtual viewpoint image. The analysis in S502 identifies the point in time when the volume of the acoustic data 602b reaches its maximum, that is, the point in time when the amplitude of the acoustic data 602b reaches its maximum. After the identification, the second posture model frame generated based on the captured image captured at the same time as the time is identified. In the example shown in FIG. 6, the time corresponding to the second posture model frame 601d coincides with the time corresponding to the time when the acoustic data 602b reaches its maximum amplitude. are associated by the associating unit 107 .

ただし、第２姿勢モデルフレームと音響データとが互いに時刻同期はされていたとしても、第２姿勢モデルフレームのフレームレートと音響データのサンプリングレートとが互いに異なる場合がある。このような場合、音響データが最大振幅となる時点に対応する時刻と同時刻の第２姿勢モデルフレームが存在しないことがある。したがって、このような場合には、音響データが最大振幅となる時点に対応する時刻に最も近い時刻の第２姿勢モデルフレームを特定して、音響データが最大振幅となる時点に対応する時刻と、特定した第２姿勢モデルフレームとを対応付ければよい。音響データと第２姿勢モデルとを対応付ける手法は上述のものに限定されるものではなく、音響と第２姿勢モデルとを同期できる手法であれば良い。 However, even if the second posture model frame and the acoustic data are time-synchronized with each other, the frame rate of the second posture model frame and the sampling rate of the acoustic data may differ from each other. In such a case, there may not be a second posture model frame at the same time as the time when the acoustic data reaches the maximum amplitude. Therefore, in such a case, the second posture model frame at the time closest to the time point at which the acoustic data reaches its maximum amplitude is identified, and the time point corresponding to the time point at which the acoustic data reaches its maximum amplitude is specified; All that is necessary is to associate it with the specified second posture model frame. The method of associating the acoustic data with the second posture model is not limited to the one described above, and any method that can synchronize the acoustic data with the second posture model may be used.

図７を参照して、対応付け部１０７による音響データと姿勢モデルとの対応付けを示す情報の構成について説明する。図７は、実施形態１に係る対応付け部１０７による音響データと姿勢モデルとの対応付けを示す情報の構成の一例を説明するための説明図である。図７に示すように、例えば、仮想視点画像に合成したい音響データのパターン数に応じて、パターンＮｏ．が割り振られている。ここで、パターンには、例えば、舞台芸能の演目ごと、又は撮影シーンごとに仮想視点画像に合成したい音響データが設定されるものとする。図７に示す音響数には、パターンＮｏ．ごとに仮想視点画像に合成する音響データの数、すなわち、Ｓ５０２にて切り出された合成用の音響データの数が入力される。図７に示す音響情報のそれぞれには、合成用の音響データが格納され、図７に示す姿勢推定モデルデータには、合成用の音響データが最大振幅となる時点に対応する時刻の第２姿勢モデルフレームのフレームデータが格納される。 The configuration of information indicating the correspondence between the acoustic data and the posture model by the associating unit 107 will be described with reference to FIG. FIG. 7 is an explanatory diagram for explaining an example of a configuration of information indicating correspondence between acoustic data and a posture model by the association unit 107 according to the first embodiment. As shown in FIG. 7, for example, according to the number of patterns of acoustic data to be combined with a virtual viewpoint image, pattern No. is allocated. Here, it is assumed that the pattern is set with, for example, acoustic data to be combined with the virtual viewpoint image for each piece of performing arts or for each shooting scene. The number of sounds shown in FIG. The number of pieces of sound data to be combined with the virtual viewpoint image, that is, the number of pieces of sound data for synthesis cut out in S502 is input. Acoustic data for synthesis is stored in each of the acoustic information shown in FIG. 7, and the posture estimation model data shown in FIG. Stores the frame data of the model frame.

Ｓ５０３の後、Ｓ５１１にて、第１画像群取得部１０１は、複数の撮像装置１１のそれぞれが出力する撮像画像データ、すなわち、複数視点画像データを取得する。次に、Ｓ５１２にて、第１前景取得部１０２は、Ｓ５１１にて取得した複数視点画像を構成する複数の撮像画像毎に、撮像画像に写るオブジェクトに対応する前景領域を取得して、前景領域を示す前景画像を生成する。次に、Ｓ５１３にて、画像生成部１０３は、仮想視点画像を生成する。次に、Ｓ５１４にて、第１モデル生成部１０４は、第１姿勢モデルを推定して、第１姿勢モデルのデータを生成する。 After S503, in S511, the first image group acquisition unit 101 acquires captured image data output from each of the plurality of imaging devices 11, that is, multi-viewpoint image data. Next, in S512, the first foreground acquisition unit 102 acquires a foreground region corresponding to an object appearing in the captured image for each of a plurality of captured images forming the multi-viewpoint image acquired in S511, and obtains the foreground region. generates a foreground image showing Next, in S513, the image generation unit 103 generates a virtual viewpoint image. Next, in S514, the first model generation unit 104 estimates the first posture model and generates data of the first posture model.

次に、Ｓ５１５にて、評価部１０８は、第１姿勢モデルと第２姿勢モデルとの間の一致度を評価し、第１姿勢モデルと第２姿勢モデルとが一致するか否かを判定する。Ｓ５１５にて一致すると判定された場合、Ｓ５１６にて、データ生成部１０９は、Ｓ５１３にて生成された仮想視点画像データに、Ｓ５０３にて第２姿勢モデルのデータと対応付けられた合成用の音響データを合成する。その後、データ生成部１０９は、合成後の仮想視点画像データを出力装置に出力する。Ｓ５１５にて第１姿勢モデルと第２姿勢モデルとが一致しないと判定された場合、データ生成部１０９は、Ｓ５１３にて生成された仮想視点画像データをそのまま出力装置に出力する。 Next, in S515, the evaluation unit 108 evaluates the degree of matching between the first posture model and the second posture model, and determines whether or not the first posture model and the second posture model match. . If it is determined in S515 that they match, in S516 the data generation unit 109 adds sound for synthesis associated with the data of the second posture model in S503 to the virtual viewpoint image data generated in S513. Synthesize data. After that, the data generation unit 109 outputs the synthesized virtual viewpoint image data to the output device. If it is determined in S515 that the first orientation model and the second orientation model do not match, the data generation unit 109 outputs the virtual viewpoint image data generated in S513 as it is to the output device.

第１情報処理装置１００は、Ｓ５２０にて終了条件を満たしか否かを判定する。ここで、終了条件とは、例えば、ユーザからの終了指示の操作信号を受けた場合等である。第１情報処理装置１００は、終了条件を満たすまでの間、Ｓ５１１からＳ５１６までの処理を繰り返して実行し、終了条件を満たしたときに図５に示すフローチャートの処理を終了する。 The first information processing apparatus 100 determines whether or not the termination condition is satisfied in S520. Here, the termination condition is, for example, a case where an operation signal for termination instruction from the user is received. The first information processing apparatus 100 repeatedly executes the processing from S511 to S516 until the termination condition is satisfied, and terminates the processing of the flowchart shown in FIG. 5 when the termination condition is satisfied.

以上のように、第１情報処理装置１００によれば、互いに異なる撮像空間において取得された仮想視点画像データと音響データとを効率よく合成することができる。結果として、スタジオ等では再現できない音響を仮想視点画像に効率よく合成することが可能となるため、音響データと仮想視点画像データと合成するための作業負荷を低減することができる。 As described above, according to the first information processing apparatus 100, it is possible to efficiently synthesize the virtual viewpoint image data and the sound data acquired in different imaging spaces. As a result, sound that cannot be reproduced in a studio or the like can be efficiently synthesized with the virtual viewpoint image, so that the workload for synthesizing the acoustic data and the virtual viewpoint image data can be reduced.

なお、実施形態１では、第１情報処理装置１００は、第１画像群取得部１０１、第１前景取得部１０２、画像生成部１０３、及び第１モデル生成部１０４を備えるものとして説明したが、これに限定されるものではない。例えば、第１情報処理装置１００とは異なる装置にて生成された第１姿勢モデルのデータを取得する、図１には図示の第１姿勢モデル取得部を第１情報処理装置１００が有している場合、第１情報処理装置１００は、第１モデル生成部１０４を有する必要はない。また、例えば、第１情報処理装置１００とは異なる装置にて生成された仮想視点画像を取得する、図１には図示の画像取得部を第１情報処理装置１００が有している場合、第１情報処理装置１００は、画像生成部１０３を有する必要はない。また、第１情報処理装置１００は、第２情報処理装置１５０が有する各部を有するものであってもよい。すなわち、第１情報処理装置１００は、第２情報処理装置１５０が有する機能を有していてもよい。第１情報処理装置１００が、第２情報処理装置１５０が有する全ての構成を有する場合、情報処理システム１は、第２情報処理装置１５０を有していなくてもよい。 In the first embodiment, the first information processing apparatus 100 is described as including the first image group acquisition unit 101, the first foreground acquisition unit 102, the image generation unit 103, and the first model generation unit 104. It is not limited to this. For example, the first information processing apparatus 100 has a first posture model acquisition unit shown in FIG. If so, the first information processing device 100 does not need to have the first model generation unit 104 . Further, for example, when the first information processing apparatus 100 has the image acquisition unit illustrated in FIG. 1 Information processing apparatus 100 does not need to have image generation unit 103 . Further, the first information processing device 100 may have each unit included in the second information processing device 150 . That is, the first information processing device 100 may have the functions that the second information processing device 150 has. If the first information processing device 100 has all the configurations of the second information processing device 150 , the information processing system 1 may not have the second information processing device 150 .

（実施形態２）
図８乃至１０を参照して、実施形態２に係る情報処理システム１について説明する。実施形態２に係る情報処理システム１の構成は、図１に一例として示す実施形態１に係る情報処理システム１の構成と同様である。すなわち、情報処理システム１は、複数の撮像装置１１、集音装置１５、複数の撮像装置１６、第１情報処理装置１００、第２情報処理装置１５０、及び出力装置１９を備える。 (Embodiment 2)
An information processing system 1 according to the second embodiment will be described with reference to FIGS. The configuration of the information processing system 1 according to the second embodiment is the same as the configuration of the information processing system 1 according to the first embodiment shown in FIG. 1 as an example. That is, the information processing system 1 includes a plurality of imaging devices 11 , a sound collector 15 , a plurality of imaging devices 16 , a first information processing device 100 , a second information processing device 150 , and an output device 19 .

実施形態１に係る情報処理システム１は、以下のようなものであった。まず、第２情報処理装置１５０にて、舞台等の第２撮像空間における撮像及び集音により得られた撮像画像データ及び音声信号に基づいて、事前に、第２姿勢モデルのデータと音響データとを生成しておく。次に、第１情報処理装置１００にて、スタジオ等の第１撮像空間における撮像により得られた撮像画像データに基づいて、第１姿勢モデルのデータと仮想視点画像データとを生成する。更に、第１情報処理装置１００にて、第１姿勢モデルのデータと第２姿勢モデルのデータとの間の一致度を評価して、当該一致度に基づいて、仮想視点画像データと音響データとを合成し、合成後の仮想視点画像データを出力する。 The information processing system 1 according to the first embodiment was as follows. First, in the second information processing device 150, based on captured image data and audio signals obtained by capturing and collecting sound in a second capturing space such as a stage, data of a second posture model and acoustic data are generated in advance. is generated. Next, in the first information processing apparatus 100, the data of the first posture model and the virtual viewpoint image data are generated based on the imaged image data obtained by imaging in the first imaging space such as a studio. Further, the first information processing device 100 evaluates the degree of matching between the data of the first posture model and the data of the second posture model, and based on the degree of matching, the virtual viewpoint image data and the acoustic data are generated. are synthesized, and the synthesized virtual viewpoint image data is output.

これに対して、実施形態２に係る情報処理システム１（以下、単に「情報処理システム１」という。）は、以下のようなものである。まず、第２情報処理装置１５０にて、舞台等の第２撮像空間における撮像及び集音により得られた撮像画像データ及び音声信号に基づいて、事前に、第２姿勢モデルのデータと音響データとを生成しておく。また、事前に、第２姿勢モデルのデータの解析により物理情報を生成しておく。物理情報については後述する。次に、第１情報処理装置１００にて、スタジオ等の第１撮像空間における撮像により得られた撮像画像データに基づいて、第１姿勢モデルのデータと仮想視点画像データとを生成する。また、第１姿勢モデルのデータの解析により物理情報を生成する。更に、第１情報処理装置１００にて、第１姿勢モデルのデータ及び当該データに対応する物理情報と、第２姿勢モデルのデータ及び当該データに対応する物理情報との間の一致度を評価する。最後に、第１情報処理装置１００にて、当該一致度に基づいて、仮想視点画像データと音響データとを合成し、合成後の仮想視点画像データを出力する。 On the other hand, an information processing system 1 (hereinafter simply referred to as "information processing system 1") according to the second embodiment is as follows. First, in the second information processing device 150, based on captured image data and audio signals obtained by capturing and collecting sound in a second capturing space such as a stage, data of a second posture model and acoustic data are generated in advance. is generated. Also, physical information is generated in advance by analyzing the data of the second posture model. Physical information will be described later. Next, in the first information processing apparatus 100, the data of the first posture model and the virtual viewpoint image data are generated based on the imaged image data obtained by imaging in the first imaging space such as a studio. Also, physical information is generated by analyzing the data of the first posture model. Furthermore, the first information processing device 100 evaluates the degree of matching between the data of the first posture model and the physical information corresponding to the data and the data of the second posture model and the physical information corresponding to the data. . Finally, the first information processing device 100 synthesizes the virtual viewpoint image data and the sound data based on the degree of matching, and outputs the synthesized virtual viewpoint image data.

［構成］
実施形態２に係る第１情報処理装置１００（以下、単に「第１情報処理装置１００」という。）の機能ブロックの構成は、図１に一例として示す実施形態１に係る第１情報処理装置１００が有する機能ブロックと同様であるため説明を省略する。すなわち、第１情報処理装置１００は、を備える。すなわち、第１情報処理装置１００は、第１画像群取得部１０１、第１前景取得部１０２、画像生成部１０３、第１モデル生成部１０４、モデル取得部１０５、音響取得部１０６、対応付け部１０７、評価部１０８、及びデータ生成部１０９を備える。また、実施形態２に係る第２情報処理装置１５０（以下、単に「第２情報処理装置１５０」という。）の機能ブロックの構成は、図１に一例として示す実施形態１に係る第２情報処理装置１５０が有する機能ブロックと同様であるため説明を省略する。すなわち、第２情報処理装置１５０は、音声取得部１５１、第２画像群取得部１５２、第２前景取得部１５３、第２モデル生成部１５４、モデル出力部１５５、及び音響出力部１５６を備える。また、第１情報処理装置１００及び第２情報処理装置１５０のハードウェア構成は、実施形態１に係る第１情報処理装置１００及び第２情報処理装置１５０と同様であるため説明を省略する。 [composition]
The configuration of the functional blocks of the first information processing apparatus 100 according to the second embodiment (hereinafter simply referred to as the "first information processing apparatus 100") is the first information processing apparatus 100 according to the first embodiment shown as an example in FIG. Since it is the same as the functional block of , the explanation is omitted. That is, the first information processing device 100 includes: That is, the first information processing apparatus 100 includes a first image group acquisition unit 101, a first foreground acquisition unit 102, an image generation unit 103, a first model generation unit 104, a model acquisition unit 105, a sound acquisition unit 106, an association unit 107 , an evaluation unit 108 and a data generation unit 109 . Further, the configuration of the functional blocks of the second information processing apparatus 150 according to the second embodiment (hereinafter simply referred to as the "second information processing apparatus 150") is similar to that of the second information processing apparatus according to the first embodiment shown as an example in FIG. Since the functional blocks are the same as those of the device 150, description thereof is omitted. That is, the second information processing device 150 includes an audio acquisition unit 151 , a second image group acquisition unit 152 , a second foreground acquisition unit 153 , a second model generation unit 154 , a model output unit 155 and an audio output unit 156 . Also, the hardware configurations of the first information processing apparatus 100 and the second information processing apparatus 150 are the same as those of the first information processing apparatus 100 and the second information processing apparatus 150 according to the first embodiment, so the description thereof will be omitted.

［各部の処理］
以下、情報処理システム１と実施形態１に係る情報処理システム１との差異について説明する。まず、第２情報処理装置１５０が備える各部の処理について説明する。音声取得部１５１、第２画像群取得部１５２、第２前景取得部１５３、及び音響出力部１５６は、実施形態１に係る音声取得部１５１、第２画像群取得部１５２、第２前景取得部１５３、及び音響出力部１５６と同様であるため説明を省略する。 [Processing of each part]
Differences between the information processing system 1 and the information processing system 1 according to the first embodiment will be described below. First, processing of each unit included in the second information processing device 150 will be described. The audio acquisition unit 151, the second image group acquisition unit 152, the second foreground acquisition unit 153, and the sound output unit 156 are the same as the audio acquisition unit 151, the second image group acquisition unit 152, and the second foreground acquisition unit according to the first embodiment. 153 and the sound output unit 156, the description is omitted.

第２モデル生成部１５４は、第２姿勢モデルのデータを生成する機能に加えて、生成した第２姿勢モデルのデータを解析して、第２姿勢モデルのデータに対応する物理情報（以下「第２物理情報」という。）を生成する機能を有する。ここで、物理情報とは、第２姿勢モデルフレームにおけるオブジェクトに対応する関節部位の速度又は加速度を示す情報である。具体的には、第２モデル生成部１５４は、複数の第２姿勢モデルフレームのフレームデータに基づいて当該関節部位の速度又は加速度を算出することにより、第２物理情報を生成する。第２モデル生成部１５４は、第２姿勢モデルのデータ、及び第２姿勢モデルを生成する際に用いた撮像画像の撮像時刻を示す情報（撮像時刻情報）に加えて、生成した第２物理情報を、第２情報処理装置１５０が備える補助記憶装置２０４に記憶させる。具体的には、第２モデル生成部１５４は、第２物理情報を第２姿勢モデルフレームのフレームデータに対応付けて、第２情報処理装置１５０が備える補助記憶装置２０４に記憶させる。 In addition to the function of generating the data of the second posture model, the second model generation unit 154 analyzes the generated data of the second posture model, and performs physical information (hereinafter referred to as “second posture model data”) corresponding to the data of the second posture model. 2 physical information”). Here, the physical information is information indicating the velocity or acceleration of the joint part corresponding to the object in the second posture model frame. Specifically, the second model generation unit 154 generates the second physical information by calculating the velocity or acceleration of the joint part based on the frame data of the plurality of second posture model frames. The second model generation unit 154 generates the generated second physical information in addition to the data of the second orientation model and the information indicating the imaging time of the captured image used when generating the second orientation model (imaging time information). is stored in the auxiliary storage device 204 included in the second information processing device 150 . Specifically, the second model generation unit 154 stores the second physical information in the auxiliary storage device 204 included in the second information processing device 150 in association with the frame data of the second posture model frame.

モデル出力部１５５は、第２姿勢モデルのデータ、及び当該第２姿勢モデルのデータに対応付けられ撮像時刻情報に加えて、第２モデル生成部１５４が生成した第２物理情報を第１情報処理装置１００に出力する。具体的には、モデル出力部１５５は、第２姿勢モデルのデータ、撮像時刻情報、及び第２物理情報を補助記憶装置２０４から読み出し、読み出した第２姿勢モデルのデータ、撮像時刻情報、及び第２物理情報を第１情報処理装置１００に出力する。 The model output unit 155 outputs the second physical information generated by the second model generation unit 154 to the first information processing, in addition to the data of the second posture model and the imaging time information associated with the data of the second posture model. Output to device 100 . Specifically, the model output unit 155 reads the data of the second posture model, the imaging time information, and the second physical information from the auxiliary storage device 204, and reads the read data of the second posture model, the imaging time information, and the second physical information. 2 Physical information is output to the first information processing device 100 .

次に、第１情報処理装置１００が備える各部の処理について説明する。第１画像群取得部１０１、第１前景取得部１０２、画像生成部１０３、音響取得部１０６、及びデータ生成部１０９は、実施形態１に係る、対応する各部と同様であるため説明を省略する。第１モデル生成部１０４は、第１姿勢モデルのデータを生成する機能に加えて、生成した第１姿勢モデルのデータを解析して、第１姿勢モデルのデータに対応する物理情報（以下「第１物理情報」という。）を生成する機能を有する。具体的には、第１モデル生成部１０４は、複数の第１姿勢モデルフレームのフレームデータに基づいて当該関節部位の速度又は加速度を算出することにより、第１物理情報を生成する。生成された第１物理情報は、第１姿勢モデルフレームのフレームデータに対応付けられる。 Next, processing of each unit included in the first information processing apparatus 100 will be described. The first image group acquisition unit 101, the first foreground acquisition unit 102, the image generation unit 103, the sound acquisition unit 106, and the data generation unit 109 are the same as the corresponding units according to the first embodiment, and descriptions thereof are omitted. . In addition to the function of generating the data of the first posture model, the first model generation unit 104 analyzes the generated data of the first posture model and performs physical information (hereinafter referred to as “first posture model data”) corresponding to the data of the first posture model. 1 physical information”). Specifically, the first model generation unit 104 generates the first physical information by calculating the velocity or acceleration of the joint part based on the frame data of the plurality of first posture model frames. The generated first physical information is associated with the frame data of the first posture model frame.

モデル取得部１０５は、第２情報処理装置１５０が有するモデル出力部１５５が出力する第２姿勢モデルのデータ、及び当該第２姿勢モデルのデータに対応付けられ撮像時刻情報に加えて、第２物理情報を取得する。取得される第２物理情報は、第２姿勢モデルフレームのフレームデータに対応付けられている。対応付け部１０７は、音響データから合成用の音響データを切り出す機能、及び、合成用の音響データ、第２姿勢モデルフレームのフレームデータ、及び第２姿勢モデルフレームのフレームデータに対応する第２物理情報を互いに対応付ける機能を有する。対応付け部１０７が有する機能のうち、音響データから合成用の音響データを切り出す機能については実施形態１で説明したため説明を省略する。また、合成用の音響データと第２姿勢モデルフレームのフレームデータとを対応付ける手法については、実施形態１で説明したため説明を省略する。また、第２物理情報は、第２姿勢モデルフレームのフレームデータに対応付けられているものであるため説明を省略する。 In addition to the data of the second orientation model output by the model output unit 155 of the second information processing device 150 and the imaging time information associated with the data of the second orientation model, the model acquisition unit 105 obtains a second physical model. Get information. The acquired second physical information is associated with the frame data of the second posture model frame. The associating unit 107 has a function of extracting acoustic data for synthesis from acoustic data, and a second physical model corresponding to the acoustic data for synthesis, the frame data of the second posture model frame, and the frame data of the second posture model frame. It has the function of associating information with each other. Among the functions of the associating unit 107, the function of extracting sound data for synthesis from sound data has been described in the first embodiment, and therefore the description thereof will be omitted. Further, since the method of associating the acoustic data for synthesis with the frame data of the second posture model frame has been described in the first embodiment, the description thereof will be omitted. Further, since the second physical information is associated with the frame data of the second posture model frame, description thereof will be omitted.

評価部１０８は、第１姿勢モデルのデータと第２姿勢モデルのデータとに基づいて、第１姿勢モデルと第２姿勢モデルとの間の一致度を評価する機能を有する。具体的には、評価部１０８は、第１姿勢モデルのデータと第２姿勢モデルのデータとに含まれる関節情報３０３同士、接続情報３０４同士、及び角度情報同士の少なくともいずれかを比較することにより一致度を評価する機能を有する。また、評価部１０８は、当該機能に加えて、第１物理情報と第２物理情報とに基づいて、第１姿勢モデルと第２姿勢モデルとの間の一致度を評価する機能を有する。具体的には、評価部１０８は、第１姿勢モデルにおけるオブジェクトに対応する関節部位の速度又は加速度を示す情報と、当該関節部位に対応する、第２姿勢モデルにおけるオブジェクトに対応する関節部位の速度又は加速度を示す情報と比較する。例えば、速度同士の差分値、及び、加速度同士の差分値の少なくともどちらかが、予め定められた閾値以下であるとき、第１姿勢モデルと第２姿勢モデルとが一致していると判定する。 The evaluation unit 108 has a function of evaluating the degree of matching between the first posture model and the second posture model based on the data of the first posture model and the data of the second posture model. Specifically, the evaluation unit 108 compares at least one of the joint information 303, the connection information 304, and the angle information included in the data of the first posture model and the data of the second posture model. It has a function to evaluate the degree of matching. In addition to this function, the evaluation unit 108 has a function of evaluating the degree of matching between the first posture model and the second posture model based on the first physical information and the second physical information. Specifically, the evaluation unit 108 calculates information indicating the velocity or acceleration of the joint part corresponding to the object in the first posture model and the velocity of the joint part corresponding to the object in the second posture model. Or compare with information indicating acceleration. For example, when at least one of the difference value between velocities and the difference value between accelerations is equal to or less than a predetermined threshold value, it is determined that the first posture model and the second posture model match.

より具体的には、例えば、まず、評価部１０８は、第１姿勢モデルのデータと第２姿勢モデルのデータとに基づいて、第１姿勢モデルと第２姿勢モデルとの間の一致度を評価する。次に、第１姿勢モデルのデータと第２姿勢モデルのデータとに基づいて第１姿勢モデルと第２姿勢モデルとが一致すると判定された場合に、評価部１０８は、第１物理情報と第２物理情報とに基づいて、第１姿勢モデルと第２姿勢モデルとの間の一致度を評価する。このような段階的な評価により、第１姿勢モデルと第２姿勢モデルとの間の一致度の評価の精度を向上できる。なお、第１姿勢モデルのデータと第２姿勢モデルのデータとに基づいて第１姿勢モデルと第２姿勢モデルとが一致すると判定されたときに、評価部１０８は、第１モデル生成部１０４及びモデル取得部１０５に対して物理情報を生成させる指示を行ってもよい。 More specifically, for example, the evaluation unit 108 first evaluates the degree of matching between the first posture model and the second posture model based on the data of the first posture model and the data of the second posture model. do. Next, when it is determined that the first posture model and the second posture model match based on the data of the first posture model and the data of the second posture model, the evaluation unit 108 evaluates the first physical information and the second posture model. The degree of matching between the first pose model and the second pose model is evaluated based on the two physical information. Such stepwise evaluation can improve the accuracy of evaluating the degree of matching between the first posture model and the second posture model. Note that when it is determined that the first posture model and the second posture model match based on the data of the first posture model and the data of the second posture model, the evaluation unit 108 generates the first model generation unit 104 and the The model acquisition unit 105 may be instructed to generate physical information.

［動作フロー］
図８を参照して、第２情報処理装置１５０の動作について説明する。図８は、実施形態２に係る第２情報処理装置１５０における処理の流れの一例を示すフローチャートである。なお、図８の説明において、記号「Ｓ」はステップを意味する。また、図８において、図４と同一の符号を付したものについては説明を省略する。まず、第２情報処理装置１５０は、Ｓ４０１からＳ４０３までの処理を実行する。 [Operation flow]
The operation of the second information processing device 150 will be described with reference to FIG. FIG. 8 is a flowchart showing an example of the flow of processing in the second information processing device 150 according to the second embodiment. In addition, in the description of FIG. 8, the symbol "S" means a step. Further, in FIG. 8, descriptions of the components denoted by the same reference numerals as those in FIG. 4 are omitted. First, the second information processing device 150 executes the processes from S401 to S403.

Ｓ４０３の後、Ｓ８０４にて、第２モデル生成部１５４は、第２物理情報を生成する。次に、Ｓ８０５にて、モデル出力部１５５は、第２姿勢モデルのデータ、当該第２姿勢モデルのデータに対応付けられ撮像時刻情報と、及び第２物理情報を第１情報処理装置１００に出力する。具体的には、例えば、モデル出力部１５５は、第１情報処理装置１００から出力指示を受信したときに、第２姿勢モデルのデータ、当該撮像時刻情報、及び第２物理情報を第１情報処理装置１００に出力する。また、音響出力部１５６は、音響データを第１情報処理装置１００に出力する。具体的には、例えば、音響出力部１５６は、第１情報処理装置１００から出力指示を受信したときに、音響データを第１情報処理装置１００に出力する。Ｓ８０５の後、第２情報処理装置１５０は、図８に示すフローチャートの処理を終了する。 After S403, in S804, the second model generating unit 154 generates second physical information. Next, in S<b>805 , the model output unit 155 outputs the data of the second posture model, the imaging time information associated with the data of the second posture model, and the second physical information to the first information processing apparatus 100 . do. Specifically, for example, when receiving an output instruction from the first information processing apparatus 100, the model output unit 155 outputs the data of the second posture model, the imaging time information, and the second physical information to the first information processing apparatus. Output to device 100 . Also, the sound output unit 156 outputs sound data to the first information processing device 100 . Specifically, for example, the sound output unit 156 outputs sound data to the first information processing device 100 when receiving an output instruction from the first information processing device 100 . After S805, the second information processing apparatus 150 ends the processing of the flowchart shown in FIG.

図９を参照して、第１情報処理装置１００の動作について説明する。図９は、実施形態２に係る第１情報処理装置１００における処理の流れの一例を示すフローチャートである。なお、図９の説明において、記号「Ｓ」はステップを意味する。また、図９において、図５と同一の符号を付したものについては説明を省略する。まず、第１情報処理装置１００は、Ｓ５０１及びＳ５０２の処理を実行する。 The operation of the first information processing apparatus 100 will be described with reference to FIG. FIG. 9 is a flowchart showing an example of the flow of processing in the first information processing apparatus 100 according to the second embodiment. In addition, in the description of FIG. 9, the symbol "S" means a step. Further, in FIG. 9, the description of the components denoted by the same reference numerals as in FIG. 5 will be omitted. First, the first information processing apparatus 100 executes the processes of S501 and S502.

Ｓ５０２の後、Ｓ９０３にて、対応付け部１０７は、合成用の音響データ、第２姿勢モデルのデータ、及び第２物理情報を互いに対応付ける。具体的には、まず、対応付け部１０７は、ステップＳ５０２における解析の結果に基づいて、合成用の音響データの音量が極大値となる時点に対応する時刻と同時刻に撮像された撮像画像に基づいて生成された第２姿勢モデルのデータを特定する。すなわち、対応付け部１０７は、ステップＳ５０２における解析の結果に基づいて、合成用の音響データの音量が極大値となる時点に対応する時刻と同時刻に撮像された撮像画像に基づいて生成された第２姿勢モデルフレームのフレームデータを特定する。次に、対応付け部１０７は、音響データから合成用の音響データを切り出す機能、及び、合成用の音響データ、第２姿勢モデルフレームのフレームデータ、及び第２姿勢モデルフレームのフレームデータに対応する第２物理情報を互いに対応付ける。このようにして、対応付け部１０７は、合成用の音響データ、第２姿勢モデルのデータ、及び第２物理情報を互いに対応付ける。 After S502, in S903, the associating unit 107 associates the acoustic data for synthesis, the data of the second posture model, and the second physical information with each other. Specifically, first, based on the analysis result in step S502, the associating unit 107 converts the captured image captured at the same time as the time corresponding to the point in time when the sound volume of the sound data for synthesis reaches the maximum value. The data of the second posture model generated based on this is specified. That is, based on the result of the analysis in step S502, the associating unit 107 generates a captured image captured at the same time as the time corresponding to the point in time when the sound volume of the sound data for synthesis reaches the maximum value. Identify the frame data of the second posture model frame. Next, the associating unit 107 has a function of extracting sound data for synthesis from the sound data, and corresponds to the sound data for synthesis, the frame data of the second posture model frame, and the frame data of the second posture model frame. Associate the second physical information with each other. In this manner, the associating unit 107 associates the acoustic data for synthesis, the data of the second posture model, and the second physical information with each other.

図１０を参照して、対応付け部１０７における対応付け処理、特に、合成用の音響データと、複数の第２姿勢モデルフレームのフレームデータとの対応付け処理について説明する。図１０は、実施形態２に係る対応付け部１０７における対応付け処理の一例を説明するための説明図である。図１０において、上部には、時系列に並べた第２姿勢モデルフレーム１００１ａ～１００１ｅが示されており、下部には、時系列の音声信号で示した音響データ１００２ａ～１００２ｃが示されている。なお、図１０の下部は、一例として、音響データのうち、４８ｋＨｚに対応する音響データを音声信号で示したものである。ここで、音響データ１００２ｂは、Ｓ５０２における切り出し処理により切り出された合成用の音響データを示している。また、音響データ１００２ａ，１００２ｃは、Ｓ５０１において取得された音声データのうち、Ｓ５０２における切り出し処理により削除された音声データを示している。なお、図１０において、横軸は、時間軸であり、図１０の下部において、縦軸は、音声信号の振幅の大きさを示している。 Referring to FIG. 10, association processing in association unit 107, in particular, association processing between acoustic data for synthesis and frame data of a plurality of second posture model frames will be described. FIG. 10 is an explanatory diagram for explaining an example of matching processing in the matching unit 107 according to the second embodiment. In FIG. 10, the upper portion shows second posture model frames 1001a to 1001e arranged in time series, and the lower portion shows acoustic data 1002a to 1002c represented by time-series audio signals. In addition, the lower part of FIG. 10 shows, as an example, audio data corresponding to 48 kHz among the audio data in the form of an audio signal. Here, the sound data 1002b indicates the sound data for synthesis extracted by the extraction processing in S502. Also, acoustic data 1002a and 1002c indicate audio data deleted by the extraction processing in S502 from the audio data acquired in S501. In FIG. 10, the horizontal axis is the time axis, and the vertical axis in the lower part of FIG. 10 indicates the amplitude of the audio signal.

音響データ１００２ｂは、仮想視点画像に合成するための合成用の音響データとして、第２姿勢モデルフレーム１００１ａ～１００１ｅのいずれかと対応付けられる。Ｓ５０２における解析により、音響データ１００２ｂにおいて音量が最大となる時点、すなわち、音響データ１００２ｂが最大振幅となる時点に対応する時刻が特定される。当該特定の後、当該時刻と同時刻に撮像された撮像画像に基づいて生成された第２姿勢モデルフレームが特定される。図１０に示す例では、第２姿勢モデルフレーム１００１ｄに対応する時刻と、音響データ１００２ｂが最大振幅となる時点に対応する時刻とが一致している。そのため、音響信号１００２ｂと第２姿勢モデルフレーム１００１ｄとが対応付け部１０７により対応付けられる。音響データが最大振幅となる時点に対応する時刻と同時刻の第２姿勢モデルフレームが存在しない場合、実施形態１に係る対応付け部１０７と同様に、まず、音響データが最大振幅となる時点に対応する時刻に最も近い時刻の第２姿勢モデルフレームを特定する。次に、音響データが最大振幅となる時点に対応する時刻と、特定した第２姿勢モデルフレームとを対応付ければよい。 The acoustic data 1002b is associated with one of the second posture model frames 1001a to 1001e as synthesizing acoustic data for synthesizing with the virtual viewpoint image. The analysis in S502 identifies the point in time when the volume of the sound data 1002b reaches its maximum, that is, the point in time when the sound data 1002b reaches its maximum amplitude. After the identification, the second posture model frame generated based on the captured image captured at the same time as the time is identified. In the example shown in FIG. 10, the time corresponding to the second posture model frame 1001d and the time corresponding to the maximum amplitude of the acoustic data 1002b match. Therefore, the associating unit 107 associates the acoustic signal 1002b with the second posture model frame 1001d. If there is no second posture model frame at the same time as the time corresponding to the time when the acoustic data reaches its maximum amplitude, first, as with the associating unit 107 according to the first embodiment, at the time when the acoustic data reaches its maximum amplitude, A second posture model frame at a time closest to the corresponding time is identified. Next, the time corresponding to the point in time when the acoustic data reaches its maximum amplitude may be associated with the specified second posture model frame.

音響信号１００２ｂと第２姿勢モデルフレーム１００１ｄとを対応付けた後、第２姿勢モデルフレーム１００１ｄの時刻を基準とした前後の予め定められた期間内に含まれる１以上の第２姿勢モデルフレームと音響信号１００２ｂとを対応付ける。なお、当該期間に含まれる全ての第２姿勢モデルフレームに対して音響信号１００２ｂを対応付ける必要はない。例えば、当該期間に含まれる第２姿勢モデルフレームのうち、第２姿勢モデルフレーム同士の時間間隔が、予め定められた時間間隔の第２姿勢モデルフレームのみに対して音響信号１００２ｂを対応付けてもよい。なお、音響信号１００２ｂを対応付ける、第２姿勢モデルフレーム１００１ｄに近傍する第２姿勢モデルフレームの決定方法は上述のものに限定されるものではない。また、音響データと第２姿勢モデルとを対応付ける手法は上述のものに限定されるものではなく、音響と第２姿勢モデルとを同期できる手法であれば良い。 After the acoustic signal 1002b and the second posture model frame 1001d are associated with each other, one or more second posture model frames included in a predetermined period before and after the time of the second posture model frame 1001d are combined with the acoustic signal. and the signal 1002b. Note that it is not necessary to associate the acoustic signal 1002b with all the second posture model frames included in the period. For example, among the second posture model frames included in the period, the acoustic signal 1002b may be associated only with the second posture model frames having a predetermined time interval between the second posture model frames. good. Note that the method of determining the second posture model frame adjacent to the second posture model frame 1001d with which the acoustic signal 1002b is associated is not limited to the one described above. Also, the method of associating the acoustic data with the second posture model is not limited to the one described above, and any method that can synchronize the acoustic data with the second posture model may be used.

Ｓ５０３の後、第１情報処理装置１００は、Ｓ５１１からＳ５１４までの処理を実行する。Ｓ５１４の後、Ｓ９１１にて、第１モデル生成部１０４は、第１物理情報を生成する。次に、Ｓ９１２にて、評価部１０８は、評価対象となる最初の第１姿勢モデルフレームのフレームデータと第２姿勢モデルフレームのフレームデータとの間の一致度を評価し、第１姿勢モデルフレームと第２姿勢モデルフレームとが一致するか否かを判定する。Ｓ５１５にて一致すると判定された場合、評価部１０８は、Ｓ９１３の処理を実行する。Ｓ９１３にて、評価部１０８は、評価対象となる最初の姿勢モデルフレーム以降の、第１姿勢モデルフレームのフレームデータ及び第１物理情報と第２姿勢モデルフレームのフレームデータ及び第２物理情報とが一致するか否かを判定する。 After S503, the first information processing apparatus 100 executes the processes from S511 to S514. After S514, in S911, the first model generation unit 104 generates first physical information. Next, in S<b>912 , the evaluation unit 108 evaluates the degree of matching between the frame data of the first posture model frame to be evaluated and the frame data of the second posture model frame. and the second posture model frame match. If it is determined that they match in S515, the evaluation unit 108 executes the process of S913. In S913, the evaluation unit 108 compares the frame data and first physical information of the first posture model frame and the frame data and second physical information of the second posture model frame after the first posture model frame to be evaluated. Determine whether or not they match.

Ｓ９１３にて一致すると判定された場合、Ｓ５１６にて、データ生成部１０９は、Ｓ５１３にて生成された仮想視点画像データに、Ｓ５０３にて第２姿勢モデルのデータと対応付けられた合成用の音響データを合成する。その後、データ生成部１０９は、合成後の仮想視点画像データを出力装置に出力する。Ｓ９１２又はＳ９１３にて一致しないと判定された場合、データ生成部１０９は、Ｓ５１３にて生成された仮想視点画像データをそのまま出力装置に出力する。第１情報処理装置１００は、Ｓ５２０にて終了条件を満たしか否かを判定する。第１情報処理装置１００は、終了条件を満たすまでの間、Ｓ５１１からＳ５１６までの処理を繰り返して実行し、終了条件を満たしたときに図９に示すフローチャートの処理を終了する。 If it is determined in S913 that they match, in S516 the data generation unit 109 adds the sound for synthesis associated with the data of the second posture model in S503 to the virtual viewpoint image data generated in S513. Synthesize data. After that, the data generation unit 109 outputs the synthesized virtual viewpoint image data to the output device. If it is determined in S912 or S913 that they do not match, the data generation unit 109 outputs the virtual viewpoint image data generated in S513 as it is to the output device. The first information processing apparatus 100 determines whether or not the termination condition is satisfied in S520. The first information processing apparatus 100 repeatedly executes the processing from S511 to S516 until the termination condition is satisfied, and terminates the processing of the flowchart shown in FIG. 9 when the termination condition is satisfied.

以上のように、第１情報処理装置１００によれば、互いに異なる撮像空間において取得された仮想視点画像データと音響データとを効率よく合成することができる。結果として、スタジオ等では再現できない音響を仮想視点画像に効率よく合成することが可能となるため、音響データと仮想視点画像データと合成するための作業負荷を低減することができる。また、第１情報処理装置１００は、第１姿勢モデルのデータと第２姿勢モデルのデータと間の一致度に加えて、第１姿勢モデルに対応する第１物理情報と第２姿勢モデルに対応する第２物理情報との間の一致度を評価するものである。このような第１情報処理装置１００によれば、互いに異なる撮像空間において取得された仮想視点画像データと音響データとの自動合成の精度を向上できる。結果として、仮想視点画像データに、誤った時刻の音響データが合成されてしまうことを抑制できる。 As described above, according to the first information processing apparatus 100, it is possible to efficiently synthesize the virtual viewpoint image data and the sound data acquired in different imaging spaces. As a result, sound that cannot be reproduced in a studio or the like can be efficiently synthesized with the virtual viewpoint image, so that the workload for synthesizing the acoustic data and the virtual viewpoint image data can be reduced. Further, the first information processing apparatus 100 calculates the degree of matching between the data of the first posture model and the data of the second posture model, and also the first physical information corresponding to the first posture model and the data corresponding to the second posture model. It evaluates the degree of matching with the second physical information. According to the first information processing apparatus 100 as described above, it is possible to improve the accuracy of automatic synthesis of the virtual viewpoint image data and the sound data acquired in different imaging spaces. As a result, it is possible to prevent sound data at an incorrect time from being combined with the virtual viewpoint image data.

（その他の実施形態）
本開示は、上述の実施形態の１以上の機能を実現するプログラムをネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 (Other embodiments)
The present disclosure supplies a program that implements one or more functions of the above-described embodiments to a system or device via a network or storage medium, and one or more processors in the computer of the system or device read and execute the program. processing is also feasible. It can also be implemented by a circuit (for example, ASIC) that implements one or more functions.

なお、本開示はその開示の範囲内において、各実施形態の自由な組み合わせ、各実施形態の任意の構成要素の変形、又は、各実施形態において任意の構成要素の省略が可能である。 In addition, within the scope of the disclosure, the present disclosure allows free combination of each embodiment, modification of arbitrary constituent elements of each embodiment, or omission of arbitrary constituent elements in each embodiment.

１００情報処理装置
１０３画像生成部
１０４第１モデル生成部
１０５モデル取得部
１０６音響取得部
１０８評価部
１０９データ生成部 100 information processing device 103 image generation unit 104 first model generation unit 105 model acquisition unit 106 sound acquisition unit 108 evaluation unit 109 data generation unit

Claims

a first model acquiring means for acquiring data of a first posture model generated based on first multi-viewpoint images obtained from a plurality of imaging devices imaging a first imaging space;
image acquisition means for acquiring data of a virtual viewpoint image generated based on the first multi-viewpoint image;
A second model for acquiring data of a second posture model generated based on second multi-viewpoint images obtained from a plurality of imaging devices imaging a second imaging space different from the first imaging space. acquisition means;
sound acquisition means for acquiring sound data collected when the second multi-viewpoint image is captured in the second imaging space;
evaluation means for evaluating a degree of matching between the first posture model and the second posture model;
data generation means for generating data including the virtual viewpoint image and the sound based on the degree of matching;
An information processing device characterized by comprising:

further comprising an associating means for analyzing the acoustic data and associating the second posture model data with the analyzed acoustic data;
The data generating means specifies the first posture model corresponding to the second posture model based on the degree of matching, the virtual viewpoint image corresponding to the specified first posture model, and the 2. The information processing apparatus according to claim 1, wherein data is generated in which the acoustic data associated with the data of the second posture model identified as corresponding to one posture model are associated with each other.

The associating means analyzes the sound data, cuts out sound data for synthesis from the sound data based on the result of the analysis, and cuts out the sound data for synthesis; making a correspondence with the data of the second posture model corresponding to the period of the sound data for synthesis,
3. The information processing apparatus according to claim 2, wherein the data generating means generates data including the virtual viewpoint image and the synthesis sound.

4. The information processing apparatus according to claim 3, wherein the associating means cuts out the sound data for synthesis based on the volume level of the sound data.

5. The information according to claim 4, wherein the associating means cuts out the sound data for synthesis based on a volume level corresponding to a predetermined frequency in the sound data. processing equipment.

The evaluation means includes information indicating positions of joints of the first posture model, information indicating positions of joints of the second posture model, and information indicating connection relationships between joints in the first posture model. and information indicating a connection relationship between joints in the second posture model, and evaluating the degree of matching between the first posture model and the second posture model. The information processing apparatus according to any one of claims 1 to 5.

physical information acquiring means for acquiring first physical information that is physical information of the first posture model and second physical information that is physical information of the second posture model;
The evaluation means evaluates the degree of matching between the first posture model and the second posture model based on the first physical information and the second physical information. Item 7. The information processing apparatus according to any one of Items 1 to 6.

The first physical information and the second physical information are the velocity and acceleration of joint parts in the first posture model and the second posture model, and the first posture model and the second posture. 8. The information processing apparatus according to claim 7, wherein the information indicates at least one of the angles of parts in the model.

The first posture model is a three-dimensional model representing joints and connection relationships between joints in an object existing in a first imaging space, and the second posture model exists in a second imaging space. 9. The information processing apparatus according to any one of claims 1 to 8, wherein the model is a three-dimensional model showing joints in an object and connection relationships between the joints.

a first image group acquiring means for acquiring the first multi-viewpoint images;
a first foreground obtaining means for obtaining, as a foreground area, an image area corresponding to an object in the captured image for each of a plurality of captured images that constitute the first multi-viewpoint image;
a first model generation means for generating the first posture model corresponding to the object based on the foreground region acquired for each of a plurality of captured images;
image generation means for generating the virtual viewpoint image based on the first multi-viewpoint image and the foreground region obtained for each of a plurality of captured images that constitute the first multi-viewpoint image;
further having
The first model acquisition means acquires data of the first posture model generated by the first model generation means,
The information processing apparatus according to any one of claims 1 to 9, wherein the image obtaining means obtains data of the virtual viewpoint image generated by the image generating means.

a second image group acquiring means for acquiring the second multi-viewpoint images;
a second foreground obtaining means for obtaining, as a foreground area, an image area corresponding to an object in the captured image for each of a plurality of captured images forming the second multi-viewpoint image;
a second model generation means for generating the second posture model corresponding to the object based on the foreground region acquired for each of a plurality of captured images;
sound acquisition means for acquiring a signal of sound collected by the sound collecting device from the sound collecting device installed in the second imaging space, and digitizing the signal of the sound to acquire the data of the sound; ,
further having
the second model acquisition means acquires data of the second posture model generated by the second model generation means;
The information processing apparatus according to any one of claims 1 to 10, wherein the sound acquisition means acquires the sound data acquired by the sound acquisition means.

a first information processing device that generates first posture model data and virtual viewpoint image data based on first multi-viewpoint images obtained from a plurality of imaging devices imaging a first imaging space; ,
generating data of a second posture model based on second multi-viewpoint images obtained from a plurality of imaging devices imaging a second imaging space different from the first imaging space; a second information processing device that generates acoustic data based on an acoustic signal obtained by sound collection by a sound collector installed in the imaging space;
has
The first information processing device acquires the second posture model generated by the second information processing device and the acoustic data, and generates the first posture model and the second posture model. an information processing system that generates data including the virtual viewpoint image and the sound based on the degree of matching between the virtual viewpoint image and the sound.

a first model acquisition step of acquiring data of a first posture model generated based on first multi-viewpoint images obtained from a plurality of imaging devices imaging a first imaging space;
an image acquisition step of acquiring data of a virtual viewpoint image generated based on the first multi-viewpoint image;
A second model for acquiring data of a second posture model generated based on second multi-viewpoint images obtained from a plurality of imaging devices imaging a second imaging space different from the first imaging space. an acquisition step;
a sound acquisition step of acquiring sound data collected when the second multi-viewpoint image is captured in the second imaging space;
an evaluation step of evaluating a degree of matching between the first posture model and the second posture model;
a data generation step of generating data including the virtual viewpoint image and the sound based on the degree of matching;
An information processing method characterized by having

A program for operating a computer as the information processing apparatus according to any one of claims 1 to 11.