JP2022007108A

JP2022007108A - Information processor, information processing method, and program

Info

Publication number: JP2022007108A
Application number: JP2020109838A
Authority: JP
Inventors: 奨平岩本; Shohei Iwamoto
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2020-06-25
Filing date: 2020-06-25
Publication date: 2022-01-13

Abstract

To estimate the position of a sound source relating to a virtual viewpoint image.SOLUTION: An information processing system 10 acquires viewpoint information showing the transition of a virtual viewpoint corresponding to a virtual viewpoint image generated on the basis of a plurality of picked-up images obtained by imaging an imaging area from a plurality of directions. The information processing system 10 estimates the position of a sound source in the imaging area on the basis of acquired viewpoint information.SELECTED DRAWING: Figure 1

Description

本発明は、音源の位置を推定する技術に関する。 The present invention relates to a technique for estimating the position of a sound source.

複数の撮影装置を異なる位置に設置して多視点で同期撮影し、当該撮影により得られた複数の画像を用いて、視点を任意に変更可能な仮想視点映像を生成する技術がある。例えば、サッカーやバスケットボールなどの競技を撮影した複数の画像に基づいて、ユーザにより指定された視点に応じた仮想視点映像を生成することにより、ユーザは様々な視点から競技を観戦することができる。 There is a technique in which a plurality of photographing devices are installed at different positions to perform synchronous shooting from multiple viewpoints, and a virtual viewpoint image in which the viewpoint can be arbitrarily changed is generated by using a plurality of images obtained by the shooting. For example, a user can watch a game from various viewpoints by generating a virtual viewpoint image corresponding to a viewpoint designated by the user based on a plurality of images of a game such as soccer or basketball.

また、仮想視点映像と共に再生される音響信号の高臨場感化が検討されている。仮想視点映像に対応する仮想視点は撮影対象の競技が行われるフィールド内の任意の位置に設定することが可能であるが、収音するためのマイクをフィールド内に持ち込んで仮想視点に追従させることは難しい。そこで、フィールドの周辺に複数のマイクを設置し、それらのマイクから得られた収音信号を選択したりミックスしたりすることで、仮想視点映像と共に再生される音響信号を生成することが考えられる。この場合、フィールド内の音源の位置に基づいて収音信号の選択やミックスを行うことで、音響信号の臨場感を向上できる。 In addition, it is being studied to make the acoustic signal reproduced together with the virtual viewpoint image highly realistic. The virtual viewpoint corresponding to the virtual viewpoint image can be set at any position in the field where the competition to be shot is held, but a microphone for collecting sound should be brought into the field to follow the virtual viewpoint. Is difficult. Therefore, it is conceivable to install multiple microphones around the field and select and mix the sound pickup signals obtained from those microphones to generate an acoustic signal to be reproduced together with the virtual viewpoint image. .. In this case, the presence of the acoustic signal can be improved by selecting and mixing the sound pickup signal based on the position of the sound source in the field.

特許文献１には、仮想視点映像と共に再生される音響信号を生成するために、仮想視点に対応する視界に含まれる被写体の位置に基づいて収音位置を選択することが開示されている。また、特許文献１には、被写体の位置を検出する方法として、仮想視点映像を解析する方法と位置センサを用いる方法が開示されている。 Patent Document 1 discloses that a sound collecting position is selected based on a position of a subject included in a field of view corresponding to a virtual viewpoint in order to generate an acoustic signal to be reproduced together with a virtual viewpoint image. Further, Patent Document 1 discloses a method of analyzing a virtual viewpoint image and a method of using a position sensor as a method of detecting the position of a subject.

特開２０１８－１９２９５号公報Japanese Unexamined Patent Publication No. 2018-19295

しかしながら、特許文献１に記載の技術のように、生成された仮想視点映像を解析して検出された音源の位置に基づいて音響信号を生成する場合、仮想視点映像の生成から音響信号の生成までの遅延が大きくなる虞がある。また、撮影対象となる競技やイベントによっては、被写体に位置センサを装着することが難しい場合がある。 However, when an acoustic signal is generated based on the position of a sound source detected by analyzing the generated virtual viewpoint image as in the technique described in Patent Document 1, from the generation of the virtual viewpoint image to the generation of the acoustic signal. There is a risk that the delay will increase. In addition, it may be difficult to attach a position sensor to the subject depending on the competition or event to be photographed.

本発明は上記の課題に鑑み、仮想視点映像に関わる音源の位置を推定するための新たな方法を提供することを目的とする。 In view of the above problems, it is an object of the present invention to provide a new method for estimating the position of a sound source related to a virtual viewpoint image.

上記の課題を解決するため、本発明に係る情報処理装置は、例えば以下の構成を有する。すなわち、撮影領域を複数の方向から撮影することで得られる複数の撮影画像に基づいて生成される仮想視点映像に対応する仮想視点の変遷を表す視点情報を取得する取得手段と、前記取得手段により取得された前記視点情報に基づいて、前記撮影領域内の音源の位置を推定する推定手段と、を有する。 In order to solve the above problems, the information processing apparatus according to the present invention has, for example, the following configuration. That is, the acquisition means for acquiring the viewpoint information representing the transition of the virtual viewpoint corresponding to the virtual viewpoint image generated based on the plurality of captured images obtained by photographing the shooting area from a plurality of directions, and the acquisition means. It has an estimation means for estimating the position of a sound source in the photographing region based on the acquired viewpoint information.

本発明によれば、仮想視点映像に関わる音源の位置を推定することができる。 According to the present invention, the position of the sound source related to the virtual viewpoint image can be estimated.

実施形態に係るシステムの構成例を示す図である。It is a figure which shows the configuration example of the system which concerns on embodiment. 学習フェーズにおける処理を示すフローチャートである。It is a flowchart which shows the process in a learning phase. 適用フェーズにおける処理を示すフローチャートである。It is a flowchart which shows the process in an application phase. 再学習に関する処理を示すフローチャートである。It is a flowchart which shows the process about relearning. マイク位置と音源位置の例を示す図である。It is a figure which shows the example of a microphone position and a sound source position. 変形例に係るシステムの構成例を示す図である。It is a figure which shows the configuration example of the system which concerns on the modification. 学習フェーズにおける処理を示すフローチャートである。It is a flowchart which shows the process in a learning phase. 適用フェーズにおける処理を示すフローチャートである。It is a flowchart which shows the process in an application phase.

［システム構成］
図１（Ａ）は、本実施形態に係る情報処理システム１０の構成を示すブロック図である。また、図１（Ｂ）は、本実施形態に係る音響生成システム２０の構成を示すブロック図である。情報処理システム１０は、仮想視点映像に対応するカメラパスと検出された音源位置とを用いて学習を行い、学習済みモデルを生成する。音響生成システム２０は、入力されたカメラパスを学習済みモデルに入力して処理することで、音源位置を推定する。本実施形態において、情報処理システム１０によって学習済みモデルの生成、更新を行うフェーズのことを学習フェーズと呼び、音響生成システム２０によって学習モデルを適用し、音源位置を推定するフェーズのことを適用フェーズと呼ぶ。 [System configuration]
FIG. 1A is a block diagram showing a configuration of an information processing system 10 according to the present embodiment. Further, FIG. 1B is a block diagram showing a configuration of the sound generation system 20 according to the present embodiment. The information processing system 10 performs learning using the camera path corresponding to the virtual viewpoint image and the detected sound source position, and generates a trained model. The sound generation system 20 estimates the sound source position by inputting the input camera path into the trained model and processing it. In the present embodiment, the phase in which the learned model is generated and updated by the information processing system 10 is called the learning phase, and the phase in which the learning model is applied by the sound generation system 20 and the sound source position is estimated is the application phase. Called.

仮想視点映像は、複数の撮像装置による撮像に基づく複数の画像と、指定された仮想視点とに基づいて生成され、指定された仮想視点からの光景を表す。本実施形態における仮想視点映像は、自由視点映像とも呼ばれるものであるが、ユーザが自由に（任意に）指定した視点に対応する画像に限定されず、例えば複数の候補からユーザが選択した視点に対応する画像なども仮想視点映像に含まれる。仮想視点の指定はユーザ操作により行われてもよいし、仮想視点の指定が画像解析の結果等に基づいて自動で行われてもよい。また、本実施形態では仮想視点映像が動画である場合を中心に説明するが、仮想視点映像に静止画が含まれていてもよい。仮想視点映像は、空間内に設定された仮想視点の位置にカメラが存在するものと仮定した場合に、そのカメラにより得られる撮像画像を模擬した画像であると言える。本実施形態では、経時的な仮想視点の変遷の内容を示す視点情報を、カメラパスと表記する。 The virtual viewpoint image is generated based on a plurality of images taken by a plurality of image pickup devices and a designated virtual viewpoint, and represents a scene from the designated virtual viewpoint. The virtual viewpoint image in the present embodiment is also called a free viewpoint image, but is not limited to an image corresponding to a viewpoint freely (arbitrarily) specified by the user, for example, a viewpoint selected by the user from a plurality of candidates. Corresponding images are also included in the virtual viewpoint video. The virtual viewpoint may be specified by a user operation, or the virtual viewpoint may be automatically specified based on the result of image analysis or the like. Further, in the present embodiment, the case where the virtual viewpoint image is a moving image will be mainly described, but the virtual viewpoint image may include a still image. It can be said that the virtual viewpoint image is an image simulating the captured image obtained by the camera, assuming that the camera exists at the position of the virtual viewpoint set in the space. In the present embodiment, the viewpoint information indicating the contents of the transition of the virtual viewpoint over time is referred to as a camera path.

仮想視点映像の生成に用いられる視点情報は、仮想視点の位置及び向き（視線方向）を示す情報である。具体的には、視点情報は、仮想視点の三次元位置を表すパラメータと、パン、チルト、及びロール方向における仮想視点の向きを表すパラメータとを含む、パラメータセットである。なお、視点情報の内容は上記に限定されない。例えば、視点情報としてのパラメータセットには、仮想視点の視野の大きさ（画角）を表すパラメータが含まれてもよい。カメラパスには、複数の時刻それぞれに対応する複数のパラメータセットが含まれる。例えば、カメラパスは、仮想視点映像の動画を構成する複数のフレームにそれぞれ対応する複数のパラメータセットを有し、連続する複数の時点それぞれにおける仮想視点の位置及び向きを示す情報であってもよい。 The viewpoint information used to generate the virtual viewpoint image is information indicating the position and direction (line-of-sight direction) of the virtual viewpoint. Specifically, the viewpoint information is a parameter set including a parameter representing a three-dimensional position of the virtual viewpoint and a parameter representing the orientation of the virtual viewpoint in the pan, tilt, and roll directions. The content of the viewpoint information is not limited to the above. For example, the parameter set as the viewpoint information may include a parameter representing the size (angle of view) of the field of view of the virtual viewpoint. The camera path contains multiple parameter sets corresponding to each of the multiple times. For example, the camera path may have a plurality of parameter sets corresponding to a plurality of frames constituting the moving image of the virtual viewpoint video, and may be information indicating the position and orientation of the virtual viewpoint at each of a plurality of consecutive time points. ..

図１（Ａ）に示すように、情報処理システム１０は、音源検出部１１１、映像データ格納部１１２、カメラパス生成部１３０、及びモデル生成部１００を有する。モデル生成部１００は、カメラパス受信部１０１、教師データ生成部１０２、学習部１０３、音源情報格納部１０４、音源種別格納部１０５、及び再学習部１０６を備える。 As shown in FIG. 1A, the information processing system 10 includes a sound source detection unit 111, a video data storage unit 112, a camera path generation unit 130, and a model generation unit 100. The model generation unit 100 includes a camera path receiving unit 101, a teacher data generation unit 102, a learning unit 103, a sound source information storage unit 104, a sound source type storage unit 105, and a re-learning unit 106.

カメラパス受信部１０１は、カメラパス生成部１３０で生成されたカメラパスを受信する。カメラパスの送受信方法としては、例えば、カメラパスの生成中にフレーム毎の視点情報を送受信してもよいし、カメラパスの生成が完了した後に全フレーム分の視点情報をまとめて送受信してもよい。 The camera path receiving unit 101 receives the camera path generated by the camera path generating unit 130. As a method of transmitting and receiving the camera path, for example, the viewpoint information for each frame may be transmitted / received during the generation of the camera path, or the viewpoint information for all frames may be transmitted / received collectively after the generation of the camera path is completed. good.

教師データ生成部１０２は、カメラパス受信部１０１が受信したカメラパスと、カメラパスに対応する音源位置、音源発生時刻、及び音源種別ＩＤを音源情報格納部から取得する。そして、取得したカメラパスと音源位置、音源発生時刻、及び音源種別ＩＤを１組の教師データとして学習部１０３に送信する。なお、カメラパスに対応する音源位置とは、そのカメラパスに基づいて生成される仮想視点映像に関連する音源の位置である。例えば、試合が行われるフィールドをある撮影期間において複数の方向から撮影することで得られた複数の撮影画像とカメラパスとに基づいて仮想視点映像が生成される場合を考える。この場合に、その撮影期間においてフィールド内（撮影領域内）で音を発する音源の位置が、カメラパスに対応する音源位置である。 The teacher data generation unit 102 acquires the camera path received by the camera path receiving unit 101, the sound source position corresponding to the camera path, the sound source generation time, and the sound source type ID from the sound source information storage unit. Then, the acquired camera path, sound source position, sound source generation time, and sound source type ID are transmitted to the learning unit 103 as a set of teacher data. The sound source position corresponding to the camera path is the position of the sound source related to the virtual viewpoint image generated based on the camera path. For example, consider a case where a virtual viewpoint image is generated based on a plurality of captured images and a camera path obtained by photographing a field in which a game is played from a plurality of directions in a certain shooting period. In this case, the position of the sound source that emits sound in the field (in the shooting area) during the shooting period is the sound source position corresponding to the camera path.

学習部１０３は、受信した教師データを元に、機械学習により学習済みモデルである音源算出モデルを作成、及び更新する。この時、教師データのうち音源位置、音源発生時刻、及び音源種別ＩＤを、正解データとして扱うことで学習を行う。機械学習の具体的なアルゴリズムとしては、線形回帰、ロジスティック回帰、サポートベクターマシーン、ニューラルネットワークなどを用いることができる。 The learning unit 103 creates and updates a sound source calculation model, which is a learned model, by machine learning based on the received teacher data. At this time, learning is performed by treating the sound source position, the sound source generation time, and the sound source type ID of the teacher data as correct answer data. As a specific algorithm for machine learning, linear regression, logistic regression, support vector machine, neural network, or the like can be used.

音源情報格納部１０４は、試合中に発生した音源に関する音源情報を格納する。例えば下記の表１に示すようなテーブルによって音源情報が管理される。なお、これらのデータは音源検出部１１１によって生成され、正解データとして扱われる。またこれらのデータは教師データ生成部１０２によって参照される。 The sound source information storage unit 104 stores sound source information related to the sound source generated during the match. For example, the sound source information is managed by a table as shown in Table 1 below. These data are generated by the sound source detection unit 111 and are treated as correct answer data. Further, these data are referred to by the teacher data generation unit 102.

表１において、カラム１０４－１は、どの試合における音源なのかを一意に識別するための試合ＩＤを格納する。カラム１０４－２は、音源が発生した時刻を格納する。カラム１０４－３は、発生した音源の種別を識別するための音源種別ＩＤを格納する。カラム１０４－４、カラム１０４－５、及びカラム１０４－６はそれぞれ、発生した音源位置のＸ座標、Ｙ座標、及びＺ座標を格納する。表１の例では、単位はメートルを用いている。 In Table 1, column 104-1 stores a match ID for uniquely identifying which match the sound source is in. Column 104-2 stores the time when the sound source was generated. Column 104-3 stores a sound source type ID for identifying the type of the generated sound source. Column 104-4, column 104-5, and column 104-6 store the X-coordinate, Y-coordinate, and Z-coordinate of the generated sound source position, respectively. In the example of Table 1, the unit is meters.

音源種別格納部１０５は、音源の種別情報を格納している。例えば下記の表２のようなテーブルによって音源種別が管理される。 The sound source type storage unit 105 stores sound source type information. For example, the sound source type is managed by a table as shown in Table 2 below.

表２において、カラム１０５－１は、音源の種別を識別するための音源種別ＩＤを格納する。カラム１０５－２は、音源種別ＩＤに対応する音源種別名を格納する。表２の例では、ラグビーの試合中に発生することが想定される音源の種別名が格納されている。 In Table 2, column 105-1 stores a sound source type ID for identifying the sound source type. Column 105-2 stores the sound source type name corresponding to the sound source type ID. In the example of Table 2, the type names of sound sources that are expected to occur during a rugby game are stored.

再学習部１０６は、適用フェーズにおいて音響生成システム２０の音源位置算出部１４０が音源位置を算出した結果の妥当性を判定し、その結果をモデル生成部１００にフィードバックする事で再学習を行う。上記判定処理は、音源履歴格納部１２９に格納されているデータを元に行われる。この判定処理については図４を用いて後述する。 The re-learning unit 106 determines the validity of the result of calculating the sound source position by the sound source position calculation unit 140 of the sound generation system 20 in the application phase, and feeds back the result to the model generation unit 100 to perform re-learning. The determination process is performed based on the data stored in the sound source history storage unit 129. This determination process will be described later with reference to FIG.

音源検出部１１１は、映像データ格納部１１２から取得した映像データを元に音源を検出し、試合ＩＤ、音源発生時刻、音源種別、音源位置を決定し、音源情報格納部１０４にそれらのデータを格納する。本実施形態においては、映像データに対して映像認識処理を行うことで音源が検出されるものとする。但し音源の検出方法はこれに限定されない。例えば、作業者が映像を確認しながら一部またはすべての音源のデータを手動で入力してもよい。また例えば、音源となる人物や物体に付帯させた位置センサから出力されるＧＰＳ情報を用いて手動または自動で音源位置が検出されてもよい。映像データ格納部１１２には、俯瞰カメラで撮影した映像データが格納されている。俯瞰カメラは、フィールドを俯瞰する位置から撮影するカメラである。 The sound source detection unit 111 detects a sound source based on the video data acquired from the video data storage unit 112, determines the match ID, the sound source generation time, the sound source type, and the sound source position, and stores those data in the sound source information storage unit 104. Store. In the present embodiment, it is assumed that the sound source is detected by performing the video recognition process on the video data. However, the sound source detection method is not limited to this. For example, the operator may manually input data of some or all sound sources while checking the video. Further, for example, the sound source position may be detected manually or automatically by using the GPS information output from the position sensor attached to the person or the object as the sound source. The video data storage unit 112 stores video data taken by the bird's-eye view camera. The bird's-eye view camera is a camera that shoots from a position that gives a bird's-eye view of the field.

カメラパス生成部１３０は、仮想視点映像を生成するためのカメラパスを生成する。カメラパスの生成方法としては、例えば、ユーザがジョイスティックコントローラを使用して仮想空間内で仮想視点を移動させる操作に基づいて、その操作に応じた入力からカメラパスを生成する方法などがある。ここで生成されたカメラパスは、カメラパス受信部１０１に送信される。 The camera path generation unit 130 generates a camera path for generating a virtual viewpoint image. As a method of generating a camera path, for example, there is a method of generating a camera path from an input corresponding to the operation based on an operation in which a user moves a virtual viewpoint in a virtual space using a joystick controller. The camera path generated here is transmitted to the camera path receiving unit 101.

図１（Ｂ）に示すように、音響生成システム２０は、音響生成部１２０とカメラパス生成部１３１とを有する。音響生成部１２０は、カメラパス受信部１２１、音源情報取得部１２２、マイク選択部１２３、競技音生成部１２４、歓声音生成部１２５、及びミックス部１２６を備える。更に音響生成部１２０は、収音信号格納部１２７、カメラパス履歴格納部１２８、音源履歴格納部１２９、及び音源位置算出部１４０を備える。 As shown in FIG. 1B, the sound generation system 20 has a sound generation unit 120 and a camera path generation unit 131. The sound generation unit 120 includes a camera path reception unit 121, a sound source information acquisition unit 122, a microphone selection unit 123, a competition sound generation unit 124, a cheer sound generation unit 125, and a mix unit 126. Further, the sound generation unit 120 includes a sound pickup signal storage unit 127, a camera path history storage unit 128, a sound source history storage unit 129, and a sound source position calculation unit 140.

音源位置算出部１４０は、学習フェーズにおいてモデル生成部１００で生成された音源算出モデルに、カメラパス生成部１３１で生成されたカメラパスを入力して処理することで、音源位置、音源発生時刻、及び音源種別ＩＤを算出する。なお、本実施形態においては、音源算出モデルへの入力データをカメラパスとしているが、これに限定されない。例えば、撮影画像や、撮影画像から得られるパラメータが入力データに含まれてもよい。この場合、音源算出モデルを生成するための教師データにも、撮影画像や撮影画像から得られるパラメータが含まれる。 The sound source position calculation unit 140 inputs the camera path generated by the camera path generation unit 131 into the sound source calculation model generated by the model generation unit 100 in the learning phase, and processes the sound source position and sound source generation time. And the sound source type ID are calculated. In the present embodiment, the input data to the sound source calculation model is used as the camera path, but the present invention is not limited to this. For example, a photographed image or a parameter obtained from the photographed image may be included in the input data. In this case, the teacher data for generating the sound source calculation model also includes the captured image and the parameters obtained from the captured image.

ここで、撮影画像から得られるパラメータとは、例えばフィールドにおける領域ごとのオブジェクトの数や密度のような情報である。このようなデータを教師データ及び入力データに含むことで、ラグビーにおけるスクラムのような、選手が密集し競技音が発生しやすい状況において、正しく音源位置を算出できる可能性が高くなる。また、音源位置算出部１４０は、適用フェーズにて音源算出モデルから出力された音源位置に補正処理を行ってもよい。補正処理としては、例えば、１つ以上のマイクで収音された収音信号や、収音信号から算出されるパラメータに基づいて、音源位置を補正してもよい。 Here, the parameter obtained from the captured image is information such as the number and density of objects for each region in the field. By including such data in the teacher data and the input data, there is a high possibility that the sound source position can be calculated correctly in a situation where athletes are crowded and competition sounds are likely to occur, such as scrum in rugby. Further, the sound source position calculation unit 140 may perform correction processing on the sound source position output from the sound source calculation model in the application phase. As the correction process, for example, the sound source position may be corrected based on a sound pick-up signal picked up by one or more microphones or a parameter calculated from the sound pick-up signal.

ここで、なぜカメラパスを基に音源位置や音源発生時刻が推測できるかについて補足説明する。カメラパスは、仮想視点を操作するユーザが、注目したいと思う被写体を仮想視点の視界に収めるように操作することで作成される。一方で、臨場感を向上させるために視聴者に聞かせたい競技音も、被写体の位置や被写体の近傍で発生することが多い。そのため、仮想視点が追いかけている被写体の位置と音源位置及び音源発生時刻との間には相関があり、機械学習によりカメラパスから音源位置及び音源発生時刻を推定することが可能となる。 Here, a supplementary explanation will be given as to why the sound source position and the sound source generation time can be estimated based on the camera path. The camera path is created by the user who operates the virtual viewpoint so that the subject he / she wants to pay attention to is within the field of view of the virtual viewpoint. On the other hand, competition sounds that the viewer wants to hear in order to improve the sense of presence are often generated at the position of the subject or in the vicinity of the subject. Therefore, there is a correlation between the position of the subject chased by the virtual viewpoint and the sound source position and the sound source generation time, and it becomes possible to estimate the sound source position and the sound source generation time from the camera path by machine learning.

またここで、なぜカメラパスを基に音源種別が推測できるかについて補足説明する。例えばラグビーにおけるプレースキックシーンのカメラパスを生成する場合には、蹴られる前のボールを仮想視点がアップで映し、蹴られた直後からは空中を舞うボールを仮想視点が追いかけるようなカメラパスが生成されることが多い。この例のように、カメラパスが表す仮想視点の動き方のパターンは、プレイの内容と相関があり、またプレイの内容と音源種別にも当然ながら相関がある。そのため、カメラパスと音源種別との間には相関があり、カメラパスから音源種別を推定することが可能となる。 Here, a supplementary explanation will be given as to why the sound source type can be inferred based on the camera path. For example, when generating a camera path for a place kick scene in rugby, a camera path is generated in which the virtual viewpoint shows the ball before being kicked up, and the virtual viewpoint chases the ball flying in the air immediately after being kicked. Often done. As in this example, the pattern of movement of the virtual viewpoint represented by the camera path has a correlation with the content of the play, and naturally there is a correlation between the content of the play and the sound source type. Therefore, there is a correlation between the camera path and the sound source type, and it is possible to estimate the sound source type from the camera path.

カメラパス受信部１２１は、カメラパス生成部１３１で生成されたカメラパスを受信する。音源情報取得部１２２は、カメラパス受信部１２１で取得したカメラパスを音源位置算出部１４０に送信し、返却値として音源発生時刻、音源種別、音源位置を取得する。 The camera path receiving unit 121 receives the camera path generated by the camera path generating unit 131. The sound source information acquisition unit 122 transmits the camera path acquired by the camera path reception unit 121 to the sound source position calculation unit 140, and acquires the sound source generation time, sound source type, and sound source position as return values.

マイク選択部１２３は、音源情報取得部１２２が取得した音源位置に応じて、再生用の音響信号に含まれる競技音を収音するための１以上のマイクを選択する。選択方法としては、例えば、推定された音源位置からマイクまでの距離が、所定の閾値以下となるようなマイクを選んでもよい。また例えば、推定された音源位置に向けた指向性を有するマイクを選択したり、音源位置からの距離が閾値以下であり且つ音源位置に向けた指向性を有するマイクを選択したりしてもよい。 The microphone selection unit 123 selects one or more microphones for collecting the competition sound included in the acoustic signal for reproduction according to the sound source position acquired by the sound source information acquisition unit 122. As a selection method, for example, a microphone may be selected so that the distance from the estimated sound source position to the microphone is equal to or less than a predetermined threshold value. Further, for example, a microphone having directivity toward the estimated sound source position may be selected, or a microphone having a distance from the sound source position less than or equal to the threshold value and having directivity toward the sound source position may be selected. ..

競技音生成部１２４は、マイク選択部１２３が選択したマイクで収音された収音信号を使用して、競技音として用いる音響信号を生成する。歓声音生成部１２５は、競技音を収音するためのマイクとは別に設置された歓声音収音用のマイクで収音した収音信号を使用して、歓声音として用いる音響信号を生成する。ミックス部１２６は、競技音生成部１２４が生成した音源と、歓声音生成部１２５が生成した音源とをミックスし、仮想視点映像と共に再生するための再生用の音響信号を生成する。 The competition sound generation unit 124 generates an acoustic signal to be used as a competition sound by using the sound collection signal picked up by the microphone selected by the microphone selection unit 123. The cheer sound generation unit 125 generates an acoustic signal to be used as a cheer sound by using a sound pick-up signal picked up by a cheer sound pick-up microphone installed separately from the microphone for picking up the competition sound. .. The mixing unit 126 mixes the sound source generated by the competition sound generation unit 124 and the sound source generated by the cheering sound generation unit 125, and generates an acoustic signal for reproduction to be reproduced together with the virtual viewpoint image.

収音信号格納部１２７は、競技場に設置したマイクで収音した収音信号を格納している。本実施形態では、収音に用いたマイクと収音時刻の指定により所望の収音信号を抽出できるような形式で収音信号が格納されている。例えば、収音信号がデータベースの形式で格納されていて、ＳＱＬによりデータが参照されてもよい。または、収音信号がＷＡＶＥ形式のファイル群で管理され、ファイル名を指定することでデータが参照されてもよい。 The sound collection signal storage unit 127 stores the sound collection signal collected by the microphone installed in the stadium. In the present embodiment, the sound collection signal is stored in a format that allows a desired sound collection signal to be extracted by designating the microphone used for sound collection and the sound collection time. For example, the pick-up signal may be stored in the form of a database and the data may be referenced by SQL. Alternatively, the sound collection signal may be managed by a WAVE format file group, and the data may be referred to by specifying the file name.

カメラパス履歴格納部１２８は、カメラパス受信部１２１により受信されたカメラパスの履歴を格納している。例えば下記の表３のようなテーブルによってカメラパスの履歴が管理される。 The camera path history storage unit 128 stores the history of the camera path received by the camera path reception unit 121. For example, the history of the camera path is managed by a table as shown in Table 3 below.

カラム１２８－１は、カメラパスを一意に識別するためのカメラパスＩＤを格納する。カラム１２８－２は、カメラパスの内容を格納する。カメラパスの格納形式としては、例えば、カメラパスに含まれる時刻情報、並びに視点の位置及び向きの情報が、ＪＳＯＮ形式で表現され、１つのデータとして格納されてもよい。また例えば、時刻情報、並びに視点の位置及び向きの情報が、データベースで管理されてもよい。 Column 128-1 stores a camera path ID for uniquely identifying the camera path. Column 128-2 stores the contents of the camera path. As the storage format of the camera path, for example, the time information included in the camera path and the information on the position and orientation of the viewpoint may be expressed in JSON format and stored as one data. Further, for example, time information and information on the position and orientation of the viewpoint may be managed in the database.

音源履歴格納部１２９は、音源算出モデルを適用して算出された音源情報の履歴を格納している。例えば下記の表４のようなテーブルによって音源情報の履歴が管理される。 The sound source history storage unit 129 stores the history of sound source information calculated by applying the sound source calculation model. For example, the history of sound source information is managed by a table as shown in Table 4 below.

カラム１２９－１は、どの試合における音源の情報なのかを一意に識別するための試合ＩＤを格納する。カラム１２９－２は、音源位置算出部１４０が音源位置を算出するために入力として用いたカメラパスに関連付けられたＩＤを格納する。カラム１２９－３は、音源が発生した時刻を格納する。カラム１２９－４は、発生した音源の種別を識別するための音源種別ＩＤを格納する。カラム１２９－５、カラム１２９－６、及びカラム１２９－７はそれぞれ、発生した音源位置のＸ座標、Ｙ座標、Ｚ座標を格納する。表４の例では、座標の単位はメートルとしている。 Column 129-1 stores a match ID for uniquely identifying which match the sound source information is. Column 129-2 stores the ID associated with the camera path used as an input by the sound source position calculation unit 140 to calculate the sound source position. Column 129-3 stores the time when the sound source was generated. Column 129-4 stores a sound source type ID for identifying the type of the generated sound source. Columns 129-5, 129-6, and 129-7 store the X-coordinate, Y-coordinate, and Z-coordinate of the generated sound source position, respectively. In the example of Table 4, the unit of coordinates is meters.

カラム１２９－８は、再学習部１０６によって判定が実施されたか否かの情報を格納する。本実施形態では、ＴＲＵＥが格納されているとき、該当データについて既に再学習部１０６によって音源位置算出結果の妥当性が判定され、音源位置算出部１４０への結果のフィードバックが送信された状態である事を示している。 Column 129-8 stores information as to whether or not the determination has been performed by the re-learning unit 106. In the present embodiment, when the TRUE is stored, the re-learning unit 106 has already determined the validity of the sound source position calculation result for the corresponding data, and the feedback of the result is transmitted to the sound source position calculation unit 140. It shows that.

カメラパス生成部１３１は、カメラパス生成部１３０と同じく、仮想視点画像を生成するためのカメラパスを生成する。ここで生成されたカメラパスは、カメラパス受信部１２１に送信される。 Like the camera path generation unit 130, the camera path generation unit 131 generates a camera path for generating a virtual viewpoint image. The camera path generated here is transmitted to the camera path receiving unit 121.

図１（Ｃ）は、モデル生成部１００及び音響生成部１２０のハードウェア構成を示す。モデル生成部１００は、ＣＰＵ１６１、ＲＯＭ１６２、ＲＡＭ１６３、補助記憶装置１６４、表示部１６５、操作部１６６、通信Ｉ／Ｆ１６７、ＧＰＵ１６８及びバス１６９を有する情報処理装置により実現され得る。音響生成部１２０は、ＣＰＵ１７１、ＲＯＭ１７２、ＲＡＭ１７３、補助記憶装置１７４、表示部１７５、操作部１７６、通信Ｉ／Ｆ１７７、ＧＰＵ１７８及びバス１７９を有する情報処理装置により実現され得る。また、モデル生成部１００と音響生成部１２０はネットワーク１８０を介して通信が可能である。 FIG. 1C shows the hardware configuration of the model generation unit 100 and the sound generation unit 120. The model generation unit 100 can be realized by an information processing device having a CPU 161, a ROM 162, a RAM 163, an auxiliary storage device 164, a display unit 165, an operation unit 166, a communication I / F 167, a GPU 168, and a bus 169. The sound generation unit 120 can be realized by an information processing device having a CPU 171, a ROM 172, a RAM 173, an auxiliary storage device 174, a display unit 175, an operation unit 176, a communication I / F 177, a GPU 178, and a bus 179. Further, the model generation unit 100 and the sound generation unit 120 can communicate with each other via the network 180.

ＣＰＵ１６１は、ＲＯＭ１６２やＲＡＭ１６３に格納されているコンピュータプログラムやデータを用いてモデル生成部１００の全体を制御することで、図１（Ａ）に示すモデル生成部１００の各機能を実現する。なお、モデル生成部１００がＣＰＵ１６１とは異なる１又は複数の専用のハードウェアを有し、ＣＰＵ１６１による処理の少なくとも一部を専用のハードウェアが実行してもよい。そのような専用のハードウェアの例としては、ＡＳＩＣ（特定用途向け集積回路）、ＦＰＧＡ（フィールドプログラマブルゲートアレイ）、およびＤＳＰ（デジタルシグナルプロセッサ）などがある。 The CPU 161 realizes each function of the model generation unit 100 shown in FIG. 1A by controlling the entire model generation unit 100 using computer programs and data stored in the ROM 162 and the RAM 163. The model generation unit 100 may have one or more dedicated hardware different from the CPU 161 and the dedicated hardware may execute at least a part of the processing by the CPU 161. Examples of such dedicated hardware include ASICs (Application Specific Integrated Circuits), FPGAs (Field Programmable Gate Arrays), and DSPs (Digital Signal Processors).

ＲＯＭ１６２は、変更を必要としないプログラムなどを格納する。ＲＡＭ１６３は、補助記憶装置１６４から供給されるプログラムやデータ、及び通信Ｉ／Ｆ１６７を介して外部から供給されるデータなどを一時記憶する。補助記憶装置１６４は、例えばハードディスクドライブ等で構成され、画像データや音響データなどの種々のデータを記憶する。なお、本実施形態では、音源情報格納部１０４、音源種別格納部１０５、及び映像データ格納部１１２は、補助記憶装置１６４により構成されているものとするが、これに限られるものではない。例えば、通信Ｉ／Ｆ１６７を介して接続された外部の装置により構成されてもよい。 The ROM 162 stores programs and the like that do not require changes. The RAM 163 temporarily stores programs and data supplied from the auxiliary storage device 164, data supplied from the outside via the communication I / F 167, and the like. The auxiliary storage device 164 is composed of, for example, a hard disk drive or the like, and stores various data such as image data and acoustic data. In the present embodiment, the sound source information storage unit 104, the sound source type storage unit 105, and the video data storage unit 112 are configured by the auxiliary storage device 164, but the present invention is not limited to this. For example, it may be configured by an external device connected via communication I / F 167.

表示部１６５は、例えば液晶ディスプレイやＬＥＤ等で構成され、ユーザがモデル生成部１００を操作するためのＧＵＩ（ＧｒａｐｈｉｃａｌＵｓｅｒＩｎｔｅｒｆａｃｅ）などを表示する。操作部１６６は、例えばキーボードやマウス、ジョイスティック、タッチパネル等で構成され、ユーザによる操作を受けて各種の指示をＣＰＵ１６１に入力する。ＣＰＵ１６１は、表示部１６５を制御する表示制御部、及び操作部１６６を制御する操作制御部として動作する。 The display unit 165 is composed of, for example, a liquid crystal display, an LED, or the like, and displays a GUI (Graphical User Interface) for the user to operate the model generation unit 100. The operation unit 166 is composed of, for example, a keyboard, a mouse, a joystick, a touch panel, and the like, and inputs various instructions to the CPU 161 in response to an operation by the user. The CPU 161 operates as a display control unit that controls the display unit 165 and an operation control unit that controls the operation unit 166.

通信Ｉ／Ｆ１６７は、モデル生成部１００の外部の装置との通信に用いられる。例えば、モデル生成部１００が外部の装置と有線で接続される場合には、通信用のケーブルが通信Ｉ／Ｆ１６７に接続される。モデル生成部１００が外部の装置と無線通信する機能を有する場合には、通信Ｉ／Ｆ１６７はアンテナを備える。 The communication I / F 167 is used for communication with an external device of the model generation unit 100. For example, when the model generation unit 100 is connected to an external device by wire, a communication cable is connected to the communication I / F 167. When the model generation unit 100 has a function of wirelessly communicating with an external device, the communication I / F 167 includes an antenna.

ＧＰＵ１６８は、データをより多く並列処理することで効率的な演算を行うことができるので、ディープラーニングのように学習モデルを用いて複数回に渡り学習を行う場合に有効である。そこで本実施形態では、学習部１０３による処理にはＣＰＵ１６１に加えてＧＰＵ１６８が用いられる。具体的には、学習モデルを用いて学習プログラムを実行する場合に、ＣＰＵ１６１とＧＰＵ１６８が協働して演算を行うことで学習を行う。ただし、学習部１０３の処理がＣＰＵ１６１またはＧＰＵ１６８の何れか一方により実行されてもよい。 Since GPU 168 can perform efficient calculation by processing more data in parallel, it is effective when learning is performed a plurality of times using a learning model such as deep learning. Therefore, in the present embodiment, the GPU 168 is used in addition to the CPU 161 for the processing by the learning unit 103. Specifically, when a learning program is executed using a learning model, learning is performed by the CPU 161 and the GPU 168 collaborating to perform calculations. However, the process of the learning unit 103 may be executed by either the CPU 161 or the GPU 168.

バス１６９は、モデル生成部１００の各部をつないで情報を伝達する。 The bus 169 connects each part of the model generation unit 100 to transmit information.

なお、本実施形態では表示部１６５と操作部１６６がモデル生成部１００の内部に存在するものとするが、表示部１６５と操作部１６６との少なくとも一方がモデル生成部１００の外部に別の装置として存在していてもよい。 In this embodiment, it is assumed that the display unit 165 and the operation unit 166 exist inside the model generation unit 100, but at least one of the display unit 165 and the operation unit 166 is another device outside the model generation unit 100. May exist as.

音響生成部１２０のハードウェア構成については、モデル生成部１００のハードウェア構成と同様である。ただし、ＧＰＵ１７８は、音源位置算出部１４０が音源位置を算出する際の処理に用いられてもよい。 The hardware configuration of the sound generation unit 120 is the same as the hardware configuration of the model generation unit 100. However, the GPU 178 may be used for processing when the sound source position calculation unit 140 calculates the sound source position.

［動作フロー］
図２は、学習フェーズにおけるモデル生成部１００の処理を示すフローチャートである。図２に示す処理は、ＣＰＵ１６１及びＧＰＵ１６８がＲＯＭ１６２に格納されたプログラムをＲＡＭ１６３に展開して実行することで実現される。なお、図２に示す処理の少なくとも一部を、ＣＰＵ１６１及びＧＰＵ１６８とは異なる１又は複数の専用のハードウェアにより実現してもよい。後述する図４及び図７に示すフローチャートの処理も同様である。図２に示す処理は、映像データ格納部１１２に映像データが格納され、カメラパス生成部１３０によりカメラパスが生成され、そのカメラパスに対応する音源が音源検出部１１１により検出された後、学習フェーズの開始が指示されたタイミングで開始される。ただし、図２に示す処理の開始タイミングはこれに限定されない。 [Operation flow]
FIG. 2 is a flowchart showing the processing of the model generation unit 100 in the learning phase. The process shown in FIG. 2 is realized by the CPU 161 and the GPU 168 expanding the program stored in the ROM 162 into the RAM 163 and executing the program. It should be noted that at least a part of the processing shown in FIG. 2 may be realized by one or a plurality of dedicated hardware different from the CPU 161 and the GPU 168. The same applies to the processing of the flowcharts shown in FIGS. 4 and 7, which will be described later. In the process shown in FIG. 2, video data is stored in the video data storage unit 112, a camera path is generated by the camera path generation unit 130, and a sound source corresponding to the camera path is detected by the sound source detection unit 111, and then learning is performed. The start of the phase is started at the instructed timing. However, the start timing of the process shown in FIG. 2 is not limited to this.

Ｓ２０１では、カメラパス生成部１３０で生成されたカメラパスをカメラパス受信部１０１が受信する。Ｓ２０２では、教師データ生成部１０２が、Ｓ２０１で受信したカメラパスに含まれる時刻情報を元に、カメラパスの開始時刻と終了時刻（すなわちカメラパスに基づいて生成される仮想視点映像に対応する撮影期間の開始時刻と終了時刻）を算出する。Ｓ２０３では、教師データ生成部１０２が、Ｓ２０１で受信したカメラパスに含まれる試合ＩＤを取得する。 In S201, the camera path receiving unit 101 receives the camera path generated by the camera path generating unit 130. In S202, the teacher data generation unit 102 takes a picture corresponding to the start time and end time of the camera path (that is, the virtual viewpoint image generated based on the camera path) based on the time information included in the camera path received in S201. Calculate the start time and end time of the period). In S203, the teacher data generation unit 102 acquires the match ID included in the camera path received in S201.

Ｓ２０４では、教師データ生成部１０２が、音源位置、音源発生時刻、及び音源種別ＩＤを含む音源情報を音源情報格納部１０４から取得する。この時取得される音源情報は、表１における試合ＩＤがＳ２０３で取得した試合ＩＤに一致し、かつ表１における音源発生時刻がＳ２０２で算出したカメラパスの開始時刻から終了時刻までの間にある音源の情報である。 In S204, the teacher data generation unit 102 acquires sound source information including the sound source position, the sound source generation time, and the sound source type ID from the sound source information storage unit 104. In the sound source information acquired at this time, the match ID in Table 1 matches the match ID acquired in S203, and the sound source generation time in Table 1 is between the start time and the end time of the camera pass calculated in S202. Information on the sound source.

Ｓ２０５～Ｓ２０８の処理は、Ｓ２０４で取得された音源情報それぞれについて実行される。Ｓ２０６では、教師データ生成部１０２が、Ｓ２０１で受信されたカメラパスを入力データとし、Ｓ２０２で取得された音源位置、音源発生時刻、及び音源種別ＩＤを正解データとして、１組の教師データを生成する。Ｓ２０７では、学習部１０３が、Ｓ２０６で生成された教師データを用いて音源算出モデルを更新する。音源算出モデルの更新を音源情報毎に繰り返すことによって学習済みモデルとしての音源算出モデルが生成される。 The processes of S205 to S208 are executed for each of the sound source information acquired in S204. In S206, the teacher data generation unit 102 generates a set of teacher data using the camera path received in S201 as input data and the sound source position, sound source generation time, and sound source type ID acquired in S202 as correct answer data. do. In S207, the learning unit 103 updates the sound source calculation model using the teacher data generated in S206. A sound source calculation model as a trained model is generated by repeating the update of the sound source calculation model for each sound source information.

図３は、適用フェーズにおける音響生成部１２０の処理を示すフローチャートである。図３に示す処理は、ＣＰＵ１７１及びＧＰＵ１７８がＲＯＭ１７２に格納されたプログラムをＲＡＭ１７３に展開して実行することで実現される。なお、図３に示す処理の少なくとも一部を、ＣＰＵ１７１及びＧＰＵ１７８とは異なる１又は複数の専用のハードウェアにより実現してもよい。後述する図８に示すフローチャートの処理も同様である。図３に示す処理は、図２に示す学習フェーズの処理によって学習済みモデルとしての音源算出モデルが生成され、カメラパス生成部１３１によりカメラパスが生成された後、適用フェーズの開始が指示されたタイミングで開始される。ただし、図３に示す処理の開始タイミングはこれに限定されない。 FIG. 3 is a flowchart showing the processing of the sound generation unit 120 in the application phase. The process shown in FIG. 3 is realized by the CPU 171 and the GPU 178 expanding the program stored in the ROM 172 into the RAM 173 and executing the program. It should be noted that at least a part of the processing shown in FIG. 3 may be realized by one or a plurality of dedicated hardware different from the CPU 171 and the GPU 178. The same applies to the processing of the flowchart shown in FIG. 8 to be described later. In the process shown in FIG. 3, a sound source calculation model as a trained model is generated by the process of the learning phase shown in FIG. 2, and after the camera path is generated by the camera path generation unit 131, the start of the application phase is instructed. It starts at the timing. However, the start timing of the process shown in FIG. 3 is not limited to this.

Ｓ３０１では、カメラパス生成部１３１で生成されたカメラパスをカメラパス受信部１２１が受信する。Ｓ３０２では、音源情報取得部１２２が、Ｓ３０１で取得したカメラパスを音源算出モデルに入力することで推定される音源位置、音源発生時刻、及び音源種別ＩＤを取得する。この時、２組以上の音源情報が取得されてもよい。Ｓ３０３では、Ｓ３０２で取得された音源位置、音源発生時刻、及び音源種別ＩＤが、音源履歴格納部１２９に格納される。ここで格納されたデータは、再学習部１０６により参照され、音源算出モデルにより算出された結果が正しかったかどうかを判定するために使用される。この判定処理については図４を用いて後述する。 In S301, the camera path receiving unit 121 receives the camera path generated by the camera path generating unit 131. In S302, the sound source information acquisition unit 122 acquires the sound source position, the sound source generation time, and the sound source type ID estimated by inputting the camera path acquired in S301 into the sound source calculation model. At this time, two or more sets of sound source information may be acquired. In S303, the sound source position, the sound source generation time, and the sound source type ID acquired in S302 are stored in the sound source history storage unit 129. The data stored here is referred to by the re-learning unit 106, and is used to determine whether or not the result calculated by the sound source calculation model is correct. This determination process will be described later with reference to FIG.

Ｓ３０４では、マイク選択部１２３が、Ｓ３０３で算出された音源位置に基づいて使用マイクを選択する。マイクの選択方法としては、例えば図５のように、マイク５０１～５１２のそれぞれと音源位置５２０までの距離を求め、その距離が一定の閾値以下であるマイクを選択してもよい。また例えば、マイクの指向性も考慮した上でマイクを選択してもよい。また、マイク選択部１２３は、マイクの選択だけでなく、各マイクで収音された収音信号を競技音として再生用の音響信号にどれくらいの割合でミックスするかの係数をマイクごとに決定してもよい。 In S304, the microphone selection unit 123 selects the microphone to be used based on the sound source position calculated in S303. As a method of selecting a microphone, for example, as shown in FIG. 5, the distance between each of the microphones 501 to 512 and the sound source position 520 may be obtained, and a microphone whose distance is equal to or less than a certain threshold value may be selected. Further, for example, the microphone may be selected in consideration of the directivity of the microphone. Further, the microphone selection unit 123 determines not only the selection of the microphone but also the coefficient of the ratio of the sound pick-up signal picked up by each microphone to be mixed with the acoustic signal for reproduction as the competition sound for each microphone. You may.

Ｓ３０５では、Ｓ３０４で選択されたマイクで収音された収音信号を収音信号格納部１２７から取得し、競技音を生成する。競技音の生成方法としては、例えば、音源種別ＩＤが示す音源種別名がプレースキックだった場合に、音源発生時刻を元に、そのキックのインパクト音だけを収音信号から短い時間で切り出してもよい。Ｓ３０６では、カメラパスに対応した歓声音を生成する。歓声音の生成方法としては、例えば、複数の歓声音用マイクで収音した収音信号をＬチャンネルとＲチャンネルにバランスよく配置して、ステレオチャンネルを生成してもよい。また、仮想視点の変化に追従するように、ステレオチャンネル、または５．１チャンネルなどのマルチチャンネルの音響生成を行ってもよい。 In S305, the sound pick-up signal picked up by the microphone selected in S304 is acquired from the sound pick-up signal storage unit 127, and the competition sound is generated. As a method of generating a competition sound, for example, when the sound source type name indicated by the sound source type ID is a place kick, even if only the impact sound of the kick is cut out from the sound collection signal in a short time based on the sound source generation time. good. In S306, a cheering sound corresponding to the camera path is generated. As a method for generating cheers, for example, a stereo channel may be generated by arranging the pick-up signals picked up by a plurality of cheers microphones in a well-balanced manner on the L channel and the R channel. Further, a stereo channel or a multi-channel sound generation such as 5.1 channel may be generated so as to follow the change of the virtual viewpoint.

Ｓ３０７では、Ｓ３０５で生成された競技音と、Ｓ３０６で生成された歓声音をミックスして、仮想視点映像と共に再生される音響信号を生成する。ミックスの手法としては、例えば、競技音と歓声音の平均レベルを算出し、同等のレベルになるように音量を自動で調整してミックスする手法等を仕様できる。生成された再生用の音響信号は、音響生成部１２０から外部のスピーカやネットワークや記憶装置へ出力される。このように、カメラパスから推定された音源位置に基づいて収音信号を合成して再生用の音響信号を生成することで、そのカメラパスに対応する仮想視点映像と共に再生するのに適した高臨場感の音響信号を生成することができる。 In S307, the competition sound generated in S305 and the cheering sound generated in S306 are mixed to generate an acoustic signal to be reproduced together with the virtual viewpoint image. As a mixing method, for example, a method of calculating the average level of the competition sound and the cheering sound and automatically adjusting the volume so as to have the same level and mixing can be specified. The generated acoustic signal for reproduction is output from the acoustic generation unit 120 to an external speaker, a network, or a storage device. In this way, by synthesizing the sound pickup signal based on the sound source position estimated from the camera path to generate an acoustic signal for reproduction, the height suitable for reproduction together with the virtual viewpoint image corresponding to the camera path. Realistic acoustic signals can be generated.

次に図４を用いて、再学習部１０６が、適用フェーズにおける音源位置の推定結果が正しかったかどうかの判定を行い、その判定結果を音源算出モデルに再学習させる処理について説明する。図４に示す処理は、図３に示す適用フェーズの処理が行われたタイミングで開始される。ただし、図４に示す処理の開始タイミングはこれに限定されない。 Next, with reference to FIG. 4, a process in which the re-learning unit 106 determines whether or not the estimation result of the sound source position in the application phase is correct and causes the sound source calculation model to relearn the determination result will be described. The process shown in FIG. 4 is started at the timing when the process of the application phase shown in FIG. 3 is performed. However, the start timing of the process shown in FIG. 4 is not limited to this.

Ｓ４０１では、音源履歴格納部１２９に格納されているデータの中で、カラム１２９－８の判定済フラグがＦＡＬＳＥであるデータの一部または全てが取得される。Ｓ４０２～Ｓ４０７の処理は、Ｓ４０１で取得された各データについて実行される。Ｓ４０３では、Ｓ４０１で取得されたデータの試合ＩＤが取得される。Ｓ４０４では、Ｓ４０３で取得された試合ＩＤに対応する試合の俯瞰カメラ映像が、映像データ格納部１１２から取得される。ただし、試合ＩＤに対応する俯瞰カメラ映像がない場合は、例えばテレビ中継時の放送データ等、音源位置算出結果の判定が出来るデータで代替されもよい。なお、上記判定が可能な映像データが得られない場合は、判定が省略されてもよい。 In S401, among the data stored in the sound source history storage unit 129, a part or all of the data in which the determined flag of the column 129-8 is FALSE is acquired. The processing of S402 to S407 is executed for each data acquired in S401. In S403, the match ID of the data acquired in S401 is acquired. In S404, the bird's-eye view camera image of the match corresponding to the match ID acquired in S403 is acquired from the video data storage unit 112. However, if there is no bird's-eye view camera image corresponding to the match ID, it may be replaced with data that can determine the sound source position calculation result, such as broadcast data at the time of TV broadcasting. If the video data capable of the above determination cannot be obtained, the determination may be omitted.

Ｓ４０５では、俯瞰カメラの映像に基づいて、Ｓ４０１で取得された履歴データにおける音源位置、音源発生時刻、及び音源種別ＩＤが正しいかどうかが判定される。判定の方法としては、例えば、Ｓ４０１で取得された音源発生時刻において、音源位置が俯瞰カメラの映像上にマーキングされるようにして、判定者が目で見て正解か不正解かを判定してもよい。その場合、再学習部１０６は判定結果を受け付けるＵＩを有し、入力された情報（ＴＲＵＥ／ＦＡＬＳＥ）を判定結果として用いる。 In S405, it is determined whether or not the sound source position, the sound source generation time, and the sound source type ID in the history data acquired in S401 are correct based on the image of the bird's-eye view camera. As a method of determination, for example, at the sound source generation time acquired by S401, the sound source position is marked on the image of the bird's-eye view camera, and the judge visually determines whether the answer is correct or incorrect. May be good. In that case, the re-learning unit 106 has a UI for receiving the determination result, and the input information (TRUE / FALSE) is used as the determination result.

Ｓ４０６では、Ｓ４０５で取得された判定結果とＳ４０１で取得されたデータとを用いて判定のフィードバックを行うことで音源算出モデルの更新が行われる。このような再学習により、音源算出モデルが改善され、その後の適用フェーズでの音源位置推定の精度が向上する。 In S406, the sound source calculation model is updated by feeding back the determination using the determination result acquired in S405 and the data acquired in S401. By such re-learning, the sound source calculation model is improved, and the accuracy of sound source position estimation in the subsequent application phase is improved.

［変形例］
上述した実施形態では、カメラパスを入力として音源位置を出力する音源算出モデルを機械学習により生成する構成について説明した。但し、音源算出モデルへの入力はカメラパスに限定されない。以下で説明する構成では、カメラパスを入力とする代わりに、マイクの位置及び向きと、マイクで収音した収音信号とを入力として、音源位置を推定する音源算出モデルを機械学習により生成する。 [Modification example]
In the above-described embodiment, a configuration is described in which a sound source calculation model that outputs a sound source position by inputting a camera path is generated by machine learning. However, the input to the sound source calculation model is not limited to the camera path. In the configuration described below, instead of inputting the camera path, a sound source calculation model that estimates the sound source position is generated by machine learning by inputting the position and orientation of the microphone and the sound collection signal collected by the microphone. ..

図６（Ａ）は、本変形例において学習フェーズの処理を行う情報処理システム１１の構成例を示すブロック図である。図６（Ｂ）は、本変形例において適用フェーズの処理を行う音響生成システム２１の構成例を示すブロック図である。以下では、図１（Ａ）及び図１（Ｂ）を用いて説明した構成と同様の処理を実施する部分については説明を省略し、差分を中心に説明をする。なお、ハードウェア構成に関しては図１（Ｃ）を用いて説明した内容と同様である。 FIG. 6A is a block diagram showing a configuration example of the information processing system 11 that performs the processing of the learning phase in this modification. FIG. 6B is a block diagram showing a configuration example of the sound generation system 21 that processes the application phase in this modification. In the following, the description of the portion where the same processing as that of the configuration described with reference to FIGS. 1 (A) and 1 (B) is performed will be omitted, and the description will be focused on the difference. The hardware configuration is the same as that described with reference to FIG. 1 (C).

マイク６４１－１～６４１－Ｍは、収音領域内における音（例えばフィールドにおける歓声音または競技音）収音するために、フィールドの周辺等に設置されるマイクである。フィールドは、仮想視点画像を生成するために用いられる複数の撮影画像を取得する撮影装置が向けられる撮影領域でもある。マイク６４１－１～６４１－Ｍで収音された収音信号は収音信号格納部６３２に格納される。 The microphones 641-1 to 641-M are microphones installed around the field for collecting sounds (for example, cheering sounds or competition sounds in the field) in the sound collecting area. The field is also a shooting area to which a shooting device for acquiring a plurality of shot images used for generating a virtual viewpoint image is directed. The sound pick-up signal picked up by the microphones 641-1 to 641-M is stored in the sound pick-up signal storage unit 632.

マイク情報格納部６３１は、設置されたマイクの位置と向きを示すマイク情報を格納している。例えば下記の表５のようなテーブルによってマイク情報は管理される。 The microphone information storage unit 631 stores microphone information indicating the position and orientation of the installed microphone. For example, microphone information is managed by a table as shown in Table 5 below.

カラム６３１－１は、どの試合において設置されたマイクの情報なのかを一意に識別するための試合ＩＤを格納する。カラム６３１－２は、設置されたマイクを一意に識別するためのＩＤを格納する。カラム６３１－３、カラム６３１－４、及びカラム６３１－５はそれぞれ、マイク位置のＸ座標、Ｙ座標、及びＺ座標を格納する。表５の例では、座標の単位はメートルとしている。カラム６３１－６とカラム６３１－７はそれぞれ、マイクの向きを球面座標系であらわすための２つの角度である。カラム６３１－６は、Ｚ軸と動径とがなす角度であり、０度から９０度までの値を取る。また、９０度が水平方向を表す。カラム６３１－７は、Ｘ軸とＸＹ平面への動径の投射とがなす角であり、０度から３６０度までの値を取る。また、０度がＸ軸方向を表す。 The column 631-1 stores a match ID for uniquely identifying the information of the microphone installed in which match. Column 631-2 stores an ID for uniquely identifying the installed microphone. Columns 631-3, 631-4, and 631-5 store the X, Y, and Z coordinates of the microphone position, respectively. In the example of Table 5, the unit of coordinates is meters. Columns 631-6 and 631-7 are two angles for expressing the direction of the microphone in the spherical coordinate system, respectively. The column 631-6 is an angle formed by the Z axis and the radius, and takes a value from 0 degrees to 90 degrees. Also, 90 degrees represents the horizontal direction. Columns 631-7 are angles formed by the X-axis and the projection of the radius onto the XY plane, and take values from 0 degrees to 360 degrees. Further, 0 degree represents the X-axis direction.

収音信号格納部６３２は、フィールドに設置されたマイクで収音した収音信号を格納している。なお、収音信号格納部６３２は、モデル生成部６００からアクセスできる場所に配置される。 The sound collection signal storage unit 632 stores the sound collection signal collected by the microphone installed in the field. The sound collecting signal storage unit 632 is arranged at a location accessible from the model generation unit 600.

教師データ生成部６０１は、マイク情報格納部６３１からマイク情報を取得し、収音信号格納部６３２から収音信号を取得する。また、教師データ生成部６０１は、収音信号に対応する収音期間の中で発生した全ての競技音について、音源位置、音源発生時刻、及び音源種別ＩＤを音源情報格納部６０３から取得する。そして、教師データ生成部６０１は、取得したマイク情報、収音信号、音源位置、音源発生時刻、及び音源種別ＩＤを教師データとして学習部６０２に送信する。学習部６０２は、受信した教師データを元に音源算出モデルを作成及び更新する。この時、教師データである音源位置、音源発生位置、及び音源種別ＩＤは、正解データとして扱われる。 The teacher data generation unit 601 acquires microphone information from the microphone information storage unit 631 and acquires a sound collection signal from the sound collection signal storage unit 632. Further, the teacher data generation unit 601 acquires the sound source position, the sound source generation time, and the sound source type ID from the sound source information storage unit 603 for all the competition sounds generated in the sound collection period corresponding to the sound collection signal. Then, the teacher data generation unit 601 transmits the acquired microphone information, sound pickup signal, sound source position, sound source generation time, and sound source type ID to the learning unit 602 as teacher data. The learning unit 602 creates and updates a sound source calculation model based on the received teacher data. At this time, the sound source position, the sound source generation position, and the sound source type ID, which are teacher data, are treated as correct answer data.

音源位置算出部６６０は、マイクの位置、マイクの向き、マイクで収音した収音信号を音源算出モデルに入力して、音源位置、音源発生時刻、及び音源種別ＩＤを推定する。音源算出モデルの学習フェーズでは、モデル生成部６００によってモデルが生成及び更新される。音源算出モデルの適用フェーズでは、マイク情報格納部６５１に格納されているマイクの位置及び向きと、収音信号格納部６５２に格納されている収音信号を本番データとして、音源位置、音源発生時刻、及び音源種別ＩＤが算出される。 The sound source position calculation unit 660 inputs the position of the microphone, the direction of the microphone, and the sound collection signal picked up by the microphone into the sound source calculation model, and estimates the sound source position, the sound source generation time, and the sound source type ID. In the learning phase of the sound source calculation model, the model is generated and updated by the model generation unit 600. In the application phase of the sound source calculation model, the position and orientation of the microphone stored in the microphone information storage unit 651 and the sound collection signal stored in the sound collection signal storage unit 652 are used as production data, and the sound source position and sound source generation time are used. , And the sound source type ID are calculated.

なお、本変形例においては、音源算出モデルへの入力データをマイクの位置、マイクの向き、及びマイクで収音した収音信号としているが、音源算出モデルへ入力されるデータはこれに限定されない。例えば、マイクの種別によって収音特性が変化する事を鑑みて、マイク種別が入力データに含まれてもよい。または、気温や湿度が音質に影響する事を鑑みて、気温及び湿度が入力データに含まれてもよい。また、収音対象となる空間の構造によって音の反響特性が変化する事を鑑みて、収音が行われた収音場所を示す場所情報が入力データに含まれてもよい。また、撮影画像や、撮影画像から得られるパラメータが入力データに含まれてもよい。 In this modification, the input data to the sound source calculation model is the position of the microphone, the direction of the microphone, and the sound collection signal collected by the microphone, but the data input to the sound source calculation model is not limited to this. .. For example, the microphone type may be included in the input data in view of the fact that the sound collection characteristic changes depending on the type of microphone. Alternatively, considering that the temperature and humidity affect the sound quality, the temperature and humidity may be included in the input data. Further, in view of the fact that the reverberation characteristic of the sound changes depending on the structure of the space to be picked up, the input data may include the place information indicating the place where the sound is picked up. Further, the captured image and the parameters obtained from the captured image may be included in the input data.

ここで、なぜマイクの位置、マイクの向き、及びマイクで収音した収音信号を基に音源位置及び音源発生時刻が推測できるかについて補足説明する。例えばサッカーにおいてシュートが発生した場合、マイクで収音した収音信号に含まれるキック音のタイミングと、マイク位置とシュート音の発生位置（音源位置）との間には、相関がある。また、指向性を有するマイクの向きとシュート音の収音レベルとには相関がある。このように、マイクの位置、マイクの向き、及びマイクで収音した収音信号と、音源位置には相関があるため、機械学習により推定が可能である。更に、単なる距離計算だけでなく機械学習を使う理由は、例えばシュート後の観客の盛り上がり等を検知することにより、シュートが発生したという事象の推定精度が向上するためである。この推定により、シュート音と、それ以外の音（例えば応援団の太鼓の音等）とを混同する可能性が低くなり、音源位置の推定精度が向上する。 Here, a supplementary explanation will be given as to why the sound source position and the sound source generation time can be estimated based on the position of the microphone, the direction of the microphone, and the sound pick-up signal picked up by the microphone. For example, when a shoot occurs in soccer, there is a correlation between the timing of the kick sound included in the sound pick-up signal picked up by the microphone and the position of the microphone and the position where the shoot sound is generated (sound source position). In addition, there is a correlation between the orientation of the microphone having directivity and the pick-up level of the shooting sound. As described above, since there is a correlation between the position of the microphone, the direction of the microphone, and the sound source signal picked up by the microphone, the sound source position can be estimated by machine learning. Furthermore, the reason for using machine learning in addition to mere distance calculation is that, for example, by detecting the excitement of the spectator after shooting, the estimation accuracy of the event that the shooting has occurred is improved. This estimation reduces the possibility of confusing the shoot sound with other sounds (for example, the sound of the drum of a cheering party), and improves the estimation accuracy of the sound source position.

また、なぜマイクの位置、マイクの向き、及びマイクで収音した収音信号を基に音源種別が推測できるかについて補足説明する。例えば前述したように、観客の盛り上がり等を検知することにより、シュートが行われたのか、コーナーキックが行われたのか、等が推定可能となる。このように、マイクで収音した収音信号と、音源種別とには相関があるため、機械学習を用いることで推定が可能となる。 In addition, a supplementary explanation will be given as to why the sound source type can be estimated based on the position of the microphone, the direction of the microphone, and the sound pick-up signal picked up by the microphone. For example, as described above, by detecting the excitement of the spectators, it is possible to estimate whether a shot has been taken or a corner kick has been taken. In this way, since there is a correlation between the sound pick-up signal picked up by the microphone and the sound source type, it is possible to estimate by using machine learning.

音源情報取得部６２２は、カメラパス受信部６２１が受信したカメラパスを取得する。また、音源情報取得部６２２は、上記カメラパスの開始時刻から終了時刻までの期間に対応する収音信号を、収音信号格納部６５２から取得する。また、音源情報取得部６２２は、マイク情報をマイク情報格納部６５１から取得する。これらの取得したデータを音源算出モデルに入力することで、音源位置、音源発生時刻、及び音源種別ＩＤが取得できる。 The sound source information acquisition unit 622 acquires the camera path received by the camera path reception unit 621. Further, the sound source information acquisition unit 622 acquires the sound collection signal corresponding to the period from the start time to the end time of the camera path from the sound collection signal storage unit 652. Further, the sound source information acquisition unit 622 acquires microphone information from the microphone information storage unit 651. By inputting these acquired data into the sound source calculation model, the sound source position, the sound source generation time, and the sound source type ID can be acquired.

図７は、本変形例の学習フェーズにおけるモデル生成部６００の処理を示すフローチャートである。図７に示す処理は、データ格納部６３０に収音信号とマイク情報が格納され、音源検出部１１１により音源が検出された後、学習フェーズの開始が指示されたタイミングで開始される。ただし、図２に示す処理の開始タイミングはこれに限定されない。 FIG. 7 is a flowchart showing the processing of the model generation unit 600 in the learning phase of this modification. The process shown in FIG. 7 is started at the timing when the sound collection signal and the microphone information are stored in the data storage unit 630, the sound source is detected by the sound source detection unit 111, and then the start of the learning phase is instructed. However, the start timing of the process shown in FIG. 2 is not limited to this.

Ｓ７０１では、試合ＩＤが取得される。取得方法としては、例えば、外部の操作用ＰＣを介してオペレータが学習の対象としたい試合を選択することで、選択された試合に対応した試合ＩＤが教師データ生成部６０１に送られる。また例えば、まだ学習が行われていない試合もしくは音源位置に対して、自動で試合ＩＤが決定されてもよい。 In S701, the match ID is acquired. As an acquisition method, for example, the operator selects a match to be learned via an external operating PC, and the match ID corresponding to the selected match is sent to the teacher data generation unit 601. Further, for example, a match ID may be automatically determined for a match or a sound source position that has not yet been learned.

Ｓ７０２では、マイク情報格納部６３１から、Ｓ７０１で取得した試合ＩＤに対応するマイク情報が取得される。Ｓ７０３では、収音信号格納部６３２から、Ｓ７０１で取得した試合ＩＤに対応する収音信号が取得される。Ｓ７０４では、教師データ生成部６０１が、音源位置、音源発生時刻、及び音源種別ＩＤを含む音源情報を音源情報格納部６０４から取得する。ここで取得される音源情報は、表１の試合ＩＤがＳ７０１で取得した試合ＩＤに一致する音源についての情報である。 In S702, the microphone information corresponding to the match ID acquired in S701 is acquired from the microphone information storage unit 631. In S703, the sound collection signal corresponding to the match ID acquired in S701 is acquired from the sound collection signal storage unit 632. In S704, the teacher data generation unit 601 acquires sound source information including the sound source position, the sound source generation time, and the sound source type ID from the sound source information storage unit 604. The sound source information acquired here is information about a sound source whose match ID in Table 1 matches the match ID acquired in S701.

Ｓ７０５～Ｓ７０９の処理は、Ｓ７０４で取得した各音源情報について実行される。Ｓ７０６では、取得した音源発生時刻の前後数秒間の範囲の収音信号が、Ｓ７０３で取得した収音信号から切り出される。何秒間分切り出すかは、音源算出モデルにおける機械学習のアルゴリズムに依存する。特に、音源発生時刻の前後の歓声の盛り上がり方等も含めて学習を行う場合は、前後１０秒以上の範囲で切り出してもよい。 The processes of S705 to S709 are executed for each sound source information acquired in S704. In S706, the sound pick-up signal in the range of several seconds before and after the acquired sound source generation time is cut out from the sound pick-up signal acquired in S703. The number of seconds to cut out depends on the machine learning algorithm in the sound source calculation model. In particular, when learning is performed including how the cheers rise before and after the sound source generation time, it may be cut out within a range of 10 seconds or more before and after.

Ｓ７０７では、教師データ生成部６０１が、Ｓ７０２で取得したマイク情報及びＳ７０６で切り出した収音信号を入力データとし、Ｓ７０４で取得した音源位置、音源発生時刻、及び音源種別ＩＤを正解データとして、１組の教師データを生成する。Ｓ７０８では、学習部６０２にが、Ｓ７０７で生成した教師データを用いて音源算出モデルを更新する。音源算出モデルの更新を音源情報毎に繰り返すことによって学習済みモデルとしての音源算出モデルが生成される。 In S707, the teacher data generation unit 601 uses the microphone information acquired in S702 and the sound pick-up signal cut out in S706 as input data, and the sound source position, sound source generation time, and sound source type ID acquired in S704 as correct data. Generate a set of teacher data. In S708, the learning unit 602 updates the sound source calculation model using the teacher data generated in S707. A sound source calculation model as a trained model is generated by repeating the update of the sound source calculation model for each sound source information.

図８は、本変形例の適用フェーズにおける音響生成部６２０の処理を示すフローチャートである。図８に示す処理は、図７に示す学習フェーズの処理によって学習済みモデルとしての音源算出モデルが生成され、データ格納部６５０にマイク情報と収音信号が格納された後、適用フェーズの開始が指示されたタイミングで開始される。ただし、図８に示す処理の開始タイミングはこれに限定されない。 FIG. 8 is a flowchart showing the processing of the sound generation unit 620 in the application phase of this modification. In the process shown in FIG. 8, a sound source calculation model as a learned model is generated by the process of the learning phase shown in FIG. 7, and after the microphone information and the sound pickup signal are stored in the data storage unit 650, the application phase is started. It will start at the indicated timing. However, the start timing of the process shown in FIG. 8 is not limited to this.

Ｓ８０１では、カメラパス生成部６４３で生成されたカメラパスをカメラパス受信部６２１が受信する。Ｓ８０２では、Ｓ８０１で取得したカメラパスから試合ＩＤが取得される。Ｓ８０３では、マイク情報格納部６５１から、Ｓ８０２で取得された試合ＩＤに対応するマイク情報が取得される。Ｓ８０４では、収音信号格納部６５２から、Ｓ８０２で取得された試合ＩＤに対応し、かつＳ８０１で取得したカメラパスの開始時刻から終了時刻までの期間に対応する収音信号が取得される。Ｓ８０５では、音源位置算出部６６０が、Ｓ８０３で取得されたマイク情報と、Ｓ８０４で取得された収音信号を音源算出モデルに入力し、音源位置、音源発生時刻、及び音源種別ＩＤを算出する。この時、２組以上の音源情報が取得されてもよい。 In S801, the camera path receiving unit 621 receives the camera path generated by the camera path generating unit 643. In S802, the match ID is acquired from the camera pass acquired in S801. In S803, the microphone information corresponding to the match ID acquired in S802 is acquired from the microphone information storage unit 651. In S804, the sound collection signal corresponding to the match ID acquired in S802 and corresponding to the period from the start time to the end time of the camera pass acquired in S801 is acquired from the sound collection signal storage unit 652. In S805, the sound source position calculation unit 660 inputs the microphone information acquired in S803 and the sound collection signal acquired in S804 into the sound source calculation model, and calculates the sound source position, the sound source generation time, and the sound source type ID. At this time, two or more sets of sound source information may be acquired.

Ｓ８０６～Ｓ８０９の処理は、図３を用いて説明したＳ３０４～Ｓ３０７の処理と同様であるため、説明を省略する。このように、マイク情報及び収音信号から推定された音源位置に基づいて収音信号を合成して再生用の音響信号を生成することで、収音期間に対応する仮想視点映像と共に再生するのに適した高臨場感の音響信号を生成することができる。 Since the processing of S806 to S809 is the same as the processing of S304 to S307 described with reference to FIG. 3, the description thereof will be omitted. In this way, by synthesizing the sound pickup signal based on the sound source position estimated from the microphone information and the sound collection signal to generate an acoustic signal for reproduction, the sound is reproduced together with the virtual viewpoint image corresponding to the sound collection period. It is possible to generate a highly realistic acoustic signal suitable for.

本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ等）によっても実現可能である。また、そのプログラムをコンピュータにより読み取り可能な記録媒体に記録して提供してもよい。 The present invention supplies a program that realizes one or more functions of the above-described embodiment to a system or device via a network or storage medium, and one or more processors in the computer of the system or device reads and executes the program. It can also be realized by the processing to be performed. It can also be realized by a circuit (for example, ASIC or the like) that realizes one or more functions. Further, the program may be recorded and provided on a recording medium readable by a computer.

１０情報処理システム
２０音響生成システム
１００モデル生成部
１２０音響生成部 10 Information processing system 20 Sound generation system 100 Model generation unit 120 Sound generation unit

Claims

An acquisition means for acquiring viewpoint information indicating the transition of a virtual viewpoint corresponding to a virtual viewpoint image generated based on a plurality of shot images obtained by shooting a shooting area from a plurality of directions.
An information processing apparatus comprising: an estimation means for estimating the position of a sound source in the photographing region based on the viewpoint information acquired by the acquisition means.

The estimation means is
The viewpoint information acquired by the acquisition means is applied to a trained model obtained by machine learning that outputs data related to a sound source in the photographing region according to input data including the viewpoint information. type in,
The information processing apparatus according to claim 1, wherein the position of a sound source in the photographing area is estimated based on the data output from the trained model.

The input data includes a photographed image obtained by photographing the photographing area, information indicating the position of a microphone that collects the sound of the photographing area, and a sound obtained by collecting the sound of the photographing area. The information processing apparatus according to claim 2, wherein at least one of the collected sound signals is included.

The information processing apparatus according to claim 2 or 3, wherein the machine learning uses at least one of linear regression, logistic regression, a support vector machine, and a neural network as an algorithm.

Any of claims 2 to 4, further comprising a re-learning means for re-learning the trained model using the data output from the trained model and the determination result regarding the validity of the data. The information processing apparatus according to item 1.

The information processing apparatus according to any one of claims 1 to 5, wherein the viewpoint information represents a transition between a position and an orientation of a virtual viewpoint corresponding to the virtual viewpoint image in a predetermined period.

The information processing apparatus according to claim 6, wherein the estimation means estimates the position of a sound source in the photographing region in the predetermined period.

The estimation means is characterized in that at least one of the generation time of the sound source in the shooting area and the type of the sound source in the shooting area is estimated based on the viewpoint information acquired by the acquisition means. The information processing apparatus according to any one of claims 1 to 7.

A generation means for generating an acoustic signal to be reproduced together with the virtual viewpoint image by processing the sound collection signal obtained by collecting the sound in the photographing region based on the estimation result by the estimation means. The information processing apparatus according to any one of claims 1 to 8, wherein the information processing apparatus has.

A learning means for generating the trained model by using the viewpoint information indicating the transition of the virtual viewpoint corresponding to the virtual viewpoint image and the sound source information indicating the position of the sound source as teacher data.
An information processing system comprising the information processing apparatus according to any one of claims 2 to 5.

An acquisition means for acquiring microphone information indicating the positions of a plurality of microphones for collecting sound in the sound collection region, and a sound collection signal acquired based on sound collection by the plurality of microphones.
The acquired model is a trained model obtained by machine learning and outputs data related to a sound source in the sound collecting region according to input data including the microphone information and the sound collecting signal by the acquisition means. An estimation means that inputs the acquired microphone information and the sound collection signal and estimates the position of the sound source in the sound collection area based on the data output from the trained model.
An information processing device characterized by having.

The information processing apparatus according to claim 11, wherein the input data includes information indicating at least one of the directivity of the microphone, the type of the microphone, the temperature, the humidity, and the sound collecting place.

An acquisition process for acquiring viewpoint information indicating the transition of a virtual viewpoint corresponding to a virtual viewpoint image generated based on a plurality of shot images obtained by shooting a shooting area from a plurality of directions, and an acquisition process.
An information processing method comprising an estimation step of estimating the position of a sound source in the photographing region based on the viewpoint information acquired in the acquisition step.

The estimation process is
The viewpoint information acquired in the acquisition step is applied to a trained model obtained by machine learning, which outputs data related to a sound source in the photographing region according to input data including the viewpoint information. type in,
The information processing method according to claim 13, wherein the position of the sound source in the photographing area is estimated based on the data output from the trained model.

A generation step of generating an acoustic signal to be reproduced together with the virtual viewpoint image by processing the sound pick-up signal obtained by picking up the sound in the shooting area based on the estimation result in the estimation step. The information processing method according to claim 13, wherein the information processing method is characterized by having.

An acquisition process for acquiring microphone information indicating the positions of a plurality of microphones for collecting sound in the sound collection region and sound collection signals acquired based on sound collection by the plurality of microphones.
In the acquisition process, a trained model obtained by machine learning that outputs data related to a sound source in the sound collection region according to input data including the microphone information and the sound collection signal. An estimation process in which the acquired microphone information and the sound collection signal are input, and the position of the sound source in the sound collection area is estimated based on the data output from the trained model.
An information processing method characterized by having.

The information processing method according to claim 16, wherein the input data includes information indicating at least one of the directivity of the microphone, the type of the microphone, the temperature, the humidity, and the sound collecting place.

A program for making a computer function as each means of the information processing apparatus according to any one of claims 1 to 9, 11 and 12.