JP2018019295A

JP2018019295A - Information processing system, control method therefor, and computer program

Info

Publication number: JP2018019295A
Application number: JP2016148996A
Authority: JP
Inventors: 船越　正伸; Masanobu Funakoshi; 正伸船越
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2016-07-28
Filing date: 2016-07-28
Publication date: 2018-02-01

Abstract

PROBLEM TO BE SOLVED: To provide a technology capable of generating sound having presence matching arbitrary viewpoint video.SOLUTION: An information processing system for processing a picture and sound corresponding to any viewpoint on the basis of a plurality of picture signals obtained by imaging by a plurality of imaging apparatuses and a plurality of sound collection signals obtained by collecting sound at a plurality of sound collection points comprises: acquisition means for acquiring a viewpoint position and a direction of a line of sight to an imaging object; selection means for selecting, from the plurality of sound collection points, depending on the viewpoint position and the direction of the line of sight, a sound collection point for a sound collection signal to be used for generating a sound signal corresponding to a picture that depends on the viewpoint position and the direction of the line of sight and is based on the plurality of picture signals; and sound generation means for generating a sound signal using a sound collection signal collected at the sound collection point selected by the selection means.SELECTED DRAWING: Figure 1

Description

本発明は情報処理システム及びその制御方法、コンピュータプログラムに関し、特に、任意視点映像にふさわしい音響再生信号を生成するための技術に関する。 The present invention relates to an information processing system, a control method therefor, and a computer program, and more particularly to a technique for generating an audio reproduction signal suitable for an arbitrary viewpoint video.

近年、コンピュータグラフィックス技術等を応用することにより、スポーツ競技場などの広い範囲を取り囲むように設置された複数のカメラによる撮影映像を適宜処理することによって、任意視点の映像を生成するシステムが開発されている。 In recent years, by developing computer graphics technology, etc., a system has been developed that generates images from arbitrary viewpoints by appropriately processing images shot by multiple cameras installed so as to surround a wide area such as a sports stadium. Has been.

このようなシステムによって生成される任意視点映像に対してより臨場感を持たせるために、それに見合う音響信号を生成、再生することが求められる。 In order to give a sense of reality to an arbitrary viewpoint video generated by such a system, it is required to generate and reproduce an audio signal corresponding to the video.

そこで、競技場などにおける任意位置を指定してその場にふさわしい音響信号を生成することが知られている（特許文献１）。この構成では、競技場に設置した複数のマイクの指向性と場所に加えて任意の視聴点及び視聴角度を入力し、各チャンネルに対する信号分配の割合と、各マイクと視聴点間の距離による影響を自動的に算出し、サラウンド音声を自動的にミキシングする。 Therefore, it is known to specify an arbitrary position in a stadium or the like and generate an acoustic signal suitable for the place (Patent Document 1). In this configuration, arbitrary viewing points and viewing angles are input in addition to the directivity and location of multiple microphones installed in the stadium, and the signal distribution ratio for each channel and the effect of the distance between each microphone and the viewing point. Is automatically calculated, and surround sound is automatically mixed.

特開２００５−２２３７７１号公報Japanese Patent Laid-Open No. 2005-223771

しかしながら、特許文献１の構成では、競技場におけるどの位置に聴取点を指定しても、全てのマイク信号を用いてサラウンド信号を生成している。そのため、聴取点が動いても音場感の変化が少ないという課題があった。 However, in the configuration of Patent Document 1, the surround signal is generated using all the microphone signals regardless of the position at which the listening point is specified in the stadium. For this reason, there is a problem that even if the listening point moves, there is little change in the sound field feeling.

そこで本発明は、任意視点映像に見合う臨場感のある音響を生成することが可能な技術を提供することを目的とする。 Therefore, an object of the present invention is to provide a technique capable of generating a realistic sound suitable for an arbitrary viewpoint video.

上記目的を達成するため、本発明による情報処理システムは以下の構成を備える。即ち、
複数の撮影装置により撮影された複数の画像信号と、複数の収音点において収音された複数の収音信号とに基づいて、任意の視点に対応する画像及び音響を処理する情報処理システムであって、
視点位置と、撮影対象に対する視線の方向とを取得する取得手段と、
前記視点位置及び視線の方向に応じた画像であって、前記複数の画像信号に基づく画像に対応する音響信号を生成するために用いる収音信号の収音点を、前記複数の収音点から、前記視点位置及び視線の方向に応じて選択する選択手段と、
前記選択手段により選択された収音点において収音された収音信号を用いて、音響信号を生成する音響生成手段と
を備える。 In order to achieve the above object, an information processing system according to the present invention comprises the following arrangement. That is,
An information processing system that processes an image and sound corresponding to an arbitrary viewpoint based on a plurality of image signals photographed by a plurality of photographing devices and a plurality of sound collection signals collected at a plurality of sound collection points. There,
An acquisition means for acquiring the viewpoint position and the direction of the line of sight with respect to the imaging target;
A sound collection point of a sound collection signal used to generate an acoustic signal corresponding to an image based on the plurality of image signals, the image being in accordance with the viewpoint position and the direction of the line of sight, from the plurality of sound collection points. Selecting means for selecting according to the viewpoint position and the direction of the line of sight;
Sound generating means for generating an acoustic signal using the collected sound signal picked up at the sound collecting point selected by the selecting means.

本発明によれば、任意視点映像に見合う臨場感のある音響を生成することが可能な技術を提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the technique which can produce | generate the sound with the realistic feeling suitable for arbitrary viewpoint images can be provided.

任意視点映像生成システムの一構成例を示すブロック図。The block diagram which shows the example of 1 structure of the arbitrary viewpoint video generation systems. 競技場における収音点の配置状況を示す模式図。The schematic diagram which shows the arrangement | positioning condition of the sound collection point in a stadium. メイン処理の処理手順を示すフローチャート。The flowchart which shows the process sequence of a main process. 任意視点映像生成システムで用いられる情報のデータ構造を示す図。The figure which shows the data structure of the information used with an arbitrary viewpoint image | video production | generation system. 聴取範囲決定処理の処理手順を示すフローチャート。The flowchart which shows the process sequence of listening range determination processing. 視点と聴取範囲、聴取点、聴取方向の関係を示す模式図。The schematic diagram which shows the relationship between a viewpoint, listening range, listening point, and listening direction. 被写体位置検知処理の処理手順を示すフローチャート。7 is a flowchart showing a processing procedure for subject position detection processing. 収音点選択処理の処理手順を示すフローチャート。The flowchart which shows the process sequence of a sound collection point selection process. 聴取範囲内収音点選択処理の処理手順を示すフローチャート。The flowchart which shows the process sequence of the sound collection point selection process in a listening range. 再生信号生成処理の処理手順を示すフローチャート。The flowchart which shows the process sequence of reproduction | regeneration signal generation processing. ステレオ再生信号生成処理の処理手順を示すフローチャート。The flowchart which shows the process sequence of a stereo reproduction signal production | generation process. サラウンド再生信号生成処理の処理手順を示すフローチャート。The flowchart which shows the process sequence of a surround reproduction signal production | generation process. ヘッドフォン再生信号生成処理の処理手順を示すフローチャート。The flowchart which shows the process sequence of a headphone reproduction signal generation process. 聴取範囲内収音点選択処理の処理手順を示すフローチャート。The flowchart which shows the process sequence of the sound collection point selection process in a listening range.

以下、本発明の実施形態について、図面を参照して説明する。なお、以下の実施形態は本発明を限定するものではなく、また、本実施形態で説明されている特徴の組み合わせの全てが本発明の実施に必須のものとは限らない。なお、同一の構成については、同じ符号を付して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. The following embodiments do not limit the present invention, and all the combinations of features described in the present embodiment are not necessarily essential for the implementation of the present invention. In addition, about the same structure, the same code | symbol is attached | subjected and demonstrated.

＜＜実施形態１＞＞
（任意視点映像生成システム）
本発明の一実施形態に係る任意視点映像生成システムについて、図１を参照して説明する。図１は本実施形態に係る任意視点映像生成システムの一構成例を示すブロック図である。本実施形態に係る任意視点映像生成システムは、複数の撮影装置により撮影された複数の画像信号と、複数の収音点において収音された複数の収音信号とに基づいて、任意の視点に対応する画像及び音響を出力する情報処理システムとして動作する。 << Embodiment 1 >>
(Arbitrary viewpoint video generation system)
An arbitrary viewpoint video generation system according to an embodiment of the present invention will be described with reference to FIG. FIG. 1 is a block diagram showing a configuration example of an arbitrary viewpoint video generation system according to the present embodiment. The arbitrary viewpoint video generation system according to the present embodiment can be used for an arbitrary viewpoint based on a plurality of image signals photographed by a plurality of photographing devices and a plurality of sound collection signals collected at a plurality of sound collection points. It operates as an information processing system that outputs corresponding images and sounds.

図１中、１は収音信号入力部であり、本システムの撮影対象である競技場にまんべんなく散りばめて設定した収音点に設置された複数のマイクロホンによる収音信号を入力し、収音信号の増幅やノイズ除去などを行う。さらに、収音点の特徴を示す付加的な情報を各収音信号に付加し、収音点情報として収音点選択部３へ出力する。 In FIG. 1, reference numeral 1 denotes a sound pickup signal input unit, which inputs sound pickup signals from a plurality of microphones installed at sound pickup points that are scattered all over the stadium to be photographed by this system. Amplification and noise removal. Further, additional information indicating the characteristics of the sound collection point is added to each sound collection signal, and is output to the sound collection point selection unit 3 as sound collection point information.

図２は、撮影対象の競技場における収音点の設定例を模式的に示す図である。図２中、１０１は収音点の一つ、１０２は観客席、１０３はトラック、１０４はグラウンドを示す。本実施形態では、図２に示したように、収音点を競技場のあらゆる場所にまんべんなく設定し、収音点の音を常に収音している場合の例を説明する。 FIG. 2 is a diagram schematically illustrating a setting example of sound collection points in a shooting target stadium. In FIG. 2, 101 is one of the sound pickup points, 102 is the auditorium, 103 is the track, and 104 is the ground. In the present embodiment, as shown in FIG. 2, an example will be described in which sound collection points are set evenly in every place of the stadium and sound at the sound collection points is always collected.

２は聴取範囲決定部であり、視点情報指定部６によって指定された視点情報や、視点映像生成部８によって出力される視点映像、及び、被写体位置検知部９によって出力される被写体位置に基づき、聴取範囲、聴取点、及び、聴取方向を決定する。３は収音点選択部であり、聴取範囲決定部２から出力される聴取範囲、聴取点、聴取方向に従って、音響再生信号生成に用いる収音点を、収音信号入力部が出力する収音点情報の中から適宜選択する。４は音響信号生成部であり、収音点選択部３で選択された収音点の収音信号に基づいて、任意の再生フォーマットの再生音響信号を生成する音響生成を行う。そして、音響再生部１１及びＭＵＸ１５へ出力する。 Reference numeral 2 denotes a listening range determination unit, which is based on the viewpoint information specified by the viewpoint information specifying unit 6, the viewpoint video output by the viewpoint video generation unit 8, and the subject position output by the subject position detection unit 9. Determine the listening range, listening point, and listening direction. Reference numeral 3 denotes a sound collection point selection unit. The sound collection signal input unit outputs the sound collection points used for generating the sound reproduction signal according to the listening range, the listening point, and the listening direction output from the listening range determination unit 2. Select from point information as appropriate. Reference numeral 4 denotes an acoustic signal generation unit, which performs acoustic generation for generating a reproduction acoustic signal of an arbitrary reproduction format based on the sound collection signal at the sound collection point selected by the sound collection point selection unit 3. And it outputs to the sound reproduction part 11 and MUX15.

５は操作部であり、本システムに対するユーザの各操作指示を受け付ける。６は視点情報指定部であり、操作部５を介して送信されるユーザの操作指示に基づいて、視点情報を生成し、聴取範囲決定部２や視点映像生成部８へ出力する。操作部５は、キーボード、マウス等のポインティング装置、タッチパネル等により実現される。このように、本実施形態では、任意視点映像生成システムは、視点情報指定部６を介して視点情報を取得する。 Reference numeral 5 denotes an operation unit, which accepts user operation instructions for the system. A viewpoint information designation unit 6 generates viewpoint information based on a user operation instruction transmitted via the operation unit 5, and outputs the viewpoint information to the listening range determination unit 2 and the viewpoint video generation unit 8. The operation unit 5 is realized by a keyboard, a pointing device such as a mouse, a touch panel, or the like. Thus, in the present embodiment, the arbitrary viewpoint video generation system acquires viewpoint information via the viewpoint information specifying unit 6.

７は映像信号入力部であり、本システムの撮影対象である競技場に設置された複数のカメラによって撮影された映像信号を入力し、映像信号の増幅やノイズ除去などを行う。さらに、撮影時のカメラパラメータを各収音信号に付加し、カメラ撮影情報として視点映像生成部８へ出力する。なお、本実施形態では、任意視点画像として動画像である映像を生成する場合の例を説明するが、静止画像を対象としてもよい。 Reference numeral 7 denotes a video signal input unit that inputs video signals shot by a plurality of cameras installed in a stadium that is the shooting target of this system, and performs amplification and noise removal of the video signals. Further, camera parameters at the time of shooting are added to each sound pickup signal and output to the viewpoint video generation unit 8 as camera shooting information. In the present embodiment, an example in which a video that is a moving image is generated as an arbitrary viewpoint image will be described, but a still image may be a target.

８は視点映像生成部であり、視点情報指定部６によって指定された視点情報に従って、複数カメラによる映像を適宜処理することにより、任意視点映像を生成して、聴取範囲決定部２、被写体位置検知部９、映像再生部１０、ＭＵＸ１５へ出力する。９は被写体位置検知部であり、視点映像生成部８が生成した視点映像や視点情報に基づき、視点映像に写っている被写体の位置を検知する。後述するように、被写体には、人物や人物以外の特定の物体等が含まれる。１０は映像再生部であり、視点映像生成部８が生成した視点映像を再生し、表示部１９へ出力する。 Reference numeral 8 denotes a viewpoint video generation unit, which generates an arbitrary viewpoint video by appropriately processing videos from a plurality of cameras in accordance with the viewpoint information specified by the viewpoint information specifying unit 6, and generates a listening range determination unit 2 and subject position detection. Unit 9, video playback unit 10, and MUX 15. Reference numeral 9 denotes a subject position detection unit that detects the position of the subject in the viewpoint video based on the viewpoint video and viewpoint information generated by the viewpoint video generation unit 8. As will be described later, the subject includes a person or a specific object other than the person. Reference numeral 10 denotes a video playback unit, which plays back the viewpoint video generated by the viewpoint video generation unit 8 and outputs it to the display unit 19.

１１は音響再生部であり、音響信号生成部４によって生成された音響信号を再生環境に応じて再生する。１２はステレオスピーカーセットであり、音響信号生成部４が生成したステレオ音響信号を適宜増幅し、音に変えて出力する。１３はサラウンドスピーカーセットであり、音響信号生成部４が生成したサラウンド音響信号を適宜増幅し、音に変えて出力する。１４はヘッドフォンであり、音響信号生成部４が生成したヘッドフォン信号を音に変えて出力する。本実施形態では、ステレオスピーカーセット１２、サラウンドスピーカーセット１３、及び、ヘッドフォン１４のいずれかにより音響信号の音を出力する例を説明するが、音の再生環境はここに例示したものに限られない。 Reference numeral 11 denotes an acoustic reproduction unit which reproduces the acoustic signal generated by the acoustic signal generation unit 4 according to the reproduction environment. A stereo speaker set 12 amplifies a stereo sound signal generated by the sound signal generation unit 4 as appropriate, and outputs the sound instead of sound. Reference numeral 13 denotes a surround speaker set, which appropriately amplifies the surround sound signal generated by the sound signal generation unit 4 and outputs the sound instead of sound. Reference numeral 14 denotes a headphone, which outputs the headphone signal generated by the acoustic signal generation unit 4 by converting it into sound. In the present embodiment, an example in which the sound of the acoustic signal is output by any one of the stereo speaker set 12, the surround speaker set 13, and the headphones 14 will be described. However, the sound reproduction environment is not limited to that exemplified here. .

１５はＭＵＸ（マルチプレクサ）であり、視点映像生成部８が生成した任意視点映像信号と、音響信号生成部４が生成した音響信号を重畳して、一つの映像ストリームデータを作成し、通信部１６や出力部１８へ出力する。１６は通信部であり、ＭＵＸ１５から出力される映像ストリームデータを通信網１７に適宜送信する。１７は通信網であり、インターネットや公衆回線網などの公共の通信網を示す。１８は出力部であり、出力端子を備え、出力端子に接続された外部機器へ、ＭＵＸ１５から出力される映像ストリームデータを出力する。１９は表示部であり、映像再生部１０が再生する任意視点映像を表示する。表示部１９は、液晶パネル、有機ＥＬディスプレイ等により実現される。 Reference numeral 15 denotes a MUX (multiplexer) that superimposes the arbitrary viewpoint video signal generated by the viewpoint video generation unit 8 and the audio signal generated by the audio signal generation unit 4 to create one video stream data. And output to the output unit 18. Reference numeral 16 denotes a communication unit, which appropriately transmits video stream data output from the MUX 15 to the communication network 17. Reference numeral 17 denotes a communication network, which indicates a public communication network such as the Internet or a public line network. An output unit 18 includes an output terminal and outputs video stream data output from the MUX 15 to an external device connected to the output terminal. Reference numeral 19 denotes a display unit that displays an arbitrary viewpoint video reproduced by the video reproduction unit 10. The display unit 19 is realized by a liquid crystal panel, an organic EL display, or the like.

これらの構成要素は不図示のＣＰＵ（中央演算処理ユニット）に制御バスを介して接続されており、ＣＰＵからの命令指示に従ってその動作が統合制御される。ＣＰＵは、不図示の記憶装置からコンピュータプログラムを読み出し、コンピュータプログラムに従って装置全体を制御する。 These components are connected to a CPU (Central Processing Unit) (not shown) via a control bus, and their operations are integrated and controlled in accordance with a command instruction from the CPU. The CPU reads a computer program from a storage device (not shown) and controls the entire apparatus according to the computer program.

（メイン処理）
図１に示す本実施形態の構成において、視点情報に応じて聴取範囲、聴取点、聴取方向を決定し、それに基づいて音場生成を行う処理をフローチャートを参照して以下に説明する。図３は、本実施形態のメイン処理の処理手順を示すフローチャートである。以下の各ステップは、ＣＰＵの制御に基づき実行される。 (Main process)
In the configuration of the present embodiment shown in FIG. 1, a process of determining a listening range, a listening point, and a listening direction according to viewpoint information and generating a sound field based on the listening range will be described below with reference to a flowchart. FIG. 3 is a flowchart showing a processing procedure of the main processing of the present embodiment. The following steps are executed based on CPU control.

Ｓ１０１は、視点情報指定部６において、内部の変更命令バッファに一時蓄積されている命令に従って視点情報を変更し、聴取範囲決定部２と視点映像生成部８へ出力する処理である。 S101 is a process in which the viewpoint information specifying unit 6 changes the viewpoint information in accordance with the instruction temporarily stored in the internal change instruction buffer, and outputs it to the listening range determination unit 2 and the viewpoint video generation unit 8.

図４（ａ）に、本実施形態における視点情報のデータ構造を示す。図４（ａ）に示すように、本実施形態の視点情報は、視点位置、視点俯角、視点方向、画角を含む。このうち、視点位置は撮影対象である競技場における視点の位置を示す三次元座標である。一例として、本実施形態では、東西に東向きにＸ軸、南北に北向きにＹ軸、上下に上向きにＺ軸を取る三次元座標系を設定し、原点を競技場全体の南西角に取る場合を説明する。視点俯角は、視点が向いている俯角であり、水平方向を０°として±９０°までの範囲で指定される。視点方向は同じく視点が向いている水平面の方向であり、本実施形態では、真北を０°（すなわち、Ｙ軸正方向）とする絶対方向を基準の正面として、右回り（時計回り）を正、左回り（反時計回り）を負として示す。画角は視点から見た視点映像の上下、左右の幅を角度で示す値である。また、視点から観察される視線の３次元的な方向を、以下、視線の方向という。視線の方向は、視点俯角（視点仰角）と視点の方向を合わせたものに相当する。本実施形態のシステムでは、この情報で指定される任意視点から見た映像を生成することができる。 FIG. 4A shows a data structure of viewpoint information in the present embodiment. As shown in FIG. 4A, the viewpoint information of the present embodiment includes a viewpoint position, a viewpoint depression angle, a viewpoint direction, and a field angle. Among these, the viewpoint position is a three-dimensional coordinate indicating the position of the viewpoint in the stadium that is the subject of photographing. As an example, in this embodiment, a three-dimensional coordinate system is set in which the east-west direction is the eastward X-axis, the north-south direction is the north-facing Y-axis, and the top-bottom direction is the Z-axis. Explain the case. The viewpoint depression angle is the depression angle to which the viewpoint is directed, and is specified within a range of ± 90 ° with the horizontal direction being 0 °. The viewpoint direction is the direction of the horizontal plane in which the viewpoint is also facing, and in this embodiment, clockwise (clockwise) with the absolute direction with true north as 0 ° (that is, the positive Y-axis direction) as the reference front. Positive and counterclockwise (counterclockwise) are shown as negative. The angle of view is a value indicating the vertical and horizontal widths of the viewpoint video viewed from the viewpoint. In addition, the three-dimensional direction of the line of sight observed from the viewpoint is hereinafter referred to as the direction of the line of sight. The direction of the line of sight corresponds to the sum of the viewpoint depression angle (viewpoint elevation angle) and the viewpoint direction. In the system of this embodiment, it is possible to generate a video viewed from an arbitrary viewpoint specified by this information.

Ｓ１０１の処理が終わると、Ｓ１０２からＳ１０５までの音響信号を生成する処理の流れと、Ｓ１０６からＳ１０８までの映像を生成する処理の流れと、Ｓ１０９の処理が並行して行われる。以下、これらの処理の流れをそれぞれ説明する。 When the process of S101 is completed, the process flow of generating an acoustic signal from S102 to S105, the process flow of generating a video from S106 to S108, and the process of S109 are performed in parallel. Hereinafter, the flow of these processes will be described.

Ｓ１０２では、収音点において取得された音響を取得し、収音信号入力部１において、複数のマイクロホンの収音信号の増幅やノイズ除去などを行う。さらに、収音点の特徴を示すヘッダ情報を各収音信号に付加し、収音点情報として収音点選択部３へ出力する。図４（ｂ）に本実施形態における収音点情報のデータ構造を示す。図４（ｂ）のように、本実施形態における収音点情報は、収音点ＩＤ、収音点座標、収音信号を含む。このうち、収音点ＩＤは、収音点を識別するための番号（識別情報）である。また、収音点座標は、この収音点の位置を示す座標であり、本実施形態では競技場における水平面座標を示す。収音信号はこの収音点に設置したマイクロホンで収音した音響信号そのものである。 In S102, the sound acquired at the sound collection point is acquired, and the sound collection signal input unit 1 performs amplification and noise removal of the sound collection signals of a plurality of microphones. Further, header information indicating the characteristics of the sound collection points is added to each sound collection signal and output to the sound collection point selection unit 3 as sound collection point information. FIG. 4B shows a data structure of sound collection point information in the present embodiment. As shown in FIG. 4B, the sound collection point information in the present embodiment includes a sound collection point ID, sound collection point coordinates, and a sound collection signal. Among these, the sound collection point ID is a number (identification information) for identifying the sound collection point. The sound collection point coordinates are coordinates indicating the position of the sound collection point, and in the present embodiment, the sound collection point coordinates are horizontal plane coordinates in the stadium. The collected sound signal is an acoustic signal itself collected by a microphone installed at the sound collection point.

次に、Ｓ１０３では、聴取範囲決定部２において、Ｓ１０１で視点情報指定部６から送信された視点情報に基づいて、聴取範囲、聴取位置、聴取方向を決定する。この処理（聴取範囲決定処理）の詳細は図５、図６を用いて後述する。次に、Ｓ１０４では、収音点選択部３において、視点映像に見合う音響信号を生成するために用いる収音信号を収音した収音点を選択する。この処理（収音点選択処理）の詳細は図８と図９を用いて後述する。次に、Ｓ１０５では、音響信号生成部４において、各音響再生環境に応じた音響信号を生成する。この処理（再生信号生成処理）の詳細は図１０、図１１、図１２、図１３を用いて後述する。生成した音響信号は、音響再生部１１やＭＵＸ１５へ出力される。処理を終えると、Ｓ１１０へ進む。 Next, in S103, the listening range determination unit 2 determines the listening range, listening position, and listening direction based on the viewpoint information transmitted from the viewpoint information specifying unit 6 in S101. Details of this process (listening range determination process) will be described later with reference to FIGS. Next, in S104, the sound collection point selection unit 3 selects a sound collection point that has collected the sound collection signal used to generate an acoustic signal suitable for the viewpoint video. Details of this processing (sound pickup point selection processing) will be described later with reference to FIGS. Next, in S105, the acoustic signal generator 4 generates an acoustic signal corresponding to each acoustic reproduction environment. Details of this processing (reproduction signal generation processing) will be described later with reference to FIGS. 10, 11, 12, and 13. FIG. The generated acoustic signal is output to the sound reproducing unit 11 and the MUX 15. When the process is finished, the process proceeds to S110.

一方、Ｓ１０６では、複数のカメラにより撮影された映像信号を取得し、映像信号入力部７において、映像信号のノイズ除去や輝度調整等を行う。さらに、各カメラの撮影パラメータをヘッダ情報として映像信号に付加し、カメラ撮影情報として収音点選択部３へ出力する。図４（ｃ）に、本実施形態におけるカメラ撮影情報のデータ構造を示す。図４（ｃ）に示すように、本実施形態のカメラ撮影情報は、カメラ位置、カメラ俯角、カメラ方向、画角、焦点距離、映像信号を含む。このうち、カメラ位置は撮影対象である競技場におけるカメラの位置を示す三次元座標である。カメラ俯角は、視点が向いている俯角であり、水平方向を０°として±９０°までの範囲で指定される。カメラ方向は同じくカメラが向いている水平面の方向であり、本実施形態では、真北（すなわち、Ｙ軸正方向）を０°とする絶対方向を基準の正面として、右回り（時計回り）を正、左回り（反時計回り）を負として示す。画角は撮影映像の幅を角度で示す値である。焦点距離は、カメラレンズの光学中心から撮像面までの距離を示す値である。映像信号は、このカメラで撮影した映像そのものの信号である。 On the other hand, in S106, video signals photographed by a plurality of cameras are acquired, and the video signal input unit 7 performs noise removal and brightness adjustment of the video signals. Further, the shooting parameters of each camera are added to the video signal as header information and output to the sound pickup point selection unit 3 as camera shooting information. FIG. 4C shows a data structure of camera shooting information in the present embodiment. As shown in FIG. 4C, the camera shooting information of this embodiment includes a camera position, a camera depression angle, a camera direction, a field angle, a focal length, and a video signal. Among these, the camera position is a three-dimensional coordinate indicating the position of the camera in the stadium that is the subject of photographing. The camera depression angle is the depression angle to which the viewpoint is directed, and is specified in a range of ± 90 ° with the horizontal direction being 0 °. Similarly, the camera direction is the direction of the horizontal plane that the camera is facing. In this embodiment, the camera is clockwise (clockwise) with the absolute direction with true north (that is, the positive Y-axis direction) being 0 ° as the reference front. Positive and counterclockwise (counterclockwise) are shown as negative. The angle of view is a value indicating the width of the captured video in angle. The focal length is a value indicating the distance from the optical center of the camera lens to the imaging surface. The video signal is a signal of the video itself taken by this camera.

次に、Ｓ１０７では、視点映像生成部８において、Ｓ１０１で視点情報指定部６から送信された視点情報に基づいて、Ｓ１０６で受信した複数カメラ映像を適宜処理、合成し、任意視点映像を生成する。すなわち、複数の画像信号に基づいて、視点位置及び視線の方向に応じた画像を生成する画像生成を行う。このような、複数のカメラ映像から任意視点の映像を合成する手法は公知であり、当分野において一般的に行われているため、詳細な説明は行わない。 Next, in S107, the viewpoint video generation unit 8 appropriately processes and synthesizes the multiple camera videos received in S106 based on the viewpoint information transmitted from the viewpoint information designation unit 6 in S101, and generates an arbitrary viewpoint video. . That is, image generation for generating an image corresponding to the viewpoint position and the direction of the line of sight is performed based on a plurality of image signals. Such a method for synthesizing a video of an arbitrary viewpoint from a plurality of camera videos is well known and is generally performed in this field, and thus will not be described in detail.

次に、Ｓ１０８では、被写体位置検知部９において、Ｓ１０６で受信した複数カメラの映像とＳ１０７で生成した任意視点映像を解析することにより、任意視点映像に写っている被写体が実際に存在する競技場における位置を検知する。この処理（被写体位置検知処理）の詳細は図７を用いて後述する。処理を終えると、Ｓ１１０へ進む。 Next, in S108, the subject position detection unit 9 analyzes the video of the plurality of cameras received in S106 and the arbitrary viewpoint video generated in S107, so that the stadium that actually exists in the arbitrary viewpoint video exists. Detect the position at. Details of this processing (subject position detection processing) will be described later with reference to FIG. When the process is finished, the process proceeds to S110.

さらに、Ｓ１０９では、視点情報指定部６において、ユーザが操作部５を介して入力した視点変更指示を受け付けて、視点情報変更命令に変換して内部の変更命令バッファに一次蓄積する。処理を終えると、Ｓ１１０へ進む。 Further, in S109, the viewpoint information designation unit 6 accepts a viewpoint change instruction input by the user via the operation unit 5, converts it into a viewpoint information change command, and primarily stores it in the internal change command buffer. When the process is finished, the process proceeds to S110.

Ｓ１１０は、ＭＵＸ１５において、Ｓ１０５で生成した音響再生信号と、Ｓ１０７で生成した任意視点映像信号を重畳・合成し、一つの映像ストリームデータとしてまとめて、通信部１６や出力部１８へ出力する処理である。処理を終えると、Ｓ１１１へ進む。 S110 is a process of superimposing and synthesizing the sound reproduction signal generated in S105 and the arbitrary viewpoint video signal generated in S107 in the MUX 15 and combining them as one video stream data and outputting it to the communication unit 16 and the output unit 18. is there. When the process is finished, the process proceeds to S111.

Ｓ１１１は、不図示のＣＰＵにおいて、本実施形態のシステムの出力先を判定する処理である。出力先が再生機器の場合は、Ｓ１１２へ進む。出力先が通信網の場合は、Ｓ１１３へ進む。出力先が外部機器の場合は、Ｓ１１４へ進む。 S111 is processing for determining an output destination of the system of the present embodiment in a CPU (not shown). If the output destination is a playback device, the process proceeds to S112. If the output destination is a communication network, the process proceeds to S113. If the output destination is an external device, the process proceeds to S114.

Ｓ１１２は、音響再生部１１と、映像再生部１０において、Ｓ１０５で生成した音響再生信号と、Ｓ１０７で生成した任意視点映像信号を、各々音響再生環境や表示部１９に同期して出力する処理である。このような処理は一般的な映像出力装置において一般的に行われており、公知であるため、詳細な説明はしない。この処理により、任意視点映像とそれに見合う音響信号が同期して再生されるため、映像再生時の臨場感を高めることができる。処理を終えると、Ｓ１１５へ進む。 S112 is a process of outputting the audio reproduction signal generated in S105 and the arbitrary viewpoint video signal generated in S107 in the audio reproduction unit 11 and the video reproduction unit 10 in synchronization with the audio reproduction environment and the display unit 19, respectively. is there. Such a process is generally performed in a general video output apparatus and is well known, and thus will not be described in detail. By this processing, an arbitrary viewpoint video and an audio signal corresponding to the arbitrary viewpoint video are played back in synchronization with each other, so that it is possible to enhance a sense of reality when playing back the video. When the process is finished, the process proceeds to S115.

Ｓ１１３は、通信部１６において、Ｓ１１０で作成した映像ストリームデータを、通信網１７を経由して外部に送信する処理である。処理を終えると、Ｓ１１５へ進む。 In step S113, the communication unit 16 transmits the video stream data created in step S110 to the outside via the communication network 17. When the process is finished, the process proceeds to S115.

Ｓ１１４は、出力部１８において、Ｓ１１０で作成した映像ストリームを外部出力端子に繋がれた外部機器へ出力する処理である。処理を終えると、Ｓ１１５へ進む。 In step S114, the output unit 18 outputs the video stream created in step S110 to an external device connected to the external output terminal. When the process is finished, the process proceeds to S115.

Ｓ１１５は、不図示のＣＰＵにおいて、本フロー全体で行われているメイン処理を終了するかどうかを判定する処理である。この判定の結果、処理を終了する場合（Ｓ１１５でＹＥＳ）は、本フローの処理を終了する。終了しない場合（Ｓ１１５でＮＯ）は、Ｓ１０１へ戻る。 S115 is processing for determining whether or not to terminate the main processing performed in the entire flow in a CPU (not shown). As a result of this determination, when the process is terminated (YES in S115), the process of this flow is terminated. If not finished (NO in S115), the process returns to S101.

（聴取範囲決定処理）
図５は、本実施形態におけるＳ１０３の聴取範囲決定処理の詳細な処理手順を示すフローチャートである。なお、本フローチャートにおける処理は全て聴取範囲決定部２において行われる。 (Listening range determination process)
FIG. 5 is a flowchart showing a detailed processing procedure of the listening range determination processing in S103 in the present embodiment. All processes in this flowchart are performed by the listening range determination unit 2.

まず、Ｓ２０１では、聴取範囲決定部２の内部ＲＡＭ（不図示）に格納されている聴取点情報を初期化する。ここで、聴取点情報のデータ構造を図４（ｄ）に示す。本実施形態における聴取点情報は、聴取範囲、聴取点、聴取方向を含む。このうち、聴取範囲は、撮影対象である競技場において、視点映像に没入した場合にすぐ近くに聴こえるであろう音が発生している範囲を示すデータである。本実施形態では、水平面上の４点の座標が格納されており、これらを結ぶことによってできる四角形で囲まれた範囲を聴取範囲とする。後述するように、聴取範囲は、音響信号を生成するために用いる収音信号の収音点を選択するための基準となる場所的範囲として機能する。また、聴取点は後の処理で再生信号を生成する際に、収音信号を配置する際の基準とする点であり、同じく水平面上の座標が格納される。また、聴取方向は同じく後の処理で再生信号を生成する際に、収音信号を配置する際の基準とする方向である。本実施形態では、聴取点からみた正面方向を視点方向と同じく絶対角度で示す。Ｓ２０１ではこの聴取点情報の各データを全て初期化する。 First, in S201, listening point information stored in an internal RAM (not shown) of the listening range determination unit 2 is initialized. Here, the data structure of the listening point information is shown in FIG. The listening point information in the present embodiment includes a listening range, a listening point, and a listening direction. Among these, the listening range is data indicating a range in which a sound that will be heard in the immediate vicinity when the viewpoint video is immersed in the stadium to be photographed is generated. In the present embodiment, the coordinates of four points on the horizontal plane are stored, and a range surrounded by a rectangle formed by connecting them is defined as a listening range. As will be described later, the listening range functions as a local range serving as a reference for selecting a sound collection point of a sound collection signal used for generating an acoustic signal. The listening point is a point used as a reference when arranging the collected sound signal when the reproduction signal is generated in the later processing, and the coordinates on the horizontal plane are also stored. Similarly, the listening direction is a direction used as a reference when arranging the collected sound signal when the reproduction signal is generated in the later processing. In the present embodiment, the front direction viewed from the listening point is indicated by an absolute angle in the same manner as the viewpoint direction. In S201, all the data of the listening point information are initialized.

具体的には、Ｓ２０２ではＳ１０１で視点情報指定部６によって指定された視点情報の仰角が−１０°より下かどうかを判定する。この処理の趣旨は、指定された視点情報が俯瞰視点であるか、水平視点であるかどうかを判定することである。俯瞰視点とは、視点が、スタジアム等の撮影対象を情報から俯瞰的に観察する位置にある場合をいい、水平視点とは、視点が、撮影対象をほぼ真横から水平に観察する位置にある場合をいう。よって、本実施形態では視点切り替えの判定基準を仰角−１０°としているが、これは一例であり、実施状況に応じて別の判定基準を用いてもよい。例えば、撮影対象が体育館におけるバスケットコートであれば、判定基準となる仰角を−２０°のようにもっと深くしてもよい。そうすることで観察対象の性質上観察の範囲が上下に大きい場合に、俯瞰視点として判定する場合を適切に設定して、聴取範囲を適切に決定し、臨場感ある音響信号を表現することが可能となる。また、後の処理であるＳ２０３によって計算される画角の水平面投影範囲を先に計算しておき、投影範囲が予め定めた範囲に収まれば俯瞰視点とし、収まらなければ水平視点としてもよい。このように、本実施形態では、視線の仰角が予め定められた負の値を下回るか否かを判定し、その判定結果に応じて、聴取範囲を決定するための処理を分岐させることで、音響信号生成の基礎となる聴取範囲を適切に決定することができる。 Specifically, in S202, it is determined whether or not the elevation angle of the viewpoint information specified by the viewpoint information specifying unit 6 in S101 is below -10 °. The purpose of this process is to determine whether the specified viewpoint information is an overhead viewpoint or a horizontal viewpoint. The bird's-eye view is when the viewpoint is at a position where the shooting target such as a stadium is observed from an overhead view from the information, and the horizontal viewpoint is when the viewpoint is at a position where the shooting target is observed almost horizontally from the side. Say. Therefore, in the present embodiment, the viewpoint switching determination criterion is set to an elevation angle of −10 °, but this is an example, and another determination criterion may be used depending on the implementation status. For example, if the subject to be photographed is a basketball court in a gymnasium, the elevation angle, which is a criterion for determination, may be made deeper, such as −20 °. By doing so, when the observation range is large in the vertical direction due to the nature of the observation target, it is possible to appropriately set the case of determining as a bird's-eye view viewpoint, appropriately determine the listening range, and express a realistic sound signal It becomes possible. Also, the horizontal plane projection range of the angle of view calculated in S203, which is the subsequent process, may be calculated in advance, and if the projection range falls within a predetermined range, an overhead view viewpoint may be obtained, and if not, a horizontal viewpoint may be obtained. As described above, in this embodiment, it is determined whether the elevation angle of the line of sight is below a predetermined negative value, and the process for determining the listening range is branched according to the determination result. It is possible to appropriately determine a listening range that is a basis for generating an acoustic signal.

このような判定の結果、−１０°より下、つまり、俯瞰視点の場合（Ｓ２０２でＹＥＳ）はＳ２０３へ進む。そうでない場合、つまり、水平視点の場合（Ｓ２０２でＮＯ）は、Ｓ２０５へ進む。 As a result of such determination, if it is below −10 °, that is, in the case of an overhead viewpoint (YES in S202), the process proceeds to S203. Otherwise, that is, in the case of a horizontal viewpoint (NO in S202), the process proceeds to S205.

Ｓ２０３では、この視点情報の画角を撮影対象の競技場等の競技面、つまり、水平面に投影した場合の範囲を計算し、この範囲を聴取範囲とする。一例として、視点位置（１５、０、１０）、視点俯角−４５°、視点方向０°（すなわち、Ｙ軸正方向）、水平画角２０°、垂直画角１４°の視点の視点映像を、高さ０の競技面に投影する場合を説明する。このとき、投影面（Ｚ＝０）の中心点の水平面座標は（１５、１０）であり、投影面は、上端のＹ座標が１０×ｔａｎ５２°≒１２．８、下端のＹ座標が１０×ｔａｎ３８°≒７．８の台形になる。ここで、垂直画角１４°の半分は１４°／２＝７°であり、５２°＝４５°＋７°、３８°＝４５°−７°である。 In S203, a range when the angle of view of this viewpoint information is projected onto a competition surface such as a shooting target stadium, that is, a horizontal plane is calculated, and this range is set as a listening range. As an example, a viewpoint video of a viewpoint having a viewpoint position (15, 0, 10), a viewpoint depression angle of −45 °, a viewpoint direction of 0 ° (that is, a positive Y-axis direction), a horizontal field angle of 20 °, and a vertical field angle of 14 °, The case where it projects on the 0 height competition surface is demonstrated. At this time, the horizontal plane coordinate of the center point of the projection plane (Z = 0) is (15, 10), and the projection plane has an upper end Y coordinate of 10 × tan 52 ° ≈12.8 and a lower end Y coordinate of 10 ×. It becomes a trapezoid of tan 38 ° ≈7.8. Here, half of the vertical angle of view 14 ° is 14 ° / 2 = 7 °, 52 ° = 45 ° + 7 °, and 38 ° = 45 ° -7 °.

台形の上端辺、下端辺の中心点と視点との距離は、各々１０／ｃｏｓ５２°≒１６．２、１０／ｃｏｓ３８°≒１２．７となる。よって、それぞれから左右１０°（水平画角２０°の半分）ずつ開くと、台形の上端辺の長さは１６．２×ｔａｎ１０°×２≒５．７、下端辺の長さは１２．７×ｔａｎ１０°×２≒４．５となる。 The distances between the center points of the upper and lower sides of the trapezoid and the viewpoint are 10 / cos 52 ° ≈16.2 and 10 / cos 38 ° ≈12.7, respectively. Therefore, when the left and right sides are opened 10 ° from each other (half of the horizontal angle of view 20 °), the length of the upper edge of the trapezoid is 16.2 × tan 10 ° × 2≈5.7, and the length of the lower edge is 12.7. Xtan10 ° × 2≈4.5.

よって、聴取範囲は、Ｚ＝０の投影面において、（１２．１５、１２．８）、（１７．８５、１２．８）、（１２．７５、７．８）、（１７．２５、７．８）の４点で囲まれた領域になる。ここで、投影面に係る台形の上端辺の両頂点のｘ座標は、１５−５．７／２＝１２．１５、１５＋５．７／２＝１７．８５となる。投影面に係る台形の下端辺の両頂点のｘ座標は、１５−４．５／２＝１２．７５、１５＋４．５／２＝１７．２５となる。このように計算された聴取範囲は、聴取範囲決定部２の内部ＲＡＭに格納されている聴取情報に格納される。 Therefore, the listening range is (12.15, 12.8), (17.85, 12.8), (12.75, 7.8), (17.25, 7) on the projection plane with Z = 0. .8) is an area surrounded by four points. Here, the x coordinates of both vertices of the upper edge of the trapezoid relating to the projection plane are 15−5.7 / 2 = 12.15 and 15 + 5.7 / 2 = 17.85. The x coordinates of both vertices of the lower edge of the trapezoid relating to the projection plane are 15−4.5 / 2 = 12.75 and 15 + 4.5 / 2 = 17.25. The listening range calculated in this way is stored in the listening information stored in the internal RAM of the listening range determination unit 2.

次に、Ｓ２０４では、Ｓ２０３で決定した聴取範囲において、画角において上方向に対応する投影面（水平面）上の方向を聴取方向とする。先ほどの例の場合は、０°（すなわち、Ｙ軸正方向）となる。処理を終えるとＳ２０８へ進む。 Next, in S204, in the listening range determined in S203, the direction on the projection plane (horizontal plane) corresponding to the upward direction at the angle of view is set as the listening direction. In the case of the previous example, the angle is 0 ° (that is, the Y axis positive direction). When the process is finished, the process proceeds to S208.

一方、Ｓ２０５では、Ｓ１０８で被写体位置検知部９によって検知された被写体位置座標を取得する。次に、Ｓ２０６では、取得した被写体位置を囲む範囲を計算し、聴取範囲とする。例えば、Ｓ２０５で取得した被写体位置の数が３つで、その水平面座標が（２，２）、（６，３）、（５，６）とすると、これを囲む聴取範囲を（１，１）、（１，７）、（７，１）、（７，７）の４点で囲まれた領域として設定する。つまり、ここでの例では、全ての被写体位置におけるＸ座標の最小値をＸmin、Ｘ座標の最大値をＸmax、Ｙ座標の最小値をＹmin、Ｙ座標の最大値をＹmaxとすると、聴取範囲は次のようになる。すなわち、（Ｘmin−１，Ｙmin−１）、（Ｘmin−１，Ｙmax＋１）、（Ｘmax＋１，Ｙmin−１）、（Ｘmax＋１，Ｙmax＋１）を各頂点座標とする四角形を聴取範囲とする。もっとも、聴取範囲は、取得した被写体位置を全て包含する必要最小限の大きさを有する領域であれば、四角形でなくてもよい。 On the other hand, in S205, the subject position coordinates detected by the subject position detection unit 9 in S108 are acquired. Next, in S206, a range surrounding the acquired subject position is calculated and set as a listening range. For example, if the number of object positions acquired in S205 is three and the horizontal plane coordinates are (2, 2), (6, 3), (5, 6), the listening range surrounding this is (1, 1). , (1, 7), (7, 1), and (7, 7) are set as an area surrounded by four points. In other words, in this example, if the minimum value of the X coordinate at all subject positions is Xmin, the maximum value of the X coordinate is Xmax, the minimum value of the Y coordinate is Ymin, and the maximum value of the Y coordinate is Ymax, the listening range is It becomes as follows. That is, the listening range is a quadrangle having (Xmin-1, Ymin-1), (Xmin-1, Ymax + 1), (Xmax + 1, Ymin-1), and (Xmax + 1, Ymax + 1) as vertex coordinates. However, the listening range does not have to be a rectangle as long as it is a region having a necessary minimum size including all the acquired subject positions.

次に、Ｓ２０７では、Ｓ１０１で送信された視点情報の視点方向を、そのまま聴取方向として聴取点情報に格納する。このようにすることで、水平視点の場合は、視点映像における視点方向と再生信号における音の方向が合致するようになる。処理を終えるとＳ２０８へ進む。 Next, in S207, the viewpoint direction of the viewpoint information transmitted in S101 is stored in the listening point information as the listening direction as it is. In this way, in the case of a horizontal viewpoint, the viewpoint direction in the viewpoint video and the sound direction in the reproduction signal coincide with each other. When the process is finished, the process proceeds to S208.

Ｓ２０８は、Ｓ２０３やＳ２０６で決定した聴取範囲の中心点を聴取点として聴取点情報に格納する処理である。本実施形態の例では、聴取範囲が四角形であるため、対角線が交差する点を算出し、これを聴取点として内部ＲＡＭの聴取点情報に格納する。なお、聴取点の座標は、聴取範囲を定める四角形の各頂点の平均座標としてもよい。次に、Ｓ２０９では、内部ＲＡＭに格納されている聴取点情報を、収音点選択部３へ出力し、聴取範囲決定処理を終えてリターンする。 S208 is processing to store the center point of the listening range determined in S203 and S206 as listening points in the listening point information. In the example of the present embodiment, since the listening range is a quadrangle, a point where the diagonal lines intersect is calculated and stored as the listening point in the listening point information of the internal RAM. Note that the coordinates of the listening point may be the average coordinates of the vertices of a rectangle that defines the listening range. Next, in S209, the listening point information stored in the internal RAM is output to the sound collection point selection unit 3, and the listening range determination process is completed and the process returns.

図６は、本実施形態における視点と聴取範囲、聴取点、聴取方向の関係を示す模式図である。図６中、聴取範囲を点線の矩形で示す。また、聴取点を八角の星型で示し、聴取方向を黒い矢印で示す。図６（ａ）は俯瞰視点の場合を示しており、画角を水平面に投影した範囲を聴取範囲とし、その対角線の交点を聴取点、画角の上方向に対応するＹ軸正方向を聴取方向とする。一方、図６（ｂ）は水平視点の場合を示しており、視点映像における被写体の位置を検知して、その位置を囲むように聴取範囲を設定する。聴取点は聴取範囲の対角線の交点とし、視点方向（Ｙ軸正方向）を聴取方向とする。 FIG. 6 is a schematic diagram illustrating the relationship between the viewpoint, the listening range, the listening point, and the listening direction in the present embodiment. In FIG. 6, the listening range is indicated by a dotted rectangle. The listening point is indicated by an octagonal star, and the listening direction is indicated by a black arrow. FIG. 6A shows the case of a bird's eye view, where the range of projection of the angle of view on the horizontal plane is the listening range, the intersection of the diagonals is the listening point, and the Y axis positive direction corresponding to the upward direction of the angle of view is listened to. The direction. On the other hand, FIG. 6B shows the case of the horizontal viewpoint, where the position of the subject in the viewpoint video is detected and the listening range is set so as to surround the position. The listening point is the intersection of the diagonal lines of the listening range, and the viewing direction (Y-axis positive direction) is the listening direction.

以上説明したように、本実施形態の聴取範囲決定処理では、視点情報及び被写体の位置から、任意視点映像に応じた聴取範囲、聴取点、聴取位置を自動的に決定する。すなわち、視点位置及び視線の方向に応じて、視点映像生成部８が生成した画像に対応する音響信号を生成する基準となる聴取点を決定し、この聴取点に応じた音響信号を複数の収音信号に基づいて生成する。このように、視点の位置、視線の方向、画角、被写体の位置等に応じて聴取点、聴取範囲、聴取の方向を自動的に決定して、任意視点の音響を自動的に生成することで、視点の変化に応じた音響の変化を忠実に表現可能である。なお、本実施形態では、画角の投影範囲や被写体の位置検出により聴取範囲及び聴取位置を決定しているが、単純に任意視点映像の中央を常に聴取点として扱うようにしてもよい。 As described above, in the listening range determination process according to the present embodiment, the listening range, listening point, and listening position corresponding to the arbitrary viewpoint video are automatically determined from the viewpoint information and the position of the subject. That is, a listening point serving as a reference for generating an acoustic signal corresponding to the image generated by the viewpoint video generation unit 8 is determined according to the viewpoint position and the direction of the line of sight, and a plurality of acoustic signals corresponding to the listening point are collected. Generated based on the sound signal. In this way, the sound of an arbitrary viewpoint is automatically generated by automatically determining the listening point, the listening range, and the listening direction according to the position of the viewpoint, the direction of the line of sight, the angle of view, the position of the subject, etc. Therefore, it is possible to faithfully express the change of sound according to the change of viewpoint. In the present embodiment, the listening range and the listening position are determined by detecting the projection range of the angle of view and the position of the subject. However, the center of the arbitrary viewpoint video may be always handled as the listening point.

また、本実施形態では、視線の仰角が予め定められた負の値を下回り、俯瞰視点であると判定されたときは、視点から見た画角に対応する撮影対象における範囲を聴取範囲として決定する。また、視線の仰角が予め定められた負の値を下回らず、水平視点であると判定されたときは、撮影対象における被写体の位置を包含する範囲を聴取範囲として決定する。このように、視線の仰角に応じて聴取範囲の決定方式を分岐させることで、視線の仰角に応じた臨場感ある音響を再現することが可能である。 In the present embodiment, when the elevation angle of the line of sight falls below a predetermined negative value and is determined to be a bird's-eye view viewpoint, the range in the shooting target corresponding to the angle of view viewed from the viewpoint is determined as the listening range. To do. When it is determined that the elevation angle of the line of sight does not fall below a predetermined negative value and is a horizontal viewpoint, the range including the position of the subject in the photographing target is determined as the listening range. In this way, by branching the method for determining the listening range according to the elevation angle of the line of sight, it is possible to reproduce realistic sound according to the elevation angle of the line of sight.

また、本実施形態では、視線の方向に基づいて、聴取点における聴取の方向を示す聴取方向を決定し、音響信号生成の際には、この聴取方向に応じた音響信号を生成する。具体的には、俯瞰視点の場合は画角の上方向を聴取方向とし、水平視点の場合は視点方向を聴取方向として決定する。このため、視線の方向に対応する音響信号を生成することができる。その他、本発明の主旨を逸脱しない範囲で実施することが可能である。 In the present embodiment, the listening direction indicating the listening direction at the listening point is determined based on the direction of the line of sight, and an acoustic signal corresponding to the listening direction is generated when generating the acoustic signal. Specifically, in the case of a bird's-eye view viewpoint, the upper direction of the angle of view is set as the listening direction, and in the case of a horizontal viewpoint, the viewpoint direction is determined as the listening direction. For this reason, the acoustic signal corresponding to the direction of the line of sight can be generated. In addition, the present invention can be carried out without departing from the gist of the present invention.

（被写体位置検知処理）
図７は、本実施形態におけるＳ１０８の被写体検知処理の詳細な処理手順を示すフローチャートである。なお、本フローチャートにおける処理は全て被写体位置検知部９によって実行される。 (Subject position detection processing)
FIG. 7 is a flowchart showing a detailed processing procedure of the subject detection processing in S108 in the present embodiment. All processes in this flowchart are executed by the subject position detection unit 9.

まず、Ｓ３０１では、被写体位置検知部の内部ＲＡＭに一次保存されているデータを全て初期化する。次に、Ｓ３０２では、Ｓ１０７において視点映像生成部８によって生成された視点映像を解析し、視点映像において焦点が合っている被写体を検知して、全て抽出する。例えば、視点映像をコントラスト画像、エッジ画像などに変換することにより、輪郭が明確な被写体、つまり、焦点が合っている被写体を検出する。抽出する被写体は選手などの人物に限らず、例えば車やバイク等の物体であってもよい。また、数は一つでも複数でもよい。Ｓ３０２では、このような焦点が合っている被写体を全て抽出し、抽出した各画像の特徴を被写体情報として内部ＲＡＭに一次記憶する。 First, in S301, all data temporarily stored in the internal RAM of the subject position detection unit are initialized. Next, in S302, the viewpoint video generated by the viewpoint video generation unit 8 in S107 is analyzed, and the subject in focus in the viewpoint video is detected and extracted. For example, by converting the viewpoint video into a contrast image, an edge image, or the like, a subject with a clear outline, that is, a subject in focus is detected. The subject to be extracted is not limited to a person such as a player but may be an object such as a car or a motorcycle. Moreover, the number may be one or plural. In S302, all such in-focus subjects are extracted, and the characteristics of each extracted image are temporarily stored in the internal RAM as subject information.

次に、Ｓ３０３からＳ３０６までは、Ｓ３０２で抽出した個々の被写体情報に対するループ処理を行う。まず、Ｓ３０３でループ処理を開始する。Ｓ３０４では、任意視点映像を生成するために用いた複数のカメラ画像のうち、処理対象となっている被写体が写っている画像を撮影したカメラ画像を複数特定して、そのカメラ位置座標と被写体方向を求める。次に、Ｓ３０５では、Ｓ３０４で求めた複数のカメラ位置座標と被写体方向から、処理対象となっている被写体の位置座標を三角測量法で算出する。算出した座標は被写体位置検出部の内部ＲＡＭに被写体位置座標として保存される。 Next, from S303 to S306, a loop process is performed on the individual subject information extracted in S302. First, loop processing is started in S303. In S304, among a plurality of camera images used for generating an arbitrary viewpoint video, a plurality of camera images obtained by capturing an image showing a subject to be processed are specified, and the camera position coordinates and the subject direction are specified. Ask for. Next, in S305, the position coordinates of the subject to be processed are calculated by the triangulation method from the plurality of camera position coordinates and the subject direction obtained in S304. The calculated coordinates are stored as subject position coordinates in the internal RAM of the subject position detection unit.

Ｓ３０６で全ての被写体情報に対して処理が終了したかどうかを判定し、終了した場合はループを抜け、被写体位置検知処理を終了してリターンする。なお、内部ＲＡＭに格納されている被写体位置座標は、聴取範囲決定部２からの要求に応じて適宜、聴取範囲決定部２へ出力される。 In step S306, it is determined whether or not the processing has been completed for all subject information. If the processing has been completed, the process exits the loop, ends the subject position detection processing, and returns. Note that the subject position coordinates stored in the internal RAM are output to the listening range determination unit 2 as appropriate in response to a request from the listening range determination unit 2.

このように、本実施形態では、視点映像生成部により生成された任意視点画像を解析して、その画像に写り込んでいる被写体の位置を検出する。そのため、被写体の位置を検知するための専用のセンサ等を設けることなく、被写体の位置を検出して聴取範囲を適切に決定することができる。なお、位置センサ等を用いて被写体の位置を検出しても構わない。 Thus, in the present embodiment, the arbitrary viewpoint image generated by the viewpoint video generation unit is analyzed, and the position of the subject reflected in the image is detected. Therefore, the listening range can be appropriately determined by detecting the position of the subject without providing a dedicated sensor or the like for detecting the position of the subject. Note that the position of the subject may be detected using a position sensor or the like.

（収音点選択処理）
図８は、本実施形態におけるＳ１０４の収音点選択処理の詳細な処理手順を示すフローチャートである。なお、本フローチャートにおける処理は全て収音点選択部３によって実行される。 (Sound collection point selection processing)
FIG. 8 is a flowchart showing a detailed processing procedure of the sound collection point selection processing of S104 in the present embodiment. Note that all the processing in this flowchart is executed by the sound pickup point selection unit 3.

まず、Ｓ４０１では、収音点選択部３の内部ＲＡＭに保存されている選択収音点情報リストを初期化する。選択収音点情報リストとは、選択された収音点の情報を記憶する領域である。図４（ｅ）に、本実施形態における選択収音点情報のデータ構造の一例を示す。図４（ｅ）に示すように、選択収音点情報は、収音点ＩＤ、対応音源配置方向ＩＤ、聴取点から見た方向を含む。このうち、収音点ＩＤは、収音点を識別するためのＩＤ（識別情報）であり、図４（ｂ）を用いて説明した収音点情報に格納されている同名のデータと共通のデータが用いられる。対応音源配置方向ＩＤは、この収音点がカバーする音源配置方向を示す番号（識別情報）である。なお音源配置方向については後述する。聴取点から見た方向は、聴取点から見たこの収音点の方向を、聴取方向を基準として計算したものである。 First, in S401, a selected sound pickup point information list stored in the internal RAM of the sound pickup point selection unit 3 is initialized. The selected sound collection point information list is an area for storing information on the selected sound collection point. FIG. 4E shows an example of the data structure of the selected sound pickup point information in this embodiment. As shown in FIG. 4E, the selected sound collection point information includes a sound collection point ID, a corresponding sound source arrangement direction ID, and a direction viewed from the listening point. Among these, the sound collection point ID is an ID (identification information) for identifying the sound collection point, and is common to the data of the same name stored in the sound collection point information described with reference to FIG. Data is used. The corresponding sound source arrangement direction ID is a number (identification information) indicating the sound source arrangement direction covered by this sound collection point. The sound source arrangement direction will be described later. The direction viewed from the listening point is calculated from the direction of the sound collection point viewed from the listening point with reference to the listening direction.

次に、Ｓ４０２では、Ｓ１０３で決定された聴取点情報に基づいて、再生時に周囲に配置する音源の大まかな配置方向を決定する。本実施形態では、音源の配置方向を、聴取方向を０°として起点とし、水平面を４５°おきに一周した８方向を音源配置方向として設定する。 Next, in S402, based on the listening point information determined in S103, a rough arrangement direction of sound sources arranged around during reproduction is determined. In the present embodiment, the sound source placement direction is set as the sound source placement direction, with the listening direction as 0 ° as a starting point, and eight directions that make one round of the horizontal plane every 45 °.

次に、Ｓ４０３からＳ４１０までは、Ｓ４０２で設定した各音源配置方向に対するループ処理を行う。Ｓ４０３でループ処理を開始する。 Next, from S403 to S410, a loop process is performed for each sound source arrangement direction set in S402. In S403, loop processing is started.

Ｓ４０４では、聴取点から見て、対象音源配置方向の±２２．５°の角度の範囲内の領域を収音点探索範囲とする。この音源配置方向に対応する収音点は、この探索範囲から探索される。 In S404, an area within an angle range of ± 22.5 ° with respect to the target sound source arrangement direction as viewed from the listening point is set as a sound collection point search range. The sound collection point corresponding to the sound source arrangement direction is searched from this search range.

次に、Ｓ４０５では、Ｓ４０４で設定した探索範囲内に収音点があるかどうかを判定する。探索範囲内に収音点がある場合（Ｓ４０５でＹＥＳ）は、処理はＳ４０６へ進む。そうでない場合（Ｓ４０５でＮＯ）は、対象となっている音源配置方向に収音点を割り当てないことに決定し、Ｓ４１０へ進む。 In step S405, it is determined whether there is a sound collection point within the search range set in step S404. If there is a sound collection point within the search range (YES in S405), the process proceeds to S406. If not (NO in S405), it is determined not to assign a sound collection point in the target sound source arrangement direction, and the process proceeds to S410.

Ｓ４０６は、探索範囲の角度内で、かつ、聴取範囲外に収音点があるかどうかを判定する処理である。この判定の結果、収音点がある場合（Ｓ４０６でＹＥＳ）はＳ４０７へ処理は進む。そうでない場合（Ｓ４０６でＮＯ）は、Ｓ４０８へ処理は進む。 S406 is processing for determining whether or not there is a sound collection point within the angle of the search range and outside the listening range. As a result of this determination, if there is a sound collection point (YES in S406), the process proceeds to S407. If not (NO in S406), the process proceeds to S408.

Ｓ４０７は、探索範囲内かつ聴取範囲外で、聴取点により近い収音点をこの音源配置方向の音源として選択する処理である。収音点が選択されたら、収音点選択部３の内部ＲＡＭに保存されている選択収音点情報リストに新規要素を追加して、選択された収音点の収音点ＩＤと、対応音源配置方向ＩＤを格納する。処理を終えると、Ｓ４０９へ進む。 S407 is processing for selecting a sound collection point closer to the listening point within the search range and outside the listening range as a sound source in this sound source arrangement direction. When a sound collection point is selected, a new element is added to the selected sound collection point information list stored in the internal RAM of the sound collection point selection unit 3, and the corresponding sound collection point ID of the selected sound collection point is associated. The sound source arrangement direction ID is stored. When the process is finished, the process proceeds to S409.

一方、Ｓ４０８では、探索範囲内かつ、聴取範囲内で、聴取点に最も遠い収音点をこの音源配置方向の音源として選択する。この場合も選択した収音点の情報を格納する要素を作成し、選択収音点情報リストに追加する。処理を終えると、Ｓ４０９へ進む。 On the other hand, in S408, the sound collection point farthest from the listening point within the search range and the listening range is selected as the sound source in this sound source arrangement direction. In this case as well, an element for storing information on the selected sound collection point is created and added to the selected sound collection point information list. When the process is finished, the process proceeds to S409.

Ｓ４０９は、Ｓ４０７又はＳ４０８で選択した収音点の、聴取方向から見た方向を計算して、前処理で新規追加した選択収音点情報に格納する処理である。例えば、聴取点の座標が（１，１）かつ収音点座標が（２，１＋√３）で、聴取方向が６０°の場合、聴取点から収音点を見た方向は−３０°となるので、この角度を選択収音点情報に格納する。処理を終えると、Ｓ４１０へ進む。 S409 is a process of calculating the direction seen from the listening direction of the sound collection point selected in S407 or S408, and storing it in the selected sound collection point information newly added in the preprocessing. For example, when the coordinates of the listening point are (1, 1) and the sound collecting point coordinates are (2, 1 + √3) and the listening direction is 60 °, the direction from the listening point to the sound collecting point is −30 °. Therefore, this angle is stored in the selected sound collection point information. When the process is finished, the process proceeds to S410.

Ｓ４１０では、全ての音源配置方向に対する処理が済んだかどうかを判定し、全ての処理が終了した場合は、ループを終了する。そして、Ｓ４１１へ進む。 In S410, it is determined whether or not the processing for all sound source arrangement directions has been completed. If all the processing has been completed, the loop is terminated. Then, the process proceeds to S411.

Ｓ４１１は、聴取範囲内の収音点のうち、再生に用いる収音点を選択するとともに、配置方向を決定する処理である。この処理（聴取範囲内収音点選択処理）の詳細は図９を用いて後述する。 S411 is a process of selecting a sound collection point used for reproduction from among the sound collection points in the listening range and determining the arrangement direction. Details of this processing (listening point selection processing within the listening range) will be described later with reference to FIG.

Ｓ４１２では、ここまでの処理で作成した選択収音点情報リストを音響信号生成部４へ出力する。そして、収音点選択処理を終了してリターンする。 In S <b> 412, the selected sound collection point information list created by the processing so far is output to the acoustic signal generation unit 4. Then, the sound collection point selection process is terminated and the process returns.

以上のように、本実施形態では、視点位置及び視線の方向に応じて決定された聴取範囲に基づき、複数の収音点の中から音響信号を生成するために用いる収音点を選択し、その収音点において収音された収音信号を用いて、音響信号を生成する。このように、再生信号を生成するために必要な収音点をＳ１０５において再生信号を生成する処理前に選択することによって、再生信号生成に必要な処理を削減できる。さらに、視点映像において注目している範囲に見合う収音点を自動的に選択することにより、視点映像に見合うより没入感の高い再生音響信号を生成することができる。したがって、任意視点映像に見合う臨場感のある音響を小さい処理量で生成することが可能である。 As described above, in the present embodiment, based on the listening range determined according to the viewpoint position and the direction of the line of sight, a sound collection point used for generating an acoustic signal is selected from a plurality of sound collection points, An acoustic signal is generated using the collected sound signal collected at the sound collection point. In this way, by selecting the sound collection points necessary for generating the reproduction signal before the process of generating the reproduction signal in S105, the processing necessary for generating the reproduction signal can be reduced. Furthermore, by automatically selecting a sound collection point that matches the range of interest in the viewpoint video, it is possible to generate a reproduced sound signal that is more immersive than the viewpoint video. Therefore, it is possible to generate a realistic sound suitable for an arbitrary viewpoint video with a small processing amount.

また、本実施形態では、視点位置及び視線の方向に応じて決定された聴取点から見た撮影対象を複数の領域に区分し、その複数の領域の各々から、聴取範囲に基づき収音点を選択する。このため、聴取点を基準に、聴取点の周囲の収音点をまんべんなく選択することができる。 Further, in the present embodiment, the shooting target viewed from the listening point determined according to the viewpoint position and the direction of the line of sight is divided into a plurality of areas, and the sound pickup points are determined based on the listening range from each of the plurality of areas. select. For this reason, the sound collection points around the listening point can be selected evenly based on the listening point.

また、聴取点から見た領域における聴取範囲の内部に収音点が存在する場合は、その聴取範囲の内部に存在する収音点のうち、聴取点から最も離れた収音点を選択する。一方、聴取点から見た領域における聴取範囲の内部に収音点が存在しない場合は、その領域に存在する収音点のうち、聴取点に最も近い収音点を選択する。このため、聴取範囲の拡がりに応じた収音点を適切に選択して、再生信号を適切に生成することが可能となる。 Further, when a sound collection point exists within the listening range in the region viewed from the listening point, a sound collecting point farthest from the listening point is selected from among the sound collecting points existing within the listening range. On the other hand, when there is no sound collection point within the listening range in the region viewed from the listening point, the sound collecting point closest to the listening point is selected from the sound collecting points present in the region. For this reason, it is possible to appropriately select a sound collection point corresponding to the expansion of the listening range and appropriately generate a reproduction signal.

なお、本実施形態では再生音響信号において周囲に配置する音源を８方向として選択した例を説明したが、音源の方向は８方向に限定されず、これよりも多くても少なくてもよい。また、本実施形態では音源配置方向として聴取点の周囲を等分した場合の例を説明したが、等分ではなく、例えば、音響再生環境におけるチャンネルの方向に応じた方向に分割して選択してもよい。 In the present embodiment, an example has been described in which sound sources arranged around in the reproduced sound signal are selected as eight directions, but the direction of the sound source is not limited to eight directions, and may be more or less than this. Further, in this embodiment, an example in which the periphery of the listening point is equally divided as the sound source arrangement direction has been described. However, it is not equally divided, for example, divided and selected in a direction corresponding to the channel direction in the sound reproduction environment. May be.

（聴取範囲内収音点選択処理）
図９は、本実施形態におけるＳ４１１の聴取範囲内収音点選択処理の詳細な処理手順を示すフローチャートである。なお、本フローチャートにおける処理も全て収音点選択部３によって実行される。 (Sound collection point selection processing within the listening range)
FIG. 9 is a flowchart showing a detailed processing procedure of the sound collection point selection processing within the listening range in S411 in the present embodiment. Note that all the processing in this flowchart is also executed by the sound pickup point selection unit 3.

まず、Ｓ５０１では、聴取範囲に含まれている収音点をリストアップして、収音点選択部３の内部ＲＡＭに一次保存する。次に、Ｓ５０２からＳ５０５までは、Ｓ５０１でリストアップした個々の収音点に対するループ処理を行う。Ｓ５０２でループ処理を開始する。 First, in S501, the sound collection points included in the listening range are listed and temporarily stored in the internal RAM of the sound collection point selection unit 3. Next, from S502 to S505, a loop process is performed for each sound collection point listed in S501. In step S502, loop processing is started.

Ｓ５０３では、処理対象の収音点が内部ＲＡＭに保存されている選択収音点情報リストに含まれているかどうかを判定する。対象の収音点が、選択収音点リストに含まれていない場合（Ｓ５０３でＮＯ）は、Ｓ５０４へ進む。含まれている場合（Ｓ５０３でＹＥＳ）は、既に再生に用いる収音点として選択されているため、Ｓ５０５へ進んでループ処理を終了する。 In S503, it is determined whether or not the sound collection point to be processed is included in the selected sound collection point information list stored in the internal RAM. If the target sound collection point is not included in the selected sound collection point list (NO in S503), the process proceeds to S504. If it is included (YES in S503), since it has already been selected as a sound collection point used for reproduction, the process proceeds to S505 to end the loop processing.

Ｓ５０４では、内部ＲＡＭに保存されている選択収音点リストに新規要素を追加して、この収音点の収音点ＩＤと、聴取点から見た方向として０°を格納する。これにより、本実施形態では、後の再生信号生成処理において、聴取範囲内の収音点で収音した信号が、聴取者の正面に定位するように再生されることになる。処理を終えると、Ｓ５０５へ進む。 In S504, a new element is added to the selected sound collection point list stored in the internal RAM, and the sound collection point ID of this sound collection point and 0 ° as the direction viewed from the listening point are stored. Thereby, in this embodiment, in the subsequent reproduction signal generation processing, the signal collected at the sound collection point within the listening range is reproduced so as to be localized in front of the listener. When the process is finished, step S505 follows.

Ｓ５０５では、Ｓ５０１でリストアップされた収音点全てに対して処理が済んだかどうかを確認し、全て終了した場合はループ処理を抜け、聴取範囲内収音点選択処理を終了してリターンする。 In S505, it is confirmed whether or not the processing has been completed for all the sound collection points listed in S501. If all the sound collection points have been completed, the loop processing is exited, the sound collection point selection processing in the listening range is terminated, and the process returns.

このように、本実施形態では、聴取範囲の内部に存在する全ての収音点を選択して、再生信号を生成するため、聴取範囲に応じた臨場感ある音響を生成することができる。また、聴取点から見た複数の領域の各々において、聴取点に最も近い収音点を選択するため、聴取点と収音点との位置関係に基づく臨場感ある音響信号を生成することが可能となる。 As described above, in this embodiment, since all the sound collection points existing in the listening range are selected and the reproduction signal is generated, it is possible to generate a realistic sound according to the listening range. In addition, in each of the plurality of areas viewed from the listening point, the sound collecting point closest to the listening point is selected, so that it is possible to generate a realistic sound signal based on the positional relationship between the listening point and the sound collecting point. It becomes.

（再生信号生成処理）
図１０は、本実施形態におけるＳ１０５の再生信号生成処理の詳細な処理手順を示すフローチャートである。再生信号生成処理では、複数の収音信号に基づいて、聴取点及び聴取範囲に応じた音響信号を生成する。なお、本フローチャートにおける処理は全て音響信号生成部４によって実行される。 (Playback signal generation processing)
FIG. 10 is a flowchart showing a detailed processing procedure of the reproduction signal generation processing of S105 in the present embodiment. In the reproduction signal generation process, an acoustic signal corresponding to the listening point and the listening range is generated based on the plurality of collected sound signals. Note that all the processes in this flowchart are executed by the acoustic signal generation unit 4.

Ｓ６０１は、音響信号生成部４の内部にある出力バッファを初期化してクリアする処理である。出力バッファは再生音響信号の出力チャンネル毎のバッファになっており、生成した音響信号を出力チャンネル毎に蓄積する。処理を終えると、Ｓ６０２へ進む。 S601 is a process of initializing and clearing the output buffer in the acoustic signal generation unit 4. The output buffer is a buffer for each output channel of the reproduced sound signal, and stores the generated sound signal for each output channel. When the process is finished, step S602 follows.

Ｓ６０２は、これから生成する音響信号を再生する環境を判定する処理である。前述のように、本実施形態の例では音響信号を再生する環境として、ステレオ再生環境、サラウンド再生環境、及び、ヘッドフォン再生環境が設けられており、音響再生フォーマットもこれらの環境のいずれかに合わせて設定されている。 S602 is processing for determining an environment in which an acoustic signal to be generated will be reproduced. As described above, in the example of the present embodiment, a stereo playback environment, a surround playback environment, and a headphone playback environment are provided as environments for playing back an audio signal, and the sound playback format is set to one of these environments. Is set.

ステレオ再生環境である場合は、Ｓ６０３へ進む。サラウンド再生環境である場合は、Ｓ６０４へ進む。ヘッドフォン再生環境である場合は、Ｓ６０５へ進む。 If it is a stereo reproduction environment, the process proceeds to S603. If it is a surround playback environment, the process proceeds to S604. If it is a headphone playback environment, the process proceeds to S605.

Ｓ６０３は、Ｓ１０４で選択した収音点の収音信号を用いてステレオ再生信号を生成する処理である。この処理の詳細は図１１を用いて後述する。処理を終えると、Ｓ６０６へ進む。 In step S603, a stereo reproduction signal is generated using the sound collection signal at the sound collection point selected in step S104. Details of this processing will be described later with reference to FIG. When the process is finished, step S606 follows.

Ｓ６０４は、Ｓ１０４で選択した収音点の収音信号を用いてサラウンド再生信号を生成する処理である。この処理の詳細は図１２を用いて後述する。処理を終えると、Ｓ６０６へ進む。 S604 is processing for generating a surround reproduction signal using the sound collection signal at the sound collection point selected in S104. Details of this processing will be described later with reference to FIG. When the process is finished, step S606 follows.

Ｓ６０５は、Ｓ１０４で選択した収音点の収音信号を用いてヘッドフォン再生信号を生成する処理である。この処理の詳細は図１３を用いて後述する。処理を終えると、Ｓ６０６へ進む。 S605 is a process of generating a headphone playback signal using the sound collection signal at the sound collection point selected in S104. Details of this processing will be described later with reference to FIG. When the process is finished, step S606 follows.

Ｓ６０６は、直前の処理で生成した再生音響信号を、音響再生部１１やＭＵＸ１５へ出力する処理である。処理を終えると、再生信号生成処理を終了し、リターンする。なお、図１０のフローチャートでは、どれか一つの再生フォーマットを選択して生成する例を示しているが、これらのフォーマットを逐次的に全て生成するようにしてもよい。 S606 is a process of outputting the reproduced sound signal generated in the immediately preceding process to the sound reproducing unit 11 or the MUX 15. When the process is finished, the reproduction signal generation process is finished and the process returns. Note that the flowchart of FIG. 10 shows an example in which any one playback format is selected and generated, but all of these formats may be generated sequentially.

上記のように、本実施形態では、視線の方向に基づいて、聴取点における聴取の方向を示す聴取方向をさらに決定し、Ｓ１０４において選択された収音点において収音された収音信号を用いて、聴取方向の正面から聞こえる音響信号を生成する。このため、任意視点画像に対応する音響を、その方向を考慮して再現よく表現することができる。 As described above, in the present embodiment, the listening direction indicating the listening direction at the listening point is further determined based on the direction of the line of sight, and the sound collection signal collected at the sound collection point selected in S104 is used. Thus, an acoustic signal that can be heard from the front in the listening direction is generated. For this reason, the sound corresponding to the arbitrary viewpoint image can be expressed with good reproducibility in consideration of the direction.

（ステレオ再生信号生成処理）
図１１は、本実施形態におけるＳ６０３のステレオ再生信号生成処理の詳細な処理手順を示すフローチャートである。なお、本フローチャートにおける処理も全て音響信号生成部４によって実行される。 (Stereo playback signal generation processing)
FIG. 11 is a flowchart showing a detailed processing procedure of the stereo reproduction signal generation processing of S603 in the present embodiment. Note that all the processing in this flowchart is also executed by the acoustic signal generation unit 4.

Ｓ７０１からＳ７０９までは、Ｓ１０４において収音点選択部３から出力された選択収音点情報リストに格納されている個々の選択収音点情報に対してループ処理を行う。 From S701 to S709, a loop process is performed on each selected sound collection point information stored in the selected sound collection point information list output from the sound collection point selection unit 3 in S104.

Ｓ７０１でループ処理が開始される。Ｓ７０２では、処理対象の選択収音点情報の聴取点から見た方向が、−９０°から９０°の範囲内かどうか、つまり、対象の収音点が聴取点から見て真横から前方に位置するかどうかを判定する。この処理の結果、範囲内にない、つまり、収音点が後方にある場合（Ｓ７０２でＮＯ）は、Ｓ７０３へ処理は進む。そうでない場合、つまり、前方にある場合（Ｓ７０２でＹＥＳ）は、Ｓ７０７へ処理は進む。 In S701, loop processing is started. In S702, whether or not the direction viewed from the listening point of the selected sound collection point information to be processed is within a range of −90 ° to 90 °, that is, the target sound collection point is positioned forward from the side as seen from the listening point. Determine whether to do. If the result of this processing is not within the range, that is, if the sound collection point is behind (NO in S702), the processing proceeds to S703. If not, that is, if it is ahead (YES in S702), the process proceeds to S707.

Ｓ７０３では、対象となっている選択収音点情報に格納されている収音信号を逆位相化する。これにより、聴取者が後方に音像を感じることはないが、後方からの音が通常の音と違って聴こえる演出を行うことができる。次に、Ｓ７０４では、聴取点から見た収音点の方向が正か否かを判定する。正でない場合、すなわち、聴取点から見て収音点が左方にある場合（Ｓ７０４でＮＯ）は、Ｓ７０５へ進む。正の場合、すなわち、聴取点から見て収音点が右方にある場合（Ｓ７０４でＹＥＳ）は、Ｓ７０６へ進む。 In S703, the collected sound signal stored in the selected selected sound collection point information is inverted in phase. Thereby, although a listener does not feel a sound image behind, the production which can hear the sound from the back unlike a normal sound can be performed. Next, in S704, it is determined whether the direction of the sound collection point viewed from the listening point is positive. If not positive, that is, if the sound collection point is on the left as viewed from the listening point (NO in S704), the process proceeds to S705. If it is positive, that is, if the sound collection point is on the right side when viewed from the listening point (YES in S704), the process proceeds to S706.

Ｓ７０５では、聴取点から見た方向に１８０°を加えて符号を反転する。一方、Ｓ７０６では、聴取点から見た方向から１８０°を減じて符号を反転する。Ｓ７０５とＳ７０６の処理により、聴取点を中心とする円の後ろ半分を前に折り返すようにして、後方の方向を前方の方向に変換することができる。処理を終えると、Ｓ７０７へ進む。 In S705, the sign is inverted by adding 180 ° in the direction seen from the listening point. On the other hand, in S706, the sign is inverted by subtracting 180 ° from the direction seen from the listening point. By the processing of S705 and S706, the rear half of the circle centered on the listening point can be folded forward to convert the rear direction to the front direction. When the process is finished, step S707 follows.

Ｓ７０７では、−９０°から９０°の範囲にある聴取点から見た方向に対するステレオパンニング計算を行い、得られたＬ，Ｒチャンネルの振幅分配率に従って収音信号をＬ，Ｒチャンネルに分配する。一般に、標準のステレオ再生環境では±３０°に左右スピーカーを配置するため、±９０°の範囲にある聴取点から見た方向を、±３０°の範囲に線形に投射することにより、ステレオパンニング計算を行う。本実施形態では、このパンニング計算を、聴取点から見た方向をθとすると、サイン則を用いて以下のように行う。
ｗＬ＝（sin３０°−sin(θ＊３０/９０)）／2sin３０°＝１／２−sin(θ／３)
ｗＲ＝（sin３０°＋sin(θ＊３０/９０)）／2sin３０°＝１／２＋sin（θ／３）（１）
ただし、ｗLは左チャンネルに対する振幅分配率、ｗRは右チャンネルに対する振幅分配率である。 In S707, stereo panning calculation is performed in the direction viewed from the listening point in the range of −90 ° to 90 °, and the collected sound signal is distributed to the L and R channels according to the obtained amplitude distribution ratio of the L and R channels. In general, in the standard stereo playback environment, the left and right speakers are placed at ± 30 °, so the stereo panning calculation is performed by linearly projecting the direction seen from the listening point in the ± 90 ° range to the ± 30 ° range. I do. In this embodiment, this panning calculation is performed as follows using a sine rule, where θ is the direction viewed from the listening point.
wL = (sin30 ° −sin (θ * 30/90)) / 2sin30 ° = 1 / 2−sin (θ / 3)
wR = (sin30 ° + sin (θ * 30/90)) / 2sin30 ° = 1/2 + sin (θ / 3) (1)
Here, wL is the amplitude distribution ratio for the left channel, and wR is the amplitude distribution ratio for the right channel.

次に、Ｓ７０８では、Ｓ７０７で分配したチャンネル信号を、各チャンネルの出力バッファに各々加算する。Ｓ７０９では、選択収音点情報リストに含まれている全ての選択収音点情報の処理が終了したかを確認する。全て終了した場合は、ループ処理を抜けてステレオ再生信号生成処理を終了し、リターンする。 Next, in S708, the channel signal distributed in S707 is added to the output buffer of each channel. In S709, it is confirmed whether or not the processing of all selected sound collection point information included in the selected sound collection point information list has been completed. When all the processing is completed, the loop processing is exited, the stereo reproduction signal generation processing is terminated, and the process returns.

（サラウンド再生信号生成処理）
図１２は、本実施形態におけるＳ６０４のサラウンド再生信号生成処理の詳細な処理手順を示すフローチャートである。なお、本フローチャートにおける処理も全て音響信号生成部４によって実行される。 (Surround playback signal generation processing)
FIG. 12 is a flowchart showing a detailed processing procedure of the surround reproduction signal generation processing of S604 in the present embodiment. Note that all the processing in this flowchart is also executed by the acoustic signal generation unit 4.

Ｓ８０１からＳ８０７までは、選択収音点情報リストに含まれている各選択収音点情報に対するループ処理を行う。 From S801 to S807, a loop process is performed for each selected sound collection point information included in the selected sound collection point information list.

Ｓ８０１でループ処理を開始する。Ｓ８０２では、聴取点から見た収音点の方向が既定のチャンネル配置方向かどうかを判定する。例えば、再生環境が５．１チャンネルサラウンド再生環境だとすると、既定のチャンネル配置角度は０°、±３０°、±１１０°〜１３０°になる。対象の選択収音点情報により示される収音点の聴取点から見た方向がこの範囲の角度に該当する場合は、チャンネル配置方向であると判定する。この場合（Ｓ８０２でＹＥＳ）はＳ８０６へ処理は進む。そうでない場合（Ｓ８０２でＮＯ）は、Ｓ８０３へ処理は進む。 In step S801, loop processing is started. In S802, it is determined whether or not the direction of the sound collection point viewed from the listening point is the default channel arrangement direction. For example, if the playback environment is a 5.1 channel surround playback environment, the default channel arrangement angles are 0 °, ± 30 °, and ± 110 ° to 130 °. When the direction seen from the listening point of the sound collection point indicated by the target selected sound collection point information corresponds to an angle within this range, it is determined that the channel arrangement direction is set. In this case (YES in S802), the process proceeds to S806. If not (NO in S802), the process proceeds to S803.

Ｓ８０３は、聴取点から見た収音点の方向の角度を挟み込む方向にある二つのチャンネルを選択する処理である。例えば、聴取点から見た収音点の方向が５０°であるとすると、それを挟み込むチャンネルとして、３０°のＲチャンネルと、１２０°のＳＲチャンネルが選択される。 S803 is processing for selecting two channels in a direction that sandwiches the angle of the direction of the sound collection point viewed from the listening point. For example, if the direction of the sound collection point as viewed from the listening point is 50 °, the 30 ° R channel and the 120 ° SR channel are selected as the channels sandwiching it.

次に、Ｓ８０４では、Ｓ８０３で選択したチャンネル間で振幅パンニング計算を行い、二つのチャンネルにこの収音点の収音信号を分配する。本実施形態ではサイン則によって振幅パンニング計算を行う。先ほどの例で説明すると、ＲとＳＲの方向の中心方向は７５°であり、中心方向と各チャンネルの方向との開き角は４５°となる。また、チャンネル間の中心方向から見ると、５０°は５０°−７５°＝−２５°となる。Ｒ、ＳＲチャンネルへの各分配率ｗＲ、ｗＳＲはサイン則よりそれぞれ次式で求められる。
ｗＲ＝（sin４５°−sin（−２５°））／2sin４５°≒０．６４９
ｗＳＲ＝（sin４５°＋sin（−２５°））／2sin４５°≒０．３５１（２）
次に、Ｓ８０５では、Ｓ８０４で分配した各チャンネルの信号を、チャンネル毎に出力バッファに加算する。一方、Ｓ８０６では、Ｓ８０２で判定された方向が同じチャンネルの出力バッファに、収音信号をそのまま加算する。 Next, in step S804, amplitude panning calculation is performed between the channels selected in step S803, and the sound collection signals at the sound collection points are distributed to the two channels. In the present embodiment, amplitude panning calculation is performed by a sine rule. In the above example, the central direction of the R and SR directions is 75 °, and the opening angle between the central direction and the direction of each channel is 45 °. Further, when viewed from the center direction between the channels, 50 ° is 50 ° −75 ° = −25 °. The distribution ratios wR and wSR to the R and SR channels can be obtained by the following equations from the sine rule.
wR = (sin45 ° −sin (−25 °)) / 2sin45 ° ≈0.649
wSR = (sin45 ° + sin (−25 °)) / 2sin45 ° ≈0.351 (2)
Next, in S805, the signal of each channel distributed in S804 is added to the output buffer for each channel. On the other hand, in S806, the collected sound signal is added as it is to the output buffer of the channel having the same direction determined in S802.

Ｓ８０７では、選択収音点情報リストに含まれる全ての選択収音点情報に対する処理が終了したかどうかを確認する。全ての処理が終了した場合は、ループ処理を終了し、Ｓ８０８へ進む。 In S807, it is confirmed whether or not the processing for all the selected sound collection point information included in the selected sound collection point information list is completed. When all the processes are finished, the loop process is finished and the process proceeds to S808.

Ｓ８０８では、各チャンネルの出力バッファに蓄積されている音響信号に対して、ローパスフィルタ（ＬＰＦ）を掛けて加算することにより、ＬＦＥ（ＬｏｗＦｒｅｑｕｅｎｃｙＥｌｅｍｅｎｔ）信号を生成する。ＬＦＥ信号は低域信号であり、通常は８０Ｈz以下の信号をローパスフィルタで取り出すようにする。この信号は、サラウンドスピーカーセットに含まれるサブウーファーによって再生される。生成されたＬＦＥ信号は、ＬＦＥチャンネル用の出力バッファに蓄積される。処理を終えると、サラウンド再生信号生成処理を終了し、リターンする。 In S808, an LFE (Low Frequency Element) signal is generated by adding a low-pass filter (LPF) to the acoustic signal accumulated in the output buffer of each channel, and adding the result. The LFE signal is a low-frequency signal, and usually a signal of 80 Hz or less is extracted by a low-pass filter. This signal is reproduced by a subwoofer included in the surround speaker set. The generated LFE signal is accumulated in the output buffer for the LFE channel. When the process is finished, the surround reproduction signal generation process is finished and the process returns.

（ヘッドフォン再生信号生成処理）
図１３は、本実施形態におけるＳ６０５のヘッドフォン再生信号生成処理の詳細な処理手順を示すフローチャートである。なお、本フローチャートにおける処理も全て４の音響信号生成部によって実行される。 (Headphone playback signal generation processing)
FIG. 13 is a flowchart showing a detailed processing procedure of the headphone reproduction signal generation processing in S605 in the present embodiment. In addition, all the processes in this flowchart are also performed by the four acoustic signal generation units.

Ｓ９０１からＳ９０４までは、選択収音点情報リストに含まれている各選択収音点情報に対するループ処理を行う。 From S901 to S904, a loop process is performed on each selected sound collection point information included in the selected sound collection point information list.

Ｓ９０１でループ処理を開始する。Ｓ９０２では、聴取点から見た方向のＨＲＩＲを収音信号に畳み込むことにより、両耳信号を計算する。ＨＲＩＲとは頭部伝達関数（ＨｅａｄＲｅｌａｔｅｄＩｍｐｕｌｓｅＲｅｓｐｏｎｓｅ）である。ＨＲＩＲは、音源方向によって変化する人間の頭部や耳介による音の回り込みを測定して両耳分のインパルス応答としたものである。収音信号に対してこれを畳み込むことにより、ヘッドフォンで聴いた場合に聴取点から見た方向に収音信号が定位する立体音響信号を作成することができる。なお、本実施形態では、方向毎のＨＲＩＲを格納したデータベースが音響信号生成部４の内部ＲＯＭに格納されており、任意の方向を入力して検索することにより、両耳分のＨＲＩＲを読み出して用いることができる。 In step S901, loop processing is started. In S902, the binaural signal is calculated by convolving the HRIR in the direction seen from the listening point with the collected sound signal. HRIR is a head related impulse response. HRIR is an impulse response for both ears by measuring the wraparound of sound by the human head and pinna, which changes depending on the sound source direction. By convolution of the collected sound signal, a stereophonic sound signal in which the collected sound signal is localized in the direction viewed from the listening point when listening with headphones can be created. In the present embodiment, a database storing the HRIR for each direction is stored in the internal ROM of the acoustic signal generation unit 4, and the HRIR for both ears is read by inputting an arbitrary direction and searching. Can be used.

次に、Ｓ９０３では、Ｓ９０２で生成した両耳信号を、Ｌ，Ｒの出力チャンネル毎に出力バッファに加算する。 Next, in S903, the binaural signal generated in S902 is added to the output buffer for each of the L and R output channels.

Ｓ９０４では、選択収音点情報リストに含まれる全ての選択収音点情報に対する処理が終了したかどうかを確認する。全ての処理が終了した場合は、ループを抜けて、ヘッドフォン再生信号生成処理を終了し、リターンする。 In S904, it is confirmed whether or not the processing for all the selected sound collection point information included in the selected sound collection point information list is completed. When all the processes are completed, the process exits the loop, ends the headphone reproduction signal generation process, and returns.

なお、本実施形態では、ステレオ再生処理においても全方向の収音点の収音信号を用いて再生信号を生成しているが、例えば、ステレオならば前方にある収音点の収音信号のみを用いて再生信号を生成するようにしてもよい。 In the present embodiment, the reproduction signal is generated using the collected sound signals of the omnidirectional sound collection points even in the stereo reproduction process. For example, in the case of stereo, only the collected sound signal of the sound collection point in the front is used. A reproduction signal may be generated using

また、本実施形態では、収音点の位置に収音用のマイクロホンを設置しているが、収音の実現手法はこのような態様に限られない。例えば、遠方の微小な音も収音できるマイクロホンを複数用いて、収音点に対応する位置を狙って収音、処理することにより、ピンポイントで遠距離から狙った収音点の音を収音してもよい。 In this embodiment, a microphone for sound collection is installed at the position of the sound collection point, but the method for realizing sound collection is not limited to such a mode. For example, by using multiple microphones that can pick up far-reaching sounds, picking up and processing the sound corresponding to the sound pickup point and picking up the sound at the sound pickup point aimed from a long distance. You may sound.

また、本実施形態では収音信号や撮影映像信号をすぐに処理して任意視点映像及びそれに見合う音響信号を生成、再生しているが、収音信号や撮影映像信号を一旦記憶装置に記憶しておき、後で処理するようにしてもよい。 In this embodiment, the sound collection signal and the captured video signal are immediately processed to generate and reproduce the arbitrary viewpoint video and the sound signal corresponding thereto. However, the sound collection signal and the captured video signal are temporarily stored in the storage device. It may be processed later.

以上説明したように、本実施形態にかかる構成により、視点情報から任意視点映像に応じた聴取範囲、聴取点、聴取位置を自動的に決定することで、任意視点の動きに応じて変化する臨場感のある音場再生を実現することができる。 As described above, the configuration according to the present embodiment automatically changes the listening range, the listening point, and the listening position according to the arbitrary viewpoint video from the viewpoint information, thereby changing according to the movement of the arbitrary viewpoint. Sound field reproduction with a feeling can be realized.

また、視点情報から任意視点映像に応じた聴取範囲を決定し、聴取範囲に応じた最小限の収音点を選択して再生音場に適宜配置することで、処理量を抑制しつつ映像に見合う臨場感のある音場再生を実現できる。すなわち、聴取範囲に基づき音響信号を生成するために用いる収音点を選択して、音響信号を生成することで、必要最小限の収音信号を選択して任意視点映像に対応する臨場感ある音響信号を自動的に生成することができる。 Also, by determining the listening range according to the arbitrary viewpoint video from the viewpoint information, selecting the minimum sound collection point according to the listening range and arranging it appropriately in the playback sound field, it is possible to reduce the processing amount to the video. Sound field reproduction with a sense of realism can be realized. That is, by selecting a sound collection point to be used for generating an acoustic signal based on the listening range and generating an acoustic signal, there is a sense of presence corresponding to an arbitrary viewpoint video by selecting the minimum necessary sound collection signal. An acoustic signal can be automatically generated.

＜＜その他の実施形態＞＞
実施形態１では、聴取範囲内の収音点による収音信号を全て用いて再生信号を生成しているが、聴取範囲内の収音信号のうち重要な収音信号を選択して用いることもできる。ここでは、重要な収音信号の一例として、人の声（以下、「音声」という）を含む収音信号を選択する例を説明する。以下、この場合の実施形態について説明する。 << Other Embodiments >>
In the first embodiment, the reproduction signal is generated using all the collected sound signals from the sound collection points in the listening range. However, it is also possible to select and use an important sound collection signal among the sound collection signals in the listening range. it can. Here, an example of selecting a sound collection signal including a human voice (hereinafter referred to as “voice”) will be described as an example of an important sound collection signal. Hereinafter, an embodiment in this case will be described.

本実施形態と実施形態１の差分は図８におけるＳ４１１の聴取範囲内収音点選択処理であり、他は同一であるため説明を省略し、実施形態１と異なる点を中心に簡潔に説明する。 The difference between the present embodiment and the first embodiment is the sound collection point selection processing within the listening range in S411 in FIG. 8, and the others are the same, so the description thereof will be omitted, and a brief description will be given focusing on the differences from the first embodiment. .

図１４は本実施形態における聴取範囲内収音点選択処理の詳細な処理手順を示すフローチャートである。Ｓ１００１からＳ１００３までの処理は図９のＳ５０１からＳ５０３までの処理と同一であるため説明を省略する。 FIG. 14 is a flowchart illustrating a detailed processing procedure of the sound collection point selection processing within the listening range in the present embodiment. The processing from S1001 to S1003 is the same as the processing from S501 to S503 in FIG.

Ｓ１００４は、対象となる収音点情報の収音信号を解析し、音声が含まれているかどうかを判定する処理である。収音信号に対してピッチ検出処理やフォルマント検出処理等を行うことにより、収音信号に音声が含まれているかどうかを判定する。判定の結果、音声が含まれている場合はＳ１００６に進む。そうでない場合は、Ｓ１００５へ進む。 S1004 is a process of analyzing the sound collection signal of the target sound collection point information and determining whether or not sound is included. By performing pitch detection processing, formant detection processing, or the like on the collected sound signal, it is determined whether or not sound is included in the collected sound signal. As a result of the determination, if sound is included, the process proceeds to S1006. Otherwise, the process proceeds to S1005.

Ｓ１００５は、収音信号の平均振幅が予め定めた既定値（閾値）を超えているかどうかを判定する処理である。判定の結果、閾値を超えている場合はＳ１００６に進む。そうでない場合は、Ｓ１００７へ進む。 S1005 is processing for determining whether or not the average amplitude of the collected sound signal exceeds a predetermined value (threshold value). As a result of the determination, if the threshold value is exceeded, the process proceeds to S1006. Otherwise, the process proceeds to S1007.

Ｓ１００６とＳ１００７は図９におけるＳ５０４とＳ５０５と同一の処理であるため説明を省略する。 Since S1006 and S1007 are the same processes as S504 and S505 in FIG.

以上説明した処理制御を行うことで、聴取範囲内の収音点で収音した信号のうち、重要な情報が含まれていると推定される声が混じっている信号や、音源の発生源に近いと推定できる平均振幅の大きな信号のみを選択して再生することができる。よって、再生信号生成処理にかかる処理量をさらに削減しつつ、重要な音のみを選択して再生することができる。 By performing the processing control described above, the signal collected at the sound collection point within the listening range is mixed with the voice that is estimated to contain important information, and the source of the sound source. Only signals with a large average amplitude that can be estimated to be close can be selected and reproduced. Therefore, it is possible to select and reproduce only important sounds while further reducing the amount of processing required for the reproduction signal generation processing.

また、上記構成では、聴取範囲内の収音点を選択する場合に、人の音声等の音響的に重要な収音点を検知する例を説明した。さらに、視点映像を解析してボールを蹴ったりスクラムを組むなどの所定のイベントが生じている場所を特定し、その場所に最も近い収音点を選択するようにしてもよい。イベントの発生は、ユーザの指定やセンサの使用等により検知することができる。 In the above configuration, an example in which an acoustically important sound collection point such as a human voice is detected when a sound collection point within the listening range is selected has been described. Furthermore, it is also possible to analyze a viewpoint video, specify a place where a predetermined event such as kicking a ball or forming a scrum occurs, and select a sound pickup point closest to the place. The occurrence of an event can be detected by user designation, use of a sensor, or the like.

また、実施形態１では、俯瞰視点の場合に画面の上方向の音を前方に配置しているが、水平面の他に、上層にもチャンネルを配置する再生フォーマットへ出力する場合には、上層チャンネルに配置するようにしてもよい。このように、Ｓ１０４において選択された収音点の視点から見た位置関係に基づいて、収音点において収音された収音信号を合成して、音響信号を生成することで、収音点の配置に応じた臨場感ある音響を再現することができる。その他、本発明の主旨を逸脱しない範囲で実施することが可能である。 In the first embodiment, the sound in the upward direction of the screen is arranged forward in the case of the overhead view point. However, in the case of outputting to a playback format in which channels are arranged in the upper layer in addition to the horizontal plane, the upper layer channel You may make it arrange | position to. As described above, the sound collection point is generated by synthesizing the sound collection signals collected at the sound collection point based on the positional relationship viewed from the viewpoint of the sound collection point selected in S104 and generating the acoustic signal. It is possible to reproduce the realistic sound according to the arrangement. In addition, the present invention can be carried out without departing from the gist of the present invention.

＜その他の実施形態＞
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 <Other embodiments>
The present invention supplies a program that realizes one or more functions of the above-described embodiments to a system or apparatus via a network or a storage medium, and one or more processors in a computer of the system or apparatus read and execute the program This process can be realized. It can also be realized by a circuit (for example, ASIC) that realizes one or more functions.

収音信号入力部：１、聴取範囲決定部：２、収音点選択部：３、音響信号生成部：４、操作部：５、視点情報指定部：６、映像信号入力部：７、視点映像生成部：８、被写体位置検知部：９、映像再生部：１０、音響再生部：１１、ステレオスピーカーセット：１２、サラウンドスピーカーセット：１３、ヘッドフォン：１４、表示部：１９ Sound collection signal input unit: 1, listening range determination unit: 2, sound collection point selection unit: 3, sound signal generation unit: 4, operation unit: 5, viewpoint information designation unit: 6, video signal input unit: 7, viewpoint Video generation unit: 8, subject position detection unit: 9, video playback unit: 10, sound playback unit: 11, stereo speaker set: 12, surround speaker set: 13, headphones: 14, display unit: 19

Claims

An information processing system that processes an image and sound corresponding to an arbitrary viewpoint based on a plurality of image signals photographed by a plurality of photographing devices and a plurality of sound collection signals collected at a plurality of sound collection points. There,
An acquisition means for acquiring the viewpoint position and the direction of the line of sight with respect to the imaging target;
A sound collection point of a sound collection signal used to generate an acoustic signal corresponding to an image based on the plurality of image signals, the image being in accordance with the viewpoint position and the direction of the line of sight, from the plurality of sound collection points. Selecting means for selecting according to the viewpoint position and the direction of the line of sight;
An information processing system comprising: an acoustic generation unit that generates an acoustic signal using a collected sound signal collected at a sound collection point selected by the selection unit.

Determination to determine a listening range, which is a reference range for selecting a sound pickup point of a sound pickup signal used for generating an acoustic signal corresponding to the image, according to the viewpoint position and the direction of the line of sight Means and
The information processing system according to claim 1, wherein the selection unit selects a sound collection point to be used for generating the acoustic signal from the plurality of sound collection points based on the listening range.

The determining means further determines a listening point as a reference for generating the acoustic signal according to the viewpoint position and the direction of the line of sight,
The said selection means divides the said imaging | photography object seen from the said listening point into several area | region, and selects the said sound collection point from each of the said several area | region based on the said listening range. Information processing system described in 1.

When there is a sound collection point within the listening range in the region, the selection means selects a sound collection point farthest from the listening point among the sound collection points existing within the listening range. The information processing system according to claim 3.

When the sound collection point does not exist within the listening range in the area, the selection means selects a sound collection point closest to the listening point from among the sound collection points existing in the area. The information processing system according to claim 3 or 4.

6. The information processing system according to claim 3, wherein the selection unit selects a sound collection point closest to the listening point in each of the plurality of regions.

The determining means further determines a listening direction indicating a listening direction at the listening point based on the direction of the line of sight;
The sound generation means generates an acoustic signal that can be heard from the front in the listening direction by using a sound pickup signal picked up at a sound pickup point selected by the selection means. The information processing system according to any one of the above.

The information processing system according to any one of claims 2 to 7, wherein the selection unit selects all sound collection points existing in the listening range.

The information processing system according to any one of claims 1 to 8, wherein the selection unit selects a sound collection point where a sound collection signal including a human voice is collected.

10. The information processing system according to claim 1, wherein the selection unit selects a sound collection point where a sound collection signal having an average amplitude exceeding a predetermined threshold is collected. .

The information processing system according to any one of claims 1 to 10, wherein the selection unit selects a sound pickup point closest to a place where occurrence of a predetermined event is detected.

The sound generation means generates the sound signal by synthesizing the sound pickup signals collected at the sound pickup points based on the positional relationship of the sound pickup points selected by the selection means as viewed from the viewpoint. The information processing system according to any one of claims 1 to 11, wherein:

An information processing system that processes an image and sound corresponding to an arbitrary viewpoint based on a plurality of image signals photographed by a plurality of photographing devices and a plurality of sound collection signals collected at a plurality of sound collection points. There,
An acquisition means for acquiring the viewpoint position and the direction of the line of sight with respect to the imaging target;
An listening point serving as a reference for generating an acoustic signal corresponding to the image based on the plurality of image signals, which is an image according to the viewpoint position and the direction of the line of sight, is determined according to the viewpoint position and the direction of the line of sight. A determination means;
Selection means for selecting a sound collection point used for generating the acoustic signal from the plurality of sound collection points based on a positional relationship between the listening point and the sound collection point;
An information processing system comprising: an acoustic generation unit that generates an acoustic signal using a collected sound signal collected at a sound collection point selected by the selection unit.

An information processing system for processing an image and sound corresponding to an arbitrary viewpoint based on a plurality of image signals photographed by a plurality of photographing devices and a plurality of sound collection signals collected at a plurality of sound collection points A control method,
An acquisition step in which the acquisition unit acquires the viewpoint position and the direction of the line of sight with respect to the imaging target;
The selection means is an image according to the viewpoint position and the direction of the line of sight, and the sound collection points of the sound collection signal used for generating an acoustic signal corresponding to the image based on the plurality of image signals are the plurality of sound collection points. A selection step of selecting from the sound collection point according to the viewpoint position and the direction of the line of sight;
A method for controlling an information processing system, comprising: a sound generation unit that generates a sound signal using a collected sound signal picked up at a sound collection point selected in the selection step.

The computer program for functioning a computer as each means with which the information processing system of any one of Claim 1 to 13 is provided.