JP7457525B2

JP7457525B2 - Receiving device, content transmission system, and program

Info

Publication number: JP7457525B2
Application number: JP2020028692A
Authority: JP
Inventors: 侑輝河村; 浩一郎今村; 裕靖永田; 悠喜山上; 知也楠
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2024-03-28
Anticipated expiration: 2040-02-21
Also published as: JP2021136465A

Description

本発明は、３次元音響を再生する受信装置、コンテンツ伝送システム、及びプログラムに関する。 The present invention relates to a receiving device, a content transmission system, and a program that reproduce three-dimensional sound.

近年、ＡＲ（Augmented Reality）／ＶＲ（Virtual Reality）技術の進歩により、ＡＲ／ＶＲ対応端末や、ＡＲ／ＶＲコンテンツが普及し始めている。ＡＲ／ＶＲ対応端末とは、スマートフォン、タブレット型端末、ＶＲゴーグル、ＡＲグラスなどである。例えば特許文献１には、ＡＲコンテンツ視聴システムの構成例が開示されている。 In recent years, with advances in AR (Augmented Reality)/VR (Virtual Reality) technology, AR/VR compatible terminals and AR/VR content have begun to spread. AR/VR compatible terminals include smartphones, tablet terminals, VR goggles, and AR glasses. For example, Patent Document 1 discloses a configuration example of an AR content viewing system.

ＡＲ／ＶＲコンテンツの再生において、視覚情報は端末に搭載されるＧＰＵ（Graphic Processing Unit）を用いた実時間レンダリング処理により、ユーザの動作に応じたインタラクティブなグラフィック表示が実現されている。ＡＲ／ＶＲ対応端末には、ジャイロセンサや加速度センサなど複数のセンサが搭載されており、これらのセンサから取得した情報を使用して端末の自己位置推定及び姿勢推定が行われる。端末のカメラで撮影される実空間の映像に対してＣＧデータ等で与えられたオブジェクトを合成表示するＡＲコンテンツの場合には、カメラで撮影された実空間の映像も、自己位置・姿勢推定処理の入力情報として使用できる。この自己位置・姿勢推定処理の結果として得られた視点位置、視線方向を反映したビューポートに応じて、ＧＰＵを用いてグラフィックの実時間レンダリング処理が行われる。なお、スマートフォンやタブレット型端末の自己位置・姿勢推定処理などＡＲ／ＶＲコンテンツの再生に必要な基本技術については、ｉＯＳ端末のＡＲＫｉｔやＡｎｄｒｏｉｄ端末のＡＲＣｏｒｅなど、ＯＳ（Operating System）レベルでの標準機能としての実装が進んでおり、一般の開発者によるＡＲ／ＶＲ対応のアプリ開発や配布が容易となっている。 When reproducing AR/VR content, visual information is rendered in real time using a GPU (Graphic Processing Unit) installed in the terminal, thereby realizing interactive graphical display according to the user's actions. An AR/VR compatible terminal is equipped with a plurality of sensors such as a gyro sensor and an acceleration sensor, and the self-position and orientation of the terminal are estimated using information acquired from these sensors. In the case of AR content that combines and displays objects given by CG data, etc. on images of real space captured by the device's camera, the images of real space captured by the camera are also subject to self-position/orientation estimation processing. It can be used as input information. Real-time graphic rendering processing is performed using the GPU in accordance with a viewport that reflects the viewpoint position and line-of-sight direction obtained as a result of this self-position/orientation estimation processing. The basic technologies necessary for playing AR/VR content, such as self-position and posture estimation processing on smartphones and tablet devices, are standard functions at the OS (Operating System) level, such as ARKit on iOS devices and ARCore on Android devices. Implementation is progressing, making it easier for general developers to develop and distribute AR/VR compatible applications.

ＡＲ／ＶＲコンテンツの視聴におけるユーザの視点位置、視線方向の自由度は、ＤｏＦ（Degrees of Freedom）と呼ばれる単位で表現される。例えば、３６０度ＶＲ映像コンテンツにおいて、視点位置が固定され、視線方向のみにインタラクティブ性がある場合は、ユーザの視線方向の自由度が３自由度（Ｒｏｌｅ，Ｐｉｔｃｈ，Ｙａｗ）の回転であるため３ＤｏＦと呼ばれる。一方、視線方向に加えて視点位置も自由に移動できるＡＲ／ＶＲコンテンツの場合には、視線方向の３自由度に加えて、視点位置の移動の自由度も３自由度（Ｘ，Ｙ，Ｚ）となるため合計６自由度であることから、６ＤｏＦと呼ばれる。また、３ＤｏＦを基本としながら、固定された椅子に座った状態での頭部の動きなど、限られた範囲での視点位置移動により、視覚情報に僅かながら自由度を追加するシステムを３ＤｏＦ＋と呼ぶ場合がある。 The degree of freedom of the user's viewpoint position and line of sight direction when viewing AR/VR content is expressed in units called DoF (Degrees of Freedom). For example, in 360-degree VR video content, if the viewpoint position is fixed and there is interactivity only in the direction of the line of sight, the degree of freedom in the direction of the user's line of sight is rotation with three degrees of freedom (Role, Pitch, Yaw), so 3DoF It is called. On the other hand, in the case of AR/VR content where the viewpoint position can be freely moved in addition to the viewing direction, in addition to the 3 degrees of freedom in the viewing direction, there are also 3 degrees of freedom in moving the viewpoint position (X, Y, Z). ), so there are 6 degrees of freedom in total, so it is called 6DoF. In addition, while based on 3DoF, a system that adds a small degree of freedom to visual information by moving the viewpoint position within a limited range, such as moving the head while sitting on a fixed chair, is called 3DoF+. There are cases.

視覚情報と聴覚情報を組み合わせたマルチモーダルな刺激により、ＡＲ／ＶＲコンテンツの視聴においてユーザのコンテンツへの没入感を高められることが期待できる。例えば、特許文献２には、ユーザが仮想空間内を自由に動き回ることができるゲームコンテンツにおいて、環境音の発生エリアとユーザの視点に相当する仮想カメラの位置と方向の関係に応じて、適用的に環境音を生成するシステムの構成例が開示されている。また、特許文献３には、ＶＲゲームにおいて、音源オブジェクトの音声を、ユーザの視線方向に応じてミックスされたモノラル音声又はステレオ音声を生成して提示するシステムのモデルが開示されている。 Multimodal stimulation that combines visual and auditory information can be expected to enhance the user's sense of immersion in the content when viewing AR/VR content. For example, Patent Document 2 discloses that in game content in which a user can freely move around in a virtual space, an application is applied depending on the relationship between the area where environmental sounds are generated and the position and direction of a virtual camera corresponding to the user's viewpoint. An example of the configuration of a system that generates environmental sounds is disclosed in . Further, Patent Document 3 discloses a model of a system that generates and presents a monaural sound or a stereo sound in which the sound of a sound source object is mixed according to the user's line of sight direction in a VR game.

上述した先行技術文献に開示された技術では、いずれも音源オブジェクトがユーザの視点位置と同じ高さにあることを前提とするか、又は実際には視点位置と違う高さにある音源も視点の高さにあるものとみなしている。つまり、視点の高さの上下にある音源オブジェクトや環境音発生エリアの音声は、ユーザの視線の高さで地面に水平な２次元平面に定位してしまう。そのため、例えば、ユーザがまっすぐ正面を向いた状態において、頭上を飛ぶ飛行機の音や、足元の地面近くで鳴く虫の鳴き声などを提示しても、上下方向の立体感を得ることはできない。 In the techniques disclosed in the above-mentioned prior art documents, either the sound source object is assumed to be at the same height as the user's viewpoint position, or the sound source object is actually located at a different height from the user's viewpoint position. It is considered to be at a height. In other words, sounds from sound source objects and environmental sound generation areas located above and below the height of the viewpoint are localized on a two-dimensional plane horizontal to the ground at the height of the user's line of sight. Therefore, for example, when the user is facing straight ahead, even if the sound of an airplane flying overhead or the chirping of an insect near the ground near the feet of the user is presented, it is not possible to obtain a three-dimensional effect in the vertical direction.

そこで、音源オブジェクトの位置からユーザの外耳道入口までの音声波の伝達関数の周波数特性（頭部伝達関数）を用いた周波数領域での音響処理を用いることで、最終的な音声出力がステレオと同じ２チャンネル音声であっても、視点の高さよりも上下の方向も含めた３次元の音像定位を実現する技術が提案されている。このように、２チャンネル音声再生による３次元音響は、一般にバイノーラル音声と呼ばれる。バイノーラル音声は、頭部伝達関数を用いた周波数領域の演算によって生成する以外に、人間の頭部形状と外耳道を模擬したダミーヘッドを用いて実空間から直接収音することもできる。このため、実写による３６０度ＶＲ映像コンテンツでは、３６０度カメラによる全天周映像などの広視野撮影とダミーヘッドによるバイノーラル音声の収音が同時に行われ、パッケージ化されたＶＲ映像コンテンツとして提供される場合がある。但し、ユーザがバイノーラル音声の立体感を正しく得られるのは、収音時にダミーヘッドが向いていた方向とユーザの視線方向が一致するときに限定される。 A technology has been proposed that uses frequency domain acoustic processing using the frequency characteristics (head-related transfer function) of the transfer function of the sound wave from the position of the sound source object to the entrance of the user's ear canal to achieve three-dimensional sound image localization including directions above and below the height of the viewpoint, even if the final audio output is the same two-channel audio as stereo. In this way, three-dimensional audio by two-channel audio playback is generally called binaural audio. In addition to generating binaural audio by frequency domain calculation using the head-related transfer function, it can also be collected directly from the real space using a dummy head that mimics the shape of a human head and the ear canal. For this reason, in live-action 360-degree VR video content, wide-field shooting such as panoramic video using a 360-degree camera and binaural audio collection using a dummy head are sometimes performed simultaneously, and the content is provided as a packaged VR video content. However, the user can correctly obtain the stereoscopic effect of binaural audio only when the direction in which the dummy head was facing at the time of audio collection matches the user's line of sight.

特開２０１９－００８３１９号公報Japanese Patent Application Publication No. 2019-008319 特開２００７－２２９２４１号公報Japanese Patent Application Publication No. 2007-229241 特開２０１９－０１３７６５号公報JP2019-013765A

ユーザが視線方向を自由に変えることができる３ＤｏＦのＶＲコンテンツや、ユーザがコンテンツの３次元空間内を自由に動ことができる６ＤｏＦのＡＲ／ＶＲコンテンツでは、ダミーヘッドで収音した音声をそのまま使用することはできない。つまり、刻々と変化するユーザの視点位置と音源オブジェクトの位置の相対関係に応じて、頭部伝達関数を用いた演算を行い、リアルタイムにバイノーラル音声を生成する必要がある。これを実現するためには、まず、音源オブジェクトごとに独立した音声ストリームと、コンテンツの３次元空間上での位置を示す３次元座標とを時間軸で紐づけて伝送する必要がある。次に、受信装置において、音源オブジェクトの音声ストリームと３次元座標とが紐づけられたデータを受信して、音源オブジェクトとユーザの視点位置に応じたバイノーラル音声をリアルタイムに生成する機能を実装する必要がある。 In 3DoF VR content where the user can freely change the line of sight, and 6DoF AR/VR content where the user can freely move within the 3D space of the content, the audio collected by the dummy head is used as is. I can't. That is, it is necessary to generate binaural audio in real time by performing calculations using head-related transfer functions in accordance with the relative relationship between the user's viewpoint position and the position of the sound source object, which changes from moment to moment. In order to achieve this, it is first necessary to transmit an independent audio stream for each sound source object and three-dimensional coordinates indicating the position of the content in three-dimensional space in a time-based manner. Next, in the receiving device, it is necessary to implement a function that receives data in which the audio stream of the sound source object and the three-dimensional coordinates are linked, and generates binaural audio in real time according to the sound source object and the user's viewpoint position. There is.

しかし、タブレット型端末、スマートフォンなど一般的なモバイル端末には、グラフィック処理用のＧＰＵに相当する様な音響処理用の専用ハードウェアであるＤＳＰ（Digital Signal Processor）が搭載されていないことや、グラフィック処理用のＯｐｅｎＧＬ（Open Graphics Library）に相当するような音響処理用のＡＰＩ（Application Programming Interface）の整備が十分でないことが、３次元音響対応の機能実装の障壁となっている。幅広いユーザに浸透するスマートフォン、タブレット型端末などのモバイル端末を対象にコンテンツを提供するためには、ＣＰＵ（Central Processing Unit）上で実行されるソフトウェアによる３次元音響処理の実装が要求される。 However, common mobile devices such as tablets and smartphones are not equipped with a DSP (Digital Signal Processor), which is dedicated hardware for audio processing that is equivalent to a GPU for graphic processing, and The lack of sufficient API (Application Programming Interface) for sound processing, which corresponds to OpenGL (Open Graphics Library) for processing, is an obstacle to implementing functions that support three-dimensional sound. In order to provide content to mobile terminals such as smartphones and tablet terminals, which are popular among a wide range of users, it is necessary to implement three-dimensional sound processing using software executed on a CPU (Central Processing Unit).

さらに、コンテンツのコンポーネントとしてストリーミング伝送される音源オブジェクト数が増えた場合には、音源数に応じて受信装置の処理負荷が増大し、過大な負荷が生じ得る。このように音源オブジェクトの数の増加に伴い過度な処理負担が生じた場合、処理遅延により視覚情報と聴覚情報の同期ずれや音飛びなどが発生する可能性ある。 Furthermore, when the number of sound source objects streamed as components of content increases, the processing load on the receiving device increases in accordance with the number of sound sources, and an excessive load may occur. If an excessive processing load occurs due to the increase in the number of sound source objects, processing delays may cause out-of-synchronization of visual and auditory information, and sound skips may occur.

つまり、３次元音響を組み合わせたＡＲ／ＶＲコンテンツのストリーミング伝送による提供において、３次元空間内での音源オブジェクトの移動、ユーザの視点位置（視聴位置）の移動、ユーザの視線方向の回転が動的である場合、従来の頭部伝達関数を用いたバイノーラル音声の生成処理では、音源位置と視聴位置・方向の相対関係に応じた膨大な数の頭部伝達関数が必要となり、メモリ資源や演算資源が限られるモバイル端末でのソフトウェア実装は非現実的であるという課題があった。また、コンテンツの３次元空間内に配置される音源オブジェクトの数が増えた場合に、ＣＰＵ、メモリなどの計算資源の限界により、全ての音源オブジェクトの音声をリアルタイムに処理することができず、遅延の増大による視覚情報との同期ずれ、音飛びなどの視聴品質の低下が生じるという課題があった。 In other words, when providing AR/VR content that combines three-dimensional sound through streaming transmission, the movement of the sound source object in three-dimensional space, the movement of the user's viewpoint position (viewing position), and the rotation of the user's line of sight are dynamic. In this case, conventional binaural audio generation processing using head-related transfer functions requires a huge number of head-related transfer functions depending on the relative relationship between the sound source position and the listening position/direction, which requires a large number of memory and computational resources. The problem was that it was impractical to implement software on mobile terminals, which have limited capabilities. In addition, when the number of sound source objects placed in the three-dimensional space of the content increases, due to the limitations of computational resources such as CPU and memory, it is not possible to process the sounds of all sound source objects in real time, resulting in delays. This has caused issues such as deterioration in viewing quality, such as loss of synchronization with visual information and skipping of sound.

かかる事情に鑑みてなされた本発明の目的は、演算量及びプログラム規模・回路規模の増加を抑制し、３次元音響を組み合わせたＡＲ／ＶＲコンテンツの再生のリアルタイム性を確保し、視聴品質を向上させることが可能な受信装置、コンテンツ伝送システム、及びプログラムをリーズナブルな実装コストで提供することにある。 The purpose of the present invention, which was made in view of the above circumstances, is to suppress increases in the amount of calculations, program scale, and circuit scale, ensure real-time playback of AR/VR content that combines three-dimensional sound, and improve viewing quality. The purpose of the present invention is to provide a receiving device, a content transmission system, and a program capable of transmitting data at a reasonable implementation cost.

一実施形態に係る受信装置は、音源オブジェクトの音声ストリームをブロック化した音声チャンクと、前記音源オブジェクトのワールド座標系における３次元座標を含む音源メタデータと、を受信する受信装置であって、前記音源オブジェクトの３次元座標を、ワールド座標系からビュー座標系へ変換する座標変換部と、当該受信装置の処理負荷に基づいて、前記ビュー座標系において音源オブジェクト選択領域を規定し、該音源オブジェクト選択領域内に位置する音源オブジェクトを、音響処理対象の音源オブジェクトとして選択するオブジェクト選択部と、前記音響処理対象の音源オブジェクトの音声チャンクを用いてバイノーラル音声を生成する３次元音響レンダリング部と、を備える。
一実施形態に係る受信装置は、上記の構成において、前記オブジェクト選択部は、前記処理負荷が大きいほど、前記音源オブジェクト選択領域が小さくなるように規定する。
また、一実施形態に係る受信装置は、上記の構成において、前記音源メタデータは、前記音源オブジェクトの優先度を含み、前記オブジェクト選択部は、前記ビュー座標系における前記音源オブジェクトの原点からの距離を前記優先度が大きいほど短くなるように変更した場合に前記音源オブジェクト選択領域に含まれることになる音源オブジェクトを、前記音響処理対象の音源オブジェクトとして選択する。
また、一実施形態に係る受信装置は、上記の構成において、前記音源メタデータは、前記音源オブジェクトの最大音圧レベルを含み、前記オブジェクト選択部は、前記ビュー座標系における前記音源オブジェクトの原点からの距離を前記最大音圧レベルが大きいほど短くなるように変更した場合に前記音源オブジェクト選択領域に含まれることになる音源オブジェクトを、前記音響処理対象の音源オブジェクトとして選択する。 A receiving device according to an embodiment is a receiving device that receives audio chunks obtained by dividing an audio stream of a sound source object into blocks, and sound source metadata including three-dimensional coordinates of the sound source object in a world coordinate system, A coordinate conversion unit that converts the three-dimensional coordinates of the sound source object from the world coordinate system to the view coordinate system; and a sound source object selection area is defined in the view coordinate system based on the processing load of the receiving device, and the sound source object is selected. An object selection unit that selects a sound source object located within a region as a sound source object to be processed, and a three-dimensional sound rendering unit that generates binaural audio using audio chunks of the sound source object to be processed. .
In the receiving device according to one embodiment, in the above-described configuration, the object selection section defines that the sound source object selection area becomes smaller as the processing load becomes larger .
Further, in the receiving device according to one embodiment, in the above configuration, the sound source metadata includes a priority of the sound source object, and the object selection unit is configured to determine the distance from the origin of the sound source object in the view coordinate system. A sound source object that will be included in the sound source object selection area when the above priority is changed so that it becomes shorter as the priority increases is selected as the sound source object to be subjected to the sound processing.
Further, in the receiving device according to one embodiment, in the above configuration, the sound source metadata includes a maximum sound pressure level of the sound source object, and the object selection unit is configured to A sound source object that will be included in the sound source object selection area when the distance is changed such that it becomes shorter as the maximum sound pressure level increases is selected as the sound source object to be subjected to the sound processing.

さらに、一実施形態において、ユーザの視線方向を推定する位置姿勢推定部をさらに備え、前記座標変換部は、前記音源オブジェクトの３次元座標を、前記ワールド座標系から、前記視線方向を軸方向とする前記ビュー座標系へ変換させてもよい。 Furthermore, in one embodiment, the position and orientation estimating unit estimates the direction of the user's line of sight, and the coordinate conversion unit converts the three-dimensional coordinates of the sound source object from the world coordinate system to the direction of the line of sight as an axial direction. The view coordinate system may be converted to the view coordinate system.

さらに、一実施形態において、前記位置姿勢推定部は、前記ユーザの視点位置を推定し、前記座標変換部は、前記音源オブジェクトの３次元座標を、前記ワールド座標系から、前記視点位置を原点とする前記ビュー座標系へ変換させてもよい。 Furthermore, in one embodiment, the position and orientation estimation unit estimates a viewpoint position of the user, and the coordinate conversion unit converts the three-dimensional coordinates of the sound source object from the world coordinate system, with the viewpoint position as the origin. The view coordinate system may be converted to the view coordinate system.

さらに、一実施形態において、前記３次元音響レンダリング部は、前記音響処理対象の音源オブジェクトの音声チャンクを、前記ビュー座標系に配置された仮想マルチチャンネルスピーカに割り当てるマッピング部と、前記仮想マルチチャンネルスピーカに割り当てられた前記音声チャンクを用いて、前記バイノーラル音声を生成するダウンミックス部と、を備えてもよい。 Furthermore, in one embodiment, the three-dimensional sound rendering unit includes a mapping unit that allocates audio chunks of the sound source object to be processed for sound to a virtual multi-channel speaker arranged in the view coordinate system; and a downmix unit that generates the binaural audio using the audio chunks assigned to the audio chunks.

また、一実施形態に係るコンテンツ伝送システムは、上記受信装置と、前記音声チャンクと前記音源メタデータとを関連付けて、前記受信装置に送信する配信装置と、を備える。 Further, a content transmission system according to an embodiment includes the receiving device, and a distribution device that associates the audio chunk with the sound source metadata and transmits the same to the receiving device.

また、一実施形態に係るプログラムは、コンピュータを、上記受信装置として機能させる。 Further, the program according to one embodiment causes a computer to function as the receiving device.

本発明によれば、ＡＲ／ＶＲコンテンツのストリーミング伝送を受信する受信装置において、演算量及び回路規模の増加を抑えることができ、コンテンツ再生のリアルタイム性を確保し、視聴品質を向上させることが可能となる。また、ＣＰＵクロックやメモリ搭載量の異なる様々な性能の端末を受信装置として利用し、各端末の処理性能に応じたコンテンツ再生が可能なサービスを実現することができる。 According to the present invention, in a receiving device that receives streaming transmission of AR/VR content, it is possible to suppress an increase in the amount of calculation and circuit scale, ensure real-time performance of content playback, and improve viewing quality. becomes. Furthermore, it is possible to use terminals of various performance with different CPU clocks and memory capacities as receiving devices, and to realize a service that can reproduce content according to the processing performance of each terminal.

第１の実施形態に係る受信装置の視聴位置と複数の音源オブジェクトの配置例を示す図である。FIG. 3 is a diagram showing an example of the viewing position of the receiving device and the arrangement of a plurality of sound source objects according to the first embodiment. 第１の実施形態に係るＡＲコンテンツ伝送システムの構成例を示すブロック図である。FIG. 1 is a block diagram illustrating a configuration example of an AR content transmission system according to a first embodiment. 第１の実施形態に係る音声ストリーム及び音源メタシーケンスのブロック化を示す図である。FIG. 3 is a diagram showing blocking of an audio stream and a sound source metasequence according to the first embodiment. 第１の実施形態に係る３次元音響レンダリング部の構成例を示すブロック図である。FIG. 2 is a block diagram illustrating a configuration example of a three-dimensional sound rendering unit according to the first embodiment. ３次元空間コンテンツのワールド座標系（真上からの視点）と音源オブジェクトの配置の例Example of world coordinate system (view from directly above) of 3D space content and placement of sound source objects ＡＲ視点を中心としたビュー座標系（真上からの視点）の例を示す図である。FIG. 3 is a diagram illustrating an example of a view coordinate system (viewpoint from directly above) centered on an AR viewpoint. ワールド座標系に配置されたビュー座標系（真上からの視点）の例を示す図である。FIG. 3 is a diagram illustrating an example of a view coordinate system (viewpoint from directly above) arranged in a world coordinate system. ワールド座標系に配置されたビュー座標系（真横からの視点）の例を示す図である。FIG. 3 is a diagram illustrating an example of a view coordinate system (viewpoint from the side) arranged in a world coordinate system. 第１の実施形態に係る３次元音響レンダリング部の処理を説明する図である。FIG. 3 is a diagram illustrating processing of a three-dimensional sound rendering unit according to the first embodiment. 第１の実施形態に係る３次元音響レンダリング部の処理を説明する図である。FIG. 3 is a diagram illustrating processing of a three-dimensional sound rendering unit according to the first embodiment. 第２の実施形態に係る３６０度ＶＲ映像コンテンツ伝送システムの構成例を示すブロック図である。A block diagram showing an example of the configuration of a 360-degree VR video content transmission system according to a second embodiment.

以下、本発明の実施形態について、図面を参照して詳細に説明する。 Embodiments of the present invention will be described in detail below with reference to the drawings.

（第１の実施形態）
第１の実施形態では、自由度が６ＤｏＦのＡＲコンテンツを伝送するＡＲコンテンツ伝送システムについて説明する。 (First embodiment)
In the first embodiment, an AR content transmission system that transmits AR content with a degree of freedom of 6 DoF will be described.

図１は、３次元空間内に配置された複数の音源オブジェクト、及びＡＲコンテンツを受信するタブレット型の受信装置１０の一例を示す図である。なお、図中の表記において、３次元空間を右手系Ｙアップの座標系で表記するが、各実装における座標系はこの限りではない。図１に示す例では、３次元空間内に音源オブジェクトＯ_１が（Ｘ_１，Ｙ_１，Ｚ_１）に配置され、音源オブジェクＯ_２が（Ｘ_２，Ｙ_２，Ｚ_２）に配置され、音源オブジェクトＯ_３が（Ｘ_３，Ｙ_３，Ｚ_３）に配置され、音源オブジェクトＯ_４が（Ｘ_４，Ｙ_４，Ｚ_４）に配置され、音源オブジェクトＯ_５が（Ｘ_５，Ｙ_５，Ｚ_５）に配置されている。ユーザは、受信装置１０を持って自由に空間内を移動することができる。受信装置１０は、ユーザの視点位置や視線方向に応じて、音源オブジェクトの音声ストリームを処理したバイノーラル音声を生成する。ユーザは、受信装置１０が備えるスピーカ（ＳＰ（Ｌ）及びＳＰ（Ｒ））、又は外付けのステレオヘッドフォンなどで音声を聴取する。 FIG. 1 is a diagram illustrating an example of a tablet-type receiving device 10 that receives multiple sound source objects arranged in a three-dimensional space and AR content. In addition, in the notation in the figure, the three-dimensional space is expressed as a right-handed Y-up coordinate system, but the coordinate system in each implementation is not limited to this. In the example shown in FIG. 1, the sound source object O ₁ is placed at (X ₁ , Y ₁ , Z ₁ ) in the three-dimensional space, the sound source object O ₂ is placed at (X ₂ , Y ₂ , Z ₂ ), Sound source object O ₃ is placed at (X ₃ , Y ₃ , Z ₃ ), sound source object O ₄ is placed at (X ₄ , Y ₄ , Z ₄ ), and sound source object O ₅ is placed at (X ₅ , Y ₅ , _Z5 ). The user can freely move within the space while holding the receiving device 10. The receiving device 10 generates binaural audio by processing the audio stream of the sound source object according to the user's viewpoint position and line of sight direction. The user listens to the audio through speakers (SP(L) and SP(R)) included in the receiving device 10, external stereo headphones, or the like.

図２は、第１の実施形態に係るＡＲコンテンツ伝送システムの構成例を示す図である。ＡＲコンテンツ伝送システム１は、受信装置（ＡＲ受信装置）１０と、配信装置４０と、を備える。時刻サーバ（タイムサーバ）５０は、受信装置１０と配信装置４０とを同期させるために設けられる。図２に示す例では、時刻サーバ５０は１つであるが、受信装置１０と配信装置４０が参照する時刻サーバはそれぞれ異なるものであっても良い。時刻サーバ５０は、インターネット上で提供されているものであっても良い。 FIG. 2 is a diagram illustrating a configuration example of an AR content transmission system according to the first embodiment. The AR content transmission system 1 includes a receiving device (AR receiving device) 10 and a distribution device 40. A time server (time server) 50 is provided to synchronize the receiving device 10 and the distribution device 40. In the example shown in FIG. 2, there is one time server 50, but the receiving device 10 and the distribution device 40 may refer to different time servers. The time server 50 may be provided on the Internet.

配信装置４０は、放送やインターネットなどの伝送路６０を経由して、ＡＲコンテンツをストリーミング伝送する。配信装置４０は、ＡＲコンテンツの３次元空間内に複数配置される音源オブジェクトの音声ストリームと、音源オブジェクトの位置情報（ワールド座標系における３次元座標）を含む音源メタデータとを関連付けて、受信装置１０に送信する。図２に示す例では、配信装置４０は、クロック生成部４１と、多重化部４２と、を備える。 The distribution device 40 streams and transmits AR content via a transmission path 60 such as broadcasting or the Internet. The distribution device 40 associates the audio streams of multiple sound source objects arranged in the three-dimensional space of the AR content with sound source metadata including position information (three-dimensional coordinates in the world coordinate system) of the sound source objects, and transmits the audio streams to the receiving device. Send to 10. In the example shown in FIG. 2, the distribution device 40 includes a clock generation section 41 and a multiplexing section 42.

クロック生成部４１は、時刻サーバ５０から入力された時刻に同期した同期クロックを生成し、多重化部４２に出力する。 The clock generation section 41 generates a synchronous clock synchronized with the time input from the time server 50 and outputs it to the multiplexing section 42 .

多重化部４２は、３次元モデルシーケンス（３次元オブジェクトのモデルシーケンス）、音声ストリーム（音声チャンクのシーケンス）、及び音源メタシーケンス（音源メタデータのシーケンス）を多重化してＡＲコンテンツを生成し、配信装置４０の外部に送信する。例えば、多重化部４２の多重化方式にＭＭＴ（MPEG Media Transport）を使用した場合には、音声チャンクはＭＰＵ（Media Processing Unit）に対応付けることができる。また、クロック生成部４１は絶対時刻であるＵＴＣ（Coordinated Universal Time）による提示時刻タイムスタンプＰＴＳ（Presentation Time Stamp）を多重化部４２に出力し、多重化部４２は、各データにＰＴＳを付与する。 The multiplexing unit 42 multiplexes a 3D model sequence (a model sequence of a 3D object), an audio stream (a sequence of audio chunks), and a sound source metasequence (a sequence of sound source metadata) to generate AR content and distribute it. It is transmitted to the outside of the device 40. For example, when MMT (MPEG Media Transport) is used as the multiplexing method of the multiplexer 42, audio chunks can be associated with MPUs (Media Processing Units). Further, the clock generation unit 41 outputs a presentation time stamp PTS (Presentation Time Stamp) based on UTC (Coordinated Universal Time), which is an absolute time, to the multiplexing unit 42, and the multiplexing unit 42 adds the PTS to each data. .

３次元モデルシーケンスは、例えば、３次元オブジェクトの形状を表すジオメトリデータと、３次元オブジェクトの表面の模様を表すテクスチャデータとを、一定のフレームレートでシーケンス化したデータであり、実時間で送信される。 A three-dimensional model sequence is data in which, for example, geometry data representing the shape of a three-dimensional object and texture data representing the surface pattern of the three-dimensional object are sequenced at a constant frame rate and transmitted in real time.

音源メタシーケンスは、３次元オブジェクトの中でも特に音声発生源となる部位（音源オブジェクト）の位置をワールド座標系の座標で示した位置座標情報を、一定のフレームレートでサンプルしたデータであり、実時間で送信される。音声発生源の点は、例えば、人物オブジェクトであれば声を発する口の中心点、楽器であれば弦楽器のサウンドホールや打楽器の打面の中心点などが主に想定されるが、同じオブジェクトでも、どの部位を音声発生源とするかはコンテンツによって異なる。例えば、タップダンスをする人物のコンテンツであれば、人物オブジェクトの靴底が音声発生源となる。さらに、歌いながらタップダンスをする人物オブジェクト（口の中心点と靴底の２点が音声発生源）の場合など、視覚上１つの３次元オブジェクトに対して２つ以上の音源オジェクトが関連付けられる場合もある。 A sound source metasequence is data obtained by sampling at a constant frame rate position coordinate information that indicates the position of a part of a 3D object that is a sound source (sound source object) in coordinates of the world coordinate system, and is processed in real time. Sent in For example, the point of the sound source is usually assumed to be the center of the mouth where the voice is emitted in the case of a human object, or the center point of the sound hole of a stringed instrument or the center of the striking surface of a percussion instrument in the case of a musical instrument. , which part is used as the sound generation source differs depending on the content. For example, in the case of content about a person tap dancing, the sole of the person object's shoe becomes the audio source. Furthermore, when two or more sound source objects are visually associated with one three-dimensional object, such as in the case of a person object who tap dances while singing (the center point of the mouth and the sole of the shoe are the two sound sources). There is also.

音声ストリームは、３次元オブジェクトの音声発生源（音源オブジェクト）から発せられる音声のストリームデータであり、実時間で送信される。 The audio stream is stream data of audio emitted from an audio source (sound source object) of a three-dimensional object, and is transmitted in real time.

図３は、音声ストリーム及び音源メタシーケンスのブロック化を示す図である。音声ストリーム及び音源メタシーケンスを紐付けるため、音声ストリーム及び音源メタシーケンスは、概ね一定周期でブロック化される。以降、音声ストリームのブロックを「音声チャンク」と称し、音源メタシーケンスのブロックを「音源メタデータ」と称する。各種データを多重化する際の制御情報として、音源オブジェクトごとに音源オブジェクトＩＤ（object_id）が付与される。また、音声チャンク及び音源メタデータは、時間軸の対応付けを行うために、時系列のシーケンス番号（sequence_num）が付与される。音源メタデータ及び音声チャンクは、音源オブジェクトＩＤ及びシーケンス番号により紐付けられる。音源メタシーケンスは、座標データ（coordinates）の他に、後述する優先度（priority）、最大音圧レベル（maximum_level）などを含んでもよい。 FIG. 3 is a diagram illustrating blocking of an audio stream and a sound source metasequence. In order to link the audio stream and the audio source metasequence, the audio stream and the audio source metasequence are generally divided into blocks at regular intervals. Hereinafter, a block of an audio stream will be referred to as an "audio chunk" and a block of a sound source metasequence will be referred to as "sound source metadata." As control information when multiplexing various data, a sound source object ID (object_id) is assigned to each sound source object. In addition, a chronological sequence number (sequence_num) is assigned to the audio chunk and sound source metadata in order to correlate them on the time axis. Sound source metadata and audio chunks are linked by a sound source object ID and a sequence number. In addition to coordinates, the sound source metasequence may also include a priority, a maximum sound pressure level (maximum_level), etc., which will be described later.

受信装置１０は、スマートフォン、タブレット型端末などのモバイル端末、ＡＲグラス、ビデオシースルー型ＡＲゴーグルなどである。受信装置１０は、配信装置４０から、音源オブジェクトの音声ストリームをブロック化した音声チャンクと、音源オブジェクトの位置情報を含む音源メタデータと、を受信する。 The receiving device 10 is a mobile terminal such as a smartphone or a tablet terminal, AR glasses, video see-through type AR goggles, or the like. The receiving device 10 receives, from the distribution device 40, audio chunks in which the audio stream of the sound source object is divided into blocks, and sound source metadata including position information of the sound source object.

図２に示す例では、受信装置１０は、クロック生成部１１と、多重分離部１２と、第１バッファ１３と、第２バッファ１４と、モデル復号部１５と、カメラ１６と、フレームメモリ１７と、検出部１８と、位置姿勢推定部１９と、モデルレンダリング部２０と、映像合成部２１と、ディスプレイ２２と、座標変換部２３と、処理負荷測定部２４と、オブジェクト選択部２５と、音声復号部２６と、３次元音響レンダリング部２７と、スピーカ２８と、を備える。 In the example shown in FIG. 2, the receiving device 10 includes a clock generator 11, a demultiplexer 12, a first buffer 13, a second buffer 14, a model decoder 15, a camera 16, and a frame memory 17. , the detection unit 18, the position and orientation estimation unit 19, the model rendering unit 20, the video synthesis unit 21, the display 22, the coordinate conversion unit 23, the processing load measurement unit 24, the object selection unit 25, and the audio decoding unit. 26, a three-dimensional sound rendering section 27, and a speaker 28.

クロック生成部１１は、時刻サーバ５０から入力された時刻に同期した同期クロックを生成し、第１バッファ１３及び第２バッファ１４に出力する。 The clock generation unit 11 generates a synchronized clock that is synchronized with the time input from the time server 50 and outputs it to the first buffer 13 and the second buffer 14.

多重分離部１２は、配信装置４０から、３次元モデルシーケンス、音声ストリーム、及び音源メタシーケンスが多重化されたＡＲコンテンツを、放送やインターネットなどの伝送路６０を経由して受信し、これらを分離する。そして、３次元モデルシーケンスを第１バッファ１３に出力し、音声ストリーム及び音源メタシーケンスを第２バッファ１４に出力する。バッファは１つであってもよいが、本実施形態では説明の便宜上、バッファを第１バッファ１３及び第２バッファ１４に分けている。 The demultiplexer 12 receives AR content, in which a 3D model sequence, an audio stream, and a sound source metasequence are multiplexed, from the distribution device 40 via a transmission path 60 such as broadcasting or the Internet, and separates them. It then outputs the 3D model sequence to the first buffer 13, and outputs the audio stream and the sound source metasequence to the second buffer 14. There may be only one buffer, but in this embodiment, for the sake of convenience, the buffer is divided into the first buffer 13 and the second buffer 14.

各データは、第１バッファ１３又は第２バッファ１４に蓄えられた後、クロック生成部１１から入力された同期クロックに同期して、後段の処理が行われる。例えば、多重化方式にＭＭＴが使用された場合には、第１バッファ１３及び第２バッファ１４は、同じＰＴＳが付与されたデータの処理結果が最終出力時に同時に提示されるように、それぞれ処理時間を考慮した適切なオフセットを付けて後段にデータを出力する。 After each data is stored in the first buffer 13 or the second buffer 14, subsequent processing is performed in synchronization with the synchronization clock input from the clock generation section 11. For example, when MMT is used as the multiplexing method, the first buffer 13 and the second buffer 14 each have a processing time so that the processing results of data assigned the same PTS are presented simultaneously at the final output. The data is output to the subsequent stage with an appropriate offset that takes into account the

モデル復号部１５は、第１バッファ１３から取得した３次元モデルシーケンスを、ｇｌＴＦ（GL Transmission format）やＨ．２６５／ＨＥＶＣ（High Efficiency Video Coding）などの既存の方式により、モデルレンダリング部２０が直接処理可能な形式に復号し、モデルレンダリング部２０に出力する。例えば、グラフィックの描画処理にＯｐｅｎＧＬで規定される関数を用いる場合に、ＶＢＯ（Vertex Buffer Object）形式が用いられる場合がある。 The model decoding unit 15 decodes the 3D model sequence obtained from the first buffer 13 into a format that can be directly processed by the model rendering unit 20 using an existing method such as glTF (GL Transmission format) or H.265/HEVC (High Efficiency Video Coding), and outputs the decoded data to the model rendering unit 20. For example, when using functions defined in OpenGL for graphic rendering processing, the VBO (Vertex Buffer Object) format may be used.

カメラ１６は、受信装置１０の周囲の映像を撮影し、撮影したフレーム画像をフレームメモリ１７に出力する。 The camera 16 photographs the surroundings of the receiving device 10 and outputs the photographed frame image to the frame memory 17.

検出部１８は、ジャイロセンサ、加速度センサ、地磁気センサ、重力センサなどの１以上のセンサを有する。検出部１８は、各種センサにより検出したセンサ情報を位置姿勢推定部１９に出力する。ジャイロセンサは、物体が同じ方向の運動を続ける慣性の法則を利用して、３自由度（Ｒｏｌｅ，Ｐｉｔｃｈ，Ｙａｗ）の回転量を検知することができる。また、加速度センサは、物体が同じ場所に留まり続ける慣性の法則を利用して、３自由度（Ｘ，Ｙ，Ｚ）の移動速度変化を検知することができる。また、地磁気センサは南北方向を検知でき、重力センサは地面との垂直方向を検知できる。 The detection unit 18 includes one or more sensors such as a gyro sensor, an acceleration sensor, a geomagnetic sensor, and a gravity sensor. The detection unit 18 outputs sensor information detected by various sensors to the position and orientation estimation unit 19. A gyro sensor can detect the amount of rotation in three degrees of freedom (Role, Pitch, Yaw) by using the law of inertia, in which an object continues to move in the same direction. Further, the acceleration sensor can detect changes in moving speed in three degrees of freedom (X, Y, Z) by utilizing the law of inertia, in which an object continues to remain in the same place. Furthermore, the geomagnetic sensor can detect the north-south direction, and the gravity sensor can detect the direction perpendicular to the ground.

位置姿勢推定部１９は、検出部１８により検出されたセンサ情報を用いて受信装置１０の姿勢を推定する。位置姿勢推定部１９は、カメラ１６により撮影されたフレーム画像をさらに用いて受信装置１０の位置及び姿勢を推定してもよい。位置姿勢推定部１９は、受信装置１０の姿勢から、ユーザの視線方向を推定する。例えば、ユーザの視線方向は、カメラ１６の向いている方向としてもよい。 The position and orientation estimation unit 19 estimates the orientation of the receiving device 10 using the sensor information detected by the detection unit 18. The position and orientation estimating unit 19 may further use the frame images captured by the camera 16 to estimate the position and orientation of the receiving device 10. The position and orientation estimating unit 19 estimates the user's line-of-sight direction from the orientation of the receiving device 10. For example, the direction of the user's line of sight may be the direction in which the camera 16 is facing.

また、位置姿勢推定部１９は、カメラ１６により撮影された映像に基づいて、ユーザの視点位置を推定する。ユーザの視点位置は、カメラ１６の位置としてもよい。位置姿勢推定部１９は、例えば、実空間を撮影したある１枚の映像フレームの画像から特徴点を検出し、その次の映像フレームの画像内でその特徴点と同様の特徴量をもつ点を近傍探索により検出し、一つの特徴点の移動量を判定する。次に、前後２フレームにおけるその特徴点の位置と、映像フレームと同じ時間間隔での視線方向変化の推定結果を組み合わせることで、三点測量により受信装置１０と特徴点との距離を求めることができる。同様に、位置姿勢推定部１９は、複数の特徴点と受信装置１０との距離を検出し、それらの特徴点が同一平面に存在することを判断することで、実空間内の平面を検出することができる。そして、位置姿勢推定部１９は、ユーザの視点位置及び視線方向を示す視点情報をモデルレンダリング部２０及び座標変換部２３に出力する。 Further, the position and orientation estimating unit 19 estimates the user's viewpoint position based on the video captured by the camera 16. The user's viewpoint position may be the position of the camera 16. For example, the position and orientation estimating unit 19 detects a feature point from an image of one video frame captured in real space, and detects a point having the same feature amount as the feature point in the image of the next video frame. It is detected by neighborhood search and the amount of movement of one feature point is determined. Next, the distance between the receiving device 10 and the feature point can be determined by triangulation by combining the position of the feature point in the previous and previous two frames and the estimation result of the change in line of sight direction at the same time interval as the video frame. can. Similarly, the position and orientation estimating unit 19 detects a plane in real space by detecting the distance between a plurality of feature points and the receiving device 10 and determining whether the feature points exist on the same plane. be able to. Then, the position and orientation estimating unit 19 outputs viewpoint information indicating the user's viewpoint position and line-of-sight direction to the model rendering unit 20 and coordinate conversion unit 23.

なお、ＡＲコンテンツを実際に視聴し始める前に、端末の位置姿勢推定部１９に実空間の平面などのコンテンツ視聴空間の状況を学習させるキャリブレーション作業をユーザに行わせてもよい。事前のキャリブレーションをユーザに行わせる場合には、ユーザがＡＲによってオブジェクトを合成させるように意図する平面（床面や、テーブルの卓面）を中心に、実空間をカメラ１６で撮影して平面検出を行わせる。一般に、事前キャリブレーションによる空間学習を行うことで、ＡＲによる実空間映像へのオブジェクトの合成をより安定させることが可能であるが、ＡＲコンテンツ視聴において事前キャリブレーションを必要としない場合もある。 Before actually starting to view AR content, the user may be made to perform a calibration operation in which the position and orientation estimation unit 19 of the terminal learns the state of the content viewing space, such as a plane in real space. When making the user perform pre-calibration, the camera 16 captures an image of the real space, centered on a plane (such as a floor or a table surface) on which the user intends to synthesize an object using AR, and plane detection is performed. In general, spatial learning through pre-calibration can make the synthesis of objects into real-space images using AR more stable, but there are cases in which pre-calibration is not required for viewing AR content.

モデルレンダリング部２０は、位置姿勢推定部１９から視点情報を入力し、モデル復号部１５により復号された３次元モデルシーケンスに対して、視点位置及び視線方向に応じたビューポートのレンダリングを行ってレンダリング画像を生成し、映像合成部２１に出力する。 The model rendering unit 20 inputs viewpoint information from the position and orientation estimation unit 19, and performs rendering of a viewport according to the viewpoint position and line of sight direction for the 3D model sequence decoded by the model decoding unit 15 to generate a rendering image, which is output to the video synthesis unit 21.

映像合成部２１は、モデルレンダリング部２０から入力したレンダリング画像と、フレームメモリ１７から入力したフレーム画像とを合成して合成画像を生成し、該合成画像をディスプレイ２２に表示させる。 The video synthesis section 21 synthesizes the rendered image inputted from the model rendering section 20 and the frame image inputted from the frame memory 17 to generate a synthesized image, and causes the display 22 to display the synthesized image.

座標変換部２３は、位置姿勢推定部１９から視点情報を入力し、第２バッファ１４から音源メタデータ（音源オブジェクトのワールド座標系における３次元座標）を入力する。座標変換部２３は、音源オブジェクトの３次元座標を、ワールド座標系から、ユーザの視点位置を中心として視線方向を軸方向とするビュー座標系に座標変換を行い、座標変換後の音源メタデータをオブジェクト選択部２５に出力する。座標変換には、例えば、アフィン変換を用いることができる。視点位置を（Ｖｘ，Ｖｙ，Ｖｚ）、視線方向の単位ベクトルを（Ｄｘ，Ｄｙ，Ｄｚ）とすると、Ｙ軸中心の回転角α、Ｘ軸中心の回転角βに関して、式（１）が成立する。 The coordinate conversion unit 23 inputs viewpoint information from the position and orientation estimation unit 19, and inputs sound source metadata (three-dimensional coordinates of the sound source object in the world coordinate system) from the second buffer 14. The coordinate conversion unit 23 converts the three-dimensional coordinates of the sound source object from the world coordinate system to a view coordinate system centered on the user's viewpoint position and with the line of sight as the axial direction, and outputs the sound source metadata after coordinate conversion to the object selection unit 25. For example, an affine transformation can be used for the coordinate conversion. If the viewpoint position is (Vx, Vy, Vz) and the unit vector of the line of sight is (Dx, Dy, Dz), then equation (1) holds for the rotation angle α around the Y axis and the rotation angle β around the X axis.

この時、ワールド座標系からビュー座標系への回転行列Ａは、式（２）で表される。 At this time, the rotation matrix A from the world coordinate system to the view coordinate system is expressed by equation (2).

また、視点位置を原点とする座標移動行列Ｔは、式（３）で表される。 Further, a coordinate movement matrix T whose origin is the viewpoint position is expressed by equation (3).

以上より、座標変換部２３は、式（４）の行列演算式により、音源オブジェクトの３次元座標ベクトルＰをビュー座標系の座標ベクトルＰ’に座標変換する。これらの座標変換演算は、３次元モデルの描画時においても一般的に行われる演算処理であり、ＧＰＵの機能を用いることができる。なお、上記の変換行列は一例であり、ワールド座標系及びビュー座標系における右手系・左手系の違いや、軸極性の向き、座標ベクトルを行ベクトルで表現するか列ベクトルで表現するか等により異なる場合がある。 As described above, the coordinate transformation unit 23 coordinates transforms the three-dimensional coordinate vector P of the sound source object into the coordinate vector P' of the view coordinate system using the matrix operation formula (4). These coordinate transformation calculations are calculation processes that are generally performed even when drawing a three-dimensional model, and can use the functions of the GPU. The above transformation matrix is an example, and may vary depending on the difference between right-handed and left-handed systems in the world coordinate system and view coordinate system, the direction of axis polarity, and whether coordinate vectors are expressed as row vectors or column vectors. It may be different.

処理負荷測定部２４は、ＣＰＵ使用率、メモリ使用率などの受信装置１０の処理負荷を示す負荷情報を測定し、オブジェクト選択部２５に出力する。 The processing load measurement unit 24 measures load information indicating the processing load of the receiving device 10, such as CPU usage rate and memory usage rate, and outputs it to the object selection unit 25.

オブジェクト選択部２５は、クロックにより制御されたタイミングで、第２バッファ１４から音声チャンクを入力する。オブジェクト選択部２５は、処理負荷測定部２４により測定された処理負荷に基づいて、ビュー座標系において音源オブジェクト選択領域を規定し、該音源オブジェクト選択領域内に位置する音源オブジェクトを、音響処理対象の音源オブジェクトとして選択する。換言すれば、オブジェクト選択部２５は、処理負荷測定部２４により測定された処理負荷、及び座標変換部２３から入力された音源メタデータ（音源オブジェクトのビュー座標系における３次元座標）に基づいて、音響処理対象の音源オブジェクトを選択する。そして、オブジェクト選択部２５は、選択した音源オブジェクトの音声チャンクを音声復号部２６に出力し、該音源オブジェクトの３次元座標を３次元音響レンダリング部２７に出力する。 The object selection unit 25 inputs audio chunks from the second buffer 14 at a timing controlled by a clock. The object selection unit 25 defines a sound source object selection area in the view coordinate system based on the processing load measured by the processing load measurement unit 24, and selects a sound source object located within the sound source object selection area as a sound processing target. Select as sound source object. In other words, the object selection unit 25, based on the processing load measured by the processing load measurement unit 24 and the sound source metadata (three-dimensional coordinates in the view coordinate system of the sound source object) input from the coordinate conversion unit 23, Select the sound source object for sound processing. Then, the object selection section 25 outputs the audio chunk of the selected sound source object to the audio decoding section 26, and outputs the three-dimensional coordinates of the sound source object to the three-dimensional sound rendering section 27.

具体的には、オブジェクト選択部２５は、処理負荷測定部２４から入力された負荷情報を元に、処理負荷の評価値Ｌを算出する。例えば、ＣＰＵ使用率をＲ_１とし、メモリ使用率をＲ_２とし、係数をＫ_１及びＫ_２すると、オブジェクト選択部２５は、評価値Ｌ＝Ｋ_１×Ｒ_１＋Ｋ_２×Ｒ_２とする。係数Ｋ_１及びＫ_２の一方は０であってもよい。オブジェクト選択部２５は、評価値Ｌ（処理負荷）が大きいほど、音源オブジェクト選択領域が小さくなるように規定する。例えば、オブジェクト選択部２５は、評価値Ｌが第１の閾値を超える場合には、音響処理対象の音源オブジェクトの数を減らすように音源オブジェクト選択領域を縮小し、評価値Ｌが第２の閾値よりも小さい場合には、処理対象の音源オブジェクトの数を増やすように音源オブジェクト選択領域を拡大する。 Specifically, the object selection unit 25 calculates the processing load evaluation value L based on the load information input from the processing load measurement unit 24. For example, if the CPU usage rate is _R1 , the memory usage rate is _R2 , and the coefficients are _K1 and _K2 , the object selection unit 25 sets the evaluation value L= _K1 × _R1 + _K2 × _R2 . One of the coefficients K ₁ and K ₂ may be 0. The object selection unit 25 specifies that the larger the evaluation value L (processing load), the smaller the sound source object selection area. For example, when the evaluation value L exceeds the first threshold, the object selection unit 25 reduces the sound source object selection area so as to reduce the number of sound source objects to be processed, and the evaluation value L exceeds the second threshold. If the number of sound source objects is smaller than , the sound source object selection area is expanded to increase the number of sound source objects to be processed.

例えば、オブジェクト選択部２５は、音源オブジェクト選択領域をユーザの視点位置であるビュー座標系の原点を中心とした半径Ｒの球体とし、原点からの距離ｒがｒ＜Ｒとなる音源オブジェクトを処理対象とすることが考えられる。すなわち、オブジェクト選択部２５は、処理負荷の評価値Ｌと半径Ｒの関係をＲ＝ｆ（Ｌ）（ｆ（ｘ）は単調減少関数）とし、処理負荷が予め定める閾値よりも大きい時は半径Ｒを小さくし、処理負荷が予め定める閾値よりも小さい時は半径Ｒを大きくする制御を行う。なお、本実施形態では音源オブジェクト選択領域をユーザの視点位置を中心とする球体とするが、ユーザの視線方向に指向性を持たせた楕円体など、その他の形状で定義することも可能である。 For example, the object selection unit 25 sets the sound source object selection area to a sphere with a radius R centered on the origin of the view coordinate system, which is the user's viewpoint position, and processes sound source objects for which the distance r from the origin satisfies r<R. It is possible to do so. That is, the object selection unit 25 sets the relationship between the processing load evaluation value L and the radius R to be R=f(L) (f(x) is a monotonically decreasing function), and when the processing load is larger than a predetermined threshold, the radius Control is performed such that the radius R is made small and the radius R is made large when the processing load is smaller than a predetermined threshold. In this embodiment, the sound source object selection area is defined as a sphere centered on the user's viewpoint position, but it can also be defined in other shapes, such as an ellipsoid with directivity in the user's viewing direction. .

音源メタデータは、音源オブジェクトの優先度ｐを含んでもよい。この場合には、オブジェクト選択部２５は、ビュー座標系における音源オブジェクトの原点からの距離を優先度ｐが大きいほど短くなるように変更した場合に音源オブジェクト選択領域に含まれることになる音源オブジェクトを、音響処理対象の音源オブジェクトとして選択する。例えば、原点と音源オブジェクトとの実際の距離ｒに対して、ｒ’＝ｒ＊ｇ（ｐ）（ｇ（ｘ）は単調減少関数）を音源オブジェクト選択領域の半径Ｒと比較する際の評価値ｒ’としてもよい。例えば、ｇ（ｘ）＝１／ｘとしたとき、オブジェクト選択部２５は、優先度ｐ＝１の音源オブジェクトについては、ｒ’＝ｒ≦Ｒを満たさなければ音響処理対象の音源オブジェクトとして選択しないが、優先度ｐ＝１００の音源オブジェクトについては、ｒ’＝ｒ／１００≦Ｒを満たせば音響処理対象の音源オブジェクトとして選択する。 The sound source metadata may include the priority p of the sound source object. In this case, the object selection unit 25 selects, as a sound source object to be acoustically processed, a sound source object that will be included in the sound source object selection area when the distance from the origin of the sound source object in the view coordinate system is changed so that the greater the priority p, the shorter it becomes. For example, for the actual distance r between the origin and the sound source object, r' = r * g(p) (g(x) is a monotonically decreasing function) may be used as the evaluation value r' when comparing with the radius R of the sound source object selection area. For example, when g(x) = 1/x, the object selection unit 25 does not select a sound source object with priority p = 1 as a sound source object to be acoustically processed unless r' = r ≦ R is satisfied, but selects a sound source object with priority p = 100 as a sound source object to be acoustically processed if r' = r/100 ≦ R is satisfied.

また、音源メタデータは、音源オブジェクトの最大音圧レベルｌを含んでもよい。この場合には、オブジェクト選択部２５は、ビュー座標系における音源オブジェクトの原点からの距離を最大音圧レベルｌが大きいほど短くなるように変更した場合に音源オブジェクト選択領域に含まれることになる音源オブジェクトを、音響処理対象の音源オブジェクトとして選択する。つまり、最大音圧レベルｌについても、音圧レベルの高い音声ほど視聴位置から遠くても聞こえるため、ｒ’＝ｒ＊ｈ（ｌ）（ｈ（ｘ）は単調減少関数）を評価値とする。最大音圧レベルを音源メタデータとして伝送することにより、受信装置１０で実際に音声チャンクを復号せずとも音圧レベルに応じた選択が可能となり、処理負荷の軽減が可能となる。優先度ｐ及び最大音圧レベルｌをともに考慮すると、音源オブジェクトの原点距離の評価値ｒ’は実際の距離ｒに対して、ｒ’＝ｒ＊ｇ（ｐ）＊ｈ（ｌ）となり、オブジェクト選択部２５は、ｒ’≦Ｒを満たす音源オブジェクトを、音響処理対象の音源オブジェクトとして選択する。 The sound source metadata may also include the maximum sound pressure level l of the sound source object. In this case, the object selection unit 25 selects a sound source that will be included in the sound source object selection area when the distance from the origin of the sound source object in the view coordinate system is changed so that the larger the maximum sound pressure level l, the shorter the distance from the origin of the sound source object. Select an object as a sound source object for acoustic processing. In other words, regarding the maximum sound pressure level l, the evaluation value is r' = r * h (l) (h (x) is a monotonically decreasing function) because the higher the sound pressure level, the more audible the sound even when it is farther from the listening position. . By transmitting the maximum sound pressure level as sound source metadata, it is possible to make a selection according to the sound pressure level without actually decoding the audio chunk in the receiving device 10, and it is possible to reduce the processing load. Considering both the priority p and the maximum sound pressure level l, the evaluation value r' of the origin distance of the sound source object is r' = r * g (p) * h (l) for the actual distance r, and the object The selection unit 25 selects a sound source object satisfying r'≦R as a sound source object to be subjected to sound processing.

音声復号部２６は、オブジェクト選択部２５により選択された音源オブジェクトの音声チャンクを、３次元音響レンダリング部２７が直接処理可能な形式に変換する。例えば、音声復号部２６は、ＭＰＥＧ－４ＡＡＣ（Advanced Audio Codec）、ＭＰＥＧ－Ｈ３ＤＡ（３Ｄ Audio）などの圧縮ストリームを復号処理し、ＰＣＭ（Pulse Code Modulation）データなどの非圧縮ストリームに変換する。音声復号部２６は、復号処理した音声チャンク（復号済み音声チャンク）を３次元音響レンダリング部２７に出力する。 The audio decoding unit 26 converts the audio chunk of the sound source object selected by the object selection unit 25 into a format that can be directly processed by the three-dimensional audio rendering unit 27. For example, the audio decoding unit 26 decodes a compressed stream such as MPEG-4 AAC (Advanced Audio Codec) or MPEG-H 3DA (3D Audio), and converts it into an uncompressed stream such as PCM (Pulse Code Modulation) data. . The audio decoding unit 26 outputs the decoded audio chunk (decoded audio chunk) to the three-dimensional audio rendering unit 27 .

３次元音響レンダリング部２７は、音声復号部２６から復号済み音声チャンクを入力し、オブジェクト選択部２５から音響処理対象として選択された音源オブジェクトのビュー座標系における３次元座標を入力する。３次元音響レンダリング部２７は、音響処理対象の音源オブジェクトの音声チャンクを用いてバイノーラル音声を生成し、スピーカ２８、又は図示しないヘッドフォンなどからバイノーラル音声を出力させる。 The three-dimensional sound rendering unit 27 inputs the decoded audio chunk from the audio decoding unit 26 and receives the three-dimensional coordinates in the view coordinate system of the sound source object selected as a target for audio processing from the object selection unit 25. The three-dimensional sound rendering unit 27 generates binaural audio using audio chunks of the sound source object to be processed, and outputs the binaural audio from the speaker 28 or headphones (not shown).

図４は、３次元音響レンダリング部２７の構成例を示すブロック図である。３次元音響レンダリング部２７は、マッピング部２７１と、ダウンミックス部２７２と、を備える。 Figure 4 is a block diagram showing an example configuration of the three-dimensional audio rendering unit 27. The three-dimensional audio rendering unit 27 includes a mapping unit 271 and a downmixing unit 272.

マッピング部２７１は、復号済み音声チャンク及び音源メタデータ（音源オブジェクトの３次元座標）を、音源オブジェクトＩＤ及びシーケンス番号によって紐付け可能な状態で入力する。そして、マッピング部２７１は、復号済み音声チャンクを、ビュー座標系において視点位置を中心とした所定位置に配置された所定数の仮想マルチチャンネルスピーカ（仮想チャンネルベース音源）に割り当てる（ミックスする）。 The mapping unit 271 inputs the decoded audio chunk and sound source metadata (three-dimensional coordinates of the sound source object) in a state that can be linked by the sound source object ID and sequence number. The mapping unit 271 then assigns (mixes) the decoded audio chunk to a predetermined number of virtual multi-channel speakers (virtual channel-based sound sources) arranged at predetermined positions centered on the viewpoint position in the view coordinate system.

図５は、３次元空間コンテンツのワールド座標系（真上からの視点）及び音源オブジェクトＯ_１～Ｏ_５の配置例を示す図である。図６は、視点位置を原点としたビュー座標系（真上からの視点）及び仮想マルチチャンネルスピーカの配置の例を示す図である。図６に示す例では、仮想マルチチャンネルスピーカは、視点位置と同じ高さに、視点位置を中心とする円上に等間隔に８個配置されており、以降の図７から図１０ついても同様である。仮想マルチチャンネルスピーカの数及び配置場所はこの限りではなく、任意の数の仮想マルチチャンネルスピーカを任意の場所に配置可能である。例えば、５．１ｃｈや２２．２ｃｈのマルチチャンネル音響のスピーカ配置などを使用してもよい。 FIG. 5 is a diagram showing an example of the world coordinate system (viewpoint from directly above) of three-dimensional space content and the arrangement of sound source objects O ₁ to O ₅ . FIG. 6 is a diagram showing an example of the view coordinate system (viewpoint from directly above) with the viewpoint position as the origin and the arrangement of virtual multichannel speakers. In the example shown in FIG. 6, eight virtual multi-channel speakers are arranged at equal intervals on a circle centered on the viewpoint position at the same height as the viewpoint position, and the same applies to the following FIGS. 7 to 10. It is. The number of virtual multi-channel speakers and their placement locations are not limited to these, and any number of virtual multi-channel speakers can be placed at any location. For example, a speaker arrangement for 5.1ch or 22.2ch multi-channel sound may be used.

ユーザの動作に伴い視点位置は移動するため、ワールド座標系とビュー座標系は、ユーザの動作によって相対的な位置関係が変化する。図７は、ある瞬間における、ワールド座標系に配置されたビュー座標系（真上からの視点）の例を示す図である。図８は、ある瞬間における、ワールド座標系に配置されたビュー座標系（真横からの視点）の例を示す図である。 Since the viewpoint position moves with the user's movements, the relative positional relationship between the world coordinate system and the view coordinate system changes depending on the user's movements. FIG. 7 is a diagram showing an example of a view coordinate system (viewpoint from directly above) arranged in the world coordinate system at a certain moment. FIG. 8 is a diagram illustrating an example of a view coordinate system (viewpoint from the side) arranged in the world coordinate system at a certain moment.

図９は、マッピング部２７１の処理の一例として、音源オブジェクトからビュー座標系の仮想マルチチャンネルスピーカへのマッピングの例を示す図である。図９では、オブジェクト選択部２５は視点位置を中心とする球状の音源オブジェクト選択領域の内側に位置する音源オブジェクトを音響処理対象とするものとし、音源オブジェクト選択領域を２点鎖線で示している。なお、音源オブジェクトＯ_２は、上側から投影図では音源オブジェクト選択領域の内部に位置するように見えるが、空間的に見ると原点を中心とする球状の音源オブジェクト選択領域の外部に位置するため、マッピングの対象から外れる。また、図９では、例として、ビュー座標系に変換された音源オブジェクトの座標と仮想マルチチャンネルスピーカとの間の距離を利用し、音源オブジェクトから最も距離が近い仮想マルチチャンネルスピーカにマッピングすることで規定数のチャンネルにマッピングをしていることを示している。なお、音源オブジェクトから仮想マルチチャンネルスピーカへのマッピング手法はこれに限られるものではなく、例えば１つの音源オブジェクトを複数の仮想マルチチャンネルスピーカに分散させてもよい。マッピングでは、例えば、ビュー座標系の原点と音源オブジェクトとの距離ｒに応じて音圧を減衰させる。なお、音響処理対象の音源オブジェクトの数が仮想マルチチャンネルスピーカの規定数に足りない場合には、オブジェクト選択部２５は音源オブジェクト選択領域を広げて選択をやり直してもよい。 FIG. 9 is a diagram showing an example of mapping from a sound source object to a virtual multi-channel speaker in a view coordinate system as an example of processing by the mapping unit 271. In FIG. 9, the object selection unit 25 assumes that a sound source object located inside a spherical sound source object selection area centered on the viewpoint position is to be subjected to acoustic processing, and the sound source object selection area is indicated by a two-dot chain line. Note that the sound source object _O2 appears to be located inside the sound source object selection area in the projection view from above, but is located outside the spherical sound source object selection area centered on the origin when viewed spatially, and is therefore excluded from the mapping target. Also, FIG. 9 shows, as an example, that mapping is performed on a prescribed number of channels by using the distance between the coordinates of the sound source object converted into the view coordinate system and the virtual multi-channel speaker, and mapping to the virtual multi-channel speaker that is closest to the sound source object. Note that the mapping method from the sound source object to the virtual multi-channel speaker is not limited to this, and for example, one sound source object may be distributed to multiple virtual multi-channel speakers. In the mapping, for example, the sound pressure is attenuated according to the distance r between the origin of the view coordinate system and the sound source object. If the number of sound source objects to be subjected to acoustic processing is insufficient to the specified number of virtual multi-channel speakers, the object selection unit 25 may widen the sound source object selection area and perform the selection again.

ダウンミックス部２７２は、仮想マルチチャンネルスピーカに割り当てられた復号済み音声チャンクを用いて、ユーザの左右の耳に対応する２チャンネルのバイノーラル音声にダウンミックスする。具体的には、ダウンミックス部２７２は、音声信号を一定の処理区間に区切り、音声信号を周波数領域に変換して固定数の頭部伝達関数の周波数特性を掛け合わせた後に、時間領域の音声信号に戻す処理を行うことにより、バイノーラル音声にダウンミックスする。 The downmix unit 272 downmixes the decoded audio chunks assigned to the virtual multi-channel speakers into two-channel binaural audio corresponding to the left and right ears of the user. Specifically, the downmix unit 272 divides the audio signal into certain processing sections, converts the audio signal into the frequency domain, multiplies the frequency characteristics of a fixed number of head related transfer functions, and then converts the audio signal into time domain audio. By processing the signal back, it is downmixed to binaural audio.

図１０は、ダウンミックス部２７２の処理の一例として、ビュー座標系の仮想マルチチャンネルスピーカからバイノーラル音声を生成するダウンミックス処理の例を示している。図１０に示す例では、前後方向の仮想マルチチャンネルスピーカについては左右両方のスピーカ２８－１及び２８－２に両方に均等に割り当て、その他の仮想マルチチャンネルスピーカは最寄りの左右いずれかのスピーカ２８－１又は２８－２にマッピングしている。これらのマッピングごとに頭部伝達関数の周波数領域の演算を行い、ミックスすることで、ユーザは上下前後左右の立体感を体験できる。図１０に示すダウンミックスによると、必要な頭部伝達関数の数は１０個となる。なお、図９に示したマッピング、及び図１０に示したダウンミックスはあくまで一例であり、より高度なアルゴリズムにより３次元音場の再現性を高めてもよい。例えば、図１０においては視線方向の左寄りの仮想マルチチャンネルスピーカはスピーカ２８－１のみへ、右寄りの仮想マルチチャンネルスピーカはスピーカ２８－２のみへの頭部伝達関数を考慮しているが、視線方向の左寄りの仮想マルチチャンネルスピーカからスピーカ２８－２へ、右寄りの仮想マルチチャンネルスピーカはスピーカ２８－１への頭部伝達関数を考慮するように、頭部伝達関数を増やしても良い。頭部伝達関数の数を増やすことで、よりリアリティのあるバイノーラル音声の生成が期待できるが、処理負荷増加とのトレードオフとなる。 FIG. 10 shows, as an example of the process of the downmix unit 272, a downmix process that generates binaural audio from a virtual multi-channel speaker in the view coordinate system. In the example shown in FIG. 10, the virtual multi-channel speakers in the front and rear direction are equally allocated to both the left and right speakers 28-1 and 28-2, and the other virtual multi-channel speakers are assigned to the nearest left or right speaker 28-2. 1 or 28-2. By calculating the frequency domain of the head-related transfer function for each of these mappings and mixing them, the user can experience a three-dimensional effect of the top, bottom, front, back, left, and right. According to the downmix shown in FIG. 10, the number of required head related transfer functions is ten. Note that the mapping shown in FIG. 9 and the downmix shown in FIG. 10 are merely examples, and the reproducibility of the three-dimensional sound field may be improved by a more advanced algorithm. For example, in FIG. 10, the head-related transfer function is considered for the virtual multi-channel speaker on the left in the viewing direction only for speaker 28-1, and for the virtual multi-channel speaker on the right for only speaker 28-2, but in the viewing direction The head-related transfer function may be increased so that the head-related transfer function from the virtual multi-channel speaker on the left side to the speaker 28-2 and the virtual multi-channel speaker on the right side to the speaker 28-1 is considered. By increasing the number of head-related transfer functions, more realistic binaural audio can be expected to be generated, but this comes at a trade-off with increased processing load.

なお、音源オブジェクトのストリーミング伝送について、配信装置４０は音声チャンクの音声符号データと音源メタデータを含む全てのデータをＵＤＰ／ＩＰパケットなどに多重化してストリーミング伝送してもよいし、音声ストリームの符号データの実体は伝送せずに、代わりに音声チャンクのロケーション情報と音源メタデータをＵＤＰ／ＩＰパケットなどに多重化してストリーミング伝送してもよい。ロケーション情報としては、ＨＴＴＰなどにより音声チャンクのファイルを取得するためのＵＲＬ情報や、ＩＰマルチキャストなどにより音声チャンクのストリームを追加受信するためのマルチキャストＩＰアドレスおよびポート番号などを指定することができる。この場合には、受信装置１０は、オブジェクト選択部２５により音響処理対象として選択された音源オブジェクトのみをロケーション情報により指定される音声チャンクのファイルを取得することで、負荷削減が可能となる。また、音源メタデータにロケーション情報を含めて、配信装置４０は音源メタデータのみをストリーミング伝送するようにしてもよい。 Regarding streaming transmission of the sound source object, the distribution device 40 may multiplex all data including the audio coded data and sound source metadata of the audio chunk into UDP/IP packets, etc., and transmit the streaming transmission, or Instead of transmitting the actual data, location information and sound source metadata of audio chunks may be multiplexed into UDP/IP packets, etc., and then transmitted by streaming. As the location information, URL information for acquiring an audio chunk file using HTTP or the like, a multicast IP address and port number for receiving an additional audio chunk stream using IP multicasting, etc. can be specified. In this case, the receiving device 10 can reduce the load by acquiring the audio chunk file specified by the location information only for the sound source object selected by the object selection unit 25 as a target for acoustic processing. Alternatively, the distribution device 40 may include the location information in the sound source metadata and stream-transmit only the sound source metadata.

（第２の実施形態）
次に、第２の実施形態として、自由度が３ＤｏＦのＶＲコンテンツ（３６０度ＶＲ映像コンテンツ）を伝送する３６０度ＶＲコンテンツ伝送システムについて説明する。 (Second embodiment)
Next, as a second embodiment, a 360-degree VR content transmission system that transmits VR content (360-degree VR video content) with a degree of freedom of 3DoF will be described.

図１１は、第２の実施形態に係る３６０度ＶＲコンテンツ伝送システム２の構成例を示す図である。３６０度ＶＲコンテンツ伝送システム２は、受信装置（ＶＲ受信装置）１０Ａと、配信装置４０Ａと、を備える。時刻サーバ５０は、受信装置１０Ａと配信装置４０Ａとを同期させるために設けられる。以下、第１の実施形態に係るＡＲコンテンツ伝送システム１と同一の構成については適宜説明を省略し、相違する部分について説明する。 Figure 11 is a diagram showing an example of the configuration of a 360-degree VR content transmission system 2 according to the second embodiment. The 360-degree VR content transmission system 2 includes a receiving device (VR receiving device) 10A and a distribution device 40A. A time server 50 is provided to synchronize the receiving device 10A and the distribution device 40A. Below, descriptions of the same configuration as the AR content transmission system 1 according to the first embodiment will be omitted as appropriate, and differences will be described.

配信装置４０Ａは、放送やインターネットなどの伝送路６０を経由して、ＶＲコンテンツをストリーミング伝送する。配信装置４０Ａは、ＶＲコンテンツの３次元空間内に複数配置される音源オブジェクトの音声ストリームと、音源オブジェクトの位置情報（３次元座標）を含む音源メタデータとを、関連付けて、受信装置１０Ａに送信する。配信装置４０Ａは、クロック生成部４１と、多重化部４２と、を備える。 The distribution device 40A transmits VR content by streaming via a transmission path 60 such as broadcasting or the Internet. The distribution device 40A transmits to the receiving device 10A an audio stream of multiple sound source objects placed in the three-dimensional space of the VR content, associated with sound source metadata including position information (three-dimensional coordinates) of the sound source objects. The distribution device 40A includes a clock generation unit 41 and a multiplexing unit 42.

配信装置４０Ａは、第１の実施形態の配信装置４０と比較して、３次元モデルシーケンスではなくＶＲ映像シーケンスを多重化して伝送する点が相違する。すなわち、多重化部４２は、ＶＲ映像シーケンス、音声ストリーム（音声チャンクのシーケンス）、及び音源メタシーケンス（音源メタデータのシーケンス）を多重化してＶＲコンテンツを生成し、配信装置４０の外部に送信する。 The distribution device 40A is different from the distribution device 40 of the first embodiment in that it multiplexes and transmits a VR video sequence instead of a three-dimensional model sequence. That is, the multiplexing unit 42 multiplexes the VR video sequence, the audio stream (sequence of audio chunks), and the sound source metasequence (sequence of sound source metadata) to generate VR content, and transmits it to the outside of the distribution device 40. .

受信装置１０Ａは、スマートフォン、タブレット型端末などのモバイル端末、ＶＲゴーグル、ＶＲヘッドマウントディスプレイなどの端末である。図１１に示す例では、受信装置１０Ａは、クロック生成部１１と、多重分離部１２と、第１バッファ１３と、第２バッファ１４と、検出部１８と、位置姿勢推定部１９Ａと、ディスプレイ２２と、座標変換部２３Ａと、処理負荷測定部２４と、オブジェクト選択部２５と、音声復号部２６と、３次元音響レンダリング部２７と、スピーカ２８と、映像復号部２９と、映像切出部３０と、を備える。受信装置１０Ａは、第１の実施形態の受信装置１０と比較して、モデル復号部１５、カメラ１６、フレームメモリ１７、モデルレンダリング部２０、及び映像合成部２１を有しておらず、映像復号部２９及び映像切出部３０を有している点が相違する。また、位置姿勢推定部１９Ａ及び座標変換部２３Ａの処理が、位置姿勢推定部１９及び座標変換部２３の処理と相違する。 The receiving device 10A is a mobile terminal such as a smartphone or a tablet terminal, a terminal such as VR goggles, or a VR head mounted display. In the example shown in FIG. 11, the receiving device 10A includes a clock generation section 11, a demultiplexing section 12, a first buffer 13, a second buffer 14, a detection section 18, a position and orientation estimation section 19A, and a display 22. , coordinate conversion section 23A, processing load measurement section 24, object selection section 25, audio decoding section 26, three-dimensional sound rendering section 27, speaker 28, video decoding section 29, and video cutting section 30 and. Compared to the receiving device 10 of the first embodiment, the receiving device 10A does not include a model decoding section 15, a camera 16, a frame memory 17, a model rendering section 20, and a video composition section 21, and does not have a model decoding section 15, a camera 16, a frame memory 17, a model rendering section 20, and a video composition section 21. The difference is that it includes a section 29 and a video cutting section 30. Further, the processing of the position/orientation estimation section 19A and the coordinate transformation section 23A is different from the processing of the position/orientation estimation section 19 and the coordinate transformation section 23.

多重分離部１２は、配信装置４０Ａから、ＶＲ映像シーケンス、音声ストリーム、及び音源メタシーケンスが多重化されたＶＲコンテンツを、放送やインターネットなどの伝送路６０を経由して受信し、これらを分離する。そして、ＶＲ映像シーケンスを第１バッファ１３に出力し、音源メタシーケンス及び音声ストリームを第２バッファ１４に出力する。 The demultiplexing unit 12 receives VR content, in which a VR video sequence, an audio stream, and an audio source metasequence are multiplexed, from the distribution device 40A via a transmission path 60 such as broadcast or the Internet, and separates them. It then outputs the VR video sequence to the first buffer 13, and outputs the audio source metasequence and audio stream to the second buffer 14.

映像復号部２９は、第１バッファ１３から取得したＶＲ映像シーケンスをＨ．２６５／ＨＥＶＣなどの既存の方式により復号し、映像切出部３０に出力する。 The video decoding unit 29 decodes the VR video sequence obtained from the first buffer 13 using an existing method such as H.265/HEVC, and outputs it to the video extraction unit 30.

位置姿勢推定部１９Ａは、検出部１８から入力したセンサ情報を用いて、ユーザの視線方向を推定する。例えば、位置姿勢推定部１９Ａは、ユーザの視線方向をジャイロセンサの情報から推定する。ここでは、自由度が３ＤｏＦ（視点位置が固定）のＶＲコンテンツを想定しているため、第１の実施形態の位置姿勢推定部１９のように、加速度センサやカメラで撮影された映像などにより視点位置を推定する必要はない。位置姿勢推定部１９Ａは、ユーザの視線方向を示す視点情報を座標変換部２３Ａ及び映像切出部３０に出力する。なお、位置姿勢推定部１９Ａは、位置姿勢推定部１９と同等の機能としてユーザの視点位置情報を出力しても良いが、３ＤｏＦコンテンツでは視点位置情報は使用されない。 The position and orientation estimation unit 19A estimates the user's gaze direction using the sensor information input from the detection unit 18. For example, the position and orientation estimation unit 19A estimates the user's gaze direction from information from a gyro sensor. Since VR content with 3DoF degrees of freedom (fixed viewpoint position) is assumed here, there is no need to estimate the viewpoint position using an acceleration sensor or video captured by a camera, as in the position and orientation estimation unit 19 of the first embodiment. The position and orientation estimation unit 19A outputs viewpoint information indicating the user's gaze direction to the coordinate conversion unit 23A and the video cropping unit 30. Note that the position and orientation estimation unit 19A may output the user's viewpoint position information as a function equivalent to that of the position and orientation estimation unit 19, but the viewpoint position information is not used in 3DoF content.

座標変換部２３Ａは、音源オブジェクトの３次元座標ベクトルＰを、式（２）で示した回転行列Ａを用いて、式（５）の行列演算式によりビュー座標系の座標ベクトルＰ’に座標変換する。ここでは、自由度が３ＤｏＦのＶＲコンテンツを想定しているため、ワールド座標系とビュー座標系の原点が一致する（ユーザの視点位置がワールド座標系の原点に固定される）ように音源メタデータの音源位置情報を制作することができる。この場合には、ワールド座標系からユーザ中心のビュー座標系への変換は、回転移動のみとなり平行移動を伴わない。 The coordinate transformation unit 23A coordinates transforms the three-dimensional coordinate vector P of the sound source object into a coordinate vector P' of the view coordinate system using the rotation matrix A shown in equation (2) and the matrix calculation formula in equation (5). do. Here, we are assuming VR content with 3DoF degrees of freedom, so the sound source metadata is It is possible to create sound source location information. In this case, the conversion from the world coordinate system to the user-centered view coordinate system involves only rotational movement and no parallel movement.

映像切出部３０は、映像復号部２９により復号されたＶＲ映像から、位置姿勢推定部１９Ａにより推定されたユーザの視線方向に対応するビューポートの映像を切り出して切出映像を生成し、該切出画像をディスプレイ２２に表示させる。 The video cutout unit 30 cuts out the video of the viewport corresponding to the user's gaze direction estimated by the position and orientation estimation unit 19A from the VR video decoded by the video decoding unit 29, generates a cutout video, and generates a cutout video. The cutout image is displayed on the display 22.

上述したように、受信装置１０，１０Ａは、処理負荷に基づいて、ビュー座標系において音源オブジェクト選択領域を規定し、該音源オブジェクト選択領域内に位置する音源オブジェクトを、音響処理対象の音源オブジェクトとして選択する。かかる構成により、音声復号部２６及び３次元音響レンダリング部２７の演算量及び回路規模を低減させることができ、コンテンツ再生のリアルタイム性を確保し、視聴品質を向上させることができる。また、ＣＰＵクロックやメモリ搭載量の異なる様々な性能の端末を受信装置として利用し、各端末の処理性能に応じたコンテンツ再生が可能なサービスを実現することができる。 As described above, the receiving device 10, 10A defines a sound source object selection area in the view coordinate system based on the processing load, and selects a sound source object located within the sound source object selection area as a sound source object to be subjected to acoustic processing. This configuration makes it possible to reduce the amount of calculation and circuit size of the audio decoding unit 26 and the three-dimensional acoustic rendering unit 27, ensure real-time content playback, and improve viewing quality. In addition, it is possible to realize a service in which terminals with various performances, such as different CPU clocks and memory capacity, can be used as receiving devices, and content can be played back according to the processing performance of each terminal.

また、受信装置１０，１０Ａは、ワールド座標系の３次元座標で伝送される音源オブジェクトの位置情報について、ユーザの視点位置を原点とするビュー座標系の３次元座標に変換してもよい。かかる構成により、ユーザの視点位置を座標系の中心の固定とみなすことができ、その後の演算を簡略化することができる。 The receiving device 10, 10A may also convert the position information of the sound source object transmitted in three-dimensional coordinates in the world coordinate system into three-dimensional coordinates in a view coordinate system with the user's viewpoint position as the origin. With this configuration, the user's viewpoint position can be regarded as a fixed center of the coordinate system, simplifying subsequent calculations.

また、受信装置１０，１０Ａは、音響処理対象の音源オブジェクトの音声チャンクを、一次処理としてビュー座標系に配置された仮想マルチチャンネルスピーカに割り当て、次に二次処理として仮想マルチチャンネルスピーカに割り当てられた音声チャンクを用いて、バイノーラル音声を生成してもよい。かかる構成により、一次処理においては、バイノーラル音声の生成に必要とされる高負荷な周波数領域の演算を用いないで、距離による単純な音圧減衰など低負荷な演算を用いることができ、さらに一次処理によって仮想チャンネル数・音源位置が固定となり、二次処理では有限数の頭部伝達関数を用いてバイノーラル音声を生成することができる。このため、３次元音響レンダリング部２７の演算量及び回路規模をさらに低減させることができる。 Furthermore, the receiving devices 10 and 10A allocate audio chunks of the sound source object to be processed for acoustic processing to virtual multi-channel speakers arranged in the view coordinate system as a primary process, and then allocate audio chunks to the virtual multi-channel speakers as a secondary process. Binaural audio may be generated using the audio chunks. With this configuration, in the primary processing, it is possible to use low-load calculations such as simple sound pressure attenuation due to distance, without using the high-load frequency domain calculations required for binaural audio generation. The number of virtual channels and sound source positions are fixed through processing, and binaural audio can be generated using a finite number of head-related transfer functions in secondary processing. Therefore, the amount of calculation and circuit scale of the three-dimensional sound rendering section 27 can be further reduced.

また、受信装置１０，１０Ａは、負荷測定による現状の処理負荷に応じて音源オブジェクト選択領域を拡大・縮小させてもよい。かかる構成により、音声復号部２６及び３次元音響レンダリング部２７の処理負荷を最適化することができる。 Furthermore, the receiving devices 10 and 10A may enlarge or reduce the sound source object selection area according to the current processing load determined by load measurement. With this configuration, the processing load on the audio decoding section 26 and the three-dimensional audio rendering section 27 can be optimized.

また、受信装置１０，１０Ａは、ビュー座標系における音源オブジェクトの原点からの距離を優先度及び／又は最大音圧レベルが大きいほど短くなるように変更した場合に音源オブジェクト選択領域に含まれることになる音源オブジェクトを、音響処理対象の音源オブジェクトとして選択してもよい。かかる構成により、コンテンツの視聴品質をさらに向上させることができる。 The receiving device 10, 10A may also select, as a sound source object to be subjected to acoustic processing, a sound source object that would be included in the sound source object selection area when the distance from the origin of the sound source object in the view coordinate system is changed so that the greater the priority and/or the maximum sound pressure level, the shorter the distance. With such a configuration, the viewing quality of the content can be further improved.

＜プログラム＞
上記の受信装置１０，１０Ａとして機能させるためにプログラム命令を実行可能なコンピュータを用いることも可能である。コンピュータは、受信装置１０，１０Ａの機能を実現する処理内容を記述したプログラムを該コンピュータの記憶部に格納しておき、該コンピュータのプロセッサによってこのプログラムを読み出して実行させることで実現することができ、これらの処理内容の少なくとも一部をハードウェアで実現することとしてもよい。ここで、プログラム命令は、必要なタスクを実行するためのプログラムコード、コードセグメントなどであってもよい。プロセッサは、ＣＰＵ、ＧＰＵ、ＤＳＰ、ＡＳＩＣ(Application Specific Integrated Circuit)などであってもよい。 <Program>
It is also possible to use a computer capable of executing program instructions to function as the above-mentioned receiving devices 10, 10A. The computer can be realized by storing a program that describes the processing contents for realizing the functions of the receiving devices 10 and 10A in the storage section of the computer, and having the processor of the computer read and execute this program. , at least a part of these processing contents may be realized by hardware. Here, the program instructions may be program codes, code segments, etc. for performing necessary tasks. The processor may be a CPU, GPU, DSP, ASIC (Application Specific Integrated Circuit), or the like.

また、このプログラムは、コンピュータが読み取り可能な記録媒体に記録されていてもよい。このような記録媒体を用いれば、プログラムをコンピュータにインストールすることが可能である。ここで、プログラムが記録された記録媒体は、非一過性の記録媒体であってもよい。非一過性の記録媒体は、特に限定されるものではないが、例えば、ＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭなどの記録媒体であってもよい。また、このプログラムは、ネットワークを介したダウンロードによって提供することもできる。 Further, this program may be recorded on a computer-readable recording medium. Using such a recording medium, it is possible to install a program on a computer. Here, the recording medium on which the program is recorded may be a non-transitory recording medium. The non-transitory recording medium is not particularly limited, but may be a recording medium such as a CD-ROM or a DVD-ROM. Moreover, this program can also be provided by downloading via a network.

上述の実施形態は代表的な例として説明したが、本発明の趣旨及び範囲内で、多くの変更及び置換ができることは当業者に明らかである。したがって、本発明は、上述の実施形態によって制限するものと解するべきではなく、特許請求の範囲から逸脱することなく、種々の変形や変更が可能である。例えば、実施形態に記載の構成ブロック又は処理ステップについて、複数を１つに組み合わせたり、１つを複数に分割したりすることが可能である。 Although the embodiments described above have been described as representative examples, it will be apparent to those skilled in the art that many modifications and substitutions can be made within the spirit and scope of the invention. Therefore, the present invention should not be construed as being limited to the above-described embodiments, and various modifications and changes can be made without departing from the scope of the claims. For example, it is possible to combine a plurality of building blocks or processing steps described in the embodiments into one, or to divide one into a plurality of blocks or processing steps.

１ＡＲコンテンツ伝送システム
２３６０度ＶＲ映像コンテンツ伝送システム
１０，１０Ａ受信装置
１１クロック生成部
１２多重分離部
１３第１バッファ
１４第２バッファ
１５モデル復号部
１６カメラ
１７フレームメモリ
１８検出部
１９，１９Ａ位置姿勢推定部
２０モデルレンダリング部
２１映像合成部
２２ディスプレイ
２３，２３Ａ座標変換部
２４処理負荷測定部
２５オブジェクト選択部
２６音声復号部
２７３次元音響レンダリング部
２８スピーカ
２９映像復号部
３０映像切出部
４０，４０Ａ配信装置
４１クロック生成部
４２多重化部
５０時刻サーバ
６０伝送路
1 AR content transmission system 2 360 degree VR video content transmission system 10, 10A Receiving device 11 Clock generation section 12 Demultiplexing section 13 First buffer 14 Second buffer 15 Model decoding section 16 Camera 17 Frame memory 18 Detection section 19, 19A Position Posture estimation unit 20 Model rendering unit 21 Video synthesis unit 22 Display 23, 23A Coordinate conversion unit 24 Processing load measurement unit 25 Object selection unit 26 Audio decoding unit 27 Three-dimensional sound rendering unit 28 Speaker 29 Video decoding unit 30 Video cutting unit 40 , 40A distribution device 41 clock generation section 42 multiplexing section 50 time server 60 transmission line

Claims

A receiving device that receives audio chunks obtained by blocking an audio stream of a sound source object and sound source metadata including three-dimensional coordinates of the sound source object in a world coordinate system, the receiving device comprising:
a coordinate conversion unit that converts the three-dimensional coordinates of the sound source object from a world coordinate system to a view coordinate system;
an object selection unit that defines a sound source object selection area in the view coordinate system based on the processing load of the receiving device, and selects a sound source object located within the sound source object selection area as a sound source object to be subjected to acoustic processing;
a three-dimensional sound rendering unit that generates binaural sound using the audio chunks of the sound source object to be processed;
Equipped with
The object selection unit defines the sound source object selection area to become smaller as the processing load becomes larger .

A receiving device that receives audio chunks obtained by dividing an audio stream of a sound source object into blocks, and sound source metadata including three-dimensional coordinates of the sound source object in a world coordinate system, comprising:
a coordinate conversion unit that converts the three-dimensional coordinates of the sound source object from a world coordinate system to a view coordinate system;
an object selection unit that defines a sound source object selection area in the view coordinate system based on a processing load of the receiving device, and selects a sound source object located within the sound source object selection area as a sound source object to be subjected to sound processing;
a three-dimensional audio rendering unit that generates binaural audio using audio chunks of the sound source object to be subjected to audio processing;
Equipped with
the audio source metadata includes a priority of the audio source object;
The object selection unit selects, as the sound source object to be subjected to acoustic processing, a sound source object that would be included in the sound source object selection area when the distance from the origin of the sound source object in the view coordinate system is changed so that the greater the priority, the shorter the distance.

A receiving device that receives audio chunks obtained by blocking an audio stream of a sound source object and sound source metadata including three-dimensional coordinates of the sound source object in a world coordinate system, the receiving device comprising:
a coordinate conversion unit that converts the three-dimensional coordinates of the sound source object from a world coordinate system to a view coordinate system;
an object selection unit that defines a sound source object selection area in the view coordinate system based on the processing load of the receiving device, and selects a sound source object located within the sound source object selection area as a sound source object to be subjected to acoustic processing;
a three-dimensional sound rendering unit that generates binaural sound using the audio chunks of the sound source object to be processed;
Equipped with
The sound source metadata includes a maximum sound pressure level of the sound source object,
The object selection unit selects a sound source object that will be included in the sound source object selection area when the distance from the origin of the sound source object in the view coordinate system is changed such that the larger the maximum sound pressure level, the shorter the distance from the origin. , a receiving device selected as the sound source object to be subjected to the acoustic processing.

further comprising a position and orientation estimation unit that estimates the user's line of sight direction;
The coordinate conversion unit converts the three-dimensional coordinates of the sound source object from the world coordinate system to the view coordinate system with the line of sight as the axial direction. Receiving device.

The position and orientation estimation unit estimates a viewpoint position of the user,
The receiving device according to claim 4 , wherein the coordinate conversion unit converts the three-dimensional coordinates of the sound source object from the world coordinate system to the view coordinate system having the viewpoint position as its origin.

The three-dimensional sound rendering unit includes:
a mapping unit that allocates audio chunks of the sound source object to be processed for acoustic processing to virtual multi-channel speakers arranged in the view coordinate system;
a downmix unit that generates the binaural audio using the audio chunk assigned to the virtual multi-channel speaker;
The receiving device according to any one of claims 1 to 5 , comprising:

A receiving device according to any one of claims 1 to 6 ,
a distribution device that associates the audio chunk with the sound source metadata and transmits the resulting association to the receiving device;
A content transmission system comprising:

A program for causing a computer to function as the receiving device according to claim 1 .