JP7173249B2

JP7173249B2 - CLIENT DEVICE, DISPLAY SYSTEM, CLIENT DEVICE PROCESSING METHOD AND PROGRAM

Info

Publication number: JP7173249B2
Application number: JP2021150008A
Authority: JP
Inventors: 郁夫塚越
Original assignee: Sony Corp; Sony Group Corp
Current assignee: Sony Corp; Sony Group Corp
Priority date: 2017-05-09
Filing date: 2021-09-15
Publication date: 2022-11-16
Anticipated expiration: 2037-05-09
Also published as: JP2022008400A

Description

本技術は、クライアント装置、表示システム、クライアント装置の処理方法およびプログラムに関する。 The present technology relates to a client device, a display system, a processing method of the client device, and a program .

ユーザの頭部に装着され、眼前に配置されたディスプレイ等によってユーザ個人に画像を提示することが可能な、ヘッドマウントディスプレイ（ＨＭＤ：Head Mounted Display）が知られている（例えば、特許文献１参照）。近年、ＨＭＤによるＶＲ（virtual reality）向けに作成された全天周レベルの画像を個人で楽しむことが始められている。個人の空間で楽しむ視聴者が複数人いて、それらの視聴者が個人のＶＲ空間だけでなく互いのＶＲ空間を共有してコミュニケーションがとれるようになることが期待されている。 A head-mounted display (HMD) is known, which is worn on the user's head and is capable of presenting an image to the individual user through a display or the like arranged in front of the user's eyes (see, for example, Patent Document 1). ). In recent years, individuals have begun to enjoy omnidirectional level images created for VR (virtual reality) using HMDs. It is expected that there will be a plurality of viewers enjoying themselves in their own space, and those viewers will be able to communicate by sharing not only their own VR space but also each other's VR space.

特開２０１６－０２５６３３号公報JP 2016-025633 A

本技術の目的は、複数のクライアント（視聴者）が互いのＶＲ空間を共有してコミュニケーションをとることを可能とすることにある。 The purpose of this technology is to enable multiple clients (viewers) to share each other's VR space and communicate with each other.

本技術の概念は、
サーバから背景画像の画像データをエンコードして得られたビデオストリームを含むサーバ配信ストリームを受信し、他のクライアント装置から該他のクライアントの代用画像を表示するための代用画像メタ情報を含むクライアント送信ストリームを受信する受信部と、
上記ビデオストリームをデコードして背景画像の画像データを得るデコード処理と、上記代用画像メタ情報に基づいて代用画像の画像データを生成する代用画像データ生成処理と、上記背景画像の画像データに上記代用画像の画像データを合成する画像データ合成処理を制御する制御部を備える
クライアント装置にある。 The concept of this technology is
Receiving a server-delivered stream including a video stream obtained by encoding image data of a background image from a server, and transmitting from another client device including substitute image meta information for displaying the substitute image of the other client. a receiver for receiving the stream;
decoding processing for decoding the video stream to obtain image data of a background image; substitute image data generation processing for generating image data for a substitute image based on the substitute image meta information; The client apparatus includes a control unit for controlling image data synthesis processing for synthesizing image data of images.

本技術において、受信部により、サーバから背景画像の画像データをエンコードして得られたビデオストリームを含むサーバ配信ストリームが受信され、また、他のクライアント装置からこの他のクライアントの代用画像を表示するための代用画像メタ情報を含むクライアント送信ストリームが受信される。代用画像は、例えばアバターあるいはキャラクタを認識し得るシンボルなどである。 In the present technology, the receiving unit receives a server-delivered stream including a video stream obtained by encoding image data of a background image from a server, and displays a substitute image of another client from another client device. A client-sent stream containing substitute image meta information for is received. A substitute image is, for example, a symbol that can recognize an avatar or a character.

制御部により、デコード処理、代用画像データ生成処理および画像データ合成処理が制御される。デコード処理では、ビデオストリームをデコードして背景画像の画像データを得ることが行われる。代用画像データ生成処理では、代用画像メタ情報に基づいて代用画像の画像データを生成することが行われる。画像データ合成処理では、背景画像の画像データに代用画像の画像データを合成することが行われる。 The control unit controls decoding processing, substitute image data generation processing, and image data synthesis processing. In the decoding process, the video stream is decoded to obtain image data of the background image. In the substitute image data generation process, image data of the substitute image is generated based on the substitute image meta information. In the image data synthesizing process, the image data of the background image is synthesized with the image data of the substitute image.

例えば、ビデオストリームのレイヤおよび/またはサーバ配信ストリームのレイヤに背景画像における代用画像の許容合成範囲を示す情報が挿入されており、制御部は、許容合成範囲を示す情報に基づき、代用画像が背景画像の許容合成範囲内に配置されるように合成処理を制御する、ようにされてもよい。 For example, information indicating the permissible composition range of the substitute image in the background image is inserted in the video stream layer and/or the server distribution stream layer, and the control unit determines whether the substitute image is the background image based on the information indicating the permissible composition range. The compositing process may be controlled so that the image is arranged within the allowable compositing range of the image.

この場合、代用画像メタ情報には、代用画像の許容合成範囲内における合成位置を示す合成位置情報が含まれており、制御部は、上記合成位置情報が示す合成位置に上記代用画像が合成されるように上記合成処理を制御する、ようにされてもよい。また、この場合、例えば、代用画像メタ情報には、代用画像のサイズを示すサイズ情報が含まれており、制御部は、サイズ情報が示すサイズで背景画像に代用画像が合成されるように合成処理を制御する、ようにされてもよい。 In this case, the substitute image meta information includes compositing position information indicating a compositing position within the permissible compositing range of the substitute image. may be configured to control the above synthesis process. Further, in this case, for example, the substitute image meta information includes size information indicating the size of the substitute image, and the control unit synthesizes the substitute image with the background image in the size indicated by the size information. may be used to control the processing.

このように本技術においては、背景画像の画像データに代用画像メタ情報に基づいて代用画像の画像データを生成し、この代用画像の画像データを背景画像の画像データに合成するものである。そのため、クライアントのそれぞれは、共通の背景画像に他のクライアントの代用画像が合成されたものを認識でき、互いのＶＲ空間を共有して良好にコミュニケーションをとることが可能となる。 As described above, in the present technology, the image data of the substitute image is generated based on the image data of the background image and the meta information of the substitute image, and the image data of the substitute image is combined with the image data of the background image. Therefore, each of the clients can recognize that the substitute image of the other client is synthesized with the common background image, and can communicate well by sharing the VR space with each other.

なお、本技術において、例えば、クライアント送信ストリームには、代用画像メタ情報に対応した音声データがオブジェクトメタデータと共に含まれており、制御部は、音声データにオブジェクトメタデータに応じたレンダリング処理を行って代用画像の合成位置を音像位置とする音声出力データを得る音声出力処理をさらに制御する、ようにされてもよい。これにより、クライアントのそれぞれに、背景画像上の各代用画像の合成位置からその代用画像のクライアントからの音声が出ているように知覚させることが可能となる。 Note that in the present technology, for example, the client transmission stream includes audio data corresponding to the substitute image meta information together with object metadata, and the control unit performs rendering processing on the audio data according to the object metadata. may further control the audio output processing for obtaining audio output data having the position of synthesizing the substitute image as the position of the sound image. This makes it possible for each client to perceive as if the sound from the client of the substitute image is coming from the composite position of each substitute image on the background image.

また、本技術において、例えば、クライアント送信ストリームには、代用画像メタ情報に対応した字幕データが表示位置情報と共に含まれており、制御部は、字幕データによる字幕が代用画像の合成位置に対応した位置に表示されるように表示位置情報に基づいて字幕の表示データを背景画像の画像データに合成する字幕合成処理をさらに制御する、ようにされてもよい。これにより、クライアントのそれぞれに、背景画像上の各代用画像の合成位置に対応した位置にその代用画像のクライアントからの字幕を認識させることが可能となる。 Further, in the present technology, for example, the client transmission stream includes caption data corresponding to the substitute image meta information together with the display position information, and the control unit determines whether the caption based on the caption data corresponds to the composition position of the substitute image. It may be configured such that a caption synthesizing process for synthesizing the display data of the caption with the image data of the background image is further controlled based on the display position information so that the caption is displayed at the position. This enables each client to recognize the caption from the client of the substitute image at the position corresponding to the composite position of each substitute image on the background image.

また、本技術において、例えば、自身の代用画像を表示するための代用画像メタ情報を含むクライアント送信ストリームを他のクライアント装置に送信する送信部をさらに備え、代用画像データ生成処理では、この自身の代用画像を表示するための代用画像メタ情報に基づいて自身の代用画像の画像データをさらに生成する、ようにされてもよい。これにより、背景画像に、他のクライアントの代用画像だけでなく、自身の代用画像をも合成することが可能となる。 Further, in the present technology, for example, a transmitting unit that transmits a client transmission stream including substitute image meta information for displaying the substitute image of the user to another client device is provided. Image data of its own substitute image may be further generated based on the substitute image meta information for displaying the substitute image. As a result, it becomes possible to synthesize not only the substitute image of another client but also the substitute image of the client itself with the background image.

また、本技術において、例えば、背景画像の画像データは、広視野角画像の画像データであり、制御部は、背景画像の画像データの一部を切り出して表示用画像データを得る画像切出し処理をさらに制御する、ようにされてもよい。例えば、表示用画像データによる画像はＨＭＤで表示され、切出し範囲は、例えば、ＨＭＤ搭載のセンサで検出される頭部姿勢に応じて決定される。 Further, in the present technology, for example, the image data of the background image is the image data of the wide-viewing-angle image, and the control unit performs image clipping processing for obtaining display image data by clipping a part of the image data of the background image. Further control may be performed. For example, an image based on display image data is displayed on an HMD, and the cutout range is determined according to, for example, the head posture detected by a sensor mounted on the HMD.

また、本技術の他の概念は、
被写体を撮像して背景画像の画像データを得る撮像部と、
上記背景画像の画像データをエンコードして得られたビデオストリームを含むサーバ配信ストリームをクライアント装置に送信する送信部を備え、
上記ビデオストリームのレイヤおよび/またはサーバ配信ストリームのレイヤに上記背景画像における代用画像の許容合成範囲を示す情報が挿入されている
サーバにある。 Another concept of this technology is
an imaging unit that captures an image of a subject and obtains image data of a background image;
a transmitting unit configured to transmit a server-delivered stream including a video stream obtained by encoding image data of the background image to a client device;
The server inserts information indicating the permissible composition range of the substitute image for the background image into the layer of the video stream and/or the layer of the server-delivered stream.

本技術において、撮像部により、被写体が撮像されて背景画像の画像データが得られる。例えば、この背景画像の画像データは、広視野角画像の画像データである、ようにされてもよい。送信部により、背景画像の画像データをエンコードして得られたビデオストリームを含むサーバ配信ストリームがクライアント装置に送信される。ここで、ビデオストリームのレイヤおよび/またはサーバ配信ストリームのレイヤに背景画像における代用画像の許容合成範囲を示す情報が挿入されている。 In the present technology, the imaging unit captures an image of a subject and obtains image data of a background image. For example, the image data of this background image may be image data of a wide viewing angle image. The transmitting unit transmits to the client device a server-delivered stream including a video stream obtained by encoding the image data of the background image. Here, information indicating the permissible composition range of the substitute image for the background image is inserted into the layer of the video stream and/or the layer of the server-delivered stream.

このように本技術においては、ビデオストリームのレイヤおよび/またはサーバ配信ストリームのレイヤに背景画像における代用画像の許容合成範囲を示す情報が挿入されて配信されるものである。そのため、クラインアント装置では、背景画像に各クライアントの代用画像を、この許容合成範囲を示す情報に基づいて、サーバが意図する範囲に配置することが容易に可能となる。 As described above, according to the present technology, information indicating the permissible composition range of the substitute image for the background image is inserted into the layer of the video stream and/or the layer of the server-delivered stream and delivered. Therefore, the client device can easily place the substitute image of each client in the background image in the range intended by the server based on the information indicating the allowable composition range.

本技術によれば、複数のクライアントが互いのＶＲ空間を共有してコミュニケーションをとることが可能となる。なお、ここに記載された効果は必ずしも限定されるものではなく、本開示中に記載されたいずれかの効果であってもよい。 According to this technology, it becomes possible for a plurality of clients to communicate with each other by sharing their VR space. Note that the effects described here are not necessarily limited, and may be any of the effects described in the present disclosure.

実施の形態としての空間共有表示システムの構成例を示すブロック図である。1 is a block diagram showing a configuration example of a space sharing display system as an embodiment; FIG. サーバと複数のクライアント装置の各間におけるストリームの送受信の関係の一例を示す図である。FIG. 2 is a diagram showing an example of the stream transmission/reception relationship between a server and a plurality of client devices; サーバの構成例を示すブロック図である。It is a block diagram which shows the structural example of a server. ビデオ・アトリビュート・インフォメーション・ＳＥＩメッセージの構造例を示す図である。FIG. 10 is a diagram showing an example structure of a Video Attribute Information SEI message; ビデオ・アトリビュート・インフォメーション・ＳＥＩメッセージの構造例における主要な情報の内容を示す図である。FIG. 4 is a diagram showing the content of main information in a structural example of a Video Attribute Information SEI message; カメラの状態情報を説明するための図である。FIG. 4 is a diagram for explaining state information of a camera; FIG. ビデオ・アトリビュート・インフォメーション・ボックスに格納される情報例を示す図である。FIG. 4 is a diagram showing an example of information stored in a video attribute information box; FIG. クライアント装置の送信系の構成例を示すブロック図である。4 is a block diagram showing a configuration example of a transmission system of a client device; FIG. アバターレンダリング制御情報の構造例と、その構造例における主要な情報の内容を示す図である。FIG. 3 is a diagram showing an example structure of avatar rendering control information and the content of main information in the example structure; アバターデータベース選択情報の構造例と、その構造例における主要な情報の内容を示す図である。FIG. 3 is a diagram showing an example structure of avatar database selection information and the contents of main information in the example structure. 各オブジェクトのオブジェクトメタデータとしての音声オブジェクトレンダリング情報の構造例と、その構造例における主要な情報の内容を示す図である。FIG. 3 is a diagram showing a structural example of audio object rendering information as object metadata of each object, and the content of main information in the structural example. 「Azimuth」、「Radius」、「Elevation」の値の求め方について説明するための図である。FIG. 4 is a diagram for explaining how to obtain the values of "Azimuth", "Radius", and "Elevation"; ＴＴＭＬ構造とメタデータの構造例を説明するための図である。It is a figure for demonstrating the structure example of a TTML structure and metadata. クライアント装置の受信系の構成例を示すブロック図である。4 is a block diagram showing a configuration example of a reception system of a client device; FIG. 受信モジュールの構成例を示すブロック図である。4 is a block diagram showing a configuration example of a receiving module; FIG. アバターデータベース選択部の構成例を示すブロック図である。FIG. 4 is a block diagram showing a configuration example of an avatar database selection unit; アバターデータベースのリスト例を示す図である。FIG. 10 is a diagram showing an example of an avatar database list; レンダラにおけるレンダリング処理の概要を示す図である。FIG. 4 is a diagram showing an outline of rendering processing in a renderer; レンダラにおけるリマッピングによる音圧制御を概略的に示す図である。FIG. 4 is a diagram schematically showing sound pressure control by remapping in a renderer; 背景画像の一例を示す図である。It is a figure which shows an example of a background image. 背景画像の許容合成範囲(sy_window)内にアバターおよび字幕が合成された状態の一例を示す図である。FIG. 10 is a diagram showing an example of a state in which an avatar and subtitles are synthesized within the allowable synthesis range (sy_window) of the background image;

以下、発明を実施するための形態（以下、「実施の形態」とする）について説明する。なお、説明は以下の順序で行う。
１．実施の形態
２．変形例 DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, modes for carrying out the invention (hereinafter referred to as "embodiments") will be described. The description will be given in the following order.
1. Embodiment 2. Modification

＜１．実施の形態＞
［空間共有表示システム］
図１は、実施の形態としての空間共有表示システム１０の構成例を示している。この空間共有表示システム１０は、サーバ１００と複数のクライアント装置２００が、インターネットなどのネットワーク３００を介して接続された構成となっている。 <1. Embodiment>
[Spatial sharing display system]
FIG. 1 shows a configuration example of a spatial sharing display system 10 as an embodiment. This shared space display system 10 has a configuration in which a server 100 and a plurality of client devices 200 are connected via a network 300 such as the Internet.

サーバ１００は、被写体を撮像して得られた背景画像の画像データをエンコードして得られたビデオストリームを含むサーバ配信ストリームを、ネットワーク３００を介して各クライアント装置２００に送信する。例えば、背景画像の画像データは広視野角画像の画像データとされる。ビデオストリームのレイヤおよび/またはサーバ配信ストリーム（コンテナ）のレイヤに背景画像における代用画像の許容合成範囲を示す情報が挿入されている。代用画像は、例えばアバターあるいはキャラクタを認識し得るシンボルなどであるが、この実施の形態では代用画像としてアバターを想定している。以下では、代用画像をアバターとして説明する。 The server 100 transmits a server delivery stream including a video stream obtained by encoding image data of a background image obtained by imaging a subject to each client device 200 via the network 300 . For example, the image data of the background image is the image data of the wide viewing angle image. Information indicating the permissible composition range of the substitute image for the background image is inserted in the video stream layer and/or the server delivery stream (container) layer. The substitute image is, for example, an avatar or a symbol by which a character can be recognized. In this embodiment, an avatar is assumed as the substitute image. Below, a substitute image is demonstrated as an avatar.

クライアント装置２００は、サーバ１００からネットワーク３００を介して送られてくるサーバ配信ストリームを受信し、このサーバ配信ストリームに含まれるビデオストリームをデコードして、背景画像の画像データを得る。また、クライアント装置２００は、他のクライアント装置２００からネットワーク３００を介して送られてくる、他のクライアントのアバターを表示するためのアバターメタ情報を含むクライアント送信フレーム（コンテナ）を受信する。 The client device 200 receives the server-delivered stream sent from the server 100 via the network 300, decodes the video stream included in this server-delivered stream, and obtains the image data of the background image. Also, the client device 200 receives a client transmission frame (container) including avatar meta information for displaying another client's avatar, which is sent from another client device 200 via the network 300 .

クライアント装置２００は、アバターメタ情報に基づいてアバターの画像データを生成し、このアバターの画像データを背景画像データの画像データに合成する。この場合、クラインアント装置２００は、ビデオストリームのレイヤおよび/またはサーバ配信ストリームのレイヤに挿入されている背景画像におけるアバターの許容合成範囲を示す情報に基づいて、アバターが背景画像の許容合成範囲内に配置されるようにする。 The client device 200 generates image data of the avatar based on the avatar meta information, and synthesizes the image data of the avatar with the image data of the background image data. In this case, the client device 200 determines whether the avatar is within the allowable synthesis range of the background image based on the information indicating the allowable synthesis range of the avatar in the background image inserted in the layer of the video stream and/or the layer of the server distribution stream. to be placed in

アバターメタ情報には、アバターの許容合成範囲内における合成位置を示す合成位置情報が含まれており、クライアント装置２００は、この合成位置情報が示す合成位置にアバターが合成されるようにする。また、アバターメタ情報には、アバターのサイズを示すサイズ情報が含まれており、クライアント装置２００は、サイズ情報が示すサイズで背景画像にアバターが合成されるようにする。 The avatar meta information includes synthesis position information indicating a synthesis position within the allowable synthesis range of the avatar, and the client device 200 allows the avatar to be synthesized at the synthesis position indicated by this synthesis position information. In addition, the avatar meta information includes size information indicating the size of the avatar, and the client device 200 allows the avatar to be combined with the background image in the size indicated by the size information.

また、クライアント装置２００は、自身のアバターを表示するためのアバターメタ情報を含むクライアント送信ストリーム（コンテナ）を、ネットワーク３００を介して他のクライアント装置２００に送信する。この場合、クラインアント装置２００は、自身のアバターを表示するためのアバターメタ情報に基づいて自身のアバターの画像データを生成し、このアバターの画像データを背景画像データの画像データに合成する。 Also, the client device 200 transmits a client transmission stream (container) including avatar meta information for displaying its own avatar to other client devices 200 via the network 300 . In this case, the client device 200 generates image data of its own avatar based on the avatar meta information for displaying its own avatar, and synthesizes this avatar image data with the image data of the background image data.

なお、クライアント装置２００によっては、自身のアバターを表示するためのアバターメタ情報を含むクライアント送信ストリーム（コンテナ）を他のクライアント装置２００に送信する機能を持たないことも考えられる。 Note that some client devices 200 may not have a function of transmitting client transmission streams (containers) containing avatar meta information for displaying their own avatars to other client devices 200 .

クライアント装置２００は、背景画像の画像データが通常の視野角画像の画像データである場合には、アバターの画像データが合成された背景画像の画像データをそのまま表示用画像データとして表示装置としてのＨＭＤ４００Ａに送る。一方、クライアント装置２００は、背景画像の画像データが広視野角画像の画像データである場合には、アバターの画像データが合成される背景画像の画像データの一部を切出して表示用画像データを得、この表示用画像データを表示装置としてのＨＭＤ４００Ａに送る。この場合の切出し範囲は、例えば、ＨＭＤ搭載のセンサで検出される頭部姿勢に応じて決定される。 When the image data of the background image is the image data of the normal viewing angle image, the client device 200 directly uses the image data of the background image combined with the image data of the avatar as the image data for display. send to On the other hand, when the image data of the background image is the image data of the wide-viewing-angle image, the client device 200 cuts out part of the image data of the background image to which the image data of the avatar is synthesized, and prepares the image data for display. Then, this display image data is sent to the HMD 400A as a display device. The cropping range in this case is determined, for example, according to the head posture detected by the HMD-equipped sensor.

また、他のクライアント装置２００から送られてくるクライアント送信ストリームには、アバターメタ情報に対応した音声データがオブジェクトメタデータと共に含まれている。また、自身のアバターを表示するためのアバターメタ情報に関しても、それに対応した音声データがオブジェクトメタデータと共に存在する。クライアント装置２００は、音声データにオブジェクトメタデータに応じたレンダリング処理を行ってアバターの合成位置を音像位置とする音声出力データを得、音声出力装置としてのヘッドフォン（ＨＰ）４００Ｂに送る。 Also, the client transmission stream sent from another client device 200 includes audio data corresponding to the avatar meta information together with the object meta data. Also, with respect to avatar meta information for displaying one's own avatar, corresponding audio data exists together with object meta data. The client device 200 performs rendering processing on the audio data according to the object metadata, obtains audio output data with the avatar synthesis position as the sound image position, and sends the audio output data to the headphone (HP) 400B as an audio output device.

また、他のクライアント装置２００から送られてくるクライアント送信ストリームには、アバターメタ情報に対応した字幕データが表示位置情報と共に含まれている。また、自身のアバターを表示するためのアバターメタ情報に関しても、それに対応した字幕データが表示位置情報と共に存在する。クライアント装置２００は、字幕データによる字幕がアバターの合成位置に対応した位置に表示されるように表示位置情報に基づいて字幕の表示データを背景画像の画像データに合成する。 Also, the client transmission stream sent from another client device 200 includes caption data corresponding to the avatar meta information together with the display position information. Also, subtitle data corresponding to avatar meta information for displaying one's own avatar exists together with display position information. The client device 200 synthesizes the display data of the caption with the image data of the background image based on the display position information so that the caption based on the caption data is displayed at a position corresponding to the avatar's synthesis position.

図２は、サーバ１００と複数のクライアント装置２００の各間におけるストリームの送受信の関係の一例を示している。なお、図示の例では、３つのクライアント装置２００が存在し、全てのクライアント装置２００からクライアント送信フレームが他のクライアント装置２００に送られている。クライアント送信フレームには、アバターメタ情報、音声データおよびテキストデータ（字幕データ）が含まれている。 FIG. 2 shows an example of the stream transmission/reception relationship between the server 100 and a plurality of client devices 200 . In the illustrated example, there are three client devices 200, and client transmission frames are sent from all client devices 200 to other client devices 200. FIG. The client transmission frame includes avatar meta information, audio data and text data (caption data).

［サーバの構成］
図３は、サーバ１００の構成例を示している。このサーバ１００は、制御部１０１と、ロケータ１０２と、ビデオキャプチャ１０３と、フォーマット変換処理部１０４と、ビデオエンコーダ１０５と、音声キャプチャ１０６と、オーディオエンコーダ１０８と、コンテナエンコーダ１０９と、ネットワークインタフェース１１０を有している。各部は、バス１１１で接続されている。 [Server configuration]
FIG. 3 shows a configuration example of the server 100. As shown in FIG. This server 100 includes a control unit 101, a locator 102, a video capture 103, a format conversion processing unit 104, a video encoder 105, an audio capture 106, an audio encoder 108, a container encoder 109, and a network interface 110. have. Each unit is connected by a bus 111 .

制御部１０１は、サーバ１００の各部の動作を制御する。この制御部１０１には、ユーザ操作部１０１ａが接続されている。ロケータ１０１は、ＧＰＳ衛星からの電波を受信して位置（経度・緯度・高度）の情報を得る。ビデオキャプチャ１０３は、被写体を撮像して背景画像の画像データを得るカメラ（撮像部）である。ビデオキャプチャ１０３では広視野角画像データや、広視野角画像データを得るための複数枚の画像データを得る。フォーマット変換処理部１０４は、ビデオキャプチャ１０３で得られた画像データに対してマッピング処理（広視野角画像の変形、複数画像の合成など）を施してエンコーダ入力の画像フォーマットの画像データを得る。 The control unit 101 controls operations of each unit of the server 100 . A user operation unit 101 a is connected to the control unit 101 . The locator 101 receives radio waves from GPS satellites and obtains position (longitude, latitude, altitude) information. A video capture 103 is a camera (imaging unit) that captures an image of a subject and obtains image data of a background image. The video capture 103 obtains wide-viewing-angle image data and a plurality of image data for obtaining wide-viewing-angle image data. A format conversion processing unit 104 performs mapping processing (deformation of a wide viewing angle image, synthesis of a plurality of images, etc.) on the image data obtained by the video capture 103 to obtain image data in the image format of the encoder input.

ビデオエンコーダ１０５は、フォーマット変換処理部１０４で得られた画像データに対してＨＥＶＣなどの符号化を施して符号化画像データを得、この符号化画像データを含むビデオストリームを生成する。この場合、ビデオエンコーダ１０５は、アクセスユニット（ＡＵ）の“ＳＥＩｓ”のＳＥＩメッセージ群に、ビデオ・アトリビュート・インフォメーション・ＳＥＩメッセージ（Video_attribute_information SEI message）を配置する。 The video encoder 105 performs encoding such as HEVC on the image data obtained by the format conversion processing unit 104 to obtain encoded image data, and generates a video stream including this encoded image data. In this case, the video encoder 105 places a video attribute information SEI message (Video_attribute_information SEI message) in the SEI message group of "SEIs" of the access unit (AU).

このＳＥＩメッセージには、カメラ（撮像部）の撮像状態を示すキャプチャ情報と、カメラの位置（撮像位置）を示す位置情報（ＧＰＳデータ）と、背景画像におけるアバターの許容合成範囲を示す情報が挿入されている。 In this SEI message, capture information indicating the imaging state of the camera (imaging unit), position information (GPS data) indicating the position of the camera (imaging position), and information indicating the allowable composition range of the avatar in the background image are inserted. It is

図４は、ビデオ・アトリビュート・インフォメーション・ＳＥＩメッセージの構造例(Syntax)を示し、図５はその構造例における主要な情報の内容（Semantics）を示している。「message_id」の８ビットフィールドは、ビデオ・アトリビュート・インフォメーション・ＳＥＩメッセージであることを識別する識別情報を示す。「byte_length」の８ビットフィールドは、このビデオ・アトリビュート・インフォメーション・ＳＥＩメッセージのサイズとして、以降のバイト数を示す。 FIG. 4 shows an example structure (Syntax) of a Video Attribute Information SEI message, and FIG. 5 shows the contents (Semantics) of main information in the example structure. An 8-bit field of "message_id" indicates identification information for identifying a video attribute information SEI message. An 8-bit field of "byte_length" indicates the number of subsequent bytes as the size of this video attribute information SEI message.

「target_content_id」の８ビットフィールドは、ビデオコンテンツの識別情報を示す。「capture_position()」のフィールドは、撮像位置を示す。「position_latitude」の１６ビットフィールドは、撮像位置（緯度）を示す。「position_longitude」の１６ビットフィールドは、撮像位置（経度）を示す。「position_elevation」の１６ビットフィールドは、撮像位置（標高）を示す。 An 8-bit field of "target_content_id" indicates identification information of the video content. A field of “capture_position( )” indicates an imaging position. A 16-bit field of “position_latitude” indicates the imaging position (latitude). A 16-bit field of “position_longitude” indicates the imaging position (longitude). A 16-bit field of "position_elevation" indicates an imaging position (elevation).

「camera_direction」の１６ビットフィールドは、撮像時のカメラの向く方向を示す。例えば、図６（ａ）に示すように、カメラの向く方向として東西南北などの方角を示す。「camera_V_angle」の１６ビットフィールドは、図６（ｂ）に示すように、撮像時のカメラの水平からのアングルを示す。 A 16-bit field of "camera_direction" indicates the direction in which the camera faces at the time of imaging. For example, as shown in FIG. 6(a), directions such as north, south, east and west are shown as directions in which the camera faces. A 16-bit field of "camera_V_angle" indicates the angle from the horizontal of the camera at the time of imaging, as shown in FIG. 6(b).

「sy_window()」のフィールドは、背景画像におけるアバターの許容合成範囲を示す。「sy_window_x_start」の１６ビットフィールドは、許容合成範囲の開始位置（水平位置）を示す。「sy_window_y_start」の１６ビットフィールドは、許容合成範囲の開始位置（垂直位置）を示す。「sy_window_x_end」の１６ビットフィールドは、許容合成範囲の終了位置（水平位置）を示す。「sy_window_y_end」の１６ビットフィールドは、許容合成範囲の終了位置（垂直位置）を示す。 The "sy_window()" field indicates the allowable composition range of the avatar in the background image. A 16-bit field of "sy_window_x_start" indicates the start position (horizontal position) of the allowable synthesis range. A 16-bit field of "sy_window_y_start" indicates the start position (vertical position) of the allowable synthesis range. A 16-bit field of "sy_window_x_end" indicates the end position (horizontal position) of the allowable synthesis range. A 16-bit field of "sy_window_y_end" indicates the end position (vertical position) of the allowable synthesis range.

図３に戻って、音声キャプチャ１０６は、ビデオキャプチャ１０３で撮像される被写体に対応した音声（音）を集音して、２チャネルあるいはそれ以上の多チャネル、例えば５．１チャネルなどの音声データを得るマイクロホンである。オーディオエンコーダ１０８は、音声キャプチャ１０６で得られた音声データに対して、ＭＰＥＧ-ＨＡｕｄｉｏ，ＡＣ４等の符号化を施し、オーディオデータストリームを生成する。 Returning to FIG. 3, the audio capture 106 collects audio (sound) corresponding to the subject imaged by the video capture 103, and converts audio data into two or more channels, such as 5.1 channels. It is a microphone that obtains The audio encoder 108 encodes the audio data obtained by the audio capture 106 according to MPEG-H Audio, AC4, etc., to generate an audio data stream.

コンテナエンコーダ１０９は、ビデオエンコーダ１０５で得られたビデオストリームおよびオーディオエンコーダ１０８で得られたオーディオストリームを含むコンテナ、ここではＭＰ４ストリームを、サーバ配信ストリームとして生成する。 The container encoder 109 generates a container including the video stream obtained by the video encoder 105 and the audio stream obtained by the audio encoder 108, here an MP4 stream, as a server distribution stream.

この場合、コンテナエンコーダ１０９は、イニシャライゼーション・セグメント（ＩＳ）や“ｍｏｏｆ”のボックスに定義される“ｕｄｔａ”のボックスに、ビデオ・アトリビュート・インフォメーション・ボックス「“ｖａｉｂ”ボックス」を定義する。 In this case, the container encoder 109 defines a video attribute information box ““vaib” box” in the initialization segment (IS) and the “udta” box defined in the “moof” box.

このビデオ・アトリビュート・インフォメーション・ボックスには、ビデオ・アトリビュート・インフォメーション・ＳＥＩメッセージと同様に、カメラ（撮像部）の撮像状態を示すキャプチャ情報と、カメラの位置（撮像位置）を示す位置情報（ＧＰＳデータ）と、背景画像におけるアバターの許容合成範囲を示す情報が挿入されている。なお、ビデオ・アトリビュート・インフォメーション・ボックスとビデオ・アトリビュート・インフォメーション・ＳＥＩメッセージの双方を挿入することは必ずしも必要ではなく、いずれか一方だけが挿入されてもよい。 Similar to the video attribute information SEI message, this video attribute information box contains capture information indicating the imaging state of the camera (imaging unit) and location information (GPS) indicating the position of the camera (imaging position). data) and information indicating the allowable composition range of the avatar in the background image are inserted. Note that it is not always necessary to insert both the video attribute information box and the video attribute information SEI message, and only one of them may be inserted.

図７は、“ｖａｉｂ”ボックスに格納される情報例を示している。「position_latitude」は撮像位置（緯度）、「position_longitude」は撮像位置（経度）、「position_elevation」は撮像位置（標高）である。また、「camera_direction」は撮像時のカメラの向く方向を示すもので、北の方向からの方角を示す。また、「camera_V_angle」は撮像時のカメラの水平方向からのアングルを示す。また、「sy_window_x_start」は許容合成範囲の開始位置（水平位置）、「sy_window_y_start」は許容合成範囲の開始位置（垂直位置）、「sy_window_x_end」は許容合成範囲の終了位置（水平位置）、「sy_window_y_end」は許容合成範囲の終了位置（垂直位置）である。 FIG. 7 shows an example of information stored in the "vaib" box. "position_latitude" is the imaging position (latitude), "position_longitude" is the imaging position (longitude), and "position_elevation" is the imaging position (elevation). "camera_direction" indicates the direction in which the camera faces at the time of imaging, and indicates the direction from the north direction. Also, "camera_V_angle" indicates the angle from the horizontal direction of the camera at the time of imaging. Also, "sy_window_x_start" is the start position of the allowable synthesis range (horizontal position), "sy_window_y_start" is the start position of the allowable synthesis range (vertical position), "sy_window_x_end" is the end position of the allowable synthesis range (horizontal position), "sy_window_y_end" is the end position (vertical position) of the allowable synthesis range.

図３に戻って、ネットワークインタフェース１１０は、ネットワーク３００を介して、クライアント装置２００と通信をする。ネットワークインタフェース１１０は、コンテナエンコーダ１０９で得られたサーバ配信ストリームを、ネットワーク３００を介して、クラインアント装置２００に送信する。 Returning to FIG. 3, network interface 110 communicates with client device 200 via network 300 . Network interface 110 transmits the server delivery stream obtained by container encoder 109 to client device 200 via network 300 .

図３に示すサーバ１００の動作を簡単に説明する。ビデオキャプチャ１０３では、被写体が撮像され、広視野角画像データや、広視野角画像データを得るための複数枚の画像データが得られる。ビデオキャプチャ１０３で得られた画像データは、フォーマット変換処理部１０４に供給される。フォーマット変換処理部１０４では、ビデオキャプチャ１０３から供給される画像データに対してマッピング処理（広視野角画像の変形、複数画像の合成など）が施され、エンコーダ入力の画像フォーマットの画像データが得られる。 The operation of the server 100 shown in FIG. 3 will be briefly described. The video capture 103 captures an image of a subject and obtains wide-viewing-angle image data and a plurality of image data for obtaining wide-viewing-angle image data. Image data obtained by the video capture 103 is supplied to the format conversion processing unit 104 . In the format conversion processing unit 104, the image data supplied from the video capture 103 is subjected to mapping processing (deformation of a wide viewing angle image, synthesis of a plurality of images, etc.) to obtain image data in the image format of the encoder input. .

フォーマット変換処理部１０４で得られた画像データは、ビデオエンコーダ１０５に供給される。ビデオエンコーダ１０５では、フォーマット変換処理部１０４からの画像データに対してＨＥＶＣなどの符号化が施されて符号化画像データが得られ、この符号化画像データを含むビデオストリームが生成される。 The image data obtained by the format conversion processing unit 104 is supplied to the video encoder 105 . In the video encoder 105, encoding such as HEVC is performed on the image data from the format conversion processing unit 104 to obtain encoded image data, and a video stream including this encoded image data is generated.

また、ビデオエンコーダ１０５では、アクセスユニット（ＡＵ）の“ＳＥＩｓ”のＳＥＩメッセージ群に、ビデオ・アトリビュート・インフォメーション・ＳＥＩメッセージ（図４参照）が配置される。このＳＥＩメッセージには、カメラ（撮像部）の撮像状態を示すキャプチャ情報と、カメラの位置（撮像位置）を示す位置情報（ＧＰＳデータ）と、背景画像におけるアバターの許容合成範囲を示す情報が挿入されている。 Also, in the video encoder 105, a video attribute information SEI message (see FIG. 4) is arranged in the SEI message group of "SEIs" of the access unit (AU). In this SEI message, capture information indicating the imaging state of the camera (imaging unit), position information (GPS data) indicating the position of the camera (imaging position), and information indicating the allowable composition range of the avatar in the background image are inserted. It is

音声キャプチャ１０６では、ビデオキャプチャ１０３で撮像される被写体に対応した音声（音）が集音されて、２チャネルあるいはそれ以上の多チャネルの音声データが得られる。各チャネルの音声データは、オーディオエンコーダ１０８に供給される。音声エンコーダ１０８では、音声キャプチャ１０６で得られた音声データに対して、ＭＰＥＧ-ＨＡｕｄｉｏ，ＡＣ４等の符号化が施され、オーディオデータストリームが生成される。 The audio capture 106 collects the audio (sound) corresponding to the subject imaged by the video capture 103, and obtains multi-channel audio data of two or more channels. Audio data for each channel is supplied to audio encoder 108 . The audio encoder 108 encodes the audio data obtained by the audio capture 106 according to MPEG-H Audio, AC4, etc., to generate an audio data stream.

ビデオエンコーダ１０５で得られたビデオストリームとオーディオエンコーダ１０８で得られたオーディオストリームは、コンテナデコーダ１０９に供給される。コンテナエンコーダ１０９は、ビデオストリームおよびオーディオストリームを含むコンテナ、ここではＭＰ４ストリームが、サーバ配信ストリームとして生成される。 The video stream obtained by the video encoder 105 and the audio stream obtained by the audio encoder 108 are supplied to the container decoder 109 . A container encoder 109 generates a container including a video stream and an audio stream, here an MP4 stream, as a server delivery stream.

また、コンテナエンコーダ１０９では、イニシャライゼーション・セグメント（ＩＳ）や“ｍｏｏｆ”のボックスに定義される“ｕｄｔａ”のボックスに、ビデオ・アトリビュート・インフォメーション・ボックス（図７参照）が新規定義される。このボックスには、カメラ（撮像部）の撮像状態を示すキャプチャ情報と、カメラの位置（撮像位置）を示す位置情報（ＧＰＳデータ）と、背景画像におけるアバターの許容合成範囲を示す情報が挿入されている。 In the container encoder 109, a new video attribute information box (see FIG. 7) is defined in the initialization segment (IS) and the "udta" box defined in the "moof" box. This box contains capture information indicating the imaging state of the camera (imaging unit), position information (GPS data) indicating the position of the camera (imaging position), and information indicating the permissible composition range of the avatar in the background image. ing.

コンテナエンコーダ１０９で得られたサーバ配信ストリームはネットワークインタフェース１１０に供給される。ネットワークインタフェース１１０では、サーバ配信ストリームを、ネットワーク３００を介して、クラインアント装置２００に送信することが行われる。 The server-delivered stream obtained by container encoder 109 is supplied to network interface 110 . The network interface 110 transmits the server-delivered stream to the client device 200 via the network 300 .

［クライアント装置の構成］
クライアント装置２００の構成例を説明する。図８は、クライアント装置２００の送信系２００Ｔの構成例を示している。この送信系２００Ｔは、制御部２０１と、メタデータジェネレータ２０２と、音声キャプチャ２０３と、オブジェクト情報生成部２０４と、オーディオエンコーダ２０５と、文字発生部２０６と、字幕エンコーダ２０７と、コンテナエンコーダ２０８と、ネットワークインタフェース２０９を有している。各部は、バス２１０で接続されている。 [Configuration of client device]
A configuration example of the client device 200 will be described. FIG. 8 shows a configuration example of the transmission system 200T of the client device 200. As shown in FIG. This transmission system 200T includes a control unit 201, a metadata generator 202, an audio capture unit 203, an object information generation unit 204, an audio encoder 205, a character generation unit 206, a caption encoder 207, a container encoder 208, It has a network interface 209 . Each unit is connected by a bus 210 .

制御部２０１は、クライアント装置２００、従って送信系２００Ｔの各部の動作を制御する。この制御部２０１には、ユーザ操作部２０１ａが接続されている。メタデータジェネレータ２０２は、ユーザ操作部２０１ａからのユーザ操作に応じて、アバターメタ情報を発生する。アバターメタ情報は、アバターレンダリング制御情報（avator_rendering_control_information）とアバターデータベース選択情報（avator_database_selection）からなっている。 The control unit 201 controls the operation of each unit of the client device 200, and thus the transmission system 200T. A user operation unit 201 a is connected to the control unit 201 . The metadata generator 202 generates avatar meta information according to user operations from the user operation unit 201a. Avatar meta information consists of avatar rendering control information (avator_rendering_control_information) and avatar database selection information (avator_database_selection).

アバターレンダリング制御情報には、背景画像の許容合成範囲内におけるアバターの合成位置を示す情報と、そのアバターのサイズを示す情報が含まれている。図９（ａ）はアバターレンダリング制御情報の構造例(Syntax)を示し、図９（ｂ）はその構造例における主要な情報の内容（Semantics）を示している。 The avatar rendering control information includes information indicating the composition position of the avatar within the allowable composition range of the background image and information indicating the size of the avatar. FIG. 9(a) shows an example structure (Syntax) of avatar rendering control information, and FIG. 9(b) shows the contents (Semantics) of main information in the example structure.

「message_id」の８ビットフィールドは、アバターレンダリング制御情報であることを識別する識別情報を示す。「byte_length」の８ビットフィールドは、このアバターレンダリング制御情報のサイズとして、以降のバイト数を示す。 An 8-bit field of "message_id" indicates identification information identifying avatar rendering control information. An 8-bit field of "byte_length" indicates the number of subsequent bytes as the size of this avatar rendering control information.

「client_id」の８ビットフィールドは、このアバターメタ情報を送信するクライアント（クライアント装置２００）の識別情報を示す。「target_content_id」の８ビットフィールドは、合成対象のビデオコンテンツ（背景画像）の識別情報を示す。「number_of_client_objects」の８ビットフィールドは、クライアントから送信されるオブジェクト、つまりアバターの数を示す。 An 8-bit field of "client_id" indicates identification information of the client (client device 200) that transmits this avatar meta information. An 8-bit field of "target_content_id" indicates identification information of the video content (background image) to be synthesized. An 8-bit field of "number_of_client_objects" indicates the number of objects, ie avatars, sent from the client.

オブジェクトの数だけ、「client_object_id」、「avator_center_position_x」、「avator_center_position_y」、「avator_rendering_size」の各フィールドが繰り返し存在する。「client_object_id」の８ビットフィールドは、クライアントから送信されるオブジェクト（アバター）の識別情報を示す。 The fields "client_object_id", "avator_center_position_x", "avator_center_position_y", and "avator_rendering_size" are repeated as many times as there are objects. An 8-bit field of "client_object_id" indicates identification information of an object (avatar) transmitted from the client.

「avator_center_position_x」の１６ビットフィールドは、許容合成範囲（sy_window）の中でアバター合成位置の中心座標のｘ座標（水平ポジション）を示す。「avator_center_position_y」の１６ビットフィールドは、許容合成範囲の中でアバター合成位置の中心座標のｙ座標（垂直ポジション）を示す。「avator_rendering_size」の１６ビットフィールドは、合成させるアバターの大きさ（サイズ）を示す。なお、サイズはアバター合成位置の中心座標からの長方形の対角線で求められる。データベースのアバター画像の元のアスペクト比を維持したまま、合成させるアバターのサイズとの比率に応じたサイズ変換を行う。 A 16-bit field of "avator_center_position_x" indicates the x-coordinate (horizontal position) of the center coordinate of the avatar synthesis position within the allowable synthesis range (sy_window). A 16-bit field of "avator_center_position_y" indicates the y-coordinate (vertical position) of the central coordinate of the avatar synthesis position within the allowable synthesis range. A 16-bit field of "avator_rendering_size" indicates the size of the avatar to be synthesized. Note that the size is obtained from the diagonal line of the rectangle from the center coordinates of the avatar synthesis position. While maintaining the original aspect ratio of the avatar image in the database, size conversion is performed according to the ratio with the size of the avatar to be synthesized.

アバターデータベース選択情報には、アバターの画像データをアバターデータベースから得るための選択情報が含まれている。図１０（ａ）はアバターデータベース選択情報の構造例(Syntax)を示し、図１０（ｂ）はその構造例における主要な情報の内容（Semantics）を示している。 The avatar database selection information includes selection information for obtaining avatar image data from the avatar database. FIG. 10(a) shows an example structure (Syntax) of avatar database selection information, and FIG. 10(b) shows the contents (Semantics) of main information in the example structure.

「message_id」の８ビットフィールドは、アバターデータベース選択情報であることを識別する識別情報を示す。「byte_length」の８ビットフィールドは、このアバターデータベース選択情報のサイズとして、以降のバイト数を示す。「client_id」の８ビットフィールドは、このアバターデータベース選択情報を送信するクライアント（クライアント装置２００）の識別情報を示す。「target_content_id」の８ビットフィールドは、合成対象のビデオコンテンツ（背景画像）の識別情報を示す。 An 8-bit field of "message_id" indicates identification information for identifying avatar database selection information. An 8-bit field of "byte_length" indicates the number of subsequent bytes as the size of this avatar database selection information. An 8-bit field of "client_id" indicates identification information of the client (client device 200) that transmits this avatar database selection information. An 8-bit field of "target_content_id" indicates identification information of the video content (background image) to be synthesized.

「number_of_client_objects」の８ビットフィールドは、クライアントから送信されるオブジェクト、つまりアバターの数を示す。オブジェクトの数だけ、「client_object_id」、「body_type」、「body_angle」、「emotional_type」、「face_angle」の各フィールドが繰り返し存在する。「client_object_id」の８ビットフィールドは、クライアントから送信されるオブジェクト（アバター）の識別情報を示す。 An 8-bit field of "number_of_client_objects" indicates the number of objects, ie avatars, sent from the client. The fields "client_object_id", "body_type", "body_angle", "emotional_type", and "face_angle" are repeated as many times as there are objects. An 8-bit field of "client_object_id" indicates identification information of an object (avatar) transmitted from the client.

「body_type」の１６ビットフィールドは、アバターの全身体系の種類を示す。「body_angle」の１６ビットフィールドは、アバター画像の正面からの向きの属性を示す。「emotional_type」の１６ビットフィールドは、アバターの表情・感情の種類を示す。「face_angle」の１６ビットフィールドは、アバターの顔の向きを示す。 A 16-bit field of "body_type" indicates the type of body system of the avatar. A 16-bit field of "body_angle" indicates the attribute of the orientation from the front of the avatar image. A 16-bit field of "emotional_type" indicates the type of facial expression/emotion of the avatar. A 16-bit field of "face_angle" indicates the orientation of the avatar's face.

図８に戻って、音声キャプチャ２０３は、各オブジェクト、つまり各アバターの音声（音）を集音して音声データを得るマイクロホンである。オブジェクト情報生成部２０４は、オブジェクト毎にアバター合成位置情報に基づいてオブジェクトメタデータを生成し、各オブジェクトのオブジェクト符号化データ（符号化サンプルデータ、オブジェクトメタデータ）を出力する。 Returning to FIG. 8, the voice capture 203 is a microphone that collects the voice (sound) of each object, that is, each avatar, to obtain voice data. The object information generation unit 204 generates object metadata for each object based on the avatar synthesis position information, and outputs object encoded data (encoded sample data, object metadata) of each object.

図１１（ａ）は各オブジェクト（アバター）のオブジェクトメタデータとしての音声オブジェクトレンダリング情報（Voice_object_rendering_information）の構造例を示し、図１１（ｂ）はその構造例における主要な情報の内容（Semantics）を示している。「message_id」の８ビットフィールドは、音声オブジェクトレンダリング情報であることを識別する識別情報を示す。「byte_length」の８ビットフィールドは、このアバターデータベース選択情報のサイズとして、以降のバイト数を示す。「client_id」の８ビットフィールドは、この音声データを送信するクライアント（クライアント装置２００）の識別情報を示す。「target_content_id」の８ビットフィールドは、合成対象のビデオコンテンツ（背景画像）の識別情報を示す。 FIG. 11(a) shows an example structure of voice object rendering information (Voice_object_rendering_information) as object metadata of each object (avatar), and FIG. 11(b) shows the contents (semantics) of main information in the example structure. ing. An 8-bit field of "message_id" indicates identification information identifying audio object rendering information. An 8-bit field of "byte_length" indicates the number of subsequent bytes as the size of this avatar database selection information. An 8-bit field of "client_id" indicates identification information of the client (client device 200) that transmits this audio data. An 8-bit field of "target_content_id" indicates identification information of the video content (background image) to be synthesized.

「number_of_client_objects」の８ビットフィールドは、クライアントから送信されるオブジェクト、つまりアバターの数を示す。オブジェクトの数だけ、「client_object_id」、「Azimuth」、「Radius」、「Elevation」の各フィールドが繰り返し存在する。「client_object_id」の８ビットフィールドは、クライアントから送信されるオブジェクト（アバター）の識別情報を示す。 An 8-bit field of "number_of_client_objects" indicates the number of objects, ie avatars, sent from the client. The "client_object_id", "Azimuth", "Radius", and "Elevation" fields are repeated as many times as there are objects. An 8-bit field of "client_object_id" indicates identification information of an object (avatar) transmitted from the client.

「Azimuth」の１６ビットフィールドは、オブジェクトとしてのアバターの位置情報としてのアジマス（Azimuth）を示す。「Radius」の１６ビットフィールドは、オブジェクトとしてのアバターの位置情報としてのラジアス（Radius ）を示す。「Elevation」の１６ビットフィールドは、オブジェクトとしてのアバターの位置情報としてのエレベーション（Elevation）を示す。 A 16-bit field of "Azimuth" indicates an azimuth as positional information of an avatar as an object. A 16-bit field of "Radius" indicates a radius as positional information of an avatar as an object. A 16-bit field of "Elevation" indicates an elevation as positional information of an avatar as an object.

ここで、図１２を参照して、「Azimuth」、「Radius」、「Elevation」の値の求め方について説明する。ＨＭＤ４００Ａで展開される画像上におけるアバターの合成位置の中心座標を点Ｐで表している。上述したようにアバターの合成位置は、背景画像におけるアバターの許容合成範囲内にあり、アバターの合成位置情報（「avator_center_position_x」、「avator_center_position_y」）で特定される。 Here, with reference to FIG. 12, how to obtain the values of "Azimuth", "Radius", and "Elevation" will be described. A point P represents the central coordinates of the avatar synthesis position on the image developed by the HMD 400A. As described above, the avatar synthesis position is within the avatar's allowable synthesis range in the background image, and is specified by the avatar synthesis position information (“avator_center_position_x”, “avator_center_position_y”).

この実施の形態において、背景画像におけるアバターの許容合成範囲は、ＨＭＤ４００Ａで展開される画像範囲が対応するように設定される。これにより、アバターの合成位置情報によりＨＭＤ４００Ａで展開される画像上における点Ｐの座標が特定される。また、この実施の形態において、ＨＭＤ４００Ａで展開される画像範囲は、デフォルトの表示状態では、背景画像におけるアバターの許容合成範囲に対応したものとされる。 In this embodiment, the allowable composition range of the avatar in the background image is set so as to correspond to the image range developed by the HMD 400A. Thereby, the coordinates of the point P on the image developed by the HMD 400A are specified based on the combined position information of the avatar. Also, in this embodiment, the image range developed by the HMD 400A corresponds to the allowable composition range of the avatar in the background image in the default display state.

ＬＴ，ＬＢ，ＲＴ，ＲＢは想定する表示モニタにおける仮想スピーカを示している。また、想定する鑑賞位置の中心を点Ｑで示している。点Ｑから点Ｐまでの距離をｒとし、ＱＡとＱＢのなす角をθ、ＱＢとＱＰのなす角をφとして、「Azimuth」、「Radius」、「Elevation」の値（アバター位置情報）が以下のように求められる。
Azimuth＝θ
Elevation＝φ
Radius＝ｒ LT, LB, RT, and RB indicate virtual speakers on an assumed display monitor. A point Q indicates the center of an assumed viewing position. Let the distance from point Q to point P be r, the angle between QA and QB be θ, and the angle between QB and QP be φ. It is required as follows.
Azimuth = θ
Elevation = φ
Radius = r

上述したように音声オブジェクトレンダリング情報（図１１参照）にオブジェクト（アバター）の合成位置情報である「Azimuth」、「Radius」、「Elevation」の値を含めて送信することで、受信側では、これらの値をそのままレンダラにオブジェクトメタデータとして入力して用いることが可能となる。 As described above, by transmitting the audio object rendering information (see FIG. 11) including the values of "Azimuth", "Radius", and "Elevation", which are synthesis position information of the object (avatar), the receiving side can can be used as it is by inputting it into the renderer as object metadata.

なお、受信側では、アバターレンダリング制御情報（図９参照）に含まれるアバターの合成位置情報（「avator_center_position_x」、「avator_center_position_y」）により点Ｐの座標を特定でき、この点Ｐと想定する鑑賞位置の中心を点Ｑから「Azimuth」、「Radius」、「Elevation」の値を求めて（図１２参照）、それをレンダラにオブジェクトメタデータとして入力して用いることも可能である。 On the receiving side, the coordinates of the point P can be specified by the avatar synthesis position information (“avator_center_position_x”, “avator_center_position_y”) included in the avatar rendering control information (see FIG. 9). It is also possible to obtain the values of 'Azimuth', 'Radius' and 'Elevation' from the center point Q (see FIG. 12) and input them to the renderer as object metadata.

その場合には、音声オブジェクトレンダリング情報（図１１参照）により各オブジェクト（アバター）の合成位置情報である「Azimuth」、「Radius」、「Elevation」の値を送信しなくてもよく、例えば、「number_of_client_objects」＝０とされる。 In that case, it is not necessary to transmit the values of "Azimuth", "Radius", and "Elevation", which are the synthesis position information of each object (avatar), using the audio object rendering information (see FIG. 11). number_of_client_objects”=0.

また、その場合であっても、「Radius」の値を送ることで、各オブジェクト（アバター）の合成位置としての適切な奥行位置をサーバ１００からクライアント装置２００に伝えることが可能となる。この場合、音声オブジェクトレンダリング情報（図１１参照）に各オブジェクト（アバター）の合成位置情報である「Azimuth」、「Radius」、「Elevation」の値を挿入する際に、「Azimuth」、「Elevation」の値に関しては例えば無効な値に設定される。 Even in that case, by sending the value of "Radius", it is possible for the server 100 to convey to the client device 200 an appropriate depth position as a synthesis position of each object (avatar). In this case, when inserting the values of "Azimuth", "Radius", and "Elevation", which are synthesis position information of each object (avatar), into the audio object rendering information (see Fig. 11), "Azimuth", "Elevation" is set to an invalid value, for example.

また、「Radius」の値も送らない場合であっても、クライアント装置２００側で、アバターレンダリング制御情報（図９参照）に含まれる「avator_rendering_size」の情報に基づいて、オブジェクト（アバター）のサイズに応じて、求められた「Radius」の値を調整することにより、各オブジェクト（アバター）の合成位置の奥行位置を適切な位置に設定することが可能となる。 In addition, even if the value of "Radius" is not sent, the client device 200 side adjusts the size of the object (avatar) based on the information of "avator_rendering_size" included in the avatar rendering control information (see FIG. 9). By adjusting the value of "Radius" obtained accordingly, it is possible to set the depth position of the synthesis position of each object (avatar) to an appropriate position.

図８に戻って、オーディオエンコーダ２０５は、オブジェクト情報生成部１０７で得られた各オブジェクトのオブジェクト符号化データ（符号化サンプルデータ、オブジェクトメタデータ）に対して符号化を施してＭＰＥＧ－Ｈ３ＤＡｕｄｉｏの符号化音声データを得る。この符号化音声データは、アバターメタ情報に対応した音声データを構成する。 Returning to FIG. 8, the audio encoder 205 encodes the encoded object data (encoded sample data, object metadata) of each object obtained by the object information generation unit 107 to generate MPEG-H 3D Audio. is obtained. This encoded audio data constitutes audio data corresponding to the avatar meta information.

文字入力部２０６は、ユーザ操作部２０１ａからのユーザ操作に基づいて、各オブジェクト、つまり各アバターに対応した字幕のテキストデータ（文字コード）ＤＴを、適宜、発生する。字幕エンコーダ２０７は、テキストデータＤＴを入力し、所定フォーマットの字幕（サブタイトル）のテキスト情報、この実施の形態においてはＴＴＭＬ（Timed Text Markup Language）を得る。このＴＴＭＬは、アバターメタ情報に対応した字幕データを構成する。 The character input unit 206 appropriately generates caption text data (character codes) DT corresponding to each object, that is, each avatar, based on a user operation from the user operation unit 201a. The caption encoder 207 receives the text data DT and obtains text information of a caption (subtitle) in a predetermined format, TTML (Timed Text Markup Language) in this embodiment. This TTML constitutes caption data corresponding to avatar meta information.

図１３（ａ）は、ＴＴＭＬ構造を示している。ＴＴＭＬは、ＸＭＬベースで記載される。ＴＴＭＬは、ヘッダ（head）とボディ（body）からなる。ヘッダには、メタデータ（metadata）、スタイリング（styling）、レイアウト（layout）などの各要素が存在する。メタデータには、メタデータのタイトルの情報と、コピーライトの情報などが含まれている。スタイリングには、識別子（id）の他に、カラー（color）、フォント（fontFamily）、サイズ（fontSize）、アラインメント（textAlign）などの情報が含まれている。レイアウトには、サブタイトルを配置するリージョンの識別子（id）の他に、範囲（extent）、オフセット（padding）、バックグラウンドカラー（backgroundColor）、アラインメント（displayAlign）などの情報が含まれている。ボディには、字幕のテキスト情報等が含まれている FIG. 13(a) shows the TTML structure. TTML is described based on XML. TTML consists of a header (head) and a body (body). The header includes elements such as metadata, styling, and layout. The metadata includes metadata title information, copyright information, and the like. Styling includes information such as color (color), font (fontFamily), size (fontSize), and alignment (textAlign) in addition to the identifier (id). The layout includes information such as the extent (extent), offset (padding), background color (backgroundColor), alignment (displayAlign), etc., in addition to the identifier (id) of the region where the subtitle is placed. The body contains text information for subtitles, etc.

この実施の形態において、ＴＴＭＬには、字幕オブジェクトレンダリング情報が挿入される。図１３（ｂ）は、メタデータ（ＴＴＭ：TTML Metadata）の構造例を示し、「target_content_id」、「client_id」、「client_object_id」の各情報が存在する。「target_content_id」は、合成対象のビデオコンテンツ（背景画像）の識別情報を示す。「client_id」は、この字幕データを送信するクライアント（クライアント装置２００）の識別情報を示す。「client_object_id」は、クライアントから送信されるオブジェクト（アバター）の識別情報を示す。なお、字幕の表示位置の情報は、ボディに含まれている。 In this embodiment, the TTML is populated with subtitle object rendering information. FIG. 13(b) shows an example of the structure of metadata (TTM: TTML Metadata), in which information of "target_content_id", "client_id", and "client_object_id" exists. “target_content_id” indicates identification information of the video content (background image) to be synthesized. "client_id" indicates the identification information of the client (client device 200) that transmits this caption data. "client_object_id" indicates the identification information of the object (avatar) transmitted from the client. Information on the display position of the caption is included in the body.

図８に戻って、コンテナエンコーダ２０８は、メタデータジェネレータ２０２で発生されたアバターメタ情報、オーディオエンコーダ２０５で得られた３Ｄオーディオの符号化音声データおよび字幕エンコーダ２０７で得られた字幕のテキスト情報であるＴＴＭＬを含むコンテナ、ここではＭＰ４ストリームを、クライアント送信ストリームとして生成する。 Returning to FIG. 8, the container encoder 208 converts the avatar meta information generated by the metadata generator 202, the encoded 3D audio data obtained by the audio encoder 205, and the caption text information obtained by the caption encoder 207 into A container containing a certain TTML, here an MP4 stream, is generated as a client transmission stream.

ネットワークインタフェース２０９は、ネットワーク３００を介して、他のクライアント装置２００と通信をする。ネットワークインタフェース２０９は、コンテナエンコーダ２０８で得られたクライアント送信ストリームを、ネットワーク３００を介して、他のクラインアント装置２００に送信する。 A network interface 209 communicates with other client devices 200 via the network 300 . Network interface 209 transmits the client transmission stream obtained by container encoder 208 to other client devices 200 via network 300 .

図８に示す送信系２００Ｔの動作を簡単に説明する。メタデータジェネレータ２０２では、ユーザ操作部２０１ａからのユーザ操作に応じて、アバターメタ情報が発生される。このアバターメタ情報は、アバターレンダリング制御情報（図９参照）と、アバターデータベース選択情報（図１０参照）からなっている。アバターレンダリング制御情報には、背景画像の許容合成範囲内におけるアバターの合成位置を示す情報と、そのアバターのサイズを示す情報が含まれている。また、アバターデータベース選択情報には、アバターの画像データをアバターデータベースから得るための選択情報が含まれている。 The operation of the transmission system 200T shown in FIG. 8 will be briefly described. In the metadata generator 202, avatar meta information is generated according to the user operation from the user operation unit 201a. This avatar meta information consists of avatar rendering control information (see FIG. 9) and avatar database selection information (see FIG. 10). The avatar rendering control information includes information indicating the composition position of the avatar within the allowable composition range of the background image and information indicating the size of the avatar. The avatar database selection information also includes selection information for obtaining image data of avatars from the avatar database.

音声キャプチャ２０３では、各オブジェクト、つまり各アバターの音声（音）が集音されて音声データが得られる。この各オブジェクト（アバター）の音声データは、オブジェクト情報生成部２０４に供給される。また、このオブジェクト情報生成部２０４には、背景画像における各オブジェクト（アバター）の合成位置情報が供給される。 The audio capture 203 collects the audio (sound) of each object, ie, each avatar, to obtain audio data. The audio data of each object (avatar) is supplied to the object information generating section 204 . The object information generation unit 204 is also supplied with information on the position of each object (avatar) in the background image to be combined.

オブジェクト情報生成部２０４では、オブジェクト毎にオブジェクト合成位置情報に基づいてオブジェクトメタデータが生成され、各オブジェクトのオブジェクト符号化データ（符号化サンプルデータ、オブジェクトメタデータ）が得られる。ここで、各オブジェクト（アバター）のオブジェクトメタデータとして音声オブジェクトレンダリング情報（図１１参照）が含まれる。この音声オブジェクトレンダリング情報には、各オブジェクト（アバター）の位置情報（θ，φ，ｒ）が含まれている。 The object information generation unit 204 generates object metadata for each object based on the object synthesis position information, and obtains object encoded data (encoded sample data, object metadata) of each object. Here, audio object rendering information (see FIG. 11) is included as object metadata of each object (avatar). This audio object rendering information includes position information (θ, φ, r) of each object (avatar).

オブジェクト情報生成部２０４で得られた各オブジェクトのオブジェクト符号化データ（符号化サンプルデータ、オブジェクトメタデータ）は、オーディオエンコーダ２０５に供給される。オーディオエンコーダ２０５では、各オブジェクトのオブジェクト符号化データに対して符号化が施されて、ＭＰＥＧ－Ｈ３ＤＡｕｄｉｏの符号化音声データが得られる。 The encoded object data (encoded sample data, object metadata) of each object obtained by the object information generator 204 is supplied to the audio encoder 205 . The audio encoder 205 encodes the encoded object data of each object to obtain MPEG-H 3D Audio encoded audio data.

文字入力部２０６では、ユーザ操作部２０１ａからのユーザ操作に基づいて、各オブジェクト、つまり各アバターに対応した字幕のテキストデータ（文字コード）ＤＴが、適宜、発生される。このテキストデータＤＴは、字幕エンコーダ２０７に供給される。この字幕エンコーダ２０７には、各オブジェクト（アバター）に対応した字幕の表示位置情報が供給される。 In the character input unit 206, caption text data (character code) DT corresponding to each object, that is, each avatar, is appropriately generated based on the user's operation from the user operation unit 201a. This text data DT is supplied to the caption encoder 207 . Display position information of subtitles corresponding to each object (avatar) is supplied to the subtitle encoder 207 .

字幕エンコーダ２０７では、テキストデータＤＴに基づいて字幕（サブタイトル）のテキスト情報としてのＴＴＭＬが得られる。このＴＴＭＬの例えばメタデータにレンダリング情報が挿入される（図１３参照）。なお、字幕の表示位置の情報は、ヘッドに含まれる。アバターレンダリング情報はメタデータ以外の部分、例えば「origin」や「extent」と共に、ヘッド配下のレイアウトに含まれるようにしてもよい。 The caption encoder 207 obtains TTML as text information of captions (subtitles) based on the text data DT. Rendering information is inserted into, for example, metadata of this TTML (see FIG. 13). Information on the display position of the caption is included in the head. The avatar rendering information may be included in the layout under the head together with parts other than the metadata, such as "origin" and "extent".

メタデータジェネレータ２０２で発生されたアバターメタ情報、オーディオエンコーダ２０５で得られた３Ｄオーディオの符号化音声データおよび字幕エンコーダ２０７で得られた字幕のテキスト情報であるＴＴＭＬは、コンテナエンコーダ２０８に供給される。コンテナエンコーダ２０８では、アバターメタ情報、符号化音声データおよびＴＴＭＬを含むＭＰ４ストリームがクライアント送信ストリームとして生成される。 The avatar meta information generated by the metadata generator 202, the encoded 3D audio data obtained by the audio encoder 205, and the subtitle text information TTML obtained by the subtitle encoder 207 are supplied to the container encoder 208. . At the container encoder 208, an MP4 stream containing avatar meta information, encoded audio data and TTML is generated as the client transmission stream.

コンテナエンコーダ２０８で得られたクライアント送信ストリームはネットワークインタフェース２０９に供給される。ネットワークインタフェース２０９では、クライアント配信ストリームを、ネットワーク３００を介して、他のクラインアント装置２００に送信することが行われる。 The client-transmitted stream resulting from container encoder 208 is provided to network interface 209 . Network interface 209 is responsible for transmitting client delivery streams to other client devices 200 via network 300 .

図１４は、クライアント装置２００の受信系２００Ｒの構成例を示している。この受信系２００Ｒは、制御部２０１と、ネットワークインタフェース２１１と、コンテナデコーダ２１２と、ビデオデコーダ２１３と、プレーンコンバータ２１４と、受信モジュール２１５，２１５Ａと、オーディオデコーダ２１６と、ミクサ２１８と、合成部２１９を有している。各部は、バス２１０で接続されている。 FIG. 14 shows a configuration example of the receiving system 200R of the client device 200. As shown in FIG. The receiving system 200R includes a control unit 201, a network interface 211, a container decoder 212, a video decoder 213, a plane converter 214, receiving modules 215 and 215A, an audio decoder 216, a mixer 218, and a synthesizing unit 219. have. Each unit is connected by a bus 210 .

制御部２０１は、クライアント装置２００、従って受信系２００Ｒの各部の動作を制御する。この制御部２０１には、ユーザ操作部２０１ａが接続されている。ネットワークインタフェース２１１は、ネットワーク３００を介して、サーバ１００および他のクライアント装置２００と通信をする。ネットワークインタフェース２１１は、サーバ１００から、上述したサーバ配信ストリームを受信する。また、ネットワークインタフェース２１１は、他のクライアント装置２００から、上述したクライアント送信ストリームを受信する。 The control unit 201 controls the operation of each unit of the client device 200, and thus the receiving system 200R. A user operation unit 201 a is connected to the control unit 201 . Network interface 211 communicates with server 100 and other client devices 200 via network 300 . Network interface 211 receives the above-described server-delivered stream from server 100 . Also, the network interface 211 receives the client transmission stream described above from another client device 200 .

コンテナデコーダ２１２は、ネットワークインタフェース２１１で受信されたサーバ配信ストリーム（ＭＰ４ストリーム）からビデオストリームおよびオーディオストリームを取り出す。この場合、コンテナデコーダ２１２は、イニシャライゼーション・セグメント（ＩＳ）や“ｍｏｏｆ”のボックスに定義される“ｕｄｔａ”のボックスに存在するビデオ・アトリビュート・インフォメーション・ボックス「“ｖａｉｂ”ボックス」を取り出し、制御部２０１に送る。これにより、制御部２０１は、カメラの撮像状態を示すキャプチャ情報と、カメラの位置（撮像位置）を示す位置情報（ＧＰＳデータ）と、背景画像におけるアバターの許容合成範囲を示す情報を認識する。 Container decoder 212 extracts the video and audio streams from the server-delivered stream (MP4 stream) received at network interface 211 . In this case, the container decoder 212 extracts the video attribute information box ““vaib” box” existing in the “udta” box defined in the initialization segment (IS) or the “moof” box, and controls the Send to section 201 . Thereby, the control unit 201 recognizes capture information indicating the imaging state of the camera, position information (GPS data) indicating the position of the camera (imaging position), and information indicating the permissible composition range of the avatar in the background image.

ビデオデコーダ２１３は、コンテナデコーダ２１２で取り出されたビデオストリームにデコード処理を施して、背景画像の画像データを得る。また、ビデオデコーダ２１３は、ビデオストリームに挿入されているパラメータセットやＳＥＩメッセージを抽出し、制御部２０１に送る。 The video decoder 213 decodes the video stream extracted by the container decoder 212 to obtain image data of the background image. The video decoder 213 also extracts parameter sets and SEI messages inserted in the video stream and sends them to the control unit 201 .

この抽出情報には、上述したビデオ・アトリビュート・インフォメーション・ＳＥＩメッセージ（図４参照）も含まれる。これにより、制御部２０１は、カメラの撮像状態を示すキャプチャ情報と、カメラの位置（撮像位置）を示す位置情報（ＧＰＳデータ）と、背景画像におけるアバターの許容合成範囲を示す情報を認識する。 This extracted information also includes the video attribute information SEI message (see FIG. 4) described above. Thereby, the control unit 201 recognizes capture information indicating the imaging state of the camera, position information (GPS data) indicating the position of the camera (imaging position), and information indicating the permissible composition range of the avatar in the background image.

プレーンコンバータ２１４は、ビデオデコーダ２１３で得られた背景画像の画像データが非線形な画像データである場合には線形な画像データに変換する。また、プレーンコンバータ２１４は、背景画像の画像データが広視野角画像の画像データである場合、その画像データから、ＨＭＤ４００Ａの表示視野角に対応した部分だけを切り出し、表示用画像データを得る。 A plane converter 214 converts the image data of the background image obtained by the video decoder 213 into linear image data when the image data is non-linear image data. Further, when the image data of the background image is the image data of the wide viewing angle image, the plane converter 214 extracts only the portion corresponding to the display viewing angle of the HMD 400A from the image data to obtain display image data.

例えば、背景画像におけるアバターの許容合成範囲の大きさはＨＭＤ４００Ａの表示視野角に対応して設定されており、プレーンコンバータ２１４は、デフォルトの状態では、この許容合成範囲に対応した画像データを切り出して表示用画像データとする。その後、プレーンコンバータ２１４は、切出し範囲を、例えばＨＭＤ搭載のセンサで検出される頭部姿勢に応じて変更していく。 For example, the size of the allowable composition range of the avatar in the background image is set corresponding to the display viewing angle of the HMD 400A, and the plane converter 214 cuts out image data corresponding to this allowable composition range in the default state. This is image data for display. After that, the plane converter 214 changes the cutout range according to the head posture detected by the HMD-equipped sensor, for example.

オーディオデコーダ２１６は、コンテナデコーダ２１２で取り出されたオーディオストリームにデコード処理を施して、ヘッドフォン（ＨＰ）４００Ｂでの音声再生のための２チャネルの音声データを得る。なお、デコード処理で５．１チャネル等の多チャネルの音声データが得られる場合、オーディオデコーダ２１６は、さらに、２チャネルにダウンミックスして２チャネルの音声データとする。 The audio decoder 216 decodes the audio stream extracted by the container decoder 212 to obtain 2-channel audio data for audio reproduction on the headphone (HP) 400B. Note that when multi-channel audio data such as 5.1-channel audio data is obtained in the decoding process, the audio decoder 216 further downmixes the audio data to 2-channel audio data to obtain 2-channel audio data.

受信モジュール２１５は、ネットワークインタフェース２１４で受信されたクライアント送信ストリームを処理し、アバターの画像データとそのアバターの合成位置情報、アバターに対応した字幕の表示データとその字幕の表示位置情報、さらにアバターに対応した２チャネルの音声データを得る。 The receiving module 215 processes the client-transmitted stream received by the network interface 214 to process the image data of the avatar, the synthesized position information of the avatar, the display data of the subtitles corresponding to the avatar and the display position information of the subtitles, and further the avatar. Acquire corresponding two-channel audio data.

また、受信モジュール２１５Ａは、自身のクライアント装置２００の送信系２００Ｔ（図８参照）で生成されたクライアント送信ストリームを処理し、アバターの画像データとそのアバターの合成位置情報、アバターに対応した字幕の表示データとその字幕の表示位置情報、さらにアバターに対応した２チャネルの音声データを得る。受信モジュール２１５Ａは、背景画像に自身のアバターを合成するために設けられている。なお、自身のクライアント装置２００が送信系２００Ｔ（図８参照）を持たない場合には、受信系２００Ｒ（図１４参照）における受信モジュール２１５Ａは不要となる。 In addition, the receiving module 215A processes the client transmission stream generated by the transmission system 200T (see FIG. 8) of its own client device 200, and generates image data of the avatar, combined position information of the avatar, and subtitles corresponding to the avatar. Display data, display position information of the subtitles, and 2-channel audio data corresponding to the avatar are obtained. The receiving module 215A is provided for compositing your avatar on the background image. If the client device 200 itself does not have the transmission system 200T (see FIG. 8), the reception module 215A in the reception system 200R (see FIG. 14) becomes unnecessary.

図１５は、受信モジュール２１５（２１５Ａ）の構成例を示している。この受信モジュール２１５（２１５Ａ）は、コンテナデコーダ２２１と、メタ情報解析部２２２と、アバターデータベース選択部２２３と、アバターデータベース２２４と、サイズ変換部２２５と、オーディオデコーダ２２６と、レンダラ２２７と、字幕デコーダ２２８と、フォント展開部２２９を有している。 FIG. 15 shows a configuration example of the receiving module 215 (215A). This receiving module 215 (215A) includes a container decoder 221, a meta information analysis section 222, an avatar database selection section 223, an avatar database 224, a size conversion section 225, an audio decoder 226, a renderer 227, and a subtitle decoder. 228 and a font development unit 229 .

コンテナデコーダ２２１は、クライアント送信ストリームからアバターメタ情報、３Ｄオーディオの符号化音声データおよび字幕のテキスト情報であるＴＴＭＬを取り出す。メタ情報解析部２２２は、コンテナデコーダ２２１で得られたアバターメタ情報を解析する。 The container decoder 221 extracts avatar meta information, 3D audio encoded voice data, and TTML, which is text information of subtitles, from the client transmission stream. A meta information analysis unit 222 analyzes the avatar meta information obtained by the container decoder 221 .

メタ情報解析部２２２は、アバターデータベース選択情報（図１０参照）に基づいて、アバターの画像データをアバターデータベース２２４から得るための選択情報を取得する。この選択情報は、アバターの全身体系の種類「body_type」、正面からの向き「body_angle」、表情・感情の種類「emotional_type」、顔の向き「face_angle」の各情報からなっている。 The meta-information analysis unit 222 acquires selection information for obtaining avatar image data from the avatar database 224 based on the avatar database selection information (see FIG. 10). This selection information consists of each information of the avatar's whole body system type "body_type", direction from the front "body_angle", expression/emotion type "emotional_type", and face direction "face_angle".

また、メタ情報解析部２２２は、アバターレンダリング制御情報（図９参照）に基づいて、背景画像の許容合成範囲内におけるアバターの合成位置情報「avator_center_position_x」、「avator_center_position_y」と、そのアバターのサイズ情報「avator_rendering_size」を取得する。 In addition, the meta-information analysis unit 222, based on the avatar rendering control information (see FIG. 9), avatar composition position information “avator_center_position_x” and “avator_center_position_y” within the allowable composition range of the background image, and size information “avator_center_position_y” of the avatar. avator_rendering_size”.

アバターデータベース選択部２２３は、メタ情報解析部２２２で取得された選択情報を参照してアバターデータベース２２４から取得されるアバターの構成データに基づいて、アバターの画像データを得る。 The avatar database selection unit 223 refers to the selection information acquired by the meta information analysis unit 222 and obtains image data of the avatar based on the avatar configuration data acquired from the avatar database 224 .

図１６は、アバターデータベース選択部２２３の構成例を示している。アバターデータベース選択部２２３は、データベースマッピング部２２３ａを備えている。アバターの全身体系の種類「body_type」、正面からの向き「body_angle」、表情・感情の種類「emotional_type」、顔の向き「face_angle」の各情報がデータベースマッピング部２２３ａに入力され、これらの情報に基づいてアバターデータベース２２４からアバターの構成データが取得されてマッピングされ、アバターの画像データが得られる。 FIG. 16 shows a configuration example of the avatar database selection unit 223. As shown in FIG. The avatar database selection unit 223 has a database mapping unit 223a. Avatar body system type "body_type", direction from the front "body_angle", expression/emotion type "emotional_type", and face direction "face_angle" are input to the database mapping unit 223a. Avatar configuration data is obtained from the avatar database 224 and mapped to obtain avatar image data.

図１７は、アバターデータベース２２４のリスト例を示している。例えば、アバターの全身体系の種類「body_type」の構成データとしては、“直立している”、“腰かけている”、“寝そべっている”の３状態が保持されている。また、例えば、正面からの向き「body_angle」の構成データとしては、“前向き”、“後ろ向き”、“右向き”、“左向き”、“上向き”、“下向き”の６状態が保持されている。また、例えば、表情・感情の種類「emotional_type」構成データとしては、“無表情”、“笑っている”、“泣いている”、“怒っている”の４状態が保持されている。また、顔の向き「face_angle」の構成データとしては、“正面直視”、“伏し目”の２状態が保持されている。 FIG. 17 shows an example list of avatar database 224 . For example, three states of "upright", "sitting", and "lying" are held as configuration data for the type "body_type" of the avatar's whole body system. Also, for example, six states of "forward", "backward", "rightward", "leftward", "upward", and "downward" are held as configuration data of the orientation "body_angle" from the front. Further, for example, four states of "expressionless", "laughing", "crying", and "angry" are held as the "emotional_type" configuration data of the type of facial expression/emotion. In addition, as configuration data of the face orientation “face_angle”, two states of “front straight view” and “downcast eyes” are held.

図１５に戻って、サイズ変換部２２５は、アバターデータベース選択部２２３で得られたアバターの画像データに対して、メタ情報解析部２２２で取得されたサイズ情報に基づいて、サイズ変換処理を施し、サイズ変換されたアバターの画像データを得る。 Returning to FIG. 15, the size conversion unit 225 performs size conversion processing on the avatar image data obtained by the avatar database selection unit 223 based on the size information obtained by the meta information analysis unit 222. Obtain the size-converted avatar image data.

オーディオデコーダ２２６は、コンテナデコーダ２２１で得られた音声符号化データにデコード処理を施し、オブジェクト符号化データとしての符号化サンプルデータおよびオブジェクトメタデータ（音声オブジェクトレンダリング情報）を得る。レンダラ２２７は、オーディオデコーダ２２６で得られた符号化サンプルデータおよびオブジェクトメタデータに対してレンダリング処理を施し、背景画像におけるアバターの合成位置が音像位置となるように、各スピーカのチャネルデータを得る。 The audio decoder 226 decodes the encoded audio data obtained by the container decoder 221 to obtain encoded sample data as object encoded data and object metadata (audio object rendering information). The renderer 227 renders the encoded sample data and object metadata obtained by the audio decoder 226, and obtains channel data for each speaker so that the avatar synthesis position in the background image becomes the sound image position.

図１８は、レンダラ２２７におけるレンダリング処理の概要を示している。この図１８において、図１２と対応する部分には同一符号を付して示している。オブジェクトメタデータに含まれるアバター位置情報（θ，φ，ｒ）は、ＨＭＤ４００Ａで展開される画像上におけるアバターの合成位置の中心座標である点Ｐに対応する。 FIG. 18 shows an outline of rendering processing in the renderer 227. FIG. In FIG. 18, parts corresponding to those in FIG. 12 are denoted by the same reference numerals. Avatar position information (θ, φ, r) included in the object metadata corresponds to point P, which is the center coordinates of the avatar's synthesis position on the image developed by HMD 400A.

なお、クライアント装置２００では、上述したように、アバターレンダリング制御情報（図９参照）に含まれるアバターの合成位置情報（「avator_center_position_x」、「avator_center_position_y」）により点Ｐの座標を特定でき、この点Ｐと想定する鑑賞位置の中心を点Ｑから「Azimuth」、「Radius」、「Elevation」の値を求めて、レンダラ２２７で用いることも可能である（図１２参照）。 In the client device 200, as described above, the coordinates of the point P can be specified by the avatar synthesis position information (“avator_center_position_x”, “avator_center_position_y”) included in the avatar rendering control information (see FIG. 9). It is also possible to obtain the values of "Azimuth", "Radius", and "Elevation" from the point Q, which is the center of the assumed viewing position, and use them in the renderer 227 (see FIG. 12).

その場合、「Radius」の値に関しては、サーバ１００から音声オブジェクトレンダリング情報（図１１参照）に挿入されて送られてくる「Radius」の値を使用するか、あるいはアバターレンダリング制御情報（図９参照）に含まれる「avator_rendering_size」の情報に基づいて、オブジェクト（アバター）のサイズに応じて、求められた「Radius」の値を調整して使用することで、アバターの合成位置の奥行位置を適切な位置に設定することが可能となる。 In that case, as for the value of "Radius", the value of "Radius" inserted in the audio object rendering information (see FIG. 11) sent from the server 100 is used, or the avatar rendering control information (see FIG. 9) is used. ), the depth position of the composite position of the avatar can be adjusted appropriately according to the size of the object (avatar). position can be set.

そして、この点Ｐが、中心鑑賞位置である点Ｑから各スピーカ位置へ伸ばした軸Ｑ－ＬＴ，Ｑ－ＬＢ，Ｑ－ＲＴ，Ｑ－ＲＢ上のベクトルｒ＿ＬＴ，ｒ＿ＬＢ，ｒ＿ＲＴ，ｒ－ＲＢに射影される。そして、各スピーカのチャンネルデータの音圧レベルはそれぞれこの４つのベクトルのベクトル量に相当するものとされる。 This point P is on vectors r_LT, r_LB, r_RT, and r-RB on axes Q-LT, Q-LB, Q-RT, and Q-RB extending from point Q, which is the central viewing position, to each speaker position. Projected. The sound pressure level of the channel data of each speaker is assumed to correspond to the vector quantity of these four vectors.

なお、図１８の例は、ＨＭＤ４００Ａに展開される画像がデフォルトの状態、すなわちＨＭＤ４００Ａに展開される画像が背景画像にけるアバターの許容合成範囲に対応している場合を示している。上述したようにプレーンコンバータ２１４における切出し範囲はＨＭＤ搭載のセンサで検出される頭部姿勢に応じて変更されていく。 Note that the example of FIG. 18 shows a case where the image developed on the HMD 400A is in the default state, that is, the image developed on the HMD 400A corresponds to the allowable composition range of the avatar in the background image. As described above, the extraction range in the plane converter 214 is changed according to the head posture detected by the HMD-equipped sensor.

この場合、ＨＭＤ４００Ａに展開される画像上の点Ｐの位置も変化し、変化量によってはＨＭＤ４００Ａに展開される画像上から点Ｐの位置が外れることも想定される。この場合、レンダラ２２７では、アバター位置情報（θ，φ，ｒ）で求められた点Ｐの位置ではなく、変化後の点Ｐの位置に基づいて各スピーカのチャンネルデータの音圧レベルが設定される。 In this case, the position of point P on the image developed on HMD 400A also changes, and it is assumed that the position of point P on the image developed on HMD 400A may deviate depending on the amount of change. In this case, the renderer 227 sets the sound pressure level of the channel data of each speaker based on the changed position of the point P, not the position of the point P obtained from the avatar position information (θ, φ, r). be.

また、レンダラ２２７は、上述したように各スピーカのチャンネルデータに、リマッピング（Remapping）による音圧制御を施し、ヘッドフォン４００Ｂで再生するための２チャネルの音声データに変換して出力する。なお、クライアント側における音声出力が、ヘッドフォン４００Ｂではなく、スピーカＬＴ，ＬＢ，ＲＴ，ＲＢで行われる場合には、このリマッピングによる音圧制御は省略される。 In addition, the renderer 227 applies sound pressure control by remapping to the channel data of each speaker as described above, converts the data into two-channel audio data to be reproduced by the headphones 400B, and outputs the data. Note that when sound output on the client side is performed by the speakers LT, LB, RT, and RB instead of the headphones 400B, the sound pressure control by this remapping is omitted.

図１９は、レンダラ２２７におけるリマッピングによる音圧制御を概略的に示している。Ｄ＿ＬＴ，Ｄ＿ＬＢ，Ｄ＿ＲＴ，Ｄ＿ＲＢはそれぞれスピーカＬＴ，ＬＢ，ＲＴ，ＲＢに出力するチャネルデータを示し、“Left ear”，“Right ear”はヘッドフォン４００Ｂで再生するための２チャネルの音声データを示している。ここで、リマッピングによる音圧制御では、各スピーカから左右の耳までの伝達特性、いわゆる頭部伝達関数（ＨＲＴＦ：Head Related Transfer Function）を各チャネルデータに畳み込んでから合算して２チャネルにダウンミックスすることが行われる。 FIG. 19 schematically shows sound pressure control by remapping in the renderer 227. FIG. D_LT, D_LB, D_RT, and D_RB indicate channel data to be output to speakers LT, LB, RT, and RB, respectively, and "Left ear" and "Right ear" indicate 2-channel audio data for reproduction with headphones 400B. there is Here, in the sound pressure control by remapping, the transfer characteristics from each speaker to the left and right ears, the so-called Head Related Transfer Function (HRTF), are convoluted with each channel data and then summed up into two channels. Downmixing is done.

図１５に戻って、字幕デコーダ２２８は、コンテナデコーダ２２１で得られたＴＴＭＬから字幕のテキストデータや制御コードを得る。制御コードの１つとして、表示位置情報も得られる。フォント展開部２２９は、字幕デコーダ２２８で得られた字幕のテキストデータや制御コードに基づいてフォント展開して、字幕表示データ（ビットマップデータ）を得る。 Returning to FIG. 15 , the caption decoder 228 obtains caption text data and control codes from the TTML obtained by the container decoder 221 . Display position information is also obtained as one of the control codes. The font expansion unit 229 expands fonts based on the subtitle text data and control code obtained by the subtitle decoder 228 to obtain subtitle display data (bitmap data).

図１５に示す受信モジュール２１５（２１５Ａ）の動作を簡単に説明する。クライアント送信ストリームは、コンテナデコーダ２２１に供給される。コンテナデコーダ２２１では、クライアント送信ストリームからアバターメタ情報、３Ｄオーディオの符号化音声データおよび字幕のテキスト情報であるＴＴＭＬが取り出される。 The operation of the receiving module 215 (215A) shown in FIG. 15 will be briefly described. The client sent stream is provided to container decoder 221 . The container decoder 221 extracts avatar meta information, 3D audio encoded voice data, and TTML, which is text information of subtitles, from the client transmission stream.

コンテナデコーダ２２１で取り出されたアバターメタ情報は、メタ情報解析部２２２に供給される。メタ情報解析部２２２では、アバターデータベース選択情報（図１０参照）に基づいて、アバターの画像データをアバターデータベース２２４から得るための選択情報が取得される。この選択情報は、アバターの全身体系の種類「body_type」、正面からの向き「body_angle」、表情・感情の種類「emotional_type」、顔の向き「face_angle」の各情報からなっている。 The avatar meta information extracted by the container decoder 221 is supplied to the meta information analysis section 222 . The meta-information analysis unit 222 acquires selection information for obtaining avatar image data from the avatar database 224 based on the avatar database selection information (see FIG. 10). This selection information consists of each information of the avatar's whole body system type "body_type", direction from the front "body_angle", expression/emotion type "emotional_type", and face direction "face_angle".

また、メタ情報解析部２２２では、アバターレンダリング制御情報（図９参照）に基づいて、背景画像の許容合成範囲内におけるアバターの合成位置情報「avator_center_position_x」、「avator_center_position_y」と、そのアバターのサイズ情報「avator_rendering_size」が取得される。 In addition, based on the avatar rendering control information (see FIG. 9), the meta-information analysis unit 222 performs avatar composition position information “avator_center_position_x” and “avator_center_position_y” within the background image allowable composition range, and avatar size information “avator_center_position_x” and “avator_center_position_y”. avator_rendering_size” is obtained.

メタ情報解析部２２２で取得された選択情報は、アバターデータベース選択部２２３に供給される。アバターデータベース選択部２２３では、選択情報に基づいてアバターデータベース２２４から取得されるアバターの構成データに基づいてアバターの構成データが取得されてマッピングされ、アバターの画像データが得られる。 The selection information acquired by the meta information analysis section 222 is supplied to the avatar database selection section 223 . The avatar database selection unit 223 acquires avatar configuration data based on the avatar configuration data acquired from the avatar database 224 based on the selection information and performs mapping to obtain avatar image data.

アバターデータベース選択部２２３で得られたアバターの画像データは、サイズ変換部２２５に供給される。また、このサイズ変換部２２５には、メタ情報解析部２２２で取得されたアバターのサイズ情報が供給される。サイズ変換部２２５では、アバターデータベース選択部２２３から供給されるアバターの画像データに対して、サイズ情報に基づいて、サイズ変換処理が施され、サイズ変換されたアバターの画像データが得られる。このようにサイズ変換部２２５で得られたアバターの画像データは、メタ情報解析部２２２で取得されたアバターの合成位置情報と共に、受信モジュール２１５（２１５Ａ）の出力とされる。 The avatar image data obtained by the avatar database selection unit 223 is supplied to the size conversion unit 225 . Also, the avatar size information acquired by the meta-information analysis unit 222 is supplied to the size conversion unit 225 . The size conversion unit 225 performs size conversion processing on the avatar image data supplied from the avatar database selection unit 223 based on the size information, and obtains size-converted avatar image data. The avatar image data obtained by the size conversion unit 225 in this way is output from the reception module 215 (215A) together with the avatar synthesis position information obtained by the meta-information analysis unit 222 .

また、コンテナデコーダ２２１で取り出された符号化音声データは、オーディオデコーダ２２６に供給される。オーディオデコーダ２２６では、符号化音声データにデコード処理が施され、オブジェクト符号化データとしての符号化サンプルデータおよびオブジェクトメタデータ（音声オブジェクトレンダリング情報）が得られる。このオブジェクト符号化データは、レンダラ２２７に供給される。 Also, the encoded audio data extracted by the container decoder 221 is supplied to the audio decoder 226 . The audio decoder 226 decodes the encoded audio data to obtain encoded sample data as object encoded data and object metadata (audio object rendering information). This object encoded data is supplied to renderer 227 .

レンダラ２２７では、オーディオデコーダ２２６で得られたオブジェクト符号化データ（符号化サンプルデータおよびオブジェクトメタデータ）に対してレンダリング処理が施され、背景画像におけるアバターの合成位置が音像位置となるように、例えばＨＭＤ４００Ａで展開される画像の左右上下に配置された仮想スピーカのチャネルデータが生成される（図１８参照）。 The renderer 227 renders the encoded object data (encoded sample data and object metadata) obtained by the audio decoder 226 so that the synthesized position of the avatar in the background image becomes the sound image position. Channel data of virtual speakers arranged on the left, right, top and bottom of the image developed by the HMD 400A is generated (see FIG. 18).

さらに、レンダラ２２７では、４つのチャネデータに頭部伝達関数（ＨＲＴＦ）を用いたリマッピングによる音圧制御が行われて、ヘッドフォン４００Ｂで再生するための２チャネルの音声データが生成される（図１９参照）。このようにレンダラ２２７で得られた２チャネルの音声データは、受信モジュール２１５（２１５Ａ）の出力とされる。 Further, the renderer 227 performs sound pressure control by remapping the four channel data using the head-related transfer function (HRTF) to generate two-channel audio data to be reproduced by the headphones 400B (Fig. 19). The two-channel audio data obtained by the renderer 227 in this way is output from the reception module 215 (215A).

また、コンテナデコーダ２２１で取り出されたＴＴＭＬは、字幕デコーダ２２８に供給される。字幕デコーダ２２８では、ＴＴＭＬから字幕のテキストデータや制御コードが得られる。制御コードの１つとして、表示位置情報も得られる。 Also, the TTML extracted by the container decoder 221 is supplied to the caption decoder 228 . The caption decoder 228 obtains caption text data and control codes from the TTML. Display position information is also obtained as one of the control codes.

字幕デコーダ２２８で得られた字幕のテキストデータや制御コードは、フォント展開部２２９に供給される。フォント展開部２２９では、字幕のテキストデータや制御コードに基づいてフォント展開がされて、字幕表示データ（ビットマップデータ）が得られる。このようにフォント展開部２２９で得られた字幕表示データは、字幕デコーダ２２８で取得された字幕の表示位置情報と共に、受信モジュール２１５（２１５Ａ）の出力とされる。 The subtitle text data and control code obtained by the subtitle decoder 228 are supplied to the font development unit 229 . The font developing unit 229 develops the font based on the text data of the caption and the control code to obtain caption display data (bitmap data). The subtitle display data obtained by the font developing unit 229 in this way is output from the reception module 215 (215A) together with the subtitle display position information obtained by the subtitle decoder 228 .

図１４に戻って、ミクサ２１８は、オーディオデコーダ２１６で得られた２チャネルの音声データと、受信モジュール２１５，２１５Ａ（図１５参照）で得られた２チャネルの音声データを合成して、ヘッドフォン（ＨＰ）４００Ｂに送る２チャネルの音声データを得る。 Returning to FIG. 14, the mixer 218 synthesizes the two-channel audio data obtained by the audio decoder 216 and the two-channel audio data obtained by the receiving modules 215 and 215A (see FIG. 15), Acquire 2-channel audio data to be sent to HP) 400B.

合成部２１９は、制御部２０１の制御のもと、プレーンコンバータ２１４で得られた表示用画像データに、受信モジュール２１５，２１５Ａで得られたアバターの画像データを、合成位置情報に基づいて、背景画像のアバター許容合成範囲内の特定位置にアバターが配置されるように合成し、さらに、受信モジュール２１５，２１５Ａで得られた字幕表示データを表示位置情報に基づいて合成し、ＨＭＤ４００Ａに送る表示画像データを得る。 Under the control of the control unit 201, the synthesizing unit 219 combines the image data for display obtained by the plane converter 214 with the avatar image data obtained by the receiving modules 215 and 215A, based on the synthesizing position information. A display image that is synthesized so that the avatar is placed at a specific position within the avatar allowable synthesis range of the image, and that caption display data obtained by the reception modules 215 and 215A are synthesized based on the display position information and sent to the HMD 400A. get the data.

なお、図１４に示す受信系２００Ｒの構成例では、自身のクライアント装置２００の送信系２００Ｔ（図８参照）で生成されたクライアント送信ストリームを処理する受信モジュール２１５Ａを備える例を示した。しかし、この受信モジュール２１５Ａの代わりに、自身のクライアント装置２００の送信系２００Ｔ（図８参照）で生成されたアバターメタ情報、符号化音声データ、ＴＴＭＬを処理するモジュール（図１５に示す受信モジュール２１５Ａのコンテナデコーダ２２１を除いた構成）、あるいはアバターメタ情報、符号化音声データ、ＴＴＭＬに対応した他のデータ、情報を入力して同様の出力を得るモジュールであってもよい。 Note that the configuration example of the receiving system 200R shown in FIG. 14 shows an example in which the receiving module 215A processes the client transmission stream generated by the transmitting system 200T (see FIG. 8) of the client device 200 itself. However, instead of this receiving module 215A, a module (receiving module 215A shown in FIG. 15) that processes avatar meta information, encoded audio data, and TTML generated by the transmitting system 200T (see FIG. 8) of its own client device 200 configuration excluding the container decoder 221), or a module that inputs avatar meta information, encoded audio data, other data and information corresponding to TTML and obtains the same output.

図１４に示す受信系２００Ｒの動作を簡単に説明する。ネットワークインタフェース２１１では、サーバ１００から、ネットワーク３００を介して、サーバ配信ストリームが受信される。また、ネットワークインタフェース２１１では、他のクライアント装置２００から、ネットワーク３００を介して、クライアント送信ストリームが受信される。 The operation of the receiving system 200R shown in FIG. 14 will be briefly described. The network interface 211 receives the server distribution stream from the server 100 via the network 300 . Also, the network interface 211 receives a client transmission stream from another client device 200 via the network 300 .

ネットワークインタフェース２１１で受信されたサーバ配信ストリームは、コンテナデコーダ２１２に供給される。コンテナデコーダ２１２では、サーバ配信ストリーム（ＭＰ４ストリーム）からビデオストリームおよびオーディオストリームが取り出される。 A server-delivered stream received at network interface 211 is provided to container decoder 212 . The container decoder 212 extracts a video stream and an audio stream from the server delivery stream (MP4 stream).

また、コンテナデコーダ２１２では、イニシャライゼーション・セグメント（ＩＳ）や“ｍｏｏｆ”のボックスに定義される“ｕｄｔａ”のボックスに存在するビデオ・アトリビュート・インフォメーション・ボックスが取り出され、制御部２０１に送られる。これにより、制御部２０１では、カメラの撮像状態を示すキャプチャ情報と、カメラの位置（撮像位置）を示す位置情報（ＧＰＳデータ）と、背景画像におけるアバターの許容合成範囲を示す情報が認識される。 In the container decoder 212 , the video attribute information box existing in the initialization segment (IS) and the “udta” box defined in the “moof” box is extracted and sent to the control unit 201 . As a result, the control unit 201 recognizes the capture information indicating the imaging state of the camera, the position information (GPS data) indicating the position of the camera (imaging position), and the information indicating the allowable composition range of the avatar in the background image. .

また、コンテナデコーダ２１２で取り出されたビデオストリームは、ビデオデコーダ２１３に供給される。ビデオデコーダ２１３では、ビデオストリームにデコード処理が施されて、背景画像の画像データが得られる。 Also, the video stream extracted by the container decoder 212 is supplied to the video decoder 213 . The video decoder 213 decodes the video stream to obtain image data of the background image.

また、ビデオデコーダ２１３では、ビデオストリームに挿入されているパラメータセットやＳＥＩメッセージが抽出され、制御部２０１に送られる。この抽出情報には、ビデオ・アトリビュート・インフォメーション・ＳＥＩメッセージ（図４参照）も含まれる。これにより、制御部２０１では、カメラの撮像状態を示すキャプチャ情報と、カメラの位置（撮像位置）を示す位置情報（ＧＰＳデータ）と、背景画像におけるアバターの許容合成範囲を示す情報が認識される。 Also, the video decoder 213 extracts the parameter set and SEI message inserted in the video stream and sends them to the control unit 201 . This extracted information also includes a video attribute information SEI message (see FIG. 4). As a result, the control unit 201 recognizes the capture information indicating the imaging state of the camera, the position information (GPS data) indicating the position of the camera (imaging position), and the information indicating the allowable composition range of the avatar in the background image. .

ビデオデコーダ２１３で得られた背景画像の画像データは、プレーンコンバータ２１４に供給される。プレーンコンバータ２１４では、背景画像の画像データが非線形な画像データである場合には線形な画像データに変換される。また、プレーンコンバータ２１４では、背景画像の画像データから、ＨＭＤ４００Ａの表示視野角に対応した部分だけが切り出され、表示用画像データが得られる。 Image data of the background image obtained by the video decoder 213 is supplied to the plane converter 214 . In the plane converter 214, if the image data of the background image is non-linear image data, it is converted into linear image data. Also, in the plane converter 214, only a portion corresponding to the display viewing angle of the HMD 400A is cut out from the image data of the background image to obtain image data for display.

例えば、背景画像におけるアバターの許容合成範囲の大きさはＨＭＤ４００Ａの表示視野角に対応して設定されており、デフォルトの状態では、この許容合成範囲に対応した画像データが切り出されて表示用画像データとされる。その後、切出し範囲は、例えばＨＭＤ搭載のセンサで検出される頭部姿勢に応じて変更されていく。 For example, the size of the permissible composition range of the avatar in the background image is set corresponding to the display viewing angle of the HMD 400A, and in the default state, the image data corresponding to this permissible composition range is extracted and displayed as image data. It is said that After that, the cropping range is changed according to the head posture detected by a sensor mounted on the HMD, for example.

また、コンテナデコーダ２１２で取り出されたオーディオストリームは、オーディオデコーダ２１６に供給される。オーディオデコーダ２１６では、オーディオストリームにデコード処理が施されて、ヘッドフォン（ＨＰ）４００Ｂでの音声再生のための２チャネルの音声データが得られる。なお、デコード処理で５．１チャネル等の多チャネルの音声データが得られる場合、オーディオデコーダ２１６では、さらに、２チャネルにダウンミックスされて２チャネルの音声データとされる。 Also, the audio stream extracted by the container decoder 212 is supplied to the audio decoder 216 . The audio decoder 216 decodes the audio stream to obtain 2-channel audio data for audio reproduction on the headphone (HP) 400B. Note that when multi-channel audio data such as 5.1-channel audio data is obtained in the decoding process, the audio decoder 216 further down-mixes the audio data into 2-channel audio data to obtain 2-channel audio data.

また、ネットワークインタフェース２１１で受信された他のクライアント装置２００からのクライアント送信ストリームは、受信モジュール２１５に供給される。この受信モジュール２１５では、クライアント送信ストリームが処理され、アバターの画像データとそのアバターの合成位置情報、アバターに対応した字幕の表示データとその字幕の表示位置情報、さらにアバターに対応した２チャネルの音声データが得られる（図１５参照）。 Also, client transmission streams from other client devices 200 received by the network interface 211 are supplied to the reception module 215 . In this receiving module 215, the client transmission stream is processed, and the image data of the avatar, the combined position information of the avatar, the display data of the caption corresponding to the avatar and the display position information of the caption, and the two-channel audio corresponding to the avatar are displayed. Data are obtained (see Figure 15).

また、自身のクライアント装置２００の送信系２００Ｔ（図８参照）で生成されたクライアント送信ストリームは、受信モジュール２１５Ａに供給される。この受信モジュール２１５Ａでは、受信モジュール２１５と同様に、クライアント送信ストリームが処理され、アバターの画像データとそのアバターの合成位置情報、アバターに対応した字幕の表示データとその字幕の表示位置情報、さらにアバターに対応した２チャネルの音声データが得られる（図１５参照）。 Also, the client transmission stream generated by the transmission system 200T (see FIG. 8) of its own client device 200 is supplied to the reception module 215A. In this receiving module 215A, similarly to the receiving module 215, the client transmission stream is processed, and the image data of the avatar and the combined position information of the avatar, the display data of the caption corresponding to the avatar and the display position information of the caption, and further the avatar 2-channel audio data corresponding to are obtained (see FIG. 15).

オーディオデコーダ２１６で得られた２チャネルの音声データは、ミクサ２１８に供給される。また、このミクサ２１８には、受信モジュール２１５，２１５Ａで得られた２チャネルの音声データが供給される。ミクサ２１８では、オーディオデコーダ２１６で得られた２チャネルの音声データと、受信モジュール２１５，２１５Ａで得られた２チャネルの音声データが合成されて、ヘッドフォン（ＨＰ）４００Ｂに送る２チャネルの音声データが得られる。 The two-channel audio data obtained by the audio decoder 216 are supplied to the mixer 218 . Also, the mixer 218 is supplied with two-channel audio data obtained by the receiving modules 215 and 215A. In the mixer 218, the two-channel audio data obtained by the audio decoder 216 and the two-channel audio data obtained by the receiving modules 215 and 215A are combined to produce two-channel audio data to be sent to the headphone (HP) 400B. can get.

また、プレーンコンバータ２１４で得られた表示用画像データは合成部２１９に供給される。また、この合成部２１９には、受信モジュール２１５，２１５Ａで得られたアバターの画像データおよびアバター合成位置情報や、字幕表示データおよび表示位置情報が供給される。合成部２１９では、プレーンコンバータ２１４で得られた表示用画像データに、受信モジュール２１５，２１５Ａで得られたアバターの画像データが、合成位置情報に基づいて、背景画像のアバター許容合成範囲内の特定位置にアバターが配置されるように合成され、さらに、受信モジュール２１５，２１５Ａで得られた字幕表示データが表示位置情報に基づいて合成され、ＨＭＤ４００Ａに送る表示画像データが得られる。 Also, the display image data obtained by the plane converter 214 is supplied to the synthesizing unit 219 . Also, the synthesizing unit 219 is supplied with avatar image data and avatar synthesizing position information obtained by the receiving modules 215 and 215A, caption display data and display position information. In the synthesizing unit 219, the avatar image data obtained by the receiving modules 215 and 215A are added to the display image data obtained by the plane converter 214, and the avatar image data obtained by the receiving modules 215 and 215A are specified based on the synthesizing position information within the avatar allowable synthesizing range of the background image. The avatars are synthesized so that the avatar is placed at the position, and the subtitle display data obtained by the reception modules 215 and 215A are synthesized based on the display position information to obtain the display image data to be sent to the HMD 400A.

図２０は、背景画像の一例を示し、矩形破線枠はアバターの許容合成範囲(sy_window)を示している。この背景画像の中心（「＋」の文字で示している）は、ビデオ・アトリビュート・インフォメーション・ＳＥＩメッセージ（図４参照）やビデオ・アトリビュート・インフォメーション・ボックス（図７参照）における「camera_direction」、「camera_V_angle」の情報に対応した位置となる。 FIG. 20 shows an example of a background image, and a rectangular dashed frame indicates the allowable composition range (sy_window) of the avatar. The center of this background image (indicated by the character "+") is the video attribute information SEI message (see Fig. 4) or the "camera_direction", " The position corresponds to the information of "camera_V_angle".

図２１は、背景画像の許容合成範囲(sy_window)内にアバターおよび字幕が合成された状態の一例を示している。図示の例では、Ａ１，Ａ２，Ａ３の３つのアバターが合成され、さらに２つの字幕が合成されている。ここで、Ａ１のアバターと、それに関連づけられた字幕は、「clinent_id」が“０ｘＡ１”であるクライアント（クライアント装置２００）によるものである。また、Ａ２のアバターは、「clinent_id」が“０ｘＡ２”であるクライアントによるものである。また、Ａ３のアバターと、それに関連づけられた字幕は、「clinent_id」が“０ｘＡ３”であるクライアント（クライアント装置２００）によるものである。 FIG. 21 shows an example of a state in which an avatar and subtitles are synthesized within the permissible synthesis range (sy_window) of the background image. In the illustrated example, three avatars A1, A2, and A3 are synthesized, and two subtitles are synthesized. Here, the avatar of A1 and the subtitles associated with it are from the client (client device 200) whose "clinent_id" is "0xA1". Also, the avatar of A2 belongs to a client whose "clinent_id" is "0xA2". Also, the avatar of A3 and the caption associated with it are those of the client (client device 200) whose "clinent_id" is "0xA3".

以上説明したように、図１に示す空間共有表示システム１０において、クラインアント装置２００では、背景画像の画像データにアバターメタ情報に基づいてアバターの画像データを生成し、このアバターの画像データを背景画像の画像データに合成するものである。そのため、クライアントのそれぞれは、共通の背景画像に他のクライアントのアバターが合成されたものを認識でき、互いのＶＲ空間を共有して良好にコミュニケーションをとることが可能となる。 As described above, in the shared space display system 10 shown in FIG. 1, the client device 200 generates avatar image data based on the avatar meta information in the image data of the background image, and uses the avatar image data as the background image data. It is synthesized with the image data of the image. Therefore, each of the clients can recognize the avatar of the other client synthesized with the common background image, and can communicate well by sharing the VR space with each other.

また、図１に示す空間共有表示システム１０において、クライアント送信ストリームには、アバターメタ情報に対応した音声データがオブジェクトメタデータと共に含まれており、クライアント装置２００では、音声データにオブジェクトメタデータに応じたレンダリング処理を行ってアバターの合成位置を音像位置とする音声出力データを得ることができる。そのため、クライアントのそれぞれに、背景画像上の各アバターの合成位置からそのアバターのクライアントからの音声が出ているように知覚させることが可能となる。 In the shared space display system 10 shown in FIG. 1, the client transmission stream includes audio data corresponding to avatar meta information together with object metadata. Rendering processing can be performed to obtain audio output data having the avatar synthesis position as the sound image position. Therefore, it is possible for each client to perceive that the voice of the avatar's client is coming from the composite position of each avatar on the background image.

また、図１に示す空間共有表示システム１０において、クライアント送信ストリームには、アバターメタ情報に対応した字幕データが表示位置情報と共に含まれており、クライアント装置２００では、字幕データによる字幕がアバターの合成位置に対応した位置に表示されるように表示位置情報に基づいて字幕の表示データを背景画像の画像データに合成することができる。そのため、クライアントのそれぞれに、背景画像上の各アバターの合成位置に対応した位置にそのアバターのクライアントからの字幕を認識させることが可能となる。 In the space sharing display system 10 shown in FIG. 1, the client transmission stream includes caption data corresponding to the avatar meta information together with the display position information. The display data of the caption can be combined with the image data of the background image based on the display position information so that the caption is displayed at a position corresponding to the position. Therefore, it is possible for each client to recognize the caption from the client of the avatar at the position corresponding to the composite position of each avatar on the background image.

また、図１に示す空間共有表示システム１０において、背景画像の画像データをエンコードして得られたビデオストリームのレイヤおよび/またはそのビデオストリームを含むサーバ配信ストリームのレイヤに背景画像におけるアバターの許容合成範囲を示す情報が挿入されて配信される。そのため、クラインアント装置２００では、背景画像に各クライアントのアバターを、この許容合成範囲を示す情報に基づいて、サーバ１００が意図する範囲に配置することが容易に可能となる。 In the spatially shared display system 10 shown in FIG. 1, the permissible synthesis of the avatar in the background image is performed on the layer of the video stream obtained by encoding the image data of the background image and/or the layer of the server distribution stream including the video stream. Information indicating the range is inserted and distributed. Therefore, the client device 200 can easily arrange each client's avatar in the background image in the range intended by the server 100 based on the information indicating the allowable composition range.

＜２．変形例＞
なお、上述実施の形態においては、クライアント装置２００がＨＭＤ４００Ａとは別個に存在する例を示したが、ＨＭＤ４００Ａとクライアント装置２００が一体的に構成される例も考えられる。また、上述していないが、実写画像をアバターとして利用することも可能である。 <2. Variation>
In the above-described embodiment, an example in which the client device 200 exists separately from the HMD 400A is shown, but an example in which the HMD 400A and the client device 200 are integrally configured is also conceivable. Also, although not described above, it is also possible to use a photographed image as an avatar.

また、上述実施の形態においては、コンテナがＭＰ４（ＩＳＯＢＭＦＦ）である例を示した。しかし、本技術は、コンテナがＭＰ４に限定されるものではなく、ＭＰＥＧ－２ＴＳやＭＭＴなどの他のフォーマットのコンテナであっても同様に適用し得る。 Also, in the above-described embodiment, an example in which the container is MP4 (ISOBMFF) has been shown. However, the present technology is not limited to MP4 containers, and can be similarly applied to containers of other formats such as MPEG-2 TS and MMT.

また、本技術は、以下のような構成を取ることもできる。
（１）サーバから背景画像の画像データをエンコードして得られたビデオストリームを含むサーバ配信ストリームを受信し、他のクライアント装置から該他のクライアントの代用画像を表示するための代用画像メタ情報を含むクライアント送信ストリームを受信する受信部と、
上記ビデオストリームをデコードして背景画像の画像データを得るデコード処理と、上記代用画像メタ情報に基づいて代用画像の画像データを生成する代用画像データ生成処理と、上記背景画像の画像データに上記代用画像の画像データを合成する画像データ合成処理を制御する制御部を備える
クライアント装置。
（２）上記ビデオストリームのレイヤおよび/または上記サーバ配信ストリームのレイヤに上記背景画像における上記代用画像の許容合成範囲を示す情報が挿入されており、
上記制御部は、上記許容合成範囲を示す情報に基づき、上記代用画像が上記背景画像の上記許容合成範囲内に配置されるように上記合成処理を制御する
前記（１）に記載のクライアント装置。
（３）上記代用画像メタ情報には、上記代用画像の上記許容合成範囲内における合成位置を示す合成位置情報が含まれており、
上記制御部は、上記合成位置情報が示す合成位置に上記代用画像が合成されるように上記合成処理を制御する
前記（２）に記載のクライアント装置。
（４）上記代用画像メタ情報には、上記代用画像のサイズを示すサイズ情報が含まれており、
上記制御部は、上記サイズ情報が示すサイズで上記背景画像に上記代用画像が合成されるように上記合成処理を制御する
前記（２）または（３）に記載のクライアント装置。
（５）上記クライアント送信ストリームには、上記代用画像メタ情報に対応した音声データがオブジェクトメタデータと共に含まれており、
上記制御部は、上記音声データに上記オブジェクトメタデータに応じたレンダリング処理を行って上記代用画像の合成位置を音像位置とする音声出力データを得る音声出力処理をさらに制御する
前記（３）に記載のクライアント装置。
（６）上記クライアント送信ストリームには、上記代用画像メタ情報に対応した字幕データが表示位置情報と共に含まれており、
上記制御部は、上記字幕データによる字幕が上記代用画像の合成位置に対応した位置に表示されるように上記表示位置情報に基づいて字幕の表示データを上記背景画像の画像データに合成する字幕合成処理をさらに制御する
前記（３）または（５）に記載のクライアント装置。
（７）自身の代用画像を表示するための代用画像メタ情報を含むクライアント送信ストリームを他のクライアント装置に送信する送信部をさらに備え、
上記代用画像データ生成処理では、上記自身の代用画像を表示するための代用画像メタ情報に基づいて自身の代用画像の画像データをさらに生成する
前記（１）から（６）のいずれかに記載のクライアント装置。
（８）上記背景画像の画像データは、広視野角画像の画像データであり、
上記制御部は、上記背景画像の画像データの一部を切り出して表示用画像データを得る画像切出し処理をさらに制御する
前記（１）から（７）のいずれかに記載のクライアント装置。
（９）受信部が、サーバから背景画像の画像データをエンコードして得られたビデオストリームを含むサーバ配信ストリームを受信し、他のクライアント装置から該他のクライアントの代用画像を表示するための代用画像メタ情報を含むクライアント送信ストリームを受信する受信ステップと、
制御部が、上記ビデオストリームをデコードして背景画像の画像データを得るデコード処理と、上記代用画像メタ情報に基づいて代用画像の画像データを生成する代用画像データ生成処理と、上記背景画像の画像データに上記代用画像の画像データを合成する画像データ合成処理を制御する制御ステップを有する
クライアント装置の処理方法。
（１０）被写体を撮像して背景画像の画像データを得る撮像部と、
上記背景画像の画像データをエンコードして得られたビデオストリームを含むサーバ配信ストリームをクライアント装置に送信する送信部を備え、
上記ビデオストリームのレイヤおよび/またはサーバ配信ストリームのレイヤに上記背景画像における代用画像の許容合成範囲を示す情報が挿入されている
サーバ。
（１１）上記背景画像の画像データは、広視野角画像の画像データである
前記（１０）に記載のサーバ。
（１２）撮像部が、被写体を撮像して背景画像の画像データを得る撮像ステップと、
送信部が、上記背景画像の画像データをエンコードして得られたビデオストリームを含むサーバ配信ストリームをクライアント装置に送信する送信ステップを有し、
上記ビデオストリームのレイヤおよび/またはサーバ配信ストリームのレイヤに上記背景画像における代用画像の許容合成範囲を示す情報が挿入されている
サーバの処理方法。 Moreover, this technique can also take the following structures.
(1) Receiving a server-delivered stream containing a video stream obtained by encoding image data of a background image from a server, and providing substitute image meta information for displaying a substitute image of the other client from another client device. a receiver for receiving a client-transmitted stream containing
decoding processing for decoding the video stream to obtain image data of a background image; substitute image data generation processing for generating image data for a substitute image based on the substitute image meta information; A client device comprising a control unit for controlling image data synthesis processing for synthesizing image data of an image.
(2) information indicating an allowable composition range of the substitute image in the background image is inserted into the layer of the video stream and/or the layer of the server-delivered stream;
The client device according to (1), wherein the control unit controls the composition processing based on information indicating the allowable composition range so that the substitute image is arranged within the allowable composition range of the background image.
(3) the substitute image meta information includes composition position information indicating a composition position of the substitute image within the allowable composition range;
The client device according to (2), wherein the control unit controls the combining process such that the substitute image is combined at the combining position indicated by the combining position information.
(4) the substitute image meta information includes size information indicating the size of the substitute image;
The client apparatus according to (2) or (3), wherein the control unit controls the synthesis process such that the substitute image is synthesized with the background image in a size indicated by the size information.
(5) the client transmission stream includes audio data corresponding to the substitute image meta information together with object meta data;
The control unit performs rendering processing on the audio data according to the object metadata, and further controls audio output processing for obtaining audio output data having the synthetic position of the substitute image as a sound image position. client device.
(6) the client transmission stream includes caption data corresponding to the substitute image meta information together with display position information;
The control unit synthesizes the display data of the caption with the image data of the background image based on the display position information so that the caption based on the caption data is displayed at a position corresponding to the synthesis position of the substitute image. The client device according to (3) or (5), further controlling processing.
(7) further comprising a transmission unit that transmits a client transmission stream including substitute image meta information for displaying the substitute image of itself to another client device;
The substitute image data generation process further generates image data of the substitute image based on substitute image meta information for displaying the substitute image. client device.
(8) the image data of the background image is image data of a wide viewing angle image;
The client device according to any one of (1) to (7), wherein the control unit further controls image clipping processing for clipping a portion of the image data of the background image to obtain display image data.
(9) A receiving unit receives a server-delivered stream including a video stream obtained by encoding image data of a background image from a server, and substitutes for displaying a substitute image of the other client from another client device. a receiving step for receiving a client-transmitted stream containing image meta information;
A control unit decodes the video stream to obtain image data of a background image, a substitute image data generation process of generating image data of a substitute image based on the meta information of the substitute image, and an image of the background image. A processing method for a client device, comprising a control step of controlling image data synthesis processing for synthesizing image data of the substitute image with data.
(10) an imaging unit that captures an image of a subject and obtains image data of a background image;
a transmitting unit configured to transmit a server-delivered stream including a video stream obtained by encoding image data of the background image to a client device;
A server in which information indicating an allowable composition range of a substitute image for the background image is inserted into the layer of the video stream and/or the layer of the server-delivered stream.
(11) The server according to (10), wherein the image data of the background image is image data of a wide viewing angle image.
(12) an image capturing step in which the image capturing unit captures an image of a subject to obtain image data of a background image;
a transmitting step of transmitting a server-delivered stream including a video stream obtained by encoding image data of the background image to a client device;
A server processing method in which information indicating an allowable composition range of a substitute image for the background image is inserted into the video stream layer and/or the server delivery stream layer.

本技術の主な特徴は、他のクラインアント装置からのクライアント送信ストリームにアバターメタ情報が含まれており、背景画像の画像データにこのアバターメタ情報に基づいて生成されたアバターの画像データを合成することで、クライアントのそれぞれが、共通の背景画像に他のクライアントのアバターが合成されたものを認識でき、互いのＶＲ空間を共有して良好にコミュニケーションをとることを可能としたことである（図２、図２１参照）。 The main feature of this technology is that client transmission streams from other client devices contain avatar meta information, and the image data of the background image is combined with the avatar image data generated based on this avatar meta information. By doing so, each client can recognize the other client's avatar combined with a common background image, and it is possible to share each other's VR space and communicate well ( 2 and 21).

１０・・・空間共有表示システム
１００・・・サーバ
１０１・・・制御部
１０１ａ・・・ユーザ操作部
１０２・・・ロケータ
１０３・・・ビデオキャプチャ
１０４・・・フォーマット変換部
１０５・・・ビデオエンコーダ
１０６・・・音声キャプチャ
１０８・・・オーディオエンコーダ
１０９・・・コンテナエンコーダ
１１０・・・ネットワークインタフェース
１１１・・・バス
２００・・・クライアント装置
２００Ｔ・・・送信系
２００Ｒ・・・受信系
２０１・・・制御部
２０１ａ・・・ユーザ操作部
２０２・・・メタデータジェネレータ
２０３・・・音声キャプチャ
２０４・・・オブジェクト情報生成部
２０５・・・オーディオエンコーダ
２０６・・・文字発生部
２０７・・・字幕エンコーダ
２０８・・・コンテナエンコーダ
２０９・・・ネットワークインタフェース
２１０・・・バス
２１１・・・ネットワークインタフェース
２１２・・・コンテナデコーダ
２１３・・・ビデオデコーダ
２１４・・・プレーンコンバータ
２１５，２１５Ａ・・・受信モジュール
２１６・・・オーディオデコーダ
２１８・・・ミクサ
２１９・・・合成部
２２１・・・コンテナエンコーダ
２２２・・・メタ情報解析部
２２３・・・アバターデータベース選択部
２２３ａ・・・データベースマッピング部
２２４・・・アバターデータベース
２２５・・・サイズ変換部
２２６・・・オーディオデコーダ
２２７・・・レンダラ
２２８・・・字幕デコーダ
２２９・・・フォント展開部
３００・・・ネットワーク
４００Ａ・・・ヘッドマウントディスプレイ（ＨＭＤ）
４００Ｂ・・・ヘッドフォン（ＨＰ） DESCRIPTION OF SYMBOLS 10... Space sharing display system 100... Server 101... Control part 101a... User operation part 102... Locator 103... Video capture 104... Format conversion part 105... Video encoder 106... Audio capture 108... Audio encoder 109... Container encoder 110... Network interface 111... Bus 200... Client device 200T... Transmission system 200R... Reception system 201... Control unit 201a User operation unit 202 Metadata generator 203 Audio capture 204 Object information generation unit 205 Audio encoder 206 Character generation unit 207 Subtitle encoder 208... Container encoder 209... Network interface 210... Bus 211... Network interface 212... Container decoder 213... Video decoder 214... Plane converter 215, 215A... Receiving module 216 Audio decoder 218 Mixer 219 Synthesis unit 221 Container encoder 222 Meta information analysis unit 223 Avatar database selection unit 223a Database mapping unit 224 Avatar Database 225 Size conversion unit 226 Audio decoder 227 Renderer 228 Subtitle decoder 229 Font development unit 300 Network 400A Head mount display (HMD)
400B Headphones (HP)

Claims

Receiving a server-delivered stream containing a video stream obtained by encoding image data of a background image from a server, and a client-transmitted stream containing substitute image meta information for displaying a substitute image of another client from another client device a receiver for receiving
Decoding processing for decoding the video stream to obtain image data of a background image, substitute image data generation processing for generating image data for the substitute image based on the meta information for the substitute image, and image data for the background image. a controller for controlling image data synthesizing processing for synthesizing the image data of the substitute image, and image data output processing for outputting display image data based on the image generated by the image data synthesizing processing to the display unit ;
client device.

The client device according to claim 1 , wherein the background image includes an image captured by a server .

the client-transmitted stream includes text information corresponding to the substitute image meta information;
3. The client device according to claim 1 , wherein the control unit controls the image data synthesis process such that the text corresponding to the text information is synthesized at a position corresponding to the image data of the substitute image. .

The substitute image meta information includes size information indicating the size of the substitute image,
4. The image data synthesizing process according to any one of claims 1 to 3, wherein the control unit controls the image data synthesizing process so that the substitute image is synthesized with the background image in a size indicated by the size information. client device.

The image data of the background image is image data of a wide viewing angle image,
5. The client device according to any one of claims 1 to 4 , wherein the control unit further controls image clipping processing for clipping a portion of the image data of the background image to obtain display image data.

Connected to an external device having the display unit
A client device according to any one of claims 1 to 5.

further comprising the display unit
A client device according to any one of claims 1 to 5.

The display unit is a head-mounted display.
A client device according to any one of claims 1 to 7.

a client device according to any one of claims 1 to 8;
including the server;
The server is
an imaging unit that captures an image of a subject and obtains image data of the background image;
a transmitting unit configured to transmit a server-delivered stream including a video stream obtained by encoding image data of the background image to the client device;
display system .

a client device according to claim 6;
including the external device;
The external device includes a display unit that displays the display image data.
display system.

A receiving unit receives a server-delivered stream including a video stream obtained by encoding image data of a background image from a server, and receives substitute image meta information for displaying a substitute image of another client from another client device. a receiving step for receiving a client-transmitted stream containing
A control unit decodes the video stream to obtain image data of a background image, a substitute image data generation process of generating image data of a substitute image based on the substitute image meta information, and the background. Control for controlling image data synthesis processing for synthesizing the image data of the substitute image with the image data of the image, and image data output processing for outputting display image data based on the image generated by the image data synthesis processing to the display unit. A client device processing method comprising steps.

The background image includes a captured image in the server
The processing method for a client device according to claim 11.

the client-transmitted stream includes text information corresponding to the substitute image meta information;
In the control step, the control unit controls the image data synthesizing process so that the text corresponding to the text information is synthesized at a position corresponding to the image data of the substitute image.
13. The processing method for a client device according to claim 11 or 12.

The substitute image meta information includes size information indicating the size of the substitute image,
In the control step, the control unit controls the image data synthesizing process so that the substitute image is synthesized with the background image in the size indicated by the size information.
A processing method for a client device according to any one of claims 11 to 13.

The image data of the background image is image data of a wide viewing angle image,
In the control step, the control unit further controls image clipping processing for clipping a portion of the image data of the background image to obtain display image data.
A processing method for a client device according to any one of claims 11 to 14.

The client device is connected to an external device having the display unit.
A processing method for a client device according to any one of claims 11 to 15.

The client device includes the display unit
A processing method for a client device according to any one of claims 11 to 15.

The display unit is a head-mounted display.
A processing method for a client device according to any one of claims 11 to 17.

A program for causing a computer to perform the method according to any one of claims 11 to 18.