JP2006014180A

JP2006014180A - Data processor, data processing method and program therefor

Info

Publication number: JP2006014180A
Application number: JP2004191539A
Authority: JP
Inventors: Toshiyuki Nakagawa; 利之中川
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2004-06-29
Filing date: 2004-06-29
Publication date: 2006-01-12

Abstract

<P>PROBLEM TO BE SOLVED: To provide a data processor, a data processing method and a program therefor by which data can surely be composited and reproduced so that media object data such as motion picture and audio is synchronized with scenes when reproducing encoded multimedia data composed of a plurality of objects. <P>SOLUTION: A media decoder circuit 104 decodes a plurality of object data from which multimedia data are demultiplexed by a demultiplexer circuit 102. A scene description data decoder circuit 103 decodes scene description data contained in received multimedia data. The demultiplexer circuit 102 outputs an event issuing instruction in a timing of starting reading object data. An event generating circuit 105 generates an event in response to the event issuing instruction. A scene composing circuit 106 composes a plurality of decoded object data based on the scene description data in accordance with the timing when the event is received. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、例えば符号化された動画像オブジェクトデータ、音声オブジェクトデータ、シーン記述データを含むマルチメディアデータを分離、復号化し、復号化されたデータを合成、出力するデータ処理装置、データ処理方法及びそのプログラムに関するものである。 The present invention relates to a data processing apparatus, a data processing method, and a data processing apparatus for separating and decoding multimedia data including, for example, encoded moving image object data, audio object data, and scene description data, and synthesizing and outputting the decoded data It is about the program.

動画像や音声を圧縮符号化し、多重化し、伝送若しくは蓄積し、これを逆多重化して復号する符号化標準の国際規格としてＭＰＥＧ（ＭｏｔｉｏｎＰｉｃｔｕｒｅＥｘｐｅｒｔｓＧｒｏｕｐ）−１、及びＭＰＥＧ−２などが知られている。 MPEG (Motion Picture Experts Group) -1, MPEG-2, and the like are known as international standards for encoding standards that compress, encode, multiplex, transmit or store moving images and audio, and demultiplex and decode them. ing.

一方、ＩＳＯ／ＩＥＣ１４４９６ｐａｒｔ１（ＭＰＥＧ−４Ｓｙｓｔｅｍｓ）では、静止画、動画像や音声、テキスト、ＣＧ（ＣｏｍｐｕｔｅｒＧｒａｐｈｉｃｓ）など複数のオブジェクトを含むマルチメディアデータの符号化ビットストリームを多重化・同期する手法が標準化されている。 On the other hand, ISO / IEC 14496 part 1 (MPEG-4 Systems) multiplexes and synchronizes an encoded bit stream of multimedia data including a plurality of objects such as still images, moving images, audio, text, and CG (Computer Graphics). The technique to do is standardized.

上述したようなＭＰＥＧ−４のデータストリームには、これまでの一般的なマルチメディアデータとは異なり、静止画像、動画像や音声に加え、テキストやＣＧなどの各オブジェクトを空間・時間的に配置するための情報として、ＶＲＭＬ（ＶｉｒｔｕａｌＲｅａｌｉｔｙＭｏｄｅｌｉｎｇＬａｎｇｕａｇｅ）を自然動画像や音声が扱えるように拡張したＢＩＦＳ（ＢｉｎａｒｙＦｏｒｍａｔｆｏｒＳｃｅｎｅｓ）が含まれている。ここでＢＩＦＳはＭＰＥＧ−４のシーンを２値で記述する情報である。 In the MPEG-4 data stream as described above, unlike conventional general multimedia data, each object such as text and CG is arranged in space and time in addition to still images, moving images and audio. As information for this purpose, there is included BIFS (Binary Format for Scenes) in which VRML (Virtual Reality Modeling Language) is extended to handle natural moving images and sounds. Here, BIFS is information describing an MPEG-4 scene in binary.

このようなマルチメディアデータを構成する静止画、動画、音声等個々のオブジェクトは、それぞれ個別に最適な符号化が施されて送信されることになるので、復号側でも個別に復号され、上述のシーン記述情報に伴い時間的、空間的に配置され、個々のデータの持つ時間軸を再生機内部の時間軸に合わせて同期させ、シーンを合成し再生される。 Individual objects such as still images, moving images, and audio that make up such multimedia data are individually encoded and transmitted, so that the decoding side individually decodes the above-described objects. It is arranged temporally and spatially according to the scene description information, and the time axis of each data is synchronized with the time axis inside the player, and the scene is synthesized and reproduced.

又、一般的にシーンの構成を記述する方法としては、上述したＶＲＭＬ、ＢＩＦＳの他に、ＨＴＭＬ（ＨｙｐｅｒｔｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）や、ＸＭＬ（ｅＸｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ）を用いて記述されるＳＭＩＬ（ＳｙｎｃｈｒｏｎｉｚｅｄＭｕｌｔｉｍｅｄｉａＩｎｔｅｇｒａｔｉｏｎＬａｎｇｕａｇｅ）、ＸＭＴ（ｅＸｔｅｎｓｉｂｌｅＭＰＥＧ−４ＴｅｘｔｕａｌＦｏｒｍａｔ）などがある。 In addition to the VRML and BIFS described above, a method for describing a scene configuration is generally SMIL (Synchronized Language Integrated) described using HTML (Hypertext Markup Language) and XML (extensible Markup Language). ) And XMT (extensible MPEG-4 Textual Format).

このようなマルチメディアデータのビットストリームを再生する際には、動画像や音声に加えて、シーン及びシーンを構成する各オブジェクトとも同期して合成し、再生することが要求される。そこで、オーディオとビデオとＣＧを同期して合成し、再生する手法が提案されている（例えば、特許文献１参照。）。 When reproducing such a bit stream of multimedia data, it is required to synthesize and reproduce the scene and each object constituting the scene in addition to the moving image and sound. Therefore, a method of synthesizing and reproducing audio, video, and CG in synchronization has been proposed (see, for example, Patent Document 1).

又、コンテンツデータの再生中に後続のコンテンツデータの先行取得をおこなうことにより、シーン記述情報による時刻指定を守った再生ができ、しかも再生開始まで、あるいは次に再生を開始するまでの遅延を小さくする手法が提案されている（例えば、特許文献２参照。）。 In addition, by performing prior acquisition of subsequent content data during the playback of content data, it is possible to perform playback while keeping the time specified by the scene description information, and to reduce the delay until the start of playback or the start of the next playback. A technique has been proposed (for example, see Patent Document 2).

特開平１０−１３６２５９号公報JP-A-10-136259 特開２００２−２６８９９９号公報JP 2002-268999 A

しかしながら、上記特許文献１に提案される方式は、動画像や音声を読み込み、再生を開始するまでに要する遅延に関しては何等言及されておらず、例えばネットワークを介して配信される動画像や音声の読み込みに時間を要した場合には、シーン記述情報の時間的な配置に従って動画像や音声を合成、再生することができないといった問題を有している。 However, the method proposed in Patent Document 1 does not mention anything about the delay required to read a moving image or sound and start playback. For example, the method of moving image or sound distributed via a network is not mentioned. When reading takes time, there is a problem that moving images and sounds cannot be synthesized and reproduced according to the temporal arrangement of scene description information.

又、上記特許文献２に提案される方式は、ネットワークを介して配信されるマルチメディアデータを受信して再生する場合に限定されている為、蓄積媒体から再生する場合に適用できず、又、ネットワークの通信速度や回線状況によっては動画像や音声とシーンとの同期が確実に取れるとは限らない、などの問題を有している。 In addition, the method proposed in Patent Document 2 is limited to the case where multimedia data distributed via a network is received and played back, and thus cannot be applied to playback from a storage medium. Depending on the communication speed of the network and the line status, there is a problem that the synchronization of the moving image or sound and the scene is not always ensured.

又、上述したＶＲＭＬ、ＢＩＦＳといったシーン記述方法では、無限に高速で処理されるという動作環境を理想としているが、動画像や音声の読み込みは負荷の大きい作業であるため、現実の動作環境においては再生開始に時間を要する。このように再生開始に要する時間は、ネットワークの通信速度、回線状況や受信端末の処理能力に依存する。この為、通信速度が低い場合や回線が混雑している場合や、処理能力の低い受信端末を使用する場合等には、再生開始時、若しくはシーン記述データの更新時に、シーンとシーンを構成する動画像や音声等のメディアオブジェクトデータとの同期が取れないという問題を有している。 In the scene description methods such as VRML and BIFS described above, an operating environment where processing is performed at an infinitely high speed is ideal. However, since reading of moving images and sounds is a heavy work, in an actual operating environment, It takes time to start playback. Thus, the time required to start reproduction depends on the communication speed of the network, the line status, and the processing capability of the receiving terminal. For this reason, when the communication speed is low, the line is congested, or when a receiving terminal with low processing capacity is used, the scene is composed at the start of playback or when the scene description data is updated. There is a problem that synchronization with media object data such as moving images and sounds cannot be achieved.

本発明は、上述した事情を考慮してなされたもので、動画、音声、静止画、テキスト、ＣＧ等、複数のオブジェクトから構成される符号化されたマルチメディアデータを再生する際に、動画や音声等のメディアオブジェクトデータとシーンとが同期した合成、再生を確実に行うことができるデータ処理装置、データ処理方法及びそのプログラムを提供することを目的とする。 The present invention has been made in consideration of the above-described circumstances. When playing back encoded multimedia data composed of a plurality of objects such as moving images, sounds, still images, texts, CGs, An object of the present invention is to provide a data processing apparatus, a data processing method, and a program thereof that can reliably perform synthesis and reproduction in which media object data such as audio and a scene are synchronized.

この発明は、上述した課題を解決すべくなされたもので、本発明によるデータ処理装置においては、符号化された動画像及び／又は音声に関するオブジェクトデータを一つ又は複数含むマルチメディアデータをネットワーク経由で受信してシーン記述データに応じて再生処理するデータ処理装置であって、受信したマルチメディアデータをオブジェクトデータ単位で分離する分離手段と、分離手段が分離した複数のオブジェクトデータを復号化する一つ又は複数の第１の復号化手段と、マルチメディアデータに含まれる一部のデータとして又はネットワークと異なる通信経路で独立したデータとして符号化されたシーン記述データを受信した場合に、シーン記述データを復号化する第２の復号化手段と、分離手段がオブジェクトデータの読み込みを開始するタイミング又は、第１の復号化手段がオブジェクトデータを復号化するタイミングに応じてイベントを発生するイベント発生手段と、イベント発生手段が発生するイベントを受信したタイミングに応じて、第２の復号化手段が復号化したシーン記述データを基に、第１の復号化手段が復号化した複数のオブジェクトデータの合成処理を行うシーン合成手段とを備えることを特徴とする。 The present invention has been made to solve the above-described problems. In the data processing apparatus according to the present invention, multimedia data including one or more encoded moving image and / or audio object data is transmitted via a network. Is a data processing apparatus for receiving and reproducing according to scene description data, and separating means for separating received multimedia data in units of object data, and decoding a plurality of object data separated by the separating means Scene description data when receiving one or more first decoding means and scene description data encoded as part of the data included in the multimedia data or as independent data on a communication path different from the network A second decoding means for decoding the object data and a separating means for reading the object data. The event generation means for generating an event in accordance with the timing at which the event generation means decodes the object data, and the timing at which the event generation means receives the event. And scene synthesizing means for synthesizing a plurality of object data decoded by the first decoding means based on the scene description data decoded by the decoding means.

また、本発明によるデータ処理方法においては、符号化された動画像及び／又は音声に関するオブジェクトデータを一つ又は複数含むマルチメディアデータをネットワーク経由で受信してシーン記述データに応じて再生処理するデータ処理装置を用いたデータ処理方法であって、受信したマルチメディアデータをオブジェクトデータ単位で分離する分離ステップと、分離ステップで分離した複数のオブジェクトデータを復号化する一つ又は複数の第１の復号化ステップと、マルチメディアデータに含まれる一部のデータとして又はネットワークと異なる通信経路で独立したデータとして符号化されたシーン記述データを受信した場合に、シーン記述データを復号化する第２の復号化ステップと、分離ステップでオブジェクトデータの読み込みを開始するタイミング又は、第１の復号化ステップでオブジェクトデータを復号化するタイミングに応じてイベントを発行するイベント発行ステップと、イベント発生ステップで発生するイベントを受信したタイミングに応じて、第２の復号化ステップで復号化したシーン記述データを基に、第１の復号化ステップで復号化した複数のオブジェクトデータの合成処理を行うシーン合成ステップとを有することを特徴とする。 In the data processing method according to the present invention, the multimedia data including one or a plurality of encoded moving image and / or audio object data is received via the network and is reproduced according to the scene description data. A data processing method using a processing device, comprising: a separation step of separating received multimedia data in units of object data; and one or a plurality of first decryptions for decoding a plurality of object data separated in the separation step And a second decoding for decoding the scene description data when receiving the scene description data encoded as part of the data included in the multimedia data or as independent data through a communication path different from the network Of object data in the conversion step and separation step An event issuing step for issuing an event according to the start timing or the timing for decoding the object data in the first decoding step, and the second decoding according to the timing for receiving the event generated in the event generation step And a scene synthesis step for synthesizing a plurality of object data decoded in the first decoding step based on the scene description data decoded in the encoding step.

また、本発明によるプログラムは、符号化された動画像及び／又は音声に関するオブジェクトデータを一つ又は複数含むマルチメディアデータをネットワーク経由で受信してシーン記述データに応じて再生処理するデータ処理装置用のプログラムであって、受信したマルチメディアデータをオブジェクトデータ単位で分離する分離ステップと、分離ステップで分離した複数のオブジェクトデータを復号化する一つ又は複数の第１の復号化ステップと、マルチメディアデータに含まれる一部のデータとして又はネットワークと異なる通信経路で独立したデータとして符号化されたシーン記述データを受信した場合に、シーン記述データを復号化する第２の復号化ステップと、分離ステップでオブジェクトデータの読み込みを開始するタイミング又は、第１の復号化ステップでオブジェクトデータを復号化するタイミングに応じてイベントを発行するイベント発行ステップと、イベント発生ステップで発生するイベントを受信したタイミングに応じて、第２の復号化ステップで復号化したシーン記述データを基に、第１の復号化ステップで復号化した複数のオブジェクトデータの合成処理を行うシーン合成ステップとをコンピュータに実行させるためのプログラムである。 Also, the program according to the present invention is for a data processing apparatus that receives multimedia data including one or a plurality of encoded moving image and / or audio object data via a network and reproduces the data according to scene description data. A separation step of separating received multimedia data in units of object data, one or more first decoding steps for decoding a plurality of object data separated in the separation step, and multimedia A second decoding step for decoding the scene description data when receiving the scene description data encoded as part of the data included in the data or as independent data through a communication path different from the network; and a separation step When to start loading object data with The event issuing step for issuing an event according to the timing for decoding the object data in the first decoding step, and the second decoding step for receiving the event generated at the event generation step. A program for causing a computer to execute a scene synthesis step for synthesizing a plurality of object data decoded in the first decoding step based on the decoded scene description data.

また、本発明による記録媒体は、符号化された動画像及び／又は音声に関するオブジェクトデータを一つ又は複数含むマルチメディアデータをネットワーク経由で受信してシーン記述データに応じて再生処理するデータ処理装置用のプログラムを記録した記録媒体であって、受信したマルチメディアデータをオブジェクトデータ単位で分離する分離ステップと、分離ステップで分離した複数のオブジェクトデータを復号化する一つ又は複数の第１の復号化ステップと、マルチメディアデータに含まれる一部のデータとして又はネットワークと異なる通信経路で独立したデータとして符号化されたシーン記述データを受信した場合に、シーン記述データを復号化する第２の復号化ステップと、分離ステップでオブジェクトデータの読み込みを開始するタイミング又は、第１の復号化ステップでオブジェクトデータを復号化するタイミングに応じてイベントを発行するイベント発行ステップと、イベント発生ステップで発生するイベントを受信したタイミングに応じて、第２の復号化ステップで復号化したシーン記述データを基に、第１の復号化ステップで復号化した複数のオブジェクトデータの合成処理を行うシーン合成ステップとをコンピュータに実行させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体である。 Also, the recording medium according to the present invention is a data processing device for receiving multimedia data including one or more encoded moving image and / or audio object data via a network, and reproducing the data according to scene description data. A recording medium on which a program for recording is recorded, a separation step of separating received multimedia data in units of object data, and one or a plurality of first decryptions for decoding a plurality of object data separated in the separation step And a second decoding for decoding the scene description data when receiving the scene description data encoded as part of the data included in the multimedia data or as independent data through a communication path different from the network Open the object data reading in the conversion step and separation step Or an event issuing step for issuing an event according to the timing at which the object data is decoded at the first decoding step, and a second decoding according to the timing at which the event generated at the event generation step is received. A computer-readable recording of a program for causing a computer to execute a scene synthesis step for synthesizing a plurality of object data decoded in the first decoding step based on the scene description data decoded in the step It is a recording medium.

本発明によるデータ処理装置、データ処理方法及びそのプログラムは、複数のオブジェクトから構成されるマルチメディアデータから各オブジェクトを分離し、再生する際に、動画像や音声等のメディアオブジェクトデータとシーンとが同期した合成や再生を、通信回線の種類や回線状況や端末の処理能力に関わらず、確実に行うことができるという効果が得られる。 A data processing apparatus, a data processing method, and a program therefor according to the present invention, when separating and reproducing each object from multimedia data composed of a plurality of objects, media object data such as moving images and sounds and a scene are There is an effect that synchronized composition and reproduction can be performed reliably regardless of the type of communication line, the line status, and the processing capability of the terminal.

以下、図面を用いて本発明の実施形態について説明する。
［第１の実施形態］
図１は、本発明の第１の実施形態としてのマルチメディアデータ受信装置（以下、単に受信装置とする）１０１の基本構成を示す図である。図１においては、回路構成を示すと共に、各回路間でのデータの流れも合わせて示している。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[First Embodiment]
FIG. 1 is a diagram showing a basic configuration of a multimedia data receiving apparatus (hereinafter simply referred to as a receiving apparatus) 101 as a first embodiment of the present invention. In FIG. 1, the circuit configuration is shown, and the data flow between the circuits is also shown.

図１に示すように、受信装置１０１は、逆多重化回路（分離手段）１０２、シーン記述データ復号回路（第２の復号化手段）１０３、メディア復号回路（第１の復号化手段）１０４、イベント発生回路１０５、シーン合成回路１０６、及び出力機器１０７により構成されている。図１において、１００は各種ネットワークに代表される伝送路であり、本実施形態においては加工、符号化されたマルチメディアデータが配信されてくるネットワークである。ここで、本実施形態においては伝送路１００からマルチメディアデータを取得しているが、このように放送ネットワークや通信ネットワークといった通信路を介して取得する方法に限定されるものではなく、例えばＤＶＤ−ＲＡＭ等の記録媒体からマルチメディアデータを読み出すことで取得する方法であってもよい。 As shown in FIG. 1, a receiving apparatus 101 includes a demultiplexing circuit (separating unit) 102, a scene description data decoding circuit (second decoding unit) 103, a media decoding circuit (first decoding unit) 104, The event generation circuit 105, the scene synthesis circuit 106, and the output device 107 are configured. In FIG. 1, reference numeral 100 denotes a transmission path typified by various networks, and in this embodiment, a network through which processed and encoded multimedia data is distributed. Here, in the present embodiment, multimedia data is acquired from the transmission line 100, but the present invention is not limited to such a method of acquiring via a communication path such as a broadcast network or a communication network. A method of acquiring multimedia data from a recording medium such as a RAM may be used.

受信装置１０１は、伝送路１００を介してネットワーク経由で配信されたマルチメディアデータを受信すると、逆多重化回路１０２に入力する。逆多重化回路１０２は、受信したマルチメディアデータを、シーン記述データや、静止画や動画像、音声などのメディアオブジェクトデータ等に分離し、それぞれシーン記述データ復号回路１０３、メディア復号回路１０４へ出力する。尚、上述したように、受信装置１０１は、記録媒体からデータを読み出す構成を備えることで、記録媒体から読み込んだマルチメディアデータを逆多重化回路１０２に入力してもよい。 When receiving the multimedia data distributed via the network via the transmission line 100, the receiving apparatus 101 inputs the multimedia data to the demultiplexing circuit 102. The demultiplexing circuit 102 separates the received multimedia data into scene description data, media object data such as still images, moving images, and audio, and outputs them to the scene description data decoding circuit 103 and the media decoding circuit 104, respectively. To do. Note that, as described above, the receiving apparatus 101 may be configured to read data from a recording medium so that multimedia data read from the recording medium may be input to the demultiplexing circuit 102.

又、図１において、メディア復号回路１０４は一つであるが、本実施形態においては、静止画像オブジェクトデータ、動画像オブジェクトデータ、音声オブジェクトデータについて、複数のオブジェクトがマルチメディアデータ内に存在しても復号可能な装置であり、例えばメディア復号回路１０４は静止画像用、動画像用、音声用に各々複数の復号回路から構成されているものとする。 In FIG. 1, there is only one media decoding circuit 104. However, in this embodiment, a plurality of objects exist in the multimedia data for still image object data, moving image object data, and audio object data. For example, it is assumed that the media decoding circuit 104 includes a plurality of decoding circuits for still images, moving images, and audio.

又、上記静止画像オブジェクトデータは、例えば周知のＪＰＥＧ方式にて高能率（圧縮）符号化されたデータである。又、上記動画像オブジェクトデータは、例えば周知のＭＰＥＧ−２やＭＰＥＧ−４、Ｈ２６３方式にて高能率符号化されたデータである。又、上記音声オブジェクトデータは、例えば周知のＣＥＬＰ（ＣｏｄｅＥｘｃｉｔｅｄＬｉｎｅａｒＰｒｅｄｉｃｔｉｏｎ）、ＡＡＣ（ＡｄｖａｎｃｅｄＡｕｄｉｏＣｏｄｉｎｇ）、変換領域重み付けインターリーブベクトル量子化（ＴＷＩＮＶＱ）符号化などの高能率符号化が施されたデータである。 The still image object data is data that has been highly efficient (compressed) encoded by, for example, the well-known JPEG method. The moving image object data is data that has been highly efficient encoded by, for example, the well-known MPEG-2, MPEG-4, or H263 system. The audio object data is data that has been subjected to high-efficiency encoding such as well-known CELP (Code Excited Linear Prediction), AAC (Advanced Audio Coding), transform domain weighted interleaved vector quantization (TWINVQ) encoding, and the like. is there.

符号化が施されたシーン記述データ、メディアオブジェクトデータは、それぞれシーン記述データ復号回路１０３とメディア復号回路１０４において復号され、シーン合成回路１０６に供給される。シーン合成回路１０６は、復号されたシーン記述データに基づいてシーンと復号されたメディアオブジェクトデータを合成する。このようにして得られた最終的なマルチメディアデータ列は、ディスプレイやスピーカー、プリンタなどに代表される出力機器１０７に供給され、再生されることになる。 The encoded scene description data and media object data are decoded by the scene description data decoding circuit 103 and the media decoding circuit 104, respectively, and supplied to the scene synthesis circuit 106. The scene synthesis circuit 106 synthesizes the scene and the decoded media object data based on the decoded scene description data. The final multimedia data string obtained in this way is supplied to the output device 107 typified by a display, a speaker, a printer, and the like and reproduced.

また、逆多重化回路１０２は、各々のノードに関連する動画像や音声の読み込みを開始すると、イベント発生回路１０５へイベント発行命令を送信する機能（発行命令手段）を有する。イベント発行命令を受信したイベント発生回路１０５は、シーン合成回路１０６へイベントを送信する。 In addition, the demultiplexing circuit 102 has a function (issue command means) that transmits an event issue command to the event generation circuit 105 when reading of a moving image or sound related to each node is started. The event generation circuit 105 that has received the event issuance command transmits the event to the scene synthesis circuit 106.

次に、本実施形態における受信装置１０１が受信するマルチメディアデータのデータ構造について具体例を示して説明する。
図２は、本実施形態におけるマルチメディアデータ２００全体のデータ構造例を示す図である。図２に示すように、マルチメディアデータ２００は、シーン記述データ２０１及びメディアデータ２０２〜２０５等のパケットから構成される。シーン記述データ２０１は、シーンを構成するメディアデータ２０２〜２０５よりも先に送信され、受信装置１０１においても、シーン記述データ２０１、メディアデータ２０２〜２０５の順でデータを取得する。 Next, the data structure of multimedia data received by the receiving apparatus 101 in the present embodiment will be described with a specific example.
FIG. 2 is a diagram showing an example of the data structure of the entire multimedia data 200 in the present embodiment. As shown in FIG. 2, the multimedia data 200 is composed of packets such as scene description data 201 and media data 202 to 205. The scene description data 201 is transmitted before the media data 202 to 205 constituting the scene, and the receiving apparatus 101 also acquires the data in the order of the scene description data 201 and the media data 202 to 205.

上記シーン記述データ２０１、メディアデータ２０２〜２０５の各パケットには、パケットヘッダ部に同期管理の為の時間情報（ＤＴＳ、ＣＴＳ等）が付加される。ＤＴＳ（ＤｅｃｏｄｉｎｇＴｉｍｅＳｔａｍｐ）は、シーン記述データ復号回路１０３とメディア復号回路１０４の前段にある図示しない復号化バッファに当該パケットが到着していなければならない時刻を示す情報である。また、ＣＴＳ（ＣｏｍｐｏｓｉｔｉｏｎｔｉｍｅＳｔａｍｐ）は、シーン記述データ復号回路１０３とメディア復号回路１０４の後段にある図示しないコンポジションメモリに当該パケットが存在しなければならない時刻を示す情報である。各パケットはパケット毎に付加されたパケットヘッダ部のＤＴＳの時刻で復号され、ＣＴＳ以降の時刻で有効になる。前述したような、無限に高速で処理を行うことが出来る理想的な動作環境においては、上記時間情報ＤＴＳ、ＣＴＳに従って、メディアデータを復号処理し再生することが可能である。 For each packet of the scene description data 201 and the media data 202 to 205, time information (DTS, CTS, etc.) for synchronization management is added to the packet header portion. The DTS (Decoding Time Stamp) is information indicating the time at which the packet must arrive at a decoding buffer (not shown) in front of the scene description data decoding circuit 103 and the media decoding circuit 104. Further, CTS (Composition time stamp) is information indicating the time at which the packet must exist in a composition memory (not shown) downstream of the scene description data decoding circuit 103 and the media decoding circuit 104. Each packet is decoded at the DTS time of the packet header part added for each packet, and becomes valid at the time after the CTS. In an ideal operating environment where processing can be performed at an infinitely high speed as described above, media data can be decoded and reproduced according to the time information DTS and CTS.

ここで、図２に示したシーン記述データ２０１の記述について具体例を示して説明する。図３は、図２に示したシーン記述データ２０１の記述例を示す図である。図３に示すように、シーン記述データ２０１は、シーン記述言語として例えばＢＩＦＳ（ＢｉｎａｒｙＦｏｒｍａｔｆｏｒＳｃｅｎｅｓ）を用いて記述されたＢＩＦＳデータ３００のような記述構成となる。ここで、シーンとは、視聴者に提示される画面や時間的な構成であり、ＭＰＥＧ−４のシステムパートではシーン記述言語として、前述したＢＩＦＳが規格化されている。ここでは簡単の為、図３に示すようにＢＩＦＳデータをテキストとして表記する。 Here, the description of the scene description data 201 shown in FIG. 2 will be described with a specific example. FIG. 3 is a diagram showing a description example of the scene description data 201 shown in FIG. As shown in FIG. 3, the scene description data 201 has a description configuration like BIFS data 300 described using, for example, BIFS (Binary Format for Scenes) as a scene description language. Here, the scene is a screen or time structure presented to the viewer, and the above-described BIFS is standardized as a scene description language in the MPEG-4 system part. Here, for simplicity, the BIFS data is represented as text as shown in FIG.

ＢＩＦＳデータ３００は、Ｇｒｏｕｐノード３０１で始まる。全てのＢＩＦＳはＳＦＴｏｐＮｏｄｅと呼ばれる種類のノードで始まるが、Ｇｒｏｕｐノード３０１はＳＦＴｏｐＮｏｄｅの一つである。このＧｒｏｕｐノード３０１の子ノード（ｃｈｉｌｄｒｅｎフィールド）に、動画像に関する情報がＴｒａｎｓｆｏｒｍ２Ｄノード３０２として記述されている。Ｔｒａｎｓｆｏｒｍ２Ｄノード３０２は、キーワードＤＥＦを使用してＭＯＶＩＥという名称で定義されている。以下、Ｔｒａｎｓｆｏｒｍ２Ｄノード３０２を単にＭＯＶＩＥ３０２とする。 BIFS data 300 begins with a Group node 301. All BIFSs begin with a type of node called SFTopNode, but Group node 301 is one of the SFTopNodes. In the child node (children field) of the Group node 301, information regarding a moving image is described as a Transform2D node 302. The Transform2D node 302 is defined with the name MOVIE using the keyword DEF. Hereinafter, the Transform2D node 302 is simply referred to as MOVIE 302.

実際に表示される動画像データは、ＭｏｖｉｅＴｅｘｔｕｒｅノード３０３によって定義され、フィールドｕｒｌに記述された"ｔｅｓｔ．ｍｐｅｇ"が動画像オブジェクトデータの所在を示している。ここで"ｔｅｓｔ．ｍｐｅｇ"は例えばＭＰＥＧ１−Ｖｉｄｅｏの動画ファイルフォーマットである。 The actually displayed moving image data is defined by the Movie Texture node 303, and “test.mpeg” described in the field url indicates the location of the moving image object data. Here, “test.mpeg” is, for example, an MPEG1-Video moving image file format.

又、ＴｉｍｅＳｅｎｓｏｒノード３０４は、時間の経過と共にイベントを発生し、時刻０秒（フィールドｓｔａｒｔＴｉｍｅ０）から１秒間（フィールドｃｙｃｌｅＩｎｔｅｒｖａｌ１）、０．０〜１．０の範囲でｆｒａｃｔｉｏｎ＿ｃｈａｎｇｅｄイベントを出力する。ＴｉｍｅＳｅｎｓｏｒノード３０４は、キーワードＤＥＦを使用してＴＩＭＥＲという名称で定義される。以下、ＴｉｍｅＳｅｎｓｏｒノード３０４を単にＴＩＭＥＲ３０４とする。 In addition, the TimeSensor node 304 generates an event as time passes, and outputs a fraction_changed event in the range of 0.0 to 1.0 from time 0 seconds (field startTime 0) to 1 second (field cycleInterval 1). The TimeSensor node 304 is defined with the name TIMER using the keyword DEF. Hereinafter, the TimeSensor node 304 is simply referred to as TIMER304.

ＰｏｓｉｔｉｏｎＩｎｔｅｒｐｏｌａｔｏｒ２Ｄノード３０５は、補完子ノードの一つである。０〜１の入力値（フィールドｋｅｙ［０１］）に対して線形補間を実行し、（１，１）〜（２，２）の値（フィールドｋｅｙＶａｌｕｅ［１，１２，２］）をｖａｌｕｅ＿ｃｈａｎｇｅｄイベントとして出力し、例えば０．５の入力値に対して、（１．５，１．５）を出力する。ＰｏｓｉｔｉｏｎＩｎｔｅｒｐｏｌａｔｏｒ２Ｄノード３０５は、キーワードＤＥＦを使用して、ＳＣＡＬＥという名称で定義される。以下、ＰｏｓｉｔｉｏｎＩｎｔｅｒｐｏｌａｔｏｒ２Ｄノード３０５を単にＳＣＡＬＥ３０５とする。 The PositionInterpolator2D node 305 is one of complement nodes. Linear interpolation is performed on 0 to 1 input values (field key [0 1]), and values (1, 1) to (2, 2) (field keyValue [1, 1 2, 2]) are value_changed. For example, (1.5, 1.5) is output for an input value of 0.5. The PositionInterpolator2D node 305 is defined with the name SCALE using the keyword DEF. Hereinafter, the PositionInterpolator2D node 305 is simply referred to as SCALE305.

ＴＩＭＥＲ３０４のｆｒａｃｔｉｏｎ＿ｃｈａｎｇｅｄ出力イベントは、ＲＯＵＴＥ文３０６によって、ＳＣＡＬＥ３０５のｓｅｔ＿ｆｒａｃｔｉｏｎ入力イベントにルート接続される。このＲＯＵＴＥ接続により、線形補間が実行され、その結果はＳＣＡＬＥ３０５のｖａｌｕｅ＿ｃｈａｎｇｅｄ出力イベントとして送出される。 The TIMER 304 fraction_changed output event is routed to the SCALE 305 set_fraction input event by the ROUTE statement 306. By this ROUTE connection, linear interpolation is executed, and the result is sent as a value_changed output event of SCALE 305.

さらに、ＳＣＡＬＥ３０５のｖａｌｕｅ＿ｃｈａｎｇｅｄ出力イベントは、ＲＯＵＴＥ文３０７によって、ＭＯＶＩＥ３０２のフィールドｓｃａｌｅにルート接続される。このＲＯＵＴＥ接続により、動画像ＭＯＶＩＥ３０２の表示スケールが変更される。 Further, the value_changed output event of SCALE 305 is route-connected to the field scale of MOVIE 302 by the ROUTE statement 307. By this ROUTE connection, the display scale of the moving image MOVIE 302 is changed.

このようにＢＩＦＳデータ３００によると、時刻０秒から１秒間、動画像オブジェクトデータ"ｔｅｓｔ．ｍｐｅｇ"のスケールは（１，１）から（２，２）まで拡大されることになる。例えば０．５秒時には、幅、高さ共１．５倍のスケールで表示される。尚、ここではその他のノード、フィールドに関しての詳細な説明は省略する。 Thus, according to the BIFS data 300, the scale of the moving image object data “test.mpeg” is expanded from (1,1) to (2,2) from the time 0 second to 1 second. For example, at 0.5 seconds, both the width and height are displayed on a scale of 1.5 times. It should be noted that detailed description of other nodes and fields is omitted here.

次に、図３に示したＢＩＦＳデータ３００を含むマルチメディアデータ例及び、そのマルチメディアデータに対する動作環境による復号、再生処理の違いについて説明する。
図４（ａ）は、図３に示したＢＩＦＳデータ３００を含むマルチメディアデータ例を示す図である。図４（ａ）に示すように、マルチメディアデータ４００は、前述のＢＩＦＳデータ３００と動画像オブジェクトデータ“ｔｅｓｔ．ｍｐｅｇ"の各フレームＦｒａｍｅＮ（Ｎ＝０，１，２，…）とから構成される。また、図４（ｂ）は、図４（ａ）に示した、マルチメディアデータ４００に対する動作環境による復号、再生処理の違いを示す図である。 Next, an example of multimedia data including the BIFS data 300 shown in FIG. 3 and differences in decoding and reproduction processing depending on the operating environment for the multimedia data will be described.
FIG. 4A shows an example of multimedia data including the BIFS data 300 shown in FIG. As shown in FIG. 4A, the multimedia data 400 includes the above-described BIFS data 300 and each frame FrameN (N = 0, 1, 2,...) Of the moving image object data “test.mpeg”. The FIG. 4B is a diagram showing a difference in decoding and reproduction processing depending on the operating environment for the multimedia data 400 shown in FIG.

本実施形態では、マルチメディアデータ４００を構成するＢＩＦＳデータ３００のパケットには、ＤＴＳ、ＣＴＳとして０（ｍｓｅｃ（ミリ秒））が付加され、動画像オブジェクトデータ“ｔｅｓｔ．ｍｐｅｇ"の各フレームＦｒａｍｅＮにはＤＴＳ、ＣＴＳとして５０×Ｎ（ｍｓｅｃ）が付加されているものとする。 In the present embodiment, 0 (msec (milliseconds)) is added to the packet of the BIFS data 300 constituting the multimedia data 400 as DTS and CTS, and each frame FrameN of the moving image object data “test.mpeg” is added. Suppose that 50 × N (msec) is added as DTS and CTS.

上述したＤＴＳ、ＣＴＳが付与されたマルチメディアデータ４００を復号、再生処理した場合、図４（ｂ）の上段に示すように理想的な動作環境においては、ＢＩＦＳデータ３００及び動画像オブエジェクトデータＦｒａｍｅ０（４０１）は、時刻０ｍｓｅｃで復号、再生される。又、動画像オブジェクトデータＦｒａｍｅ１０（４０２）は、時刻５００（ｍｓｅｃ）で復号、再生される。同様に、動画像オブジェクトデータＦｒａｍｅ２０（４０３）は、時刻１０００（ｍｓｅｃ）で復号、再生され、動画像オブジェクトデータＦｒａｍｅ３０（４０４）は、時刻１５００（ｍｓｅｃ）で復号、再生される。このように、無限に高速で処理される理想的な動作環境においては、ＢＩＦＳデータ３００及び動画像オブジェクトデータＦｒａｍｅＮは、時間情報ＤＴＳ、ＣＴＳに従って理想的なタイミングで復号、再生処理がなされる。 When the multimedia data 400 to which the above-described DTS and CTS are added is decoded and played back, the BIFS data 300 and the moving image object data Frame 0 are displayed in an ideal operating environment as shown in the upper part of FIG. 4B. (401) is decoded and reproduced at time 0 msec. The moving image object data Frame 10 (402) is decoded and reproduced at time 500 (msec). Similarly, moving image object data Frame 20 (403) is decoded and reproduced at time 1000 (msec), and moving image object data Frame 30 (404) is decoded and reproduced at time 1500 (msec). In this way, in an ideal operating environment where processing is performed at infinitely high speed, the BIFS data 300 and the moving image object data FrameN are decoded and reproduced at ideal timing according to the time information DTS and CTS.

ここで、図４（ｂ）の上段に示すような理想的なタイミングでマルチメディアデータ４００が復号、再生処理された場合の画面例について説明する。
図５は、図４（ｂ）の上段に示すような理想的なタイミングでマルチメディアデータ４００が復号、再生処理された場合の画面例を時間経過と共に示した図である。図５に示すように、時刻０ｍｓｅｃ時に、動画像Ｆｒａｍｅ０（４０１）が幅、高さのスケールがそれぞれ等倍で再生される（画面５００）。次に、時刻５００ｍｓｅｃ時に、動画像Ｆｒａｍｅ１０（４０２）が幅、高さのスケールがそれぞれ１．５倍で再生される（画面５０１）。同様に、時刻１０００ｍｓｅｃ、１５００ｍｓｅｃ時に、動画像Ｆｒａｍｅ２０（４０３）、Ｆｒａｍｅ３０（４０４）が幅、高さのスケールがそれぞれ２倍で再生される（画面５０２、画面５０３）。このように、動画像オブジェクトデータの各フレームＦｒａｍｅＮは、コンテンツ作成者の意図したスケールで時間情報ＤＴＳ、ＣＴＳに応じたタイミングで再生される。 Here, an example of a screen when the multimedia data 400 is decoded and reproduced at an ideal timing as shown in the upper part of FIG. 4B will be described.
FIG. 5 is a diagram showing an example of a screen over time when the multimedia data 400 is decoded and reproduced at an ideal timing as shown in the upper part of FIG. 4B. As shown in FIG. 5, at time 0 msec, the moving image Frame0 (401) is reproduced with the same scale of width and height (screen 500). Next, at time 500 msec, the moving image Frame 10 (402) is reproduced with a scale of 1.5 times the width and height (screen 501). Similarly, at the time of 1000 msec and 1500 msec, the moving images Frame 20 (403) and Frame 30 (404) are reproduced with the width and height scales doubled (screen 502, screen 503). As described above, each frame FrameN of the moving image object data is reproduced at a timing according to the time information DTS and CTS on the scale intended by the content creator.

ところが、伝送路１００が接続されているネットワークの通信速度が低い場合や、回線が混雑している場合や、受信装置１０１の処理能力が低い場合等には、マルチメディアデータ４００中のＢＩＦＳデータ３００に後続する動画像オブジェクトデータＦｒａｍｅＮの受信処理や、復号処理に遅延が発生する場合が考えられる。このような場合には、時間情報ＤＴＳ、ＣＴＳに応じたタイミングで動画像オブジェクトデータＦｒａｍｅＮの復号、再生を行うことができない場合がある。図４（ｂ）の下段は、時間情報ＤＴＳ、ＣＴＳに応じたタイミングで動画像オブジェクトデータＦｒａｍｅＮの復号、再生を行うことができない場合を示している。 However, when the communication speed of the network to which the transmission line 100 is connected is low, when the line is congested, or when the processing capacity of the receiving apparatus 101 is low, the BIFS data 300 in the multimedia data 400 is used. There may be a case where a delay occurs in the receiving process and the decoding process of the moving image object data FrameN that follows. In such a case, there is a case where the moving image object data FrameN cannot be decoded and reproduced at a timing according to the time information DTS and CTS. The lower part of FIG. 4B shows a case where the moving image object data FrameN cannot be decoded and reproduced at the timing according to the time information DTS and CTS.

具体的には図４（ｂ）の下段に示す遅延環境は、動画像オブジェクトデータの受信や復号に５００ｍｓｅｃの遅延が発生した場合の例である。すなわち、ＢＩＦＳデータ３００は、時刻０ｍｓｅｃで復号、再生されるが、動画像オブジェクトデータの各フレーム４０１〜４０４は、各々の時間情報ＤＴＳ、ＣＴＳより５００ｍｓｅｃ遅れて復号、再生される。 Specifically, the delay environment shown in the lower part of FIG. 4B is an example in the case where a delay of 500 msec occurs in receiving or decoding moving image object data. That is, the BIFS data 300 is decoded and reproduced at time 0 msec, but the frames 401 to 404 of the moving image object data are decoded and reproduced with a delay of 500 msec from the respective time information DTS and CTS.

ここで、図４（ｂ）の下段に示すような遅延の発生する遅延環境でマルチメディアデータ４００が復号、再生処理された場合の画面例について説明する。
図６は、図４（ｂ）の下段に示すような遅延の発生する遅延環境でマルチメディアデータ４００に対して従来の復号、再生処理が行われた場合の画面例を示す図である。図６に示すように、動画像の各フレームは、復号、再生処理のタイミングがずれたことにより、コンテンツ作成者が意図したものとは異なるスケールで再生されることになる。 Here, an example of a screen when the multimedia data 400 is decoded and reproduced in a delay environment in which a delay occurs as shown in the lower part of FIG. 4B will be described.
FIG. 6 is a diagram showing an example of a screen when conventional decoding and reproduction processing is performed on the multimedia data 400 in a delay environment in which a delay occurs as shown in the lower part of FIG. 4B. As shown in FIG. 6, each frame of a moving image is reproduced at a scale different from that intended by the content creator due to a shift in the timing of decoding and reproduction processing.

つまり、時刻０ｍｓｅｃ時には、ＢＩＦＳデータ３００は復号されているが、動画像Ｆｒａｍｅ０（４０１）の復号がなされていない為、何も表示されない（画面６００）。次に、時刻５００ｍｓｅｃ時には、時刻０ｍｓｅｃで表示されるべき動画像Ｆｒａｍｅ０（４０１）が、幅、高さのスケールがそれぞれ１．５倍で再生される（画面６０１）。次に、時刻１０００ｍｓｅｃ時には、時刻５００ｍｓｅｃで表示されるべき動画像Ｆｒａｍｅ１０（４０２）が、幅、高さのスケールがそれぞれ２倍で再生される（画面６０２）。次に、時刻１５００ｍｓｅｃ時に、時刻１０００ｍｓｅｃで表示されるべき動画像Ｆｒａｍｅ２０（４０３）が、幅、高さのスケールがそれぞれ２倍で再生される（画面６０３）。時刻２０００ｍｓｅｃ時に、時刻１５００ｍｓｅｃで表示されるべき動画像Ｆｒａｍｅ３０（４０４）が、幅、高さのスケールがそれぞれ２倍で再生される（画面６０４）。 That is, at time 0 msec, the BIFS data 300 is decoded, but nothing is displayed because the moving picture Frame 0 (401) has not been decoded (screen 600). Next, at time 500 msec, the moving image Frame 0 (401) to be displayed at time 0 msec is reproduced with a scale of 1.5 times the width and height (screen 601). Next, at the time of 1000 msec, the moving image Frame 10 (402) to be displayed at the time of 500 msec is reproduced with the scale of the width and the height being doubled (screen 602). Next, at the time of 1500 msec, the moving image Frame 20 (403) to be displayed at the time of 1000 msec is reproduced with the scale of width and height being doubled (screen 603). At the time of 2000 msec, the moving image Frame 30 (404) to be displayed at the time of 1500 msec is reproduced with the scale of the width and the height being doubled (screen 604).

このように遅延の発生する環境において従来の復号、再生処理を行うと、シーン記述データは時間情報に従って復号、再生されるので、時間経過と共にシーン（フィールド値等）を変化するが、後続のメディアオブジェクトデータの受信、復号に遅延が発生した場合には、シーン記述データとメディアオブジェクトデータの同期が取れなくなるという問題がある。 When conventional decoding and playback processing is performed in an environment where delay occurs in this way, scene description data is decoded and played back according to time information, so the scene (field value, etc.) changes over time. If there is a delay in receiving and decoding object data, there is a problem that the scene description data and the media object data cannot be synchronized.

本実施形態における受信装置１０１は、イベント発生回路１０５を備えることで、上記した遅延の発生する環境においても、シーン記述データとメディアオブジェクトデータを同期して再生を行うことができる。以下に、イベント発生回路１０５の発生するイベントを利用してシーン記述データに基づき複数のメディアオブジェクトデータを合成する処理について、図７及び図８のフローチャートを用いて説明する。 The reception apparatus 101 according to the present embodiment includes the event generation circuit 105, so that the scene description data and the media object data can be reproduced in synchronism even in the environment where the delay occurs. Hereinafter, a process of combining a plurality of media object data based on scene description data using an event generated by the event generation circuit 105 will be described with reference to flowcharts of FIGS.

図７は、受信装置１０１がシーン記述データを受信した際に、シーン合成回路１０６がシーン記述データとメディアオブジェクトデータを合成する処理示すフローチャートである。図７では、特にシーン合成回路１０６の処理について説明しているが、図７のステップＳ７０１の前に、逆多重化回路１０２が、受信したマルチメディアデータに対して逆多重化処理を行い、シーン記述データ復号回路１０３及びメディア復号回路１０４が、逆多重化処理された各データに復号処理を行っている。尚、マルチメディアデータに含まれるシーン記述データは、前述したようにメディアオブジェクトデータよりも先に送信されるか、又は送信側からシーンを意図的に更新するために再送される。 FIG. 7 is a flowchart showing a process in which the scene synthesis circuit 106 synthesizes the scene description data and the media object data when the receiving apparatus 101 receives the scene description data. FIG. 7 particularly describes the processing of the scene synthesis circuit 106, but before step S701 in FIG. 7, the demultiplexing circuit 102 performs demultiplexing processing on the received multimedia data to The description data decoding circuit 103 and the media decoding circuit 104 perform a decoding process on each demultiplexed data. The scene description data included in the multimedia data is transmitted before the media object data as described above, or retransmitted in order to intentionally update the scene from the transmission side.

まず、シーン合成回路１０６は、シーン記述データ復号回路１０３から、復号されたシーン記述データを読み込む（ステップＳ７０１）。次に、シーン合成回路１０６は、読み込んだシーン記述データからシーンを解析する（ステップＳ７０２）。次に、シーン合成回路１０６は、解析の結果、シーン記述データが動画像や音声、静止画像などのメディアデータを参照するか否かを判断する（ステップＳ７０３）。この判断は、例えばシーン記述データとしてＢＩＦＳを用いた場合には、ＭｏｖｉｅＴｅｘｔｕｒｅ／ＡｕｄｉｏＣｌｉｐ／ＩｍａｇｅＴｅｘｔｕｒｅノードに代表される、メディアデータを参照するノードが含まれるか否かにより、判断する。 First, the scene synthesis circuit 106 reads the decoded scene description data from the scene description data decoding circuit 103 (step S701). Next, the scene synthesis circuit 106 analyzes the scene from the read scene description data (step S702). Next, as a result of the analysis, the scene synthesis circuit 106 determines whether or not the scene description data refers to media data such as a moving image, audio, or still image (step S703). For example, when BIFS is used as the scene description data, this determination is made based on whether or not a node that refers to media data, such as a MovieText / AudioClip / ImageTexture node, is included.

例えば図３に示したＢＩＦＳデータ３００の例では、ＭｏｖｉｅＴｅｘｔｕｒｅノード３０３が含まれているので、シーン合成回路１０６は、メディアデータを参照すると判断する（ステップＳ７０３の“はい”）。 For example, in the example of the BIFS data 300 shown in FIG. 3, since the Movie Texture node 303 is included, the scene composition circuit 106 determines to refer to the media data (“Yes” in step S703).

上記ステップＳ７０３において、シーン中にメディアデータの参照がないと判断した場合（ステップＳ７０３の“いいえ”）には、シーン合成回路１０６は、メディアデータを合成する必要はなく、ステップＳ７０１で読み込んだシーン記述データをそのまま合成し、出力する（ステップＳ７０７）。 If it is determined in step S703 that there is no reference to media data in the scene (“No” in step S703), the scene synthesis circuit 106 does not need to synthesize media data, and the scene read in step S701. The description data is synthesized and output as it is (step S707).

また、上記ステップＳ７０３において、シーン中にメディアデータの参照があると判断した場合（ステップＳ７０３の“はい”）には、シーン合成回路１０６は、現在の時刻が当該メディアデータの再生開始時刻になったか否かを判断する（ステップＳ７０４）。この判断には、例えば前述のＡｕｄｉｏＣｌｉｐノードの場合には、ｓｔａｒｔＴｉｍｅフィールドから判断することができ、現時刻がｓｔａｒｔＴｉｍｅを超えた時点でメディアデータの開始時刻になったと判断する。又、前述のＭｏｖｉｅＴｅｘｔｕｒｅノードの場合には、開始時刻でなくとも、逆方向の再生が指定されていなければ、現時刻で開始時刻になったと判断する。その理由は、ＭｏｖｉｅＴｅｘｔｕｒｅノードは、現時刻がｓｔａｒｔＴｉｍｅを超えていても、ｓｐｅｅｄフィールドの値が負でなければ、動画像の最初のフレーム（図４の例では、Ｆｒａｍｅ０（４０１））を表示する必要があるからである。又、前述のＩｍａｇｅＴｅｘｔｕｒｅノードには開始時刻を表すフィールドはないため、同様に現時刻で開始時刻であると判断する。 If it is determined in step S703 that there is a reference to media data in the scene ("Yes" in step S703), the scene composition circuit 106 sets the current time as the playback start time of the media data. It is determined whether or not (step S704). For this determination, for example, in the case of the aforementioned AudioClip node, it can be determined from the startTime field, and it is determined that the start time of the media data has come when the current time exceeds the startTime. In the case of the above-mentioned Movie Texture node, it is determined that the start time has been reached at the current time if playback in the reverse direction is not specified even if it is not the start time. The reason is that the Movie Texture node needs to display the first frame of the moving image (Frame 0 (401) in the example of FIG. 4) if the value of the speed field is not negative even if the current time exceeds startTime. Because there is. Further, since the Image Texture node does not have a field indicating the start time, it is similarly determined that the current time is the start time.

上記ステップＳ７０４で現時刻が当該メディアデータの開始時刻になっていないと判断された場合（ステップＳ７０４の“いいえ”）には、シーン合成回路１０６は、現時刻において当該メディアデータを合成する必要はなく、シーン記述データが他にメディアデータを参照するかどうかを判断する（ステップＳ７０８）。一方、現時刻が当該メディアデータの開始時刻であると判断された場合（ステップＳ７０４の“はい”）には、シーン合成回路１０６は、イベント発生回路１０５から送信されるイベントを受信したかどうかを判断する（ステップＳ７０５）。 If it is determined in step S704 that the current time is not the start time of the media data (“No” in step S704), the scene synthesis circuit 106 does not need to synthesize the media data at the current time. In step S708, it is determined whether the scene description data refers to other media data. On the other hand, when it is determined that the current time is the start time of the media data (“Yes” in step S704), the scene synthesis circuit 106 determines whether an event transmitted from the event generation circuit 105 has been received. Judgment is made (step S705).

この判断処理の内容は、逆多重化回路１０２がシーン中のメディアデータの読み込みを開始することで、イベント発生回路１０５が、シーン合成回路１０６へ送信するイベントが用いられる。例えばシーン記述データとしてＢＩＦＳを用いた場合には、ＭｏｖｉｅＴｅｘｔｕｒｅノードやＡｕｄｉｏＣｌｉｐノードのｄｕｒａｔｉｏｎ＿ｃｈａｎｇｅｄイベントがイベント発生回路１０５からシーン合成回路１０６へ送信され、判断に用いられる。逆多重化回路１０２は各々のノードに関連する動画像や音声の読み込みを開始すると、イベント発生回路１０５へイベント発行命令を送信する。イベント発行命令を受信したイベント発生回路１０５はシーン合成回路１０６へｄｕｒａｔｉｏｎ＿ｃｈａｎｇｅｄイベントを送信する。ｄｕｒａｔｉｏｎ＿ｃｈａｎｇｅｄイベントは動画像や音声の継続時間を秒で示したものであるが、値が−１のときは、動画像、音声がまだ読み込まれていないか、何らかの理由で使用できないことを意味する。 The content of this determination processing uses an event transmitted from the event generation circuit 105 to the scene synthesis circuit 106 when the demultiplexing circuit 102 starts reading media data in the scene. For example, when BIFS is used as the scene description data, a duration_changed event of a Movie Texture node or an Audio Clip node is transmitted from the event generation circuit 105 to the scene synthesis circuit 106 and used for determination. The demultiplexing circuit 102 transmits an event issuing command to the event generation circuit 105 when it starts reading moving images and sounds related to each node. The event generation circuit 105 that has received the event issuing command transmits a duration_changed event to the scene synthesis circuit 106. The duration_changed event indicates the duration of a moving image or sound in seconds. When the value is −1, it means that the moving image or sound has not been read yet or cannot be used for some reason.

例えば、図３に示したＢＩＦＳデータ３００の例では、逆多重化回路１０２は動画像オブジェクトデータ"ｔｅｓｔ．ｍｐｅｇ"の読み込みを開始すると、イベント発生回路１０５へイベント発行命令を送信する。イベント発行命令を受信したイベント発生回路１０５は、シーン合成回路１０６へＭｏｖｉｅＴｅｘｔｕｒｅノードのｄｕｒａｔｉｏｎ＿ｃｈａｎｇｅｄイベントを送信することになる。 For example, in the example of the BIFS data 300 shown in FIG. 3, when the demultiplexing circuit 102 starts reading the moving image object data “test.mpeg”, it transmits an event issuing command to the event generating circuit 105. The event generation circuit 105 that has received the event issuing command transmits a duration_changed event of the Movie Texture node to the scene composition circuit 106.

イベント発生回路１０５からイベントを受信したと判断した場合（ステップＳ７０５の“はい”）には、シーン合成回路１０６は、メディア復号回路１０４から復号されたメディアオブジェクトデータを読み込む（ステップＳ７０７）。次に、ステップＳ７０８に進み、シーンが他にメディアデータを参照するかどうかを判断する。ここで、参照が無いと判断した場合（ステップＳ７０８の“いいえ”）には、シーン合成回路１０６は、ステップＳ７０７で読み込んだメディアデータをシーン記述データと共に合成して、出力機器１０７へ出力する（ステップＳ７０９）。 If it is determined that an event has been received from the event generation circuit 105 (“Yes” in step S705), the scene synthesis circuit 106 reads the media object data decoded from the media decoding circuit 104 (step S707). In step S708, it is determined whether the scene refers to other media data. If it is determined that there is no reference (“NO” in step S708), the scene synthesis circuit 106 synthesizes the media data read in step S707 together with the scene description data and outputs it to the output device 107 ( Step S709).

イベント発生回路１０５からイベントを受信していないと判断した場合（ステップＳ７０５の“いいえ”）とは、回線状況の混雑や受信装置１０１の処理能力が低い為、メディアデータの読み込みが開始していないか、その他何らかの理由で使用できない場合である。この場合、シーン合成回路１０６は、メディア復号回路１０４から復号されたメディアオブジェクトデータを読み込むことはできず、イベント発生回路１０５からイベントが送信されるのを待機して（ステップＳ７０６）、再度ステップＳ７０５の処理を行う。イベントが送信されるまでの間、シーンは時間が経過しても変化せず、フィールド値も更新しない。よって図３のＢＩＦＳデータ３００の例では、ＴＩＭＥＲ３０４は時刻５００ｍｓｅｃを経過するまでの間ｆｒａｃｔｉｏｎ＿ｃｈａｎｇｅｄイベントの出力を行わない。 When it is determined that no event has been received from the event generation circuit 105 (“No” in step S705), reading of media data has not started because the line status is congested and the processing capability of the receiving apparatus 101 is low. Or if it cannot be used for some other reason. In this case, the scene synthesis circuit 106 cannot read the media object data decoded from the media decoding circuit 104, waits for an event to be transmitted from the event generation circuit 105 (step S706), and then repeats step S705. Perform the process. Until the event is sent, the scene does not change over time and the field values are not updated. Therefore, in the example of the BIFS data 300 in FIG. 3, the TIMER 304 does not output the fraction_changed event until the time of 500 msec elapses.

又、ステップＳ７０８で、シーンが他のメディアデータを参照する場合（ステップＳ７０８の“はい”）には、シーン合成回路１０６は、当該メディアオブジェクトデータの開始時刻になったかどうかを判断するべく、ステップＳ７０４へと処理を移行する。 In step S708, if the scene refers to other media data (“Yes” in step S708), the scene composition circuit 106 determines whether or not the start time of the media object data has come. The process proceeds to S704.

図８は、既にシーン記述データを読み込み済みの場合に、受信装置１０１のシーン合成回路１０６がシーン記述データとメディアオブジェクトデータを合成する処理を示すフローチャートである。すなわち、図８の処理は、図７に示した処理に続く処理である。 FIG. 8 is a flowchart showing a process in which the scene synthesis circuit 106 of the receiving apparatus 101 synthesizes the scene description data and the media object data when the scene description data has already been read. That is, the process of FIG. 8 is a process following the process shown in FIG.

まず、シーン合成回路１０６は、既に読み込み済みのシーン記述データのフィールド値を、時間経過と共に変化させる（ステップＳ８０１）。図３のＢＩＦＳデータ３００の例では、ＴＩＭＥＲ３０４が経過した時間に従ってｆｒａｃｔｉｏｎ＿ｃｈａｎｇｅｄイベントを出力する。 First, the scene synthesizing circuit 106 changes the field value of the already-read scene description data over time (step S801). In the example of the BIFS data 300 in FIG. 3, a fraction_changed event is output according to the time when the TIMER 304 has elapsed.

次に、シーン合成回路１０６は、シーン中のメディアデータで未開始のメディアデータがあるか否かを判断する（ステップＳ８０２）。図３のＢＩＦＳデータ３００にはＭｏｖｉｅＴｅｘｔｕｒｅノード３０３が含まれるが、図７に示したステップＳ７０４で既に再生が開始されていると判断され、未開始のメディアデータは無いと判断する。 Next, the scene synthesis circuit 106 determines whether there is unstarted media data in the media data in the scene (step S802). The BIFS data 300 in FIG. 3 includes the Movie Texture node 303, but it is determined in step S704 shown in FIG. 7 that playback has already started, and it is determined that there is no unstarted media data.

このように、未開始のメディアデータがないと判断された場合（ステップＳ８０２の“いいえ”）には、シーン中にメディアデータの参照がないか、既に全てのメディアデータの再生を開始しているので、シーン合成回路１０６は、シーン記述データを合成して出力する（ステップＳ８０８）。 As described above, when it is determined that there is no unstarted media data (“No” in step S802), there is no reference to the media data in the scene, or reproduction of all media data has already started. Therefore, the scene synthesizing circuit 106 synthesizes and outputs the scene description data (step S808).

また、シーン中に未開始のメディアデータがあると判断された場合（ステップＳ８０２の“はい”）には、シーン合成回路１０６は、現在の時刻が当該メディアデータの開始時刻になったかどうかを判断する（ステップＳ８０３）。この判断は、図７に示したステップＳ７０４と同様の処理である。 If it is determined that there is unstarted media data in the scene (“Yes” in step S802), the scene composition circuit 106 determines whether the current time is the start time of the media data. (Step S803). This determination is the same process as step S704 shown in FIG.

現時刻が当該メディアデータの開示時刻になっていないと判断された場合（ステップＳ８０３の“いいえ”）には、シーン合成回路１０６は、現時刻において当該メディアデータを合成する必要はなく、ステップＳ８０７へと処理を進める。 If it is determined that the current time is not the disclosure time of the media data (“No” in step S803), the scene synthesis circuit 106 does not need to synthesize the media data at the current time, and step S807. Continue the process.

又、現時刻が当該メディアデータの開始時刻であると判断された場合（ステップＳ８０３の“はい”）には、当該メディアデータの読み込みが開始したかどうかを判断する為、シーン合成回路１０６は、イベント発生回路１０５からのイベント受信が完了したかどうかを判断する（ステップＳ８０４）。この判断は、図７に示したステップＳ７０５と同様の処理である。 If it is determined that the current time is the start time of the media data (“Yes” in step S803), the scene composition circuit 106 determines whether the reading of the media data has started. It is determined whether event reception from the event generation circuit 105 has been completed (step S804). This determination is the same processing as step S705 shown in FIG.

上記イベント受信が完了したと判断された場合には、シーン合成回路１０６は、メディア復号回路１０４から復号されたメディアオブジェクトデータを読み込み（ステップＳ８０６）、シーン中に他に未開始のメディアデータがあるか否かを判断し（ステップＳ８０７）、無ければシーン記述データと共に合成して、出力機器１０７へ出力する（ステップＳ８０８）。 If it is determined that the event reception has been completed, the scene synthesis circuit 106 reads the media object data decoded from the media decoding circuit 104 (step S806), and there is other unstarted media data in the scene. (Step S807), if not, it is combined with the scene description data and output to the output device 107 (step S808).

一方、イベント受信が未完了と判断された場合には、図７に示したステップＳ７０６と同様に、シーン合成回路１０６は、イベント発生回路１０５からイベントが送信されるのを待機する（ステップＳ８０５）。 On the other hand, when it is determined that the event reception has not been completed, the scene synthesis circuit 106 waits for an event to be transmitted from the event generation circuit 105, similarly to step S706 shown in FIG. 7 (step S805). .

又、ステップＳ８０７で、他に未開始のメディアデータが存在する場合（ステップＳ８０７の“はい”）には、シーン合成回路１０６は、当該メディアデータが開始時刻になったかどうかを判断するべく、ステップＳ８０３へと処理を移行する。 If there is other unstarted media data in step S807 (“Yes” in step S807), the scene composition circuit 106 determines whether the media data has reached the start time. The process proceeds to S803.

図９は、図７、図８のフローチャートに基づいて、前述した遅延の発生する環境において、マルチメディアデータ４００が時間経過と共に再生される画面例を示した図である。
図９に示すように、時刻０ｍｓｅｃ時には、ＢＩＦＳデータ３００は復号されているが、動画像オブジェクトデータの復号が完了していない為、何も表示されない（画面６００）。時刻５００ｍｓｅｃ時には、動画像オブジェクトデータＦｒａｍｅ０（４０１）が幅、高さのスケールがそれぞれ等倍で表示される（画面５００）。動画像オブジェクトデータＦｒａｍｅ０（４０１）の読み込みが開始されるまで、シーン合成回路１０６は時間が経過してもシーンを変化しない（ＴＩＭＥＲ３０４がｆｒａｃｔｉｏｎ＿ｃｈａｎｇｅｄイベントを出力しない）為、ＭＯＶＩＥ３０２のｓｃａｌｅフィールド値は（１，１）のままである。 FIG. 9 is a diagram showing an example of a screen on which the multimedia data 400 is reproduced over time in the above-described environment in which delay occurs based on the flowcharts of FIGS. 7 and 8.
As shown in FIG. 9, at the time of 0 msec, the BIFS data 300 is decoded, but nothing is displayed because the decoding of the moving image object data is not completed (screen 600). At the time of 500 msec, the moving image object data Frame0 (401) is displayed with the same scale of the width and height (screen 500). Until the reading of the moving image object data Frame 0 (401) is started, the scene composition circuit 106 does not change the scene even if time elapses (the TIMER 304 does not output a fraction_changed event), so the scale field value of the MOVIE 302 is (1). , 1).

時刻１０００ｍｓｅｃ時には、動画像オブジェクトデータＦｒａｍｅ１０（４０２）が幅、高さのスケールがそれぞれ１．５倍で表示される（画面５０１）。時刻１５００ｍｓｅｃ時には、動画像オブジェクトデータＦｒａｍｅ２０（４０３）が幅、高さのスケールがそれぞれ２倍で表示される（画面５０２）。時刻２０００ｍｓｅｃ時には、動画像オブジェクトデータＦｒａｍｅ３０（４０４）が幅、高さのスケールがそれぞれ2倍で表示される（画面５０３）。 When the time is 1000 msec, the moving image object data Frame 10 (402) is displayed with a scale of 1.5 times the width and height (screen 501). At the time of 1500 msec, the moving image object data Frame 20 (403) is displayed with the scale of width and height being doubled (screen 502). When the time is 2000 msec, the moving image object data Frame 30 (404) is displayed with the scale of the width and the height being doubled (screen 503).

このように、本実施形態の受信装置１０１によれば、シーン合成回路１０６は、メディアデータの開始時間となっても、イベント発生回路１０５からのイベントを受信するまでシーンを停止して待機することができる。これにより、図９に示すように、動画像フレームＦｒａｍｅ０の復号処理を待つために、最初の画面（画面５００）の再生処理において５００ｍｓｅｃの遅延はあるものの、以降の画面（画面５０１〜５０３の画面）は図５で示される理想環境と同様に再生することができる。すなわち、本実施形態の受信装置１０１によれば、複数のオブジェクトから構成されるマルチメディアデータから各オブジェクトを分離し、再生する際に、動画像や音声等のメディアオブジェクトデータとシーン記述データで管理されるシーンとが同期した合成や再生を、通信回線の種類や回線状況や端末の処理能力に関わらず、確実に行うことができる。 As described above, according to the receiving apparatus 101 of the present embodiment, the scene synthesis circuit 106 stops and waits for the scene until the event from the event generation circuit 105 is received even when the start time of the media data is reached. Can do. As a result, as shown in FIG. 9, in order to wait for the decoding process of the moving image frame Frame 0, there is a delay of 500 msec in the reproduction process of the first screen (screen 500), but subsequent screens (screens of screens 501 to 503). ) Can be reproduced in the same manner as in the ideal environment shown in FIG. That is, according to the receiving apparatus 101 of this embodiment, when each object is separated from the multimedia data composed of a plurality of objects and played back, it is managed by media object data such as moving images and audio and scene description data. The composition and reproduction synchronized with the scene to be performed can be reliably performed regardless of the type of communication line, the line condition, and the processing capability of the terminal.

上述した実施形態において図３、図４に示した例では説明を分かり易くする為に、シーンの構成は動画像だけとしたが、シーンを構成するメディアオブジェクトとしては、動画像に限られるものではなく、音声や静止画等を用いることが可能である。よって、シーンとの同期を取るための制御対象物も動画像に限られるものではなく、動画像を構成するオブジェクトや、音声、静止画等のいずれであっても適用可能である。 In the embodiment described above, in the example shown in FIG. 3 and FIG. 4, the scene configuration is only a moving image for easy understanding. However, the media objects constituting the scene are not limited to the moving image. It is possible to use audio, still images, and the like. Therefore, the control target for synchronizing with the scene is not limited to a moving image, and any object that forms a moving image, audio, still image, or the like can be applied.

［第２の実施形態］
上述した第１の実施形態では、受信装置１０１がマルチメディアデータを受信すると、逆多重化回路１０２が受信したマルチメディアデータをシーン記述データとメディアオブジェクトデータに分離し、それぞれの復号回路１０３、１０４へ入力する構成としたが、その構成と異なる構成である第２の実施形態における受信装置１０００について以下に説明する。 [Second Embodiment]
In the above-described first embodiment, when the receiving apparatus 101 receives multimedia data, the multimedia data received by the demultiplexing circuit 102 is separated into scene description data and media object data, and the decoding circuits 103 and 104 are respectively separated. The receiving apparatus 1000 according to the second embodiment, which has a configuration different from that configuration, will be described below.

図１０は、第２の実施形態における受信装置１０００の概略構成を示す図である。図１０に示すように、受信装置１０００は、シーン記述データをメディアオブジェクトデータと異なる経路（伝送路１００）で受信する構成である。尚、図１０において符号１００、１０２〜１０７に示すように、図１と同じ機能のものには同じ符号を付与している。 FIG. 10 is a diagram illustrating a schematic configuration of a receiving apparatus 1000 according to the second embodiment. As shown in FIG. 10, the receiving apparatus 1000 is configured to receive scene description data through a different path (transmission path 100) from the media object data. 10, the same reference numerals are given to the same functions as those in FIG.

本実施形態の受信装置１０００は、シーン記述データを受信すると、メディアオブジェクトデータとは異なる伝送路１００（図１０の上側）を介してシーン記述データ復号回路１０３へ入力する。また、シーン記述データと異なる伝送路１００（図１０の下側）を介して多重化されたメディアオブジェクトデータは、第１の実施形態と同様に、逆多重化回路１０２へ入力する。入力されたメディアオブジェクトデータは、逆多重化回路１０２おいて逆多重化され、それぞれのメディア復号回路１０４へ入力される。また、本実施形態の受信装置１０００における処理フローは、図７、図８に示した第１の実施形態における受信装置１０１の処理フローと同様である。 When receiving the scene description data, the receiving apparatus 1000 according to the present embodiment inputs the scene description data to the scene description data decoding circuit 103 via the transmission path 100 (upper side in FIG. 10) different from the media object data. Also, media object data multiplexed via the transmission path 100 (lower side in FIG. 10) different from the scene description data is input to the demultiplexing circuit 102 as in the first embodiment. The input media object data is demultiplexed by the demultiplexing circuit 102 and input to each media decoding circuit 104. The processing flow in the receiving apparatus 1000 of this embodiment is the same as the processing flow of the receiving apparatus 101 in the first embodiment shown in FIGS.

本実施形態で示した受信装置１０００によると、シーン記述データとしてＶＲＭＬやＳＭＩＬ等を用い、シーン記述データと符号化されたメディアデータを同一のストリームに多重化しない構成のマルチメディアデータを再生するに際して、第１の実施形態の目的と同様に、シーンとメディアオブジェクトデータとの同期した合成、再生を、通信回線や回線状況、端末の処理能力に関わらず、確実に行うことが可能になる。 According to the receiving apparatus 1000 shown in the present embodiment, VRML, SMIL, or the like is used as scene description data, and when reproducing multimedia data having a configuration in which scene description data and encoded media data are not multiplexed into the same stream. Similarly to the object of the first embodiment, it is possible to reliably perform the synthesizing and reproduction of the scene and the media object data regardless of the communication line, the line status, and the processing capability of the terminal.

［第３の実施形態］
第１の実施形態では、イベント発生回路１０５は、逆多重化回路１０２がシーン中のメディアデータの読み込みを開始すると、シーン合成回路１０６へイベントを送信する構成としたが、その構成と異なる構成である第３の実施形態における受信装置１１００について以下に説明する。 [Third Embodiment]
In the first embodiment, the event generation circuit 105 is configured to transmit an event to the scene synthesis circuit 106 when the demultiplexing circuit 102 starts reading media data in a scene. A receiving apparatus 1100 according to a third embodiment will be described below.

図１１は、第３の実施形態における受信装置１１００の概略構成を示す図である。図１１に示すように、受信装置１１００のイベント発生回路１０５は、メディア復号回路１１４におけるメディアデータ復号処理の状況に応じてイベントを発生する構成である。尚、図１１において符号１００、１０２、１０３、１０５〜１０７に示すように、図１と同じ機能のものには同じ符号を付与している。 FIG. 11 is a diagram illustrating a schematic configuration of a receiving device 1100 according to the third embodiment. As shown in FIG. 11, the event generation circuit 105 of the reception device 1100 is configured to generate an event according to the status of the media data decoding process in the media decoding circuit 114. In FIG. 11, as indicated by reference numerals 100, 102, 103, 105 to 107, the same functions as those in FIG.

図１に示した第１の実施形態における受信装置１０１の処理能力が低い場合、メディア復号回路１０４は復号処理に時間を要する為、シーンとメディアデータ間の同期が取れなくなる可能性がある。そこで本実施形態では、メディア復号回路１１４が、メディアデータの復号処理を完了すると、イベント発生回路１０５へイベント発行命令を送信する機能（発行命令手段）を有する。イベント発行命令を受信したイベント発生回路１０５は、シーン合成回路１０６へイベントを送信する構成とする。また、本実施形態の受信装置１１００における処理は、図７、図８に示した第１の実施形態における受信装置１０１の処理と同様である。 When the processing capability of the receiving apparatus 101 in the first embodiment shown in FIG. 1 is low, the media decoding circuit 104 takes time for the decoding process, so there is a possibility that the scene and the media data cannot be synchronized. Therefore, in this embodiment, the media decryption circuit 114 has a function (issue command means) for transmitting an event issue command to the event generation circuit 105 when the media data decryption processing is completed. The event generation circuit 105 that has received the event issuance command transmits an event to the scene synthesis circuit 106. Further, the processing in the receiving apparatus 1100 of the present embodiment is the same as the processing of the receiving apparatus 101 in the first embodiment shown in FIGS.

以上より、本実施形態で示した受信装置１１００によれば、メディアオブジェクトデータの復号処理の進捗を考慮してシーンの進行を制御するので、複数のオブジェクトから構成される符号化されたマルチメディアデータを再生するに際して、特に動画や音声等のメディアオブジェクトデータの復号に時間を要しても、シーンとメディアオブジェクトデータとの同期した合成、再生を、通信回線や回線状況、端末の処理能力に関わらず、確実に行うことが可能になる。 As described above, according to the receiving apparatus 1100 shown in the present embodiment, since the progress of the scene is controlled in consideration of the progress of the decoding process of the media object data, the encoded multimedia data composed of a plurality of objects However, even if it takes time to decode media object data such as video and audio, the synchronized synthesis and playback of scenes and media object data depends on the communication line, line status, and terminal processing capacity. Therefore, it is possible to carry out with certainty.

［第４の実施形態］
上述した第２の実施形態では、イベント発生回路１０５は、逆多重化回路１０２がシーン中のメディアデータの読み込みを開始すると、シーン合成回路１０６へイベントを送信する構成としたが、その構成と異なる構成である第４の実施形態における受信装置１２００について以下に説明する。 [Fourth Embodiment]
In the second embodiment described above, the event generation circuit 105 is configured to transmit an event to the scene synthesis circuit 106 when the demultiplexing circuit 102 starts reading the media data in the scene, but is different from the configuration. A receiving apparatus 1200 according to the fourth embodiment having the configuration will be described below.

図１２は、第４の実施形態における受信装置１２００の概略構成を示す図である。図１２に示すように、受信装置１２００のイベント発生回路１０５は、メディア復号回路１１４のメディアデータ復号処理の状況に応じてイベントを発生する構成である。尚、図１２において符号１００、１０２、１０３、１０５〜１０７に示すように、図１０と同じ機能のものには同じ符号を付与している。 FIG. 12 is a diagram illustrating a schematic configuration of a receiving device 1200 according to the fourth embodiment. As shown in FIG. 12, the event generation circuit 105 of the reception device 1200 is configured to generate an event according to the status of the media data decoding process of the media decoding circuit 114. In FIG. 12, as indicated by reference numerals 100, 102, 103, 105 to 107, the same reference numerals are given to the same functions as those in FIG.

図１０に示した第２の実施形態における受信装置１０００の処理能力が低い場合、メディア復号回路１０４は復号処理に時間を要する為、シーンとメディアデータ間の同期が取れなくなる可能性がある。そこで本実施形態では、メディア復号回路１１４がメディアデータの復号処理を完了すると、イベント発生回路１０５へイベント発行命令を送信し、イベント発行命令を受信したイベント発生回路１０５は、シーン合成回路１０６へイベントを送信する構成とする。 When the processing capability of the receiving apparatus 1000 in the second embodiment shown in FIG. 10 is low, the media decoding circuit 104 takes time for the decoding process, so there is a possibility that the scene and the media data cannot be synchronized. Therefore, in this embodiment, when the media decoding circuit 114 completes the decoding process of the media data, the event generation command is transmitted to the event generation circuit 105, and the event generation circuit 105 that has received the event generation command receives the event to the scene synthesis circuit 106. Is configured to transmit.

以上より、本実施形態で示した受信装置１２００によれば、メディアオブジェクトデータの復号処理の進捗を考慮してシーンの進行を制御するので、複数のオブジェクトから構成される符号化されたマルチメディアデータを再生するに際して、特に動画や音声等のメディアオブジェクトデータの復号に時間を要しても、シーンとメディアオブジェクトデータとの同期した合成、再生を、通信回線や回線状況、端末の処理能力に関わらず、確実に行うことが可能になる。 As described above, according to the receiving apparatus 1200 shown in the present embodiment, since the progress of the scene is controlled in consideration of the progress of the decoding process of the media object data, the encoded multimedia data composed of a plurality of objects However, even if it takes time to decode media object data such as video and audio, the synchronized synthesis and playback of scenes and media object data depends on the communication line, line status, and terminal processing capacity. Therefore, it is possible to carry out with certainty.

［その他の実施形態］
また、上述した実施形態では、受信装置内の各機能を回路により実現したが、これに限定されるものではない。受信装置内の各機能を実現するためのソフトウェアのプログラムコードを記録した記録媒体をシステムあるいは装置に提供し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記録媒体に格納されたプログラムコードを読み出し実行することによって受信装置内の各機能を実現してもよい。この場合、記録媒体から読み出されたプログラムコード自体が前述した実施形態の機能を実現するためのものであり、そのプログラムコードを記録した記録媒体は本発明を構成することになる。 [Other Embodiments]
In the above-described embodiment, each function in the receiving apparatus is realized by a circuit. However, the present invention is not limited to this. A recording medium in which a program code of software for realizing each function in the receiving apparatus is recorded is provided to the system or apparatus, and the computer (or CPU or MPU) of the system or apparatus stores the program code stored in the recording medium. Each function in the receiving apparatus may be realized by executing reading. In this case, the program code itself read from the recording medium is for realizing the functions of the above-described embodiment, and the recording medium on which the program code is recorded constitutes the present invention.

上述した、プログラムコードを供給するための記録媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、磁気テープ、不揮発性のメモリカード、ＲＯＭなどを用いることができる。 As the recording medium for supplying the program code described above, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a magnetic tape, a nonvolatile memory card, a ROM, or the like is used. Can do.

また、コンピュータが読み出したプログラムコードを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼動しているＯＳ（オペレーティングシステム）などが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれていることは言うまでもない。 Further, by executing the program code read by the computer, not only the functions of the above-described embodiments are realized, but also an OS (operating system) operating on the computer based on the instruction of the program code. Needless to say, some or all of the actual processing is performed and the functions of the above-described embodiments are realized by the processing.

さらに、記録媒体から読み出されたプログラムコードが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書きこまれた後、そのプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現されてもよい。 Furthermore, after the program code read from the recording medium is written to the memory provided in the function expansion board inserted into the computer or the function expansion unit connected to the computer, the function is determined based on the instruction of the program code. A CPU or the like provided in the expansion board or the function expansion unit may perform part or all of the actual processing, and the functions of the above-described embodiments may be realized by the processing.

また、上記のプログラムコードを記録したコンピュータ読み取り可能な記録媒体等のプログラムプロダクトも本発明の実施形態として適用することができる。
以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 A program product such as a computer-readable recording medium in which the above program code is recorded can also be applied as an embodiment of the present invention.
The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

本発明の第１の実施形態としてのマルチメディアデータ受信装置１０１の基本構成を示す図である。It is a figure which shows the basic composition of the multimedia data receiver 101 as the 1st Embodiment of this invention. 本実施形態におけるマルチメディアデータ２００全体のデータ構造例を示す図である。It is a figure which shows the example of a data structure of the multimedia data 200 whole in this embodiment. 図２に示したシーン記述データ２０１の記述例を示す図である。It is a figure which shows the example of description of the scene description data 201 shown in FIG. 図３に示したＢＩＦＳデータ３００を含むマルチメディアデータ例と、そのマルチメディアデータ例に対する動作環境による復号、再生処理の違いを示す図である。It is a figure which shows the difference of the decoding and reproduction | regeneration processing by the operation environment with respect to the example of multimedia data containing the BIFS data 300 shown in FIG. 3, and the example of multimedia data. 図４に示すような理想的なタイミングでマルチメディアデータ４００が復号、再生処理された場合の画面例を時間経過と共に示した図である。FIG. 5 is a diagram showing an example of a screen over time when multimedia data 400 is decoded and played back at an ideal timing as shown in FIG. 4. 図４に示すような遅延の発生する遅延環境でマルチメディアデータ４００に対して従来の復号、再生処理が行われた場合の画面例を示す図である。FIG. 5 is a diagram illustrating an example of a screen when conventional decoding and reproduction processing is performed on multimedia data 400 in a delay environment in which a delay occurs as illustrated in FIG. 4. 受信装置１０１がシーン記述データを受信した際に、シーン合成回路１０６がシーン記述データとメディアオブジェクトデータを合成する処理示すフローチャートである。10 is a flowchart showing a process of combining scene description data and media object data by the scene combining circuit when the receiving apparatus 101 receives the scene description data. 既にシーン記述データを読み込み済みの場合に、シーン合成回路１０６がシーン記述データとメディアオブジェクトデータを合成する処理を示すフローチャートである。10 is a flowchart showing a process of combining scene description data and media object data by the scene composition circuit when the scene description data has already been read. 図７、図８のフローチャートに基づいて、前述した遅延の発生する環境において、マルチメディアデータ４００が時間経過と共に再生される画面例を示した図である。FIG. 9 is a diagram showing an example of a screen on which multimedia data 400 is reproduced with time in the above-described environment in which delay occurs based on the flowcharts of FIGS. 7 and 8. 第２の実施形態における受信装置１０００の概略構成を示す図である。It is a figure which shows schematic structure of the receiver 1000 in 2nd Embodiment. 第３の実施形態における受信装置１１００の概略構成を示す図である。It is a figure which shows schematic structure of the receiver 1100 in 3rd Embodiment. 第４の実施形態における受信装置１２００の概略構成を示す図である。It is a figure which shows schematic structure of the receiver 1200 in 4th Embodiment.

Explanation of symbols

１００伝送路
１０１マルチメディアデータ受信装置
１０２逆多重化回路
１０３シーン記述データ復号回路
１０４、１１４メディア復号回路
１０５イベント発生回路
１０６シーン合成回路
１０７出力機器
１０００、１１００、１２００マルチメディアデータ受信装置（受信装置） DESCRIPTION OF SYMBOLS 100 Transmission path 101 Multimedia data receiver 102 Demultiplexing circuit 103 Scene description data decoding circuit 104, 114 Media decoding circuit 105 Event generation circuit 106 Scene synthesis circuit 107 Output device 1000, 1100, 1200 Multimedia data receiving apparatus (receiving apparatus) )

Claims

A data processing device that receives multimedia data including one or more encoded moving image and / or audio object data via a network and performs playback processing according to scene description data,
Separating means for separating the received multimedia data in units of the object data;
One or more first decoding means for decoding the plurality of object data separated by the separation means;
Second decoding for decoding the scene description data when the scene description data encoded as a part of the data included in the multimedia data or as independent data through a communication path different from the network is received; And
Event generating means for generating an event in accordance with the timing at which the separation means starts reading the object data or the timing at which the first decoding means decodes the object data;
Based on the scene description data decoded by the second decoding unit according to the timing at which the event generated by the event generating unit is received, the plurality of the decoding units decoded by the first decoding unit A data processing apparatus comprising: scene combining means for performing object data combining processing.

The data processing apparatus according to claim 1, wherein the scene synthesizing unit waits for the synthesizing process until the event generated by the event generating unit is received.

The multimedia data is data conforming to the MPEG (Motion Picture Experts Group) -4 standard, and the scene description data includes BIFS (Binary Format for Scenes), VRML (Virtual Reality Modeling Language, SM, IL When the data is described in either Integration Language (XMT) or XMT (extensible MPEG-4 Textual Format), the event generation means is a video defined in the BIFS, the VRML, the SMIL, or the XMT. Generate predetermined events for nodes related to image or audio The data processing apparatus according to claim 1 or 2, characterized in Rukoto.

A data processing method using a data processing apparatus that receives multimedia data including one or more encoded moving image and / or audio object data via a network and reproduces the data according to scene description data. ,
A separation step of separating the received multimedia data in units of the object data;
One or more first decoding steps for decoding the plurality of object data separated in the separation step;
Second decoding for decoding the scene description data when the scene description data encoded as a part of the data included in the multimedia data or as independent data through a communication path different from the network is received; Step,
An event issuing step for issuing an event in accordance with a timing at which reading of the object data is started in the separation step or a timing at which the object data is decoded in the first decoding step;
Based on the timing of receiving the event generated in the event generation step, based on the scene description data decoded in the second decoding step, a plurality of the decoded in the first decoding step A scene synthesis step for performing synthesis processing of object data.

A program for a data processing apparatus that receives multimedia data including one or a plurality of encoded moving image and / or audio object data via a network and reproduces the data according to scene description data,
A separation step of separating the received multimedia data in units of the object data;
One or more first decoding steps for decoding the plurality of object data separated in the separation step;
Second decoding for decoding the scene description data when the scene description data encoded as a part of the data included in the multimedia data or as independent data through a communication path different from the network is received; Step,
An event issuing step for issuing an event in accordance with a timing at which reading of the object data is started in the separation step or a timing at which the object data is decoded in the first decoding step;
Based on the timing of receiving the event generated in the event generation step, based on the scene description data decoded in the second decoding step, a plurality of the decoded in the first decoding step A program for causing a computer to execute a scene composition step for performing object data composition processing.

A recording medium for recording a program for a data processing device that receives multimedia data including one or more encoded moving image and / or object data related to sound via a network and performs playback processing according to scene description data There,
A separation step of separating the received multimedia data in units of the object data;
One or more first decoding steps for decoding the plurality of object data separated in the separation step;
Second decoding for decoding the scene description data when the scene description data encoded as a part of the data included in the multimedia data or as independent data through a communication path different from the network is received; Step,
An event issuing step for issuing an event in accordance with a timing at which reading of the object data is started in the separation step or a timing at which the object data is decoded in the first decoding step;
Based on the timing of receiving the event generated in the event generation step, based on the scene description data decoded in the second decoding step, a plurality of the decoded in the first decoding step A computer-readable recording medium having recorded thereon a program for causing a computer to execute a scene synthesis step for synthesizing object data.