JP4499204B2

JP4499204B2 - Image signal multiplexing apparatus and method, and transmission medium

Info

Publication number: JP4499204B2
Application number: JP20212498A
Authority: JP
Inventors: 輝彦鈴木; 陽一矢ヶ崎
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1997-07-18
Filing date: 1998-07-16
Publication date: 2010-07-07
Anticipated expiration: 2018-07-16
Also published as: JPH1185966A

Description

【０００１】
【発明の属する技術分野】
本発明は、画像信号多重化装置および方法、並びに伝送媒体に関し、特に、光磁気ディスクや磁気テープなどの記録媒体に記録したり、その記録媒体から再生され、ディスプレイに表示されるデータや、テレビ会議システム、テレビ電話システム、放送用機器、マルチメディアデータベース検索システムなどのように、伝送路を介して送信側から受信側に伝送し、受信側において表示されたり、あるいは編集、記録されるデータに適用して好適な画像信号多重化装置および方法、並びに伝送媒体に関する。
【０００２】
【従来の技術】
例えば、テレビ会議システム、テレビ電話システムなどのように、動画像信号を遠隔地に伝送するシステムにおいては、伝送路を効率良く利用するため、映像信号のライン相関やフレーム間相関を利用して、画像信号を圧縮符号化するようになされている。
【０００３】
また、近年においてはコンピュータの処理能力が向上したことから、コンピュータを用いた動画像情報端末も普及しつつある。こうしたシステムでは、ネットワークなどの伝送路を通じて情報を遠隔地に伝送する。この場合も同様に、伝送路を効率よく利用するために、伝送する画像信号、音響信号、データなどの信号が圧縮符号化して伝送される。
【０００４】
端末側では、所定の方法に基づき伝送された圧縮信号を復号し、元の画像信号、音響信号、データなどを復元し、端末が備えるディスプレイやスピーカなどに出力する。従来の技術では、伝送されてきた画像信号などをそのまま表示端末に出力するのみであったが、コンピュータを用いた情報端末においては、そうした複数の画像信号、音響信号およびデータを変換処理した後、２次元または３次元空間に表示することが可能になってきている。このような処理は、送信側において、２次元および３次元空間の情報を所定の方法で記述し、端末がその記述に従って、例えば、画像信号などに対して所定の変換処理行い、表示することにより実現することができる。
【０００５】
こうした空間情報の記述の方式として代表的なものに、例えば、VRML（Virtual Reality Modeling Language）がある。これは、ISO-IEC/JTC1/SC2 4においても標準化され、最新版のVRML2.0は、IS14772に記載されている。VRMLは、３次元空間を記述する言語であり、３次元空間の属性や形状などを記述するためのデータの集まりが定義されている。このデータの集まりをノードと呼ぶ。３次元空間を記述するためには、予め規定されているこれらのノードをどのように組み合わせるのかを記述することになる。ノードには、色やテクスチャなどの属性を示すものや、ポリゴンの形状を示すものなどが定義されている。
【０００６】
コンピュータによる情報端末では、こうしたVRMLなどの記述に従い、ＣＧ（Computer Graphics）により、ポリゴンなどを用いて所定のオブジェクトを生成する。VRMLでは、また、こうして生成したポリゴンから構成される３次元オブジェクトに対してテクスチャを貼り付けることが可能である。貼りつけるテクスチャが静止画像の場合にはTexture、動画像の場合にはMovieTextureと呼ばれるノードが定義されており、このノードに貼り付けるテクスチャに関する情報（ファイル名、表示開始時間、表示終了時間など）が記載されている。
【０００７】
ここで、図１４を用いて、テクスチャの貼り付け（以下、適宜テクスチャマッピングという）について説明する。まず、外部より、貼り付けるテクスチャ（画像信号）とその透過度を表す信号（key信号）、および３次元オブジェクト情報が入力され、メモリ群１５１の所定の記憶領域に記憶される。テクスチャはテクスチャメモリ１５２に、透過度を表す信号はグレースケールメモリ１５３に、また３次元オブジェクト情報は３次元情報メモリ１５４に記憶される。ここで、３次元オブジェクト情報とはポリゴンの形成情報や照明情報などである。
【０００８】
レンダリング回路１５５は、メモリ群１５１に記録されている、所定の３次元オブジェクト情報に基づき、３次元オブジェクトをポリゴンにより形成する。レンダリング回路１５５は、３次元オブジェクト情報に基づきメモリ１５２およびメモリ１５３より、所定のテクスチャおよび透過度を示す信号を読み込み、生成した３次元オブジェクトに貼り付ける。透過度を表す信号は、対応する位置のテクスチャの透過度を示しており、対応する位置のテクスチャが貼り付けられた位置のオブジェクトの透明度を示すことになる。レンダリング回路１５５は、テクスチャを貼り付けたオブジェクトの信号を２次元変換回路１５６に供給する。
【０００９】
２次元変換回路１５６では、外部より供給される視点情報に基づき、３次元オブジェクトを２次元平面に写像して得られる２次元画像信号に変換する。２次元画像信号に変換された３次元オブジェクトは、さらに外部に出力される。なお、テクスチャは静止画でも良いし、動画でもよい。動画の場合、貼り付ける動画の画像フレームが変更される毎に、上記操作が行われる。
【００１０】
VRMLにおいては、貼り付けるテクスチャのフォーマットとして、静止画の高能率符号化方式の１つであるJPEG(Joint Photographic Experts Group)や動画像符号化方式の１つであるMPEG(Moving Picture Experts Group)といった圧縮された画像フォーマットもサポートされている。この場合、所定の圧縮方式に基づいた復号処理によりテクスチャ（画像）が復号され、復画像信号がメモリ群１５１におけるメモリ１５２に記録されることになる。
【００１１】
レンダリング回路１５５では、画像のフォーマット、動画であるか静止画であるか、また、その内容に拘らず、メモリ１５２に記録されているテクスチャを貼り付ける。ある１つのポリゴンに貼り付けることが可能なのは、常にメモリに記憶されている１つのテクスチャであり、１つのポリゴンに複数のテクスチャを貼ることはできない。
【００１２】
ところで、伝送路を経由して、こうした３次元情報やテクスチャ情報を伝送する場合、伝送路を効率よく利用するためには、情報を圧縮して送る必要がある。特に、動画像を３次元オブジェクトに貼り付ける場合などは、動画像を圧縮して伝送することが必須である。
【００１３】
例えば、上述したMPEG方式は、ISO-IEC/JTC1/SC2/WG11にて議論され、標準案として提案されたものであり、動き補償予測符号化とDCT（Discrete Cosine Transform）符号化を組み合わせたハイブリッド方式が採用されている。MPEGでは様々なアプリケーションや機能に対応するために、いくつかのプロファイルおよびレベルが定義されている。最も基本となるのが、メインプロファイルメインレベル（MP@ML)である。
【００１４】
図１５を参照して、MPEG方式の MP@ML のエンコーダの構成例について説明する。入力画像信号は、まずフレームメモリ群１に入力され、所定の順番で記憶される。符号化されるべき画像データは、マクロブロック単位で動きベクトル検出回路（ＭＥ）２に入力される。動きベクトル検出回路２は、予め設定されている所定のシーケンスに従って、各フレームの画像データを、Ｉピクチャ、Ｐピクチャ、またはＢピクチャとして処理する。シーケンシャルに入力される各フレームの画像を、Ｉ，Ｐ，Ｂのいずれのピクチャとして処理するかは、予め定められている（例えば、Ｉ，Ｂ，Ｐ，Ｂ，Ｐ，・・・Ｂ，Ｐの順番で処理される）。
【００１５】
動きベクトル検出回路２は、予め定められた所定の参照フレームを参照して動き補償を行い、その動きベクトルを検出する。動き補償（フレーム間予測）には前方予測、後方予測、両方向予測の３種類のモードがある。Ｐピクチャの予測モードは前方予測のみであり、Ｂピクチャの予測モードは前方予測、後方予測、両方向予測の３種類である。動きベクトル検出回路２は予測誤差を最小にする予測モードを選択し、その際の予測ベクトルを発生する。
【００１６】
この際、予測誤差は、例えば、符号化するマクロブロックの分散と比較され、マクロブロックの分散の方が小さい場合、そのマクロブロックでは予測は行わず、フレーム内符号化が行われる。この場合、予測モードは、画像内符号化（イントラ）となる。動きベクトルおよび上記予測モードは、可変長符号化回路６および動き補償回路（ＭＣ）１２に入力される。
【００１７】
動き補償回路１２では、入力された動きベクトルに基づいて予測画像データを生成し、その予測画像データが演算回路３に入力される。演算回路３では、符号化するマクロブロックの値と予測画像の値の差分データを演算し、 DCT 回路４に出力する。イントラマクロブロックの場合、演算回路３は符号化するマクロブロックの信号を、そのまま DCT 回路４に出力する。
【００１８】
DCT 回路４では、入力された信号が、 DCT（離散コサイン変換）処理され、DCT 係数に変換される。このDCT係数は、量子化回路（Ｑ）５に入力され、送信バッファ７のデータ蓄積量（バッファ蓄積量）に対応した量子化ステップで量子化された後、量子化データが可変長符号化回路（VLC）６に入力される。
【００１９】
可変長符号化回路６は、量子化回路５より供給される量子化ステップ（スケール）に対応して、量子化回路５より供給される量子化データ（例えば、Ｉピクチャのデータ）を、例えばハフマン符号などの可変長符号に変換し、送信バッファ７に出力する。可変長符号化回路６にはまた、量子化回路５より量子化ステップ（スケール）、動きベクトル検出回路２より予測モード（画像内予測、前方予測、後方予測、または両方向予測のいずれが設定されたかを示すモード）、および動きベクトル、が入力されており、これらも可変長符号化される。
【００２０】
送信バッファ７は、入力された符号化データを一時蓄積し、蓄積量に対応するデータを量子化回路５に出力する。送信バッファ７は、そのデータ残量が許容上限値まで増量すると、量子化制御信号によって量子化回路５の量子化スケールを大きくすることにより、量子化データのデータ量を低下させる。また、これとは逆に、データ残量が許容下限値まで減少すると、送信バッファ７は、量子化制御信号によって量子化回路５の量子化スケールを小さくすることにより、量子化データのデータ量を増大させる。このようにして、送信バッファ７のオーバフローまたはアンダフローが防止される。そして、送信バッファ７に蓄積された符号化データは、所定のタイミングで読み出され、ビットストリーム（bitstream）として伝送路に出力される。
【００２１】
一方、量子化回路５より出力された量子化データは、逆量子化回路（ＩＱ）８に入力され、量子化回路５より供給される量子化ステップに対応して逆量子化される。逆量子化回路８の出力データ（逆量子化されることにより得られたDCT係数）は、IDCT（逆DCT）回路９に入力される。IDCT回路９は、入力されたDCT係数を逆DCT処理し、得られた出力データ（差分データ）が演算回路１０に供給される。演算回路１０は、差分データと動き補償回路１２からの予測画像データを加算し、その出力画像データがフレームメモリ(FM)群１１に記憶される。なお、イントラマクロブロックの場合には、演算回路１０は、IDCT回路９からの出力データをそのままフレームメモリ群１１に供給する。
【００２２】
次に、図１６を用いて、MPEG の MP@ML のデコーダの構成例を説明する。伝送路を介して伝送されてきた符号化されている画像データ（ビットストリーム）は、図示せぬ受信回路で受信されたり、再生装置で再生され、受信バッファ２１に一時記憶された後、符号化データとして可変長復号回路（IVLC）２２に供給される。可変長復号回路２２は、受信バッファ２１より供給された符号化データを可変長復号し、動きベクトルと予測モードを動き補償回路２７に、また、量子化ステップを逆量子化回路（ＩＱ）２３に、それぞれ出力するとともに、復号された量子化データを逆量子化回路２３に出力する。
【００２３】
逆量子化回路２３は、可変長復号回路２２より供給された量子化データを、同じく可変長復号回路２２より供給された量子化ステップに従って逆量子化し、出力データ（逆量子化されることにより得られたDCT係数）をIDCT回路２４に出力する。逆量子化回路２３より出力された出力データ（DCT係数）は、IDCT回路２４で逆DCT処理され、出力データ（差分データ）が演算回路２５に供給される。
【００２４】
IDCT回路２４より出力された出力データが、Ｉピクチャのデータである場合、その出力データが画像データとして演算回路２５より出力され、演算回路２５に後に入力される画像データ（ＰまたはＢピクチャのデータ）の予測画像データ生成のために、フレームメモリ群２６に供給されて、記憶される。また、この画像データは、そのまま、再生画像として外部に出力される。また、IDCT回路２４より出力されたデータがＰまたはＢピクチャの場合、動き補償回路２７は可変長復号回路２２より供給される、動きベクトルおよび予測モードに従って、フレームメモリ群２６に記憶されている画像データから予測画像データを生成し、演算回路２５に出力する。演算回路２５では IDCT 回路２４より入力される出力データ（差分データ）と動き補償回路２７より供給される予測画像データを加算し、出力画像データとする。またＰピクチャの場合、演算回路２５の出力データはまた、フレームメモリ群２６に予測画像データとして記憶され、次に復号する画像信号の参照画像とされる。
【００２５】
MPEG では MP@ML の他に、様々なプロファイルおよびレベルが定義され、また各種ツールが用意されている。スケーラビリティも、こうしたツールの１つである。また、 MPEGでは、異なる画像サイズやフレームレートに対応する、スケーラビリティを実現するスケーラブル符号化方式が導入されている。例えば、空間スケーラビリティの場合、下位レイヤのビットストリームのみを復号する場合、画像サイズの小さい画像信号を復号し、下位レイヤおよび上位レイヤのビットストリームを復号する場合、画像サイズの大きい画像信号を復号する。
【００２６】
図１７を用いて空間スケーラビリティのエンコーダを説明する。空間スケーラビリティの場合、下位レイヤは画像サイズの小さい画像信号、また上位レイヤは画像サイズの大きい画像信号に対応する。
【００２７】
下位レイヤの画像信号は、まずフレームメモリ群１に入力され、MP@ML と同様に符号化される。ただし、演算回路１０の出力データは、フレームメモリ群１１に供給され、下位レイヤの予測画像データとして用いられるだけでなく、画像拡大回路（up sampling）３１により上位レイヤの画像サイズと同一の画像サイズに拡大された後、上位レイヤの予測画像データにも用いられる。
【００２８】
上位レイヤの画像信号はまず、フレームメモリ群５１に入力される。動きベクトル検出回路５２は MP@ML と同様に、動きベクトルおよび予測モードを決定する。動き補償回路６２は、動きベクトル検出回路５２によって決定された動きベクトルおよび予測モードに従って、予測画像データを生成し、重み付加回路（Ｗ）３４に出力する。重み付加回路３４では予測画像データに対して重み W を乗算し、重み予測画像データを演算回路３３に出力する。
【００２９】
演算回路１０の出力データ（画像データ）は、上述したようにフレームメモリ群１１および画像拡大回路３１に入力される。画像拡大回路３１では演算回路１０によって生成された画像データを拡大して上位レイヤの画像サイズと同一の大きさにして、重み付加回路（1-W）３２に出力する。重み付加回路３２では、画像拡大回路３１の出力データに、重み (1-W) を乗算し、重み予測画像データとして演算回路３３に出力する。
【００３０】
演算回路３３は、重み付加回路３２の出力データおよび重み付加回路３４の出力データを加算し、予測画像データとして演算回路５３に出力する。演算回路３３の出力データはまた、演算回路６０に入力され、逆 DCT 回路５９の出力データと加算された後、フレームメモリ群６１に入力され、その後、符号化される画像データの予測参照データフレームとして用いられる。演算回路５３は、符号化する画像データと演算回路３３の出力データ（予測画像データ）との差分を計算し、差分データとして出力する。ただし、フレーム内符号化マクロブロックの場合、演算回路５３は符号化する画像データを、そのまま DCT 回路５４に出力する。
【００３１】
DCT 回路５４は、演算回路５３の出力データを DCT（離散コサイン変換）処理し、DCT係数を生成し、そのDCT係数を量子化回路５５に出力する。量子化回路５５では MP@ML の場合と同様に、送信バッファ５７のデータ蓄積量などから決定された量子化スケールに従って DCT 係数を量子化し、量子化データを可変長符号化回路５６に出力する。可変長符号化回路５６は、量子化データ（量子化された DCT 係数）を可変長符号化した後、送信バッファ５７を介して上位レイヤのビットストリームとして出力する。
【００３２】
量子化回路５５の出力データはまた、量子化回路５５で用いた量子化スケールで逆量子化回路５８により逆量子化される。逆量子化回路５８の出力データ（逆量子化されることにより選られたDCT係数）は、IDCT回路５９に供給され、IDCT 回路５９で逆DCT処理された後、演算回路６０に入力される。演算回路６０では、演算回路３３の出力データと逆 DCT 回路５９の出力データ（差分データ）を加算し、その出力データがフレームメモリ群６１に入力される。
【００３３】
可変長符号化回路５６ではまた、動きベクトル検出回路５２で検出された動きベクトルおよび予測モード、量子化回路５５で用いた量子化スケール、並びに重み付加回路３４および３２で用いた重み W が入力され、それぞれ符号化され、符号化データとしてバッファ５７に供給される。その符号化データがバッファ５７を介してビットストリームとして伝送される。
【００３４】
次に、図１８を用いて空間スケーラビリティのデコーダの一例を説明する。下位レイヤのビットストリームは受信バッファ２１に入力された後、MP@ML と同様に復号される。ただし、演算回路２５の出力データは外部に出力され、またフレームメモリ群２６に蓄えられて、それ以後復号する画像信号の予測画像データとして用いられるだけでなく、画像信号拡大回路８１により、上位レイヤの画像信号と同一の画像サイズに拡大された後、上位レイヤの予測画像データとしても用いられる。
【００３５】
上位レイヤのビットストリームは、受信バッファ７１を介して可変長復号回路７２に供給され、可変長符号が復号される。このとき、DCT 係数とともに、量子化スケール、動きベクトル、予測モードおよび重み係数が復号される。可変長復号回路７２により復号された量子化データは、復号された量子化スケールを用いて逆量子化回路７３において逆量子化された後、DCT係数（逆量子化されることにより得られたDCT係数）がIDCT回路７４に供給される。そして、DCT係数は、IDCT 回路７４により逆DCT 処理された後、出力データが演算回路７５に供給される。
【００３６】
動き補償回路７７は、復号された動きベクトルおよび予測モードに従って、予測画像データを生成し、重み付加回路８４に入力する。重み付加回路８４では、復号された重み W を、動き補償回路７７の出力データに乗算し、演算回路８３に出力する。
【００３７】
演算回路２５の出力データは、下位レイヤの再生画像データとして出力され、またフレームメモリ群２６に出力されると同時に、画像信号拡大回路８１により上位レイヤの画像サイズと同一の画像サイズに拡大された後、重み付加回路８２に出力される。重み付加回路８２では、画像信号拡大回路８１の出力データに、復号された重みW を用いて(1-W) を乗算し、演算回路８３に出力する。
【００３８】
演算回路８３は、重み付加回路８４の出力データおよび重み付加回路８２の出力データを加算し、演算回路７５に出力する。演算回路７５ではIDCT 回路７４の出力データおよび演算回路８３の出力データを加算して、上位レイヤの再生画像として出力するとともに、フレームメモリ群７６に供給して、その後、復号する画像データの予測画像データとして使用する。
【００３９】
また、以上においては、輝度信号の処理について説明したが、色差信号の処理も同様に行われる。但し、この場合、動きベクトルは、輝度信号用のものを、垂直方向および水平方向に１／２にしたものが用いられる。
【００４０】
以上、MPEG 方式について説明したが、この他にも様々な動画像の高能率符号化方式が標準化されている。例えば、ITU-T では主に通信用の符号化方式として、H.261 や H.263 という方式を規定している。この H.261 や H.263 も基本的には MPEG 方式と同様に動き補償予測符号化と DCT 変換符号化を組み合わせたものであり、ヘッダ情報などの詳細は異なるが、画像信号符号化装置（エンコーダ）や画像信号復号化装置（デコーダ）は同様の構成となる。
【００４１】
また、上述したMPEG方式においても、MPEG４と呼ばれる、新たな動画像信号の高能率符号化方式の標準化が進められている。このMPEG４の大きな特徴は、画像をオブジェクト単位で符号化し（複数の画像に分けて符号化し）、加工することが可能であることである。復号側では、各オブジェクトの画像信号、即ち複数の画像信号を合成して１つの画像を再構成する。
【００４２】
複数の画像を合成して１つの画像を構成する画像合成システムには、例えば、クロマキーという方法が用いられる。これは所定の物体を、青などの特定の一様な色の背景の前で撮影し、青以外の領域をそこから抽出し、別の画像に合成する方法である。このとき、抽出した領域を示す信号を key 信号と呼ぶ。
【００４３】
次に、図１９を参照して合成画像を符号化する方法を示す。画像F１は背景であり、また画像F２は前景である。前景F２は特定の色の背景の前で撮影し、その色以外の領域を抽出することによって生成した画像である。その際、抽出した領域を示す信号が key 信号 K１である。合成画像 F３は、これら F１、F２、K１を用いて合成したものである。この画像を符号化する場合、通常、F３をそのまま、MPEGなどの符号化方式で符号化することになる。このとき key 信号などの情報は失われ、前景F２はそのままで、背景F１のみを変更する、といった画像の再編集、再合成は困難となる。
【００４４】
これに対して、図２０に示すように、画像F１、F２および key 信号 K１を、それぞれ別々に符号化し、それぞれのビットストリームを多重化することにより画像F３のビットストリームを構成することも可能である。
【００４５】
図２０のようにして、構成したビットストリームを復号して合成画像F３を得る方法を図２１に示す。ビットストリームは逆多重化により、F１、F２およびK１のビットストリームに分解され、それぞれが復号され、復号画像F１'，F２'、および復号 key 信号 K１'が得られる。このとき、F１'とF２'を key 信号 K１'に従って合成すれば、復号合成画像 F３'を得ることができる。この場合、ビットストリームのままで、前景F２'をそのままにして、背景F１'のみを変更するといった、再編集および再合成が可能となる。
【００４６】
上述したように、MPEG４においては、画像F１、F２といった合成画像を構成する各画像シーケンスを VO (VideoObject) と呼ぶ。また、 VO のある時刻における画像フレームを VOP（VideoObject Plane）と呼ぶ。VOPは輝度および色差信号と、並びに key 信号から構成される。画像フレームは、所定の時刻における１枚の画像を意味し、画像シーケンスは異なる時刻の画像フレームの集合を意味する。即ち、各 VO は異なる時刻の VOP の集合である。各 VO は時間によって大きさや位置が異なる。即ち、同じ VO に属する VOP でも大きさおよび位置が異なる。
【００４７】
上述したオブジェクト単位で符号化および復号するエンコーダおよびデコーダの構成を図２２と図２３に示す。図２２はエンコーダの一例である。入力画像信号は、まず VO 構成回路１０１に入力される。VO 構成回路１０１は、入力画像を物体ごとに分割して各物体（VO）を表わす画像信号を出力する。各 VOの画像信号は、画像信号と key 信号から構成される。VO 構成回路１０１から出力される画像信号は、各 VO 毎に、 VOP 構成回路１０２−０乃至１０２−ｎに出力される。例えば、VO ０の画像信号および key 信号は、 VOP 構成回路１０２−０に入力され、VO １の画像信号および key 信号は、 VOP 構成回路１０２−１に入力され、以下同様に、 VO n の画像信号および key 信号は、 VOP 構成回路１０２−ｎに入力される。
【００４８】
VO構成回路１０１では、例えば、図２０に示すようにクロマキーで生成された画像信号の場合、VO は、そのまま各画像信号およびその key 信号で構成される。key 信号がないか、失われた画像に対しては、画像領域分割が行われ、所定の領域が抽出され、key 信号が生成され、VO とされる。
【００４９】
VOP 構成回路１０２−０乃至１０２−ｎは、各画像フレームから画像中の物体を含む最小の長方形の部分を抽出する。ただし、このとき長方形の水平および垂直方向の画素数は１６の倍数とされる。VOP 構成回路１０２−０乃至１０２−ｎは、上述した長方形に含まれる画像信号（輝度および色差信号）および key 信号を抽出し、それらを出力する。また VOP の大きさを示すフラグ（VOP size）およびその VOP の絶対座標における位置を示すフラグ(VOP POS)を出力する。
【００５０】
VOP 構成回路１０２−０乃至１０２−ｎの出力信号は、 VOP 符号化回路１０３−０乃至１０３−ｎに入力され、符号化される。VOP 符号化回路１０３−０乃至１０３−ｎの出力は多重化回路１０４に入力され、１つのビットストリームに構成され、ビットストリームとして外部に出力される。
【００５１】
図２３は、デコーダの一例を示している。多重化されたビットストリームは、逆多重化回路１１１により逆多重化され、各 VOのビットストリームに分解される。各VOのビットストリームは、それぞれ VOP 復号化回路１１２−０乃至１１２−ｎに入力され、復号される。VOP 復号化回路１１２−０乃至１１２−ｎは、各 VOP の画像信号、key 信号、VOP の大きさを示すフラグ(VOP size)、VOP の絶対座標における位置を示すフラグ(VOP POS)を復号し、画像再構成回路１１３に入力する。画像再構成回路１１３は、各 VOP の画像信号、key 信号、大きさを示すフラグ(VOP size)、絶対座標における位置を示すフラグ(VOP POS)を用い、画像を合成し、再生画像を出力する。
【００５２】
次に、図２４を用いて、 VOP 符号化回路１０３−０（他のVOP符号化回路１０３−１乃至１０３−ｎも同様に構成されている）の一例を説明する。各 VOP を構成する画像信号と key 信号は、画像信号符号化回路１２１と key 信号符号化回路１２２にそれぞれ入力される。画像信号符号化回路１２１は、例えばMPEG方式やH.263 といった方式で符号化処理を行う。key 信号符号化回路１２２は、例えば DPCMなどにより符号化処理を行う。また、key信号を符号化する際に、画像信号符号化回路１２１によって検出された動きベクトルを用いて動き補償を行い、差分信号を符号化する方法もある。key 信号符号化によって発生したビット量は、画像信号符号化回路１２１に入力され、所定のビットレートになるように制御される。
【００５３】
符号化された画像信号（動きベクトルおよびテクスチャ情報）のビットストリームおよび key 信号のビットストリームは、多重化回路１２３に入力され、１つのビットストリームに構成され、送信バッファ１２４を介して出力される。
【００５４】
図２５は、VOP復号化回路１１２−０（他のVOP復号化回路１１２−１乃至１１２−ｎも同様に構成されている）の構成例を表している。ビットストリームはまず、逆多重化回路１３１に入力され、画像信号（動きベクトルおよびテクスチャ情報）のビットストリームと key 信号のビットストリームに分解され、画像信号復号化回路１３２と key 信号復号化回路１３３により、それぞれ復号される。この場合において、 key 信号を動き補償して符号化が行われているとき、画像信号復号化回路１３２によって復号された動きベクトルは、 key 信号復号化回路１３３に入力され、復号に用いられる。
【００５５】
以上、画像をVOPごとに符号化する方法を述べたが、こうした方式は、ISO-IEC/JTC1/SC29/WG11 において、 MPEG4 として、現在標準化作業が進められている段階である。しかし、上述したような各VOPを効率よく符号化する方法は、現在確立しておらず、またスケーラビリティといった機能も現在確立していない。
【００５６】
以下、画像をオブジェクト単位でスケーラブル符号化する方法に関して説明する。上述したように、レンダリング回路１５５では、画像のフォーマット、動画であるか静止画であるか、また、その内容に拘らず、テクスチャメモリ１５２に記録されているテクスチャをポリゴンに貼り付ける。１つのポリゴンに貼り付けることが可能なのは、常にメモリに記憶されている１つのテクスチャであり、１つのポリゴンに複数のテクスチャを貼り付けることはできない。また、多くの場合、画像は圧縮して伝送され、端末側で圧縮ビットストリームを復号した後、所定のテクスチャ貼り付け用のメモリに格納される。
【００５７】
従来の場合、ビットストリームを復号することにより生成される画像信号は、常に１つである。例えば、MPEGにおけるMP@MLのビットストリームを復号した場合、１つの画像シーケンスが復号される。また、MPEG２におけるスケーラビリティの場合、下位レイヤのビットストリームを復号した場合、低画質の画像が得られ、下位および上位レイヤのビットストリームを復号した場合には高画質の画像信号が得られる。いずれの場合にも１つの画像シーケンスが復号されることになる。
【００５８】
ところが、画像をオブジェクト単位で符号化するMPEG４などの方式の場合は、状況が異なる。即ち、１つのオブジェクトを複数のビットストリームで構成する場合があり、このような場合、各ビットストリーム毎に、複数の画像が得られる。従って、例えば、VRMLなどで記述される３次元オブジェクトにテクスチャを貼りつけることが出来ない。
【００５９】
これを解決する方法として、１つの画像オブジェクト（VO）に１つのVRMLノード（ポリゴン）を割り当てることが考えられる。例えば、図２１の場合、背景F１'を１つのノードに割り当て、また前景F２'およびkey信号K１'を１つのノードに割り当てることが考えられる。しかし、１つの画像オブジェクトが複数のビットストリームから構成され、復号時に、複数の画像が生成される場合、以下のような問題点がある。これを図２６乃至図３１を用いて説明する。
【００６０】
３階層のスケーラブル符号化を例にとって説明する。３階層のスケーラブル符号化の場合、下位レイヤ（ベースレイヤ）以外に２つの上位レイヤが、即ち、第１の上位レイヤ（エンハンスメントレイヤ１、以下、適宜上位レイヤ１という）と第２の上位レイヤ（エンハンスメントレイヤ２、以下、適宜上位レイヤ２という）が存在する。第１の上位レイヤまで復号した画像と比較して、第２の上位レイヤまで復号した画像は、より画質が向上している。ここで画質の向上とは、空間スケーラブル符号化の場合、空間解像度であり、時間スケーラブル符号化の場合には、フレームレートであり、またSNR(Single to Noise Ratio)スケーラブル符号化の場合、画像のSNRである。
【００６１】
オブジェクト単位で符号化するMPEG４の場合、第１の上位レイヤと第２の上位レイヤの関係は次のいずれかとなる。
（１）第２の上位レイヤは第１の上位レイヤの全ての領域を含む。
（２）第２の上位レイヤは第１の上位レイヤの一部の領域に対応する。
（３）第２の上位レイヤは第１の上位レイヤよりも広い領域に対応する。
【００６２】
（３）の関係は３階層以上のスケーラブル符号化を行う場合に存在する。これは、第１の上位レイヤは下位レイヤの一部領域に対応し、かつ第２の上位レイヤは下位レイヤの全ての領域を含む場合、または第１の上位レイヤは下位レイヤの一部領域に対応し、第２の上位レイヤは第１の上位レイヤよりも広い領域に対応し、かつ、下位レイヤの一部の領域に対応する場合である。（３）の関係の場合、第１の上位レイヤまで復号すると、下位レイヤの画像の一部のみが画質改善され、第２の上位レイヤまで復号すると、さらに広い領域または下位レイヤの画像の全ての領域の画質が改善される。（３）の関係においては、VOPの形状は長方形であっても、任意形状であってもよい。
【００６３】
図２６乃至図３１は３階層の空間スケーラブル符号化の例を示している。図２６は、（１）の関係における空間スケーラビリティでVOPの形状がいずれも長方形である場合の一例を示している。また、図２７は、（２）の関係における空間スケーラビリティで、VOPの形状が長方形である場合の一例を示している。さらに、図２８は、（３）の関係における空間スケーラビリティで、全てのレイヤのVOPの形状が長方形である場合の一例を示している。また、図２９は、（３）の関係における空間スケーラビリティで、第１の上位レイヤのVOPの形状が任意の形状であり、下位レイヤと第２の上位レイヤのVOPの形状が長方形である場合の一例を示している。図３０と図３１は、（１）の関係における空間スケーラビリティで、VOPの形状が、それぞれ長方形と任意の形状である場合の一例を示している。
【００６４】
ここで、図２６に示すように、画像全体の画質が向上する場合は、従来のMPEG２などのスケーラブル符号化と同様で、最も高い画質の画像を１枚表示すれば十分である。しかし、オブジェクト単位で符号化するMPEG４には、図２７、図２８、図２９に示すような場合が存在する。例えば、図２７の場合、下位レイヤおよび上位レイヤ１，２のビットストリームを復号した場合、下位レイヤおよび上位レイヤ１の画像を解像度変換した後、解像度変換後の２つの画像シーケンスを、上位レイヤ２の復号画像シーケンスと合成して画像全体を再構成することになる。また、図２９の場合、上位レイヤ１および下位レイヤのみを復号し、上位レイヤ１の画像のみを出力し、他のビットストリームから復号される別の画像シーケンスと合成してもよい。
【００６５】
【発明が解決しようとする課題】
上述したように、オブジェクト単位で画像を符号化する場合、単に、１つのオブジェクトに１つのノードを割り当てるだけの方法では、１つのオブジェクトに対して複数の画像が生成されると、その画像をテクスチャとしてオブジェクトに貼り付けることができなくなる課題があった。
【００６６】
本発明はこのような状況に鑑みてなされたものであり、１つのオブジェクトに対して複数の画像が生成された場合においても、その画像をテクスチャとしてオブジェクトに確実に貼り付けることができるようにするものである。
【００６７】
【課題を解決するための手段】
本発明の画像信号多重化装置は、所定のオブジェクトを記述する空間構成情報を示すシーン記述子を選択するとともに、スケーラブル符号化されている複数の階層のビットストリームのうち、所定のオブジェクトを構成するビットストリームを選択する選択手段と、選択されたビットストリームで構成されるオブジェクトに関する情報を示すオブジェクト記述子を生成する生成手段と、スタートコード、選択されたシーン記述子およびビットストリーム、並びに生成されたオブジェクト記述子を多重化し、スタートコード、シーン記述子、所定数のオブジェクト記述子、所定数のビットストリームの順序で出力する多重化手段とを備える。
【００６８】
本発明の画像信号多重化方法は、所定のオブジェクトを記述する空間構成情報を示すシーン記述子を選択するとともに、スケーラブル符号化されている複数の階層のビットストリームのうち、所定のオブジェクトを構成するビットストリームを選択する選択ステップと、選択されたビットストリームで構成されるオブジェクトに関する情報を示すオブジェクト記述子を生成する生成ステップと、スタートコード、選択されたシーン記述子およびビットストリーム、並びに生成されたオブジェクト記述子を多重化し、スタートコード、シーン記述子、所定数のオブジェクト記述子、所定数のビットストリームの順序で出力する多重化ステップとを含む。
【００６９】
本発明の記録媒体は、所定のオブジェクトを記述する空間構成情報を示すシーン記述子を選択するとともに、スケーラブル符号化されている複数の階層のビットストリームのうち、所定のオブジェクトを構成するビットストリームを選択する選択ステップと、選択されたビットストリームで構成されるオブジェクトに関する情報を示すオブジェクト記述子を生成する生成ステップと、スタートコード、選択されたシーン記述子およびビットストリーム、並びに生成されたオブジェクト記述子を多重化し、スタートコード、シーン記述子、所定数のオブジェクト記述子、所定数のビットストリームの順序で出力する多重化ステップとを含む処理をコンピュータに実行させるプログラムが記録されている。
【００７８】
請求項２０に記載の伝送媒体は、オブジェクトを記述する空間構成情報、オブジェクトを構成する、異なる質を有する複数の階層のビットストリーム、および異なるビットストリーム間での情報の依存関係を示す依存情報を少なくとも含むオブジェクトに関する情報が多重化されて伝送される多重化ビットストリームから、空間構成情報、オブジェクトを構成する複数の階層のビットストリーム、およびオブジェクトに関する情報を分離する分離ステップと、依存情報に基づいて、所定のオブジェクトを記述する空間構成情報、またはオブジェクトを構成する複数の階層のビットストリームを選択するために分離ステップでの処理を制御する制御ステップと、選択された空間構成情報を解析する解析ステップと、複数の階層のビットストリームをデコードするデコードステップと、デコードステップでデコードされた出力信号のうち、同一のオブジェクトに対応する出力信号を混合する混合ステップと、オブジェクトに関する情報に基づいて、解析ステップで解析された出力データと混合ステップで混合された出力信号から画像信号を再構成する再構成ステップとを含むプログラムを伝送することを特徴とする。
【００７９】
本発明の画像信号多重化装置および方法、並びに記録媒体のプログラムにおいては、所定のオブジェクトを記述する空間構成情報を示すシーン記述子が選択されるとともに、スケーラブル符号化されている複数の階層のビットストリームのうち、所定のオブジェクトを構成するビットストリームが選択され、選択されたビットストリームで構成されるオブジェクトに関する情報を示すオブジェクト記述子が生成され、スタートコード、選択されたシーン記述子およびビットストリーム、並びに生成されたオブジェクト記述子が多重化され、スタートコード、シーン記述子、所定数のオブジェクト記述子、所定数のビットストリームの順序で出力される。
【００８３】
【発明の実施の形態】
以下に本発明の実施の形態を説明する。まず、第１の実施の形態におけるビットストリーム多重化装置および逆多重化装置を図１を用いて説明する。なお、以下の説明においては、符号化されたオーディオおよびビデオのビットストリーム（Elementary Stream（ＥＳ））は、予め、所定の記憶装置２０２に記録されているものとして説明するが、ビデオおよびオーディオの符号化装置から、記憶装置２０２を通さずに、ビットストリームを多重化回路２０３に直接入力してもよい。また、以下、符号化および復号化方式は、MPEG４方式を想定して説明するが、画像を複数画像に分割して符号化する方式であれば、全て同様に適用可能である。
【００８４】
記憶装置２０２には、各AV（オーディオおよびビデオ）オブジェクトに対応しているビットストリームES(Elementary Stream)、各ビットストリームをデコードするために必要なオブジェクトストリーム情報OI、および２次元または３次元のシーン（伝送する画像により構成される仮想的空間）を記述するシーン記述子SD（Scene Descriptor）が記録されている。ここで、オブジェクトストリーム情報OIは、例えば、デコードするために必要なバッファサイズ、各アクセスユニット（フレームあるいはVOP）のタイムスタンプなどを含んでいる。詳細については後述する。
【００８５】
オブジェクト情報OIには、各AV（オーディオおよびビデオ）オブジェクトに対応しているビットストリームＥＳの情報が全て記載されている。オブジェクト記述子発生回路２０４は、記憶装置２０２より供給されるオブジェクト情報OIに対応して、オブジェクト記述子OD（Object Descriptor）を発生する。
【００８６】
多重化回路２０３は、記憶装置２０２に記録されているビットストリームＥＳおよびシーン記述子SD、並びに、オブジェクト記述子発生回路２０４より供給されるオブジェクト記述子ODを、所定の順番で多重化し、多重化ビットストリームＦＳを伝送する。
【００８７】
ここで、各オブジェクトを形成するビットストリームの構成について説明する。例えば図２１に示されるようなシーンは、背景F１'および前景F２'の２つのオブジェクトから構成される。ただし、key信号K1' および前景F２'は１つのビットストリームＥＳで構成される。従って、図２１の場合、２つのビデオオブジェクトVOから構成され、スケーラブル符号化を用いない場合、それぞれのVOは１つのビットストリームＥＳから構成される。
【００８８】
また、図２６乃至図２９の場合、フレームは１つのビデオオブジェクトVOから構成される。ただし、この場合、スケーラブル符号化を行っているため、１つのビデオオブジェクトVOは３つのビットストリームＥＳから構成される。図２６乃至図２９は、３階層のスケーラブル符号化の例を示しているが、階層の数は任意で良い。
【００８９】
また、図３０および図３１では、シーンが背景（図３０）および前景（図３１）の２つのビデオオブジェクトVOから構成され、それぞれのビデオオブジェクトVOは、３つのビットストリームＥＳから構成されている。
【００９０】
ユーザは端末から要求信号を送ることにより、どのビデオオブジェクトを表示するか、またスケーラブル符号化の場合、どのレイヤを表示するかを、任意に設定することが可能である。
【００９１】
図１の実施の形態においては、ユーザは図示せぬ外部の端末より、必要なビデオオブジェクトとビットストリームを特定する要求信号ＲＥＱを送信側に送信する。要求信号ＲＥＱは、ストリームコントロール回路２０１に供給される。各ビデオオブジェクトのビットストリームのオブジェクトストリーム情報OIは記憶装置２０２に記録されている。上述したように、オブジェクトストリーム情報OIは、例えば所定のオブジェクトが、幾つのビットストリームから構成されているかを示す情報、各ビットストリームを復号するために必要な情報、バッファサイズ、復号に際して他にどのビットストリームが必要か、といった情報を含んでいる。
【００９２】
ストリームコントロール回路２０１は、要求信号ＲＥＱに従って、記憶装置２０２から供給されるオブジェクトストリーム情報OIを参照して、どのビットストリームを伝送するかを決定し、ストリーム要求信号ＳＲＥＱを多重化回路２０３、記憶装置２０２、およびオブジェクト記述子発生回路２０４に供給する。また、記憶装置２０２は、ストリーム要求信号ＳＲＥＱに従って、所定のビットストリームＥＳおよびシーン記述子SDを読み出し、多重化回路２０３に出力する。
【００９３】
オブジェクト記述子発生回路２０４は、ストリーム要求信号ＳＲＥＱに従って、記憶装置２０２に記録されている各オブジェクト（VO）のビットストリームに関するオブジェクトストリーム情報OIを読み出し、ストリーム要求信号ＳＲＥＱにより要求されたビットストリームの情報のみをオブジェクト記述子ODとして抽出する。また、オブジェクト記述子発生回路２０４は、どのオブジェクトに対応するかを示すID番号OD_IDを生成して、オブジェクト記述子ODに書き込む。例えば、図２６の場合で、所定のオブジェクトに対して、下位レイヤおよび上位レイヤ１のみが要求されたとき、オブジェクト記述子発生回路２０４は、下位レイヤおよび上位レイヤ１の情報のみをオブジェクトストリーム情報OIから抽出し、オブジェクト記述子ODとするとともに、そのオブジェクトを示すID番号OD_IDを生成して、それをオブジェクト記述子ODに書き込む。そして、このように生成されたオブジェクト記述子ODが多重化回路２０３に供給される。オブジェクト記述子ODおよびオブジェクトストリーム情報OIのシンタクス、並びにシーン記述子SDの詳細は後述する。
【００９４】
次に、図２を用いて多重化回路２０３の動作を説明する。多重化回路２０３には、ストリーム要求信号ＳＲＥＱに従って、伝送するビットストリームＥＳ１乃至ＥＳｎが供給される。各ビットストリームＥＳ１乃至ＥＳｎはスイッチ２３１に供給される。またシーン記述子SDおよびオブジェクト記述子ODも同様に、スイッチ２３１に供給される。さらに、多重化回路２０３にはスタートコード発生回路２３２が設けられており、スタートコード発生回路２３２にて発生されたスタートコードもスイッチ２３１に供給される。スイッチ２３１は、所定の順番で接続を切り替えことにより得られたデータを多重化ビットストリームＦＳとして外部に出力する。
【００９５】
多重化ビットストリームＦＳとして、まず、スタートコード発生回路２３２で発生したスタートコードが出力される。次に、スイッチ２３１の接続が切り替えられ、シーン記述子SDが出力される。シーン記述子SDが出力された後、スイッチ２３１の接続が再び切り替えられ、オブジェクト記述子ODが出力される。オブジェクト記述子ODは、オブジェクトの数だけ存在するため、そのオブジェクトの数だけオブジェクト記述子が出力される。図２には、オブジェクトの数が３個である場合が示されている。オブジェクト記述子ODが出力された後、スイッチ２３１の接続が再び切り替えられ、ビットストリームＥＳ１乃至ＥＳｎがそれぞれ所定のデータサイズ毎に選択され、出力される。多重化されたビットストリームＥＳは、図１に示されるように、所定の伝送路を経由した後、逆多重化回路２０５に供給される。
【００９６】
次に、逆多重化回路２０５の詳細について図３を用いて説明する。まず、多重化ビットストリームＦＳがスイッチ２４１に供給される。スイッチ２４１は、まず、スタートコードを検出することにより、以降の各データを認識する。スタートコードを検出した後、スイッチ２４１からシーン記述子SDが読み出されて出力される。次に、スイッチ２４１の接続が変更され、オブジェクト記述子ODが読み出されて出力される。オブジェクト記述子ODは、オブジェクトの数だけ存在し、順次出力される。全てのオブジェクト記述子ODが出力された後、スイッチ２４１の接続が再び変更され、各ビットストリームＥＳ１乃至ＥＳｎが所定の接続に従って読み出されて出力される。
【００９７】
読み出されたシーン記述子SDは、図１に示されるように、構文解析回路（パーサ）２０８に供給され、解析される。構文解析されたシーン記述は、３次元オブジェクト情報として再構成回路２０９に供給される。３次元オブジェクト情報は、実際には、ノードやポリゴンなどの情報から構成されているが、以下の説明では、適宜ノードと記載して説明する。
【００９８】
また、読み出されたオブジェクト記述子ODは、図１に示されるように、構文解析回路（パーサ）２０６に供給され、解析される。構文解析回路２０６は、必要なデコーダの種類と数を同定し、必要なデコーダ２０７−１乃至２０７−ｎに各ビットストリームＥＳ１乃至ＥＳｎを逆多重化回路２０５から供給させるようにする。また、オブジェクト記述子ODから各ビットストリームを復号するために必要なバッファ量などが読み出され、構文解析回路２０６から各デコーダ２０７−１乃至２０７−ｎに出力される。各デコーダ２０７−１乃至２０７−ｎは、構文解析回路２０６から供給された（つまり、オブジェクト記述子ODにより伝送された）バッファサイズなどの初期化情報に基づき、初期化される。また、構文解析回路２０６は、各ビットストリームＥＳ１乃至ＥＳｎが、どのオブジェクトに所属するものなのかを同定するために、各オブジェクト記述子ODのID番号OD_IDを読み出す。そして、各オブジェクト記述子ODのＩＤ番号OD_IDが、構文解析回路２０６からオブジェクト記述子ODに記載されているビットストリームを復号するデコーダ２０７−１乃至２０７−ｎに対して、出力される。
【００９９】
各デコーダ２０７−１乃至２０７−ｎは、エンコードに対応する所定のデコード方法に基づきビットストリームを復号し、ビデオまたはオーディオ信号を、再構成回路２０９に出力する。また、各デコーダ２０７−１乃至２０７−ｎは、その画像がどのオブジェクトに所属するものかを示すID番号OD_IDを再構成回路２０９に出力する。また、各デコーダ２０７−１乃至２０７−ｎは、画像信号の場合、その位置および大きさを示す信号（POS, SZ）をビットストリームから復号し、再構成回路２０９に出力する。さらに、デコーダ２０７−１乃至２０７−ｎは、画像信号の場合、透過度を示す信号（key信号）をビットストリームから復号し、再構成回路２０９に出力する。
【０１００】
次に、画像を再構成するための各信号の対応関係および再構成回路２０９について、図４および図５を用いて説明する。図４はスケーラブル符号化を行わない場合の例を示し、また図５はスケーラブル符号化を行った場合の例を示す。
【０１０１】
図４において、再構成回路２０９は、合成回路２５２から構成されており、その合成回路２５２に生成された画像信号がディスプレイ２５１に供給されて表示される。なお、図４において、合成回路２５２とディスプレイ２５１が再構成回路２０９として示されているが、これは合成回路２５２で構成された画像がどのようにディスプレイ２５１に示されるかを示すためであり、実際には、再構成回路２０９に中にディスプレイは含まれない。
【０１０２】
図４では、ディスプレイ２５１の画面に、長方形の画像シーケンスとＣＧにより生成された３角錐が表示されている。３角錐のオブジェクトにはまた、復号されたテクスチャが貼り付けられている。ここで、テクスチャは動画でもよいし、静止画でもよい。
【０１０３】
図４には、シーン記述子SDと出力画面との対応が示されている。シーン記述子SDとしては、例えば、VRMLなどの記述子が用いられる。シーン記述子SDは、ノードと呼ばれる記述群から構成される。画像全体に各オブジェクトをどのように配置するかを記述した親（ルート）ノードSD０がある。その子ノードとして、３角錐に関する情報を記述するノードSD１がある。また、ルートノードSD０の子ノードとしてのノードSD２に、画像が貼り付けられる長方形の平面に関する情報が記載されている。図４の例の場合、画像信号が３つのビデオオブジェクトVOから構成されている。第１のビデオオブジェクトVOとしての背景に関する情報はノードSD２に記載されている。また、第２のビデオオブジェクトVOとしての太陽を貼り付けるための平面に関する情報がノードSD３に記載されている。さらに第３のビデオオブジェクトVOとしての人物を貼り付ける平面に関する情報がノードSD４に記載されている。ノードSD３およびSD４はノードSD２の子ノードとなる。
【０１０４】
従って、ノードSD０乃至SD４により１つのシーン記述子SDが構成されている。各ノードSD０乃至SD４はそれぞれ１つの３次元または２次元のオブジェクトに対応する。図４の例の場合、ノードSD０はシーン全体のオブジェクトに、ノードSD１は３角錐のオブジェクトに、ノードSD２は背景のオブジェクトに、ノードSD３は太陽のオブジェクトに、ノードSD４は人物のオブジェクトに、それぞれ対応している。各ノードにテクスチャを貼り付ける場合、それぞれのノードにどのビットストリームが対応するかを示すフラグが必要となる。これを同定するために、各ノードには、対応するビットストリームのデコーダから供給されるオブジェクト記述子のＩＤ番号OD_IDが記載される。これにより１つのノードには１つのオブジェクト記述子ODが対応することになる。これにより、１つの２次元または３次元オブジェクトには１つのビデオオブジェクトVOが貼り付けられることになる。
【０１０５】
シーン記述子SDを構成する各ノードSD０乃至SD４は、構文解析回路２０８により解釈され、３次元オブジェクト情報として再構成回路２０９の合成回路２５２に供給される。また、各デコーダ２０７−１乃至２０７−４には、逆多重化回路２０５よりビットストリームＥＳ１乃至ＥＳ４が供給され、また、構文解析回路２０６から対応するオブジェクト記述子ODのＩＤ番号OD_IDが供給される。各デコーダ２０７−１乃至２０７−４はビットストリームを復号した後、ID番号OD_IDと復号信号（画像もしくはオーディオ）の他、画像の場合、key信号、並びに画像の位置および大きさを示す信号(POS, SZ)を復号信号として再構成回路２０９の合成回路２５２に供給する。ここで、画像の位置とは、そのノードの属する１つ上の親ノードとの相対位置を意味する。
【０１０６】
合成回路２５２の構成例は図６に示されている。なお、図６において、図１４に示した場合と対応する部分には、同一の符号を付してある。入力される３次元オブジェクト情報（ノードSD０乃至SD４や各ポリゴン情報を含む）、画像信号（Texture）、key 信号（key signal）、ID番号OD_ID、位置および大きさを示す信号(POS, SZ)は、それぞれ、オブジェクト合成回路２７１−１乃至２７１−ｎに供給される。１つのノードSDｉには１つのオブジェクト合成回路２７１−ｉが対応する（ｉ＝１，２，３，・・・，ｎ）。オブジェクト合成回路２７１−ｉはノードSDｉに示されるID番号OD_IDを持つ復号信号をデコーダ２０７−ｉより受け取り、画像信号の場合、生成する２次元または３次元のオブジェクトに貼り付ける。なお、上述したように、ID番号OD_IDと復号信号が対応するオブジェクト合成回路２７１−ｉに供給される際に、各復号信号がどのノードに対応するかを探索しなければならない。従って、再構成回路２０９に供給されたID番号OD_IDとノードが含まれているID番号OD_IDとを照合することにより、対応関係を認識する。そして、その認識結果に基づいて、復号信号が対応するノードが供給されるオブジェクト合成回路２７１−ｉに供給される。
【０１０７】
デコーダ２０７−ｉより供給される、貼り付けるテクスチャ（画像信号）とその透過度を表す信号（key信号）およびその位置と大きさを示す信号（VOP, SZ）は、メモリ群１５１−ｉの所定の領域に記憶される。また同様に、構文解析回路２０８から供給されるノード（２次元または３次元オブジェクト情報）はメモリ群１５１−ｉの所定の記憶領域に記憶される。テクスチャ（画像信号）はテクスチャメモリ１５２−ｉに、透過度を表す信号(key signal)およびID番号OD_IDはグレースケールメモリ１５３−ｉに、またノードは３次元情報メモリ１５４−ｉに、それぞれ記憶される。ID番号OD_IDは、オブジェクトを識別するために供給され、使用される。さらに、位置および大きさを示す信号(POS, SZ)は、いずれのメモリに記憶されてもよいが、例えば、この例の場合、グレースケールメモリ１５３−ｉに記憶される。ここで、３次元オブジェクト情報とは、ポリゴンの形成情報や照明情報などである。位置および大きさを示す信号は、メモリ群１５１−ｉ中の所定の位置に記憶される。
【０１０８】
レンダリング回路１５５−ｉは３次元情報メモリ１５４−ｉに記録されているノードに基づき、２次元または３次元オブジェクトをポリゴンにより形成する。レンダリング回路１５５−ｉは、テクスチャメモリ１５２−ｉおよびグレースケールメモリ１５３−ｉより所定のテクスチャおよび透過度を示す信号を読み込み、生成した３次元オブジェクトに貼り付ける。透過度を表す信号は、対応する位置のテクスチャの透過度を示しており、対応する位置のテクスチャが貼り付けられた位置のオブジェクトの透明度を示すことになる。レンダリング回路１５５−ｉはテクスチャを貼り付けた信号を２次元変換回路１５６に供給する。また画像の位置および大きさを示す信号（親ノードとの相対位置）は、同様に、メモリ群１５１−ｉの所定の位置（この例の場合、グレースケールメモリ１５３−ｉ）から読み出され、２次元変換回路１５６に出力される。
【０１０９】
２次元変換回路１５６には、ノードの数だけのオブジェクト合成回路２７１−１乃至２７１−ｎから、テクスチャを貼り付けた２次元または３次元のオブジェクトが供給される。２次元変換回路１５６では、外部より供給される視点情報並びに画像の位置および大きさを示す信号(POS, SZ)に基づき、３次元オブジェクトを２次元平面に写像して、２次元画像信号に変換する。２次元画像信号に変換された３次元オブジェクトは、さらにディスプレイ２５１に出力され、表示される。なお、全てのオブジェクトが２次元オブジェクトの場合、各レンダリング回路１５５−１乃至１５５−ｎからの出力データが、その透過度（key 信号）並びに画像の位置および大きさを示す信号に従って合成され、出力される。この場合、視点による変換は行われない。
【０１１０】
次に、図５のスケーラブル符号化を行った場合の例を説明する。この場合、再構成回路２０９は、混合回路２６１および合成回路２５２から構成されており、その混合回路２６１および合成回路２５２で生成された画像信号がディスプレイ２５１に供給されて表示される。なお、図５においても図４と同様に、混合回路２６１および合成回路２５２とディスプレイ２５１が再構成回路２０９として示されているが、これは混合回路２６１および合成回路２５２で構成された画像がどのようにディスプレイ２５１に示されるかを示すためであり、実際には、再構成回路２０９中にディスプレイは含まれない。また、図５の例では、ディスプレイ２５１に、長方形の画像シーケンスと、ＣＧにより生成された３角錐が表示されている。３角錐のオブジェクトにはまた、復号されたテクスチャが貼り付けられている。ここで、テクスチャは動画でもよいし、静止画でもよい。
【０１１１】
図５には、シーン記述子SDと出力画面との対応が示されている。図５の場合、画像全体に各オブジェクトをどのように配置するかを記述した親ノードSD０がある。それの子ノードとして、３角錐に関する情報が記述されているノードSD１、および画像が貼り付けられる長方形の平面に関する情報が記述されているノードSD２が存在する。この図５のノードSD２が対応する画像信号は図４の例の場合とは異なり、１つのビデオオブジェクトVOから構成されている。ただし、図５の場合、ノードSD２が対応する画像は、３階層のスケーラブル符号化が行われており、３つのビデオオブジェクトレイヤからビデオオブジェクトVOが構成されているものとする。なお、図５では３階層の例を説明するが階層の数は任意で良い。
【０１１２】
シーン記述子SDを構成する各ノードSD０乃至SD２は、構文解析回路２０８により解釈され、解析結果が合成回路２５２に供給される。各デコーダ２０７−１乃至２０７−４には、逆多重化回路２０５よりビットストリームＥＳ１乃至ＥＳｎが供給されるとともに、対応するオブジェクト記述子ODのＩＤ番号OD_IDが構文解析回路２０６から供給される。各デコーダ２０７−１乃至２０７−４はビットストリームを復号した後、復号信号の他、画像の場合、key信号、画像の位置および大きさを示す信号（VOP, SZ）、倍率を示す信号ＦＲを混合回路２６１に供給する。ここで、画像の位置とは同一ビデオオブジェクトVOでの各レイヤの相対位置を意味する。また各デコーダ２０７−１乃至２０７−４はID番号OD_IDを合成回路２５２に供給する。合成回路２５２の構成は図６に示すものと同様であるため、ここでは、その説明を省略する。
【０１１３】
なお、上述したように、ID番号OD_IDと復号信号が対応するオブジェクト合成回路２７１−ｉに供給される際に、各復号信号がどのノードに対応するかを探索しなければならない。従って、再構成回路２０９に供給されたID番号OD_IDとノードに含まれているID番号OD_IDとを照合することにより、対応関係を認識する。そして、その認識結果に基づいて、復号信号が対応するノードが供給されるオブジェクト合成回路２７１−ｉに供給される。
【０１１４】
スケーラブル符号化の場合、各レイヤ（VOL）のビットストリームは同一のビデオオブジェクトVOに所属するため、同一のID番号OD_IDを持つ。１つのノードには１つのビデオオブジェクトVOが対応し、またそれに対応して合成回路２５２中に、１つのテクスチャメモリ１５２−ｉが対応する。従って、スケーラブル符号化の場合、各レイヤの出力（デコーダ２０７−２乃至２０７−４の出力）を一度混合回路２６１に供給し、１つの画像シーケンスに合成する。
【０１１５】
混合回路２６１は、各デコーダ２０７−２乃至２０７−４から供給される画像信号、 key信号、倍率を示す信号、画像の位置および大きさを示す信号に基づき、各レイヤの画像を先に合成した後、合成回路２５２に出力する。従って、合成回路２５２では、１つのオブジェクトに対して１つの画像シーケンスを対応させることができる。
【０１１６】
例えば、図２９に示すようなスケーラブル符号化が行われ、かつ下位レイヤおよび上位レイヤ１が伝送され、それらが復号された場合、下位レイヤの画像信号は倍率を示す信号ＦＲに基づき解像度変換される。次に、この画像に対し、上位レイヤ１の復号画像が対応する位置にkey信号に従って、合成される。
【０１１７】
混合回路２６１により合成された画像シーケンスは、合成回路２５２に供給される。合成回路２５２では、図４の場合と同様に画像を構成し、ディスプレイ２５１に出力して最終的な出力画像を得る。
【０１１８】
このように、この例では、１つのノードに対して、１つのオブジェクト（ビデオの場合ビデオオブジェクトVO）を割り当て、レンダリング回路１５５におけるテクスチャや３次元情報などを記憶するメモリ群１５１の前段に混合回路２６１を設け、複数の画像を所定のkey信号に従って混合した後、テクスチャメモリ１５２に記録し、複数解像度からなる画像信号をテクスチャマッピングすることを可能にする。
【０１１９】
また、このように、図１の例では、あるオブジェクトに対して、そのオブジェクトを構成するビットストリームのシステム情報を記録した記述子を生成し、その際、必ず復号しなければならないビットストリームの情報のみを記憶し、その記述子に記載されているビットストリームは全て復号することにより、復号可能なビットストリームの組み合わせを同定し、所定の信号を復号することを可能とする。この場合、上記記述子は、送信側と受信側で１対１で生成され、伝送される。
【０１２０】
次に、図７乃至図９にオブジェクト記述子ODの構成を示す。図７はオブジェクト記述子ODの全体構成（シンタックス）を示す。
【０１２１】
Node IDは、その記述子のＩＤ番号を示す１０ビットのフラグである。上記したOD_IDに相当する。また、streamCountは、８ビットのフラグで、そのオブジェクト記述子に含まれるビットストリームＥＳの数を示す。この数だけ、ビットストリームＥＳの復号時に必要な情報、ES_Descriptorが伝送される。さらに、extentionFlagは、その他の記述子を伝送するかどうかを示すフラグで、この値が１の場合、その他の記述子が伝送される。
【０１２２】
ES_Descriptorは、各ビットストリームに関する情報を示す記述子である。図８にES_Descriptorの構成（シンタックス）を示す。ES_Numberはそのビットストリームを識別するためのＩＤ番号を示す５ビットのフラグである。また、streamTypeは、そのビットストリームのフォーマット、例えばMPEG２ビデオなど、を示す８ビットのフラグである。さらに、QoS_Descriptorは、伝送の際にネットワークへの要求を示す８ビットのフラグである。
【０１２３】
ESConfigParamsはそのビットストリームを復号するのに必要な情報が記載されている記述子であり、その構成（シンタックス）を図９に示す。ESConfigParamの詳細は、MPEG4 System VMに記述されている。
【０１２４】
図１０に動画を貼り付けるためのシーン記述子を示す。SFObjectIDは、貼り付けるテクスチャのオブジェクト記述子のＩＤであるID番号OD_IDを示すフラグである。また、図１１には静止画を貼り付けるためのシーン記述子を示す。SFObjectIDは、貼り付けるテクスチャのオブジェクト記述子のＩＤ番号OD_IDを示すフラグである。なお、図１０と図１１の書式は、VRMLのノード記述に準拠している。
【０１２５】
次に、第２の実施の形態におけるビットストリーム多重化装置および逆多重化装置を図１２に示す。この実施の形態においては、オブジェクトに所属するビットストリームが全て多重化され伝送される。図１の第１の実施の形態においては、受信側から要求されたビットストリームだけを多重化して伝送するようになされていた。その際、伝送するビットストリームに合わせてオブジェクト記述子ODを発生させていた。受信側ではオブジェクト記述子ODに記載されているビットストリームは全て復号したため、ビットストリーム間での情報の依存関係を特に伝送する必要が無かった。
【０１２６】
これに対して、第２の実施の形態では、あらかじめ、オブジェクト記述子ODが記憶装置２０２に記憶されており、送信側ではこのオブジェクト記述子ODに記録されているビットストリームは全て多重化して伝送する。この時、第２の実施の形態におけるオブジェクト記述子ODは、ビットストリーム間の情報の依存関係が記載されている点が第１の実施の形態と異なる。それ以外の点では第１の実施の形態と同様である。
【０１２７】
多重化回路２０３では、記憶装置２０２に記録されているシーン記述子SD、オブジェクト記述子OD、およびビットストリーム群ＥＳを読み込み、所定の順番で多重化し、伝送する。伝送順や、多重化回路２０３の構成は、第１の実施の形態と同様である。多重化ビットストリームＦＳは伝送路を経由して、逆多重化回路２０５に供給される。
【０１２８】
ユーザは、どのオブジェクトを表示させたいかを表す要求信号ＲＥＱを端末から入力する。要求信号ＲＥＱは、逆多重化回路２０５、構文解析回路２０６、および再構成回路２０９に供給される。構文解析回路２０６では、伝送されてきた各オブジェクト記述子ODを解析し、必要なビットストリームを要求する信号ＳＲＥＱを生成し、逆多重化回路２０５に供給する。ユーザが所定のビットストリームを要求した場合、それを復号するために必要なその他のビットストリームが存在するか、またどのビットストリームが必要かは、オブジェクト記述子ODに記録されている。
【０１２９】
逆多重化回路２０５は、ユーザからの要求信号ＲＥＱおよび必要なビットストリームを要求する信号ＳＲＥＱに従って、必要なビットストリームのみをデコーダ２０７−１乃至２０７−ｎに供給するとともに、必要なオブジェクト記述子ODを構文解析回路２０６に供給する。構文解析回路２０６は、オブジェクト記述子ODを解析し、オブジェクト記述子ODおよびユーザからの要求信号ＲＥＱに基づき、デコーダ２０７−１乃至２０７−ｎの初期化情報およびID番号OD_IDを各デコーダ２０７−１乃至２０７−ｎに出力する。以下、第１の実施の形態と同様に復号、合成、表示がなされる。
【０１３０】
このようにこの例では、あるオブジェクトに対して、そのオブジェクトを構成するビットストリームのシステム情報を記録した記述子（オブジェクト記述子）を生成し、その際、各ビットストリームを復号するのに必要なビットストリームを示すフラグを記録し、その記述子に記載されているフラグに従って所定のビットストリームを復号することにより、復号可能なビットストリームの組み合わせを同定し、所定の信号を復号することを可能とする。この場合、上記、記述子は送信側で１度生成された後、全ての受信者に共通の記述子が伝送される。
【０１３１】
第２の実施の形態においては、オブジェクト記述子ODが第１の実施の形態とは異なり、所定のビットストリームを復号する際に必要な、別ビットストリームを同定するための情報が記載されている。第２の実施の形態におけるオブジェクト記述子ODを説明する。オブジェクト記述子ODの全体構成は、図７に示した第１の実施の形態における場合と同様である。
【０１３２】
図１３に各ビットストリームに関する情報を記述するES_Descriptorを示す。isOtherStreamは、１ビットのフラグで、このビットストリームを復号するために他のビットストリームが必要かどうかを示す。この値が０の場合、このビットストリームは単独で復号可能である。isOtherStreamの値が１である場合、このビットストリームは単独では復号できない。
【０１３３】
streamCountは、他に幾つのビットストリームが必要かを示す５ビットのフラグである。streamCountに基づき、その数だけ、ES_Numberが伝送される。
【０１３４】
ES_Numberは、復号に必要なビットストリームを識別するためのＩＤである。なお、ES_Descriptorのその他の構成は第１の実施の形態と同様である。また、各ビットストリームを復号するために必要な情報を表すESConfigParamsの構成は、図９に示した第１の実施の形態における場合と同様である。
【０１３５】
なお、上述したような処理（多重化および逆多重化）はプログラムで実現でき、そのプログラムをユーザに伝送（提供）することが可能であり、伝送媒体としては、磁気ディスク、CD-ROM、固体メモリなどの記録媒体の他、ネットワーク、衛星などの通信媒体を利用することができる。また、上述した処理は、プログラムとして実現できる他、ハードウェアとして実現できることは言うまでもない。
【０１３６】
なお、本発明の主旨を逸脱しない範囲において、さまざまな変形や応用例が考えうる。従って、本発明の要旨は、実施の形態に限定されるものではない。
【０１３７】
【発明の効果】
本発明の第１の画像信号多重化装置および方法、並びに記録媒体のプログラムによれば、複数階層をもつオブジェクト単位のスケーラブルビットストリームをテクスチャマッピングすることが可能となる。
【図面の簡単な説明】
【図１】本発明の画像信号多重化装置および画像信号逆多重化装置の構成例を示すブロック図である。
【図２】図１の多重化回路２０３の構成例を示すブロック図である。
【図３】図１の逆多重化回路２０５の構成例を示すブロック図である。
【図４】画像を再構成するための各信号の対応関係および図１の再構成回路２０９を示す図である。
【図５】画像を再構成するための各信号の対応関係および図１の再構成回路２０９を示す図である。
【図６】図５の合成回路２５２の構成例を示すブロック図である。
【図７】オブジェクト記述子の構成を示す図である。
【図８】 ES_Descriptorの構成を示す図である。
【図９】 ESConfigParamsの構成を示す図である。
【図１０】動画用のシーン記述子の構成を示す図である。
【図１１】静止画用のシーン記述子の構成を示す図である。
【図１２】本発明の画像信号多重化装置および画像信号逆多重化装置の他の構成例を示すブロック図である。
【図１３】 ES_Descriptorの構成を示す図である。
【図１４】従来のオブジェクト合成回路の構成例を示すブロック図である。
【図１５】従来の画像信号符号化装置の構成例を示すブロック図である。
【図１６】従来の画像信号復号化装置の構成例を示すブロック図である。
【図１７】従来の画像信号符号化装置の他の構成例を示すブロック図である。
【図１８】従来の画像信号復号化装置の他の構成例を示すブロック図である。
【図１９】従来の画像の合成を説明する図である。
【図２０】画像の合成を説明する図である。
【図２１】画像の合成を説明する図である。
【図２２】従来の画像信号符号化装置のさらに他の構成例を示すブロック図である。
【図２３】従来の画像信号復号化装置のさらに他の構成例を示すブロック図である。
【図２４】図２２のVOP符号化回路１０３−０の構成例を示すブロック図である。
【図２５】図２３のVOP復号化回路１１２−０の構成例を示すブロック図である。
【図２６】画像オブジェクトを説明する図である。
【図２７】画像オブジェクトを説明する図である。
【図２８】画像オブジェクトを説明する図である。
【図２９】画像オブジェクトを説明する図である。
【図３０】画像オブジェクトを説明する図である。
【図３１】画像オブジェクトを説明する図である。
【符号の説明】
２０１ストリームコントロール回路，２０２記憶装置，２０３多重化回路，２０４オブジェクト記述子発生回路，２０５逆多重化回路，２０６構文解析回路，２０７−１乃至２０７−ｎデコーダ，２０８構文解析回路，２０９再構成回路[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an image signal multiplexing apparatus and method. , In particular, regarding data transmission media, data recorded on a recording medium such as a magneto-optical disk or magnetic tape, reproduced from the recording medium, and displayed on a display, a video conference system, a videophone system, a broadcasting device, An image signal multiplexing apparatus and method suitable for application to data that is transmitted from a transmission side to a reception side via a transmission path and displayed, edited, or recorded on the reception side, such as a media database search system , And a transmission medium.
[0002]
[Prior art]
For example, in a system that transmits a moving image signal to a remote place such as a video conference system and a video phone system, in order to efficiently use a transmission path, the line correlation of video signals and the correlation between frames are used. Compress and encode image signals Ru It is made like that.
[0003]
In recent years, since the processing capability of computers has improved, moving image information terminals using computers are becoming widespread. In such a system, information is transmitted to a remote place through a transmission line such as a network. In this case as well, signals such as image signals, sound signals, and data to be transmitted are compressed and transmitted in order to efficiently use the transmission path.
[0004]
On the terminal side, the compressed signal transmitted based on a predetermined method is decoded, the original image signal, sound signal, data, etc. are restored, and output to a display or a speaker provided in the terminal. In the conventional technology, the transmitted image signal or the like was only output to the display terminal as it is, but in the information terminal using a computer, after converting the plurality of image signals, acoustic signals and data, It has become possible to display in a two-dimensional or three-dimensional space. Such processing is performed by describing information in a two-dimensional and three-dimensional space by a predetermined method on the transmission side, and performing, for example, a predetermined conversion process for an image signal or the like according to the description by the terminal and displaying the information. Can be realized.
[0005]
As a typical description method of such spatial information, for example, there is VRML (Virtual Reality Modeling Language). This is also standardized in ISO-IEC / JTC1 / SC2 4 and the latest version of VRML2.0 is described in IS14772. VRML is a language for describing a three-dimensional space, and defines a collection of data for describing attributes and shapes of the three-dimensional space. This collection of data is called a node. In order to describe the three-dimensional space, it is described how these nodes that are defined in advance are combined. Nodes are defined to indicate attributes such as color and texture, and those indicating the shape of a polygon.
[0006]
In an information terminal using a computer, a predetermined object is generated using a polygon or the like by CG (Computer Graphics) in accordance with the description such as VRML. In VRML, a texture can be pasted on a three-dimensional object composed of polygons generated in this way. If the texture to be pasted is a still image, a node called Texture is defined. If it is a moving image, a node called MovieTexture is defined. Information about the texture to be pasted to this node (file name, display start time, display end time, etc.) Are listed.
[0007]
Here, the pasting of the texture (hereinafter referred to as texture mapping as appropriate) will be described with reference to FIG. First, a texture (image signal) to be pasted, a signal indicating its transparency (key signal), and three-dimensional object information are input from the outside and stored in a predetermined storage area of the memory group 151. The texture is stored in the texture memory 152, the signal indicating the transparency is stored in the gray scale memory 153, and the three-dimensional object information is stored in the three-dimensional information memory 154. Here, the three-dimensional object information includes polygon formation information and illumination information.
[0008]
The rendering circuit 155 forms a three-dimensional object with polygons based on predetermined three-dimensional object information recorded in the memory group 151. The rendering circuit 155 reads a signal indicating a predetermined texture and transparency from the memory 152 and the memory 153 based on the three-dimensional object information, and pastes the signal to the generated three-dimensional object. The signal indicating the transparency indicates the transparency of the texture at the corresponding position, and indicates the transparency of the object at the position where the texture at the corresponding position is pasted. The rendering circuit 155 supplies the signal of the object to which the texture is pasted to the two-dimensional conversion circuit 156.
[0009]
The two-dimensional conversion circuit 156 converts a three-dimensional object into a two-dimensional image signal obtained by mapping a three-dimensional object onto a two-dimensional plane based on viewpoint information supplied from the outside. The three-dimensional object converted into the two-dimensional image signal is further output to the outside. The texture may be a still image or a moving image. In the case of a moving image, the above operation is performed each time the image frame of the moving image to be pasted is changed.
[0010]
In VRML, JPEG (Joint Photographic Experts Group), which is one of the high-efficiency encoding methods for still images, and MPEG (Moving Picture Experts Group), which is one of the moving image encoding methods, are used as the texture format to be pasted. Compressed image formats are also supported. In this case, the texture (image) is decoded by a decoding process based on a predetermined compression method, and the reverse image signal is recorded in the memory 152 in the memory group 151.
[0011]
The rendering circuit 155 pastes the texture recorded in the memory 152 regardless of the format of the image, whether it is a moving image or a still image, and the content thereof. One texture that can be pasted to one polygon is always one texture stored in the memory, and a plurality of textures cannot be pasted to one polygon.
[0012]
By the way, when transmitting such three-dimensional information and texture information via a transmission line, it is necessary to compress and send the information in order to efficiently use the transmission line. In particular, when a moving image is pasted on a three-dimensional object, it is essential to compress and transmit the moving image.
[0013]
For example, the MPEG method described above was discussed in ISO-IEC / JTC1 / SC2 / WG11 and was proposed as a standard proposal. The method is adopted. In MPEG, several profiles and levels are defined to support various applications and functions. The most basic is the main profile main level (MP @ ML).
[0014]
A configuration example of an MPEG MP @ ML encoder will be described with reference to FIG. The input image signal is first input to the frame memory group 1 and stored in a predetermined order. Image data to be encoded is input to a motion vector detection circuit (ME) 2 in units of macroblocks. The motion vector detection circuit 2 processes the image data of each frame as an I picture, P picture, or B picture according to a predetermined sequence set in advance. It is predetermined (for example, I, B, P, B, P,..., B, P) that an image of each frame that is sequentially input is processed as an I, P, or B picture. Are processed in the order of
[0015]
The motion vector detection circuit 2 performs motion compensation with reference to a predetermined reference frame determined in advance, and detects the motion vector. There are three types of motion compensation (interframe prediction): forward prediction, backward prediction, and bidirectional prediction. The prediction mode for P pictures is only forward prediction, and the prediction modes for B pictures are three types: forward prediction, backward prediction, and bidirectional prediction. The motion vector detection circuit 2 selects a prediction mode that minimizes the prediction error, and generates a prediction vector at that time.
[0016]
At this time, for example, the prediction error is compared with the variance of the macroblock to be encoded. When the variance of the macroblock is smaller, the prediction is not performed on the macroblock, and the intraframe encoding is performed. In this case, the prediction mode is intra-picture coding (intra). The motion vector and the prediction mode are input to the variable length coding circuit 6 and the motion compensation circuit (MC) 12.
[0017]
The motion compensation circuit 12 generates predicted image data based on the input motion vector, and the predicted image data is input to the arithmetic circuit 3. The arithmetic circuit 3 calculates difference data between the value of the macroblock to be encoded and the value of the predicted image, and outputs it to the DCT circuit 4. In the case of an intra macroblock, the arithmetic circuit 3 outputs the macroblock signal to be encoded to the DCT circuit 4 as it is.
[0018]
In the DCT circuit 4, the input signal is subjected to DCT (Discrete Cosine Transform) processing and converted to DCT coefficients. This DCT coefficient is input to the quantization circuit (Q) 5, quantized in a quantization step corresponding to the data accumulation amount (buffer accumulation amount) of the transmission buffer 7, and then the quantized data is converted into a variable length coding circuit. (VLC) 6 is input.
[0019]
The variable length encoding circuit 6 corresponds to the quantization step (scale) supplied from the quantization circuit 5 and converts the quantized data (for example, I picture data) supplied from the quantization circuit 5 to, for example, Huffman. The data is converted into a variable length code such as a code and output to the transmission buffer 7. The variable length coding circuit 6 is also set with a quantization step (scale) by the quantization circuit 5 and a prediction mode (intra-picture prediction, forward prediction, backward prediction, or bidirectional prediction) set by the motion vector detection circuit 2. ) And a motion vector are input, and these are also variable-length encoded.
[0020]
The transmission buffer 7 temporarily stores the input encoded data, and outputs data corresponding to the storage amount to the quantization circuit 5. When the remaining amount of data increases to the allowable upper limit value, the transmission buffer 7 increases the quantization scale of the quantization circuit 5 by the quantization control signal, thereby reducing the data amount of the quantized data. On the other hand, when the remaining data amount is reduced to the allowable lower limit value, the transmission buffer 7 reduces the quantization scale of the quantization circuit 5 by the quantization control signal, thereby reducing the data amount of the quantized data. Increase. In this way, overflow or underflow of the transmission buffer 7 is prevented. The encoded data stored in the transmission buffer 7 is read at a predetermined timing, and is output as a bit stream to the transmission path.
[0021]
On the other hand, the quantized data output from the quantizing circuit 5 is input to the inverse quantizing circuit (IQ) 8 and is inversely quantized corresponding to the quantization step supplied from the quantizing circuit 5. Output data of the inverse quantization circuit 8 (DCT coefficient obtained by inverse quantization) is input to an IDCT (inverse DCT) circuit 9. The IDCT circuit 9 performs inverse DCT processing on the input DCT coefficient, and the obtained output data (difference data) is supplied to the arithmetic circuit 10. The arithmetic circuit 10 adds the difference data and the predicted image data from the motion compensation circuit 12, and the output image data is stored in the frame memory (FM) group 11. In the case of an intra macroblock, the arithmetic circuit 10 supplies the output data from the IDCT circuit 9 to the frame memory group 11 as it is.
[0022]
Next, a configuration example of an MPEG MP @ ML decoder will be described with reference to FIG. The encoded image data (bit stream) transmitted through the transmission path is received by a reception circuit (not shown), reproduced by a reproduction device, temporarily stored in the reception buffer 21, and then encoded. The data is supplied to the variable length decoding circuit (IVLC) 22 as data. The variable-length decoding circuit 22 performs variable-length decoding on the encoded data supplied from the reception buffer 21, the motion vector and prediction mode to the motion compensation circuit 27, and the quantization step to the inverse quantization circuit (IQ) 23. , And outputs the decoded quantized data to the inverse quantization circuit 23.
[0023]
The inverse quantization circuit 23 inversely quantizes the quantized data supplied from the variable length decoding circuit 22 according to the quantization step supplied from the variable length decoding circuit 22 and obtains output data (obtained by being inversely quantized). The obtained DCT coefficient) is output to the IDCT circuit 24. The output data (DCT coefficient) output from the inverse quantization circuit 23 is subjected to inverse DCT processing by the IDCT circuit 24, and the output data (difference data) is supplied to the arithmetic circuit 25.
[0024]
When the output data output from the IDCT circuit 24 is I picture data, the output data is output from the arithmetic circuit 25 as image data, and image data (P or B picture data) that is input to the arithmetic circuit 25 later. ) Is supplied to the frame memory group 26 and stored therein. The image data is output to the outside as it is as a reproduced image. When the data output from the IDCT circuit 24 is a P or B picture, the motion compensation circuit 27 is supplied from the variable length decoding circuit 22 and is stored in the frame memory group 26 according to the motion vector and the prediction mode. Predictive image data is generated from the data and output to the arithmetic circuit 25. The arithmetic circuit 25 adds the output data (difference data) input from the IDCT circuit 24 and the predicted image data supplied from the motion compensation circuit 27 to obtain output image data. In the case of a P picture, the output data of the arithmetic circuit 25 is also stored as predicted image data in the frame memory group 26 and used as a reference image of an image signal to be decoded next.
[0025]
In addition to MP @ ML, MPEG defines various profiles and levels and provides various tools. Scalability is one such tool. In addition, in MPEG, a scalable encoding method that realizes scalability corresponding to different image sizes and frame rates is introduced. For example, in the case of spatial scalability, when only a lower layer bitstream is decoded, a small image size image signal is decoded, and when a lower layer and upper layer bitstream is decoded, a large image size image signal is decoded. .
[0026]
The spatial scalability encoder will be described with reference to FIG. In the case of spatial scalability, the lower layer corresponds to an image signal having a small image size, and the upper layer corresponds to an image signal having a large image size.
[0027]
The lower layer image signal is first input to the frame memory group 1 and encoded in the same manner as MP @ ML. However, the output data of the arithmetic circuit 10 is not only supplied to the frame memory group 11 and used as the predicted image data of the lower layer, but also the same image size as the image size of the upper layer by the image enlarging circuit (up sampling) 31. After being enlarged to the upper layer, it is also used for the predicted image data of the upper layer.
[0028]
The upper layer image signal is first input to the frame memory group 51. The motion vector detection circuit 52 determines a motion vector and a prediction mode, similarly to MP @ ML. The motion compensation circuit 62 generates predicted image data according to the motion vector determined by the motion vector detection circuit 52 and the prediction mode, and outputs the predicted image data to the weight addition circuit (W) 34. The weight addition circuit 34 multiplies the predicted image data by the weight W and outputs the weight predicted image data to the arithmetic circuit 33.
[0029]
The output data (image data) of the arithmetic circuit 10 is input to the frame memory group 11 and the image enlargement circuit 31 as described above. The image enlargement circuit 31 enlarges the image data generated by the arithmetic circuit 10 so as to have the same size as the image size of the upper layer, and outputs it to the weight addition circuit (1-W) 32. The weight addition circuit 32 multiplies the output data of the image enlargement circuit 31 by the weight (1-W) and outputs the result to the arithmetic circuit 33 as weight prediction image data.
[0030]
The arithmetic circuit 33 adds the output data of the weight addition circuit 32 and the output data of the weight addition circuit 34 and outputs the result to the arithmetic circuit 53 as predicted image data. The output data of the arithmetic circuit 33 is also input to the arithmetic circuit 60, added to the output data of the inverse DCT circuit 59, input to the frame memory group 61, and then the predicted reference data frame of the image data to be encoded. Used as The arithmetic circuit 53 calculates a difference between the image data to be encoded and the output data (predicted image data) of the arithmetic circuit 33 and outputs the difference as difference data. However, in the case of an intra-frame encoded macroblock, the arithmetic circuit 53 outputs the image data to be encoded to the DCT circuit 54 as it is.
[0031]
The DCT circuit 54 performs DCT (discrete cosine transform) processing on the output data of the arithmetic circuit 53, generates DCT coefficients, and outputs the DCT coefficients to the quantization circuit 55. As in the case of MP @ ML, the quantization circuit 55 quantizes the DCT coefficient according to the quantization scale determined from the data accumulation amount of the transmission buffer 57 and outputs the quantized data to the variable length coding circuit 56. The variable length coding circuit 56 performs variable length coding on the quantized data (quantized DCT coefficient), and then outputs it as a bit stream of the upper layer via the transmission buffer 57.
[0032]
The output data of the quantization circuit 55 is also inversely quantized by the inverse quantization circuit 58 at the quantization scale used in the quantization circuit 55. The output data of the inverse quantization circuit 58 (DCT coefficient selected by being inversely quantized) is supplied to the IDCT circuit 59, subjected to inverse DCT processing by the IDCT circuit 59, and then input to the arithmetic circuit 60. In the arithmetic circuit 60, the output data of the arithmetic circuit 33 and the output data (difference data) of the inverse DCT circuit 59 are added, and the output data is input to the frame memory group 61.
[0033]
The variable length coding circuit 56 also receives the motion vector and prediction mode detected by the motion vector detection circuit 52, the quantization scale used by the quantization circuit 55, and the weight W used by the weight addition circuits 34 and 32. Are encoded and supplied to the buffer 57 as encoded data. The encoded data is transmitted as a bit stream via the buffer 57.
[0034]
Next, an example of a spatial scalability decoder will be described with reference to FIG. After the lower layer bit stream is input to the reception buffer 21, it is decoded in the same manner as MP @ ML. However, the output data of the arithmetic circuit 25 is output to the outside, stored in the frame memory group 26, and used as predicted image data of an image signal to be decoded thereafter. After being enlarged to the same image size as that of the image signal, it is also used as the predicted image data of the upper layer.
[0035]
The upper layer bit stream is supplied to the variable length decoding circuit 72 via the reception buffer 71, and the variable length code is decoded. At this time, the quantization scale, the motion vector, the prediction mode, and the weighting coefficient are decoded together with the DCT coefficient. The quantized data decoded by the variable length decoding circuit 72 is inversely quantized by the inverse quantization circuit 73 using the decoded quantization scale, and then DCT coefficients (DCT obtained by inverse quantization). Coefficient) is supplied to the IDCT circuit 74. The DCT coefficient is subjected to inverse DCT processing by the IDCT circuit 74, and output data is supplied to the arithmetic circuit 75.
[0036]
The motion compensation circuit 77 generates predicted image data according to the decoded motion vector and the prediction mode, and inputs the predicted image data to the weight addition circuit 84. The weight addition circuit 84 multiplies the decoded weight W by the output data of the motion compensation circuit 77 and outputs the result to the arithmetic circuit 83.
[0037]
The output data of the arithmetic circuit 25 is output as reproduced image data of the lower layer and also output to the frame memory group 26. At the same time, the image data is enlarged to the same image size as the image size of the upper layer by the image signal enlargement circuit 81. Thereafter, it is output to the weight addition circuit 82. The weight addition circuit 82 multiplies the output data of the image signal expansion circuit 81 by (1-W) using the decoded weight W 1 and outputs the result to the arithmetic circuit 83.
[0038]
The arithmetic circuit 83 adds the output data of the weight addition circuit 84 and the output data of the weight addition circuit 82 and outputs the result to the arithmetic circuit 75. In the arithmetic circuit 75, the output data of the IDCT circuit 74 and the output data of the arithmetic circuit 83 are added and output as a reproduction image of the upper layer, supplied to the frame memory group 76, and then the predicted image of the image data to be decoded Use as data.
[0039]
In the above description, the processing of the luminance signal has been described, but the processing of the color difference signal is similarly performed. However, in this case, the motion vector used is a luminance signal halved in the vertical and horizontal directions.
[0040]
Although the MPEG system has been described above, various other high-efficiency encoding systems for moving images have been standardized. For example, ITU-T prescribes H.261 and H.263 as the main encoding methods for communication. H.261 and H.263 are basically a combination of motion-compensated predictive coding and DCT transform coding as in the MPEG system, and the details of the header information differ, but the image signal coding device ( The encoder and the image signal decoding device (decoder) have the same configuration.
[0041]
Also in the above-described MPEG system, standardization of a new high-efficiency encoding system for moving image signals called MPEG4 is in progress. A major feature of MPEG4 is that it is possible to encode and process an image in units of objects (divide it into a plurality of images). On the decoding side, an image signal of each object, that is, a plurality of image signals is synthesized to reconstruct one image.
[0042]
For example, a chroma key method is used for an image composition system that composes a plurality of images to form one image. This is a method in which a predetermined object is photographed in front of a background of a specific uniform color such as blue, an area other than blue is extracted therefrom, and is synthesized with another image. At this time, a signal indicating the extracted region is called a key signal.
[0043]
Next, a method for encoding a composite image will be described with reference to FIG. Image F1 is the background and image F2 is the foreground. The foreground F2 is an image that is generated by shooting in front of a background of a specific color and extracting an area other than that color. At this time, the signal indicating the extracted region is the key signal K1. The synthesized image F3 is synthesized using these F1, F2, and K1. When this image is encoded, normally, F3 is directly encoded by an encoding method such as MPEG. At this time, information such as the key signal is lost, and it becomes difficult to re-edit and re-synthesize the image such that the foreground F2 remains unchanged and only the background F1 is changed.
[0044]
On the other hand, as shown in FIG. 20, it is also possible to configure the bit stream of the image F3 by separately encoding the images F1, F2 and the key signal K1, and multiplexing the respective bit streams. is there.
[0045]
FIG. 21 shows a method for obtaining the composite image F3 by decoding the configured bitstream as shown in FIG. The bit stream is decomposed into F1, F2 and K1 bitstreams by demultiplexing, and each is decoded to obtain decoded images F1 ′, F2 ′ and a decoded key signal K1 ′. At this time, if F1 ′ and F2 ′ are synthesized in accordance with the key signal K1 ′, a decoded synthesized image F3 ′ can be obtained. In this case, re-editing and resynthesizing can be performed such that the foreground F2 ′ is left as it is, and only the background F1 ′ is changed with the bit stream as it is.
[0046]
As described above, in MPEG4, each image sequence constituting a composite image such as images F1 and F2 is called VO (VideoObject). An image frame at a certain time of VO is called VOP (Video Object Plane). VOP consists of luminance and color difference signals, and key signals. An image frame means one image at a predetermined time, and an image sequence means a set of image frames at different times. That is, each VO is a set of VOPs at different times. Each VO varies in size and position depending on the time. That is, the size and position of VOPs belonging to the same VO are different.
[0047]
The configurations of the encoder and decoder for encoding and decoding in units of objects described above are shown in FIGS. FIG. 22 shows an example of an encoder. The input image signal is first input to the VO configuration circuit 101. The VO configuration circuit 101 divides an input image for each object and outputs an image signal representing each object (VO). Each VO image signal consists of an image signal and a key signal. The image signal output from the VO configuration circuit 101 is output to the VOP configuration circuits 102-0 to 102-n for each VO. For example, the VO 0 image signal and key signal are input to the VOP configuration circuit 102-0, the VO 1 image signal and key signal are input to the VOP configuration circuit 102-1, and so on. The signal and the key signal are input to the VOP configuration circuit 102-n.
[0048]
In the VO component circuit 101, for example, in the case of an image signal generated with a chroma key as shown in FIG. 20, the VO is composed of each image signal and its key signal as it is. For an image that has no key signal or has been lost, image region segmentation is performed, a predetermined region is extracted, a key signal is generated, and is set to VO.
[0049]
The VOP configuration circuits 102-0 to 102-n extract a minimum rectangular portion including an object in the image from each image frame. At this time, however, the number of pixels in the horizontal and vertical directions of the rectangle is a multiple of 16. The VOP configuration circuits 102-0 to 102-n extract the image signals (luminance and color difference signals) and key signals included in the above-described rectangle and output them. It also outputs a flag indicating the size of the VOP (VOP size) and a flag indicating the position of the VOP in absolute coordinates (VOP POS).
[0050]
Output signals from the VOP configuration circuits 102-0 to 102-n are input to the VOP encoding circuits 103-0 to 103-n and encoded. Outputs of the VOP encoding circuits 103-0 to 103-n are input to the multiplexing circuit 104, configured as one bit stream, and output to the outside as a bit stream.
[0051]
FIG. 23 shows an example of a decoder. The multiplexed bit stream is demultiplexed by the demultiplexing circuit 111 and decomposed into bit streams of each VO. The bit stream of each VO is input to the VOP decoding circuits 112-0 to 112-n and decoded. The VOP decoding circuits 112-0 to 112-n decode the VOP image signal, the key signal, the VOP size flag (VOP size), and the VOP absolute position flag (VOP POS). , Input to the image reconstruction circuit 113. The image reconstruction circuit 113 synthesizes an image using each VOP image signal, a key signal, a size flag (VOP size), and a position in absolute coordinates (VOP POS), and outputs a reproduced image. .
[0052]
Next, an example of the VOP encoding circuit 103-0 (the other VOP encoding circuits 103-1 to 103-n are configured similarly) will be described with reference to FIG. The image signal and key signal constituting each VOP are input to the image signal encoding circuit 121 and the key signal encoding circuit 122, respectively. The image signal encoding circuit 121 performs encoding processing by a method such as MPEG or H.263. The key signal encoding circuit 122 performs encoding processing using, for example, DPCM. There is also a method of encoding a difference signal by performing motion compensation using a motion vector detected by the image signal encoding circuit 121 when encoding a key signal. The bit amount generated by the key signal encoding is input to the image signal encoding circuit 121 and controlled to have a predetermined bit rate.
[0053]
The bit stream of the encoded image signal (motion vector and texture information) and the bit stream of the key signal are input to the multiplexing circuit 123, configured as one bit stream, and output via the transmission buffer 124.
[0054]
FIG. 25 illustrates a configuration example of the VOP decoding circuit 112-0 (the other VOP decoding circuits 112-1 to 112-n are configured similarly). First, the bit stream is input to the demultiplexing circuit 131 and is decomposed into a bit stream of an image signal (motion vector and texture information) and a bit stream of a key signal, and the image signal decoding circuit 132 and the key signal decoding circuit 133 , Respectively. In this case, when encoding is performed with motion compensation of the key signal, the motion vector decoded by the image signal decoding circuit 132 is input to the key signal decoding circuit 133 and used for decoding.
[0055]
The method for encoding an image for each VOP has been described above. This method is currently being standardized as MPEG4 in ISO-IEC / JTC1 / SC29 / WG11. However, a method for efficiently encoding each VOP as described above has not been established yet, and a function such as scalability has not been established.
[0056]
Hereinafter, a method for performing scalable coding of an image in units of objects will be described. As described above, the rendering circuit 155 pastes the texture recorded in the texture memory 152 to the polygon regardless of the image format, whether it is a moving image or a still image, and the content thereof. What can be pasted to one polygon is always one texture stored in the memory, and a plurality of textures cannot be pasted to one polygon. In many cases, the image is compressed and transmitted, and after the compressed bit stream is decoded on the terminal side, the image is stored in a predetermined texture pasting memory.
[0057]
In the conventional case, the number of image signals generated by decoding a bit stream is always one. For example, when an MP @ ML bit stream in MPEG is decoded, one image sequence is decoded. In the case of scalability in MPEG2, when a lower-layer bitstream is decoded, a low-quality image is obtained, and when a lower-layer and upper-layer bitstream is decoded, a high-quality image signal is obtained. In either case, one image sequence is decoded.
[0058]
However, in the case of a system such as MPEG4 that encodes an image in object units, the situation is different. That is, one object may be composed of a plurality of bit streams. In such a case, a plurality of images are obtained for each bit stream. Therefore, for example, a texture cannot be pasted on a three-dimensional object described in VRML or the like.
[0059]
As a method for solving this, it is conceivable to assign one VRML node (polygon) to one image object (VO). For example, in the case of FIG. 21, it can be considered that the background F1 ′ is assigned to one node, and the foreground F2 ′ and the key signal K1 ′ are assigned to one node. However, if one image object consists of multiple bitstreams and multiple images are generated during decoding, the following problems But is there. This will be described with reference to FIGS.
[0060]
A description will be given taking a three-layer scalable coding as an example. In the case of three-layer scalable coding, in addition to the lower layer (base layer), there are two upper layers, namely, a first upper layer (enhancement layer 1, hereinafter referred to as upper layer 1 as appropriate) and a second upper layer ( There is an enhancement layer 2 (hereinafter referred to as upper layer 2 as appropriate). Compared with the image decoded up to the first upper layer, the image decoded up to the second upper layer has improved image quality. Here, the improvement in image quality is the spatial resolution in the case of spatial scalable coding, the frame rate in the case of temporal scalable coding, and the image quality in the case of SNR (Single to Noise Ratio) scalable coding. SNR.
[0061]
In the case of MPEG4 encoded in object units, the relationship between the first upper layer and the second upper layer is one of the following.
(1) The second upper layer includes all areas of the first upper layer.
(2) The second upper layer corresponds to a partial region of the first upper layer.
(3) The second upper layer corresponds to a wider area than the first upper layer.
[0062]
The relationship (3) exists when scalable coding of three or more layers is performed. This is because the first upper layer corresponds to a partial region of the lower layer and the second upper layer includes all the regions of the lower layer, or the first upper layer is a partial region of the lower layer. Correspondingly, the second upper layer corresponds to a wider area than the first upper layer and corresponds to a partial area of the lower layer. In the case of the relationship (3), when decoding up to the first upper layer, only a part of the lower layer image is improved in image quality, and when decoding up to the second upper layer, all of the larger area or lower layer images The image quality of the area is improved. In the relationship (3), the shape of the VOP may be a rectangle or an arbitrary shape.
[0063]
FIG. 26 to FIG. 31 show examples of spatially scalable coding with three layers. FIG. 26 shows an example in which the VOPs are all rectangular in the spatial scalability in the relationship (1). FIG. 27 shows an example of the spatial scalability in the relationship (2) when the VOP shape is a rectangle. Further, FIG. 28 shows an example of the case where the VOP shapes of all layers are rectangular in the spatial scalability in the relationship (3). FIG. 29 shows the spatial scalability in the relationship (3), where the VOP shape of the first upper layer is an arbitrary shape, and the VOP shapes of the lower layer and the second upper layer are rectangular. An example is shown. FIG. 30 and FIG. 31 show an example of the spatial scalability in the relationship (1), where the VOP has a rectangular shape and an arbitrary shape, respectively.
[0064]
Here, as shown in FIG. 26, when the image quality of the entire image is improved, it is sufficient to display one image with the highest image quality as in the conventional scalable encoding such as MPEG2. However, there are cases as shown in FIG. 27, FIG. 28, and FIG. 29 in MPEG4 encoded in object units. For example, in the case of FIG. 27, when the bitstreams of the lower layer and the upper layers 1 and 2 are decoded, after the resolution conversion of the images of the lower layer and the upper layer 1, the two image sequences after the resolution conversion are converted into the upper layer 2 The entire image is reconstructed by combining with the decoded image sequence. In the case of FIG. 29, only the upper layer 1 and the lower layer may be decoded, and only the image of the upper layer 1 may be output and synthesized with another image sequence decoded from another bitstream.
[0065]
[Problems to be solved by the invention]
As described above, when an image is encoded in units of objects, a method of simply assigning one node to one object generates a texture when a plurality of images are generated for one object. As a result, there is a problem that it cannot be pasted to the object.
[0066]
The present invention has been made in view of such a situation, and even when a plurality of images are generated for one object, the images can be reliably pasted on the object as a texture. Is.
[0067]
[Means for Solving the Problems]
The present invention of The image signal multiplexing apparatus selects a scene descriptor indicating spatial configuration information describing a predetermined object, and selects a bit stream that constitutes the predetermined object from a plurality of hierarchically encoded bit streams. Selecting means for selecting; generating means for generating an object descriptor indicating information relating to an object composed of the selected bitstream; a start code; a selected scene descriptor and bitstream; and an generated object descriptor And multiplexing means for outputting in the order of a start code, a scene descriptor, a predetermined number of object descriptors, and a predetermined number of bitstreams.
[0068]
The present invention of In the image signal multiplexing method, a scene descriptor indicating spatial configuration information describing a predetermined object is selected, and a bit stream constituting the predetermined object is selected from a plurality of hierarchically encoded bit streams. A selection step to select, a generation step to generate an object descriptor indicating information about an object comprising the selected bitstream, a start code, a selected scene descriptor and bitstream, and a generated object descriptor And a multiplexing step of outputting in order of a start code, a scene descriptor, a predetermined number of object descriptors, and a predetermined number of bitstreams.
[0069]
The present invention of The recording medium selects a scene descriptor indicating spatial configuration information describing a predetermined object, and selects a bit stream constituting the predetermined object from a plurality of hierarchically encoded bit streams. A step of generating an object descriptor indicating information about an object comprising the selected bitstream; and multiplexing the start code, the selected scene descriptor and bitstream, and the generated object descriptor , A program for causing a computer to execute processing including a start code, a scene descriptor, a predetermined number of object descriptors, and a multiplexing step for outputting in the order of a predetermined number of bitstreams.
[0078]
The transmission medium according to claim 20 includes spatial configuration information describing an object, bit streams of a plurality of layers having different qualities constituting the object, and dependency information indicating dependency relationships of information between different bit streams. Based on the dependency information, the separation step for separating the information on the spatial configuration information, the bit streams of the plurality of layers constituting the object, and the information on the object from the multiplexed bit stream in which the information on the object including at least is multiplexed and transmitted A spatial configuration information describing a predetermined object, or a control step for controlling the processing in the separation step to select a bit stream of a plurality of layers constituting the object, and an analysis step for analyzing the selected spatial configuration information And multiple levels of bitstreams A decoding step for decoding the output signal, a mixing step for mixing output signals corresponding to the same object among the output signals decoded in the decoding step, and mixing with the output data analyzed in the analysis step based on the information about the object A program including a reconstruction step of reconstructing an image signal from the output signal mixed in the step is transmitted.
[0079]
The present invention of In the image signal multiplexing apparatus and method, and the recording medium program, a scene descriptor indicating spatial configuration information describing a predetermined object is selected, and a plurality of hierarchically encoded bitstreams are selected. A bitstream comprising a given object is selected, an object descriptor is generated indicating information about the object comprising the selected bitstream, a start code, the selected scene descriptor and bitstream are generated, and The object descriptors are multiplexed and output in the order of a start code, a scene descriptor, a predetermined number of object descriptors, and a predetermined number of bitstreams.
[0083]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below. First, the bit stream multiplexer and demultiplexer in the first embodiment will be described with reference to FIG. In the following description, encoded audio and video bit streams (Elementary Stream (ES)) are described as being recorded in advance in a predetermined storage device 202. The bit stream may be directly input from the multiplexing device to the multiplexing circuit 203 without passing through the storage device 202. Hereinafter, the encoding and decoding methods will be described assuming the MPEG4 method, but any method can be applied in the same manner as long as the image is divided into a plurality of images and encoded.
[0084]
The storage device 202 includes a bit stream ES (Elementary Stream) corresponding to each AV (audio and video) object, object stream information OI necessary for decoding each bit stream, and a two-dimensional or three-dimensional scene. A scene descriptor SD (Scene Descriptor) describing (a virtual space constituted by images to be transmitted) is recorded. Here, the object stream information OI includes, for example, a buffer size necessary for decoding, a time stamp of each access unit (frame or VOP), and the like. Details will be described later.
[0085]
In the object information OI, all the information of the bit stream ES corresponding to each AV (audio and video) object is described. The object descriptor generation circuit 204 generates an object descriptor OD (Object Descriptor) corresponding to the object information OI supplied from the storage device 202.
[0086]
The multiplexing circuit 203 multiplexes the bit stream ES and the scene descriptor SD recorded in the storage device 202 and the object descriptor OD supplied from the object descriptor generation circuit 204 in a predetermined order. The bit stream FS is transmitted.
[0087]
Here, the configuration of the bit stream forming each object will be described. For example, a scene as shown in FIG. 21 includes two objects, a background F1 ′ and a foreground F2 ′. However, the key signal K1 ′ and the foreground F2 ′ are composed of one bit stream ES. Therefore, in the case of FIG. 21, when composed of two video objects VO and not using scalable coding, each VO is composed of one bit stream ES.
[0088]
In the case of FIGS. 26 to 29, the frame is composed of one video object VO. However, in this case, since scalable encoding is performed, one video object VO is composed of three bit streams ES. FIG. 26 to FIG. 29 show examples of scalable encoding of three layers, but the number of layers may be arbitrary.
[0089]
In FIGS. 30 and 31, the scene is composed of two video objects VO of the background (FIG. 30) and the foreground (FIG. 31), and each video object VO is composed of three bit streams ES.
[0090]
The user can arbitrarily set which video object is displayed by sending a request signal from the terminal, and which layer is displayed in the case of scalable coding.
[0091]
In the embodiment of FIG. 1, the user transmits a request signal REQ for specifying a necessary video object and a bit stream from an external terminal (not shown) to the transmission side. The request signal REQ is supplied to the stream control circuit 201. Object stream information OI of the bit stream of each video object is recorded in the storage device 202. As described above, the object stream information OI includes, for example, information indicating how many bitstreams a given object is composed of, information necessary for decoding each bitstream, buffer size, and other information for decoding. Contains information such as whether a bitstream is required.
[0092]
The stream control circuit 201 refers to the object stream information OI supplied from the storage device 202 according to the request signal REQ, determines which bit stream to transmit, and multiplexes the stream request signal SREQ into the multiplexing circuit 203 and the storage device. 202 and the object descriptor generation circuit 204. Further, the storage device 202 reads a predetermined bit stream ES and scene descriptor SD in accordance with the stream request signal SREQ and outputs the read bit stream ES and the scene descriptor SD to the multiplexing circuit 203.
[0093]
The object descriptor generation circuit 204 reads the object stream information OI related to the bit stream of each object (VO) recorded in the storage device 202 according to the stream request signal SREQ, and information on the bit stream requested by the stream request signal SREQ. Extract only the object descriptor OD. Also, the object descriptor occurrence time Road 204 generates an ID number OD_ID indicating which object corresponds to the object descriptor OD. For example, in the case of FIG. 26, when only the lower layer and the upper layer 1 are requested for a predetermined object, the object descriptor generation circuit 204 uses only the information of the lower layer and the upper layer 1 as the object stream information OI. And an object descriptor OD is generated, and an ID number OD_ID indicating the object is generated and written to the object descriptor OD. Then, the object descriptor OD generated in this way is supplied to the multiplexing circuit 203. Details of the syntax of the object descriptor OD and the object stream information OI and the scene descriptor SD will be described later.
[0094]
Next, the operation of the multiplexing circuit 203 will be described with reference to FIG. The multiplexing circuit 203 is supplied with bit streams ES1 to ESn to be transmitted in accordance with the stream request signal SREQ. Each bit stream ES1 to ESn is supplied to the switch 231. Similarly, the scene descriptor SD and the object descriptor OD are supplied to the switch 231. Further, the multiplexing circuit 203 is provided with a start code generation circuit 232, and the start code generated by the start code generation circuit 232 is also supplied to the switch 231. The switch 231 outputs data obtained by switching connections in a predetermined order to the outside as a multiplexed bit stream FS.
[0095]
First, the start code generated by the start code generation circuit 232 is output as the multiplexed bit stream FS. Next, the connection of the switch 231 is switched, and the scene descriptor SD is output. After the scene descriptor SD is output, the connection of the switch 231 is switched again, and the object descriptor OD is output. Since there are as many object descriptors OD as there are objects, the same number of object descriptors are output. FIG. 2 shows a case where the number of objects is three. After the object descriptor OD is output, the connection of the switch 231 is switched again, and the bit streams ES1 to ESn are selected and output for each predetermined data size. As shown in FIG. 1, the multiplexed bit stream ES is supplied to the demultiplexing circuit 205 after passing through a predetermined transmission path.
[0096]
Next, details of the demultiplexing circuit 205 will be described with reference to FIG. First, the multiplexed bit stream FS is supplied to the switch 241. First, the switch 241 recognizes each subsequent data by detecting a start code. After detecting the start code, the scene descriptor SD is read from the switch 241 and output. Next, the connection of the switch 241 is changed, and the object descriptor OD is read and output. There are as many object descriptors OD as the number of objects, and they are output sequentially. After all the object descriptors OD are output, the connection of the switch 241 is changed again, and each bit stream ES1 to ESn is read and output according to a predetermined connection.
[0097]
The read scene descriptor SD is supplied to a parsing circuit (parser) 208 and analyzed as shown in FIG. The parsed scene description is supplied to the reconstruction circuit 209 as three-dimensional object information. The three-dimensional object information is actually composed of information such as nodes and polygons, but in the following description, it will be described as a node as appropriate.
[0098]
Further, the read object descriptor OD is supplied to the parsing circuit (parser) 206 and analyzed as shown in FIG. The parsing circuit 206 identifies the type and number of necessary decoders, and causes the necessary decoders 207-1 to 207-n to supply the respective bit streams ES1 to ESn from the demultiplexing circuit 205. Further, the buffer amount necessary for decoding each bit stream is read from the object descriptor OD, and is output from the syntax analysis circuit 206 to each decoder 207-1 to 207-n. Each of the decoders 207-1 to 207-n is initialized based on initialization information such as a buffer size supplied from the syntax analysis circuit 206 (that is, transmitted by the object descriptor OD). Further, the syntax analysis circuit 206 reads the ID number OD_ID of each object descriptor OD in order to identify which object each bitstream ES1 to ESn belongs to. Then, the ID number OD_ID of each object descriptor OD is output from the syntax analysis circuit 206 to the decoders 207-1 to 207-n that decode the bit stream described in the object descriptor OD.
[0099]
Each of the decoders 207-1 to 207-n decodes the bit stream based on a predetermined decoding method corresponding to encoding, and outputs a video or audio signal to the reconstruction circuit 209. Each decoder 207-1 to 207-n outputs an ID number OD_ID indicating to which object the image belongs to the reconstruction circuit 209. In the case of an image signal, each of the decoders 207-1 to 207-n decodes a signal (POS, SZ) indicating the position and size from the bit stream and outputs the decoded signal to the reconstruction circuit 209. Further, in the case of an image signal, the decoders 207-1 to 207-n decode a signal indicating a transparency (key signal) from the bit stream and output the decoded signal to the reconstruction circuit 209.
[0100]
Next, the correspondence between signals for reconstructing an image and the reconstruction circuit 209 will be described with reference to FIGS. 4 and 5. FIG. FIG. 4 shows an example in the case where scalable coding is not performed, and FIG. 5 shows an example in the case where scalable coding is performed.
[0101]
In FIG. 4, the reconstruction circuit 209 includes a synthesis circuit 252, and an image signal generated by the synthesis circuit 252 is supplied to the display 251 and displayed. In FIG. 4, the synthesis circuit 252 and the display 251 are shown as the reconstruction circuit 209, but this is to show how the image constituted by the synthesis circuit 252 is shown on the display 251. In practice, the reconstruction circuit 209 does not include a display.
[0102]
In FIG. 4, a rectangular image sequence and a triangular pyramid generated by CG are displayed on the screen of the display 251. The decrypted texture is also attached to the object of the triangular pyramid. Here, the texture may be a moving image or a still image.
[0103]
FIG. 4 shows the correspondence between the scene descriptor SD and the output screen. As the scene descriptor SD, for example, a descriptor such as VRML is used. The scene descriptor SD is composed of a group of descriptions called nodes. There is a parent (root) node SD0 describing how to arrange each object in the entire image. As a child node, there is a node SD1 describing information about a triangular pyramid. In addition, information regarding a rectangular plane on which an image is pasted is described in a node SD2 as a child node of the root node SD0. In the example of FIG. 4, the image signal is composed of three video objects VO. Information relating to the background as the first video object VO is described in the node SD2. In addition, information regarding a plane for attaching the sun as the second video object VO is described in the node SD3. Further, information regarding a plane on which a person as the third video object VO is pasted is described in the node SD4. Nodes SD3 and SD4 are child nodes of node SD2.
[0104]
Therefore, one scene descriptor SD is constituted by the nodes SD0 to SD4. Each node SD0 to SD4 corresponds to one three-dimensional or two-dimensional object. In the example of FIG. 4, the node SD0 is an object of the entire scene, the node SD1 is a triangular pyramid object, the node SD2 is a background object, the node SD3 is a sun object, and the node SD4 is a person object. It corresponds. When a texture is pasted to each node, a flag indicating which bit stream corresponds to each node is required. In order to identify this, the ID number OD_ID of the object descriptor supplied from the decoder of the corresponding bit stream is described in each node. As a result, one object descriptor OD corresponds to one node. Thus, one video object VO is pasted on one two-dimensional or three-dimensional object.
[0105]
Each node SD0 to SD4 constituting the scene descriptor SD is interpreted by the syntax analysis circuit 208 and supplied to the synthesis circuit 252 of the reconstruction circuit 209 as three-dimensional object information. Also, the decoders 207-1 to 207-4 are supplied with the bitstreams ES1 to ES4 from the demultiplexing circuit 205, and are supplied with the ID number OD_ID of the corresponding object descriptor OD from the syntax analysis circuit 206. . Each decoder 207-1 to 207-4 decodes the bit stream, and in addition to an ID number OD_ID and a decoded signal (image or audio), in the case of an image, a key signal and a signal (POS) indicating the position and size of the image , SZ) is supplied to the synthesis circuit 252 of the reconstruction circuit 209 as a decoded signal. Here, the position of the image means a relative position with respect to the parent node one level above that node belongs.
[0106]
A configuration example of the synthesis circuit 252 is shown in FIG. In FIG. 6, parts corresponding to those shown in FIG. 14 are given the same reference numerals. Input 3D object information (including nodes SD0 to SD4 and polygon information), image signal (Texture), key signal (key signal), ID number OD_ID, position and size signals (POS, SZ) are Are supplied to the object composition circuits 271-1 to 271-n, respectively. One object composition circuit 271-i corresponds to one node SDi (i = 1, 2, 3,..., N). The object synthesis circuit 271-i receives the decoded signal having the ID number OD_ID indicated by the node SDi from the decoder 207-i, and in the case of an image signal, pastes it on the generated two-dimensional or three-dimensional object. As described above, when the ID number OD_ID and the decoded signal are supplied to the corresponding object synthesis circuit 271-i, it is necessary to search which node each decoded signal corresponds to. Therefore, the correlation is recognized by comparing the ID number OD_ID supplied to the reconfiguration circuit 209 with the ID number OD_ID including the node. Based on the recognition result, the decoded signal is supplied to the object composition circuit 271-i to which the corresponding node is supplied.
[0107]
The texture (image signal) to be pasted, the signal indicating the transparency (key signal), and the signal indicating the position and size (VOP, SZ) supplied from the decoder 207-i are predetermined in the memory group 151-i. Is stored in the area. Similarly, the node (two-dimensional or three-dimensional object information) supplied from the syntax analysis circuit 208 is stored in a predetermined storage area of the memory group 151-i. The texture (image signal) is stored in the texture memory 152-i, the signal indicating the transparency (key signal) and the ID number OD_ID are stored in the grayscale memory 153-i, and the node is stored in the three-dimensional information memory 154-i. The An ID number OD_ID is supplied and used to identify the object. Furthermore, the signals (POS, SZ) indicating the position and the magnitude may be stored in any memory, but for example, in this example, are stored in the gray scale memory 153-i. Here, the three-dimensional object information includes polygon formation information and illumination information. A signal indicating the position and size is stored at a predetermined position in the memory group 151-i.
[0108]
The rendering circuit 155-i forms a two-dimensional or three-dimensional object with polygons based on the nodes recorded in the three-dimensional information memory 154-i. The rendering circuit 155-i reads a signal indicating a predetermined texture and transparency from the texture memory 152-i and the grayscale memory 153-i and pastes it on the generated three-dimensional object. The signal indicating the transparency indicates the transparency of the texture at the corresponding position, and indicates the transparency of the object at the position where the texture at the corresponding position is pasted. The rendering circuit 155-i supplies the signal with the texture pasted to the two-dimensional conversion circuit 156. Similarly, a signal indicating the position and size of the image (relative position with respect to the parent node) is read out from a predetermined position (in this example, the grayscale memory 153-i) of the memory group 151-i, The data is output to the two-dimensional conversion circuit 156.
[0109]
The two-dimensional conversion circuit 156 is supplied with two-dimensional or three-dimensional objects to which textures are pasted from the object composition circuits 271-1 to 271-n corresponding to the number of nodes. The two-dimensional conversion circuit 156 maps a three-dimensional object to a two-dimensional plane and converts it into a two-dimensional image signal based on viewpoint information supplied from the outside and signals (POS, SZ) indicating the position and size of the image. To do. The three-dimensional object converted into the two-dimensional image signal is further output to the display 251 and displayed. When all the objects are two-dimensional objects, output data from the rendering circuits 155-1 to 155-n are synthesized according to the transparency (key signal) and a signal indicating the position and size of the image and output. Is done. In this case, conversion based on the viewpoint is not performed.
[0110]
Next, an example when the scalable coding of FIG. 5 is performed will be described. In this case, the reconfiguration circuit 209 includes a mixing circuit 261 and a synthesis circuit 252, and an image signal generated by the mixing circuit 261 and the synthesis circuit 252 is supplied to the display 251 and displayed. In FIG. 5, as in FIG. 4, the mixing circuit 261, the synthesis circuit 252, and the display 251 are shown as the reconstruction circuit 209, but this shows which image is composed of the mixing circuit 261 and the synthesis circuit 252. The display 251 does not include a display in the reconstruction circuit 209 in practice. In the example of FIG. 5, a rectangular image sequence and a triangular pyramid generated by CG are displayed on the display 251. The decrypted texture is also attached to the object of the triangular pyramid. Here, the texture may be a moving image or a still image.
[0111]
FIG. 5 shows the correspondence between the scene descriptor SD and the output screen. In the case of FIG. 5, there is a parent node SD0 describing how to arrange each object in the entire image. As its child nodes, there are a node SD1 in which information about a triangular pyramid is described and a node SD2 in which information about a rectangular plane to which an image is pasted are described. Unlike the example of FIG. 4, the image signal corresponding to the node SD2 in FIG. 5 is composed of one video object VO. However, in the case of FIG. 5, it is assumed that the image corresponding to the node SD2 has been subjected to three-layer scalable coding, and a video object VO is composed of three video object layers. In addition, although FIG. 5 illustrates an example of three layers, the number of layers may be arbitrary.
[0112]
Each node SD0 to SD2 constituting the scene descriptor SD is interpreted by the syntax analysis circuit 208, and the analysis result is supplied to the synthesis circuit 252. Bit streams ES1 to ESn are supplied from the demultiplexing circuit 205 to the decoders 207-1 to 207-4, and the ID number OD_ID of the corresponding object descriptor OD is supplied from the syntax analysis circuit 206. After the decoder 207-1 to 207-4 decodes the bit stream, in addition to the decoded signal, in the case of an image, a key signal, a signal (VOP, SZ) indicating the position and size of the image, and a signal FR indicating the magnification are received. This is supplied to the mixing circuit 261. Here, the position of the image means the relative position of each layer in the same video object VO. Each decoder 207-1 to 207-4 supplies the ID number OD_ID to the synthesis circuit 252. Since the configuration of the synthesis circuit 252 is the same as that shown in FIG. 6, the description thereof is omitted here.
[0113]
As described above, when the ID number OD_ID and the decoded signal are supplied to the corresponding object synthesis circuit 271-i, it is necessary to search which node each decoded signal corresponds to. Therefore, the correlation is recognized by comparing the ID number OD_ID supplied to the reconfiguration circuit 209 with the ID number OD_ID included in the node. Based on the recognition result, the decoded signal is supplied to the object composition circuit 271-i to which the corresponding node is supplied.
[0114]
In the case of scalable coding, the bit stream of each layer (VOL) belongs to the same video object VO and therefore has the same ID number OD_ID. One video object VO corresponds to one node, and one texture memory 152-i corresponds to one video object VO corresponding to the video object VO. Therefore, in the case of scalable coding, the output of each layer (the output of the decoders 207-2 to 207-4) is once supplied to the mixing circuit 261 and synthesized into one image sequence.
[0115]
The mixing circuit 261 first synthesizes the image of each layer based on the image signal, the key signal, the signal indicating the magnification, and the signal indicating the position and size of the image supplied from each decoder 207-2 to 207-4. Thereafter, the data is output to the synthesis circuit 252. Therefore, the composition circuit 252 can correspond one image sequence to one object.
[0116]
For example, when scalable coding as shown in FIG. 29 is performed and the lower layer and the upper layer 1 are transmitted and decoded, the image signal of the lower layer is subjected to resolution conversion based on the signal FR indicating the magnification. . Next, the decoded image of the upper layer 1 is combined with this image at a position corresponding to the decoded image according to the key signal.
[0117]
The image sequence synthesized by the mixing circuit 261 is supplied to the synthesis circuit 252. The composition circuit 252 constructs an image in the same manner as in FIG. 4 and outputs it to the display 251 to obtain a final output image.
[0118]
As described above, in this example, one object (video object VO in the case of video) is allocated to one node, and the mixing circuit is provided in the previous stage of the memory group 151 that stores the texture, three-dimensional information, and the like in the rendering circuit 155. 261 is provided, and a plurality of images are mixed according to a predetermined key signal and then recorded in the texture memory 152 to enable texture mapping of an image signal having a plurality of resolutions.
[0119]
As described above, in the example of FIG. 1, for a certain object, a descriptor that records the system information of the bit stream that constitutes the object is generated, and information on the bit stream that must be decoded at that time is generated. Only the bit stream described in the descriptor is decoded, so that a combination of the decodable bit streams can be identified and a predetermined signal can be decoded. In this case, the descriptor is generated and transmitted one-to-one on the transmission side and the reception side.
[0120]
Next, FIGS. 7 to 9 show the structure of the object descriptor OD. FIG. 7 shows the overall configuration (syntax) of the object descriptor OD.
[0121]
Node ID is a 10-bit flag indicating the ID number of the descriptor. Corresponds to the above OD_ID. StreamCount is an 8-bit flag indicating the number of bit streams ES included in the object descriptor. This number of information, ES_Descriptor, necessary for decoding the bitstream ES is transmitted. Furthermore, extensionFlag is a flag indicating whether or not to transmit other descriptors. When this value is 1, other descriptors are transmitted.
[0122]
ES_Descriptor is a descriptor indicating information on each bit stream. FIG. 8 shows the configuration (syntax) of ES_Descriptor. ES_Number is a 5-bit flag indicating an ID number for identifying the bit stream. StreamType is an 8-bit flag indicating the format of the bit stream, for example, MPEG2 video. Furthermore, QoS_Descriptor is an 8-bit flag indicating a request to the network at the time of transmission.
[0123]
ESConfigParams is a descriptor in which information necessary for decoding the bitstream is described, and its configuration (syntax) is shown in FIG. Details of ESConfigParam are described in MPEG4 System VM.
[0124]
FIG. 10 shows a scene descriptor for pasting a moving image. SFObjectID is a flag indicating an ID number OD_ID that is an ID of an object descriptor of a texture to be pasted. FIG. 11 shows a scene descriptor for pasting a still image. SFObjectID is a flag indicating the ID number OD_ID of the object descriptor of the texture to be pasted. 10 and 11 conform to the VRML node description.
[0125]
Next, FIG. 12 shows a bitstream multiplexing apparatus and a demultiplexing apparatus in the second embodiment. In this embodiment, all bit streams belonging to an object are multiplexed and transmitted. In the first embodiment of FIG. 1, only the bit stream requested from the receiving side is multiplexed and transmitted. At that time, the object descriptor OD was generated according to the bit stream to be transmitted. Since all the bitstreams described in the object descriptor OD are decoded on the receiving side, there is no need to particularly transmit information dependency between the bitstreams.
[0126]
On the other hand, in the second embodiment, the object descriptor OD is stored in the storage device 202 in advance, and all the bit streams recorded in the object descriptor OD are multiplexed and transmitted on the transmission side. To do. At this time, the object descriptor OD in the second embodiment is different from the first embodiment in that the dependency relationship of information between bit streams is described. The other points are the same as those in the first embodiment.
[0127]
The multiplexing circuit 203 reads the scene descriptor SD, the object descriptor OD, and the bit stream group ES recorded in the storage device 202, multiplexes them in a predetermined order, and transmits them. The transmission order and the configuration of the multiplexing circuit 203 are the same as those in the first embodiment. The multiplexed bit stream FS is supplied to the demultiplexing circuit 205 via the transmission path.
[0128]
The user inputs a request signal REQ indicating which object is to be displayed from the terminal. The request signal REQ is supplied to the demultiplexing circuit 205, the syntax analysis circuit 206, and the reconfiguration circuit 209. The parsing circuit 206 analyzes each transmitted object descriptor OD, generates a signal SREQ requesting a necessary bit stream, and supplies the signal SREQ to the demultiplexing circuit 205. When the user requests a predetermined bitstream, the other bitstreams necessary for decoding it and which bitstream is required are recorded in the object descriptor OD.
[0129]
The demultiplexing circuit 205 supplies only the necessary bit stream to the decoders 207-1 to 207-n according to the request signal REQ from the user and the signal SREQ for requesting the necessary bit stream, and the necessary object descriptor OD. Is supplied to the syntax analysis circuit 206. The parsing circuit 206 analyzes the object descriptor OD, and based on the object descriptor OD and the request signal REQ from the user, the initialization information of the decoders 207-1 to 207-n and the ID number OD_ID are assigned to each decoder 207-1. To 207-n. Thereafter, decoding, synthesis, and display are performed in the same manner as in the first embodiment.
[0130]
As described above, in this example, a descriptor (object descriptor) in which system information of a bit stream constituting the object is recorded is generated for an object, and at that time, it is necessary to decode each bit stream. By recording a flag indicating a bitstream and decoding a predetermined bitstream according to the flag described in the descriptor, it is possible to identify a combination of decodable bitstreams and decode a predetermined signal To do. In this case, after the descriptor is generated once on the transmission side, a common descriptor is transmitted to all recipients.
[0131]
In the second embodiment, unlike the first embodiment, the object descriptor OD describes information for identifying another bit stream necessary for decoding a predetermined bit stream. . The object descriptor OD in the second embodiment will be described. The overall configuration of the object descriptor OD is the same as that in the first embodiment shown in FIG.
[0132]
FIG. 13 shows an ES_Descriptor that describes information about each bitstream. isOtherStream is a 1-bit flag that indicates whether another bitstream is required to decode this bitstream. If this value is 0, this bitstream can be decoded alone. When the value of isOtherStream is 1, this bit stream cannot be decoded alone.
[0133]
streamCount is a 5-bit flag indicating how many other bit streams are required. Based on streamCount, ES_Number is transmitted by that number.
[0134]
ES_Number is an ID for identifying a bit stream necessary for decoding. The other configuration of ES_Descriptor is the same as that of the first embodiment. Further, the configuration of ESConfigParams representing information necessary for decoding each bitstream is the same as that in the first embodiment shown in FIG.
[0135]
The above-described processing (multiplexing and demultiplexing) can be realized by a program, and the program can be transmitted (provided) to a user. As a transmission medium, a magnetic disk, CD-ROM, solid In addition to recording media such as memory, communication media such as networks and satellites can be used. Needless to say, the above-described processing can be realized not only as a program but also as hardware.
[0136]
Various modifications and application examples can be considered without departing from the gist of the present invention. Therefore, the gist of the present invention is not limited to the embodiment.
[0137]
【The invention's effect】
The present invention of First Image signal multiplexer And method, and recording medium program According to Duplicate It is possible to texture-map a scalable bitstream in units of objects having several layers.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration example of an image signal multiplexing device and an image signal demultiplexing device according to the present invention.
FIG. 2 is a block diagram illustrating a configuration example of a multiplexing circuit 203 in FIG.
3 is a block diagram showing a configuration example of a demultiplexing circuit 205 in FIG. 1. FIG.
4 is a diagram illustrating a correspondence relationship between signals for reconstructing an image and the reconstruction circuit 209 in FIG. 1;
FIG. 5 is a diagram illustrating a correspondence relationship between signals for reconstructing an image and the reconstruction circuit 209 in FIG. 1;
6 is a block diagram illustrating a configuration example of a synthesis circuit 252 in FIG. 5;
FIG. 7 is a diagram illustrating a configuration of an object descriptor.
FIG. 8 is a diagram illustrating a configuration of an ES_Descriptor.
FIG. 9 is a diagram showing a configuration of ESConfigParams.
FIG. 10 is a diagram illustrating a configuration of a scene descriptor for moving images.
FIG. 11 is a diagram illustrating a configuration of a scene descriptor for a still image.
FIG. 12 is a block diagram illustrating another configuration example of the image signal multiplexing device and the image signal demultiplexing device according to the present invention.
FIG. 13 is a diagram illustrating a configuration of an ES_Descriptor.
FIG. 14 is a block diagram illustrating a configuration example of a conventional object composition circuit.
FIG. 15 is a block diagram illustrating a configuration example of a conventional image signal encoding device.
FIG. 16 is a block diagram illustrating a configuration example of a conventional image signal decoding apparatus.
FIG. 17 is a block diagram illustrating another configuration example of a conventional image signal encoding device.
FIG. 18 is a block diagram illustrating another configuration example of a conventional image signal decoding device.
FIG. 19 is a diagram for explaining conventional image composition;
FIG. 20 is a diagram for explaining image composition;
FIG. 21 is a diagram for explaining image composition;
FIG. 22 is a block diagram showing still another configuration example of a conventional image signal encoding device.
FIG. 23 is a block diagram illustrating still another configuration example of a conventional image signal decoding device.
24 is a block diagram illustrating a configuration example of the VOP encoding circuit 103-0 in FIG.
25 is a block diagram illustrating a configuration example of the VOP decoding circuit 112-0 in FIG.
FIG. 26 is a diagram illustrating an image object.
FIG. 27 is a diagram illustrating an image object.
FIG. 28 is a diagram illustrating an image object.
FIG. 29 is a diagram illustrating an image object.
FIG. 30 is a diagram illustrating an image object.
FIG. 31 is a diagram illustrating an image object.
[Explanation of symbols]
201 stream control circuit, 202 storage device, 203 multiplexing circuit, 204 object descriptor generation circuit, 205 demultiplexing circuit, 206 syntax analysis circuit, 207-1 to 207-n decoder, 208 syntax analysis circuit, 209 reconfiguration circuit

Claims

Selecting means for selecting a scene descriptor indicating spatial configuration information describing a predetermined object, and selecting the bitstream constituting the predetermined object from a plurality of hierarchically encoded bitstreams; ,
Generating means for generating an object descriptor indicating information on the object composed of the selected bitstream;
The start code, the selected scene descriptor and bitstream, and the generated object descriptor are multiplexed, and the start code, the scene descriptor, the predetermined number of the object descriptors, the predetermined number of the bitstreams are multiplexed. An image signal multiplexing apparatus comprising: multiplexing means for outputting in order.

The object descriptor includes at least one of a flag indicating spatial configuration information describing the object, a flag indicating the number of bit streams, and information necessary for decoding the bit stream. 2. An image signal multiplexing device according to 1.

Selecting a scene descriptor indicating spatial configuration information describing a predetermined object, and selecting the bit stream constituting the predetermined object from a plurality of hierarchically encoded bit streams; ,
Generating an object descriptor indicating information about the object composed of the selected bitstream;
The start code, the selected scene descriptor and bitstream, and the generated object descriptor are multiplexed, and the start code, the scene descriptor, the predetermined number of the object descriptors, the predetermined number of the bitstreams are multiplexed. And a multiplexing step for outputting in order.

Selecting a scene descriptor indicating spatial configuration information describing a predetermined object, and selecting the bit stream constituting the predetermined object from a plurality of hierarchically encoded bit streams; ,
Generating an object descriptor indicating information about the object composed of the selected bitstream;
The start code, the selected scene descriptor and bitstream, and the generated object descriptor are multiplexed, and the start code, the scene descriptor, the predetermined number of the object descriptors, the predetermined number of the bitstreams are multiplexed. A recording medium on which is recorded a program that causes a computer to execute processing including multiplexing steps that are output in order .